Re: List of known BTRFS Raid 5/6 Bugs?

2018-09-11 Thread Duncan
Stefan K posted on Tue, 11 Sep 2018 13:29:38 +0200 as excerpted:

> wow, holy shit, thanks for this extended answer!
> 
>> The first thing to point out here again is that it's not
>> btrfs-specific.
> so that mean that every RAID implemantation (with parity) has such Bug?
> I'm looking a bit, it looks like that ZFS doesn't have a write hole.

Every parity-raid implementation that doesn't contain specific write-hole 
workarounds, yes, but some already have workarounds built-in, as btrfs 
will after the planned code is written/tested/merged/tested-more-broadly.

https://www.google.com/search?q=parity-raid+write-hole [1]

As an example, back some years ago when I was doing raid6 on mdraid, it 
had the write-hole problem and I remember reading about it at the time.  
However, right on the first page of hits for the above search...

LWN: A journal for MD/RAID5 : https://lwn.net/Articles/665299/

Seems md/raid5's write hole was (optionally) closed in kernel 4.4 with an 
optional journal device... preferably a fast ssd or nvram, to avoid 
performance issues, and mirrored, to avoid the journal itself being a 
single point of failure.

For me zfs is strictly an arm's-length thing, because if Oracle wanted to 
they could easily resolve the licensing thing as they own the code, but 
they haven't, which at this point can only be deliberate, and as I result 
I simply don't touch it.  That isn't to say I don't recommend it for 
those comfortable with or simply willing to overlook the licensing 
issues, however, because zfs remains the most mature Linux option for 
many of the same feature points that btrfs has, only at a lower maturity 
level.

But while I keep zfs at personal arm's length, from what I've picked up, 
I /believe/ zfs gets around the write-hole by doing strict copy-on-write 
combined with variable-length stripes -- unlike current btrfs, a stripe 
isn't always written as widely as possible, so for instance in a 20-
device raid5-alike they're able to do a 3-device and possibly even 2-
device "stripe", which then being entirely copy-on-write, avoids the read-
modify-write cycle of modified existing data that unless mitigated 
creates the parity-raid write-hole.

Variable-length stripes is actually one of the possible longer-term 
solutions already discussed for btrfs as well, but the logging/journalling 
solution seems to be what they've decided to implement first, and there's 
other tradeoffs to it (as discussed elsewhere).  Of course because as 
I've already explained I'm interested in the 3/4-way-mirroring option 
that would be used for the journal but also available to expand the 
current 2-way-raid1 mode to additional mirroring, this is absolutely fine 
with me! =:^)

> And
> it _only_ happens when the server has a ungraceful shutdown, caused by
> poweroutage? So that mean if I running btrfs raid5/6 and I've no
> poweroutages I've no problems?

Sort-of yes?

Keep in mind that power-outage isn't the /only/ way to have an ungraceful 
shutdown, just one of the most common.  Should the kernel crash or lock 
up for some reason, common examples include video and occasionally 
network driver bugs due to the direct access to hardware and memory they 
get, that can trigger an "ungraceful shutdown" as well, altho with care 
(basically always trying ssh-ing in for a remote shutdown if possible and/
or using alt-sysrq-reisub sequences on apparent lockups) it's often 
possible to prevent those being /entirely/ ungraceful at the hardware 
level, so it's not /quite/ as bad as an abrupt power outage or perhaps 
even worse a brownout that doesn't kill writes entirely but can at least 
theoretically trigger garbage scribbling in random device blocks.

So yes, sort-of, but it'd not just power-outages.

>>  it's possible to specify data as raid5/6 and metadata as raid1

> does some have this in production?

I'm sure people do.  (As I said I'm a raid1 guy here, even 3-way-
mirroring for some things were it possible, so no parity-raid at all for 
me personally.)

On btrfs, it is in fact the multi-device default and thus quite common to 
have data and metadata as different profiles.  The multi-device default 
for metadata if not specified is raid1, with single profile data.  So if 
you just specify raid5/6 data and don't specify metadata at all, you'll 
get exactly what was mentioned, raid5/6 data as specified, raid1 metadata 
as the unspecified multi-device default.

So were I to guess I'd guess that a lot of people who weren't paying 
attention when setting up but saying they have raid5/6, actually only 
have it for data, having not specified anything for metadata, so they got 
raid1 for it.


> ZFS btw have 2 copies of metadata by
> default, maybe it would also be an option or btrfs?

It actually sounds like they do hybrid raid, then, not just pure parity-
raid, but mirroring the metadata as well.  That would be in accord with a 
couple things I'd read about zfs but hadn't quite pursued to the logical 
conclusion, and woul

Re: List of known BTRFS Raid 5/6 Bugs?

2018-09-11 Thread Stefan K
wow, holy shit, thanks for this extended answer!

> The first thing to point out here again is that it's not btrfs-specific.  
so that mean that every RAID implemantation (with parity) has such Bug? I'm 
looking a bit, it looks like that ZFS doesn't have a write hole. And it _only_ 
happens when the server has a ungraceful shutdown, caused by poweroutage? So 
that mean if I running btrfs raid5/6 and I've no poweroutages I've no problems?

>  it's possible to specify data as raid5/6 and metadata as raid1
does some have this in production? ZFS btw have 2 copies of metadata by 
default, maybe it would also be an option or btrfs?
in this case you think 'btrfs fi balance start -mconvert=raid1 -dconvert=raid5 
/path ' is safe at the moment?

> That means small files and modifications to existing files, the ends of large 
> files, and much of the 
> metadata, will be written twice, first to the log, then to the final 
> location. 
that sounds that the performance will go down? So far as I can see btrfs can't 
beat ext4 or btrfs nor zfs and then they will made it even slower?

thanks in advanced!

best regards
Stefan



On Saturday, September 8, 2018 8:40:50 AM CEST Duncan wrote:
> Stefan K posted on Fri, 07 Sep 2018 15:58:36 +0200 as excerpted:
> 
> > sorry for disturb this discussion,
> > 
> > are there any plans/dates to fix the raid5/6 issue? Is somebody working
> > on this issue? Cause this is for me one of the most important things for
> > a fileserver, with a raid1 config I loose to much diskspace.
> 
> There's a more technically complete discussion of this in at least two 
> earlier threads you can find on the list archive, if you're interested, 
> but here's the basics (well, extended basics...) from a btrfs-using-
> sysadmin perspective.
> 
> "The raid5/6 issue" can refer to at least three conceptually separate 
> issues, with different states of solution maturity:
> 
> 1) Now generally historic bugs in btrfs scrub, etc, that are fixed (thus 
> the historic) in current kernels and tools.  Unfortunately these will 
> still affect for some time many users of longer-term stale^H^Hble distros 
> who don't update using other sources for some time, as because the raid56 
> feature wasn't yet stable at the lock-in time for whatever versions they 
> stabilized on, they're not likely to get the fixes as it's new-feature 
> material.
> 
> If you're using a current kernel and tools, however, this issue is 
> fixed.  You can look on the wiki for the specific versions, but with the 
> 4.18 kernel current latest stable, it or 4.17, and similar tools versions 
> since the version numbers are synced, are the two latest release series, 
> with the two latest release series being best supported and considered 
> "current" on this list.
> 
> Also see...
> 
> 2) General feature maturity:  While raid56 mode should be /reasonably/ 
> stable now, it remains one of the newer features and simply hasn't yet 
> had the testing of time that tends to flush out the smaller and corner-
> case bugs, that more mature features such as raid1 have now had the 
> benefit of.
> 
> There's nothing to do for this but test, report any bugs you find, and 
> wait for the maturity that time brings.
> 
> Of course this is one of several reasons we so strongly emphasize and 
> recommend "current" on this list, because even for reasonably stable and 
> mature features such as raid1, btrfs itself remains new enough that they 
> still occasionally get latent bugs found and fixed, and while /some/ of 
> those fixes get backported to LTS kernels (with even less chance for 
> distros to backport tools fixes), not all of them do and even when they 
> do, current still gets the fixes first.
> 
> 3) The remaining issue is the infamous parity-raid write-hole that 
> affects all parity-raid implementations (not just btrfs) unless they take 
> specific steps to work around the issue.
> 
> The first thing to point out here again is that it's not btrfs-specific.  
> Between that and the fact that it *ONLY* affects parity-raid operating in 
> degraded mode *WITH* an ungraceful-shutdown recovery situation, it could 
> be argued not to be a btrfs issue at all, but rather one inherent to 
> parity-raid mode and considered an acceptable risk to those choosing 
> parity-raid because it's only a factor when operating degraded, if an 
> ungraceful shutdown does occur.
> 
> But btrfs' COW nature along with a couple technical implementation 
> factors (the read-modify-write cycle for incomplete stripe widths and how 
> that risks existing metadata when new metadata is written) does amplify 
> the risk somewhat compared to that seen with the same write-hole issue in 
> various other parity-raid implementations that don't avoid it due to 
> taking write-hole avoidance countermeasures.
> 
> 
> So what can be done right now?
> 
> As it happens there is a mitigation the admin can currently take -- btrfs 
> allows specifying data and metadata modes separately, and even where 
> 

Re: List of known BTRFS Raid 5/6 Bugs?

2018-09-08 Thread Duncan
Stefan K posted on Fri, 07 Sep 2018 15:58:36 +0200 as excerpted:

> sorry for disturb this discussion,
> 
> are there any plans/dates to fix the raid5/6 issue? Is somebody working
> on this issue? Cause this is for me one of the most important things for
> a fileserver, with a raid1 config I loose to much diskspace.

There's a more technically complete discussion of this in at least two 
earlier threads you can find on the list archive, if you're interested, 
but here's the basics (well, extended basics...) from a btrfs-using-
sysadmin perspective.

"The raid5/6 issue" can refer to at least three conceptually separate 
issues, with different states of solution maturity:

1) Now generally historic bugs in btrfs scrub, etc, that are fixed (thus 
the historic) in current kernels and tools.  Unfortunately these will 
still affect for some time many users of longer-term stale^H^Hble distros 
who don't update using other sources for some time, as because the raid56 
feature wasn't yet stable at the lock-in time for whatever versions they 
stabilized on, they're not likely to get the fixes as it's new-feature 
material.

If you're using a current kernel and tools, however, this issue is 
fixed.  You can look on the wiki for the specific versions, but with the 
4.18 kernel current latest stable, it or 4.17, and similar tools versions 
since the version numbers are synced, are the two latest release series, 
with the two latest release series being best supported and considered 
"current" on this list.

Also see...

2) General feature maturity:  While raid56 mode should be /reasonably/ 
stable now, it remains one of the newer features and simply hasn't yet 
had the testing of time that tends to flush out the smaller and corner-
case bugs, that more mature features such as raid1 have now had the 
benefit of.

There's nothing to do for this but test, report any bugs you find, and 
wait for the maturity that time brings.

Of course this is one of several reasons we so strongly emphasize and 
recommend "current" on this list, because even for reasonably stable and 
mature features such as raid1, btrfs itself remains new enough that they 
still occasionally get latent bugs found and fixed, and while /some/ of 
those fixes get backported to LTS kernels (with even less chance for 
distros to backport tools fixes), not all of them do and even when they 
do, current still gets the fixes first.

3) The remaining issue is the infamous parity-raid write-hole that 
affects all parity-raid implementations (not just btrfs) unless they take 
specific steps to work around the issue.

The first thing to point out here again is that it's not btrfs-specific.  
Between that and the fact that it *ONLY* affects parity-raid operating in 
degraded mode *WITH* an ungraceful-shutdown recovery situation, it could 
be argued not to be a btrfs issue at all, but rather one inherent to 
parity-raid mode and considered an acceptable risk to those choosing 
parity-raid because it's only a factor when operating degraded, if an 
ungraceful shutdown does occur.

But btrfs' COW nature along with a couple technical implementation 
factors (the read-modify-write cycle for incomplete stripe widths and how 
that risks existing metadata when new metadata is written) does amplify 
the risk somewhat compared to that seen with the same write-hole issue in 
various other parity-raid implementations that don't avoid it due to 
taking write-hole avoidance countermeasures.


So what can be done right now?

As it happens there is a mitigation the admin can currently take -- btrfs 
allows specifying data and metadata modes separately, and even where 
raid1 loses too much space to be used for both, it's possible to specify 
data as raid5/6 and metadata as raid1.  While btrfs raid1 only covers 
loss of a single device, it doesn't have the parity-raid write-hole as 
it's not parity-raid, and for most use-cases at least, specifying raid1 
for metadata only, while raid5 for data, should strictly limit both the 
risk of the parity-raid write-hole as it'll be limited to data which in 
most cases will be full-stripe writes and thus not subject to the 
problem, and the size-doubling of raid1 as it'll be limited to metadata.

Meanwhile, arguably, for a sysadmin properly following the sysadmin's 
first rule of backups, that the true value of data isn't defined by 
arbitrary claims, but by the number of backups it is considered worth the 
time/trouble/resources to have of that data, it's a known parity-raid 
risk specifically limited to the corner-case of having an ungraceful 
shutdown *WHILE* already operating degraded, and as such, it can be 
managed along with all the other known risks to the data, including admin 
fat-fingering, the risk that more devices will go out than the array can 
tolerate, the risk of general bugs affecting the filesystem or other 
storage-function related code, etc.

IOW, in the context of the admin's first rule of backups, no matter the 
issue, rai

Re: List of known BTRFS Raid 5/6 Bugs?

2018-09-07 Thread Stefan K
sorry for disturb this discussion,

are there any plans/dates to fix the raid5/6 issue? Is somebody working on this 
issue? Cause this is for me one of the most important things for a fileserver, 
with a raid1 config I loose to much diskspace.

best regards
Stefan


Re: List of known BTRFS Raid 5/6 Bugs?

2018-08-17 Thread Menion
Ok, but I cannot guarantee that I don't need to cancel scrub during the process
As said, this is a domestic storage, and when scrub is running the
performance hit is big enough to prevent smooth streaming of HD and 4k
movies
Il giorno gio 16 ago 2018 alle ore 21:38  ha scritto:
>
> Could you show scrub status -d, then start a new scrub (all drives) and show 
> scrub status -d again? This may help us diagnose the problem.
>
> Am 15-Aug-2018 09:27:40 +0200 schrieb men...@gmail.com:
> > I needed to resume scrub two times after an unclear shutdown (I was
> > cooking and using too much electricity) and two times after a manual
> > cancel, because I wanted to watch a 4k movie and the array
> > performances were not enough with scrub active.
> > Each time I resumed it, I checked also the status, and the total
> > number of data scrubbed was keep counting (never started from zero)
> > Il giorno mer 15 ago 2018 alle ore 05:33 Zygo Blaxell
> >  ha scritto:
> > >
> > > On Tue, Aug 14, 2018 at 09:32:51AM +0200, Menion wrote:
> > > > Hi
> > > > Well, I think it is worth to give more details on the array.
> > > > the array is built with 5x8TB HDD in an esternal USB3.0 to SATAIII 
> > > > enclosure
> > > > The enclosure is a cheap JMicron based chinese stuff (from Orico).
> > > > There is one USB3.0 link for all the 5 HDD with a SATAIII 3.0Gb
> > > > multiplexer behind it. So you cannot expect peak performance, which is
> > > > not the goal of this array (domestic data storage).
> > > > Also the USB to SATA firmware is buggy, so UAS operations are not
> > > > stable, it run in BOT mode.
> > > > Having said so, the scrub has been started (and resumed) on the array
> > > > mount point:
> > > >
> > > > sudo btrfs scrub start(resume) /media/storage/das1
> > >
> > > So is 2.59TB the amount scrubbed _since resume_? If you run a complete
> > > scrub end to end without cancelling or rebooting in between, what is
> > > the size on all disks (btrfs scrub status -d)?
> > >
> > > > even if reading the documentation I understand that it is the same
> > > > invoking it on mountpoint or one of the HDD in the array.
> > > > In the end, especially for a RAID5 array, does it really make sense to
> > > > scrub only one disk in the array???
> > >
> > > You would set up a shell for-loop and scrub each disk of the array
> > > in turn. Each scrub would correct errors on a single device.
> > >
> > > There was a bug in btrfs scrub where scrubbing the filesystem would
> > > create one thread for each disk, and the threads would issue commands
> > > to all disks and compete with each other for IO, resulting in terrible
> > > performance on most non-SSD hardware. By scrubbing disks one at a time,
> > > there are no competing threads, so the scrub runs many times faster.
> > > With this bug the total time to scrub all disks individually is usually
> > > less than the time to scrub the entire filesystem at once, especially
> > > on HDD (and even if it's not faster, one-at-a-time disk scrubs are
> > > much kinder to any other process trying to use the filesystem at the
> > > same time).
> > >
> > > It appears this bug is not fixed, based on some timing results I am
> > > getting from a test array. iostat shows 10x more reads than writes on
> > > all disks even when all blocks on one disk are corrupted and the scrub
> > > is given only a single disk to process (that should result in roughly
> > > equal reads on all disks slightly above the number of writes on the
> > > corrupted disk).
> > >
> > > This is where my earlier caveat about performance comes from. Many parts
> > > of btrfs raid5 are somewhere between slower and *much* slower than
> > > comparable software raid5 implementations. Some of that is by design:
> > > btrfs must be at least 1% slower than mdadm because btrfs needs to read
> > > metadata to verify data block csums in scrub, and the difference would
> > > be much larger in practice due to HDD seek times, but 500%-900% overhead
> > > still seems high especially when compared to btrfs raid1 that has the
> > > same metadata csum reading issue without the huge performance gap.
> > >
> > > It seems like btrfs raid5 could still use a thorough profiling to figure
> > > out where it's spending all its IO.
> > >
> > > > Regarding the data usage, here you have the current figures:
> > > >
> > > > menion@Menionubuntu:~$ sudo btrfs fi show
> > > > [sudo] password for menion:
> > > > Label: none uuid: 6db4baf7-fda8-41ac-a6ad-1ca7b083430f
> > > > Total devices 1 FS bytes used 11.44GiB
> > > > devid 1 size 27.07GiB used 18.07GiB path /dev/mmcblk0p3
> > > >
> > > > Label: none uuid: 931d40c6-7cd7-46f3-a4bf-61f3a53844bc
> > > > Total devices 5 FS bytes used 6.57TiB
> > > > devid 1 size 7.28TiB used 1.64TiB path /dev/sda
> > > > devid 2 size 7.28TiB used 1.64TiB path /dev/sdb
> > > > devid 3 size 7.28TiB used 1.64TiB path /dev/sdc
> > > > devid 4 size 7.28TiB used 1.64TiB path /dev/sdd
> > > > devid 5 size 7.28TiB used 1.64TiB path /dev/sde
> > > >
> >

Re: List of known BTRFS Raid 5/6 Bugs?

2018-08-16 Thread erenthetitan
Could you show scrub status -d, then start a new scrub (all drives) and show 
scrub status -d again? This may help us diagnose the problem.

Am 15-Aug-2018 09:27:40 +0200 schrieb men...@gmail.com: 
> I needed to resume scrub two times after an unclear shutdown (I was
> cooking and using too much electricity) and two times after a manual
> cancel, because I wanted to watch a 4k movie and the array
> performances were not enough with scrub active.
> Each time I resumed it, I checked also the status, and the total
> number of data scrubbed was keep counting (never started from zero)
> Il giorno mer 15 ago 2018 alle ore 05:33 Zygo Blaxell
>  ha scritto:
> >
> > On Tue, Aug 14, 2018 at 09:32:51AM +0200, Menion wrote:
> > > Hi
> > > Well, I think it is worth to give more details on the array.
> > > the array is built with 5x8TB HDD in an esternal USB3.0 to SATAIII 
> > > enclosure
> > > The enclosure is a cheap JMicron based chinese stuff (from Orico).
> > > There is one USB3.0 link for all the 5 HDD with a SATAIII 3.0Gb
> > > multiplexer behind it. So you cannot expect peak performance, which is
> > > not the goal of this array (domestic data storage).
> > > Also the USB to SATA firmware is buggy, so UAS operations are not
> > > stable, it run in BOT mode.
> > > Having said so, the scrub has been started (and resumed) on the array
> > > mount point:
> > >
> > > sudo btrfs scrub start(resume) /media/storage/das1
> >
> > So is 2.59TB the amount scrubbed _since resume_? If you run a complete
> > scrub end to end without cancelling or rebooting in between, what is
> > the size on all disks (btrfs scrub status -d)?
> >
> > > even if reading the documentation I understand that it is the same
> > > invoking it on mountpoint or one of the HDD in the array.
> > > In the end, especially for a RAID5 array, does it really make sense to
> > > scrub only one disk in the array???
> >
> > You would set up a shell for-loop and scrub each disk of the array
> > in turn. Each scrub would correct errors on a single device.
> >
> > There was a bug in btrfs scrub where scrubbing the filesystem would
> > create one thread for each disk, and the threads would issue commands
> > to all disks and compete with each other for IO, resulting in terrible
> > performance on most non-SSD hardware. By scrubbing disks one at a time,
> > there are no competing threads, so the scrub runs many times faster.
> > With this bug the total time to scrub all disks individually is usually
> > less than the time to scrub the entire filesystem at once, especially
> > on HDD (and even if it's not faster, one-at-a-time disk scrubs are
> > much kinder to any other process trying to use the filesystem at the
> > same time).
> >
> > It appears this bug is not fixed, based on some timing results I am
> > getting from a test array. iostat shows 10x more reads than writes on
> > all disks even when all blocks on one disk are corrupted and the scrub
> > is given only a single disk to process (that should result in roughly
> > equal reads on all disks slightly above the number of writes on the
> > corrupted disk).
> >
> > This is where my earlier caveat about performance comes from. Many parts
> > of btrfs raid5 are somewhere between slower and *much* slower than
> > comparable software raid5 implementations. Some of that is by design:
> > btrfs must be at least 1% slower than mdadm because btrfs needs to read
> > metadata to verify data block csums in scrub, and the difference would
> > be much larger in practice due to HDD seek times, but 500%-900% overhead
> > still seems high especially when compared to btrfs raid1 that has the
> > same metadata csum reading issue without the huge performance gap.
> >
> > It seems like btrfs raid5 could still use a thorough profiling to figure
> > out where it's spending all its IO.
> >
> > > Regarding the data usage, here you have the current figures:
> > >
> > > menion@Menionubuntu:~$ sudo btrfs fi show
> > > [sudo] password for menion:
> > > Label: none uuid: 6db4baf7-fda8-41ac-a6ad-1ca7b083430f
> > > Total devices 1 FS bytes used 11.44GiB
> > > devid 1 size 27.07GiB used 18.07GiB path /dev/mmcblk0p3
> > >
> > > Label: none uuid: 931d40c6-7cd7-46f3-a4bf-61f3a53844bc
> > > Total devices 5 FS bytes used 6.57TiB
> > > devid 1 size 7.28TiB used 1.64TiB path /dev/sda
> > > devid 2 size 7.28TiB used 1.64TiB path /dev/sdb
> > > devid 3 size 7.28TiB used 1.64TiB path /dev/sdc
> > > devid 4 size 7.28TiB used 1.64TiB path /dev/sdd
> > > devid 5 size 7.28TiB used 1.64TiB path /dev/sde
> > >
> > > menion@Menionubuntu:~$ sudo btrfs fi df /media/storage/das1
> > > Data, RAID5: total=6.57TiB, used=6.56TiB
> > > System, RAID5: total=12.75MiB, used=416.00KiB
> > > Metadata, RAID5: total=9.00GiB, used=8.16GiB
> > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > > menion@Menionubuntu:~$ sudo btrfs fi usage /media/storage/das1
> > > WARNING: RAID56 detected, not implemented
> > > WARNING: RAID56 detected, not implemented
> > > WARNING: RAI

Re: List of known BTRFS Raid 5/6 Bugs?

2018-08-15 Thread Menion
I needed to resume scrub two times after an unclear shutdown (I was
cooking and using too much electricity) and two times after a manual
cancel, because I wanted to watch a 4k movie and the array
performances were not enough with scrub active.
Each time I resumed it, I checked also the status, and the total
number of data scrubbed was keep counting (never started from zero)
Il giorno mer 15 ago 2018 alle ore 05:33 Zygo Blaxell
 ha scritto:
>
> On Tue, Aug 14, 2018 at 09:32:51AM +0200, Menion wrote:
> > Hi
> > Well, I think it is worth to give more details on the array.
> > the array is built with 5x8TB HDD in an esternal USB3.0 to SATAIII enclosure
> > The enclosure is a cheap JMicron based chinese stuff (from Orico).
> > There is one USB3.0 link for all the 5 HDD with a SATAIII 3.0Gb
> > multiplexer behind it. So you cannot expect peak performance, which is
> > not the goal of this array (domestic data storage).
> > Also the USB to SATA firmware is buggy, so UAS operations are not
> > stable, it run in BOT mode.
> > Having said so, the scrub has been started (and resumed) on the array
> > mount point:
> >
> > sudo btrfs scrub start(resume) /media/storage/das1
>
> So is 2.59TB the amount scrubbed _since resume_?  If you run a complete
> scrub end to end without cancelling or rebooting in between, what is
> the size on all disks (btrfs scrub status -d)?
>
> > even if reading the documentation I understand that it is the same
> > invoking it on mountpoint or one of the HDD in the array.
> > In the end, especially for a RAID5 array, does it really make sense to
> > scrub only one disk in the array???
>
> You would set up a shell for-loop and scrub each disk of the array
> in turn.  Each scrub would correct errors on a single device.
>
> There was a bug in btrfs scrub where scrubbing the filesystem would
> create one thread for each disk, and the threads would issue commands
> to all disks and compete with each other for IO, resulting in terrible
> performance on most non-SSD hardware.  By scrubbing disks one at a time,
> there are no competing threads, so the scrub runs many times faster.
> With this bug the total time to scrub all disks individually is usually
> less than the time to scrub the entire filesystem at once, especially
> on HDD (and even if it's not faster, one-at-a-time disk scrubs are
> much kinder to any other process trying to use the filesystem at the
> same time).
>
> It appears this bug is not fixed, based on some timing results I am
> getting from a test array.  iostat shows 10x more reads than writes on
> all disks even when all blocks on one disk are corrupted and the scrub
> is given only a single disk to process (that should result in roughly
> equal reads on all disks slightly above the number of writes on the
> corrupted disk).
>
> This is where my earlier caveat about performance comes from.  Many parts
> of btrfs raid5 are somewhere between slower and *much* slower than
> comparable software raid5 implementations.  Some of that is by design:
> btrfs must be at least 1% slower than mdadm because btrfs needs to read
> metadata to verify data block csums in scrub, and the difference would
> be much larger in practice due to HDD seek times, but 500%-900% overhead
> still seems high especially when compared to btrfs raid1 that has the
> same metadata csum reading issue without the huge performance gap.
>
> It seems like btrfs raid5 could still use a thorough profiling to figure
> out where it's spending all its IO.
>
> > Regarding the data usage, here you have the current figures:
> >
> > menion@Menionubuntu:~$ sudo btrfs fi show
> > [sudo] password for menion:
> > Label: none  uuid: 6db4baf7-fda8-41ac-a6ad-1ca7b083430f
> > Total devices 1 FS bytes used 11.44GiB
> > devid1 size 27.07GiB used 18.07GiB path /dev/mmcblk0p3
> >
> > Label: none  uuid: 931d40c6-7cd7-46f3-a4bf-61f3a53844bc
> > Total devices 5 FS bytes used 6.57TiB
> > devid1 size 7.28TiB used 1.64TiB path /dev/sda
> > devid2 size 7.28TiB used 1.64TiB path /dev/sdb
> > devid3 size 7.28TiB used 1.64TiB path /dev/sdc
> > devid4 size 7.28TiB used 1.64TiB path /dev/sdd
> > devid5 size 7.28TiB used 1.64TiB path /dev/sde
> >
> > menion@Menionubuntu:~$ sudo btrfs fi df /media/storage/das1
> > Data, RAID5: total=6.57TiB, used=6.56TiB
> > System, RAID5: total=12.75MiB, used=416.00KiB
> > Metadata, RAID5: total=9.00GiB, used=8.16GiB
> > GlobalReserve, single: total=512.00MiB, used=0.00B
> > menion@Menionubuntu:~$ sudo btrfs fi usage /media/storage/das1
> > WARNING: RAID56 detected, not implemented
> > WARNING: RAID56 detected, not implemented
> > WARNING: RAID56 detected, not implemented
> > Overall:
> > Device size:   36.39TiB
> > Device allocated:  0.00B
> > Device unallocated:   36.39TiB
> > Device missing:  0.00B
> > Used:  0.00B
> > Free (estimated):  0.00B (min: 8.00EiB)
> > Data ratio:   0.00
> > Metadata ratio:   0.00
> > Global reserve: 512.00Mi

Re: List of known BTRFS Raid 5/6 Bugs?

2018-08-14 Thread Zygo Blaxell
On Tue, Aug 14, 2018 at 09:32:51AM +0200, Menion wrote:
> Hi
> Well, I think it is worth to give more details on the array.
> the array is built with 5x8TB HDD in an esternal USB3.0 to SATAIII enclosure
> The enclosure is a cheap JMicron based chinese stuff (from Orico).
> There is one USB3.0 link for all the 5 HDD with a SATAIII 3.0Gb
> multiplexer behind it. So you cannot expect peak performance, which is
> not the goal of this array (domestic data storage).
> Also the USB to SATA firmware is buggy, so UAS operations are not
> stable, it run in BOT mode.
> Having said so, the scrub has been started (and resumed) on the array
> mount point:
> 
> sudo btrfs scrub start(resume) /media/storage/das1

So is 2.59TB the amount scrubbed _since resume_?  If you run a complete
scrub end to end without cancelling or rebooting in between, what is
the size on all disks (btrfs scrub status -d)?

> even if reading the documentation I understand that it is the same
> invoking it on mountpoint or one of the HDD in the array.
> In the end, especially for a RAID5 array, does it really make sense to
> scrub only one disk in the array???

You would set up a shell for-loop and scrub each disk of the array
in turn.  Each scrub would correct errors on a single device.

There was a bug in btrfs scrub where scrubbing the filesystem would
create one thread for each disk, and the threads would issue commands
to all disks and compete with each other for IO, resulting in terrible
performance on most non-SSD hardware.  By scrubbing disks one at a time,
there are no competing threads, so the scrub runs many times faster.
With this bug the total time to scrub all disks individually is usually
less than the time to scrub the entire filesystem at once, especially
on HDD (and even if it's not faster, one-at-a-time disk scrubs are
much kinder to any other process trying to use the filesystem at the
same time).

It appears this bug is not fixed, based on some timing results I am
getting from a test array.  iostat shows 10x more reads than writes on
all disks even when all blocks on one disk are corrupted and the scrub
is given only a single disk to process (that should result in roughly
equal reads on all disks slightly above the number of writes on the
corrupted disk).

This is where my earlier caveat about performance comes from.  Many parts
of btrfs raid5 are somewhere between slower and *much* slower than
comparable software raid5 implementations.  Some of that is by design:
btrfs must be at least 1% slower than mdadm because btrfs needs to read
metadata to verify data block csums in scrub, and the difference would
be much larger in practice due to HDD seek times, but 500%-900% overhead
still seems high especially when compared to btrfs raid1 that has the
same metadata csum reading issue without the huge performance gap.

It seems like btrfs raid5 could still use a thorough profiling to figure
out where it's spending all its IO.

> Regarding the data usage, here you have the current figures:
> 
> menion@Menionubuntu:~$ sudo btrfs fi show
> [sudo] password for menion:
> Label: none  uuid: 6db4baf7-fda8-41ac-a6ad-1ca7b083430f
> Total devices 1 FS bytes used 11.44GiB
> devid1 size 27.07GiB used 18.07GiB path /dev/mmcblk0p3
> 
> Label: none  uuid: 931d40c6-7cd7-46f3-a4bf-61f3a53844bc
> Total devices 5 FS bytes used 6.57TiB
> devid1 size 7.28TiB used 1.64TiB path /dev/sda
> devid2 size 7.28TiB used 1.64TiB path /dev/sdb
> devid3 size 7.28TiB used 1.64TiB path /dev/sdc
> devid4 size 7.28TiB used 1.64TiB path /dev/sdd
> devid5 size 7.28TiB used 1.64TiB path /dev/sde
> 
> menion@Menionubuntu:~$ sudo btrfs fi df /media/storage/das1
> Data, RAID5: total=6.57TiB, used=6.56TiB
> System, RAID5: total=12.75MiB, used=416.00KiB
> Metadata, RAID5: total=9.00GiB, used=8.16GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> menion@Menionubuntu:~$ sudo btrfs fi usage /media/storage/das1
> WARNING: RAID56 detected, not implemented
> WARNING: RAID56 detected, not implemented
> WARNING: RAID56 detected, not implemented
> Overall:
> Device size:   36.39TiB
> Device allocated:  0.00B
> Device unallocated:   36.39TiB
> Device missing:  0.00B
> Used:  0.00B
> Free (estimated):  0.00B (min: 8.00EiB)
> Data ratio:   0.00
> Metadata ratio:   0.00
> Global reserve: 512.00MiB (used: 32.00KiB)
> 
> Data,RAID5: Size:6.57TiB, Used:6.56TiB
>/dev/sda1.64TiB
>/dev/sdb1.64TiB
>/dev/sdc1.64TiB
>/dev/sdd1.64TiB
>/dev/sde1.64TiB
> 
> Metadata,RAID5: Size:9.00GiB, Used:8.16GiB
>/dev/sda2.25GiB
>/dev/sdb2.25GiB
>/dev/sdc2.25GiB
>/dev/sdd2.25GiB
>/dev/sde2.25GiB
> 
> System,RAID5: Size:12.75MiB, Used:416.00KiB
>/dev/sda3.19MiB
>/dev/sdb3.19MiB
>/dev/sdc3.19MiB
>/dev/sdd3.19MiB
>/dev/sde3.19MiB
> 
> Unallocated:
>/dev/sda5.63TiB
>/dev/sdb5.63TiB
>/dev/sdc  

Re: List of known BTRFS Raid 5/6 Bugs?

2018-08-14 Thread Menion
Hi
Well, I think it is worth to give more details on the array.
the array is built with 5x8TB HDD in an esternal USB3.0 to SATAIII enclosure
The enclosure is a cheap JMicron based chinese stuff (from Orico).
There is one USB3.0 link for all the 5 HDD with a SATAIII 3.0Gb
multiplexer behind it. So you cannot expect peak performance, which is
not the goal of this array (domestic data storage).
Also the USB to SATA firmware is buggy, so UAS operations are not
stable, it run in BOT mode.
Having said so, the scrub has been started (and resumed) on the array
mount point:

sudo btrfs scrub start(resume) /media/storage/das1

even if reading the documentation I understand that it is the same
invoking it on mountpoint or one of the HDD in the array.
In the end, especially for a RAID5 array, does it really make sense to
scrub only one disk in the array???
Regarding the data usage, here you have the current figures:

menion@Menionubuntu:~$ sudo btrfs fi show
[sudo] password for menion:
Label: none  uuid: 6db4baf7-fda8-41ac-a6ad-1ca7b083430f
Total devices 1 FS bytes used 11.44GiB
devid1 size 27.07GiB used 18.07GiB path /dev/mmcblk0p3

Label: none  uuid: 931d40c6-7cd7-46f3-a4bf-61f3a53844bc
Total devices 5 FS bytes used 6.57TiB
devid1 size 7.28TiB used 1.64TiB path /dev/sda
devid2 size 7.28TiB used 1.64TiB path /dev/sdb
devid3 size 7.28TiB used 1.64TiB path /dev/sdc
devid4 size 7.28TiB used 1.64TiB path /dev/sdd
devid5 size 7.28TiB used 1.64TiB path /dev/sde

menion@Menionubuntu:~$ sudo btrfs fi df /media/storage/das1
Data, RAID5: total=6.57TiB, used=6.56TiB
System, RAID5: total=12.75MiB, used=416.00KiB
Metadata, RAID5: total=9.00GiB, used=8.16GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
menion@Menionubuntu:~$ sudo btrfs fi usage /media/storage/das1
WARNING: RAID56 detected, not implemented
WARNING: RAID56 detected, not implemented
WARNING: RAID56 detected, not implemented
Overall:
Device size:   36.39TiB
Device allocated:  0.00B
Device unallocated:   36.39TiB
Device missing:  0.00B
Used:  0.00B
Free (estimated):  0.00B (min: 8.00EiB)
Data ratio:   0.00
Metadata ratio:   0.00
Global reserve: 512.00MiB (used: 32.00KiB)

Data,RAID5: Size:6.57TiB, Used:6.56TiB
   /dev/sda1.64TiB
   /dev/sdb1.64TiB
   /dev/sdc1.64TiB
   /dev/sdd1.64TiB
   /dev/sde1.64TiB

Metadata,RAID5: Size:9.00GiB, Used:8.16GiB
   /dev/sda2.25GiB
   /dev/sdb2.25GiB
   /dev/sdc2.25GiB
   /dev/sdd2.25GiB
   /dev/sde2.25GiB

System,RAID5: Size:12.75MiB, Used:416.00KiB
   /dev/sda3.19MiB
   /dev/sdb3.19MiB
   /dev/sdc3.19MiB
   /dev/sdd3.19MiB
   /dev/sde3.19MiB

Unallocated:
   /dev/sda5.63TiB
   /dev/sdb5.63TiB
   /dev/sdc5.63TiB
   /dev/sdd5.63TiB
   /dev/sde5.63TiB
menion@Menionubuntu:~$
menion@Menionubuntu:~$ sf -h
The program 'sf' is currently not installed. You can install it by typing:
sudo apt install ruby-sprite-factory
menion@Menionubuntu:~$ df -h
Filesystem  Size  Used Avail Use% Mounted on
udev934M 0  934M   0% /dev
tmpfs   193M   22M  171M  12% /run
/dev/mmcblk0p3   28G   12G   15G  44% /
tmpfs   962M 0  962M   0% /dev/shm
tmpfs   5,0M 0  5,0M   0% /run/lock
tmpfs   962M 0  962M   0% /sys/fs/cgroup
/dev/mmcblk0p1  188M  3,4M  184M   2% /boot/efi
/dev/mmcblk0p3   28G   12G   15G  44% /home
/dev/sda 37T  6,6T   29T  19% /media/storage/das1
tmpfs   193M 0  193M   0% /run/user/1000
menion@Menionubuntu:~$ btrfs --version
btrfs-progs v4.17

So I don't fully understand where the scrub data size comes from
Il giorno lun 13 ago 2018 alle ore 23:56  ha scritto:
>
> Running time of 55:06:35 indicates that the counter is right, it is not 
> enough time to scrub the entire array using hdd.
>
> 2TiB might be right if you only scrubbed one disc, "sudo btrfs scrub start 
> /dev/sdx1" only scrubs the selected partition,
> whereas "sudo btrfs scrub start /media/storage/das1" scrubs the actual array.
>
> Use "sudo btrfs scrub status -d " to view per disc scrubbing statistics and 
> post the output.
> For live statistics, use "sudo watch -n 1".
>
> By the way:
> 0 errors despite multiple unclean shutdowns? I assumed that the write hole 
> would corrupt parity the first time around, was i wrong?
>
> Am 13-Aug-2018 09:20:36 +0200 schrieb men...@gmail.com:
> > Hi
> > I have a BTRFS RAID5 array built on 5x8TB HDD filled with, well :),
> > there are contradicting opinions by the, well, "several" ways to check
> > the used space on a BTRFS RAID5 array, but I should be aroud 8TB of
> > data.
> > This array is running on kernel 4.17.3 and it definitely experienced
> > power loss while data was being written.
> > I can say that it wen through at least a dozen of unclear shutdown
> > So following this thread I started my first scrub on the array. and
> > this is the outcome (after having resumed it 4 times, two after a
>

Re: List of known BTRFS Raid 5/6 Bugs?

2018-08-13 Thread Zygo Blaxell
On Mon, Aug 13, 2018 at 11:56:05PM +0200, erentheti...@mail.de wrote:
> Running time of 55:06:35 indicates that the counter is right, it is
> not enough time to scrub the entire array using hdd.
> 
> 2TiB might be right if you only scrubbed one disc, "sudo btrfs scrub
> start /dev/sdx1" only scrubs the selected partition,
> whereas "sudo btrfs scrub start /media/storage/das1" scrubs the actual array.
> 
> Use "sudo btrfs scrub status -d " to view per disc scrubbing statistics
> and post the output.
> For live statistics, use "sudo watch -n 1".
> 
> By the way:
> 0 errors despite multiple unclean shutdowns? I assumed that the write
> hole would corrupt parity the first time around, was i wrong?

You won't see the write hole from just a power failure.  You need a
power failure *and* a disk failure, and writes need to be happening at
the moment power fails.

Write hole breaks parity.  Scrub silently(!) fixes parity.  Scrub reads
the parity block and compares it to the computed parity, and if it's
wrong, scrub writes the computed parity back.  Normal RAID5 reads with
all disks online read only the data blocks, so they won't read the parity
block and won't detect wrong parity.

I did a couple of order-of-magnitude estimations of how likely a power
failure is to trash a btrfs RAID system and got a probability between 3%
and 30% per power failure if there were writes active at the time, and
a disk failed to join the array after boot.  That was based on 5 disks
having 31 writes queued with one of the disks being significantly slower
than the others (as failing disks often are) with continuous write load.

If you have a power failure on an array that isn't writing anything at
the time, nothing happens.

> 
> Am 13-Aug-2018 09:20:36 +0200 schrieb men...@gmail.com: 
> > Hi
> > I have a BTRFS RAID5 array built on 5x8TB HDD filled with, well :),
> > there are contradicting opinions by the, well, "several" ways to check
> > the used space on a BTRFS RAID5 array, but I should be aroud 8TB of
> > data.
> > This array is running on kernel 4.17.3 and it definitely experienced
> > power loss while data was being written.
> > I can say that it wen through at least a dozen of unclear shutdown
> > So following this thread I started my first scrub on the array. and
> > this is the outcome (after having resumed it 4 times, two after a
> > power loss...):
> > 
> > menion@Menionubuntu:~$ sudo btrfs scrub status /media/storage/das1/
> > scrub status for 931d40c6-7cd7-46f3-a4bf-61f3a53844bc
> > scrub resumed at Sun Aug 12 18:43:31 2018 and finished after 55:06:35
> > total bytes scrubbed: 2.59TiB with 0 errors
> > 
> > So, there are 0 errors, but I don't understand why it says 2.59TiB of
> > scrubbed data. Is it possible that also this values is crap, as the
> > non zero counters for RAID5 array?
> > Il giorno sab 11 ago 2018 alle ore 17:29 Zygo Blaxell
> >  ha scritto:
> > >
> > > On Sat, Aug 11, 2018 at 08:27:04AM +0200, erentheti...@mail.de wrote:
> > > > I guess that covers most topics, two last questions:
> > > >
> > > > Will the write hole behave differently on Raid 6 compared to Raid 5 ?
> > >
> > > Not really. It changes the probability distribution (you get an extra
> > > chance to recover using a parity block in some cases), but there are
> > > still cases where data gets lost that didn't need to be.
> > >
> > > > Is there any benefit of running Raid 5 Metadata compared to Raid 1 ?
> > >
> > > There may be benefits of raid5 metadata, but they are small compared to
> > > the risks.
> > >
> > > In some configurations it may not be possible to allocate the last
> > > gigabyte of space. raid1 will allocate 1GB chunks from 2 disks at a
> > > time while raid5 will allocate 1GB chunks from N disks at a time, and if
> > > N is an odd number there could be one chunk left over in the array that
> > > is unusable. Most users will find this irrelevant because a large disk
> > > array that is filled to the last GB will become quite slow due to long
> > > free space search and seek times--you really want to keep usage below 95%,
> > > maybe 98% at most, and that means the last GB will never be needed.
> > >
> > > Reading raid5 metadata could theoretically be faster than raid1, but that
> > > depends on a lot of variables, so you can't assume it as a rule of thumb.
> > >
> > > Raid6 metadata is more interesting because it's the only currently
> > > supported way to get 2-disk failure tolerance in btrfs. Unfortunately
> > > that benefit is rather limited due to the write hole bug.
> > >
> > > There are patches floating around that implement multi-disk raid1 (i.e. 3
> > > or 4 mirror copies instead of just 2). This would be much better for
> > > metadata than raid6--more flexible, more robust, and my guess is that
> > > it will be faster as well (no need for RMW updates or journal seeks).
> > >
> > > > -
> > > > FreeMail powered by mail.de - MEHR

Re: List of known BTRFS Raid 5/6 Bugs?

2018-08-13 Thread Zygo Blaxell
On Mon, Aug 13, 2018 at 09:20:22AM +0200, Menion wrote:
> Hi
> I have a BTRFS RAID5 array built on 5x8TB HDD filled with, well :),
> there are contradicting opinions by the, well, "several" ways to check
> the used space on a BTRFS RAID5 array, but I should be aroud 8TB of
> data.
> This array is running on kernel 4.17.3 and it definitely experienced
> power loss while data was being written.
> I can say that it wen through at least a dozen of unclear shutdown
> So following this thread I started my first scrub on the array. and
> this is the outcome (after having resumed it 4 times, two after a
> power loss...):
> 
> menion@Menionubuntu:~$ sudo btrfs scrub status /media/storage/das1/
> scrub status for 931d40c6-7cd7-46f3-a4bf-61f3a53844bc
> scrub resumed at Sun Aug 12 18:43:31 2018 and finished after 55:06:35
> total bytes scrubbed: 2.59TiB with 0 errors
> 
> So, there are 0 errors, but I don't understand why it says 2.59TiB of
> scrubbed data. Is it possible that also this values is crap, as the
> non zero counters for RAID5 array?

I just tested a quick scrub with injected errors on 4.18.0 and it looks
like the garbage values are finally fixed (yay!).

I never saw invalid values for 'total bytes' from raid5; however, scrub
has (had?) trouble resuming, especially if the system was rebooted between
cancel and resume, but sometimes just if the scrub had just been suspended
too long (maybe if there are changes to the chunk tree...?).

55 hours for 2600 GB is just under 50GB per hour, which doesn't sound
too unreasonable for btrfs, though it is known to be a bit slow compared
to other raid5 implementations.

> Il giorno sab 11 ago 2018 alle ore 17:29 Zygo Blaxell
>  ha scritto:
> >
> > On Sat, Aug 11, 2018 at 08:27:04AM +0200, erentheti...@mail.de wrote:
> > > I guess that covers most topics, two last questions:
> > >
> > > Will the write hole behave differently on Raid 6 compared to Raid 5 ?
> >
> > Not really.  It changes the probability distribution (you get an extra
> > chance to recover using a parity block in some cases), but there are
> > still cases where data gets lost that didn't need to be.
> >
> > > Is there any benefit of running Raid 5 Metadata compared to Raid 1 ?
> >
> > There may be benefits of raid5 metadata, but they are small compared to
> > the risks.
> >
> > In some configurations it may not be possible to allocate the last
> > gigabyte of space.  raid1 will allocate 1GB chunks from 2 disks at a
> > time while raid5 will allocate 1GB chunks from N disks at a time, and if
> > N is an odd number there could be one chunk left over in the array that
> > is unusable.  Most users will find this irrelevant because a large disk
> > array that is filled to the last GB will become quite slow due to long
> > free space search and seek times--you really want to keep usage below 95%,
> > maybe 98% at most, and that means the last GB will never be needed.
> >
> > Reading raid5 metadata could theoretically be faster than raid1, but that
> > depends on a lot of variables, so you can't assume it as a rule of thumb.
> >
> > Raid6 metadata is more interesting because it's the only currently
> > supported way to get 2-disk failure tolerance in btrfs.  Unfortunately
> > that benefit is rather limited due to the write hole bug.
> >
> > There are patches floating around that implement multi-disk raid1 (i.e. 3
> > or 4 mirror copies instead of just 2).  This would be much better for
> > metadata than raid6--more flexible, more robust, and my guess is that
> > it will be faster as well (no need for RMW updates or journal seeks).
> >
> > > -
> > > FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT
> > >
> 


signature.asc
Description: PGP signature


Re: List of known BTRFS Raid 5/6 Bugs?

2018-08-13 Thread erenthetitan
Running time of 55:06:35 indicates that the counter is right, it is not enough 
time to scrub the entire array using hdd.

2TiB might be right if you only scrubbed one disc, "sudo btrfs scrub start 
/dev/sdx1" only scrubs the selected partition, 
whereas "sudo btrfs scrub start /media/storage/das1" scrubs the actual array.

Use "sudo btrfs scrub status -d " to view per disc scrubbing statistics and 
post the output.
For live statistics, use "sudo watch -n 1".

By the way:
0 errors despite multiple unclean shutdowns? I assumed that the write hole 
would corrupt parity the first time around, was i wrong?

Am 13-Aug-2018 09:20:36 +0200 schrieb men...@gmail.com: 
> Hi
> I have a BTRFS RAID5 array built on 5x8TB HDD filled with, well :),
> there are contradicting opinions by the, well, "several" ways to check
> the used space on a BTRFS RAID5 array, but I should be aroud 8TB of
> data.
> This array is running on kernel 4.17.3 and it definitely experienced
> power loss while data was being written.
> I can say that it wen through at least a dozen of unclear shutdown
> So following this thread I started my first scrub on the array. and
> this is the outcome (after having resumed it 4 times, two after a
> power loss...):
> 
> menion@Menionubuntu:~$ sudo btrfs scrub status /media/storage/das1/
> scrub status for 931d40c6-7cd7-46f3-a4bf-61f3a53844bc
> scrub resumed at Sun Aug 12 18:43:31 2018 and finished after 55:06:35
> total bytes scrubbed: 2.59TiB with 0 errors
> 
> So, there are 0 errors, but I don't understand why it says 2.59TiB of
> scrubbed data. Is it possible that also this values is crap, as the
> non zero counters for RAID5 array?
> Il giorno sab 11 ago 2018 alle ore 17:29 Zygo Blaxell
>  ha scritto:
> >
> > On Sat, Aug 11, 2018 at 08:27:04AM +0200, erentheti...@mail.de wrote:
> > > I guess that covers most topics, two last questions:
> > >
> > > Will the write hole behave differently on Raid 6 compared to Raid 5 ?
> >
> > Not really. It changes the probability distribution (you get an extra
> > chance to recover using a parity block in some cases), but there are
> > still cases where data gets lost that didn't need to be.
> >
> > > Is there any benefit of running Raid 5 Metadata compared to Raid 1 ?
> >
> > There may be benefits of raid5 metadata, but they are small compared to
> > the risks.
> >
> > In some configurations it may not be possible to allocate the last
> > gigabyte of space. raid1 will allocate 1GB chunks from 2 disks at a
> > time while raid5 will allocate 1GB chunks from N disks at a time, and if
> > N is an odd number there could be one chunk left over in the array that
> > is unusable. Most users will find this irrelevant because a large disk
> > array that is filled to the last GB will become quite slow due to long
> > free space search and seek times--you really want to keep usage below 95%,
> > maybe 98% at most, and that means the last GB will never be needed.
> >
> > Reading raid5 metadata could theoretically be faster than raid1, but that
> > depends on a lot of variables, so you can't assume it as a rule of thumb.
> >
> > Raid6 metadata is more interesting because it's the only currently
> > supported way to get 2-disk failure tolerance in btrfs. Unfortunately
> > that benefit is rather limited due to the write hole bug.
> >
> > There are patches floating around that implement multi-disk raid1 (i.e. 3
> > or 4 mirror copies instead of just 2). This would be much better for
> > metadata than raid6--more flexible, more robust, and my guess is that
> > it will be faster as well (no need for RMW updates or journal seeks).
> >
> > > -
> > > FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT
> > >


-
FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT


Re: List of known BTRFS Raid 5/6 Bugs?

2018-08-13 Thread Menion
Hi
I have a BTRFS RAID5 array built on 5x8TB HDD filled with, well :),
there are contradicting opinions by the, well, "several" ways to check
the used space on a BTRFS RAID5 array, but I should be aroud 8TB of
data.
This array is running on kernel 4.17.3 and it definitely experienced
power loss while data was being written.
I can say that it wen through at least a dozen of unclear shutdown
So following this thread I started my first scrub on the array. and
this is the outcome (after having resumed it 4 times, two after a
power loss...):

menion@Menionubuntu:~$ sudo btrfs scrub status /media/storage/das1/
scrub status for 931d40c6-7cd7-46f3-a4bf-61f3a53844bc
scrub resumed at Sun Aug 12 18:43:31 2018 and finished after 55:06:35
total bytes scrubbed: 2.59TiB with 0 errors

So, there are 0 errors, but I don't understand why it says 2.59TiB of
scrubbed data. Is it possible that also this values is crap, as the
non zero counters for RAID5 array?
Il giorno sab 11 ago 2018 alle ore 17:29 Zygo Blaxell
 ha scritto:
>
> On Sat, Aug 11, 2018 at 08:27:04AM +0200, erentheti...@mail.de wrote:
> > I guess that covers most topics, two last questions:
> >
> > Will the write hole behave differently on Raid 6 compared to Raid 5 ?
>
> Not really.  It changes the probability distribution (you get an extra
> chance to recover using a parity block in some cases), but there are
> still cases where data gets lost that didn't need to be.
>
> > Is there any benefit of running Raid 5 Metadata compared to Raid 1 ?
>
> There may be benefits of raid5 metadata, but they are small compared to
> the risks.
>
> In some configurations it may not be possible to allocate the last
> gigabyte of space.  raid1 will allocate 1GB chunks from 2 disks at a
> time while raid5 will allocate 1GB chunks from N disks at a time, and if
> N is an odd number there could be one chunk left over in the array that
> is unusable.  Most users will find this irrelevant because a large disk
> array that is filled to the last GB will become quite slow due to long
> free space search and seek times--you really want to keep usage below 95%,
> maybe 98% at most, and that means the last GB will never be needed.
>
> Reading raid5 metadata could theoretically be faster than raid1, but that
> depends on a lot of variables, so you can't assume it as a rule of thumb.
>
> Raid6 metadata is more interesting because it's the only currently
> supported way to get 2-disk failure tolerance in btrfs.  Unfortunately
> that benefit is rather limited due to the write hole bug.
>
> There are patches floating around that implement multi-disk raid1 (i.e. 3
> or 4 mirror copies instead of just 2).  This would be much better for
> metadata than raid6--more flexible, more robust, and my guess is that
> it will be faster as well (no need for RMW updates or journal seeks).
>
> > -
> > FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT
> >


Re: List of known BTRFS Raid 5/6 Bugs?

2018-08-11 Thread Zygo Blaxell
On Sat, Aug 11, 2018 at 08:27:04AM +0200, erentheti...@mail.de wrote:
> I guess that covers most topics, two last questions:
> 
> Will the write hole behave differently on Raid 6 compared to Raid 5 ?

Not really.  It changes the probability distribution (you get an extra
chance to recover using a parity block in some cases), but there are
still cases where data gets lost that didn't need to be.

> Is there any benefit of running Raid 5 Metadata compared to Raid 1 ? 

There may be benefits of raid5 metadata, but they are small compared to
the risks.

In some configurations it may not be possible to allocate the last
gigabyte of space.  raid1 will allocate 1GB chunks from 2 disks at a
time while raid5 will allocate 1GB chunks from N disks at a time, and if
N is an odd number there could be one chunk left over in the array that
is unusable.  Most users will find this irrelevant because a large disk
array that is filled to the last GB will become quite slow due to long
free space search and seek times--you really want to keep usage below 95%,
maybe 98% at most, and that means the last GB will never be needed.

Reading raid5 metadata could theoretically be faster than raid1, but that
depends on a lot of variables, so you can't assume it as a rule of thumb.

Raid6 metadata is more interesting because it's the only currently
supported way to get 2-disk failure tolerance in btrfs.  Unfortunately
that benefit is rather limited due to the write hole bug.

There are patches floating around that implement multi-disk raid1 (i.e. 3
or 4 mirror copies instead of just 2).  This would be much better for
metadata than raid6--more flexible, more robust, and my guess is that
it will be faster as well (no need for RMW updates or journal seeks).

> -
> FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT
> 


signature.asc
Description: PGP signature


Re: List of known BTRFS Raid 5/6 Bugs?

2018-08-10 Thread erenthetitan
I guess that covers most topics, two last questions:

Will the write hole behave differently on Raid 6 compared to Raid 5 ?
Is there any benefit of running Raid 5 Metadata compared to Raid 1 ? 
-
FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT


Re: List of known BTRFS Raid 5/6 Bugs?

2018-08-10 Thread Zygo Blaxell
On Sat, Aug 11, 2018 at 04:18:35AM +0200, erentheti...@mail.de wrote:
> Write hole:
> 
> 
> > The data will be readable until one of the data blocks becomes
> > inaccessible (bad sector or failed disk). This is because it is only the
> > parity block that is corrupted (old data blocks are still not modified
> > due to btrfs CoW), and the parity block is only required when recovering
> > from a disk failure.
> 
> I am unsure about your meaning. 
> Assuming you perform an unclean shutdown (eg. crash), and after restart
> perform a scrub, with no additional error (bad sector, bit-rot) before
> or after the crash:
> will you loose data? 

No, the parity blocks will be ignored and RAID5 will act like slow RAID0
if no other errors occur.

> Will you be able to mount the filesystem like normal? 

Yes.

> Additionaly, will the crash create additional errors like bad
> sectors and or bit-rot aside from the parity-block corruption?

No, only parity-block corruptions should occur.

> Its actually part of my first mail, where the btrfs Raid5/6 page
> assumes no data damage while the spinics comment implies the opposite.

The above assumes no drive failures or data corruption; however, if this
were the case, you could use RAID0 instead of RAID5.

The only reason to use RAID5 is to handle cases where at least one block
(or an entire disk) fails, so the behavior of RAID5 when all disks are
working is almost irrelevant.

A drive failure could occur at any time, so even if you mount successfully,
if a disk fails immediately after, any stripes affected by write hole will
be unrecoverably corrupted.

> The write hole does not seem as dangerous if you could simply scrub
> to repair damage (On smaller discs that is, where scrub doesnt take
> enough time for additional errors to occur)

Scrub can repair parity damage on normal data and metadata--it recomputes
parity from data if the data passes a CRC check.

No repair is possible for data in nodatasum files--the parity can be
recomputed, but there is no way to determine if the result is correct.

Metadata is always checksummed and transid verified; alas, there isn't
an easy way to get btrfs to perform an urgent scrub on metadata only.

> > Put another way: if all disks are online then RAID5/6 behaves like a slow
> > RAID0, and RAID0 does not have the partial stripe update problem because
> > all of the data blocks in RAID0 are independent. It is only when a disk
> > fails in RAID5/6 that the parity block is combined with data blocks, so
> > it is only in this case that the write hole bug can result in lost data.
> 
> So data will not be lost if no drive has failed?

Correct, but the array will have reduced failure tolerance, and RAID5
only matters when a drive has failed.  It is effectively operating in
degraded mode on parts of the array affected by write hole, and no single
disk failure can be tolerated there.

It is possible to recover the parity by performing an immediate scrub
after reboot, but this cannot be as effective as a proper RAID5 update
journal which avoids making the parity bad in the first place.

> > > > If the filesystem is -draid5 -mraid1 then the metadata is not vulnerable
> > > > to the write hole, but data is. In this configuration you can determine
> > > > with high confidence which files you need to restore from backup, and
> > > > the filesystem will remain writable to replace the restored data, 
> > > > because
> > > > raid1 does not have the write hole bug.
> 
> In regards to my earlier questions, what would change if i do -draid5 -mraid1?

Metadata would be using RAID1 which is not subject to the RAID5 write
hole issue.  It is much more tolerant of unclean shutdowns especially
in degraded mode.

Data in RAID5 may be damaged when the array is in degraded mode and
a write hole occurs (in either order as long as both occur).  Due to
RAID1 metadata, the filesystem will continue to operate properly,
allowing the damaged data to be overwritten or deleted.

> Lost Writes:
> 
> > Hotplugging causes an effect (lost writes) which can behave similarly
> > to the write hole bug in some instances. The similarity ends there.
> 
> Are we speaking about the same problem that is causing transid mismatch? 

Transid mismatch is usually caused by lost writes, by any mechanism
that prevents a write from being completed after the disk reports that
it was completed.

Drives may report that data is "in stable storage", i.e. the drive
believes it can complete the write in the future even if power is lost
now because the drive or controller has capacitors or NVRAM or similar.
If the drive is reset by the SATA host because of a cable disconnect
event, the drive may forget that it has promised to do writes in the
future.  Drives may simply lie, and claim that data has been written to
disk when the data is actually in volatile RAM and will disappear in a
power failure.

btrfs uses a transaction mechanism and CoW metadata to handle lost writes
within an interrupted transaction. 

Re: List of known BTRFS Raid 5/6 Bugs?

2018-08-10 Thread erenthetitan
Write hole:


> The data will be readable until one of the data blocks becomes
> inaccessible (bad sector or failed disk). This is because it is only the
> parity block that is corrupted (old data blocks are still not modified
> due to btrfs CoW), and the parity block is only required when recovering
> from a disk failure.

I am unsure about your meaning. 
Assuming you perform an unclean shutdown (eg. crash), and after restart perform 
a scrub, with no additional error (bad sector, bit-rot) before or after the 
crash:
will you loose data? Will you be able to mount the filesystem like normal? 
Additionaly, will the crash create additional errors like bad sectors and or 
bit-rot aside from the parity-block corruption?
Its actually part of my first mail, where the btrfs Raid5/6 page assumes no 
data damage while the spinics comment implies the opposite.
The write hole does not seem as dangerous if you could simply scrub to repair 
damage (On smaller discs that is, where scrub doesnt take enough time for 
additional errors to occur)

> Put another way: if all disks are online then RAID5/6 behaves like a slow
> RAID0, and RAID0 does not have the partial stripe update problem because
> all of the data blocks in RAID0 are independent. It is only when a disk
> fails in RAID5/6 that the parity block is combined with data blocks, so
> it is only in this case that the write hole bug can result in lost data.

So data will not be lost if no drive has failed?

> > > If the filesystem is -draid5 -mraid1 then the metadata is not vulnerable
> > > to the write hole, but data is. In this configuration you can determine
> > > with high confidence which files you need to restore from backup, and
> > > the filesystem will remain writable to replace the restored data, because
> > > raid1 does not have the write hole bug.

In regards to my earlier questions, what would change if i do -draid5 -mraid1?


Lost Writes:


> Hotplugging causes an effect (lost writes) which can behave similarly
> to the write hole bug in some instances. The similarity ends there.

Are we speaking about the same problem that is causing transid mismatch? 

> They are really two distinct categories of problem. Temporary connection
> loss can do bad things to all RAID profiles on btrfs (not just RAID5/6)
> and the btrfs requirements for handling connection loss and write holes
> are very different.

What kind of bad things? Will scrub (1/10, 5/6) detect and repair it?

> > > Hot-unplugging a device can cause many lost write events at once, and
> > > each lost write event is very bad.

> Transid mismatch is btrfs detecting data
> that was previously silently corrupted by some component outside of btrfs.
> 
> btrfs can't prevent disks from silently corrupting data. It can only
> try to detect and repair the damage after the damage has occurred.

Aside from the chance that all copies of data are corrupted, is there any way 
scrubbing could fail?

> Normally RAID1/5/6/10 or DUP profiles are used for btrfs metadata, so any
> transid mismatches can be recovered by reading up-to-date data from the
> other mirror copy of the metadata, or by reconstructing the data with
> parity blocks in the RAID 5/6 case. It is only after this recovery
> mechanism fails (i.e. too many disks have a failure or corruption at
> the same time on the same sectors) that the filesystem is ended.

Does this mean that transid mismatch is harmless unless both copys are hit at 
once (And in case of Raid 6 all three)?


Old hardware:


> > > It's fun and/or scary to put known good and bad hardware in the same
> > > RAID1 array and watch btrfs autocorrecting the bad data after every
> > > other power failure; however, the bad hardware is clearly not sufficient
> > > to implement any sort of reliable data persistence, and arrays with bad
> > > hardware in them will eventually fail.

I am having a hard time wrapping my head around this statement.
If Btrfs can repair corrupted data and Raid  6 allows two disc failures at once 
without data loss, is using old discs even with high average error count not 
still pretty much safe?
You would simply have to repeat the scrubbing process more often to make sure 
that not enough data is corrupted to break redundancy.

> > > I have one test case where I write millions of errors into a raid5/6 and
> > > the filesystem recovers every single one transparently while verifying
> > > SHA1 hashes of test data. After years of rebuilding busted ext3 on
> > > mdadm-raid5 filesystems, watching btrfs do it all automatically is
> > > just...beautiful.

Once again, if Btrfs is THIS good at repairing data, then is old hardware, 
hotplugging and maybe even (depending on whether i understood your point) write 
hole really dangerous? Are there bugs that could destroy the data or filesystem 
whitout corrupting all copies of data (Or all copies at once)? Assuming Raid 6, 
corrupted data would not break redundancy and repeated scrubbing would fix any 
upcoming issue.
--

Re: List of known BTRFS Raid 5/6 Bugs?

2018-08-10 Thread Zygo Blaxell
On Fri, Aug 10, 2018 at 06:55:58PM +0200, erentheti...@mail.de wrote:
> Did i get you right?
> Please correct me if i am wrong:
> 
> Scrubbing seems to have been fixed, you only have to run it once.

Yes.

There is one minor bug remaining here:  when scrub detects an error
on any disk in a raid5/6 array, the error counts are garbage (random
numbers on all the disks).  You will need to inspect btrfs dev stats
or the kernel log messages to learn which disks are injecting errors.

This does not impair the scrubbing function, only the detailed statistics
report (scrub status -d).

If there are no errors, scrub correctly reports 0 for all error counts.
Only raid5/6 is affected this way--other RAID profiles produce correct
scrub statistics.

> Hotplugging (temporary connection loss) is affected by the write hole
> bug, and will create undetectable errors every 16 TB (crc32 limitation).

Hotplugging causes an effect (lost writes) which can behave similarly
to the write hole bug in some instances.  The similarity ends there.

They are really two distinct categories of problem.  Temporary connection
loss can do bad things to all RAID profiles on btrfs (not just RAID5/6)
and the btrfs requirements for handling connection loss and write holes
are very different.

> The write Hole Bug can affect both old and new data. 

Normally, only old data can be affected by the write hole bug.

The "new" data is not committed before the power failure (otherwise we
would call it "old" data), so any corrupted new data will be inaccessible
as a result of the power failure.  The filesytem will roll back to the
last complete committed data tree (discarding all new and modified data
blocks), then replay the fsync log (which repeats and completes some
writes that occurred since the last commit).  This process eliminates
new data from the filesystem whether the new data was corrupted by the
write hole or not.

Only corruptions that affect old data will remain, because old data is
not overwritten by data saved in the fsync log, and old data is not part
of the incomplete data tree that is rolled back after power failure.

Exception:  new data in nodatasum files can also be corrupted, but since
nodatasum disables all data integrity or recovery features it's hard to
define what "corrupted" means for a nodatasum file.

> Reason: BTRFS saves data in fixed size stripes, if the write operation
> fails midway, the stripe is lost.
> This does not matter much for Raid 1/10, data always uses a full stripe,
> and stripes are copied on write. Only new data could be lost.

This is incorrect.  Btrfs saves data in variable-sized extents (between
1 and 32768 4K data blocks) and btrfs has no concept of stripes outside of
its raid layer.  Stripes are never copied.

In RAID 1/10/DUP all data blocks are fully independent of each other,
i.e. writing to any block on these RAID profiles does not corrupt data in
any other block.  As a result these RAID profiles do not allow old data
to be corrupted by partially completed writes of new data.

There is striping in some profiles, but it is only used for performance
in these cases, and has no effect on data recovery.

> However, for some reason Raid 5/6 works with partial stripes, meaning
> that data is stored in stripes not completley filled by prior data,

In RAID 5/6 each data block is related to all other data blocks in the
same stripe with the parity block(s).  If any individual data block in the
stripe is updated, the parity block(s) must also be updated atomically,
or the wrong data will be reconstructed during RAID5/6 recovery.

Because btrfs does nothing to prevent it, some writes will occur
to RAID5/6 stripes that are already partially occupied by old data.
btrfs also does nothing to ensure that parity block updates are atomic,
so btrfs has the write hole bug as a result.

> and stripes are removed on write.

Stripes are never removed...?  A stripe is just a group of disk blocks
divided on 64K boundaries, same as mdadm and many hardware RAID5/6
implementations.

> Result: If the operation fails midway, the stripe is lost as is all
> data previously stored it.

You can only lose as many data blocks in each stripe as there are parity
disks (i.e. raid5 can lose 0 or 1 block, while raid6 can lose 0, 1, or 2
blocks); however, multiple writes can be lost affecting multiple stripes
in a single power loss event.  Losing even 1 block is often too much.  ;)

The data will be readable until one of the data blocks becomes
inaccessible (bad sector or failed disk).  This is because it is only the
parity block that is corrupted (old data blocks are still not modified
due to btrfs CoW), and the parity block is only required when recovering
from a disk failure.

Put another way:  if all disks are online then RAID5/6 behaves like a slow
RAID0, and RAID0 does not have the partial stripe update problem because
all of the data blocks in RAID0 are independent.  It is only when a disk
fails in RAID5/6 that the parity block is com

Re: List of known BTRFS Raid 5/6 Bugs?

2018-08-10 Thread erenthetitan
Did i get you right?
Please correct me if i am wrong:

Scrubbing seems to have been fixed, you only have to run it once.

Hotplugging (temporary connection loss) is affected by the write hole bug, and 
will create undetectable errors every 16 TB (crc32 limitation).

The write Hole Bug can affect both old and new data.
Reason: BTRFS saves data in fixed size stripes, if the write operation fails 
midway, the stripe is lost.
This does not matter much for Raid 1/10, data always uses a full stripe, and 
stripes are copied on write. Only new data could be lost.
However, for some reason Raid 5/6 works with partial stripes, meaning that data 
is stored in stripes not completley filled by prior data, and stripes are 
removed on write.
Result: If the operation fails midway, the stripe is lost as is all data 
previously stored it.

Transid Mismatch can silently corrupt data.
Reason: It is a seperate metadata failure that is trigged by lost or incomplete 
writes, writes that are lost somewhere during transmission.
It can happen to all BTRFS configurations and is not trigerred by the write 
hole.
It could happen due to brown out (temporary undersupply of voltage), faulty 
cables, faulty ram, faulty disc cache, faulty discs in general.

Both bugs could damage metadata and trigger the following:
Data will be lost (0 to 100% unreadable), the filesystem will be readonly.
Reason: BTRFS saves metadata as a tree structure. The closer the error to the 
root, the more data cannot be read.

Transid Mismatch can happen up to once every 3 months per device,
depending on the drive hardware!

Question: Does this not make transid mismatch way more dangerous than
the write hole? What would happen to other filesystems, like ext4?

Am 10-Aug-2018 09:12:21 +0200 schrieb ce3g8...@umail.furryterror.org: 
> > On Fri, Aug 10, 2018 at 03:40:23AM +0200, erentheti...@mail.de wrote:
> > > I am searching for more information regarding possible bugs related to
> > > BTRFS Raid 5/6. All sites i could find are incomplete and information
> > > contradicts itself:
> > >
> > > The Wiki Raid 5/6 Page (https://btrfs.wiki.kernel.org/index.php/RAID56)
> > > warns of the write hole bug, stating that your data remains safe
> > > (except data written during power loss, obviously) upon unclean shutdown
> > > unless your data gets corrupted by further issues like bit-rot, drive
> > > failure etc.
> > 
> > The raid5/6 write hole bug exists on btrfs (as of 4.14-4.16) and there are
> > no mitigations to prevent or avoid it in mainline kernels.
> > 
> > The write hole results from allowing a mixture of old (committed) and
> > new (uncommitted) writes to the same RAID5/6 stripe (i.e. a group of
> > blocks consisting of one related data or parity block from each disk
> > in the array, such that writes to any of the data blocks affect the
> > correctness of the parity block and vice versa). If the writes were
> > not completed and one or more of the data blocks are not online, the
> > data blocks reconstructed by the raid5/6 algorithm will be corrupt.
> > 
> > If all disks are online, the write hole does not immediately
> > damage user-visible data as the old data blocks can still be read
> > directly; however, should a drive failure occur later, old data may
> > not be recoverable because the parity block will not be correct for
> > reconstructing the missing data block. A scrub can fix write hole
> > errors if all disks are online, and a scrub should be performed after
> > any unclean shutdown to recompute parity data.
> > 
> > The write hole always puts both old and new data at risk of damage;
> > however, due to btrfs's copy-on-write behavior, only the old damaged
> > data can be observed after power loss. The damaged new data will have
> > no references to it written to the disk due to the power failure, so
> > there is no way to observe the new damaged data using the filesystem.
> > Not every interrupted write causes damage to old data, but some will.
> > 
> > Two possible mitigations for the write hole are:
> > 
> > - modify the btrfs allocator to prevent writes to partially filled
> > raid5/6 stripes (similar to what the ssd mount option does, except
> > with the correct parameters to match RAID5/6 stripe boundaries),
> > and advise users to run btrfs balance much more often to reclaim
> > free space in partially occupied raid stripes
> > 
> > - add a stripe write journal to the raid5/6 layer (either in
> > btrfs itself, or in a lower RAID5 layer).
> > 
> > There are assorted other ideas (e.g. copy the RAID-Z approach from zfs
> > to btrfs or dramatically increase the btrfs block size) that also solve
> > the write hole problem but are somewhat more invasive and less practical
> > for btrfs.
> > 
> > Note that the write hole also affects btrfs on top of other similar
> > raid5/6 implementations (e.g. mdadm raid5 without stripe journal).
> > The btrfs CoW layer does not understand how to allocate data to avoid RMW
> > raid5 stripe updates without corrupting exi

Re: List of known BTRFS Raid 5/6 Bugs?

2018-08-10 Thread Zygo Blaxell
On Fri, Aug 10, 2018 at 03:40:23AM +0200, erentheti...@mail.de wrote:
> I am searching for more information regarding possible bugs related to
> BTRFS Raid 5/6. All sites i could find are incomplete and information
> contradicts itself:
>
> The Wiki Raid 5/6 Page (https://btrfs.wiki.kernel.org/index.php/RAID56)
> warns of the write hole bug, stating that your data remains safe
> (except data written during power loss, obviously) upon unclean shutdown
> unless your data gets corrupted by further issues like bit-rot, drive
> failure etc.

The raid5/6 write hole bug exists on btrfs (as of 4.14-4.16) and there are
no mitigations to prevent or avoid it in mainline kernels.

The write hole results from allowing a mixture of old (committed) and
new (uncommitted) writes to the same RAID5/6 stripe (i.e. a group of
blocks consisting of one related data or parity block from each disk
in the array, such that writes to any of the data blocks affect the
correctness of the parity block and vice versa).  If the writes were
not completed and one or more of the data blocks are not online, the
data blocks reconstructed by the raid5/6 algorithm will be corrupt.

If all disks are online, the write hole does not immediately
damage user-visible data as the old data blocks can still be read
directly; however, should a drive failure occur later, old data may
not be recoverable because the parity block will not be correct for
reconstructing the missing data block.  A scrub can fix write hole
errors if all disks are online, and a scrub should be performed after
any unclean shutdown to recompute parity data.

The write hole always puts both old and new data at risk of damage;
however, due to btrfs's copy-on-write behavior, only the old damaged
data can be observed after power loss.  The damaged new data will have
no references to it written to the disk due to the power failure, so
there is no way to observe the new damaged data using the filesystem.
Not every interrupted write causes damage to old data, but some will.

Two possible mitigations for the write hole are:

- modify the btrfs allocator to prevent writes to partially filled
raid5/6 stripes (similar to what the ssd mount option does, except
with the correct parameters to match RAID5/6 stripe boundaries),
and advise users to run btrfs balance much more often to reclaim
free space in partially occupied raid stripes

- add a stripe write journal to the raid5/6 layer (either in
btrfs itself, or in a lower RAID5 layer).

There are assorted other ideas (e.g. copy the RAID-Z approach from zfs
to btrfs or dramatically increase the btrfs block size) that also solve
the write hole problem but are somewhat more invasive and less practical
for btrfs.

Note that the write hole also affects btrfs on top of other similar
raid5/6 implementations (e.g. mdadm raid5 without stripe journal).
The btrfs CoW layer does not understand how to allocate data to avoid RMW
raid5 stripe updates without corrupting existing committed data, and this
limitation applies to every combination of unjournalled raid5/6 and btrfs.

> The Wiki Gotchas Page (https://btrfs.wiki.kernel.org/index.php/Gotchas)
> warns of possible incorrigible "transid" mismatch, not stating which
> versions are affected or what transid mismatch means for your data. It
> does not mention the write hole at all.

Neither raid5 nor write hole are required to produce a transid mismatch
failure.  transid mismatch usually occurs due to a lost write.  Write hole
is a specific case of lost write, but write hole does not usually produce
transid failures (it produces header or csum failures instead).

During real disk failure events, multiple distinct failure modes can
occur concurrently.  i.e. both transid failure and write hole can occur
at different places in the same filesystem as a result of attempting to
use a failing disk over a long period of time.

A transid verify failure is metadata damage.  It will make the filesystem
readonly and make some data inaccessible as described below.

> This Mail Archive (linux-btrfs@vger.kernel.org/msg55161.html"
> target="_blank">https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg55161.html)
> states that scrubbing BTRFS Raid 5/6 will always repair Data Corruption,
> but may corrupt your Metadata while trying to do so - meaning you have
> to scrub twice in a row to ensure data integrity.

Simple corruption (without write hole errors) is fixed by scrubbing
as of the last...at least six months?  Kernel v4.14.xx and later can
definitely do it these days.  Both data and metadata.

If the metadata is damaged in any way (corruption, write hole, or transid
verify failure) on btrfs and btrfs cannot use the raid profile for
metadata to recover the damaged data, the filesystem is usually forever
readonly, and anywhere from 0 to 100% of the filesystem may be readable
depending on where in the metadata tree structure the error occurs (the
closer to the ro