subject:"Re\: Convert from RAID 5 to 10"

Re: Convert from RAID 5 to 10

2016-12-06 Thread Florian Lindner

Thanks a lot to all who replied! I learned a lot from this thread.

However, what I learned has made me even more doubtful that a btrfs RAID is the 
right choice for me at this moment.
There seems to be much uncertainty about the real state (experimental, stable, 
production-ready, mature, ...) of btrfs'
raid implementation, even on this very well informed lists.

I really want to have the checksumming and auto-repair feature of btrfs. That 
was the original reason why I did't go
with a dmraid in the first place. So there are basically two options left, 
btrfs, with a raid 10 or zfs with some raid
10 or 5 equivalent.

zfs seems to be nice, mature solution, but I also prefer to use something 
native to Linux.

Best Regards,
Florian


Am 29.11.2016 um 18:20 schrieb Florian Lindner:
> Hello,
> 
> I have 4 harddisks with 3TB capacity each. They are all used in a btrfs RAID 
> 5. It has come to my attention, that there
> seem to be major flaws in btrfs' raid 5 implementation. Because of that, I 
> want to convert the the raid 5 to a raid 10
> and I have several questions.
> 
> * Is that possible as an online conversion?
> 
> * Since my effective capacity will shrink during conversions, does btrfs 
> check if there is enough free capacity to
> convert? As you see below, right now it's probably too full, but I'm going to 
> delete some stuff.
> 
> * I understand the command to convert is
> 
> btrfs balance start -dconvert=raid10 -mconvert=raid10 /mnt
> 
> Correct?
> 
> * What disks are allowed to fail? My understanding of a raid 10 is like that
> 
> disks = {a, b, c, d}
> 
> raid0( raid1(a, b), raid1(c, d) )
> 
> This way (a XOR b) AND (c XOR d) are allowed to fail without the raid to fail 
> (either a or b and c or d are allowed to fail)
> 
> How is that with a btrfs raid 10?
> 
> * Any other advice? ;-)
> 
> Thanks a lot,
> 
> Florian
> 
> 
> Some information of my filesystem:
> 
> # btrfs filesystem show /
> Label: 'data'  uuid: 57e5b9e9-01ae-4f9e-8a3d-9f42204d7005
> Total devices 4 FS bytes used 7.57TiB
> devid1 size 2.72TiB used 2.72TiB path /dev/sda4
> devid2 size 2.72TiB used 2.72TiB path /dev/sdb4
> devid3 size 2.72TiB used 2.72TiB path /dev/sdc4
> devid4 size 2.72TiB used 2.72TiB path /dev/sdd4
> 
> # btrfs filesystem df /
> Data, RAID5: total=8.14TiB, used=7.56TiB
> System, RAID5: total=96.00MiB, used=592.00KiB
> Metadata, RAID5: total=12.84GiB, used=11.06GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> # df -h
> Filesystem  Size  Used Avail Use% Mounted on
> 
> /dev/sda411T  7.6T  597G  93% /
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Convert from RAID 5 to 10

2016-12-01 Thread Tomasz Kusmierz

FYI.
There is an old saying in embedded circles that I revolve that evolved
from Arthur C Clarke "Any sufficiently advanced technology is
indistinguishable from magic." Engineering version states "Any
sufficiently advanced incompetence is indistinguishable from malice"
Also I'll quote you on throwing under the bus thing :) (I actually
like that justification)

On 1 December 2016 at 17:28, Chris Murphy  wrote:
> On Wed, Nov 30, 2016 at 1:29 PM, Tomasz Kusmierz  
> wrote:
>
>> Please, I beg you add another column to man and wiki stating clearly
>> how many devices every profile can withstand to loose. I frequently
>> have to explain how btrfs profiles work and show quotes from this
>> mailing list because "dawning-kruger effect victims" keep poping up
>> with statements like "in btrfs raid10 with 8 drives you can loose 4
>> drives" ... I seriously beg you guys, my beating stick is half broken
>> by now.
>
> You need a new stick. It's called the ad hominem attack. When stupid
> people say stupid things, the dispute is not about the facts or
> opinions in the argument itself, but rather the person involved. There
> is the possibility this is more than stupidity, it really borders on
> maliciousness. Any ethical code of conduct for a list will accept ad
> hominem attacks over the willful dissemination of provably wrong
> information. When stupid assholes throw users under the bus with
> provably wrong (and bad) advice, it becomes something of an obligation
> to resort to name calling.
>
> Of course, I'd also like the wiki to clearly state the only profile
> that tolerates more than one device loss is raid6; and be very
> explicit with the manifestly wrong terminology being used by Btrfs's
> raid10 terminology. That is a fairly egregious violation of common
> terminology and the trust we're supposed to be developing, both in the
> usage of common terms, but also in Btrfs specifically.
>
>
>
> --
> Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Convert from RAID 5 to 10

2016-12-01 Thread Chris Murphy

On Wed, Nov 30, 2016 at 1:29 PM, Tomasz Kusmierz  wrote:

> Please, I beg you add another column to man and wiki stating clearly
> how many devices every profile can withstand to loose. I frequently
> have to explain how btrfs profiles work and show quotes from this
> mailing list because "dawning-kruger effect victims" keep poping up
> with statements like "in btrfs raid10 with 8 drives you can loose 4
> drives" ... I seriously beg you guys, my beating stick is half broken
> by now.

You need a new stick. It's called the ad hominem attack. When stupid
people say stupid things, the dispute is not about the facts or
opinions in the argument itself, but rather the person involved. There
is the possibility this is more than stupidity, it really borders on
maliciousness. Any ethical code of conduct for a list will accept ad
hominem attacks over the willful dissemination of provably wrong
information. When stupid assholes throw users under the bus with
provably wrong (and bad) advice, it becomes something of an obligation
to resort to name calling.

Of course, I'd also like the wiki to clearly state the only profile
that tolerates more than one device loss is raid6; and be very
explicit with the manifestly wrong terminology being used by Btrfs's
raid10 terminology. That is a fairly egregious violation of common
terminology and the trust we're supposed to be developing, both in the
usage of common terms, but also in Btrfs specifically.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Convert from RAID 5 to 10

2016-12-01 Thread Niccolò Belli


On giovedì 1 dicembre 2016 10:37:13 CET, Wilson Meier wrote:

The only thing i have asked for is to document the *known*
problems/flaws/limitations of all raid profiles and link to them from
the stability matrix.


+1

Do someone mind if I ask for an account and I start copy-pasting any 
relevant post in this thread?


Niccolò Belli
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Convert from RAID 5 to 10

2016-12-01 Thread Wilson Meier

Am 30/11/16 um 17:48 schrieb Austin S. Hemmelgarn:
> On 2016-11-30 10:49, Wilson Meier wrote:
>> Am 30/11/16 um 15:37 schrieb Austin S. Hemmelgarn:
>>
>> Transferring this to car analogy, just to make it a bit more funny:
>> The airbag (raid level whatever) itself is ok but the micro controller
>> (general btrfs) which has the responsibility to inflate the airbag is
>> suffers some problems, sometimes doesn't inflate and the manufacturer
>> doesn't mention about that fact.
>> From your point of you the airbag is ok. From my point of view -> Don't
>> buy that car!!!
>> Don't you mean that the fact that the live safer suffers problems should
>> be noted and every dependent component should point to that fact?
>> I think it should.
>> I'm not talking about performance issues, i'm talking about data loss.
>> Now the next one can throw in "Backups, always make backups!".
>> Sure, but backup is backup and raid is raid. Both have their own
>> concerns.
> A better analogy for a car would be something along the lines of the
> radio working fine but the general wiring having issues that cause all
> the electronics in the car to stop working under certain
> circumstances. In that case, the radio itself is absolutely OK, but it
> suffers from issues caused directly by poor design elsewhere in the
> vehicle.
Ahm, no. You cannot exchange a security mechanism (raid) with a comfort
one (compression) and treat them as the same in terms of importance.
It makes a serious difference to have a not properly working airbag or
not being able to listen to music while your a driving against a wall.
Anyway, we should stop this here.
 I'm not angry or something like that :) .
 I just would like to have the possibility to read such information
 about
 the storage i put my personal data (> 3 TB) on its official wiki.
> There are more places than the wiki to look for info about BTRFS (and
> this is the case about almost any piece of software, not just BTRFS,
> very few things have one central source for everything).  I don't mean
> to sound unsympathetic, but given what you're saying, it's sounding
> more and more like you didn't look at anything beyond the wiki and
> should have checked other sources as well.
This is your assumption.


Am 01/12/16 um 07:47 schrieb Duncan:
> Austin S. Hemmelgarn posted on Wed, 30 Nov 2016 11:48:57 -0500 as
> excerpted:
>> On 2016-11-30 10:49, Wilson Meier wrote:
>>> Do you also have all home users in mind, which go to vacation (sometime
 3 weeks) and don't have a 24/7 support team to replace monitored disks
>>> which do report SMART errors?
>> Better than 90% of people I know either shut down their systems when
>> they're going to be away for a long period of time, or like me have
>> ways to log in remotely and tell the FS to not use that disk anymore.
> https://btrfs.wiki.kernel.org/index.php/Getting_started ... ... has
> two warnings offset in red right in the first section: * If you have
> btrfs filesystems, run the latest kernel.
I do. Ok not the very latest but i'm always on the latest major version.
Right now i have 4.8.4 and the very latest is 4.8.11.
> * You should keep and test backups of your data, and be prepared to use 
> them.
I have daily backups.
> As to the three weeks vacation thing... And "daily use" != "three
> weeks without physical access to something you're going to actually be
> relying on for parts of those three weeks".
>
Maybe i have my own mailserver and owncloud to server files to my
family? Maybe i'm out of country and somewhere i have no internet access?
I will not comment this any further as it leads us nowhere.


In general i think that this discussion is taking a complete wrong
direction.
The only thing i have asked for is to document the *known*
problems/flaws/limitations of all raid profiles and link to them from
the stability matrix.

Regarding raid10:
Even if one knows about the fact that btrfs handles things on chunk
level one would assume that the code is written in a way to put the
copies on different stripes.
Otherwise raid10 ***can*** become pretty useless in terms of data
redundancy and 2 x raid1 with an lvm should be considered as a replacement.
This is a serious thing and should be documented. If this is documented
somewhere then please point me to it as i cannot find a word about that
anywhere.

Cheers,
Wilson


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Convert from RAID 5 to 10

2016-11-30 Thread Duncan

Austin S. Hemmelgarn posted on Wed, 30 Nov 2016 11:48:57 -0500 as
excerpted:

> On 2016-11-30 10:49, Wilson Meier wrote:

>> Do you also have all home users in mind, which go to vacation (sometime
>>> 3 weeks) and don't have a 24/7 support team to replace monitored disks
>> which do report SMART errors?

> Better than 90% of people I know either shut down their systems when
> they're going to be away for a long period of time, or like me have ways
> to log in remotely and tell the FS to not use that disk anymore.

https://btrfs.wiki.kernel.org/index.php/Getting_started ...

... has two warnings offset in red right in the first section:

* If you have btrfs filesystems, run the latest kernel.

* You should keep and test backups of your data, and be prepared to use 
them.

It also says:

The status of btrfs was experimental for a long time, but the the core 
functionality is considered good enough for daily use. [...]
While many people use it reliably, there are still problems being found.

Were I editing that or something very similar would be on the main 
landing page and as a general status announcement on the feature and 
profile status page.  However, it IS on the wiki.

As to the three weeks vacation thing...

And "daily use" != "three weeks without physical access to something 
you're going to actually be relying on for parts of those three weeks".

And "keep and test backups [and] be prepared to use them" != "go away for 
three weeks and leave yourself unable to restore from those backups, for 
something you're relying on over those three weeks", either.

As Austin says, many home users actually shut down their systems when 
they're going to be away, because they are /not/ going to be using them 
in that period, and *certainly* *don't* actually /rely/ on them.

And most of those that /do/ actually rely on them, have learned or will 
learn, possibly the hard way, that "things happen", and they need either 
someone that can be called to poke the systems if necessary, or 
alternative plans in case what they can't access ATM fails.

Meanwhile, arguably those who /are/ relying on their filesystems to be up 
and running for extended periods while they can't actually poke (or have 
someone else poke) the hardware if necessary, shouldn't be running btrfs 
as yet in the first place, as it's simply not stable and mature enough 
for that.  And people who really care about it will have done the 
research to know the stability status.  And people who don't... well, by 
not doing that research they've effectively defined it as not that 
important in their life, other things have taken priority.  So if btrfs 
fails on them and they didn't know it's stability status, it can only be 
because it wasn't that important to them that they know, so no big deal.

(I know for certain that before /I/ switched to btrfs, I scoured both the 
wiki and the manpages, as well as reading a number of articles on btrfs, 
and then still posted to this list a number of questions I had remaining 
after doing all that, and got answers I read as well, before I actually 
did my switch.  That's because it was my data at risk, data I place a 
high enough value on to want to know the risk at which I was placing it 
and the best way to deal with various issues I could anticipate possibly 
happening, before they actually happened.  And I actually did some of my 
own testing before final deployment, as well, satisfying myself that I 
/could/ reasonably deal with various hardware and software disaster 
scenarios, before putting any real data at risk, as well.

Of course I don't expect everyone to do all that, but then I don't expect 
everyone to place the value in their data that I do in mine.  Which is 
fine, as long as they're willing to live with the consequences of the 
priority they placed on appreciating and dealing appropriately with the 
risk factor on their data, based on the definition of value their actions 
placed on it.  If they're willing to risk the data because it's of no 
particular value to them anyway, well then, no such preliminary research 
and testing is required.  Indeed, it would be stupid, because they surely 
have more important and higher priority things to deal with.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Convert from RAID 5 to 10

2016-11-30 Thread Tomasz Kusmierz

On 30 November 2016 at 19:09, Chris Murphy  wrote:
> On Wed, Nov 30, 2016 at 7:37 AM, Austin S. Hemmelgarn
>  wrote:
>
>> The stability info could be improved, but _absolutely none_ of the things
>> mentioned as issues with raid1 are specific to raid1.  And in general, in
>> the context of a feature stability matrix, 'OK' generally means that there
>> are no significant issues with that specific feature, and since none of the
>> issues outlined are specific to raid1, it does meet that description of
>> 'OK'.
>
> Maybe the gotchas page needs a one or two liner for each profile's
> gotchas compared to what the profile leads the user into believing.
> The overriding gotcha with all Btrfs multiple device support is the
> lack of monitoring and notification other than kernel messages; and
> the raid10 actually being more like raid0+1 I think it certainly a
> gotcha, however 'man mkfs.btrfs' contains a grid that very clearly
> states raid10 can only safely lose 1 device.
>
>
>> Looking at this another way, I've been using BTRFS on all my systems since
>> kernel 3.16 (I forget what exact vintage that is in regular years).  I've
>> not had any data integrity or data loss issues as a result of BTRFS itself
>> since 3.19, and in just the past year I've had multiple raid1 profile
>> filesystems survive multiple hardware issues with near zero issues (with the
>> caveat that I had to re-balance after replacing devices to convert a few
>> single chunks to raid1), and that includes multiple disk failures and 2 bad
>> PSU's plus about a dozen (not BTRFS related) kernel panics and 4 unexpected
>> power loss events.  I also have exhaustive monitoring, so I'm replacing bad
>> hardware early instead of waiting for it to actually fail.
>
> Possibly nothing aids predictably reliable storage stacks than healthy
> doses of skepticism and awareness of all limitations. :-D
>
> --
> Chris Murphy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Please, I beg you add another column to man and wiki stating clearly
how many devices every profile can withstand to loose. I frequently
have to explain how btrfs profiles work and show quotes from this
mailing list because "dawning-kruger effect victims" keep poping up
with statements like "in btrfs raid10 with 8 drives you can loose 4
drives" ... I seriously beg you guys, my beating stick is half broken
by now.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Convert from RAID 5 to 10

2016-11-30 Thread Martin Steigerwald

Am Mittwoch, 30. November 2016, 12:09:23 CET schrieb Chris Murphy:
> On Wed, Nov 30, 2016 at 7:37 AM, Austin S. Hemmelgarn
> 
>  wrote:
> > The stability info could be improved, but _absolutely none_ of the things
> > mentioned as issues with raid1 are specific to raid1.  And in general, in
> > the context of a feature stability matrix, 'OK' generally means that there
> > are no significant issues with that specific feature, and since none of
> > the
> > issues outlined are specific to raid1, it does meet that description of
> > 'OK'.
> 
> Maybe the gotchas page needs a one or two liner for each profile's
> gotchas compared to what the profile leads the user into believing.
> The overriding gotcha with all Btrfs multiple device support is the
> lack of monitoring and notification other than kernel messages; and
> the raid10 actually being more like raid0+1 I think it certainly a
> gotcha, however 'man mkfs.btrfs' contains a grid that very clearly
> states raid10 can only safely lose 1 device.

Wow, that manpage is quite an resource.

Developers, documentation people definitely improved the official BTRFS 
documentation.

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Convert from RAID 5 to 10

2016-11-30 Thread Chris Murphy

On Wed, Nov 30, 2016 at 7:37 AM, Austin S. Hemmelgarn
 wrote:

> The stability info could be improved, but _absolutely none_ of the things
> mentioned as issues with raid1 are specific to raid1.  And in general, in
> the context of a feature stability matrix, 'OK' generally means that there
> are no significant issues with that specific feature, and since none of the
> issues outlined are specific to raid1, it does meet that description of
> 'OK'.

Maybe the gotchas page needs a one or two liner for each profile's
gotchas compared to what the profile leads the user into believing.
The overriding gotcha with all Btrfs multiple device support is the
lack of monitoring and notification other than kernel messages; and
the raid10 actually being more like raid0+1 I think it certainly a
gotcha, however 'man mkfs.btrfs' contains a grid that very clearly
states raid10 can only safely lose 1 device.


> Looking at this another way, I've been using BTRFS on all my systems since
> kernel 3.16 (I forget what exact vintage that is in regular years).  I've
> not had any data integrity or data loss issues as a result of BTRFS itself
> since 3.19, and in just the past year I've had multiple raid1 profile
> filesystems survive multiple hardware issues with near zero issues (with the
> caveat that I had to re-balance after replacing devices to convert a few
> single chunks to raid1), and that includes multiple disk failures and 2 bad
> PSU's plus about a dozen (not BTRFS related) kernel panics and 4 unexpected
> power loss events.  I also have exhaustive monitoring, so I'm replacing bad
> hardware early instead of waiting for it to actually fail.

Possibly nothing aids predictably reliable storage stacks than healthy
doses of skepticism and awareness of all limitations. :-D

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Convert from RAID 5 to 10

2016-11-30 Thread Chris Murphy

On Wed, Nov 30, 2016 at 7:04 AM, Roman Mamedov  wrote:
> On Wed, 30 Nov 2016 07:50:17 -0500

> Also I don't know what is particularly insane about copying a 4-8 GB file onto
> a storage array. I'd expect both disks to write at the same time (like they
> do in pretty much any other RAID1 system), not one-after-another, effectively
> slowing down the entire operation by as much as 2x in extreme cases.

I don't experience this behavior. Writes take the same amount of time
to single profile volume as a two device raid1 profile volume. iotop
reports 2x the write bandwidth when writing to the raid1 volume, which
corresponds to simultaneous writes to both drives in the volume. It's
also not an elaborate setup by any means: two laptop drives, each in
cheap USB 3.0 cases using bus power only, connected to a USB 3.0 hub,
in turn connected to an Intel NUC.

>
> Comparing to Ext4, that one appears to have the "errors=continue" behavior by
> default, the user has to explicitly request "errors=remount-ro", and I have
> never seen anyone use or recommend the third option of "errors=panic", which
> is basically the equivalent of the current Btrfs practce.

I think in the context of degradedness, it may be appropriate to mount
degraded,ro by default rather than fail. But changing the default
isn't enough for the root fs use case, because the mount command isn't
even issued when udev's btrfs 'dev scan' fails to report back all
devices available. In this case there is a sort of "pre check" before
even mounting is attempted, and that is what fails.

Also,  Btrfs has fatal_errors=panic and it's not the default. Rather,
we just get mount failure. There really isn't anything quite like this
in the mdadm/lvm + other file system world where the array is active
degraded and the file system mounts anyway; if it doesn't mount it's
because the array isn't active, and doesn't even exist yet.

> Unplugging and replugging a SATA cable of a RAID1 member should never put your
> system under the risk of a massive filesystem corruption; you cannot say it
> absolutely doesn't with the current implementation.

I can't say it absolutely doesn't even with md. Of course it
shouldn't, but users do report corruptions on all of the other fs
lists (ext4, XFS, linux-raid) from time to time that are not the
result of user error.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Convert from RAID 5 to 10

2016-11-30 Thread Austin S. Hemmelgarn


On 2016-11-30 10:49, Wilson Meier wrote:



Am 30/11/16 um 15:37 schrieb Austin S. Hemmelgarn:

On 2016-11-30 08:12, Wilson Meier wrote:

Am 30/11/16 um 11:41 schrieb Duncan:

Wilson Meier posted on Wed, 30 Nov 2016 09:35:36 +0100 as excerpted:


Am 30/11/16 um 09:06 schrieb Martin Steigerwald:

Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov:

[snip]

So the stability matrix would need to be updated not to recommend any
kind of BTRFS RAID 1 at the moment?

Actually I faced the BTRFS RAID 1 read only after first attempt of
mounting it "degraded" just a short time ago.

BTRFS still needs way more stability work it seems to me.


I would say the matrix should be updated to not recommend any RAID
Level
as from the discussion it seems they all of them have flaws.
To me RAID is broken if one cannot expect to recover from a device
failure in a solid way as this is why RAID is used.
Correct me if i'm wrong. Right now i'm making my thoughts about
migrating to another FS and/or Hardware RAID.

It should be noted that no list regular that I'm aware of anyway, would
make any claims about btrfs being stable and mature either now or in
the
near-term future in any case.  Rather to the contrary, as I
generally put
it, btrfs is still stabilizing and maturing, with backups one is
willing
to use (and as any admin of any worth would say, a backup that hasn't
been tested usable isn't yet a backup; the job of creating the backup
isn't done until that backup has been tested actually usable for
recovery) still extremely strongly recommended.  Similarly, keeping up
with the list is recommended, as is staying relatively current on both
the kernel and userspace (generally considered to be within the latest
two kernel series of either current or LTS series kernels, and with a
similarly versioned btrfs userspace).

In that context, btrfs single-device and raid1 (and raid0 of course)
are
quite usable and as stable as btrfs in general is, that being
stabilizing
but not yet fully stable and mature, with raid10 being slightly less so
and raid56 being much more experimental/unstable at this point.

But that context never claims full stability even for the relatively
stable raid1 and single device modes, and in fact anticipates that
there
may be times when recovery from the existing filesystem may not be
practical, thus the recommendation to keep tested usable backups at the
ready.

Meanwhile, it remains relatively common on this list for those
wondering
about their btrfs on long-term-stale (not a typo) "enterprise" distros,
or even debian-stale, to be actively steered away from btrfs,
especially
if they're not willing to update to something far more current than
those
distros often provide, because in general, the current stability status
of btrfs is in conflict with the reason people generally choose to use
that level of old and stale software in the first place -- they
prioritize tried and tested to work, stable and mature, over the latest
generally newer and flashier featured but sometimes not entirely
stable,
and btrfs at this point simply doesn't meet that sort of stability/
maturity expectations, nor is it likely to for some time (measured in
years), due to all the reasons enumerated so well in the above thread.


In that context, the stability status matrix on the wiki is already
reasonably accurate, certainly so IMO, because "OK" in context means as
OK as btrfs is in general, and btrfs itself remains still stabilizing,
not fully stable and mature.

If there IS an argument as to the accuracy of the raid0/1/10 OK status,
I'd argue it's purely due to people not understanding the status of
btrfs
in general, and that if there's a general deficiency at all, it's in
the
lack of a general stability status paragraph on that page itself
explaining all this, despite the fact that the main https://
btrfs.wiki.kernel.org landing page states quite plainly under stability
status that btrfs remains under heavy development and that current
kernels are strongly recommended.  (Tho were I editing it, there'd
certainly be a more prominent mention of keeping backups at the
ready as
well.)


Hi Duncan,

i understand your arguments but cannot fully agree.
First of all, i'm not sticking with old stale versions of whatever as i
try to keep my system up2date.
My kernel is 4.8.4 (Gentoo) and btrfs-progs is 4.8.4.
That being said, i'm quite aware of the heavy development status of
btrfs but pointing the finger on the users saying that they don't fully
understand the status of btrfs without giving the information on the
wiki is in my opinion not the right way. Heavy development doesn't mean
that features marked as ok are "not" or "mostly" ok in the context of
overall btrfs stability.
There is no indication on the wiki that raid1 or every other raid
(except for raid5/6) suffers from the problems stated in this thread.

The performance issues are inherent to BTRFS right now, and none of
the other issues are likely to impact most regular

Re: Convert from RAID 5 to 10

2016-11-30 Thread Martin Steigerwald

Am Mittwoch, 30. November 2016, 16:49:59 CET schrieb Wilson Meier:
> Am 30/11/16 um 15:37 schrieb Austin S. Hemmelgarn:
> > On 2016-11-30 08:12, Wilson Meier wrote:
> >> Am 30/11/16 um 11:41 schrieb Duncan:
> >>> Wilson Meier posted on Wed, 30 Nov 2016 09:35:36 +0100 as excerpted:
>  Am 30/11/16 um 09:06 schrieb Martin Steigerwald:
> > Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov:
[…]
> >> It is really disappointing to not have this information in the wiki
> >> itself. This would have saved me, and i'm quite sure others too, a lot
> >> of time.
> >> Sorry for being a bit frustrated.
> 
> I'm not angry or something like that :) .
> I just would like to have the possibility to read such information about
> the storage i put my personal data (> 3 TB) on its official wiki.

Anyone can get an account on the wiki and add notes there, so feel free.

You can even use footnotes or something like that. Maybe it would be good to 
add a paragraph there that features are related to one another, so while BTRFS 
RAID 1 for example might be quite okay, it depends on features that are still 
flaky.

I for myself rely quite much on BTRFS RAID 1 with lzo compression and it seems 
to work okay for me.

-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Convert from RAID 5 to 10

2016-11-30 Thread Niccolò Belli


I completely agree, the whole wiki status is simply *FRUSTRATING*.

Niccolò Belli

On mercoledì 30 novembre 2016 14:12:36 CET, Wilson Meier wrote:

Am 30/11/16 um 11:41 schrieb Duncan:

Wilson Meier posted on Wed, 30 Nov 2016 09:35:36 +0100 as excerpted:
 ...

Hi Duncan,

i understand your arguments but cannot fully agree.
First of all, i'm not sticking with old stale versions of whatever as i
try to keep my system up2date.
My kernel is 4.8.4 (Gentoo) and btrfs-progs is 4.8.4.
That being said, i'm quite aware of the heavy development status of
btrfs but pointing the finger on the users saying that they don't fully
understand the status of btrfs without giving the information on the
wiki is in my opinion not the right way. Heavy development doesn't mean
that features marked as ok are "not" or "mostly" ok in the context of
overall btrfs stability.
There is no indication on the wiki that raid1 or every other raid
(except for raid5/6) suffers from the problems stated in this thread.
If there are know problems then the stability matrix should point them
out or link to a corresponding wiki entry otherwise one has to assume
that the features marked as "ok" are in fact "ok".
And yes, the overall btrfs stability should be put on the wiki.

Just to give you a quick overview of my history with btrfs.
I migrated away from MD Raid and ext4 to btrfs raid6 because of its CoW
and checksum features at a time as raid6 was not considered fully stable
but also not as badly broken.
After a few months i had a disk failure and the raid could not recover.
I looked at the wiki an the mailing list and noticed that raid6 has been
marked as badly broken :(
I was quite happy to have a backup. So i asked on the btrfs IRC channel
(the wiki had no relevant information) if raid10 is usable or suffers
from the same problems. The summary was "Yes it is usable and has no
known problems". So i migrated to raid10. Now i know that raid10 (marked
as ok) has also problems with 2 disk failures in different stripes and
can in fact lead to data loss.
I thought, hmm ok, i'll split my data and use raid1 (marked as ok). And
again the mailing list states that raid1 has also problems in case of
recovery.

It is really disappointing to not have this information in the wiki
itself. This would have saved me, and i'm quite sure others too, a lot
of time.
Sorry for being a bit frustrated.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Convert from RAID 5 to 10

2016-11-30 Thread Wilson Meier



Am 30/11/16 um 15:37 schrieb Austin S. Hemmelgarn:
> On 2016-11-30 08:12, Wilson Meier wrote:
>> Am 30/11/16 um 11:41 schrieb Duncan:
>>> Wilson Meier posted on Wed, 30 Nov 2016 09:35:36 +0100 as excerpted:
>>>
 Am 30/11/16 um 09:06 schrieb Martin Steigerwald:
> Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov:
>> [snip]
> So the stability matrix would need to be updated not to recommend any
> kind of BTRFS RAID 1 at the moment?
>
> Actually I faced the BTRFS RAID 1 read only after first attempt of
> mounting it "degraded" just a short time ago.
>
> BTRFS still needs way more stability work it seems to me.
>
 I would say the matrix should be updated to not recommend any RAID
 Level
 as from the discussion it seems they all of them have flaws.
 To me RAID is broken if one cannot expect to recover from a device
 failure in a solid way as this is why RAID is used.
 Correct me if i'm wrong. Right now i'm making my thoughts about
 migrating to another FS and/or Hardware RAID.
>>> It should be noted that no list regular that I'm aware of anyway, would
>>> make any claims about btrfs being stable and mature either now or in
>>> the
>>> near-term future in any case.  Rather to the contrary, as I
>>> generally put
>>> it, btrfs is still stabilizing and maturing, with backups one is
>>> willing
>>> to use (and as any admin of any worth would say, a backup that hasn't
>>> been tested usable isn't yet a backup; the job of creating the backup
>>> isn't done until that backup has been tested actually usable for
>>> recovery) still extremely strongly recommended.  Similarly, keeping up
>>> with the list is recommended, as is staying relatively current on both
>>> the kernel and userspace (generally considered to be within the latest
>>> two kernel series of either current or LTS series kernels, and with a
>>> similarly versioned btrfs userspace).
>>>
>>> In that context, btrfs single-device and raid1 (and raid0 of course)
>>> are
>>> quite usable and as stable as btrfs in general is, that being
>>> stabilizing
>>> but not yet fully stable and mature, with raid10 being slightly less so
>>> and raid56 being much more experimental/unstable at this point.
>>>
>>> But that context never claims full stability even for the relatively
>>> stable raid1 and single device modes, and in fact anticipates that
>>> there
>>> may be times when recovery from the existing filesystem may not be
>>> practical, thus the recommendation to keep tested usable backups at the
>>> ready.
>>>
>>> Meanwhile, it remains relatively common on this list for those
>>> wondering
>>> about their btrfs on long-term-stale (not a typo) "enterprise" distros,
>>> or even debian-stale, to be actively steered away from btrfs,
>>> especially
>>> if they're not willing to update to something far more current than
>>> those
>>> distros often provide, because in general, the current stability status
>>> of btrfs is in conflict with the reason people generally choose to use
>>> that level of old and stale software in the first place -- they
>>> prioritize tried and tested to work, stable and mature, over the latest
>>> generally newer and flashier featured but sometimes not entirely
>>> stable,
>>> and btrfs at this point simply doesn't meet that sort of stability/
>>> maturity expectations, nor is it likely to for some time (measured in
>>> years), due to all the reasons enumerated so well in the above thread.
>>>
>>>
>>> In that context, the stability status matrix on the wiki is already
>>> reasonably accurate, certainly so IMO, because "OK" in context means as
>>> OK as btrfs is in general, and btrfs itself remains still stabilizing,
>>> not fully stable and mature.
>>>
>>> If there IS an argument as to the accuracy of the raid0/1/10 OK status,
>>> I'd argue it's purely due to people not understanding the status of
>>> btrfs
>>> in general, and that if there's a general deficiency at all, it's in
>>> the
>>> lack of a general stability status paragraph on that page itself
>>> explaining all this, despite the fact that the main https://
>>> btrfs.wiki.kernel.org landing page states quite plainly under stability
>>> status that btrfs remains under heavy development and that current
>>> kernels are strongly recommended.  (Tho were I editing it, there'd
>>> certainly be a more prominent mention of keeping backups at the
>>> ready as
>>> well.)
>>>
>> Hi Duncan,
>>
>> i understand your arguments but cannot fully agree.
>> First of all, i'm not sticking with old stale versions of whatever as i
>> try to keep my system up2date.
>> My kernel is 4.8.4 (Gentoo) and btrfs-progs is 4.8.4.
>> That being said, i'm quite aware of the heavy development status of
>> btrfs but pointing the finger on the users saying that they don't fully
>> understand the status of btrfs without giving the information on the
>> wiki is in my opinion not the right way. Heavy development doesn't mean
>> that

Re: Convert from RAID 5 to 10

2016-11-30 Thread Austin S. Hemmelgarn


On 2016-11-30 09:04, Roman Mamedov wrote:

On Wed, 30 Nov 2016 07:50:17 -0500
"Austin S. Hemmelgarn"  wrote:


*) Read performance is not optimized: all metadata is always read from the
first device unless it has failed, data reads are supposedly balanced between
devices per PID of the process reading. Better implementations dispatch reads
per request to devices that are currently idle.

Based on what I've seen, the metadata reads get balanced too.


https://github.com/torvalds/linux/blob/v4.8/fs/btrfs/disk-io.c#L451
This starts from the mirror number 0 and tries others in an incrementing
order, until succeeds. It appears that as long as the mirror with copy #0 is up
and not corrupted, all reads will simply get satisfied from it.
That's actually how all reads work, it's just that the PID selects what 
constitutes the 'first' copy.  IIRC, that selection is doen by a lower 
layer.



*) Write performance is not optimized, during long full bandwidth sequential
writes it is common to see devices writing not in parallel, but with a long
periods of just one device writing, then another. (Admittedly have been some
time since I tested that).

I've never seen this be an issue in practice, especially if you're using
transparent compression (which caps extent size, and therefore I/O size
to a given device, at 128k).  I'm also sane enough that I'm not doing
bulk streaming writes to traditional HDD's or fully saturating the
bandwidth on my SSD's (you should be over-provisioning whenever
possible).  For a desktop user, unless you're doing real-time video
recording at higher than HD resolution with high quality surround sound,
this probably isn't going to hit you (and even then you should be
recording to a temporary location with much faster write speeds (tmpfs
or ext4 without a journal for example) because you'll likely get hit
with fragmentation).


I did not use compression while observing this;
Compression doesn't make things parallel, but it does cause BTRFS to 
distribute the writes more evenly because it writes first one extent 
then the other, which in turn makes things much more efficient because 
you're not stalling as much waiting for the I/O queue to finish.  It 
also means you have to write less overall to the disk, so on systems 
which can do LZO compression significantly faster than they can write to 
or read from the disk, it will generally improve performance all around.


Also I don't know what is particularly insane about copying a 4-8 GB file onto
a storage array. I'd expect both disks to write at the same time (like they
do in pretty much any other RAID1 system), not one-after-another, effectively
slowing down the entire operation by as much as 2x in extreme cases.
I'm not talking 4-8GB files, I'm talking really big stuff at least an 
order of magnitude larger than that, stuff like filesystem images and 
big databases.  On the only system I have where I have traditional hard 
disks (7200RPM consumer SATA3 drives connected to an LSI MPT2SAS HBA, 
about 80-100MB/s bulk write speed to a single disk), an 8GB copy from 
tmpfs is only in practice about 20% slower to BTRFS raid1 mode than to 
XFS on top of a DM-RAID RAID1 volume, and about 30% slower than the same 
with ext4.  In both cases, this is actually about 50% faster than ZFS 
(which does prallelize reads and writes) in an equivalent configuration 
on the same hardware.  Comparing all of that to single disk versions on 
the same hardware, I see roughly the same performance ratios between 
filesystems, and the same goes for running on the motherboard's SATA 
controller instead of the LSI HBA.  In this case, I am using compression 
(and the data gets reasonable compression ratios), and I see both disks 
running at just below peak bandwidth, and based on tracing, most of the 
difference is in the metadata updates required to change the extents.


I would love to see BTRFS properly parallelize writes and stripe reads 
sanely, but I seriously doubt it's going to have as much impact as you 
think, especially on systems with fast storage.



As far as not mounting degraded by default, that's a conscious design
choice that isn't going to change.  There's a switch (adding 'degraded'
to the mount options) to enable this behavior per-mount, so we're still
on-par in that respect with LVM and MD, we just picked a different
default.  In this case, I actually feel it's a better default for most
cases, because most regular users aren't doing exhaustive monitoring,
and thus are not likely to notice the filesystem being mounted degraded
until it's far too late.  If the filesystem is degraded, then
_something_ has happened that the user needs to know about, and until
some sane monitoring solution is implemented, the easiest way to ensure
this is to refuse to mount.


The easiest is to write to dmesg and syslog, if a user doesn't monitor those
either, it's their own fault; and the more user friendly one would be to still
auto mount degraded, but

Re: Convert from RAID 5 to 10

2016-11-30 Thread Austin S. Hemmelgarn


On 2016-11-30 08:12, Wilson Meier wrote:

Am 30/11/16 um 11:41 schrieb Duncan:

Wilson Meier posted on Wed, 30 Nov 2016 09:35:36 +0100 as excerpted:


Am 30/11/16 um 09:06 schrieb Martin Steigerwald:

Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov:

[snip]

So the stability matrix would need to be updated not to recommend any
kind of BTRFS RAID 1 at the moment?

Actually I faced the BTRFS RAID 1 read only after first attempt of
mounting it "degraded" just a short time ago.

BTRFS still needs way more stability work it seems to me.


I would say the matrix should be updated to not recommend any RAID Level
as from the discussion it seems they all of them have flaws.
To me RAID is broken if one cannot expect to recover from a device
failure in a solid way as this is why RAID is used.
Correct me if i'm wrong. Right now i'm making my thoughts about
migrating to another FS and/or Hardware RAID.

It should be noted that no list regular that I'm aware of anyway, would
make any claims about btrfs being stable and mature either now or in the
near-term future in any case.  Rather to the contrary, as I generally put
it, btrfs is still stabilizing and maturing, with backups one is willing
to use (and as any admin of any worth would say, a backup that hasn't
been tested usable isn't yet a backup; the job of creating the backup
isn't done until that backup has been tested actually usable for
recovery) still extremely strongly recommended.  Similarly, keeping up
with the list is recommended, as is staying relatively current on both
the kernel and userspace (generally considered to be within the latest
two kernel series of either current or LTS series kernels, and with a
similarly versioned btrfs userspace).

In that context, btrfs single-device and raid1 (and raid0 of course) are
quite usable and as stable as btrfs in general is, that being stabilizing
but not yet fully stable and mature, with raid10 being slightly less so
and raid56 being much more experimental/unstable at this point.

But that context never claims full stability even for the relatively
stable raid1 and single device modes, and in fact anticipates that there
may be times when recovery from the existing filesystem may not be
practical, thus the recommendation to keep tested usable backups at the
ready.

Meanwhile, it remains relatively common on this list for those wondering
about their btrfs on long-term-stale (not a typo) "enterprise" distros,
or even debian-stale, to be actively steered away from btrfs, especially
if they're not willing to update to something far more current than those
distros often provide, because in general, the current stability status
of btrfs is in conflict with the reason people generally choose to use
that level of old and stale software in the first place -- they
prioritize tried and tested to work, stable and mature, over the latest
generally newer and flashier featured but sometimes not entirely stable,
and btrfs at this point simply doesn't meet that sort of stability/
maturity expectations, nor is it likely to for some time (measured in
years), due to all the reasons enumerated so well in the above thread.


In that context, the stability status matrix on the wiki is already
reasonably accurate, certainly so IMO, because "OK" in context means as
OK as btrfs is in general, and btrfs itself remains still stabilizing,
not fully stable and mature.

If there IS an argument as to the accuracy of the raid0/1/10 OK status,
I'd argue it's purely due to people not understanding the status of btrfs
in general, and that if there's a general deficiency at all, it's in the
lack of a general stability status paragraph on that page itself
explaining all this, despite the fact that the main https://
btrfs.wiki.kernel.org landing page states quite plainly under stability
status that btrfs remains under heavy development and that current
kernels are strongly recommended.  (Tho were I editing it, there'd
certainly be a more prominent mention of keeping backups at the ready as
well.)


Hi Duncan,

i understand your arguments but cannot fully agree.
First of all, i'm not sticking with old stale versions of whatever as i
try to keep my system up2date.
My kernel is 4.8.4 (Gentoo) and btrfs-progs is 4.8.4.
That being said, i'm quite aware of the heavy development status of
btrfs but pointing the finger on the users saying that they don't fully
understand the status of btrfs without giving the information on the
wiki is in my opinion not the right way. Heavy development doesn't mean
that features marked as ok are "not" or "mostly" ok in the context of
overall btrfs stability.
There is no indication on the wiki that raid1 or every other raid
(except for raid5/6) suffers from the problems stated in this thread.
The performance issues are inherent to BTRFS right now, and none of the 
other issues are likely to impact most regular users.  Most of the 
people who would be interested in the features of BTRFS also have 
existing

Re: Convert from RAID 5 to 10

2016-11-30 Thread Roman Mamedov

On Wed, 30 Nov 2016 07:50:17 -0500
"Austin S. Hemmelgarn"  wrote:

> > *) Read performance is not optimized: all metadata is always read from the
> > first device unless it has failed, data reads are supposedly balanced 
> > between
> > devices per PID of the process reading. Better implementations dispatch 
> > reads
> > per request to devices that are currently idle.
> Based on what I've seen, the metadata reads get balanced too.

https://github.com/torvalds/linux/blob/v4.8/fs/btrfs/disk-io.c#L451
This starts from the mirror number 0 and tries others in an incrementing
order, until succeeds. It appears that as long as the mirror with copy #0 is up
and not corrupted, all reads will simply get satisfied from it.

> > *) Write performance is not optimized, during long full bandwidth sequential
> > writes it is common to see devices writing not in parallel, but with a long
> > periods of just one device writing, then another. (Admittedly have been some
> > time since I tested that).
> I've never seen this be an issue in practice, especially if you're using 
> transparent compression (which caps extent size, and therefore I/O size 
> to a given device, at 128k).  I'm also sane enough that I'm not doing 
> bulk streaming writes to traditional HDD's or fully saturating the 
> bandwidth on my SSD's (you should be over-provisioning whenever 
> possible).  For a desktop user, unless you're doing real-time video 
> recording at higher than HD resolution with high quality surround sound, 
> this probably isn't going to hit you (and even then you should be 
> recording to a temporary location with much faster write speeds (tmpfs 
> or ext4 without a journal for example) because you'll likely get hit 
> with fragmentation).

I did not use compression while observing this;

Also I don't know what is particularly insane about copying a 4-8 GB file onto
a storage array. I'd expect both disks to write at the same time (like they
do in pretty much any other RAID1 system), not one-after-another, effectively
slowing down the entire operation by as much as 2x in extreme cases.

> As far as not mounting degraded by default, that's a conscious design 
> choice that isn't going to change.  There's a switch (adding 'degraded' 
> to the mount options) to enable this behavior per-mount, so we're still 
> on-par in that respect with LVM and MD, we just picked a different 
> default.  In this case, I actually feel it's a better default for most 
> cases, because most regular users aren't doing exhaustive monitoring, 
> and thus are not likely to notice the filesystem being mounted degraded 
> until it's far too late.  If the filesystem is degraded, then 
> _something_ has happened that the user needs to know about, and until 
> some sane monitoring solution is implemented, the easiest way to ensure 
> this is to refuse to mount.

The easiest is to write to dmesg and syslog, if a user doesn't monitor those
either, it's their own fault; and the more user friendly one would be to still
auto mount degraded, but read-only.

Comparing to Ext4, that one appears to have the "errors=continue" behavior by
default, the user has to explicitly request "errors=remount-ro", and I have
never seen anyone use or recommend the third option of "errors=panic", which
is basically the equivalent of the current Btrfs practce.

> > *) It does not properly handle a device disappearing during operation. 
> > (There
> > is a patchset to add that).
> >
> > *) It does not properly handle said device returning (under a
> > different /dev/sdX name, for bonus points).
> These are not an easy problem to fix completely, especially considering 
> that the device is currently guaranteed to reappear under a different 
> name because BTRFS will still have an open reference on the original 
> device name.
> 
> On top of that, if you've got hardware that's doing this without manual 
> intervention, you've got much bigger issues than how BTRFS reacts to it. 
>   No correctly working hardware should be doing this.

Unplugging and replugging a SATA cable of a RAID1 member should never put your
system under the risk of a massive filesystem corruption; you cannot say it
absolutely doesn't with the current implementation.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Convert from RAID 5 to 10

2016-11-30 Thread Wilson Meier

Am 30/11/16 um 11:41 schrieb Duncan:
> Wilson Meier posted on Wed, 30 Nov 2016 09:35:36 +0100 as excerpted:
>
>> Am 30/11/16 um 09:06 schrieb Martin Steigerwald:
>>> Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov:
 [snip]
>>> So the stability matrix would need to be updated not to recommend any
>>> kind of BTRFS RAID 1 at the moment?
>>>
>>> Actually I faced the BTRFS RAID 1 read only after first attempt of
>>> mounting it "degraded" just a short time ago.
>>>
>>> BTRFS still needs way more stability work it seems to me.
>>>
>> I would say the matrix should be updated to not recommend any RAID Level
>> as from the discussion it seems they all of them have flaws.
>> To me RAID is broken if one cannot expect to recover from a device
>> failure in a solid way as this is why RAID is used.
>> Correct me if i'm wrong. Right now i'm making my thoughts about
>> migrating to another FS and/or Hardware RAID.
> It should be noted that no list regular that I'm aware of anyway, would 
> make any claims about btrfs being stable and mature either now or in the 
> near-term future in any case.  Rather to the contrary, as I generally put 
> it, btrfs is still stabilizing and maturing, with backups one is willing 
> to use (and as any admin of any worth would say, a backup that hasn't 
> been tested usable isn't yet a backup; the job of creating the backup 
> isn't done until that backup has been tested actually usable for 
> recovery) still extremely strongly recommended.  Similarly, keeping up 
> with the list is recommended, as is staying relatively current on both 
> the kernel and userspace (generally considered to be within the latest 
> two kernel series of either current or LTS series kernels, and with a 
> similarly versioned btrfs userspace).
>
> In that context, btrfs single-device and raid1 (and raid0 of course) are 
> quite usable and as stable as btrfs in general is, that being stabilizing 
> but not yet fully stable and mature, with raid10 being slightly less so 
> and raid56 being much more experimental/unstable at this point.
>
> But that context never claims full stability even for the relatively 
> stable raid1 and single device modes, and in fact anticipates that there 
> may be times when recovery from the existing filesystem may not be 
> practical, thus the recommendation to keep tested usable backups at the 
> ready.
>
> Meanwhile, it remains relatively common on this list for those wondering 
> about their btrfs on long-term-stale (not a typo) "enterprise" distros, 
> or even debian-stale, to be actively steered away from btrfs, especially 
> if they're not willing to update to something far more current than those 
> distros often provide, because in general, the current stability status 
> of btrfs is in conflict with the reason people generally choose to use 
> that level of old and stale software in the first place -- they 
> prioritize tried and tested to work, stable and mature, over the latest 
> generally newer and flashier featured but sometimes not entirely stable, 
> and btrfs at this point simply doesn't meet that sort of stability/
> maturity expectations, nor is it likely to for some time (measured in 
> years), due to all the reasons enumerated so well in the above thread.
>
>
> In that context, the stability status matrix on the wiki is already 
> reasonably accurate, certainly so IMO, because "OK" in context means as 
> OK as btrfs is in general, and btrfs itself remains still stabilizing, 
> not fully stable and mature.
>
> If there IS an argument as to the accuracy of the raid0/1/10 OK status, 
> I'd argue it's purely due to people not understanding the status of btrfs 
> in general, and that if there's a general deficiency at all, it's in the 
> lack of a general stability status paragraph on that page itself 
> explaining all this, despite the fact that the main https://
> btrfs.wiki.kernel.org landing page states quite plainly under stability 
> status that btrfs remains under heavy development and that current 
> kernels are strongly recommended.  (Tho were I editing it, there'd 
> certainly be a more prominent mention of keeping backups at the ready as 
> well.)
>
Hi Duncan,

i understand your arguments but cannot fully agree.
First of all, i'm not sticking with old stale versions of whatever as i
try to keep my system up2date.
My kernel is 4.8.4 (Gentoo) and btrfs-progs is 4.8.4.
That being said, i'm quite aware of the heavy development status of
btrfs but pointing the finger on the users saying that they don't fully
understand the status of btrfs without giving the information on the
wiki is in my opinion not the right way. Heavy development doesn't mean
that features marked as ok are "not" or "mostly" ok in the context of
overall btrfs stability.
There is no indication on the wiki that raid1 or every other raid
(except for raid5/6) suffers from the problems stated in this thread.
If there are know problems then the stability matrix should point

Re: Convert from RAID 5 to 10

2016-11-30 Thread Austin S. Hemmelgarn


On 2016-11-30 00:38, Roman Mamedov wrote:

On Wed, 30 Nov 2016 00:16:48 +0100
Wilson Meier  wrote:


That said, btrfs shouldn't be used for other then raid1 as every other
raid level has serious problems or at least doesn't work as the expected
raid level (in terms of failure recovery).


RAID1 shouldn't be used either:

*) Read performance is not optimized: all metadata is always read from the
first device unless it has failed, data reads are supposedly balanced between
devices per PID of the process reading. Better implementations dispatch reads
per request to devices that are currently idle.

Based on what I've seen, the metadata reads get balanced too.

As far as the read balancing in general, while it doesn't work very well 
for single processes, but if you have a large number of processes 
started sequentially (for example, a thread-pool based server), it 
actually works out to being near optimal with a lot less logic than DM 
and MD have.  Aggregated over an entire system it's usually near optimal 
as well.


*) Write performance is not optimized, during long full bandwidth sequential
writes it is common to see devices writing not in parallel, but with a long
periods of just one device writing, then another. (Admittedly have been some
time since I tested that).
I've never seen this be an issue in practice, especially if you're using 
transparent compression (which caps extent size, and therefore I/O size 
to a given device, at 128k).  I'm also sane enough that I'm not doing 
bulk streaming writes to traditional HDD's or fully saturating the 
bandwidth on my SSD's (you should be over-provisioning whenever 
possible).  For a desktop user, unless you're doing real-time video 
recording at higher than HD resolution with high quality surround sound, 
this probably isn't going to hit you (and even then you should be 
recording to a temporary location with much faster write speeds (tmpfs 
or ext4 without a journal for example) because you'll likely get hit 
with fragmentation).


This also has overall pretty low impact compared to a number of other 
things that BTRFS does (BTRFS on a single disk with single profile for 
everything versus 2 of the same disks with raid1 profile for everything 
gets less than a 20% performance difference in all the testing I've done).


*) A degraded RAID1 won't mount by default.

If this was the root filesystem, the machine won't boot.

To mount it, you need to add the "degraded" mount option.
However you have exactly a single chance at that, you MUST restore the RAID to
non-degraded state while it's mounted during that session, since it won't ever
mount again in the r/w+degraded mode, and in r/o mode you can't perform any
operations on the filesystem, including adding/removing devices.
There is a fix pending for the single chance to mount degraded thing, 
and even then, it only applies to a 2 device raid1 array (with more 
devices, new chunks are still raid1 if you're missing 1 device, so the 
checks don't trigger and refuse the mount).


As far as not mounting degraded by default, that's a conscious design 
choice that isn't going to change.  There's a switch (adding 'degraded' 
to the mount options) to enable this behavior per-mount, so we're still 
on-par in that respect with LVM and MD, we just picked a different 
default.  In this case, I actually feel it's a better default for most 
cases, because most regular users aren't doing exhaustive monitoring, 
and thus are not likely to notice the filesystem being mounted degraded 
until it's far too late.  If the filesystem is degraded, then 
_something_ has happened that the user needs to know about, and until 
some sane monitoring solution is implemented, the easiest way to ensure 
this is to refuse to mount.


*) It does not properly handle a device disappearing during operation. (There
is a patchset to add that).

*) It does not properly handle said device returning (under a
different /dev/sdX name, for bonus points).
These are not an easy problem to fix completely, especially considering 
that the device is currently guaranteed to reappear under a different 
name because BTRFS will still have an open reference on the original 
device name.


On top of that, if you've got hardware that's doing this without manual 
intervention, you've got much bigger issues than how BTRFS reacts to it. 
 No correctly working hardware should be doing this.


Most of these also apply to all other RAID levels.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Convert from RAID 5 to 10

2016-11-30 Thread Duncan

Wilson Meier posted on Wed, 30 Nov 2016 09:35:36 +0100 as excerpted:

> Am 30/11/16 um 09:06 schrieb Martin Steigerwald:
>> Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov:
>>> On Wed, 30 Nov 2016 00:16:48 +0100
>>>
>>> Wilson Meier  wrote:
 That said, btrfs shouldn't be used for other then raid1 as every
 other raid level has serious problems or at least doesn't work as the
 expected raid level (in terms of failure recovery).
>>> RAID1 shouldn't be used either:
>>>
>>> *) Read performance is not optimized: all metadata is always read from
>>> the first device unless it has failed, data reads are supposedly
>>> balanced between devices per PID of the process reading. Better
>>> implementations dispatch reads per request to devices that are
>>> currently idle.
>>>
>>> *) Write performance is not optimized, during long full bandwidth
>>> sequential writes it is common to see devices writing not in parallel,
>>> but with a long periods of just one device writing, then another.
>>> (Admittedly have been some time since I tested that).
>>>
>>> *) A degraded RAID1 won't mount by default.
>>>
>>> If this was the root filesystem, the machine won't boot.
>>>
>>> To mount it, you need to add the "degraded" mount option.
>>> However you have exactly a single chance at that, you MUST restore the
>>> RAID to non-degraded state while it's mounted during that session,
>>> since it won't ever mount again in the r/w+degraded mode, and in r/o
>>> mode you can't perform any operations on the filesystem, including
>>> adding/removing devices.
>>>
>>> *) It does not properly handle a device disappearing during operation.
>>> (There is a patchset to add that).
>>>
>>> *) It does not properly handle said device returning (under a
>>> different /dev/sdX name, for bonus points).
>>>
>>> Most of these also apply to all other RAID levels.
>> So the stability matrix would need to be updated not to recommend any
>> kind of BTRFS RAID 1 at the moment?
>>
>> Actually I faced the BTRFS RAID 1 read only after first attempt of
>> mounting it "degraded" just a short time ago.
>>
>> BTRFS still needs way more stability work it seems to me.
>>
> I would say the matrix should be updated to not recommend any RAID Level
> as from the discussion it seems they all of them have flaws.
> To me RAID is broken if one cannot expect to recover from a device
> failure in a solid way as this is why RAID is used.
> Correct me if i'm wrong. Right now i'm making my thoughts about
> migrating to another FS and/or Hardware RAID.

It should be noted that no list regular that I'm aware of anyway, would 
make any claims about btrfs being stable and mature either now or in the 
near-term future in any case.  Rather to the contrary, as I generally put 
it, btrfs is still stabilizing and maturing, with backups one is willing 
to use (and as any admin of any worth would say, a backup that hasn't 
been tested usable isn't yet a backup; the job of creating the backup 
isn't done until that backup has been tested actually usable for 
recovery) still extremely strongly recommended.  Similarly, keeping up 
with the list is recommended, as is staying relatively current on both 
the kernel and userspace (generally considered to be within the latest 
two kernel series of either current or LTS series kernels, and with a 
similarly versioned btrfs userspace).

In that context, btrfs single-device and raid1 (and raid0 of course) are 
quite usable and as stable as btrfs in general is, that being stabilizing 
but not yet fully stable and mature, with raid10 being slightly less so 
and raid56 being much more experimental/unstable at this point.

But that context never claims full stability even for the relatively 
stable raid1 and single device modes, and in fact anticipates that there 
may be times when recovery from the existing filesystem may not be 
practical, thus the recommendation to keep tested usable backups at the 
ready.

Meanwhile, it remains relatively common on this list for those wondering 
about their btrfs on long-term-stale (not a typo) "enterprise" distros, 
or even debian-stale, to be actively steered away from btrfs, especially 
if they're not willing to update to something far more current than those 
distros often provide, because in general, the current stability status 
of btrfs is in conflict with the reason people generally choose to use 
that level of old and stale software in the first place -- they 
prioritize tried and tested to work, stable and mature, over the latest 
generally newer and flashier featured but sometimes not entirely stable, 
and btrfs at this point simply doesn't meet that sort of stability/
maturity expectations, nor is it likely to for some time (measured in 
years), due to all the reasons enumerated so well in the above thread.

In that context, the stability status matrix on the wiki is already 
reasonably accurate, certainly so IMO, because "OK" in context means as 
OK as

Re: Convert from RAID 5 to 10

2016-11-30 Thread Wilson Meier



Am 30/11/16 um 09:06 schrieb Martin Steigerwald:
> Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov:
>> On Wed, 30 Nov 2016 00:16:48 +0100
>>
>> Wilson Meier  wrote:
>>> That said, btrfs shouldn't be used for other then raid1 as every other
>>> raid level has serious problems or at least doesn't work as the expected
>>> raid level (in terms of failure recovery).
>> RAID1 shouldn't be used either:
>>
>> *) Read performance is not optimized: all metadata is always read from the
>> first device unless it has failed, data reads are supposedly balanced
>> between devices per PID of the process reading. Better implementations
>> dispatch reads per request to devices that are currently idle.
>>
>> *) Write performance is not optimized, during long full bandwidth sequential
>> writes it is common to see devices writing not in parallel, but with a long
>> periods of just one device writing, then another. (Admittedly have been
>> some time since I tested that).
>>
>> *) A degraded RAID1 won't mount by default.
>>
>> If this was the root filesystem, the machine won't boot.
>>
>> To mount it, you need to add the "degraded" mount option.
>> However you have exactly a single chance at that, you MUST restore the RAID
>> to non-degraded state while it's mounted during that session, since it
>> won't ever mount again in the r/w+degraded mode, and in r/o mode you can't
>> perform any operations on the filesystem, including adding/removing
>> devices.
>>
>> *) It does not properly handle a device disappearing during operation.
>> (There is a patchset to add that).
>>
>> *) It does not properly handle said device returning (under a
>> different /dev/sdX name, for bonus points).
>>
>> Most of these also apply to all other RAID levels.
> So the stability matrix would need to be updated not to recommend any kind of 
> BTRFS RAID 1 at the moment?
>
> Actually I faced the BTRFS RAID 1 read only after first attempt of mounting 
> it 
> "degraded" just a short time ago.
>
> BTRFS still needs way more stability work it seems to me.
>
I would say the matrix should be updated to not recommend any RAID Level
as from the discussion it seems they all of them have flaws.
To me RAID is broken if one cannot expect to recover from a device
failure in a solid way as this is why RAID is used.
Correct me if i'm wrong. Right now i'm making my thoughts about
migrating to another FS and/or Hardware RAID.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Convert from RAID 5 to 10

2016-11-30 Thread Martin Steigerwald

Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov:
> On Wed, 30 Nov 2016 00:16:48 +0100
> 
> Wilson Meier  wrote:
> > That said, btrfs shouldn't be used for other then raid1 as every other
> > raid level has serious problems or at least doesn't work as the expected
> > raid level (in terms of failure recovery).
> 
> RAID1 shouldn't be used either:
> 
> *) Read performance is not optimized: all metadata is always read from the
> first device unless it has failed, data reads are supposedly balanced
> between devices per PID of the process reading. Better implementations
> dispatch reads per request to devices that are currently idle.
> 
> *) Write performance is not optimized, during long full bandwidth sequential
> writes it is common to see devices writing not in parallel, but with a long
> periods of just one device writing, then another. (Admittedly have been
> some time since I tested that).
> 
> *) A degraded RAID1 won't mount by default.
> 
> If this was the root filesystem, the machine won't boot.
> 
> To mount it, you need to add the "degraded" mount option.
> However you have exactly a single chance at that, you MUST restore the RAID
> to non-degraded state while it's mounted during that session, since it
> won't ever mount again in the r/w+degraded mode, and in r/o mode you can't
> perform any operations on the filesystem, including adding/removing
> devices.
> 
> *) It does not properly handle a device disappearing during operation.
> (There is a patchset to add that).
> 
> *) It does not properly handle said device returning (under a
> different /dev/sdX name, for bonus points).
> 
> Most of these also apply to all other RAID levels.

So the stability matrix would need to be updated not to recommend any kind of 
BTRFS RAID 1 at the moment?

Actually I faced the BTRFS RAID 1 read only after first attempt of mounting it 
"degraded" just a short time ago.

BTRFS still needs way more stability work it seems to me.

-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Convert from RAID 5 to 10

2016-11-29 Thread Roman Mamedov

On Wed, 30 Nov 2016 00:16:48 +0100
Wilson Meier  wrote:

> That said, btrfs shouldn't be used for other then raid1 as every other
> raid level has serious problems or at least doesn't work as the expected
> raid level (in terms of failure recovery).

RAID1 shouldn't be used either:

*) Read performance is not optimized: all metadata is always read from the
first device unless it has failed, data reads are supposedly balanced between
devices per PID of the process reading. Better implementations dispatch reads
per request to devices that are currently idle.

*) Write performance is not optimized, during long full bandwidth sequential
writes it is common to see devices writing not in parallel, but with a long
periods of just one device writing, then another. (Admittedly have been some
time since I tested that).

*) A degraded RAID1 won't mount by default.

If this was the root filesystem, the machine won't boot.

To mount it, you need to add the "degraded" mount option.
However you have exactly a single chance at that, you MUST restore the RAID to
non-degraded state while it's mounted during that session, since it won't ever
mount again in the r/w+degraded mode, and in r/o mode you can't perform any
operations on the filesystem, including adding/removing devices.

*) It does not properly handle a device disappearing during operation. (There
is a patchset to add that).

*) It does not properly handle said device returning (under a
different /dev/sdX name, for bonus points).

Most of these also apply to all other RAID levels.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Convert from RAID 5 to 10

2016-11-29 Thread Wilson Meier



On 30.11.2016 00:49, Chris Murphy wrote:
> On Tue, Nov 29, 2016 at 4:16 PM, Wilson Meier  wrote:
>>
>>
>> On 29.11.2016 23:52, Chris Murphy wrote:
>>> On Tue, Nov 29, 2016 at 3:34 PM, Wilson Meier  
>>> wrote:
 On 29.11.2016 18:54, Austin S. Hemmelgarn wrote:
> On 2016-11-29 12:20, Florian Lindner wrote:
>> Hello,
>>
>> I have 4 harddisks with 3TB capacity each. They are all used in a
>> btrfs RAID 5. It has come to my attention, that there
>> seem to be major flaws in btrfs' raid 5 implementation. Because of
>> that, I want to convert the the raid 5 to a raid 10
>> and I have several questions.
>>
>> * Is that possible as an online conversion?
> Yes, as long as you have a complete array to begin with (converting from
> a degraded raid5/6 array has the same issues as rebuilding a degraded
> raid5/6 array).
>>
>> * Since my effective capacity will shrink during conversions, does
>> btrfs check if there is enough free capacity to
>> convert? As you see below, right now it's probably too full, but I'm
>> going to delete some stuff.
> No, you'll have to do the math yourself.  This would be a great project
> idea to place on the wiki though.
>>
>> * I understand the command to convert is
>>
>> btrfs balance start -dconvert=raid10 -mconvert=raid10 /mnt
>>
>> Correct?
> Yes, but I would personally convert first metadata then data.  The
> raid10 profile gets better performance than raid5, so converting the
> metadata first (by issuing a balance just covering the metadata) should
> speed up the data conversion a bit).
>>
>> * What disks are allowed to fail? My understanding of a raid 10 is
>> like that
>>
>> disks = {a, b, c, d}
>>
>> raid0( raid1(a, b), raid1(c, d) )
>>
>> This way (a XOR b) AND (c XOR d) are allowed to fail without the raid
>> to fail (either a or b and c or d are allowed to fail)
>>
>> How is that with a btrfs raid 10?
> A BTRFS raid10 can only sustain one disk failure.  Ideally, it would
> work like you show, but in practice it doesn't.
 I'm a little bit concerned right now. I migrated my 4 disk raid6 to
 raid10 because of the known raid5/6 problems. I assumed that btrfs
 raid10 can handle 2 disk failures as longs as they occur in different
 stripes.
 Could you please point out why it cannot sustain 2 disk failures?
>>>
>>> Conventional raid10 has a fixed assignment of which drives are
>>> mirrored pairs, and this doesn't happen with Btrfs at the device level
>>> but rather the chunk level. And a chunk stripe number is not fixed to
>>> a particular device, therefore it's possible a device will have more
>>> than one chunk stripe number. So what that means is the loss of two
>>> devices has a pretty decent chance of resulting in the loss of both
>>> copies of a chunk, whereas conventional RAID 10 must lose both
>>> mirrored pairs for data loss to happen.
>>>
>>> With very cursory testing what I've found is btrfs-progs establishes
>>> an initial stripe number to device mapping that's different than the
>>> kernel code. The kernel code appears to be pretty consistent so long
>>> as the member devices are identically sized. So it's probably not an
>>> unfixable problem, but the effect is that right now Btrfs raid10
>>> profile is more like raid0+1.
>>>
>>> You can use
>>> $ sudo btrfs insp dump-tr -t 3 /dev/
>>>
>>> That will dump the chunk tree, and you can see if any device has more
>>> than one chunk stripe number associated with it.
>>>
>>>
>> Huh, that makes sense. That probably should be fixed :)
>>
>> Given your advised command (extended it a bit for readability):
>> # btrfs insp dump-tr -t 3 /dev/mapper/luks-2.1 | grep "stripe " | awk '{
>> print $1" "$2" "$3" "$4 }' | sort -u
>>
>> I get:
>> stripe 0 devid 1
>> stripe 0 devid 4
>> stripe 1 devid 2
>> stripe 1 devid 3
>> stripe 1 devid 4
>> stripe 2 devid 1
>> stripe 2 devid 2
>> stripe 2 devid 3
>> stripe 3 devid 1
>> stripe 3 devid 2
>> stripe 3 devid 3
>> stripe 3 devid 4
>>
>> Now i'm even more concerned!
> 
> Uhh yeah, this is a four device raid10? I'm a little confused why it's
> not consistently showing four stripes per chunk, which would mean the
> same number of strip 0's as stripe 3's. I don't know what that's
> about.
> 
Yes, 4 devices. It does show 4 stripes per chunk, but the command above
sorts and makes the results unique (sort -u). This gives a quick
overview of multiple stripes on a single device.

> A full balance might make the mapping consistent.
>
Will give i a try.

>> That said, btrfs shouldn't be used for other then raid1 as every other
>> raid level has serious problems or at least doesn't work as the expected
>> raid level (in terms of failure recovery).
> 
> Well, raid1 is also single device failure tolerance only as well.
> There is no device n raid1.
> 
Sure, but

Re: Convert from RAID 5 to 10

2016-11-29 Thread Chris Murphy

On Tue, Nov 29, 2016 at 4:16 PM, Wilson Meier  wrote:
>
>
> On 29.11.2016 23:52, Chris Murphy wrote:
>> On Tue, Nov 29, 2016 at 3:34 PM, Wilson Meier  wrote:
>>> On 29.11.2016 18:54, Austin S. Hemmelgarn wrote:
 On 2016-11-29 12:20, Florian Lindner wrote:
> Hello,
>
> I have 4 harddisks with 3TB capacity each. They are all used in a
> btrfs RAID 5. It has come to my attention, that there
> seem to be major flaws in btrfs' raid 5 implementation. Because of
> that, I want to convert the the raid 5 to a raid 10
> and I have several questions.
>
> * Is that possible as an online conversion?
 Yes, as long as you have a complete array to begin with (converting from
 a degraded raid5/6 array has the same issues as rebuilding a degraded
 raid5/6 array).
>
> * Since my effective capacity will shrink during conversions, does
> btrfs check if there is enough free capacity to
> convert? As you see below, right now it's probably too full, but I'm
> going to delete some stuff.
 No, you'll have to do the math yourself.  This would be a great project
 idea to place on the wiki though.
>
> * I understand the command to convert is
>
> btrfs balance start -dconvert=raid10 -mconvert=raid10 /mnt
>
> Correct?
 Yes, but I would personally convert first metadata then data.  The
 raid10 profile gets better performance than raid5, so converting the
 metadata first (by issuing a balance just covering the metadata) should
 speed up the data conversion a bit).
>
> * What disks are allowed to fail? My understanding of a raid 10 is
> like that
>
> disks = {a, b, c, d}
>
> raid0( raid1(a, b), raid1(c, d) )
>
> This way (a XOR b) AND (c XOR d) are allowed to fail without the raid
> to fail (either a or b and c or d are allowed to fail)
>
> How is that with a btrfs raid 10?
 A BTRFS raid10 can only sustain one disk failure.  Ideally, it would
 work like you show, but in practice it doesn't.
>>> I'm a little bit concerned right now. I migrated my 4 disk raid6 to
>>> raid10 because of the known raid5/6 problems. I assumed that btrfs
>>> raid10 can handle 2 disk failures as longs as they occur in different
>>> stripes.
>>> Could you please point out why it cannot sustain 2 disk failures?
>>
>> Conventional raid10 has a fixed assignment of which drives are
>> mirrored pairs, and this doesn't happen with Btrfs at the device level
>> but rather the chunk level. And a chunk stripe number is not fixed to
>> a particular device, therefore it's possible a device will have more
>> than one chunk stripe number. So what that means is the loss of two
>> devices has a pretty decent chance of resulting in the loss of both
>> copies of a chunk, whereas conventional RAID 10 must lose both
>> mirrored pairs for data loss to happen.
>>
>> With very cursory testing what I've found is btrfs-progs establishes
>> an initial stripe number to device mapping that's different than the
>> kernel code. The kernel code appears to be pretty consistent so long
>> as the member devices are identically sized. So it's probably not an
>> unfixable problem, but the effect is that right now Btrfs raid10
>> profile is more like raid0+1.
>>
>> You can use
>> $ sudo btrfs insp dump-tr -t 3 /dev/
>>
>> That will dump the chunk tree, and you can see if any device has more
>> than one chunk stripe number associated with it.
>>
>>
> Huh, that makes sense. That probably should be fixed :)
>
> Given your advised command (extended it a bit for readability):
> # btrfs insp dump-tr -t 3 /dev/mapper/luks-2.1 | grep "stripe " | awk '{
> print $1" "$2" "$3" "$4 }' | sort -u
>
> I get:
> stripe 0 devid 1
> stripe 0 devid 4
> stripe 1 devid 2
> stripe 1 devid 3
> stripe 1 devid 4
> stripe 2 devid 1
> stripe 2 devid 2
> stripe 2 devid 3
> stripe 3 devid 1
> stripe 3 devid 2
> stripe 3 devid 3
> stripe 3 devid 4
>
> Now i'm even more concerned!

Uhh yeah, this is a four device raid10? I'm a little confused why it's
not consistently showing four stripes per chunk, which would mean the
same number of strip 0's as stripe 3's. I don't know what that's
about.

A full balance might make the mapping consistent.

> That said, btrfs shouldn't be used for other then raid1 as every other
> raid level has serious problems or at least doesn't work as the expected
> raid level (in terms of failure recovery).

Well, raid1 is also single device failure tolerance only as well.
There is no device n raid1.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Convert from RAID 5 to 10

2016-11-29 Thread Wilson Meier



On 29.11.2016 23:52, Chris Murphy wrote:
> On Tue, Nov 29, 2016 at 3:34 PM, Wilson Meier  wrote:
>> On 29.11.2016 18:54, Austin S. Hemmelgarn wrote:
>>> On 2016-11-29 12:20, Florian Lindner wrote:
 Hello,

 I have 4 harddisks with 3TB capacity each. They are all used in a
 btrfs RAID 5. It has come to my attention, that there
 seem to be major flaws in btrfs' raid 5 implementation. Because of
 that, I want to convert the the raid 5 to a raid 10
 and I have several questions.

 * Is that possible as an online conversion?
>>> Yes, as long as you have a complete array to begin with (converting from
>>> a degraded raid5/6 array has the same issues as rebuilding a degraded
>>> raid5/6 array).

 * Since my effective capacity will shrink during conversions, does
 btrfs check if there is enough free capacity to
 convert? As you see below, right now it's probably too full, but I'm
 going to delete some stuff.
>>> No, you'll have to do the math yourself.  This would be a great project
>>> idea to place on the wiki though.

 * I understand the command to convert is

 btrfs balance start -dconvert=raid10 -mconvert=raid10 /mnt

 Correct?
>>> Yes, but I would personally convert first metadata then data.  The
>>> raid10 profile gets better performance than raid5, so converting the
>>> metadata first (by issuing a balance just covering the metadata) should
>>> speed up the data conversion a bit).

 * What disks are allowed to fail? My understanding of a raid 10 is
 like that

 disks = {a, b, c, d}

 raid0( raid1(a, b), raid1(c, d) )

 This way (a XOR b) AND (c XOR d) are allowed to fail without the raid
 to fail (either a or b and c or d are allowed to fail)

 How is that with a btrfs raid 10?
>>> A BTRFS raid10 can only sustain one disk failure.  Ideally, it would
>>> work like you show, but in practice it doesn't.
>> I'm a little bit concerned right now. I migrated my 4 disk raid6 to
>> raid10 because of the known raid5/6 problems. I assumed that btrfs
>> raid10 can handle 2 disk failures as longs as they occur in different
>> stripes.
>> Could you please point out why it cannot sustain 2 disk failures?
> 
> Conventional raid10 has a fixed assignment of which drives are
> mirrored pairs, and this doesn't happen with Btrfs at the device level
> but rather the chunk level. And a chunk stripe number is not fixed to
> a particular device, therefore it's possible a device will have more
> than one chunk stripe number. So what that means is the loss of two
> devices has a pretty decent chance of resulting in the loss of both
> copies of a chunk, whereas conventional RAID 10 must lose both
> mirrored pairs for data loss to happen.
> 
> With very cursory testing what I've found is btrfs-progs establishes
> an initial stripe number to device mapping that's different than the
> kernel code. The kernel code appears to be pretty consistent so long
> as the member devices are identically sized. So it's probably not an
> unfixable problem, but the effect is that right now Btrfs raid10
> profile is more like raid0+1.
> 
> You can use
> $ sudo btrfs insp dump-tr -t 3 /dev/
> 
> That will dump the chunk tree, and you can see if any device has more
> than one chunk stripe number associated with it.
> 
> 
Huh, that makes sense. That probably should be fixed :)

Given your advised command (extended it a bit for readability):
# btrfs insp dump-tr -t 3 /dev/mapper/luks-2.1 | grep "stripe " | awk '{
print $1" "$2" "$3" "$4 }' | sort -u

I get:
stripe 0 devid 1
stripe 0 devid 4
stripe 1 devid 2
stripe 1 devid 3
stripe 1 devid 4
stripe 2 devid 1
stripe 2 devid 2
stripe 2 devid 3
stripe 3 devid 1
stripe 3 devid 2
stripe 3 devid 3
stripe 3 devid 4

Now i'm even more concerned!
That said, btrfs shouldn't be used for other then raid1 as every other
raid level has serious problems or at least doesn't work as the expected
raid level (in terms of failure recovery).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Convert from RAID 5 to 10

2016-11-29 Thread Chris Murphy

On Tue, Nov 29, 2016 at 3:34 PM, Wilson Meier  wrote:
> On 29.11.2016 18:54, Austin S. Hemmelgarn wrote:
>> On 2016-11-29 12:20, Florian Lindner wrote:
>>> Hello,
>>>
>>> I have 4 harddisks with 3TB capacity each. They are all used in a
>>> btrfs RAID 5. It has come to my attention, that there
>>> seem to be major flaws in btrfs' raid 5 implementation. Because of
>>> that, I want to convert the the raid 5 to a raid 10
>>> and I have several questions.
>>>
>>> * Is that possible as an online conversion?
>> Yes, as long as you have a complete array to begin with (converting from
>> a degraded raid5/6 array has the same issues as rebuilding a degraded
>> raid5/6 array).
>>>
>>> * Since my effective capacity will shrink during conversions, does
>>> btrfs check if there is enough free capacity to
>>> convert? As you see below, right now it's probably too full, but I'm
>>> going to delete some stuff.
>> No, you'll have to do the math yourself.  This would be a great project
>> idea to place on the wiki though.
>>>
>>> * I understand the command to convert is
>>>
>>> btrfs balance start -dconvert=raid10 -mconvert=raid10 /mnt
>>>
>>> Correct?
>> Yes, but I would personally convert first metadata then data.  The
>> raid10 profile gets better performance than raid5, so converting the
>> metadata first (by issuing a balance just covering the metadata) should
>> speed up the data conversion a bit).
>>>
>>> * What disks are allowed to fail? My understanding of a raid 10 is
>>> like that
>>>
>>> disks = {a, b, c, d}
>>>
>>> raid0( raid1(a, b), raid1(c, d) )
>>>
>>> This way (a XOR b) AND (c XOR d) are allowed to fail without the raid
>>> to fail (either a or b and c or d are allowed to fail)
>>>
>>> How is that with a btrfs raid 10?
>> A BTRFS raid10 can only sustain one disk failure.  Ideally, it would
>> work like you show, but in practice it doesn't.
> I'm a little bit concerned right now. I migrated my 4 disk raid6 to
> raid10 because of the known raid5/6 problems. I assumed that btrfs
> raid10 can handle 2 disk failures as longs as they occur in different
> stripes.
> Could you please point out why it cannot sustain 2 disk failures?

Conventional raid10 has a fixed assignment of which drives are
mirrored pairs, and this doesn't happen with Btrfs at the device level
but rather the chunk level. And a chunk stripe number is not fixed to
a particular device, therefore it's possible a device will have more
than one chunk stripe number. So what that means is the loss of two
devices has a pretty decent chance of resulting in the loss of both
copies of a chunk, whereas conventional RAID 10 must lose both
mirrored pairs for data loss to happen.

With very cursory testing what I've found is btrfs-progs establishes
an initial stripe number to device mapping that's different than the
kernel code. The kernel code appears to be pretty consistent so long
as the member devices are identically sized. So it's probably not an
unfixable problem, but the effect is that right now Btrfs raid10
profile is more like raid0+1.

You can use
$ sudo btrfs insp dump-tr -t 3 /dev/

That will dump the chunk tree, and you can see if any device has more
than one chunk stripe number associated with it.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Convert from RAID 5 to 10

2016-11-29 Thread Wilson Meier

On 29.11.2016 18:54, Austin S. Hemmelgarn wrote:
> On 2016-11-29 12:20, Florian Lindner wrote:
>> Hello,
>>
>> I have 4 harddisks with 3TB capacity each. They are all used in a
>> btrfs RAID 5. It has come to my attention, that there
>> seem to be major flaws in btrfs' raid 5 implementation. Because of
>> that, I want to convert the the raid 5 to a raid 10
>> and I have several questions.
>>
>> * Is that possible as an online conversion?
> Yes, as long as you have a complete array to begin with (converting from
> a degraded raid5/6 array has the same issues as rebuilding a degraded
> raid5/6 array).
>>
>> * Since my effective capacity will shrink during conversions, does
>> btrfs check if there is enough free capacity to
>> convert? As you see below, right now it's probably too full, but I'm
>> going to delete some stuff.
> No, you'll have to do the math yourself.  This would be a great project
> idea to place on the wiki though.
>>
>> * I understand the command to convert is
>>
>> btrfs balance start -dconvert=raid10 -mconvert=raid10 /mnt
>>
>> Correct?
> Yes, but I would personally convert first metadata then data.  The
> raid10 profile gets better performance than raid5, so converting the
> metadata first (by issuing a balance just covering the metadata) should
> speed up the data conversion a bit).
>>
>> * What disks are allowed to fail? My understanding of a raid 10 is
>> like that
>>
>> disks = {a, b, c, d}
>>
>> raid0( raid1(a, b), raid1(c, d) )
>>
>> This way (a XOR b) AND (c XOR d) are allowed to fail without the raid
>> to fail (either a or b and c or d are allowed to fail)
>>
>> How is that with a btrfs raid 10?
> A BTRFS raid10 can only sustain one disk failure.  Ideally, it would
> work like you show, but in practice it doesn't.
I'm a little bit concerned right now. I migrated my 4 disk raid6 to
raid10 because of the known raid5/6 problems. I assumed that btrfs
raid10 can handle 2 disk failures as longs as they occur in different
stripes.
Could you please point out why it cannot sustain 2 disk failures?

Thanks
>>
>> * Any other advice? ;-)
> You'll actually get significantly better performance with no loss of
> data safety by running BTRFS in raid1 mode on top of two RAID0 volumes
> (LVM/MD/hardware doesn't matter much).  I do this myself and see roughly
> 10-20% improved performance on average with my workloads.
> 
> If you do decide to do this, it's theoretically possible to do so
> online, but it's kind of tricky, so I won't post any instructions for
> that here unless someone asks for them.
>>
>> Thanks a lot,
>>
>> Florian
>>
>>
>> Some information of my filesystem:
>>
>> # btrfs filesystem show /
>> Label: 'data'  uuid: 57e5b9e9-01ae-4f9e-8a3d-9f42204d7005
>> Total devices 4 FS bytes used 7.57TiB
>> devid1 size 2.72TiB used 2.72TiB path /dev/sda4
>> devid2 size 2.72TiB used 2.72TiB path /dev/sdb4
>> devid3 size 2.72TiB used 2.72TiB path /dev/sdc4
>> devid4 size 2.72TiB used 2.72TiB path /dev/sdd4
>>
>> # btrfs filesystem df /
>> Data, RAID5: total=8.14TiB, used=7.56TiB
>> System, RAID5: total=96.00MiB, used=592.00KiB
>> Metadata, RAID5: total=12.84GiB, used=11.06GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
> Based on this output, you will need to delete some data  before you can
> convert to raid10.  With 4 2.72TiB drives, you're looking at roughly
> 5.44TiB of usable space, so you're probably going to have to delete at
> least 2-3TiB of data from this filesystem before converting.
> 
> If you're not already using transparent compression, it could probably
> help some with this, but it likely won't save you more than a few
> hundred GB unless you are storing lots of data that compresses very well.
>>
>> # df -h
>> Filesystem  Size  Used Avail Use% Mounted on
>>
>> /dev/sda411T  7.6T  597G  93% /
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Convert from RAID 5 to 10

2016-11-29 Thread Austin S. Hemmelgarn


On 2016-11-29 14:03, Lionel Bouton wrote:

Hi,

Le 29/11/2016 à 18:20, Florian Lindner a écrit :

[...]

* Any other advice? ;-)


Don't rely on RAID too much... The degraded mode is unstable even for
RAID10: you can corrupt data simply by writing to a degraded RAID10. I
could reliably reproduce this on a 6 devices RAID10 BTRFS filesystem
with a missing device. It affected even a 4.8.4 kernel where our
PostgreSQL clusters got frequent write errors (on the fs itself but not
the 5 working devices) and managed to corrupt their data. Have backups,
you probably will need them.

With Btrfs RAID If you have a failing device, replace it early (monitor
the devices and don't wait for them to fail if you get transient errors
or see worrying SMART values). If you have a failed device, don't
actively use the filesystem in degraded mode. Replace or delete/add
before writing to the filesystem again.
This is an excellent point I didn't think of.  If you don't have some 
way you can monitor things, don't trust RAID (not just BTRFS raid modes, 
but any RAID like system in general).  The only reason I'm willing to 
trust it is because I have really good monitoring set up (SMART status 
on the disks + daily scrubs + hourly event counter checks on the FS + 
watching for changes to filesystem flags + a couple of other things) 
which will e-mail me the moment something starts to go bad (and I've 
jumped through hoops to get the mailing to work under almost any 
circumstances as long as userspace still exists and has network access).


I can confirm though that things work well with BTRFS raid1 mode for at 
least the following:
 * Basic, mostly static, network services (DHCP server, DNS relay, web 
server serving static content, very low volume postfix installation, etc).
 * Moderate disk usage in very sequential usage patterns (BOINC 
applications in my case, but almost anything replacing files or 
appending in reasonably sized chunks semi-regularly falls into this).
 * Infrequent typical usage for software builds (I run Gentoo, so 
system updates = building software, and I've never had any issues with 
this (at least, not any issues because of BTRFS)).

 * Bulk sequential streaming of data (stuff like multimedia recordings).

In all cases except the last (which I've only had some limited recent 
experience with), I've had BTRFS raid1 mode filesystems survive just 
fine through:
 * 3 bad PSU's (common case for this is that you see filesystem and 
storage device errors tracing down to the disks at rates proportionate 
to the overall load on the system)
 * 7 different storage devices going bad (1 catastrophic mechanical 
failure, 1 connector failure (poor soldering job for the connector), 2 
disk controller failures, and 3 media failures)

 * 2 intermittently bad storage controllers
 * 100+ kernel panics/crashes
All with no issues with data corruption (there was corruption, but BTRFS 
safely handled all of it and fixed it, and actually helped me diagnose 
two of the bad PSU's and one of the bad storage controllers).  90% of 
the reason it's survived all this though is because of the monitoring I 
have in place which let me track down exactly what was wrong and fix it 
before it became an issue.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Convert from RAID 5 to 10

2016-11-29 Thread Lionel Bouton

Hi,

Le 29/11/2016 à 18:20, Florian Lindner a écrit :
> [...]
>
> * Any other advice? ;-)

Don't rely on RAID too much... The degraded mode is unstable even for
RAID10: you can corrupt data simply by writing to a degraded RAID10. I
could reliably reproduce this on a 6 devices RAID10 BTRFS filesystem
with a missing device. It affected even a 4.8.4 kernel where our
PostgreSQL clusters got frequent write errors (on the fs itself but not
the 5 working devices) and managed to corrupt their data. Have backups,
you probably will need them.

With Btrfs RAID If you have a failing device, replace it early (monitor
the devices and don't wait for them to fail if you get transient errors
or see worrying SMART values). If you have a failed device, don't
actively use the filesystem in degraded mode. Replace or delete/add
before writing to the filesystem again.

Best regards,

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Convert from RAID 5 to 10

2016-11-29 Thread Austin S. Hemmelgarn


On 2016-11-29 12:20, Florian Lindner wrote:

Hello,

I have 4 harddisks with 3TB capacity each. They are all used in a btrfs RAID 5. 
It has come to my attention, that there
seem to be major flaws in btrfs' raid 5 implementation. Because of that, I want 
to convert the the raid 5 to a raid 10
and I have several questions.

* Is that possible as an online conversion?
Yes, as long as you have a complete array to begin with (converting from 
a degraded raid5/6 array has the same issues as rebuilding a degraded 
raid5/6 array).


* Since my effective capacity will shrink during conversions, does btrfs check 
if there is enough free capacity to
convert? As you see below, right now it's probably too full, but I'm going to 
delete some stuff.
No, you'll have to do the math yourself.  This would be a great project 
idea to place on the wiki though.


* I understand the command to convert is

btrfs balance start -dconvert=raid10 -mconvert=raid10 /mnt

Correct?
Yes, but I would personally convert first metadata then data.  The 
raid10 profile gets better performance than raid5, so converting the 
metadata first (by issuing a balance just covering the metadata) should 
speed up the data conversion a bit).


* What disks are allowed to fail? My understanding of a raid 10 is like that

disks = {a, b, c, d}

raid0( raid1(a, b), raid1(c, d) )

This way (a XOR b) AND (c XOR d) are allowed to fail without the raid to fail 
(either a or b and c or d are allowed to fail)

How is that with a btrfs raid 10?
A BTRFS raid10 can only sustain one disk failure.  Ideally, it would 
work like you show, but in practice it doesn't.


* Any other advice? ;-)
You'll actually get significantly better performance with no loss of 
data safety by running BTRFS in raid1 mode on top of two RAID0 volumes 
(LVM/MD/hardware doesn't matter much).  I do this myself and see roughly 
10-20% improved performance on average with my workloads.


If you do decide to do this, it's theoretically possible to do so 
online, but it's kind of tricky, so I won't post any instructions for 
that here unless someone asks for them.


Thanks a lot,

Florian


Some information of my filesystem:

# btrfs filesystem show /
Label: 'data'  uuid: 57e5b9e9-01ae-4f9e-8a3d-9f42204d7005
Total devices 4 FS bytes used 7.57TiB
devid1 size 2.72TiB used 2.72TiB path /dev/sda4
devid2 size 2.72TiB used 2.72TiB path /dev/sdb4
devid3 size 2.72TiB used 2.72TiB path /dev/sdc4
devid4 size 2.72TiB used 2.72TiB path /dev/sdd4

# btrfs filesystem df /
Data, RAID5: total=8.14TiB, used=7.56TiB
System, RAID5: total=96.00MiB, used=592.00KiB
Metadata, RAID5: total=12.84GiB, used=11.06GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
Based on this output, you will need to delete some data  before you can 
convert to raid10.  With 4 2.72TiB drives, you're looking at roughly 
5.44TiB of usable space, so you're probably going to have to delete at 
least 2-3TiB of data from this filesystem before converting.


If you're not already using transparent compression, it could probably 
help some with this, but it likely won't save you more than a few 
hundred GB unless you are storing lots of data that compresses very well.


# df -h
Filesystem  Size  Used Avail Use% Mounted on

/dev/sda411T  7.6T  597G  93% /


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Convert from RAID 5 to 10

Re: Convert from RAID 5 to 10

Re: Convert from RAID 5 to 10

Re: Convert from RAID 5 to 10

Re: Convert from RAID 5 to 10

Re: Convert from RAID 5 to 10

Re: Convert from RAID 5 to 10

Re: Convert from RAID 5 to 10

Re: Convert from RAID 5 to 10

Re: Convert from RAID 5 to 10

Re: Convert from RAID 5 to 10

Re: Convert from RAID 5 to 10

Re: Convert from RAID 5 to 10

Re: Convert from RAID 5 to 10

Re: Convert from RAID 5 to 10

Re: Convert from RAID 5 to 10

Re: Convert from RAID 5 to 10

Re: Convert from RAID 5 to 10

Re: Convert from RAID 5 to 10

Re: Convert from RAID 5 to 10

Re: Convert from RAID 5 to 10

Re: Convert from RAID 5 to 10

Re: Convert from RAID 5 to 10

Re: Convert from RAID 5 to 10

Re: Convert from RAID 5 to 10

Re: Convert from RAID 5 to 10

Re: Convert from RAID 5 to 10

Re: Convert from RAID 5 to 10

Re: Convert from RAID 5 to 10

Re: Convert from RAID 5 to 10

Re: Convert from RAID 5 to 10

31 matches

Site Navigation

Mail list logo

Footer information