Re: Convert from RAID 5 to 10
Thanks a lot to all who replied! I learned a lot from this thread. However, what I learned has made me even more doubtful that a btrfs RAID is the right choice for me at this moment. There seems to be much uncertainty about the real state (experimental, stable, production-ready, mature, ...) of btrfs' raid implementation, even on this very well informed lists. I really want to have the checksumming and auto-repair feature of btrfs. That was the original reason why I did't go with a dmraid in the first place. So there are basically two options left, btrfs, with a raid 10 or zfs with some raid 10 or 5 equivalent. zfs seems to be nice, mature solution, but I also prefer to use something native to Linux. Best Regards, Florian Am 29.11.2016 um 18:20 schrieb Florian Lindner: > Hello, > > I have 4 harddisks with 3TB capacity each. They are all used in a btrfs RAID > 5. It has come to my attention, that there > seem to be major flaws in btrfs' raid 5 implementation. Because of that, I > want to convert the the raid 5 to a raid 10 > and I have several questions. > > * Is that possible as an online conversion? > > * Since my effective capacity will shrink during conversions, does btrfs > check if there is enough free capacity to > convert? As you see below, right now it's probably too full, but I'm going to > delete some stuff. > > * I understand the command to convert is > > btrfs balance start -dconvert=raid10 -mconvert=raid10 /mnt > > Correct? > > * What disks are allowed to fail? My understanding of a raid 10 is like that > > disks = {a, b, c, d} > > raid0( raid1(a, b), raid1(c, d) ) > > This way (a XOR b) AND (c XOR d) are allowed to fail without the raid to fail > (either a or b and c or d are allowed to fail) > > How is that with a btrfs raid 10? > > * Any other advice? ;-) > > Thanks a lot, > > Florian > > > Some information of my filesystem: > > # btrfs filesystem show / > Label: 'data' uuid: 57e5b9e9-01ae-4f9e-8a3d-9f42204d7005 > Total devices 4 FS bytes used 7.57TiB > devid1 size 2.72TiB used 2.72TiB path /dev/sda4 > devid2 size 2.72TiB used 2.72TiB path /dev/sdb4 > devid3 size 2.72TiB used 2.72TiB path /dev/sdc4 > devid4 size 2.72TiB used 2.72TiB path /dev/sdd4 > > # btrfs filesystem df / > Data, RAID5: total=8.14TiB, used=7.56TiB > System, RAID5: total=96.00MiB, used=592.00KiB > Metadata, RAID5: total=12.84GiB, used=11.06GiB > GlobalReserve, single: total=512.00MiB, used=0.00B > > # df -h > Filesystem Size Used Avail Use% Mounted on > > /dev/sda411T 7.6T 597G 93% / > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
FYI. There is an old saying in embedded circles that I revolve that evolved from Arthur C Clarke "Any sufficiently advanced technology is indistinguishable from magic." Engineering version states "Any sufficiently advanced incompetence is indistinguishable from malice" Also I'll quote you on throwing under the bus thing :) (I actually like that justification) On 1 December 2016 at 17:28, Chris Murphywrote: > On Wed, Nov 30, 2016 at 1:29 PM, Tomasz Kusmierz > wrote: > >> Please, I beg you add another column to man and wiki stating clearly >> how many devices every profile can withstand to loose. I frequently >> have to explain how btrfs profiles work and show quotes from this >> mailing list because "dawning-kruger effect victims" keep poping up >> with statements like "in btrfs raid10 with 8 drives you can loose 4 >> drives" ... I seriously beg you guys, my beating stick is half broken >> by now. > > You need a new stick. It's called the ad hominem attack. When stupid > people say stupid things, the dispute is not about the facts or > opinions in the argument itself, but rather the person involved. There > is the possibility this is more than stupidity, it really borders on > maliciousness. Any ethical code of conduct for a list will accept ad > hominem attacks over the willful dissemination of provably wrong > information. When stupid assholes throw users under the bus with > provably wrong (and bad) advice, it becomes something of an obligation > to resort to name calling. > > Of course, I'd also like the wiki to clearly state the only profile > that tolerates more than one device loss is raid6; and be very > explicit with the manifestly wrong terminology being used by Btrfs's > raid10 terminology. That is a fairly egregious violation of common > terminology and the trust we're supposed to be developing, both in the > usage of common terms, but also in Btrfs specifically. > > > > -- > Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
On Wed, Nov 30, 2016 at 1:29 PM, Tomasz Kusmierzwrote: > Please, I beg you add another column to man and wiki stating clearly > how many devices every profile can withstand to loose. I frequently > have to explain how btrfs profiles work and show quotes from this > mailing list because "dawning-kruger effect victims" keep poping up > with statements like "in btrfs raid10 with 8 drives you can loose 4 > drives" ... I seriously beg you guys, my beating stick is half broken > by now. You need a new stick. It's called the ad hominem attack. When stupid people say stupid things, the dispute is not about the facts or opinions in the argument itself, but rather the person involved. There is the possibility this is more than stupidity, it really borders on maliciousness. Any ethical code of conduct for a list will accept ad hominem attacks over the willful dissemination of provably wrong information. When stupid assholes throw users under the bus with provably wrong (and bad) advice, it becomes something of an obligation to resort to name calling. Of course, I'd also like the wiki to clearly state the only profile that tolerates more than one device loss is raid6; and be very explicit with the manifestly wrong terminology being used by Btrfs's raid10 terminology. That is a fairly egregious violation of common terminology and the trust we're supposed to be developing, both in the usage of common terms, but also in Btrfs specifically. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
On giovedì 1 dicembre 2016 10:37:13 CET, Wilson Meier wrote: The only thing i have asked for is to document the *known* problems/flaws/limitations of all raid profiles and link to them from the stability matrix. +1 Do someone mind if I ask for an account and I start copy-pasting any relevant post in this thread? Niccolò Belli -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
Am 30/11/16 um 17:48 schrieb Austin S. Hemmelgarn: > On 2016-11-30 10:49, Wilson Meier wrote: >> Am 30/11/16 um 15:37 schrieb Austin S. Hemmelgarn: >> >> Transferring this to car analogy, just to make it a bit more funny: >> The airbag (raid level whatever) itself is ok but the micro controller >> (general btrfs) which has the responsibility to inflate the airbag is >> suffers some problems, sometimes doesn't inflate and the manufacturer >> doesn't mention about that fact. >> From your point of you the airbag is ok. From my point of view -> Don't >> buy that car!!! >> Don't you mean that the fact that the live safer suffers problems should >> be noted and every dependent component should point to that fact? >> I think it should. >> I'm not talking about performance issues, i'm talking about data loss. >> Now the next one can throw in "Backups, always make backups!". >> Sure, but backup is backup and raid is raid. Both have their own >> concerns. > A better analogy for a car would be something along the lines of the > radio working fine but the general wiring having issues that cause all > the electronics in the car to stop working under certain > circumstances. In that case, the radio itself is absolutely OK, but it > suffers from issues caused directly by poor design elsewhere in the > vehicle. Ahm, no. You cannot exchange a security mechanism (raid) with a comfort one (compression) and treat them as the same in terms of importance. It makes a serious difference to have a not properly working airbag or not being able to listen to music while your a driving against a wall. Anyway, we should stop this here. I'm not angry or something like that :) . I just would like to have the possibility to read such information about the storage i put my personal data (> 3 TB) on its official wiki. > There are more places than the wiki to look for info about BTRFS (and > this is the case about almost any piece of software, not just BTRFS, > very few things have one central source for everything). I don't mean > to sound unsympathetic, but given what you're saying, it's sounding > more and more like you didn't look at anything beyond the wiki and > should have checked other sources as well. This is your assumption. Am 01/12/16 um 07:47 schrieb Duncan: > Austin S. Hemmelgarn posted on Wed, 30 Nov 2016 11:48:57 -0500 as > excerpted: >> On 2016-11-30 10:49, Wilson Meier wrote: >>> Do you also have all home users in mind, which go to vacation (sometime 3 weeks) and don't have a 24/7 support team to replace monitored disks >>> which do report SMART errors? >> Better than 90% of people I know either shut down their systems when >> they're going to be away for a long period of time, or like me have >> ways to log in remotely and tell the FS to not use that disk anymore. > https://btrfs.wiki.kernel.org/index.php/Getting_started ... ... has > two warnings offset in red right in the first section: * If you have > btrfs filesystems, run the latest kernel. I do. Ok not the very latest but i'm always on the latest major version. Right now i have 4.8.4 and the very latest is 4.8.11. > * You should keep and test backups of your data, and be prepared to use > them. I have daily backups. > As to the three weeks vacation thing... And "daily use" != "three > weeks without physical access to something you're going to actually be > relying on for parts of those three weeks". > Maybe i have my own mailserver and owncloud to server files to my family? Maybe i'm out of country and somewhere i have no internet access? I will not comment this any further as it leads us nowhere. In general i think that this discussion is taking a complete wrong direction. The only thing i have asked for is to document the *known* problems/flaws/limitations of all raid profiles and link to them from the stability matrix. Regarding raid10: Even if one knows about the fact that btrfs handles things on chunk level one would assume that the code is written in a way to put the copies on different stripes. Otherwise raid10 ***can*** become pretty useless in terms of data redundancy and 2 x raid1 with an lvm should be considered as a replacement. This is a serious thing and should be documented. If this is documented somewhere then please point me to it as i cannot find a word about that anywhere. Cheers, Wilson -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
Austin S. Hemmelgarn posted on Wed, 30 Nov 2016 11:48:57 -0500 as excerpted: > On 2016-11-30 10:49, Wilson Meier wrote: >> Do you also have all home users in mind, which go to vacation (sometime >>> 3 weeks) and don't have a 24/7 support team to replace monitored disks >> which do report SMART errors? > Better than 90% of people I know either shut down their systems when > they're going to be away for a long period of time, or like me have ways > to log in remotely and tell the FS to not use that disk anymore. https://btrfs.wiki.kernel.org/index.php/Getting_started ... ... has two warnings offset in red right in the first section: * If you have btrfs filesystems, run the latest kernel. * You should keep and test backups of your data, and be prepared to use them. It also says: The status of btrfs was experimental for a long time, but the the core functionality is considered good enough for daily use. [...] While many people use it reliably, there are still problems being found. Were I editing that or something very similar would be on the main landing page and as a general status announcement on the feature and profile status page. However, it IS on the wiki. As to the three weeks vacation thing... And "daily use" != "three weeks without physical access to something you're going to actually be relying on for parts of those three weeks". And "keep and test backups [and] be prepared to use them" != "go away for three weeks and leave yourself unable to restore from those backups, for something you're relying on over those three weeks", either. As Austin says, many home users actually shut down their systems when they're going to be away, because they are /not/ going to be using them in that period, and *certainly* *don't* actually /rely/ on them. And most of those that /do/ actually rely on them, have learned or will learn, possibly the hard way, that "things happen", and they need either someone that can be called to poke the systems if necessary, or alternative plans in case what they can't access ATM fails. Meanwhile, arguably those who /are/ relying on their filesystems to be up and running for extended periods while they can't actually poke (or have someone else poke) the hardware if necessary, shouldn't be running btrfs as yet in the first place, as it's simply not stable and mature enough for that. And people who really care about it will have done the research to know the stability status. And people who don't... well, by not doing that research they've effectively defined it as not that important in their life, other things have taken priority. So if btrfs fails on them and they didn't know it's stability status, it can only be because it wasn't that important to them that they know, so no big deal. (I know for certain that before /I/ switched to btrfs, I scoured both the wiki and the manpages, as well as reading a number of articles on btrfs, and then still posted to this list a number of questions I had remaining after doing all that, and got answers I read as well, before I actually did my switch. That's because it was my data at risk, data I place a high enough value on to want to know the risk at which I was placing it and the best way to deal with various issues I could anticipate possibly happening, before they actually happened. And I actually did some of my own testing before final deployment, as well, satisfying myself that I /could/ reasonably deal with various hardware and software disaster scenarios, before putting any real data at risk, as well. Of course I don't expect everyone to do all that, but then I don't expect everyone to place the value in their data that I do in mine. Which is fine, as long as they're willing to live with the consequences of the priority they placed on appreciating and dealing appropriately with the risk factor on their data, based on the definition of value their actions placed on it. If they're willing to risk the data because it's of no particular value to them anyway, well then, no such preliminary research and testing is required. Indeed, it would be stupid, because they surely have more important and higher priority things to deal with.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
On 30 November 2016 at 19:09, Chris Murphywrote: > On Wed, Nov 30, 2016 at 7:37 AM, Austin S. Hemmelgarn > wrote: > >> The stability info could be improved, but _absolutely none_ of the things >> mentioned as issues with raid1 are specific to raid1. And in general, in >> the context of a feature stability matrix, 'OK' generally means that there >> are no significant issues with that specific feature, and since none of the >> issues outlined are specific to raid1, it does meet that description of >> 'OK'. > > Maybe the gotchas page needs a one or two liner for each profile's > gotchas compared to what the profile leads the user into believing. > The overriding gotcha with all Btrfs multiple device support is the > lack of monitoring and notification other than kernel messages; and > the raid10 actually being more like raid0+1 I think it certainly a > gotcha, however 'man mkfs.btrfs' contains a grid that very clearly > states raid10 can only safely lose 1 device. > > >> Looking at this another way, I've been using BTRFS on all my systems since >> kernel 3.16 (I forget what exact vintage that is in regular years). I've >> not had any data integrity or data loss issues as a result of BTRFS itself >> since 3.19, and in just the past year I've had multiple raid1 profile >> filesystems survive multiple hardware issues with near zero issues (with the >> caveat that I had to re-balance after replacing devices to convert a few >> single chunks to raid1), and that includes multiple disk failures and 2 bad >> PSU's plus about a dozen (not BTRFS related) kernel panics and 4 unexpected >> power loss events. I also have exhaustive monitoring, so I'm replacing bad >> hardware early instead of waiting for it to actually fail. > > Possibly nothing aids predictably reliable storage stacks than healthy > doses of skepticism and awareness of all limitations. :-D > > -- > Chris Murphy > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Please, I beg you add another column to man and wiki stating clearly how many devices every profile can withstand to loose. I frequently have to explain how btrfs profiles work and show quotes from this mailing list because "dawning-kruger effect victims" keep poping up with statements like "in btrfs raid10 with 8 drives you can loose 4 drives" ... I seriously beg you guys, my beating stick is half broken by now. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
Am Mittwoch, 30. November 2016, 12:09:23 CET schrieb Chris Murphy: > On Wed, Nov 30, 2016 at 7:37 AM, Austin S. Hemmelgarn > >wrote: > > The stability info could be improved, but _absolutely none_ of the things > > mentioned as issues with raid1 are specific to raid1. And in general, in > > the context of a feature stability matrix, 'OK' generally means that there > > are no significant issues with that specific feature, and since none of > > the > > issues outlined are specific to raid1, it does meet that description of > > 'OK'. > > Maybe the gotchas page needs a one or two liner for each profile's > gotchas compared to what the profile leads the user into believing. > The overriding gotcha with all Btrfs multiple device support is the > lack of monitoring and notification other than kernel messages; and > the raid10 actually being more like raid0+1 I think it certainly a > gotcha, however 'man mkfs.btrfs' contains a grid that very clearly > states raid10 can only safely lose 1 device. Wow, that manpage is quite an resource. Developers, documentation people definitely improved the official BTRFS documentation. Thanks, -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
On Wed, Nov 30, 2016 at 7:37 AM, Austin S. Hemmelgarnwrote: > The stability info could be improved, but _absolutely none_ of the things > mentioned as issues with raid1 are specific to raid1. And in general, in > the context of a feature stability matrix, 'OK' generally means that there > are no significant issues with that specific feature, and since none of the > issues outlined are specific to raid1, it does meet that description of > 'OK'. Maybe the gotchas page needs a one or two liner for each profile's gotchas compared to what the profile leads the user into believing. The overriding gotcha with all Btrfs multiple device support is the lack of monitoring and notification other than kernel messages; and the raid10 actually being more like raid0+1 I think it certainly a gotcha, however 'man mkfs.btrfs' contains a grid that very clearly states raid10 can only safely lose 1 device. > Looking at this another way, I've been using BTRFS on all my systems since > kernel 3.16 (I forget what exact vintage that is in regular years). I've > not had any data integrity or data loss issues as a result of BTRFS itself > since 3.19, and in just the past year I've had multiple raid1 profile > filesystems survive multiple hardware issues with near zero issues (with the > caveat that I had to re-balance after replacing devices to convert a few > single chunks to raid1), and that includes multiple disk failures and 2 bad > PSU's plus about a dozen (not BTRFS related) kernel panics and 4 unexpected > power loss events. I also have exhaustive monitoring, so I'm replacing bad > hardware early instead of waiting for it to actually fail. Possibly nothing aids predictably reliable storage stacks than healthy doses of skepticism and awareness of all limitations. :-D -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
On Wed, Nov 30, 2016 at 7:04 AM, Roman Mamedovwrote: > On Wed, 30 Nov 2016 07:50:17 -0500 > Also I don't know what is particularly insane about copying a 4-8 GB file onto > a storage array. I'd expect both disks to write at the same time (like they > do in pretty much any other RAID1 system), not one-after-another, effectively > slowing down the entire operation by as much as 2x in extreme cases. I don't experience this behavior. Writes take the same amount of time to single profile volume as a two device raid1 profile volume. iotop reports 2x the write bandwidth when writing to the raid1 volume, which corresponds to simultaneous writes to both drives in the volume. It's also not an elaborate setup by any means: two laptop drives, each in cheap USB 3.0 cases using bus power only, connected to a USB 3.0 hub, in turn connected to an Intel NUC. > > Comparing to Ext4, that one appears to have the "errors=continue" behavior by > default, the user has to explicitly request "errors=remount-ro", and I have > never seen anyone use or recommend the third option of "errors=panic", which > is basically the equivalent of the current Btrfs practce. I think in the context of degradedness, it may be appropriate to mount degraded,ro by default rather than fail. But changing the default isn't enough for the root fs use case, because the mount command isn't even issued when udev's btrfs 'dev scan' fails to report back all devices available. In this case there is a sort of "pre check" before even mounting is attempted, and that is what fails. Also, Btrfs has fatal_errors=panic and it's not the default. Rather, we just get mount failure. There really isn't anything quite like this in the mdadm/lvm + other file system world where the array is active degraded and the file system mounts anyway; if it doesn't mount it's because the array isn't active, and doesn't even exist yet. > Unplugging and replugging a SATA cable of a RAID1 member should never put your > system under the risk of a massive filesystem corruption; you cannot say it > absolutely doesn't with the current implementation. I can't say it absolutely doesn't even with md. Of course it shouldn't, but users do report corruptions on all of the other fs lists (ext4, XFS, linux-raid) from time to time that are not the result of user error. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
On 2016-11-30 10:49, Wilson Meier wrote: Am 30/11/16 um 15:37 schrieb Austin S. Hemmelgarn: On 2016-11-30 08:12, Wilson Meier wrote: Am 30/11/16 um 11:41 schrieb Duncan: Wilson Meier posted on Wed, 30 Nov 2016 09:35:36 +0100 as excerpted: Am 30/11/16 um 09:06 schrieb Martin Steigerwald: Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov: [snip] So the stability matrix would need to be updated not to recommend any kind of BTRFS RAID 1 at the moment? Actually I faced the BTRFS RAID 1 read only after first attempt of mounting it "degraded" just a short time ago. BTRFS still needs way more stability work it seems to me. I would say the matrix should be updated to not recommend any RAID Level as from the discussion it seems they all of them have flaws. To me RAID is broken if one cannot expect to recover from a device failure in a solid way as this is why RAID is used. Correct me if i'm wrong. Right now i'm making my thoughts about migrating to another FS and/or Hardware RAID. It should be noted that no list regular that I'm aware of anyway, would make any claims about btrfs being stable and mature either now or in the near-term future in any case. Rather to the contrary, as I generally put it, btrfs is still stabilizing and maturing, with backups one is willing to use (and as any admin of any worth would say, a backup that hasn't been tested usable isn't yet a backup; the job of creating the backup isn't done until that backup has been tested actually usable for recovery) still extremely strongly recommended. Similarly, keeping up with the list is recommended, as is staying relatively current on both the kernel and userspace (generally considered to be within the latest two kernel series of either current or LTS series kernels, and with a similarly versioned btrfs userspace). In that context, btrfs single-device and raid1 (and raid0 of course) are quite usable and as stable as btrfs in general is, that being stabilizing but not yet fully stable and mature, with raid10 being slightly less so and raid56 being much more experimental/unstable at this point. But that context never claims full stability even for the relatively stable raid1 and single device modes, and in fact anticipates that there may be times when recovery from the existing filesystem may not be practical, thus the recommendation to keep tested usable backups at the ready. Meanwhile, it remains relatively common on this list for those wondering about their btrfs on long-term-stale (not a typo) "enterprise" distros, or even debian-stale, to be actively steered away from btrfs, especially if they're not willing to update to something far more current than those distros often provide, because in general, the current stability status of btrfs is in conflict with the reason people generally choose to use that level of old and stale software in the first place -- they prioritize tried and tested to work, stable and mature, over the latest generally newer and flashier featured but sometimes not entirely stable, and btrfs at this point simply doesn't meet that sort of stability/ maturity expectations, nor is it likely to for some time (measured in years), due to all the reasons enumerated so well in the above thread. In that context, the stability status matrix on the wiki is already reasonably accurate, certainly so IMO, because "OK" in context means as OK as btrfs is in general, and btrfs itself remains still stabilizing, not fully stable and mature. If there IS an argument as to the accuracy of the raid0/1/10 OK status, I'd argue it's purely due to people not understanding the status of btrfs in general, and that if there's a general deficiency at all, it's in the lack of a general stability status paragraph on that page itself explaining all this, despite the fact that the main https:// btrfs.wiki.kernel.org landing page states quite plainly under stability status that btrfs remains under heavy development and that current kernels are strongly recommended. (Tho were I editing it, there'd certainly be a more prominent mention of keeping backups at the ready as well.) Hi Duncan, i understand your arguments but cannot fully agree. First of all, i'm not sticking with old stale versions of whatever as i try to keep my system up2date. My kernel is 4.8.4 (Gentoo) and btrfs-progs is 4.8.4. That being said, i'm quite aware of the heavy development status of btrfs but pointing the finger on the users saying that they don't fully understand the status of btrfs without giving the information on the wiki is in my opinion not the right way. Heavy development doesn't mean that features marked as ok are "not" or "mostly" ok in the context of overall btrfs stability. There is no indication on the wiki that raid1 or every other raid (except for raid5/6) suffers from the problems stated in this thread. The performance issues are inherent to BTRFS right now, and none of the other issues are likely to impact most regular
Re: Convert from RAID 5 to 10
Am Mittwoch, 30. November 2016, 16:49:59 CET schrieb Wilson Meier: > Am 30/11/16 um 15:37 schrieb Austin S. Hemmelgarn: > > On 2016-11-30 08:12, Wilson Meier wrote: > >> Am 30/11/16 um 11:41 schrieb Duncan: > >>> Wilson Meier posted on Wed, 30 Nov 2016 09:35:36 +0100 as excerpted: > Am 30/11/16 um 09:06 schrieb Martin Steigerwald: > > Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov: […] > >> It is really disappointing to not have this information in the wiki > >> itself. This would have saved me, and i'm quite sure others too, a lot > >> of time. > >> Sorry for being a bit frustrated. > > I'm not angry or something like that :) . > I just would like to have the possibility to read such information about > the storage i put my personal data (> 3 TB) on its official wiki. Anyone can get an account on the wiki and add notes there, so feel free. You can even use footnotes or something like that. Maybe it would be good to add a paragraph there that features are related to one another, so while BTRFS RAID 1 for example might be quite okay, it depends on features that are still flaky. I for myself rely quite much on BTRFS RAID 1 with lzo compression and it seems to work okay for me. -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
I completely agree, the whole wiki status is simply *FRUSTRATING*. Niccolò Belli On mercoledì 30 novembre 2016 14:12:36 CET, Wilson Meier wrote: Am 30/11/16 um 11:41 schrieb Duncan: Wilson Meier posted on Wed, 30 Nov 2016 09:35:36 +0100 as excerpted: ... Hi Duncan, i understand your arguments but cannot fully agree. First of all, i'm not sticking with old stale versions of whatever as i try to keep my system up2date. My kernel is 4.8.4 (Gentoo) and btrfs-progs is 4.8.4. That being said, i'm quite aware of the heavy development status of btrfs but pointing the finger on the users saying that they don't fully understand the status of btrfs without giving the information on the wiki is in my opinion not the right way. Heavy development doesn't mean that features marked as ok are "not" or "mostly" ok in the context of overall btrfs stability. There is no indication on the wiki that raid1 or every other raid (except for raid5/6) suffers from the problems stated in this thread. If there are know problems then the stability matrix should point them out or link to a corresponding wiki entry otherwise one has to assume that the features marked as "ok" are in fact "ok". And yes, the overall btrfs stability should be put on the wiki. Just to give you a quick overview of my history with btrfs. I migrated away from MD Raid and ext4 to btrfs raid6 because of its CoW and checksum features at a time as raid6 was not considered fully stable but also not as badly broken. After a few months i had a disk failure and the raid could not recover. I looked at the wiki an the mailing list and noticed that raid6 has been marked as badly broken :( I was quite happy to have a backup. So i asked on the btrfs IRC channel (the wiki had no relevant information) if raid10 is usable or suffers from the same problems. The summary was "Yes it is usable and has no known problems". So i migrated to raid10. Now i know that raid10 (marked as ok) has also problems with 2 disk failures in different stripes and can in fact lead to data loss. I thought, hmm ok, i'll split my data and use raid1 (marked as ok). And again the mailing list states that raid1 has also problems in case of recovery. It is really disappointing to not have this information in the wiki itself. This would have saved me, and i'm quite sure others too, a lot of time. Sorry for being a bit frustrated. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
Am 30/11/16 um 15:37 schrieb Austin S. Hemmelgarn: > On 2016-11-30 08:12, Wilson Meier wrote: >> Am 30/11/16 um 11:41 schrieb Duncan: >>> Wilson Meier posted on Wed, 30 Nov 2016 09:35:36 +0100 as excerpted: >>> Am 30/11/16 um 09:06 schrieb Martin Steigerwald: > Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov: >> [snip] > So the stability matrix would need to be updated not to recommend any > kind of BTRFS RAID 1 at the moment? > > Actually I faced the BTRFS RAID 1 read only after first attempt of > mounting it "degraded" just a short time ago. > > BTRFS still needs way more stability work it seems to me. > I would say the matrix should be updated to not recommend any RAID Level as from the discussion it seems they all of them have flaws. To me RAID is broken if one cannot expect to recover from a device failure in a solid way as this is why RAID is used. Correct me if i'm wrong. Right now i'm making my thoughts about migrating to another FS and/or Hardware RAID. >>> It should be noted that no list regular that I'm aware of anyway, would >>> make any claims about btrfs being stable and mature either now or in >>> the >>> near-term future in any case. Rather to the contrary, as I >>> generally put >>> it, btrfs is still stabilizing and maturing, with backups one is >>> willing >>> to use (and as any admin of any worth would say, a backup that hasn't >>> been tested usable isn't yet a backup; the job of creating the backup >>> isn't done until that backup has been tested actually usable for >>> recovery) still extremely strongly recommended. Similarly, keeping up >>> with the list is recommended, as is staying relatively current on both >>> the kernel and userspace (generally considered to be within the latest >>> two kernel series of either current or LTS series kernels, and with a >>> similarly versioned btrfs userspace). >>> >>> In that context, btrfs single-device and raid1 (and raid0 of course) >>> are >>> quite usable and as stable as btrfs in general is, that being >>> stabilizing >>> but not yet fully stable and mature, with raid10 being slightly less so >>> and raid56 being much more experimental/unstable at this point. >>> >>> But that context never claims full stability even for the relatively >>> stable raid1 and single device modes, and in fact anticipates that >>> there >>> may be times when recovery from the existing filesystem may not be >>> practical, thus the recommendation to keep tested usable backups at the >>> ready. >>> >>> Meanwhile, it remains relatively common on this list for those >>> wondering >>> about their btrfs on long-term-stale (not a typo) "enterprise" distros, >>> or even debian-stale, to be actively steered away from btrfs, >>> especially >>> if they're not willing to update to something far more current than >>> those >>> distros often provide, because in general, the current stability status >>> of btrfs is in conflict with the reason people generally choose to use >>> that level of old and stale software in the first place -- they >>> prioritize tried and tested to work, stable and mature, over the latest >>> generally newer and flashier featured but sometimes not entirely >>> stable, >>> and btrfs at this point simply doesn't meet that sort of stability/ >>> maturity expectations, nor is it likely to for some time (measured in >>> years), due to all the reasons enumerated so well in the above thread. >>> >>> >>> In that context, the stability status matrix on the wiki is already >>> reasonably accurate, certainly so IMO, because "OK" in context means as >>> OK as btrfs is in general, and btrfs itself remains still stabilizing, >>> not fully stable and mature. >>> >>> If there IS an argument as to the accuracy of the raid0/1/10 OK status, >>> I'd argue it's purely due to people not understanding the status of >>> btrfs >>> in general, and that if there's a general deficiency at all, it's in >>> the >>> lack of a general stability status paragraph on that page itself >>> explaining all this, despite the fact that the main https:// >>> btrfs.wiki.kernel.org landing page states quite plainly under stability >>> status that btrfs remains under heavy development and that current >>> kernels are strongly recommended. (Tho were I editing it, there'd >>> certainly be a more prominent mention of keeping backups at the >>> ready as >>> well.) >>> >> Hi Duncan, >> >> i understand your arguments but cannot fully agree. >> First of all, i'm not sticking with old stale versions of whatever as i >> try to keep my system up2date. >> My kernel is 4.8.4 (Gentoo) and btrfs-progs is 4.8.4. >> That being said, i'm quite aware of the heavy development status of >> btrfs but pointing the finger on the users saying that they don't fully >> understand the status of btrfs without giving the information on the >> wiki is in my opinion not the right way. Heavy development doesn't mean >> that
Re: Convert from RAID 5 to 10
On 2016-11-30 09:04, Roman Mamedov wrote: On Wed, 30 Nov 2016 07:50:17 -0500 "Austin S. Hemmelgarn"wrote: *) Read performance is not optimized: all metadata is always read from the first device unless it has failed, data reads are supposedly balanced between devices per PID of the process reading. Better implementations dispatch reads per request to devices that are currently idle. Based on what I've seen, the metadata reads get balanced too. https://github.com/torvalds/linux/blob/v4.8/fs/btrfs/disk-io.c#L451 This starts from the mirror number 0 and tries others in an incrementing order, until succeeds. It appears that as long as the mirror with copy #0 is up and not corrupted, all reads will simply get satisfied from it. That's actually how all reads work, it's just that the PID selects what constitutes the 'first' copy. IIRC, that selection is doen by a lower layer. *) Write performance is not optimized, during long full bandwidth sequential writes it is common to see devices writing not in parallel, but with a long periods of just one device writing, then another. (Admittedly have been some time since I tested that). I've never seen this be an issue in practice, especially if you're using transparent compression (which caps extent size, and therefore I/O size to a given device, at 128k). I'm also sane enough that I'm not doing bulk streaming writes to traditional HDD's or fully saturating the bandwidth on my SSD's (you should be over-provisioning whenever possible). For a desktop user, unless you're doing real-time video recording at higher than HD resolution with high quality surround sound, this probably isn't going to hit you (and even then you should be recording to a temporary location with much faster write speeds (tmpfs or ext4 without a journal for example) because you'll likely get hit with fragmentation). I did not use compression while observing this; Compression doesn't make things parallel, but it does cause BTRFS to distribute the writes more evenly because it writes first one extent then the other, which in turn makes things much more efficient because you're not stalling as much waiting for the I/O queue to finish. It also means you have to write less overall to the disk, so on systems which can do LZO compression significantly faster than they can write to or read from the disk, it will generally improve performance all around. Also I don't know what is particularly insane about copying a 4-8 GB file onto a storage array. I'd expect both disks to write at the same time (like they do in pretty much any other RAID1 system), not one-after-another, effectively slowing down the entire operation by as much as 2x in extreme cases. I'm not talking 4-8GB files, I'm talking really big stuff at least an order of magnitude larger than that, stuff like filesystem images and big databases. On the only system I have where I have traditional hard disks (7200RPM consumer SATA3 drives connected to an LSI MPT2SAS HBA, about 80-100MB/s bulk write speed to a single disk), an 8GB copy from tmpfs is only in practice about 20% slower to BTRFS raid1 mode than to XFS on top of a DM-RAID RAID1 volume, and about 30% slower than the same with ext4. In both cases, this is actually about 50% faster than ZFS (which does prallelize reads and writes) in an equivalent configuration on the same hardware. Comparing all of that to single disk versions on the same hardware, I see roughly the same performance ratios between filesystems, and the same goes for running on the motherboard's SATA controller instead of the LSI HBA. In this case, I am using compression (and the data gets reasonable compression ratios), and I see both disks running at just below peak bandwidth, and based on tracing, most of the difference is in the metadata updates required to change the extents. I would love to see BTRFS properly parallelize writes and stripe reads sanely, but I seriously doubt it's going to have as much impact as you think, especially on systems with fast storage. As far as not mounting degraded by default, that's a conscious design choice that isn't going to change. There's a switch (adding 'degraded' to the mount options) to enable this behavior per-mount, so we're still on-par in that respect with LVM and MD, we just picked a different default. In this case, I actually feel it's a better default for most cases, because most regular users aren't doing exhaustive monitoring, and thus are not likely to notice the filesystem being mounted degraded until it's far too late. If the filesystem is degraded, then _something_ has happened that the user needs to know about, and until some sane monitoring solution is implemented, the easiest way to ensure this is to refuse to mount. The easiest is to write to dmesg and syslog, if a user doesn't monitor those either, it's their own fault; and the more user friendly one would be to still auto mount degraded, but
Re: Convert from RAID 5 to 10
On 2016-11-30 08:12, Wilson Meier wrote: Am 30/11/16 um 11:41 schrieb Duncan: Wilson Meier posted on Wed, 30 Nov 2016 09:35:36 +0100 as excerpted: Am 30/11/16 um 09:06 schrieb Martin Steigerwald: Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov: [snip] So the stability matrix would need to be updated not to recommend any kind of BTRFS RAID 1 at the moment? Actually I faced the BTRFS RAID 1 read only after first attempt of mounting it "degraded" just a short time ago. BTRFS still needs way more stability work it seems to me. I would say the matrix should be updated to not recommend any RAID Level as from the discussion it seems they all of them have flaws. To me RAID is broken if one cannot expect to recover from a device failure in a solid way as this is why RAID is used. Correct me if i'm wrong. Right now i'm making my thoughts about migrating to another FS and/or Hardware RAID. It should be noted that no list regular that I'm aware of anyway, would make any claims about btrfs being stable and mature either now or in the near-term future in any case. Rather to the contrary, as I generally put it, btrfs is still stabilizing and maturing, with backups one is willing to use (and as any admin of any worth would say, a backup that hasn't been tested usable isn't yet a backup; the job of creating the backup isn't done until that backup has been tested actually usable for recovery) still extremely strongly recommended. Similarly, keeping up with the list is recommended, as is staying relatively current on both the kernel and userspace (generally considered to be within the latest two kernel series of either current or LTS series kernels, and with a similarly versioned btrfs userspace). In that context, btrfs single-device and raid1 (and raid0 of course) are quite usable and as stable as btrfs in general is, that being stabilizing but not yet fully stable and mature, with raid10 being slightly less so and raid56 being much more experimental/unstable at this point. But that context never claims full stability even for the relatively stable raid1 and single device modes, and in fact anticipates that there may be times when recovery from the existing filesystem may not be practical, thus the recommendation to keep tested usable backups at the ready. Meanwhile, it remains relatively common on this list for those wondering about their btrfs on long-term-stale (not a typo) "enterprise" distros, or even debian-stale, to be actively steered away from btrfs, especially if they're not willing to update to something far more current than those distros often provide, because in general, the current stability status of btrfs is in conflict with the reason people generally choose to use that level of old and stale software in the first place -- they prioritize tried and tested to work, stable and mature, over the latest generally newer and flashier featured but sometimes not entirely stable, and btrfs at this point simply doesn't meet that sort of stability/ maturity expectations, nor is it likely to for some time (measured in years), due to all the reasons enumerated so well in the above thread. In that context, the stability status matrix on the wiki is already reasonably accurate, certainly so IMO, because "OK" in context means as OK as btrfs is in general, and btrfs itself remains still stabilizing, not fully stable and mature. If there IS an argument as to the accuracy of the raid0/1/10 OK status, I'd argue it's purely due to people not understanding the status of btrfs in general, and that if there's a general deficiency at all, it's in the lack of a general stability status paragraph on that page itself explaining all this, despite the fact that the main https:// btrfs.wiki.kernel.org landing page states quite plainly under stability status that btrfs remains under heavy development and that current kernels are strongly recommended. (Tho were I editing it, there'd certainly be a more prominent mention of keeping backups at the ready as well.) Hi Duncan, i understand your arguments but cannot fully agree. First of all, i'm not sticking with old stale versions of whatever as i try to keep my system up2date. My kernel is 4.8.4 (Gentoo) and btrfs-progs is 4.8.4. That being said, i'm quite aware of the heavy development status of btrfs but pointing the finger on the users saying that they don't fully understand the status of btrfs without giving the information on the wiki is in my opinion not the right way. Heavy development doesn't mean that features marked as ok are "not" or "mostly" ok in the context of overall btrfs stability. There is no indication on the wiki that raid1 or every other raid (except for raid5/6) suffers from the problems stated in this thread. The performance issues are inherent to BTRFS right now, and none of the other issues are likely to impact most regular users. Most of the people who would be interested in the features of BTRFS also have existing
Re: Convert from RAID 5 to 10
On Wed, 30 Nov 2016 07:50:17 -0500 "Austin S. Hemmelgarn"wrote: > > *) Read performance is not optimized: all metadata is always read from the > > first device unless it has failed, data reads are supposedly balanced > > between > > devices per PID of the process reading. Better implementations dispatch > > reads > > per request to devices that are currently idle. > Based on what I've seen, the metadata reads get balanced too. https://github.com/torvalds/linux/blob/v4.8/fs/btrfs/disk-io.c#L451 This starts from the mirror number 0 and tries others in an incrementing order, until succeeds. It appears that as long as the mirror with copy #0 is up and not corrupted, all reads will simply get satisfied from it. > > *) Write performance is not optimized, during long full bandwidth sequential > > writes it is common to see devices writing not in parallel, but with a long > > periods of just one device writing, then another. (Admittedly have been some > > time since I tested that). > I've never seen this be an issue in practice, especially if you're using > transparent compression (which caps extent size, and therefore I/O size > to a given device, at 128k). I'm also sane enough that I'm not doing > bulk streaming writes to traditional HDD's or fully saturating the > bandwidth on my SSD's (you should be over-provisioning whenever > possible). For a desktop user, unless you're doing real-time video > recording at higher than HD resolution with high quality surround sound, > this probably isn't going to hit you (and even then you should be > recording to a temporary location with much faster write speeds (tmpfs > or ext4 without a journal for example) because you'll likely get hit > with fragmentation). I did not use compression while observing this; Also I don't know what is particularly insane about copying a 4-8 GB file onto a storage array. I'd expect both disks to write at the same time (like they do in pretty much any other RAID1 system), not one-after-another, effectively slowing down the entire operation by as much as 2x in extreme cases. > As far as not mounting degraded by default, that's a conscious design > choice that isn't going to change. There's a switch (adding 'degraded' > to the mount options) to enable this behavior per-mount, so we're still > on-par in that respect with LVM and MD, we just picked a different > default. In this case, I actually feel it's a better default for most > cases, because most regular users aren't doing exhaustive monitoring, > and thus are not likely to notice the filesystem being mounted degraded > until it's far too late. If the filesystem is degraded, then > _something_ has happened that the user needs to know about, and until > some sane monitoring solution is implemented, the easiest way to ensure > this is to refuse to mount. The easiest is to write to dmesg and syslog, if a user doesn't monitor those either, it's their own fault; and the more user friendly one would be to still auto mount degraded, but read-only. Comparing to Ext4, that one appears to have the "errors=continue" behavior by default, the user has to explicitly request "errors=remount-ro", and I have never seen anyone use or recommend the third option of "errors=panic", which is basically the equivalent of the current Btrfs practce. > > *) It does not properly handle a device disappearing during operation. > > (There > > is a patchset to add that). > > > > *) It does not properly handle said device returning (under a > > different /dev/sdX name, for bonus points). > These are not an easy problem to fix completely, especially considering > that the device is currently guaranteed to reappear under a different > name because BTRFS will still have an open reference on the original > device name. > > On top of that, if you've got hardware that's doing this without manual > intervention, you've got much bigger issues than how BTRFS reacts to it. > No correctly working hardware should be doing this. Unplugging and replugging a SATA cable of a RAID1 member should never put your system under the risk of a massive filesystem corruption; you cannot say it absolutely doesn't with the current implementation. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
Am 30/11/16 um 11:41 schrieb Duncan: > Wilson Meier posted on Wed, 30 Nov 2016 09:35:36 +0100 as excerpted: > >> Am 30/11/16 um 09:06 schrieb Martin Steigerwald: >>> Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov: [snip] >>> So the stability matrix would need to be updated not to recommend any >>> kind of BTRFS RAID 1 at the moment? >>> >>> Actually I faced the BTRFS RAID 1 read only after first attempt of >>> mounting it "degraded" just a short time ago. >>> >>> BTRFS still needs way more stability work it seems to me. >>> >> I would say the matrix should be updated to not recommend any RAID Level >> as from the discussion it seems they all of them have flaws. >> To me RAID is broken if one cannot expect to recover from a device >> failure in a solid way as this is why RAID is used. >> Correct me if i'm wrong. Right now i'm making my thoughts about >> migrating to another FS and/or Hardware RAID. > It should be noted that no list regular that I'm aware of anyway, would > make any claims about btrfs being stable and mature either now or in the > near-term future in any case. Rather to the contrary, as I generally put > it, btrfs is still stabilizing and maturing, with backups one is willing > to use (and as any admin of any worth would say, a backup that hasn't > been tested usable isn't yet a backup; the job of creating the backup > isn't done until that backup has been tested actually usable for > recovery) still extremely strongly recommended. Similarly, keeping up > with the list is recommended, as is staying relatively current on both > the kernel and userspace (generally considered to be within the latest > two kernel series of either current or LTS series kernels, and with a > similarly versioned btrfs userspace). > > In that context, btrfs single-device and raid1 (and raid0 of course) are > quite usable and as stable as btrfs in general is, that being stabilizing > but not yet fully stable and mature, with raid10 being slightly less so > and raid56 being much more experimental/unstable at this point. > > But that context never claims full stability even for the relatively > stable raid1 and single device modes, and in fact anticipates that there > may be times when recovery from the existing filesystem may not be > practical, thus the recommendation to keep tested usable backups at the > ready. > > Meanwhile, it remains relatively common on this list for those wondering > about their btrfs on long-term-stale (not a typo) "enterprise" distros, > or even debian-stale, to be actively steered away from btrfs, especially > if they're not willing to update to something far more current than those > distros often provide, because in general, the current stability status > of btrfs is in conflict with the reason people generally choose to use > that level of old and stale software in the first place -- they > prioritize tried and tested to work, stable and mature, over the latest > generally newer and flashier featured but sometimes not entirely stable, > and btrfs at this point simply doesn't meet that sort of stability/ > maturity expectations, nor is it likely to for some time (measured in > years), due to all the reasons enumerated so well in the above thread. > > > In that context, the stability status matrix on the wiki is already > reasonably accurate, certainly so IMO, because "OK" in context means as > OK as btrfs is in general, and btrfs itself remains still stabilizing, > not fully stable and mature. > > If there IS an argument as to the accuracy of the raid0/1/10 OK status, > I'd argue it's purely due to people not understanding the status of btrfs > in general, and that if there's a general deficiency at all, it's in the > lack of a general stability status paragraph on that page itself > explaining all this, despite the fact that the main https:// > btrfs.wiki.kernel.org landing page states quite plainly under stability > status that btrfs remains under heavy development and that current > kernels are strongly recommended. (Tho were I editing it, there'd > certainly be a more prominent mention of keeping backups at the ready as > well.) > Hi Duncan, i understand your arguments but cannot fully agree. First of all, i'm not sticking with old stale versions of whatever as i try to keep my system up2date. My kernel is 4.8.4 (Gentoo) and btrfs-progs is 4.8.4. That being said, i'm quite aware of the heavy development status of btrfs but pointing the finger on the users saying that they don't fully understand the status of btrfs without giving the information on the wiki is in my opinion not the right way. Heavy development doesn't mean that features marked as ok are "not" or "mostly" ok in the context of overall btrfs stability. There is no indication on the wiki that raid1 or every other raid (except for raid5/6) suffers from the problems stated in this thread. If there are know problems then the stability matrix should point
Re: Convert from RAID 5 to 10
On 2016-11-30 00:38, Roman Mamedov wrote: On Wed, 30 Nov 2016 00:16:48 +0100 Wilson Meierwrote: That said, btrfs shouldn't be used for other then raid1 as every other raid level has serious problems or at least doesn't work as the expected raid level (in terms of failure recovery). RAID1 shouldn't be used either: *) Read performance is not optimized: all metadata is always read from the first device unless it has failed, data reads are supposedly balanced between devices per PID of the process reading. Better implementations dispatch reads per request to devices that are currently idle. Based on what I've seen, the metadata reads get balanced too. As far as the read balancing in general, while it doesn't work very well for single processes, but if you have a large number of processes started sequentially (for example, a thread-pool based server), it actually works out to being near optimal with a lot less logic than DM and MD have. Aggregated over an entire system it's usually near optimal as well. *) Write performance is not optimized, during long full bandwidth sequential writes it is common to see devices writing not in parallel, but with a long periods of just one device writing, then another. (Admittedly have been some time since I tested that). I've never seen this be an issue in practice, especially if you're using transparent compression (which caps extent size, and therefore I/O size to a given device, at 128k). I'm also sane enough that I'm not doing bulk streaming writes to traditional HDD's or fully saturating the bandwidth on my SSD's (you should be over-provisioning whenever possible). For a desktop user, unless you're doing real-time video recording at higher than HD resolution with high quality surround sound, this probably isn't going to hit you (and even then you should be recording to a temporary location with much faster write speeds (tmpfs or ext4 without a journal for example) because you'll likely get hit with fragmentation). This also has overall pretty low impact compared to a number of other things that BTRFS does (BTRFS on a single disk with single profile for everything versus 2 of the same disks with raid1 profile for everything gets less than a 20% performance difference in all the testing I've done). *) A degraded RAID1 won't mount by default. If this was the root filesystem, the machine won't boot. To mount it, you need to add the "degraded" mount option. However you have exactly a single chance at that, you MUST restore the RAID to non-degraded state while it's mounted during that session, since it won't ever mount again in the r/w+degraded mode, and in r/o mode you can't perform any operations on the filesystem, including adding/removing devices. There is a fix pending for the single chance to mount degraded thing, and even then, it only applies to a 2 device raid1 array (with more devices, new chunks are still raid1 if you're missing 1 device, so the checks don't trigger and refuse the mount). As far as not mounting degraded by default, that's a conscious design choice that isn't going to change. There's a switch (adding 'degraded' to the mount options) to enable this behavior per-mount, so we're still on-par in that respect with LVM and MD, we just picked a different default. In this case, I actually feel it's a better default for most cases, because most regular users aren't doing exhaustive monitoring, and thus are not likely to notice the filesystem being mounted degraded until it's far too late. If the filesystem is degraded, then _something_ has happened that the user needs to know about, and until some sane monitoring solution is implemented, the easiest way to ensure this is to refuse to mount. *) It does not properly handle a device disappearing during operation. (There is a patchset to add that). *) It does not properly handle said device returning (under a different /dev/sdX name, for bonus points). These are not an easy problem to fix completely, especially considering that the device is currently guaranteed to reappear under a different name because BTRFS will still have an open reference on the original device name. On top of that, if you've got hardware that's doing this without manual intervention, you've got much bigger issues than how BTRFS reacts to it. No correctly working hardware should be doing this. Most of these also apply to all other RAID levels. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
Wilson Meier posted on Wed, 30 Nov 2016 09:35:36 +0100 as excerpted: > Am 30/11/16 um 09:06 schrieb Martin Steigerwald: >> Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov: >>> On Wed, 30 Nov 2016 00:16:48 +0100 >>> >>> Wilson Meierwrote: That said, btrfs shouldn't be used for other then raid1 as every other raid level has serious problems or at least doesn't work as the expected raid level (in terms of failure recovery). >>> RAID1 shouldn't be used either: >>> >>> *) Read performance is not optimized: all metadata is always read from >>> the first device unless it has failed, data reads are supposedly >>> balanced between devices per PID of the process reading. Better >>> implementations dispatch reads per request to devices that are >>> currently idle. >>> >>> *) Write performance is not optimized, during long full bandwidth >>> sequential writes it is common to see devices writing not in parallel, >>> but with a long periods of just one device writing, then another. >>> (Admittedly have been some time since I tested that). >>> >>> *) A degraded RAID1 won't mount by default. >>> >>> If this was the root filesystem, the machine won't boot. >>> >>> To mount it, you need to add the "degraded" mount option. >>> However you have exactly a single chance at that, you MUST restore the >>> RAID to non-degraded state while it's mounted during that session, >>> since it won't ever mount again in the r/w+degraded mode, and in r/o >>> mode you can't perform any operations on the filesystem, including >>> adding/removing devices. >>> >>> *) It does not properly handle a device disappearing during operation. >>> (There is a patchset to add that). >>> >>> *) It does not properly handle said device returning (under a >>> different /dev/sdX name, for bonus points). >>> >>> Most of these also apply to all other RAID levels. >> So the stability matrix would need to be updated not to recommend any >> kind of BTRFS RAID 1 at the moment? >> >> Actually I faced the BTRFS RAID 1 read only after first attempt of >> mounting it "degraded" just a short time ago. >> >> BTRFS still needs way more stability work it seems to me. >> > I would say the matrix should be updated to not recommend any RAID Level > as from the discussion it seems they all of them have flaws. > To me RAID is broken if one cannot expect to recover from a device > failure in a solid way as this is why RAID is used. > Correct me if i'm wrong. Right now i'm making my thoughts about > migrating to another FS and/or Hardware RAID. It should be noted that no list regular that I'm aware of anyway, would make any claims about btrfs being stable and mature either now or in the near-term future in any case. Rather to the contrary, as I generally put it, btrfs is still stabilizing and maturing, with backups one is willing to use (and as any admin of any worth would say, a backup that hasn't been tested usable isn't yet a backup; the job of creating the backup isn't done until that backup has been tested actually usable for recovery) still extremely strongly recommended. Similarly, keeping up with the list is recommended, as is staying relatively current on both the kernel and userspace (generally considered to be within the latest two kernel series of either current or LTS series kernels, and with a similarly versioned btrfs userspace). In that context, btrfs single-device and raid1 (and raid0 of course) are quite usable and as stable as btrfs in general is, that being stabilizing but not yet fully stable and mature, with raid10 being slightly less so and raid56 being much more experimental/unstable at this point. But that context never claims full stability even for the relatively stable raid1 and single device modes, and in fact anticipates that there may be times when recovery from the existing filesystem may not be practical, thus the recommendation to keep tested usable backups at the ready. Meanwhile, it remains relatively common on this list for those wondering about their btrfs on long-term-stale (not a typo) "enterprise" distros, or even debian-stale, to be actively steered away from btrfs, especially if they're not willing to update to something far more current than those distros often provide, because in general, the current stability status of btrfs is in conflict with the reason people generally choose to use that level of old and stale software in the first place -- they prioritize tried and tested to work, stable and mature, over the latest generally newer and flashier featured but sometimes not entirely stable, and btrfs at this point simply doesn't meet that sort of stability/ maturity expectations, nor is it likely to for some time (measured in years), due to all the reasons enumerated so well in the above thread. In that context, the stability status matrix on the wiki is already reasonably accurate, certainly so IMO, because "OK" in context means as OK as
Re: Convert from RAID 5 to 10
Am 30/11/16 um 09:06 schrieb Martin Steigerwald: > Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov: >> On Wed, 30 Nov 2016 00:16:48 +0100 >> >> Wilson Meierwrote: >>> That said, btrfs shouldn't be used for other then raid1 as every other >>> raid level has serious problems or at least doesn't work as the expected >>> raid level (in terms of failure recovery). >> RAID1 shouldn't be used either: >> >> *) Read performance is not optimized: all metadata is always read from the >> first device unless it has failed, data reads are supposedly balanced >> between devices per PID of the process reading. Better implementations >> dispatch reads per request to devices that are currently idle. >> >> *) Write performance is not optimized, during long full bandwidth sequential >> writes it is common to see devices writing not in parallel, but with a long >> periods of just one device writing, then another. (Admittedly have been >> some time since I tested that). >> >> *) A degraded RAID1 won't mount by default. >> >> If this was the root filesystem, the machine won't boot. >> >> To mount it, you need to add the "degraded" mount option. >> However you have exactly a single chance at that, you MUST restore the RAID >> to non-degraded state while it's mounted during that session, since it >> won't ever mount again in the r/w+degraded mode, and in r/o mode you can't >> perform any operations on the filesystem, including adding/removing >> devices. >> >> *) It does not properly handle a device disappearing during operation. >> (There is a patchset to add that). >> >> *) It does not properly handle said device returning (under a >> different /dev/sdX name, for bonus points). >> >> Most of these also apply to all other RAID levels. > So the stability matrix would need to be updated not to recommend any kind of > BTRFS RAID 1 at the moment? > > Actually I faced the BTRFS RAID 1 read only after first attempt of mounting > it > "degraded" just a short time ago. > > BTRFS still needs way more stability work it seems to me. > I would say the matrix should be updated to not recommend any RAID Level as from the discussion it seems they all of them have flaws. To me RAID is broken if one cannot expect to recover from a device failure in a solid way as this is why RAID is used. Correct me if i'm wrong. Right now i'm making my thoughts about migrating to another FS and/or Hardware RAID. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov: > On Wed, 30 Nov 2016 00:16:48 +0100 > > Wilson Meierwrote: > > That said, btrfs shouldn't be used for other then raid1 as every other > > raid level has serious problems or at least doesn't work as the expected > > raid level (in terms of failure recovery). > > RAID1 shouldn't be used either: > > *) Read performance is not optimized: all metadata is always read from the > first device unless it has failed, data reads are supposedly balanced > between devices per PID of the process reading. Better implementations > dispatch reads per request to devices that are currently idle. > > *) Write performance is not optimized, during long full bandwidth sequential > writes it is common to see devices writing not in parallel, but with a long > periods of just one device writing, then another. (Admittedly have been > some time since I tested that). > > *) A degraded RAID1 won't mount by default. > > If this was the root filesystem, the machine won't boot. > > To mount it, you need to add the "degraded" mount option. > However you have exactly a single chance at that, you MUST restore the RAID > to non-degraded state while it's mounted during that session, since it > won't ever mount again in the r/w+degraded mode, and in r/o mode you can't > perform any operations on the filesystem, including adding/removing > devices. > > *) It does not properly handle a device disappearing during operation. > (There is a patchset to add that). > > *) It does not properly handle said device returning (under a > different /dev/sdX name, for bonus points). > > Most of these also apply to all other RAID levels. So the stability matrix would need to be updated not to recommend any kind of BTRFS RAID 1 at the moment? Actually I faced the BTRFS RAID 1 read only after first attempt of mounting it "degraded" just a short time ago. BTRFS still needs way more stability work it seems to me. -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
On Wed, 30 Nov 2016 00:16:48 +0100 Wilson Meierwrote: > That said, btrfs shouldn't be used for other then raid1 as every other > raid level has serious problems or at least doesn't work as the expected > raid level (in terms of failure recovery). RAID1 shouldn't be used either: *) Read performance is not optimized: all metadata is always read from the first device unless it has failed, data reads are supposedly balanced between devices per PID of the process reading. Better implementations dispatch reads per request to devices that are currently idle. *) Write performance is not optimized, during long full bandwidth sequential writes it is common to see devices writing not in parallel, but with a long periods of just one device writing, then another. (Admittedly have been some time since I tested that). *) A degraded RAID1 won't mount by default. If this was the root filesystem, the machine won't boot. To mount it, you need to add the "degraded" mount option. However you have exactly a single chance at that, you MUST restore the RAID to non-degraded state while it's mounted during that session, since it won't ever mount again in the r/w+degraded mode, and in r/o mode you can't perform any operations on the filesystem, including adding/removing devices. *) It does not properly handle a device disappearing during operation. (There is a patchset to add that). *) It does not properly handle said device returning (under a different /dev/sdX name, for bonus points). Most of these also apply to all other RAID levels. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
On 30.11.2016 00:49, Chris Murphy wrote: > On Tue, Nov 29, 2016 at 4:16 PM, Wilson Meierwrote: >> >> >> On 29.11.2016 23:52, Chris Murphy wrote: >>> On Tue, Nov 29, 2016 at 3:34 PM, Wilson Meier >>> wrote: On 29.11.2016 18:54, Austin S. Hemmelgarn wrote: > On 2016-11-29 12:20, Florian Lindner wrote: >> Hello, >> >> I have 4 harddisks with 3TB capacity each. They are all used in a >> btrfs RAID 5. It has come to my attention, that there >> seem to be major flaws in btrfs' raid 5 implementation. Because of >> that, I want to convert the the raid 5 to a raid 10 >> and I have several questions. >> >> * Is that possible as an online conversion? > Yes, as long as you have a complete array to begin with (converting from > a degraded raid5/6 array has the same issues as rebuilding a degraded > raid5/6 array). >> >> * Since my effective capacity will shrink during conversions, does >> btrfs check if there is enough free capacity to >> convert? As you see below, right now it's probably too full, but I'm >> going to delete some stuff. > No, you'll have to do the math yourself. This would be a great project > idea to place on the wiki though. >> >> * I understand the command to convert is >> >> btrfs balance start -dconvert=raid10 -mconvert=raid10 /mnt >> >> Correct? > Yes, but I would personally convert first metadata then data. The > raid10 profile gets better performance than raid5, so converting the > metadata first (by issuing a balance just covering the metadata) should > speed up the data conversion a bit). >> >> * What disks are allowed to fail? My understanding of a raid 10 is >> like that >> >> disks = {a, b, c, d} >> >> raid0( raid1(a, b), raid1(c, d) ) >> >> This way (a XOR b) AND (c XOR d) are allowed to fail without the raid >> to fail (either a or b and c or d are allowed to fail) >> >> How is that with a btrfs raid 10? > A BTRFS raid10 can only sustain one disk failure. Ideally, it would > work like you show, but in practice it doesn't. I'm a little bit concerned right now. I migrated my 4 disk raid6 to raid10 because of the known raid5/6 problems. I assumed that btrfs raid10 can handle 2 disk failures as longs as they occur in different stripes. Could you please point out why it cannot sustain 2 disk failures? >>> >>> Conventional raid10 has a fixed assignment of which drives are >>> mirrored pairs, and this doesn't happen with Btrfs at the device level >>> but rather the chunk level. And a chunk stripe number is not fixed to >>> a particular device, therefore it's possible a device will have more >>> than one chunk stripe number. So what that means is the loss of two >>> devices has a pretty decent chance of resulting in the loss of both >>> copies of a chunk, whereas conventional RAID 10 must lose both >>> mirrored pairs for data loss to happen. >>> >>> With very cursory testing what I've found is btrfs-progs establishes >>> an initial stripe number to device mapping that's different than the >>> kernel code. The kernel code appears to be pretty consistent so long >>> as the member devices are identically sized. So it's probably not an >>> unfixable problem, but the effect is that right now Btrfs raid10 >>> profile is more like raid0+1. >>> >>> You can use >>> $ sudo btrfs insp dump-tr -t 3 /dev/ >>> >>> That will dump the chunk tree, and you can see if any device has more >>> than one chunk stripe number associated with it. >>> >>> >> Huh, that makes sense. That probably should be fixed :) >> >> Given your advised command (extended it a bit for readability): >> # btrfs insp dump-tr -t 3 /dev/mapper/luks-2.1 | grep "stripe " | awk '{ >> print $1" "$2" "$3" "$4 }' | sort -u >> >> I get: >> stripe 0 devid 1 >> stripe 0 devid 4 >> stripe 1 devid 2 >> stripe 1 devid 3 >> stripe 1 devid 4 >> stripe 2 devid 1 >> stripe 2 devid 2 >> stripe 2 devid 3 >> stripe 3 devid 1 >> stripe 3 devid 2 >> stripe 3 devid 3 >> stripe 3 devid 4 >> >> Now i'm even more concerned! > > Uhh yeah, this is a four device raid10? I'm a little confused why it's > not consistently showing four stripes per chunk, which would mean the > same number of strip 0's as stripe 3's. I don't know what that's > about. > Yes, 4 devices. It does show 4 stripes per chunk, but the command above sorts and makes the results unique (sort -u). This gives a quick overview of multiple stripes on a single device. > A full balance might make the mapping consistent. > Will give i a try. >> That said, btrfs shouldn't be used for other then raid1 as every other >> raid level has serious problems or at least doesn't work as the expected >> raid level (in terms of failure recovery). > > Well, raid1 is also single device failure tolerance only as well. > There is no device n raid1. > Sure, but
Re: Convert from RAID 5 to 10
On Tue, Nov 29, 2016 at 4:16 PM, Wilson Meierwrote: > > > On 29.11.2016 23:52, Chris Murphy wrote: >> On Tue, Nov 29, 2016 at 3:34 PM, Wilson Meier wrote: >>> On 29.11.2016 18:54, Austin S. Hemmelgarn wrote: On 2016-11-29 12:20, Florian Lindner wrote: > Hello, > > I have 4 harddisks with 3TB capacity each. They are all used in a > btrfs RAID 5. It has come to my attention, that there > seem to be major flaws in btrfs' raid 5 implementation. Because of > that, I want to convert the the raid 5 to a raid 10 > and I have several questions. > > * Is that possible as an online conversion? Yes, as long as you have a complete array to begin with (converting from a degraded raid5/6 array has the same issues as rebuilding a degraded raid5/6 array). > > * Since my effective capacity will shrink during conversions, does > btrfs check if there is enough free capacity to > convert? As you see below, right now it's probably too full, but I'm > going to delete some stuff. No, you'll have to do the math yourself. This would be a great project idea to place on the wiki though. > > * I understand the command to convert is > > btrfs balance start -dconvert=raid10 -mconvert=raid10 /mnt > > Correct? Yes, but I would personally convert first metadata then data. The raid10 profile gets better performance than raid5, so converting the metadata first (by issuing a balance just covering the metadata) should speed up the data conversion a bit). > > * What disks are allowed to fail? My understanding of a raid 10 is > like that > > disks = {a, b, c, d} > > raid0( raid1(a, b), raid1(c, d) ) > > This way (a XOR b) AND (c XOR d) are allowed to fail without the raid > to fail (either a or b and c or d are allowed to fail) > > How is that with a btrfs raid 10? A BTRFS raid10 can only sustain one disk failure. Ideally, it would work like you show, but in practice it doesn't. >>> I'm a little bit concerned right now. I migrated my 4 disk raid6 to >>> raid10 because of the known raid5/6 problems. I assumed that btrfs >>> raid10 can handle 2 disk failures as longs as they occur in different >>> stripes. >>> Could you please point out why it cannot sustain 2 disk failures? >> >> Conventional raid10 has a fixed assignment of which drives are >> mirrored pairs, and this doesn't happen with Btrfs at the device level >> but rather the chunk level. And a chunk stripe number is not fixed to >> a particular device, therefore it's possible a device will have more >> than one chunk stripe number. So what that means is the loss of two >> devices has a pretty decent chance of resulting in the loss of both >> copies of a chunk, whereas conventional RAID 10 must lose both >> mirrored pairs for data loss to happen. >> >> With very cursory testing what I've found is btrfs-progs establishes >> an initial stripe number to device mapping that's different than the >> kernel code. The kernel code appears to be pretty consistent so long >> as the member devices are identically sized. So it's probably not an >> unfixable problem, but the effect is that right now Btrfs raid10 >> profile is more like raid0+1. >> >> You can use >> $ sudo btrfs insp dump-tr -t 3 /dev/ >> >> That will dump the chunk tree, and you can see if any device has more >> than one chunk stripe number associated with it. >> >> > Huh, that makes sense. That probably should be fixed :) > > Given your advised command (extended it a bit for readability): > # btrfs insp dump-tr -t 3 /dev/mapper/luks-2.1 | grep "stripe " | awk '{ > print $1" "$2" "$3" "$4 }' | sort -u > > I get: > stripe 0 devid 1 > stripe 0 devid 4 > stripe 1 devid 2 > stripe 1 devid 3 > stripe 1 devid 4 > stripe 2 devid 1 > stripe 2 devid 2 > stripe 2 devid 3 > stripe 3 devid 1 > stripe 3 devid 2 > stripe 3 devid 3 > stripe 3 devid 4 > > Now i'm even more concerned! Uhh yeah, this is a four device raid10? I'm a little confused why it's not consistently showing four stripes per chunk, which would mean the same number of strip 0's as stripe 3's. I don't know what that's about. A full balance might make the mapping consistent. > That said, btrfs shouldn't be used for other then raid1 as every other > raid level has serious problems or at least doesn't work as the expected > raid level (in terms of failure recovery). Well, raid1 is also single device failure tolerance only as well. There is no device n raid1. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
On 29.11.2016 23:52, Chris Murphy wrote: > On Tue, Nov 29, 2016 at 3:34 PM, Wilson Meierwrote: >> On 29.11.2016 18:54, Austin S. Hemmelgarn wrote: >>> On 2016-11-29 12:20, Florian Lindner wrote: Hello, I have 4 harddisks with 3TB capacity each. They are all used in a btrfs RAID 5. It has come to my attention, that there seem to be major flaws in btrfs' raid 5 implementation. Because of that, I want to convert the the raid 5 to a raid 10 and I have several questions. * Is that possible as an online conversion? >>> Yes, as long as you have a complete array to begin with (converting from >>> a degraded raid5/6 array has the same issues as rebuilding a degraded >>> raid5/6 array). * Since my effective capacity will shrink during conversions, does btrfs check if there is enough free capacity to convert? As you see below, right now it's probably too full, but I'm going to delete some stuff. >>> No, you'll have to do the math yourself. This would be a great project >>> idea to place on the wiki though. * I understand the command to convert is btrfs balance start -dconvert=raid10 -mconvert=raid10 /mnt Correct? >>> Yes, but I would personally convert first metadata then data. The >>> raid10 profile gets better performance than raid5, so converting the >>> metadata first (by issuing a balance just covering the metadata) should >>> speed up the data conversion a bit). * What disks are allowed to fail? My understanding of a raid 10 is like that disks = {a, b, c, d} raid0( raid1(a, b), raid1(c, d) ) This way (a XOR b) AND (c XOR d) are allowed to fail without the raid to fail (either a or b and c or d are allowed to fail) How is that with a btrfs raid 10? >>> A BTRFS raid10 can only sustain one disk failure. Ideally, it would >>> work like you show, but in practice it doesn't. >> I'm a little bit concerned right now. I migrated my 4 disk raid6 to >> raid10 because of the known raid5/6 problems. I assumed that btrfs >> raid10 can handle 2 disk failures as longs as they occur in different >> stripes. >> Could you please point out why it cannot sustain 2 disk failures? > > Conventional raid10 has a fixed assignment of which drives are > mirrored pairs, and this doesn't happen with Btrfs at the device level > but rather the chunk level. And a chunk stripe number is not fixed to > a particular device, therefore it's possible a device will have more > than one chunk stripe number. So what that means is the loss of two > devices has a pretty decent chance of resulting in the loss of both > copies of a chunk, whereas conventional RAID 10 must lose both > mirrored pairs for data loss to happen. > > With very cursory testing what I've found is btrfs-progs establishes > an initial stripe number to device mapping that's different than the > kernel code. The kernel code appears to be pretty consistent so long > as the member devices are identically sized. So it's probably not an > unfixable problem, but the effect is that right now Btrfs raid10 > profile is more like raid0+1. > > You can use > $ sudo btrfs insp dump-tr -t 3 /dev/ > > That will dump the chunk tree, and you can see if any device has more > than one chunk stripe number associated with it. > > Huh, that makes sense. That probably should be fixed :) Given your advised command (extended it a bit for readability): # btrfs insp dump-tr -t 3 /dev/mapper/luks-2.1 | grep "stripe " | awk '{ print $1" "$2" "$3" "$4 }' | sort -u I get: stripe 0 devid 1 stripe 0 devid 4 stripe 1 devid 2 stripe 1 devid 3 stripe 1 devid 4 stripe 2 devid 1 stripe 2 devid 2 stripe 2 devid 3 stripe 3 devid 1 stripe 3 devid 2 stripe 3 devid 3 stripe 3 devid 4 Now i'm even more concerned! That said, btrfs shouldn't be used for other then raid1 as every other raid level has serious problems or at least doesn't work as the expected raid level (in terms of failure recovery). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
On Tue, Nov 29, 2016 at 3:34 PM, Wilson Meierwrote: > On 29.11.2016 18:54, Austin S. Hemmelgarn wrote: >> On 2016-11-29 12:20, Florian Lindner wrote: >>> Hello, >>> >>> I have 4 harddisks with 3TB capacity each. They are all used in a >>> btrfs RAID 5. It has come to my attention, that there >>> seem to be major flaws in btrfs' raid 5 implementation. Because of >>> that, I want to convert the the raid 5 to a raid 10 >>> and I have several questions. >>> >>> * Is that possible as an online conversion? >> Yes, as long as you have a complete array to begin with (converting from >> a degraded raid5/6 array has the same issues as rebuilding a degraded >> raid5/6 array). >>> >>> * Since my effective capacity will shrink during conversions, does >>> btrfs check if there is enough free capacity to >>> convert? As you see below, right now it's probably too full, but I'm >>> going to delete some stuff. >> No, you'll have to do the math yourself. This would be a great project >> idea to place on the wiki though. >>> >>> * I understand the command to convert is >>> >>> btrfs balance start -dconvert=raid10 -mconvert=raid10 /mnt >>> >>> Correct? >> Yes, but I would personally convert first metadata then data. The >> raid10 profile gets better performance than raid5, so converting the >> metadata first (by issuing a balance just covering the metadata) should >> speed up the data conversion a bit). >>> >>> * What disks are allowed to fail? My understanding of a raid 10 is >>> like that >>> >>> disks = {a, b, c, d} >>> >>> raid0( raid1(a, b), raid1(c, d) ) >>> >>> This way (a XOR b) AND (c XOR d) are allowed to fail without the raid >>> to fail (either a or b and c or d are allowed to fail) >>> >>> How is that with a btrfs raid 10? >> A BTRFS raid10 can only sustain one disk failure. Ideally, it would >> work like you show, but in practice it doesn't. > I'm a little bit concerned right now. I migrated my 4 disk raid6 to > raid10 because of the known raid5/6 problems. I assumed that btrfs > raid10 can handle 2 disk failures as longs as they occur in different > stripes. > Could you please point out why it cannot sustain 2 disk failures? Conventional raid10 has a fixed assignment of which drives are mirrored pairs, and this doesn't happen with Btrfs at the device level but rather the chunk level. And a chunk stripe number is not fixed to a particular device, therefore it's possible a device will have more than one chunk stripe number. So what that means is the loss of two devices has a pretty decent chance of resulting in the loss of both copies of a chunk, whereas conventional RAID 10 must lose both mirrored pairs for data loss to happen. With very cursory testing what I've found is btrfs-progs establishes an initial stripe number to device mapping that's different than the kernel code. The kernel code appears to be pretty consistent so long as the member devices are identically sized. So it's probably not an unfixable problem, but the effect is that right now Btrfs raid10 profile is more like raid0+1. You can use $ sudo btrfs insp dump-tr -t 3 /dev/ That will dump the chunk tree, and you can see if any device has more than one chunk stripe number associated with it. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
On 29.11.2016 18:54, Austin S. Hemmelgarn wrote: > On 2016-11-29 12:20, Florian Lindner wrote: >> Hello, >> >> I have 4 harddisks with 3TB capacity each. They are all used in a >> btrfs RAID 5. It has come to my attention, that there >> seem to be major flaws in btrfs' raid 5 implementation. Because of >> that, I want to convert the the raid 5 to a raid 10 >> and I have several questions. >> >> * Is that possible as an online conversion? > Yes, as long as you have a complete array to begin with (converting from > a degraded raid5/6 array has the same issues as rebuilding a degraded > raid5/6 array). >> >> * Since my effective capacity will shrink during conversions, does >> btrfs check if there is enough free capacity to >> convert? As you see below, right now it's probably too full, but I'm >> going to delete some stuff. > No, you'll have to do the math yourself. This would be a great project > idea to place on the wiki though. >> >> * I understand the command to convert is >> >> btrfs balance start -dconvert=raid10 -mconvert=raid10 /mnt >> >> Correct? > Yes, but I would personally convert first metadata then data. The > raid10 profile gets better performance than raid5, so converting the > metadata first (by issuing a balance just covering the metadata) should > speed up the data conversion a bit). >> >> * What disks are allowed to fail? My understanding of a raid 10 is >> like that >> >> disks = {a, b, c, d} >> >> raid0( raid1(a, b), raid1(c, d) ) >> >> This way (a XOR b) AND (c XOR d) are allowed to fail without the raid >> to fail (either a or b and c or d are allowed to fail) >> >> How is that with a btrfs raid 10? > A BTRFS raid10 can only sustain one disk failure. Ideally, it would > work like you show, but in practice it doesn't. I'm a little bit concerned right now. I migrated my 4 disk raid6 to raid10 because of the known raid5/6 problems. I assumed that btrfs raid10 can handle 2 disk failures as longs as they occur in different stripes. Could you please point out why it cannot sustain 2 disk failures? Thanks >> >> * Any other advice? ;-) > You'll actually get significantly better performance with no loss of > data safety by running BTRFS in raid1 mode on top of two RAID0 volumes > (LVM/MD/hardware doesn't matter much). I do this myself and see roughly > 10-20% improved performance on average with my workloads. > > If you do decide to do this, it's theoretically possible to do so > online, but it's kind of tricky, so I won't post any instructions for > that here unless someone asks for them. >> >> Thanks a lot, >> >> Florian >> >> >> Some information of my filesystem: >> >> # btrfs filesystem show / >> Label: 'data' uuid: 57e5b9e9-01ae-4f9e-8a3d-9f42204d7005 >> Total devices 4 FS bytes used 7.57TiB >> devid1 size 2.72TiB used 2.72TiB path /dev/sda4 >> devid2 size 2.72TiB used 2.72TiB path /dev/sdb4 >> devid3 size 2.72TiB used 2.72TiB path /dev/sdc4 >> devid4 size 2.72TiB used 2.72TiB path /dev/sdd4 >> >> # btrfs filesystem df / >> Data, RAID5: total=8.14TiB, used=7.56TiB >> System, RAID5: total=96.00MiB, used=592.00KiB >> Metadata, RAID5: total=12.84GiB, used=11.06GiB >> GlobalReserve, single: total=512.00MiB, used=0.00B > Based on this output, you will need to delete some data before you can > convert to raid10. With 4 2.72TiB drives, you're looking at roughly > 5.44TiB of usable space, so you're probably going to have to delete at > least 2-3TiB of data from this filesystem before converting. > > If you're not already using transparent compression, it could probably > help some with this, but it likely won't save you more than a few > hundred GB unless you are storing lots of data that compresses very well. >> >> # df -h >> Filesystem Size Used Avail Use% Mounted on >> >> /dev/sda411T 7.6T 597G 93% / > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
On 2016-11-29 14:03, Lionel Bouton wrote: Hi, Le 29/11/2016 à 18:20, Florian Lindner a écrit : [...] * Any other advice? ;-) Don't rely on RAID too much... The degraded mode is unstable even for RAID10: you can corrupt data simply by writing to a degraded RAID10. I could reliably reproduce this on a 6 devices RAID10 BTRFS filesystem with a missing device. It affected even a 4.8.4 kernel where our PostgreSQL clusters got frequent write errors (on the fs itself but not the 5 working devices) and managed to corrupt their data. Have backups, you probably will need them. With Btrfs RAID If you have a failing device, replace it early (monitor the devices and don't wait for them to fail if you get transient errors or see worrying SMART values). If you have a failed device, don't actively use the filesystem in degraded mode. Replace or delete/add before writing to the filesystem again. This is an excellent point I didn't think of. If you don't have some way you can monitor things, don't trust RAID (not just BTRFS raid modes, but any RAID like system in general). The only reason I'm willing to trust it is because I have really good monitoring set up (SMART status on the disks + daily scrubs + hourly event counter checks on the FS + watching for changes to filesystem flags + a couple of other things) which will e-mail me the moment something starts to go bad (and I've jumped through hoops to get the mailing to work under almost any circumstances as long as userspace still exists and has network access). I can confirm though that things work well with BTRFS raid1 mode for at least the following: * Basic, mostly static, network services (DHCP server, DNS relay, web server serving static content, very low volume postfix installation, etc). * Moderate disk usage in very sequential usage patterns (BOINC applications in my case, but almost anything replacing files or appending in reasonably sized chunks semi-regularly falls into this). * Infrequent typical usage for software builds (I run Gentoo, so system updates = building software, and I've never had any issues with this (at least, not any issues because of BTRFS)). * Bulk sequential streaming of data (stuff like multimedia recordings). In all cases except the last (which I've only had some limited recent experience with), I've had BTRFS raid1 mode filesystems survive just fine through: * 3 bad PSU's (common case for this is that you see filesystem and storage device errors tracing down to the disks at rates proportionate to the overall load on the system) * 7 different storage devices going bad (1 catastrophic mechanical failure, 1 connector failure (poor soldering job for the connector), 2 disk controller failures, and 3 media failures) * 2 intermittently bad storage controllers * 100+ kernel panics/crashes All with no issues with data corruption (there was corruption, but BTRFS safely handled all of it and fixed it, and actually helped me diagnose two of the bad PSU's and one of the bad storage controllers). 90% of the reason it's survived all this though is because of the monitoring I have in place which let me track down exactly what was wrong and fix it before it became an issue. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
Hi, Le 29/11/2016 à 18:20, Florian Lindner a écrit : > [...] > > * Any other advice? ;-) Don't rely on RAID too much... The degraded mode is unstable even for RAID10: you can corrupt data simply by writing to a degraded RAID10. I could reliably reproduce this on a 6 devices RAID10 BTRFS filesystem with a missing device. It affected even a 4.8.4 kernel where our PostgreSQL clusters got frequent write errors (on the fs itself but not the 5 working devices) and managed to corrupt their data. Have backups, you probably will need them. With Btrfs RAID If you have a failing device, replace it early (monitor the devices and don't wait for them to fail if you get transient errors or see worrying SMART values). If you have a failed device, don't actively use the filesystem in degraded mode. Replace or delete/add before writing to the filesystem again. Best regards, Lionel -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
On 2016-11-29 12:20, Florian Lindner wrote: Hello, I have 4 harddisks with 3TB capacity each. They are all used in a btrfs RAID 5. It has come to my attention, that there seem to be major flaws in btrfs' raid 5 implementation. Because of that, I want to convert the the raid 5 to a raid 10 and I have several questions. * Is that possible as an online conversion? Yes, as long as you have a complete array to begin with (converting from a degraded raid5/6 array has the same issues as rebuilding a degraded raid5/6 array). * Since my effective capacity will shrink during conversions, does btrfs check if there is enough free capacity to convert? As you see below, right now it's probably too full, but I'm going to delete some stuff. No, you'll have to do the math yourself. This would be a great project idea to place on the wiki though. * I understand the command to convert is btrfs balance start -dconvert=raid10 -mconvert=raid10 /mnt Correct? Yes, but I would personally convert first metadata then data. The raid10 profile gets better performance than raid5, so converting the metadata first (by issuing a balance just covering the metadata) should speed up the data conversion a bit). * What disks are allowed to fail? My understanding of a raid 10 is like that disks = {a, b, c, d} raid0( raid1(a, b), raid1(c, d) ) This way (a XOR b) AND (c XOR d) are allowed to fail without the raid to fail (either a or b and c or d are allowed to fail) How is that with a btrfs raid 10? A BTRFS raid10 can only sustain one disk failure. Ideally, it would work like you show, but in practice it doesn't. * Any other advice? ;-) You'll actually get significantly better performance with no loss of data safety by running BTRFS in raid1 mode on top of two RAID0 volumes (LVM/MD/hardware doesn't matter much). I do this myself and see roughly 10-20% improved performance on average with my workloads. If you do decide to do this, it's theoretically possible to do so online, but it's kind of tricky, so I won't post any instructions for that here unless someone asks for them. Thanks a lot, Florian Some information of my filesystem: # btrfs filesystem show / Label: 'data' uuid: 57e5b9e9-01ae-4f9e-8a3d-9f42204d7005 Total devices 4 FS bytes used 7.57TiB devid1 size 2.72TiB used 2.72TiB path /dev/sda4 devid2 size 2.72TiB used 2.72TiB path /dev/sdb4 devid3 size 2.72TiB used 2.72TiB path /dev/sdc4 devid4 size 2.72TiB used 2.72TiB path /dev/sdd4 # btrfs filesystem df / Data, RAID5: total=8.14TiB, used=7.56TiB System, RAID5: total=96.00MiB, used=592.00KiB Metadata, RAID5: total=12.84GiB, used=11.06GiB GlobalReserve, single: total=512.00MiB, used=0.00B Based on this output, you will need to delete some data before you can convert to raid10. With 4 2.72TiB drives, you're looking at roughly 5.44TiB of usable space, so you're probably going to have to delete at least 2-3TiB of data from this filesystem before converting. If you're not already using transparent compression, it could probably help some with this, but it likely won't save you more than a few hundred GB unless you are storing lots of data that compresses very well. # df -h Filesystem Size Used Avail Use% Mounted on /dev/sda411T 7.6T 597G 93% / -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html