Re: BTRFS did it's job nicely (thanks!)
Sterling Windmill wrote: Out of curiosity, what led to you choosing RAID1 for data but RAID10 for metadata? I've flip flipped between these two modes myself after finding out that BTRFS RAID10 doesn't work how I would've expected. Wondering what made you choose your configuration. Thanks! Sure, The "RAID"1 profile for data was chosen to maximize disk space utilization since I got a lot of mixed size devices. The "RAID"10 profile for metadata was chosen simply because it *feels* a bit faster for some of my (previous) workload which was reading a lot of small files (which I guess was embedded in the metadata). While I never remembered that I got any measurable performance increase the system simply felt smoother (which is strange since "RAID"10 should hog more disks at once). I would love to try "RAID"10 for both data and metadata, but I have to delete some files first (or add yet another drive). Would you like to elaborate a bit more yourself about how BTRFS "RAID"10 does not work as you expected? As far as I know BTRFS' version of "RAID"10 means it ensure 2 copies (1 replica) is striped over as many disks it can (as long as there is free space). So if I am not terribly mistaking a "RAID"10 with 20 devices will stripe over (20/2) x 2 and if you run out of space on 10 of the devices it will continue to stripe over (5/2) x 2. So your stripe width vary with the available space essentially... I may be terribly wrong about this (until someones corrects me that is...)
Re: BTRFS did it's job nicely (thanks!)
Duncan wrote: waxhead posted on Fri, 02 Nov 2018 20:54:40 +0100 as excerpted: Note that I tend to interpret the btrfs de st / output as if the error was NOT fixed even if (seems clearly that) it was, so I think the output is a bit misleading... just saying... See the btrfs-device manpage, stats subcommand, -z|--reset option, and device stats section: -z|--reset Print the stats and reset the values to zero afterwards. DEVICE STATS The device stats keep persistent record of several error classes related to doing IO. The current values are printed at mount time and updated during filesystem lifetime or from a scrub run. So stats keeps a count of historic errors and is only reset when you specifically reset it, *NOT* when the error is fixed. Yes, I am perfectly aware of all that. The issue I have is that the manpage describes corruption errors as "A block checksum mismatched or corrupted metadata header was found". This does not tell me if this was a permanent corruption or if it was fixed. That is why I think the output is a bit misleadning (and I should have said that more clearly). My point being that btrfs device stats /mnt would have been a lot easier to read and understand if it distinguished between permanent corruption e.g. unfixable errors vs fixed errors. (There's actually a recent patch, I believe in the current dev kernel 4.20/5.0, that will reset a device's stats automatically for the btrfs replace case when it's actually a different device afterward anyway. Apparently, it doesn't even do /that/ automatically yet. Keep that in mind if you replace that device.) Oh thanks for the heads up, I was under the impression that the device stats was tracked by btrfs devid, but apparently it is (was) not. Good to know!
BTRFS did it's job nicely (thanks!)
Hi, my main computer runs on a 7x SSD BTRFS as rootfs with data:RAID1 and metadata:RAID10. One SSD is probably about to fail, and it seems that BTRFS fixed it nicely (thanks everyone!) I decided to just post the ugly details in case someone just wants to have a look. Note that I tend to interpret the btrfs de st / output as if the error was NOT fixed even if (seems clearly that) it was, so I think the output is a bit misleading... just saying... -- below are the details for those curious (just for fun) --- scrub status for [YOINK!] scrub started at Fri Nov 2 17:49:45 2018 and finished after 00:29:26 total bytes scrubbed: 1.15TiB with 1 errors error details: csum=1 corrected errors: 1, uncorrectable errors: 0, unverified errors: 0 btrfs fi us -T / Overall: Device size: 1.18TiB Device allocated: 1.17TiB Device unallocated:9.69GiB Device missing: 0.00B Used: 1.17TiB Free (estimated): 6.30GiB (min: 6.30GiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Data Metadata System Id Path RAID1 RAID10RAID10Unallocated -- - - - - --- 6 /dev/sda1 236.28GiB 704.00MiB 32.00MiB 485.00MiB 7 /dev/sdb1 233.72GiB 1.03GiB 32.00MiB 2.69GiB 2 /dev/sdc1 110.56GiB 352.00MiB - 904.00MiB 8 /dev/sdd1 234.96GiB 1.03GiB 32.00MiB 1.45GiB 1 /dev/sde1 164.90GiB 1.03GiB 32.00MiB 1.72GiB 9 /dev/sdf1 109.00GiB 1.03GiB 32.00MiB 744.00MiB 10 /dev/sdg1 107.98GiB 1.03GiB 32.00MiB 1.74GiB -- - - - - --- Total 598.70GiB 3.09GiB 96.00MiB 9.69GiB Used 597.25GiB 1.57GiB 128.00KiB uname -a Linux main 4.18.0-2-amd64 #1 SMP Debian 4.18.10-2 (2018-10-07) x86_64 GNU/Linux btrfs --version btrfs-progs v4.17 dmesg | grep -i btrfs [7.801817] Btrfs loaded, crc32c=crc32c-generic [8.163288] BTRFS: device label btrfsroot devid 10 transid 669961 /dev/sdg1 [8.163433] BTRFS: device label btrfsroot devid 9 transid 669961 /dev/sdf1 [8.163591] BTRFS: device label btrfsroot devid 1 transid 669961 /dev/sde1 [8.163734] BTRFS: device label btrfsroot devid 8 transid 669961 /dev/sdd1 [8.163974] BTRFS: device label btrfsroot devid 2 transid 669961 /dev/sdc1 [8.164117] BTRFS: device label btrfsroot devid 7 transid 669961 /dev/sdb1 [8.164262] BTRFS: device label btrfsroot devid 6 transid 669961 /dev/sda1 [8.206174] BTRFS info (device sde1): disk space caching is enabled [8.206236] BTRFS info (device sde1): has skinny extents [8.348610] BTRFS info (device sde1): enabling ssd optimizations [8.854412] BTRFS info (device sde1): enabling free space tree [8.854471] BTRFS info (device sde1): using free space tree [ 68.170580] BTRFS warning (device sde1): csum failed root 3760 ino 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2 [ 68.185973] BTRFS warning (device sde1): csum failed root 3760 ino 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2 [ 68.185991] BTRFS warning (device sde1): csum failed root 3760 ino 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2 [ 68.186003] BTRFS warning (device sde1): csum failed root 3760 ino 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2 [ 68.186015] BTRFS warning (device sde1): csum failed root 3760 ino 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2 [ 68.186028] BTRFS warning (device sde1): csum failed root 3760 ino 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2 [ 68.186041] BTRFS warning (device sde1): csum failed root 3760 ino 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2 [ 68.186052] BTRFS warning (device sde1): csum failed root 3760 ino 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2 [ 68.186063] BTRFS warning (device sde1): csum failed root 3760 ino 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2 [ 68.186075] BTRFS warning (device sde1): csum failed root 3760 ino 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2 [ 68.199237] BTRFS info (device sde1): read error corrected: ino 3247424 off 36700160 (dev /dev/sda1 sector 244987192) [ 68.202602] BTRFS info (device sde1): read error corrected: ino 3247424 off 36704256 (dev /dev/sda1 sector 244987192) [ 68.203176] BTRFS info (device sde1): read error corrected: ino 3247424 off 36712448 (dev /dev/sda1 sector 244987192) [ 68.206762] BTRFS info (device sde1): read error corrected: ino 3247424 off 36708352 (dev /dev/sda1 sector 244987192) [ 68.212071] BTRFS info
BTRFS bad block management. Does it exist?
In case BTRFS fails to WRITE to a disk. What happens? Does the bad area get mapped out somehow? Does it try again until it succeed or until it "times out" or reach a threshold counter? Does it eventually try to write to a different disk (in case of using the raid1/10 profile?)
Re: lazytime mount option—no support in Btrfs
Adam Hunt wrote: Back in 2014 Ted Tso introduced the lazytime mount option for ext4 and shortly thereafter a more generic VFS implementation which was then merged into mainline. His early patches included support for Btrfs but those changes were removed prior to the feature being merged. His changelog includes the following note about the removal: - Per Christoph's suggestion, drop support for btrfs and xfs for now, issues with how btrfs and xfs handle dirty inode tracking. We can add btrfs and xfs support back later or at the end of this series if we want to revisit this decision. My reading of the current mainline shows that Btrfs still lacks any support for lazytime. Has any thought been given to adding support for lazytime to Btrfs? Thanks, Adam Is there any new regarding this?
Re: [PATCH 0/4] 3- and 4- copy RAID1
Hugo Mills wrote: On Wed, Jul 18, 2018 at 08:39:48AM +, Duncan wrote: Duncan posted on Wed, 18 Jul 2018 07:20:09 + as excerpted: Perhaps it's a case of coder's view (no code doing it that way, it's just a coincidental oddity conditional on equal sizes), vs. sysadmin's view (code or not, accidental or not, it's a reasonably accurate high-level description of how it ends up working most of the time with equivalent sized devices).) Well, it's an *accurate* observation. It's just not a particularly *useful* one. :) Hugo. A bit off topic perhaps - but I've got to give it a go: Pretty please with sugar, nuts, a cherry and chocolate sprinkles dipped in syrup and coated with icecream on top , would it not be about time to update your online btrfs-usage calculator (which is insanely useful in so many ways) to support the new modes!? In fact it would have been a great- / even better as a- cli-tool. And yes, a while ago I toyed about porting it to C for own use mostly, but never got that far. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] 3- and 4- copy RAID1
waxhead wrote: David Sterba wrote: An interesting question is the naming of the extended profiles. I picked something that can be easily understood but it's not a final proposal. Years ago, Hugo proposed a naming scheme that described the non-standard raid varieties of the btrfs flavor: https://marc.info/?l=linux-btrfs=136286324417767 Switching to this naming would be a good addition to the extended raid. As just a humble BTRFS user I agree and really think it is about time to move far away from the RAID terminology. However adding some more descriptive profile names (or at least some aliases) would be much better for the commoners (such as myself). ...snip... > Which would make the above table look like so: Old format / My Format / My suggested alias SINGLE / R0.S0.P0 / SINGLE DUP / R1.S1.P0 / DUP (or even MIRRORLOCAL1) RAID0 / R0.Sm.P0 / STRIPE RAID1 / R1.S0.P0 / MIRROR1 RAID1c3 / R2.S0.P0 / MIRROR2 RAID1c4 / R3.S0.P0 / MIRROR3 RAID10 / R1.Sm.P0 / STRIPE.MIRROR1 RAID5 / R1.Sm.P1 / STRIPE.PARITY1 RAID6 / R1.Sm.P2 / STRIPE.PARITY2 And i think this is much more readable, but others may disagree. And as a side note... from a (hobby) coders perspective this is probably simpler to parse as well. ...snap... ...and before someone else points this out that my suggestion has an ugly flaw , I got a bit copy / paste happy and messed up the RAID 5 and 6 like profiles. The below table are corrected and hopefully it make the point why using the word 'replicas' is easier to understand than 'copies' even if I messed it up :) Old format / My Format / My suggested alias SINGLE / R0.S0.P0 / SINGLE DUP / R1.S1.P0 / DUP (or even MIRRORLOCAL1) RAID0 / R0.Sm.P0 / STRIPE RAID1 / R1.S0.P0 / MIRROR1 RAID1c3 / R2.S0.P0 / MIRROR2 RAID1c4 / R3.S0.P0 / MIRROR3 RAID10 / R1.Sm.P0 / STRIPE.MIRROR1 RAID5 / R0.Sm.P1 / STRIPE.PARITY1 RAID6 / R0.Sm.P2 / STRIPE.PARITY2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] 3- and 4- copy RAID1
David Sterba wrote: An interesting question is the naming of the extended profiles. I picked something that can be easily understood but it's not a final proposal. Years ago, Hugo proposed a naming scheme that described the non-standard raid varieties of the btrfs flavor: https://marc.info/?l=linux-btrfs=136286324417767 Switching to this naming would be a good addition to the extended raid. As just a humble BTRFS user I agree and really think it is about time to move far away from the RAID terminology. However adding some more descriptive profile names (or at least some aliases) would be much better for the commoners (such as myself). For example: Old format / New Format / My suggested alias SINGLE / 1C / SINGLE DUP / 2CD/ DUP (or even MIRRORLOCAL1) RAID0 / 1CmS / STRIPE RAID1 / 2C / MIRROR1 RAID1c3 / 3C / MIRROR2 RAID1c4 / 4C / MIRROR3 RAID10 / 2CmS / STRIPE.MIRROR1 RAID5 / 1CmS1P / STRIPE.PARITY1 RAID6 / 1CmS2P / STRIPE.PARITY2 I find that writing something like "btrfs balance start -dconvert=stripe5.parity2 /mnt" is far less confusing and therefore less error prone than writing "-dconvert=1C5S2P". While Hugo's suggestion is compact and to the point I would call for expanding that so it is a bit more descriptive and human readable. So for example : STRIPE where obviously is the same as Hugo proposed - the number of storage devices for the stripe and no would be best to mean 'use max devices'. For PARITY then is obviously required Keep in mind that most people (...and I am willing to bet even Duncan which probably HAS backups ;) ) get a bit stressed when their storage system is degraded. With that in mind I hope for more elaborate, descriptive and human readable profile names to be used to avoid making mistakes using the "compact" layout. ...and yes, of course this could go both ways. A more compact (and dare I say cryptic) variant can cause people to stop and think before doing something and thus avoid errors, Now that I made my point I can't help being a bit extra hash, obnoxious and possibly difficult so I would also suggest that Hugo's format could have been changed (dare I say improved?) from numCOPIESnumSTRIPESnumPARITY to. REPLICASnum.STRIPESnum.PARITYnum Which would make the above table look like so: Old format / My Format / My suggested alias SINGLE / R0.S0.P0 / SINGLE DUP / R1.S1.P0 / DUP (or even MIRRORLOCAL1) RAID0 / R0.Sm.P0 / STRIPE RAID1 / R1.S0.P0 / MIRROR1 RAID1c3 / R2.S0.P0 / MIRROR2 RAID1c4 / R3.S0.P0 / MIRROR3 RAID10 / R1.Sm.P0 / STRIPE.MIRROR1 RAID5 / R1.Sm.P1 / STRIPE.PARITY1 RAID6 / R1.Sm.P2 / STRIPE.PARITY2 And i think this is much more readable, but others may disagree. And as a side note... from a (hobby) coders perspective this is probably simpler to parse as well. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unsolvable technical issues?
Chris Murphy wrote: On Thu, Jun 21, 2018 at 5:13 PM, waxhead wrote: According to this: https://stratis-storage.github.io/StratisSoftwareDesign.pdf Page 4 , section 1.2 It claims that BTRFS still have significant technical issues that may never be resolved. Could someone shed some light on exactly what these technical issues might be?! What are BTRFS biggest technical problems? I think it's appropriate to file an issue and ask what they're referring to. It very well might be use case specific to Red Hat. https://github.com/stratis-storage/stratis-storage.github.io/issues I also think it's appropriate to crosslink: include URL for the start of this thread in the issue, and the issue URL to this thread. https://github.com/stratis-storage/stratis-storage.github.io/issues/1 Apparently the author have toned down the wording a bit, this confirm that the claim was without basis and probably based on "popular myth". The document the PDF links to is not yet updated. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unsolvable technical issues?
David Sterba wrote: On Fri, Jun 22, 2018 at 01:13:31AM +0200, waxhead wrote: According to this: https://stratis-storage.github.io/StratisSoftwareDesign.pdf Page 4 , section 1.2 It claims that BTRFS still have significant technical issues that may never be resolved. Could someone shed some light on exactly what these technical issues might be?! What are BTRFS biggest technical problems? The subject you write is 'unsolvable', which I read as 'impossible to solve', eg. on the design level. I'm not aware of such issues. Alright , so I interpret this as there is no showstopper regarding implementation of existing and planned features... If this is about issues that are difficult either to implement or getting right, there are a few known ones. Alright again, and I interpret this as there might be some code that is not flexible enough and changing that might affect working / stable parts of the code so therefore other solutions are looked at which is not that uncommon for software. Apart from not listing the known issues I think I got my questions answered :) and now it is perhaps finally appropriate to file a request at the Stratis bugtracker to ask what specifically they are referring to. If you forget about the "RAID"5/6 like features then the only annoyances that I have with BTRFS so far is... 1. Lack of per subvolume "RAID" levels 2. Lack of not using the deviceid to re-discover and re-add dropped devices And that's about it really... This could quickly turn into 'my faviourite bug/feature' list that can be very long. The most asked for are raid56, and performance of qgroups. Qu Wenruo improved some of the core problems and Jeff is working on the performance problem. So there are people working on that. On the raid56 front, there were some recent updates that fixed some bugs, but the fix for write hole is still missing so we can't raise the status yet. I have some some good news but nobody should get too excited until the code lands. I have prototype for the N-copy raid (where N is 3 or 4). This will provide the underlying infrastructure for the raid5/6 logging mechanism, the rest can be taken from Liu Bo's patchset sent some time ago. In the end the N-copy can be used for data and metadata too, independently and flexibly switched via the balance filters. This will cost one incompatibility bit. I hope I am not asking for too much (but I know I probably am), but I suggest that having a small snippet of information on the status page showing a little bit about what is either currently the development focus , or what people are known for working at would be very valuable for users and it may of course work both ways, such as exciting people or calming them down. ;) For example something simple like a "development focus" list... 2018-Q4: (planned) Renaming the grotesque "RAID" terminology 2018-Q3: (planned) Magical feature X 2018-Q2: N-Way mirroring 2018-Q1: Feature work "RAID"5/6 I think it would be good for people living their lives outside as it would perhaps spark some attention from developers and perhaps even media as well. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unsolvable technical issues?
Jukka Larja wrote: waxhead wrote 24.6.2018 klo 1.01: Nikolay Borisov wrote: On 22.06.2018 02:13, waxhead wrote: According to this: https://stratis-storage.github.io/StratisSoftwareDesign.pdf Page 4 , section 1.2 It claims that BTRFS still have significant technical issues that may never be resolved. Could someone shed some light on exactly what these technical issues might be?! What are BTRFS biggest technical problems? That's a question that needs to be directed at the author of the statement. I think not, and here's why: I am asking the BTRFS developers a general question , with some basis as to why I became curious. The question is simply what (if any) are the biggest technical issues in BTRFS because one must expect that if anyone is going to give me a credible answer it must be the people that hack on BTRFS and understand what they are working on and not the stratis guys. It would surprise me if they knew better than the BTRFS devs. I think the problem with that question is that it is too general. Duncan's post already highlights several things that could be a significant problem for some user while being non-issue for most. Without more specific problem description, best you can hope for is speculation on things that Btrfs currently does badly. -Jukka Larja Well, I still don't agree (apparently I am starting to become difficult). There is a "roadmap" on the BTRFS wiki that describes features implemented and feature planned for example. Naturally people are working on improvements to existing features and prep-work for new features. If some of this work is not moving ahead due to design issues it sounds likely that someone would know about it by now. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unsolvable technical issues?
Nikolay Borisov wrote: On 22.06.2018 02:13, waxhead wrote: According to this: https://stratis-storage.github.io/StratisSoftwareDesign.pdf Page 4 , section 1.2 It claims that BTRFS still have significant technical issues that may never be resolved. Could someone shed some light on exactly what these technical issues might be?! What are BTRFS biggest technical problems? That's a question that needs to be directed at the author of the statement. I think not, and here's why: I am asking the BTRFS developers a general question , with some basis as to why I became curious. The question is simply what (if any) are the biggest technical issues in BTRFS because one must expect that if anyone is going to give me a credible answer it must be the people that hack on BTRFS and understand what they are working on and not the stratis guys. It would surprise me if they knew better than the BTRFS devs. And yes absolutely, I do understand why one would want to direct that to the author of the statement as this claim is as far as I can tell completely without basis at all, and we all know that extraordinary claims require extraordinary evidence right? I do however feel that I should educate myself a bit on BTRFS to have some sort of basis to work on before confronting the stratis guys and risk ending up as the middle man in a potential email flame war. So again , does BTRFS have any *known* major technical obstacles which the devs are having a hard time solving? (Duncan already gave the best answer so far). PS! I have a tendency to sound a bit aggressive / harsh. I assure you all that it is not my intent. I am simply trying to get some knowledge of a filesystem (that interest me a lot before) trying to validate a "third party" claim. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
unsolvable technical issues?
According to this: https://stratis-storage.github.io/StratisSoftwareDesign.pdf Page 4 , section 1.2 It claims that BTRFS still have significant technical issues that may never be resolved. Could someone shed some light on exactly what these technical issues might be?! What are BTRFS biggest technical problems? If you forget about the "RAID"5/6 like features then the only annoyances that I have with BTRFS so far is... 1. Lack of per subvolume "RAID" levels 2. Lack of not using the deviceid to re-discover and re-add dropped devices And that's about it really... -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID56
Gandalf Corvotempesta wrote: Another kernel release was made. Any improvements in RAID56? I didn't see any changes in that sector, is something still being worked on or it's stuck waiting for something ? Based on official BTRFS status page, RAID56 is the only "unstable" item marked in red. No interested from Suse in fixing that? I think it's the real missing part for a feature-complete filesystem. Nowadays parity raid is mandatory, we can't only rely on mirroring. First of all: I am not a BTRFS developer, but I follow the mailing list closely and I too have a particular interest in the "RAID"5/6 feature which realistically is probably about 3-4 years (if not more) in the future. From what I am able to understand the pesky write hole is one of the major obstacles for having BTRFS "RAID"5/6 work reliably. There was patches to fix this a while ago, but if these patches are to be classified as a workaround or actually as "the darn thing done right" is perhaps up for discussion. In general there seems to be a lot more momentum on the "RAID"5/6 feature now compared to earlier. There also seem to be a lot of focus on fixing bugs and running tests as well. This is why I am guessing that 3-4 years ahead is a absolute minimum until "RAID"5/6 might be somewhat reliable and usable. There are a few other basics missing that may be acceptable for you as long as you know about it. For example as far as I know BTRFS does still not use the "device-id" or "BTRFS internal number" for storage devices to keep track of the storage device. This means that if you have a multi storage device filesystem with for example /dev/sda /dev/sdb /dev/sdc etc... and /dev/sdc disappears and show up again as /dev/sdx then BTRFS would not recoginize this and happily try to continue to write on /dev/sdc even if it does not exist. ...and perhaps even worse - I can imagine that if you swap device ordering and a different device takes /dev/sdc's place then BTRFS *could* overwrite data on this device - possibly making a real mess of things. I am not sure if this holds true, but if it does it's for sure a real nugget of basic functionality missing right there. BTRFS also so far have no automatic "drop device" function e.g. it will not automatically kick out a storage device that is throwing lots of errors and causing delays etc. There may be benefits to keeping this design of course, but for some dropping the device might be desirable. And no hot-spare "or hot-(reserved-)space" (which would be more accurate in BTRFS terms) is implemented either, and that is one good reason to keep an eye on your storage pool. What you *might* consider is to have your metadata in "RAID"1 or "RAID"10 and your data in "RAID5" or even "RAID6" so that if you run into problems then you might in worst case loose some data, but since "RAID"1/10 is beginning to be rather mature then it is likely that your filesystem might survive a disk failure. So if you are prepared to perhaps loose a file or two, but want to feel confident that your filesystem is surviving and will give you a report about what file(s) are toast then this may be acceptable for you as you can always restore from backups (because you do have backups right? If not, read 'any' of Duncan's posts - he explains better than most people why you need and should have backups!) Now keep in mind that this is just a humble users analysis of the situation based on whatever I have picked up from the mailing list which may or may not be entirely accurate so take it for what it is! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: replace drive with write_io_errs?
Adam Bahe wrote: Hello all, 'All' includes me as well, but keep in mind I am not a BTRFS dev. I have a drive that has been in my btrfs array for about 6 months now. It was purchased new. Its an IBM-ESXS SAS drive rebranded from an HGST HUH721010AL4200. Here is the following stats, it passed a long smartctl test. But I'm not sure what to make of it. [/dev/sdi].write_io_errs1823 [/dev/sdi].read_io_errs 0 [/dev/sdi].flush_io_errs0 [/dev/sdi].corruption_errs 0 [/dev/sdi].generation_errs 0 Just a few observations. You are more likely to get (faster) help from the friendly devs here if you provide the output of... btrfs --version uname -a btrfs filesystem show Have you gone through the "regular" stuff?! E.g. things like bad cables, rerouting cables, checking your power supply (noise, correct voltages), temperature (your drive is not *that* far off the trip temperature, if it is 52 C I imagine it could easily hit 65 C with a bit of load), trying to eliminate other hardware, sound cards, graphics cards etc... If you run your array on a USB enclosure weird things may/will/(has to) happen. === START OF INFORMATION SECTION === Vendor: IBM-ESXS Product: HUH721010AL4200 Revision: J6R2 User Capacity:9,931,038,130,176 bytes [9.93 TB] Logical block size: 4096 bytes Formatted with type 2 protection Logical block provisioning type unreported, LBPME=0, LBPRZ=0 Rotation Rate:7200 rpm Form Factor: 3.5 inches Logical Unit id: 0x5000cca266a405e4 Serial number:*YOINK* Device type: disk Transport protocol: SAS Local Time is:Sat May 12 03:06:35 2018 CDT SMART support is: Available - device has SMART capability. SMART support is: Enabled Temperature Warning: Enabled === START OF READ SMART DATA SECTION === SMART Health Status: OK Current Drive Temperature: 52 C Drive Trip Temperature:65 C Manufactured in week 33 of year 2017 Specified cycle count over device lifetime: 5 Accumulated start-stop cycles: 28 Specified load-unload count over device lifetime: 60 Accumulated load-unload cycles: 170 Elements in grown defect list: 0 Vendor (Seagate) cache information Blocks sent to initiator = 1848304782540800 Error counter log: Errors Corrected by Total Correction GigabytesTotal ECC rereads/errors algorithm processeduncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 00 0 01096283 14317.360 0 write: 00 0 0 2906 27801.489 0 verify:00 0 0 13027 0.000 0 Non-medium error count:0 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background long Completed -2466 - [- --] # 2 Background short Completed -2448 - [- --] Long (extended) Self Test duration: 65535 seconds [1092.2 minutes] I have not seen the 'correction algorithm invocations' before, but I expect that such a large drive probably do some of this as part of regular use. If the number is significantly higher than your other drives (if they have the same load) I would suspect something is fishy with your drive. But then again , it's better to ask someone else. I can't RMA the drive as I have no idea how or where to RMA an IBM branded HGST drive. So if on the off chance someone here is reading this who can also point me in the right direction, let me know where to RMA and IBM standalone drive with no FRU. Uhm... can't you just return the drive where you purchased it? But is this drive healthy or should I have it replaced? What is the extent of a write_io_err? Are they somewhat common or a sign of a bad drive? A scrub returned no errors. The manual is a bit hard to understand https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs-device It does not say clearly what happens if you have a redundant storage profile for your (meta)data. Would a write be redirected to another copy? if yes would it retry the original write. I *assume* that as long as you don't get any write errors in your application it works. But perhaps someone else care to explain this better (by preferably updating the manual/wiki) Also what about the correction algorithm invocations? All of my IBM drives seem to have those. Whereas all of my other drives do not. I was curious about that too, if anyone knows. Thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe
Re: RAID56 - 6 parity raid
Andrei Borzenkov wrote: 02.05.2018 21:17, waxhead пишет: Goffredo Baroncelli wrote: On 05/02/2018 06:55 PM, waxhead wrote: So again, which problem would solve having the parity checksummed ? On the best of my knowledge nothing. In any case the data is checksummed so it is impossible to return corrupted data (modulo bug :-) ). I am not a BTRFS dev , but this should be quite easy to answer. Unless you checksum the parity there is no way to verify that that the data (parity) you use to reconstruct other data is correct. In any case you could catch that the compute data is wrong, because the data is always checksummed. And in any case you must check the data against their checksum. What if you lost an entire disk? How does it matter exactly? RAID is per chunk anyway. It does not matter. I was wrong, got bitten by thinking about BTRFS "RAID5" as normal RAID5. Again a good reason to change the naming for it I think... or had corruption for both data AND checksum? By the same logic you may have corrupted parity and its checksum. Yup. Indeed How do you plan to safely reconstruct that without checksummed parity? Define "safely". The main problem of current RAID56 implementation is that stripe is not updated atomically (at least, that is what I understood from the past discussions) and this is not solved by having extra parity checksum. So how exactly "safety" is improved here? You still need overall checksum to verify result of reconstruction, what exactly extra parity checksum buys you? > [...] Again - please describe when having parity checksum will be beneficial over current implementation. You do not reconstruct anything as long as all data strips are there, so parity checksum will not be used. If one data strip fails (including checksum) it will be reconstructed and verified. If parity itself is corrupted, checksum verification fails (hopefully). How is it different from verifying parity checksum before reconstructing? In both cases data cannot be reconstructed, end of story. Ok, before attempting and answer I have to admit that I do not know enough about how RAID56 is laid out on disk in BTRFS terms. Is data checksummed pr. stripe or pr. disk? Is parity calculated on the data only or is it calculated on the data+checksum ?! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID56 - 6 parity raid
Goffredo Baroncelli wrote: On 05/02/2018 06:55 PM, waxhead wrote: So again, which problem would solve having the parity checksummed ? On the best of my knowledge nothing. In any case the data is checksummed so it is impossible to return corrupted data (modulo bug :-) ). I am not a BTRFS dev , but this should be quite easy to answer. Unless you checksum the parity there is no way to verify that that the data (parity) you use to reconstruct other data is correct. In any case you could catch that the compute data is wrong, because the data is always checksummed. And in any case you must check the data against their checksum. What if you lost an entire disk? or had corruption for both data AND checksum? How do you plan to safely reconstruct that without checksummed parity? My point is that storing the checksum is a cost that you pay *every time*. Every time you update a part of a stripe you need to update the parity, and then in turn the parity checksum. It is not a problem of space occupied nor a computational problem. It is a a problem of write amplification... How much of a problem is this? no benchmarks have been run since the feature is not yet there I suppose. The only gain is to avoid to try to use the parity when a) you need it (i.e. when the data is missing and/or corrupted) I'm not sure I can make out your argument here , but with RAID5/6 you don't have another copy to restore from. You *have* to use the parity to reconstruct data and it is a good thing if this data is trusted. and b) it is corrupted. But the likelihood of this case is very low. And you can catch it during the data checksum check (which has to be performed in any case !). So from one side you have a *cost every time* (the write amplification), to other side you have a gain (cpu-time) *only in case* of the parity is corrupted and you need it (eg. scrub or corrupted data)). IMHO the cost are very higher than the gain, and the likelihood the gain is very lower compared to the likelihood (=100% or always) of the cost. Then run benchmarks and considering making parity checksums optional (but pretty please dipped in syrup with sugar on top - keep in on by default). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID56 - 6 parity raid
Goffredo Baroncelli wrote: Hi On 05/02/2018 03:47 AM, Duncan wrote: Gandalf Corvotempesta posted on Tue, 01 May 2018 21:57:59 + as excerpted: Hi to all I've found some patches from Andrea Mazzoleni that adds support up to 6 parity raid. Why these are wasn't merged ? With modern disk size, having something greater than 2 parity, would be great. 1) [...] the parity isn't checksummed, Why the fact that the parity is not checksummed is a problem ? I read several times that this is a problem. However each time the thread reached the conclusion that... it is not a problem. So again, which problem would solve having the parity checksummed ? On the best of my knowledge nothing. In any case the data is checksummed so it is impossible to return corrupted data (modulo bug :-) ). I am not a BTRFS dev , but this should be quite easy to answer. Unless you checksum the parity there is no way to verify that that the data (parity) you use to reconstruct other data is correct. On the other side, having the parity would increase both the code complexity and the write amplification, because every time a part of the stripe is touched not only the parity has to be updated, but also the checksum has too.. Which is a good thing. BTRFS main selling point is that you can feel pretty confident that whatever you put is exactly what you get out. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
libbrtfsutil questions
Howdy! I am pondering writing a little C program that use libmicrohttpd and libbtrfsutil to display some very basic (overview) details about BTRFS. I was hoping to display the same information that'btrfs fi sh /mnt' and 'btrfs fi us -T /mnt' do, but somewhat combined. Since I recently just figured out how easy it was to do svg graphics I was hoping to try to visualize things a bit. What I was hoping to achieve is: - show all filesystems - ..show all devices in a filesystem (and mark missing devices clearly) - show usage and/or allocation for each device - possibly display chunks as blocks (like old defrag programs) where the brightness indicate how utilied a (meta)data chunk is. - possibly mark devices with errors ( 'btrfs de st /mnt' ). The problem is ... I looked at libbtrfsutil and it appears that there is mostly sync + subvolume/snapshot stuff in there. So my question is: Is libbtrfsutil the right choice and intended to at some point (in the future?) supply me with the data I need for these things or should I look elsewhere? PS! This a completely private project for my own egoistic reasons. However if it turns out to be useful and the code is not too embarrassing I am happy put the code into public domain ... if it ever gets written :S -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of RAID5/6
Liu Bo wrote: On Wed, Mar 21, 2018 at 9:50 AM, Menionwrote: Hi all I am trying to understand the status of RAID5/6 in BTRFS I know that there are some discussion ongoing on the RFC patch proposed by Liu bo But it seems that everything stopped last summary. Also it mentioned about a "separate disk for journal", does it mean that the final implementation of RAID5/6 will require a dedicated HDD for the journaling? Thanks for the interest on btrfs and raid56. The patch set is to plug write hole, which is very rare in practice, tbh. The feedback is to use existing space instead of another dedicate "fast device" as the journal in order to get some extent of raid protection. I'd need some time to pick it up. With that being said, we have several data reconstruction fixes for raid56 (esp. raid6) in 4.15, I'd say please deploy btrfs with the upstream kernel or some distros which do kernel updates frequently, the most important one is 8810f7517a3b Btrfs: make raid6 rebuild retry more https://patchwork.kernel.org/patch/10091755/ AFAIK, no other data corruptions showed up. I am very interested in the "raid"5/6 like behavior myself. Actually calling it RAID in the past may have had it's benefits , but these days continuing to use the RAID term is not helping. Even technically minded people seem to get confused. For example: It was suggested that "raid"5/6 should have hot-spare support. In BTRFS terms a hot spare devicse sounds wrong to me, but reserving extra space for a "hot-space" so any "raid"5/6 like system can (auto?) rebalance to missing blocks to the rest of the pool sounds sensible enough (as long as the number of devices allows to separate the different bits and pieces). Anyway , I got carried away a bit there. Sorry about that. What I really wanted to comment is about usability of "raid"5/6 How would really a metadata "raid"1 + data "raid"5 or 6 compare to say mdraid 5 or 6 from a reliability point of view. Sure mdraid has the advantage, but even with the write hole and the risk of corruption of data (not the filesystem) would not BTRFS in "theory" be safer that at least mdraid 5 if run with metadata "raid"5 ?! You have to run scrub on both mdraid as well as BTRFS to ensure data is not corrupted. PS! It might be worth mentioning that I am slightly affected by a Glenfarclas 105 Whisky while writing this so please bare with me in case something is too far off :) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crashes running btrfs scrub
Liu Bo wrote: On Sat, Mar 17, 2018 at 5:26 PM, Liu Bowrote: On Fri, Mar 16, 2018 at 2:46 PM, Mike Stevens wrote: Could you please paste the whole dmesg, it looks like it hit btrfs_abort_transaction(), which should give us more information about where goes wrong. The whole thing is here https://pastebin.com/4ENq2saQ Given this, [ 299.410998] BTRFS: error (device sdag) in btrfs_create_pending_block_groups:10192: errno=-27 unknown it refers to -EFBIG, so I think the warning comes from btrfs_add_system_chunk() { ... if (array_size + item_size + sizeof(disk_key) > BTRFS_SYSTEM_CHUNK_ARRAY_SIZE) { mutex_unlock(_info->chunk_mutex); return -EFBIG; } If that's the case, we need to check this earlier during mount. I didn't realize this until now, we do have a limitation on up to how many disks btrfs could handle, in order to make balance/scrub work properly (where system chunks may be set readonly), ((BTRFS_SYSTEM_CHUNK_ARRAY_SIZE / 2) - sizeof(struct btrfs_chunk)) / sizeof(struct btrfs_stripe) + 1 will be the number of disks btrfs can handle at most. Am I understanding this correct, BTRFS have limit to the number of physical devices it can handle?! (max 30 devices?!) Or are this referring to the number of devices BTRFS can utilize in a stripe (in which case 30 actually sounds like a high number). 30 devices is really not that much, heck you get 90 disks top load JBOD storage chassis these days and BTRFS does sound like an attractive choice for things like that. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crashes running btrfs scrub
Mike Stevens wrote: First, the required information ~ $ uname -a Linux auswscs9903 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux ~ $ btrfs --version btrfs-progs v4.9.1 ~ $ sudo btrfs fi show Label: none uuid: 77afc2bb-f7a8-4ce9-9047-c031f7571150 Total devices 34 FS bytes used 89.06TiB devid1 size 5.46TiB used 4.72TiB path /dev/sdb devid2 size 5.46TiB used 4.72TiB path /dev/sda devid3 size 5.46TiB used 4.72TiB path /dev/sdx devid4 size 5.46TiB used 4.72TiB path /dev/sdt devid5 size 5.46TiB used 4.72TiB path /dev/sdz devid6 size 5.46TiB used 4.72TiB path /dev/sdv devid7 size 5.46TiB used 4.72TiB path /dev/sdab devid8 size 5.46TiB used 4.72TiB path /dev/sdw devid9 size 5.46TiB used 4.72TiB path /dev/sdad devid 10 size 5.46TiB used 4.72TiB path /dev/sdaa devid 11 size 5.46TiB used 4.72TiB path /dev/sdr devid 12 size 5.46TiB used 4.72TiB path /dev/sdy devid 13 size 5.46TiB used 4.72TiB path /dev/sdj devid 14 size 5.46TiB used 4.72TiB path /dev/sdaf devid 15 size 5.46TiB used 4.72TiB path /dev/sdag devid 16 size 5.46TiB used 4.72TiB path /dev/sdh devid 17 size 5.46TiB used 4.72TiB path /dev/sdu devid 18 size 5.46TiB used 4.72TiB path /dev/sdac devid 19 size 5.46TiB used 4.72TiB path /dev/sdk devid 20 size 5.46TiB used 4.72TiB path /dev/sdah devid 21 size 5.46TiB used 4.72TiB path /dev/sdp devid 22 size 5.46TiB used 4.72TiB path /dev/sdae devid 23 size 5.46TiB used 4.72TiB path /dev/sdc devid 24 size 5.46TiB used 4.72TiB path /dev/sdl devid 25 size 5.46TiB used 4.72TiB path /dev/sdo devid 26 size 5.46TiB used 4.72TiB path /dev/sdd devid 27 size 5.46TiB used 4.72TiB path /dev/sdi devid 28 size 5.46TiB used 4.72TiB path /dev/sdn devid 29 size 5.46TiB used 4.72TiB path /dev/sds devid 30 size 5.46TiB used 4.72TiB path /dev/sdm devid 31 size 5.46TiB used 4.72TiB path /dev/sdf devid 32 size 5.46TiB used 4.72TiB path /dev/sdq devid 33 size 5.46TiB used 4.72TiB path /dev/sdg devid 34 size 5.46TiB used 4.72TiB path /dev/sde ~ $ sudo btrfs fi df /gpfs_backups Data, RAID6: total=150.82TiB, used=88.88TiB System, RAID6: total=512.00MiB, used=19.08MiB Metadata, RAID6: total=191.00GiB, used=187.38GiB GlobalReserve, single: total=512.00MiB, used=0.00B That's a hell of a filesystem. RAID5 and RAID5 is unstable and should not be used for anything but throw away data. You will be happy that you value you data enough to have backups because all sensible sysadmins do have backups correct?! (Do read just about any of Duncan's replies - he describes this better than me). Also if you are running kernel ***3.10*** that is nearly antique in btrfs terms. As a word of advise, try a more recent kernel (there have been lots of patches to raid5/6 since kernel 4.9) and if you ever get the filesystem running again then *at least* rebalance the metadata to raid1 as quickly as possible as the raid1 profile is (unlike raid5 or raid6) working really well. PS! I'm not a BTRFS dev so don't run away just yet. Someone else may magically help you recover, Best of luck! - Waxhead -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to replace a failed drive in btrfs RAID 1 filesystem
Austin S. Hemmelgarn wrote: On 2018-03-09 11:02, Paul Richards wrote: Hello there, I have a 3 disk btrfs RAID 1 filesystem, with a single failed drive. Before I attempt any recovery I’d like to ask what is the recommended approach? (The wiki docs suggest consulting here before attempting recovery[1].) The system is powered down currently and a replacement drive is being delivered soon. Should I use “replace”, or “add” and “delete”? Once replaced should I rebalance and/or scrub? I believe that the recovery may involve mounting in degraded mode. If I do this, how do I later get out of degraded mode, or if it’s automatic how do i determine when I’m out of degraded mode? It won't automatically mount degraded, you either have to explicitly ask it to, or you have to have an option to do so in your default mount options for the volume in /etc/fstab (which is dangerous for multiple reasons). Now, as to what the best way to go about this is, there are three things to consider: 1. Is the failed disk still usable enough that you can get good data off of it in a reasonable amount of time? If you're replacing the disk because of a lot of failed sectors, you can still probably get data off of it, while something like a head crash isn't worth trying to get data back. 2. Do you have enough room in the system itself to add another disk without removing one? 3. Is the replacement disk at least as big as the failed disk? If the answer to all three is yes, then just put in the new disk, mount the volume normally (you don't need to mount it degraded if the failed disk is working this well), and use `btrfs replace` to move the data. This is the most efficient option in terms of both time and is also generally the safest (and I personally always over-spec drive-bays in systems we build where I work specifically so that this approach can be used). If the answer to the third question is no, put in the new disk (removing the failed one first if the answer to the second question is no), mount the volume (mount it degraded if one of the first two questions is no, normally otherwise), then add the new disk to the volume with `btrfs device add` and remove the old one with `btrfs device delete` (using the 'missing' option if you had to remove the failed disk). This is needed because the replace operation requires the new device to be at least as big as the old one. If the answer to either one or two is no but the answer to three is yes, pull out the failed disk, put in a new one, mount the volume degraded, and use `btrfs replace` as well (you will need to specify the device ID for the now missing failed disk, which you can find by calling `btrfs filesystem show` on the volume). In the event that the replace operation refuses to run in this case, instead add the new disk to the volume with `btrfs device add` and then run `btrfs device delete missing` on the volume. If you follow any of the above procedures, you don't need to balance (the replace operation is equivalent to a block level copy and will result in data being distributed exactly the same as it was before, while the delete operation is a special type of balance), and you generally don't need to scrub the volume either (though it may still be a good idea). As far as getting back from degraded mode, you can just remount the volume to do so, though I would generally suggest rebooting. Note that there are three other possible approaches to consider as well: 1. If you can't immediately get a new disk _and_ all the data will fit on the other two disks, use `btrfs device delete` to remove the failed disk anyway, and run with just the two until you can get a new disk. This is exponentially safer than running the volume degraded until you get a new disk, and is the only case you realistically should delete a device before adding the new one. Make sure to balance the volume after adding the new device. 2. Depending on the situation, it may be faster to just recreate the whole volume from scratch using a backup than it is to try to repair it. This is actually the absolute safest method of handling this situation, as it makes sure that nothing from the old volume with the failed disk causes problems in the future. 3. If you don't have a backup, but have some temporary storage space that will fit all the data from the volume, you could also use `btrfs restore` to extract files from the old volume to temporary storage, recreate the volume, and copy the data back in from the temporary storage. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html I did a quick scan of the wiki just to see, but I did not find any good info about how to recover a "RAID" like set if degraded. Information about how to recover, and what profiles can be recovered from would be good to have
Per subvolume "RAID" level?!
Just out of curiosity, are there any work going on for enabling different "RAID" levels per subvolume?! And out of even more curiosity how is this planned to be handled with btrfs balance?! When per subvolume "RAID" levels are good to go, how would you then run the balance filters to convert / leave alone certain parts of the filesystem?! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Please update the BTRFS status page
The latest released kernel is 4.15 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded permanent mount option
Austin S. Hemmelgarn wrote: On 2018-01-29 12:58, Andrei Borzenkov wrote: 29.01.2018 14:24, Adam Borowski пишет: ... So any event (the user's request) has already happened. A rc system, of which systemd is one, knows whether we reached the "want root filesystem" or "want secondary filesystems" stage. Once you're there, you can issue the mount() call and let the kernel do the work. It is a btrfs choice to not expose compound device as separate one (like every other device manager does) Btrfs is not a device manager, it's a filesystem. it is a btrfs drawback that doesn't provice anything else except for this IOCTL with it's logic How can it provide you with something it doesn't yet have? If you want the information, call mount(). And as others in this thread have mentioned, what, pray tell, would you want to know "would a mount succeed?" for if you don't want to mount? it is a btrfs drawback that there is nothing to push assembling into "OK, going degraded" state The way to do so is to timeout, then retry with -o degraded. That's possible way to solve it. This likely requires support from mount.btrfs (or btrfs.ko) to return proper indication that filesystem is incomplete so caller can decide whether to retry or to try degraded mount. We already do so in the accepted standard manner. If the mount fails because of a missing device, you get a very specific message in the kernel log about it, as is the case for most other common errors (for uncommon ones you usually just get a generic open_ctree error). This is really the only option too, as the mount() syscall (which the mount command calls) returns only 0 on success or -1 and an appropriate errno value on failure, and we can't exactly go about creating a half dozen new error numbers just for this (well, technically we could, but I very much doubt that they would be accepted upstream, which defeats the purpose). Or may be mount.btrfs should implement this logic internally. This would really be the most simple way to make it acceptable to the other side by not needing to accept anything :) And would also be another layering violation which would require a proliferation of extra mount options to control the mount command itself and adjust the timeout handling. This has been done before with mount.nfs, but for slightly different reasons (primarily to allow nested NFS mounts, since the local directory that the filesystem is being mounted on not being present is treated like a mount timeout), and it had near zero control. It works there because they push the complicated policy decisions to userspace (namely, there is no support for retrying with different options or trying a different server). I just felt like commenting a bit on this from a regular users point of view. Remember that at some point BTRFS will probably be the default filesystem for the average penguin. BTRFS big selling point is redundance and a guarantee that whatever you write is the same that you will read sometime later. Many users will probably build their BTRFS system on a redundant array of storage devices. As long as there are sufficient (not necessarily all) storage devices present they expect their system to come up and work. If the system is not able to come up in a fully operative state it must at least be able to limp until the issue is fixed. Starting a argument about what init system is the most sane or most shiny is not helping. The truth is that systemd is not going away sometime soon and one might as well try to become friends if nothing else for the sake of having things working which should be a common goal regardless of the religion. I personally think the degraded mount option is a mistake as this assumes that a lightly degraded system is not able to work which is false. If the system can mount to some working state then it should mount regardless if it is fully operative or not. If the array is in a bad state you need to learn about it by issuing a command or something. The same goes for a MD array (and yes, I am aware of the block layer vs filesystem thing here). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Superblock update: Is there really any benefits of updating synchronously?
Hans van Kranenburg wrote: On 01/23/2018 08:51 PM, waxhead wrote: Nikolay Borisov wrote: On 23.01.2018 16:20, Hans van Kranenburg wrote: [...] We also had a discussion about the "backup roots" that are stored besides the superblock, and that they are "better than nothing" to help maybe recover something from a borken fs, but never ever guarantee you will get a working filesystem back. The same holds for superblocks from a previous generation. As soon as the transaction for generation X succesfully hits the disk, all space that was occupied in generation X-1 but no longer in X is available to be overwritten immediately. Ok so this means that superblocks with a older generation is utterly useless and will lead to corruption (effectively making my argument above useless as that would in fact assist corruption then). Mostly, yes. Does this means that if disk space was allocated in X-1 and is freed in X it will unallocated if you roll back to X-1 e.g. writing to unallocated storage. Can you reword that? I can't follow that sentence. Sure why not. I'll give it a go: Does this mean that if... * Superblock generation N-1 have range 1234-2345 allocated and used. and * Superblock generation N-0 (the current) have range 1234-2345 free because someone deleted a file or something Then It is no point in rolling back to generation N-1 because that refers to what is no essentially free "memory" which may or may have not been written over by generation N-0. And therefore N-1 which still thinks range 1234-2345 is allocated may point to the wrong data. I hope that was easier to follow - if not don't hold back on the explicitives! :) I was under the impression that a superblock was like a "snapshot" of the entire filesystem and that rollbacks via pre-gen superblocks was possible. Am I mistaking? Yes. The first fundamental thing in Btrfs is COW which makes sure that everything referenced from transaction X, from the superblock all the way down to metadata trees and actual data space is never overwritten by changes done in transaction X+1. Perhaps a tad off topic, but assuming the (hopefully) better explanation above clear things up a bit. What happens if a block is freed?! in X+1 --- which must mean that it can be overwritten in transaction X+1 (which I assume means a new superblock generation). After all without freeing and overwriting data there is no way to re-use space. For metadata trees that are NOT filesystem trees a.k.a. subvolumes, the way this is done is actually quite simple. If a block is cowed, the old location is added to a 'pinned extents' list (in memory), which is used as a blacklist for choosing space to put new writes in. After a transaction is completed on disk, that list with pinned extents is emptied and all that space is available for immediate reuse. This way we make sure that if the transaction that is ongoing is aborted, the previous one (latest one that is completely on disk) is always still there. If the computer crashes and the in memory list is lost, no big deal, we just continue from the latest completed transaction again after a reboot. (ignoring extra log things for simplicity) So, the only situation in which you can fully use an X-1 superblock is when none of that previously pinned space has actually been overwritten yet afterwards. And if any of the space was overwritten already, you can go play around with using an older superblock and your filesystem mounts and everything might look fine, until you hit that distant corner and BOOM! Got it , this takes care of my questions above, but I'll leave them in just for completeness sake. Thanks for the good explanation. >8 Extra!! Moar!! >8 But, doing so does not give you snapshot functionality yet! It's more like a poor mans snapshot that only can prevent from messing up the current version. Snapshot functionality is implemented only for filesystem trees (subvolumes) by adding reference counting (which does end up on disk) to the metadata blocks, and then COW trees as a whole. If you make a snapshot of a filesystem tree, the snapshot gets a whole new tree ID! It's not a previous version of the same subvolume you're looking at, it's a clone! This is a big difference. The extent tree is always tree 2. The chunk tree is always tree 3. But your subvolume snapshot gets a new tree number. Technically, it would maybe be possible to implement reference counting and snapshots to all of the metadata trees, but it would probably mean that the whole filesystem would get stuck in rewriting itself all day instead of doing any useful work. The current extent tree already has such amount of rumination problems that the added work of keeping track of reference counts would make it completely unusable. In the wiki, it's here: https://btrfs.wiki.kernel.org/index.php/Btrfs_design#Copy_on_Write_Logging Actually, I just
Re: Superblock update: Is there really any benefits of updating synchronously?
Nikolay Borisov wrote: On 23.01.2018 16:20, Hans van Kranenburg wrote: On 01/23/2018 10:03 AM, Nikolay Borisov wrote: On 23.01.2018 09:03, waxhead wrote: Note: This have been mentioned before, but since I see some issues related to superblocks I think it would be good to bring up the question again. [...] https://btrfs.wiki.kernel.org/index.php/On-disk_Format#Superblock The superblocks are updated synchronously on HDD's and one after each other on SSD's. There is currently no distinction in the code whether we are writing to SSD or HDD. So what does that line in the wiki mean, and why is it there? "btrfs normally updates all superblocks, but in SSD mode it will update only one at a time." It means the wiki is outdated. Ok and now the wiki is updated. Great :) Also what do you mean by synchronously, if you inspect the code in write_all_supers you will see what for every device we issue writes for every available copy of the superblock and then wait for all of them to be finished via the 'wait_dev_supers'. In that regard sb writeout is asynchronous. I meant basically what you have explained. You write the same memory to all superblocks "step by step" but in one operation. Superblocks are also (to my knowledge) not protected by copy-on-write and are read-modify-update. On a storage device with >256GB there will be three superblocks. BTRFS will always prefer the superblock with the highest generation number providing that the checksum is good. Wrong. On mount btrfs will only ever read the first superblock at 64k. If that one is corrupted it will refuse to mount, then it's expected the user will initiate recovery procedure with btrfs-progs which reads all supers and replaces them with the "newest" one (as decided by the generation number) So again, the line "The superblock with the highest generation is used when reading." in the wiki needs to go away then? Yep, for background information you can read the discussion here: https://www.spinics.net/lists/linux-btrfs/msg71878.html And the wiki is also updated... Great! On the list there seem to be a few incidents where the superblocks have gone toast and I am pondering what (if any) benefits there is by updating the superblocks synchronously. The superblock is checkpoint'ed every 30 seconds by default and if someone pulls the plug (poweroutage) on HDD's then a synchronous write depending on (the quality of) your hardware may perhaps ruin all the superblock copies in one go. E.g. Copy A,B and C will all be updated at 30s. On SSD's, since one superblock is updated after other it would mean that using the default 30 second checkpoint Copy A=30s, Copy B=1m, Copy C=1m30s As explained previously there is no notion of "SSD vs HDD" modes. Ok, thanks for clearing things up. But the main thing here is that all superblocks are updated at the same time both on SSD and HDD's. I think the question is still valid. What is there to gain on updating all of them every 30s instead of updating them one by one?! Would not that be safer, perhaps itty-bitty quicker and perhaps better in terms of recovery?! We also had a discussion about the "backup roots" that are stored besides the superblock, and that they are "better than nothing" to help maybe recover something from a borken fs, but never ever guarantee you will get a working filesystem back. The same holds for superblocks from a previous generation. As soon as the transaction for generation X succesfully hits the disk, all space that was occupied in generation X-1 but no longer in X is available to be overwritten immediately. Ok so this means that superblocks with a older generation is utterly useless and will lead to corruption (effectively making my argument above useless as that would in fact assist corruption then). Does this means that if disk space was allocated in X-1 and is freed in X it will unallocated if you roll back to X-1 e.g. writing to unallocated storage. I was under the impression that a superblock was like a "snapshot" of the entire filesystem and that rollbacks via pre-gen superblocks was possible. Am I mistaking? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Superblock update: Is there really any benefits of updating synchronously?
Note: This have been mentioned before, but since I see some issues related to superblocks I think it would be good to bring up the question again. According to the information found in the wiki: https://btrfs.wiki.kernel.org/index.php/On-disk_Format#Superblock The superblocks are updated synchronously on HDD's and one after each other on SSD's. Superblocks are also (to my knowledge) not protected by copy-on-write and are read-modify-update. On a storage device with >256GB there will be three superblocks. BTRFS will always prefer the superblock with the highest generation number providing that the checksum is good. On the list there seem to be a few incidents where the superblocks have gone toast and I am pondering what (if any) benefits there is by updating the superblocks synchronously. The superblock is checkpoint'ed every 30 seconds by default and if someone pulls the plug (poweroutage) on HDD's then a synchronous write depending on (the quality of) your hardware may perhaps ruin all the superblock copies in one go. E.g. Copy A,B and C will all be updated at 30s. On SSD's, since one superblock is updated after other it would mean that using the default 30 second checkpoint Copy A=30s, Copy B=1m, Copy C=1m30s Why is the SSD method not used on harddrives also?! If two superblocks are toast you would at maximum loose 1m30s by default , and if this is considered a problem then you can always adjust downwards the commit time. If this is set to 15 seconds you would still only loose 30 seconds of "action time" and would in my opinion be far better off from a reliability point of view than having to update multiple superblocks at the same time. I can't see why on earth updating all superblocks at the same time would have any benefits. So this all boils down to the questions three (ere the other side will see. :P ) 1. What are the benefits of updating all superblocks at the same time? (Just imagine if your memory is bad - you could risk updating all superblocks simultaneously with kebab'ed data). 2. What would the negative consequences be by using the SSD scheme also for harddisks? Especially if the commit time is set to 15s instead of 30s 3. In a RAID1 / 10 / 5 / 6 like setup. Would a set of corrupt superblocks on a single drive be recoverable from other disks or do the superblocks need to be intact on the (possibly) damaged drive? (If the superblocks are needed then why would not SSD mode be better especially if the drive is partly working) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Recommendations for balancing as part of regular maintenance?
Austin S. Hemmelgarn wrote: So, for a while now I've been recommending small filtered balances to people as part of regular maintenance for BTRFS filesystems under the logic that it does help in some cases and can't really hurt (and if done right, is really inexpensive in terms of resources). This ended up integrated partially in the info text next to the BTRFS charts on netdata's dashboard, and someone has now pointed out (correctly I might add) that this is at odds with the BTRFS FAQ entry on balances. For reference, here's the bit about it in netdata: You can keep your volume healthy by running the `btrfs balance` command on it regularly (check `man btrfs-balance` for more info). And here's the FAQ entry: Q: Do I need to run a balance regularly? A: In general usage, no. A full unfiltered balance typically takes a long time, and will rewrite huge amounts of data unnecessarily. You may wish to run a balance on metadata only (see Balance_Filters) if you find you have very large amounts of metadata space allocated but unused, but this should be a last resort. I've commented in the issue in netdata's issue tracker that I feel that the FAQ entry could be better worded (strictly speaking, you don't _need_ to run balances regularly, but it's usually a good idea). Looking at both though, I think they could probably both be improved, but I would like to get some input here on what people actually think the best current practices are regarding this (and ideally why they feel that way) before I go and change anything. So, on that note, how does anybody else out there feel about this? Is balancing regularly with filters restricting things to small numbers of mostly empty chunks a good thing for regular maintenance or not? -- As just a regular user I would think that the first thing you would need is an analyze that can tell you if it is a good idea to balance or not in the first place. Scrub seems like a great place to start - e.g. scrub could auto-analyze and report back need to balance. I also think that scrub should optionally autobalance if needed. Balance may not be needed, but if one can determine that balancing would speed up things a bit I don't see why this as an option can't be scheduled automatically. Ideally there should be a "scrub and polish" option that would scrub, balance and perhaps even defragment in one go. In fact, the way I see it btrfs should idealy by itself keep track on each data/metadata chunk and it should know , when was this chunk last affected by a scrub, balance, defrag etc and perform the required operations by itself based on a configuration or similar. Some may disagree for good reasons , but for me this is my wishlist for a filesystem :) e.g. a pool that just works and only annoys you with the need of replacing a bad disk every now and then :) To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A Big Thank You, and some Notes on Current Recovery Tools.
Qu Wenruo wrote: On 2018年01月01日 08:48, Stirling Westrup wrote: Okay, I want to start this post with a HUGE THANK YOU THANK YOU THANK YOU to Nikolay Borisov and most especially to Qu Wenruo! Thanks to their tireless help in answering all my dumb questions I have managed to get my BTRFS working again! As I speak I have the full, non-degraded, quad of drives mounted and am updating my latest backup of their contents. I had a 4-drive setup with 2x4T and 2x2T drives and one of the 2T drives failed, and with help I was able to make a 100% recovery of the lost data. I do have some observations on what I went through though. Take this as constructive criticism, or as a point for discussing additions to the recovery tools: 1) I had a 2T drive die with exactly 3 hard-sector errors and those 3 errors exactly coincided with the 3 super-blocks on the drive. WTF, why all these corruption all happens at btrfs super blocks?! What a coincident. The odds against this happening as random independent events is so unlikely as to be mind-boggling. (Something like odds of 1 in 10^26) Yep, that's also why I was thinking the corruption is much heavier than our expectation. But if this turns out to be superblocks only, then as long as superblock can be recovered, you're OK to go. So, I'm going to guess this wasn't random chance. Its possible that something inside the drive's layers of firmware is to blame, but it seems more likely to me that there must be some BTRFS process that can, under some conditions, try to update all superblocks as quickly as possible. Btrfs only tries to update its superblock when committing transaction. And it's only done after all devices are flushed. AFAIK there is nothing strange. I think it must be that a drive failure during this window managed to corrupt all three superblocks. Maybe, but at least the first (primary) superblock is written with FUA flag, unless you enabled libata FUA support (which is disabled by default) AND your driver supports native FUA (not all HDD supports it, I only have a seagate 3.5 HDD supports it), FUA write will be converted to write & flush, which should be quite safe. The only timing I can think of is, between the superblock write request submit and the wait for them. But anyway, btrfs superblocks are the ONLY metadata not protected by CoW, so it is possible something may go wrong at certain timming. So from what I can piece together SSD mode is safer even for regular harddisks correct? According to this... https://btrfs.wiki.kernel.org/index.php/On-disk_Format#Superblock - There is 3x superblocks for every device. - The superblocks are updated every 30 seconds if there is any changes... - SSD mode will not try to update all superblocks in one go, but update one by one every 30 seconds. So if SSD mode is enabled even for harddisks then only 60 seconds of filesystem history / activity will potentially be lost... this sounds like a reasonable trade-off compared to having your entire filesystem hampered if your hardware is not perhaps optimal (which is sort of the point with BTRFS' checksumming anyway) So would it make sense to enable SSD behavior by default for HDD's ?! It may be better to perform an update-readback-compare on each superblock before moving onto the next, so as to avoid this particular failure in the future. I doubt this would slow things down much as the superblocks must be cached in memory anyway. That should be done by block layer, where things like dm-integrity could help. 2) The recovery tools seem too dumb while thinking they are smarter than they are. There should be some way to tell the various tools to consider some subset of the drives in a system as worth considering. My fault, in fact there is a -F option for dump-super, to force it to recognize the bad superblock and output whatever it has. In that case at least we could be able to see if it was really corrupted or just some bitflip in magic numbers. Not knowing that a superblock was a single 4096-byte sector, I had primed my recovery by copying a valid superblock from one drive to the clone of my broken drive before starting the ddrescue of the failing drive. I had hoped that I could piece together a valid superblock from a good drive, and whatever I could recover from the failing one. In the end this turned out to be a useful strategy, but meanwhile I had two drives that both claimed to be drive 2 of 4, and no drive claiming to be drive 1 of 4. The tools completely failed to deal with this case and were consistently preferring to read the bogus drive 2 instead of the real drive 2, and it wasn't until I deliberately patched over the magic in the cloned drive that I could use the various recovery tools without bizarre and spurious errors. I understand how this was never an anticipated scenario for the recovery process, but if its happened once, it could happen again. Just dealing with a failing drive and its clone both available in one
Re: [PATCH] Btrfs: enchanse raid1/10 balance heuristic for non rotating devices
Timofey Titovets wrote: Currently btrfs raid1/10 balancer blance requests to mirrors, based on pid % num of mirrors. Update logic and make it understood if underline device are non rotational. If one of mirrors are non rotational, then all read requests will be moved to non rotational device. And this would make reads regardless of the PID always end up on the fastest device which sounds sane enough , but scubbing will be even more important since there is a less chance that a "random PID" will check the other copy every now and then. If both of mirrors are non rotational, calculate sum of pending and in flight request for queue on that bdev and use device with least queue leght. I think this would be tried out on rotational disk as well. I am happy to test this out for you on a 7x disk server if you want. Note: I have no experience with compiling kernels and applying patches (but I do code a bit in C every now and then) so a pre-compiled kernel would be required (I believe you are on Debain as well) For rotational then perhaps it would not be wise to use another mirror unless the queue length is significantly higher than the other. Again I am happy to test if tunables are provided. P.S. Inspired by md-raid1 read balancing Signed-off-by: Timofey Titovets--- fs/btrfs/volumes.c | 59 ++ 1 file changed, 59 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 9a04245003ab..98bc2433a920 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5216,13 +5216,30 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len) return ret; } +static inline int bdev_get_queue_len(struct block_device *bdev) +{ + int sum = 0; + struct request_queue *rq = bdev_get_queue(bdev); + + sum += rq->nr_rqs[BLK_RW_SYNC] + rq->nr_rqs[BLK_RW_ASYNC]; + sum += rq->in_flight[BLK_RW_SYNC] + rq->in_flight[BLK_RW_ASYNC]; + + /* +* Try prevent switch for every sneeze +* By roundup output num by 2 +*/ + return ALIGN(sum, 2); +} + static int find_live_mirror(struct btrfs_fs_info *fs_info, struct map_lookup *map, int first, int num, int optimal, int dev_replace_is_ongoing) { int i; int tolerance; + struct block_device *bdev; struct btrfs_device *srcdev; + bool all_bdev_nonrot = true; if (dev_replace_is_ongoing && fs_info->dev_replace.cont_reading_from_srcdev_mode == @@ -5231,6 +5248,48 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info, else srcdev = NULL; + /* +* Optimal expected to be pid % num +* That's generaly ok for spinning rust drives +* But if one of mirror are non rotating, +* that bdev can show better performance +* +* if one of disks are non rotating: +* - set optimal to non rotating device +* if both disk are non rotating +* - set optimal to bdev with least queue +* If both disks are spinning rust: +* - leave old pid % nu, +*/ + for (i = 0; i < num; i++) { + bdev = map->stripes[i].dev->bdev; + if (!bdev) + continue; + if (blk_queue_nonrot(bdev_get_queue(bdev))) + optimal = i; + else + all_bdev_nonrot = false; + } + + if (all_bdev_nonrot) { + int qlen; + /* Forse following logic choise by init with some big number */ + int optimal_dev_rq_count = 1 << 24; + + for (i = 0; i < num; i++) { + bdev = map->stripes[i].dev->bdev; + if (!bdev) + continue; + + qlen = bdev_get_queue_len(bdev); + + if (qlen < optimal_dev_rq_count) { + optimal = i; + optimal_dev_rq_count = qlen; + } + } + } + /* * try to avoid the drive that is the source drive for a * dev-replace procedure, only choose it if no other non-missing -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Tiered storage?
As a regular BTRFS user I can tell you that there is no such thing as hot data tracking yet. Some people seem to use bcache together with btrfs and come asking for help on the mailing list. Raid5/6 have received a few fixes recently, and it *may* soon me worth trying out raid5/6 for data, but keeping metadata in raid1/10 (I would rather loose a file or two than the entire filesystem). I had plans to run some tests on this a while ago, but forgot about it. As call good citizens, remember to have good backups. Last time I tested for Raid5/6 I ran into issues easily. For what it's worth - raid1/10 seems pretty rock solid as long as you have sufficient disks (hint: you need more than two for raid1 if you want to stay safe) As for dedupe there is (to my knowledge) nothing fully automatic yet. You have to run a program to scan your filesystem but all the deduplication is done in the kernel. duperemove works apparently quite well when I tested it, but there may be some performance implications. Roy Sigurd Karlsbakk wrote: Hi all I've been following this project on and off for quite a few years, and I wonder if anyone has looked into tiered storage on it. With tiered storage, I mean hot data lying on fast storage and cold data on slow storage. I'm not talking about cashing (where you just keep a copy of the hot data on the fast storage). And btw, how far is raid[56] and block-level dedup from something useful in production? Vennlig hilsen roy -- Roy Sigurd Karlsbakk (+47) 98013356 http://blogg.karlsbakk.net/ GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt -- Hið góða skaltu í stein höggva, hið illa í snjó rita. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Several questions regarding btrfs
ST wrote: Hello, I've recently learned about btrfs and consider to utilize for my needs. I have several questions in this regard: I manage a dedicated server remotely and have some sort of script that installs an OS from several images. There I can define partitions and their FSs. 1. By default the script provides a small separate partition for /boot with ext3. Does it have any advantages or can I simply have /boot within / all on btrfs? (Note: the OS is Debian9) I am on Debian as well and run /boot on multiple systems without any issues. Remember to run grub-install on all your disks and update-grub if you run it in a redundant setup. That way you can loose a disk and still be happy about it. If you run a redundant setup like raid1 / raid10 make sure you have sufficient disks to avoid that the filesystem enters read-only mode. See the status page for details. 2. as for the / I get ca. following written to /etc/fstab: UUID=blah_blah /dev/sda3 / btrfs ... So top-level volume is populated after initial installation with the main filesystem dir-structure (/bin /usr /home, etc..). As per btrfs wiki I would like top-level volume to have only subvolumes (at least, the one mounted as /) and snapshots. I can make a snapshot of the top-level volume with / structure, but how can get rid of all the directories within top-lvl volume and keep only the subvolume containing / (and later snapshots), unmount it and then mount the snapshot that I took? rm -rf / - is not a good idea... There are some tutorials floating around the web for this stuff. Just be careful, after a system update you might run into boot issues. (I suggest you try playing with this in a VM first to see what happens) 3. in my current ext4-based setup I have two servers while one syncs files of certain dir to the other using lsyncd (which launches rsync on inotify events). As far as I have understood it is more efficient to use btrfs send/receive (over ssh) than rsync (over ssh) to sync two boxes. Do you think it would be possible to make lsyncd to use btrfs for syncing instead of rsync? I.e. can btrfs work with inotify events? Did somebody try it already? Otherwise I can sync using btrfs send/receive from within cron every 10-15 minutes, but it seems less elegant. Have no idea, but since Debian uses systemd you might be able to cook up something with systemd.path (https://www.freedesktop.org/software/systemd/man/systemd.path.html 4. In a case when compression is used - what quota is based on - (a) amount of GBs the data actually consumes on the hard drive while in compressed state or (b) amount of GBs the data naturally is in uncompressed form. I need to set quotas as in (b). Is it possible? If not - should I file a feature request? No, you should not file a feature request it seems. Look what me and Google found for you :) https://btrfs.wiki.kernel.org/index.php/Quota_support (hint: read the "using limits" section) Thank you in advance! No worries, good luck! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Parity-based redundancy (RAID5/6/triple parity and beyond) on BTRFS and MDADM (Dec 2014) – Ronny Egners Blog
Dave wrote: Has this been discussed here? Has anything changed since it was written? I have (more or less) been following the mailing list since this feature was suggested. I have been drooling over it since, but not much have happened. Parity-based redundancy (RAID5/6/triple parity and beyond) on BTRFS and MDADM (Dec 2014) – Ronny Egners Blog http://blog.ronnyegner-consulting.de/2014/12/10/parity-based-redundancy-raid56triple-parity-and-beyond-on-btrfs-and-mdadm-dec-2014/comment-page-1/ TL;DR: There are patches to extend the linux kernel to support up to 6 parity disks but BTRFS does not want them because it does not fit their “business case” and MDADM would want them but somebody needs to develop patches for the MDADM component. The kernel raid implementation is ready and usable. If someone volunteers to do this kind of work I would support with equipment and myself as a test resource. -- I am just a list "stalker" and no BTRFS developer, but as others have indirectly said already. It is not so much that BTRFS don't want the patches as it is that BTRFS do not want to / can't focus on this right now due to other priorities. There was some updates to raid5/6 in kernel 4.12 that should fix (or at least improve) scrub/auto-repair. The write hole does still exist. That being said there might be configurations where btrfs raid5/6 might be of some use. I think I read somewhere that you can set data to raid5/6 and METADATA to raid1 or 10 and you would risk loosing some data (but not the filesystem) in the event of a system crash / power failure. This sounds tempting since it in theory would not make btrfs raid 5/6 significantly less reliable than other RAID's which will corrupt your data if the disk happens to spits out bad bits without complaining (one possible exception that might catch this is md raid6 which I use). That being said there is no way I would personally use btrfs raid 5/6 even with metadata raid1/10 yet without proper tested backups at standby at this point. Anyway - I would worry more about getting raid5/6 to work properly before even thinking about multi-parity at all :) To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
8 disk metadata radi10 + data raid1
Hi, On one of my machines I run a BTRFS filesystem with the following configuration Kernel: 4.11.0-1-amd64 #1 SMP Debian 4.11.6-1 (2017-06-19) x86_64 GNU/Linux Disks: 8 Metadata: Raid 10 Data: Raid1 One of the disks is going bad , and while the system still runs fine I ran some md5sum's on a few files and hit this bug. Currently the only non-zero output from btrfs de st / is about 27 write_io_errs and 278000 read_io_errs I do have backups (hah! you did not expect that Duncan did you!) and the data is not important on this filesystem. Even if I got thrown the below stuff in dmesg the system keeps running fine. ...the failed device disappeared as of writing this... so now the filesystem have one missing device. ---snip--- ( https://pastebin.ca/3856147 ) [120678.569637] ? worker_thread+0x4d/0x490 [120678.570931] ? kthread+0xfc/0x130 [120678.572206] ? process_one_work+0x430/0x430 [120678.573480] ? kthread_create_on_node+0x70/0x70 [120678.574756] ? do_group_exit+0x3a/0xa0 [120678.576024] ? ret_from_fork+0x26/0x40 [120678.577287] Code: 00 00 c7 43 28 00 00 00 00 b9 01 00 00 00 31 c0 eb d8 8d 48 02 eb da 41 89 e8 48 c7 c6 d8 67 4a c0 4c 89 e7 e8 c0 b9 fa ff eb 80 <0f> 0b 66 90 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41 57 [120678.580003] RIP: btrfs_check_repairable+0xe2/0xf0 [btrfs] RSP: ac3a419ffd60 [120678.581377] [ cut here ] [120678.581526] ---[ end trace 1f2b98046a799b47 ]--- [120678.584109] kernel BUG at /build/linux-C5oXKu/linux-4.11.6/fs/btrfs/extent_io.c:2315! [120678.585477] invalid opcode: [#5] SMP [120678.586821] Modules linked in: ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter cpufreq_userspace cpufreq_powersave cpufreq_conservative intel_powerclamp iTCO_wdt iTCO_vendor_support coretemp kvm_intel cdc_ether usbnet mii kvm joydev mgag200 ttm irqbypass intel_cstate evdev intel_uncore drm_kms_helper drm pcspkr i2c_algo_bit lpc_ich mfd_core ioatdma sg ipmi_si ipmi_devintf dca ipmi_msghandler i7core_edac button edac_core i5500_temp shpchp acpi_cpufreq binfmt_misc ip_tables x_tables autofs4 btrfs crc32c_generic xor raid6_pq sd_mod hid_generic usbhid hid sr_mod cdrom ata_generic crc32c_intel ata_piix i2c_i801 libata mptsas ehci_pci scsi_transport_sas uhci_hcd mptscsih mptbase ehci_hcd scsi_mod usbcore usb_common bnx2 [120678.595485] CPU: 0 PID: 15305 Comm: kworker/u32:3 Tainted: G D I 4.11.0-1-amd64 #1 Debian 4.11.6-1 [120678.596960] Hardware name: IBM Lenovo ThinkServer RD220 -[379811G]-/59Y3827 , BIOS -[D6EL28AUS-1.03]- 08/20/2009 [120678.598476] Workqueue: btrfs-endio btrfs_endio_helper [btrfs] [120678.599943] task: 9a4dc7bd6d40 task.stack: ac3a44214000 [120678.601431] RIP: 0010:btrfs_check_repairable+0xe2/0xf0 [btrfs] [120678.602867] RSP: :ac3a44217d60 EFLAGS: 00010297 [120678.604271] RAX: 0001 RBX: 9a4daa28e080 RCX: [120678.605660] RDX: 0002 RSI: RDI: 9a4de67d6e18 [120678.607017] RBP: 0001 R08: 0dbbb224 R09: 0dbbf224 [120678.608344] R10: R11: fffb R12: 9a4de67d6000 [120678.609641] R13: 9a4cfea6dda8 R14: 9a4cfea6dda8 R15: [120678.610921] FS: () GS:9a4def20() knlGS: [120678.612190] CS: 0010 DS: ES: CR0: 80050033 [120678.613435] CR2: 5627a2d96ff0 CR3: 00039310f000 CR4: 06f0 [120678.614687] Call Trace: [120678.615965] ? end_bio_extent_readpage+0x42e/0x580 [btrfs] [120678.617244] ? btrfs_scrubparity_helper+0xcf/0x300 [btrfs] [120678.618488] ? process_one_work+0x197/0x430 [120678.619728] ? worker_thread+0x4d/0x490 [120678.620960] ? kthread+0xfc/0x130 [120678.622185] ? process_one_work+0x430/0x430 [120678.623408] ? kthread_create_on_node+0x70/0x70 [120678.624620] ? do_group_exit+0x3a/0xa0 [120678.625817] ? ret_from_fork+0x26/0x40 [120678.626994] Code: 00 00 c7 43 28 00 00 00 00 b9 01 00 00 00 31 c0 eb d8 8d 48 02 eb da 41 89 e8 48 c7 c6 d8 67 4a c0 4c 89 e7 e8 c0 b9 fa ff eb 80 <0f> 0b 66 90 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41 57 [120678.629525] RIP: btrfs_check_repairable+0xe2/0xf0 [btrfs] RSP: ac3a44217d60 [120678.630795] [ cut here ] [120678.630839] ---[ end trace 1f2b98046a799b48 ]--- [120678.633382] kernel BUG at /build/linux-C5oXKu/linux-4.11.6/fs/btrfs/extent_io.c:2315! [120678.634625] invalid opcode: [#6] SMP [120678.635820] Modules linked in: ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter cpufreq_userspace cpufreq_powersave cpufreq_conservative intel_powerclamp iTCO_wdt iTCO_vendor_support coretemp kvm_intel cdc_ether usbnet mii kvm joydev mgag200 ttm irqbypass intel_cstate evdev intel_uncore drm_kms_helper drm pcspkr i2c_algo_bit lpc_ich mfd_core ioatdma sg ipmi_si ipmi_devintf dca ipmi_msghandler i7core_edac button edac_core i5500_temp
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
Brendan Hide wrote: The title seems alarmist to me - and I suspect it is going to be misconstrued. :-/ From the release notes at https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/7.4_Release_Notes/chap-Red_Hat_Enterprise_Linux-7.4_Release_Notes-Deprecated_Functionality.html "Btrfs has been deprecated The Btrfs file system has been in Technology Preview state since the initial release of Red Hat Enterprise Linux 6. Red Hat will not be moving Btrfs to a fully supported feature and it will be removed in a future major release of Red Hat Enterprise Linux. The Btrfs file system did receive numerous updates from the upstream in Red Hat Enterprise Linux 7.4 and will remain available in the Red Hat Enterprise Linux 7 series. However, this is the last planned update to this feature. Red Hat will continue to invest in future technologies to address the use cases of our customers, specifically those related to snapshots, compression, NVRAM, and ease of use. We encourage feedback through your Red Hat representative on features and requirements you have for file systems and storage technology." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html First of all I am not a BTRFS dev, but I use it for various projects and have high hopes for what it can become. Now, the fact that Red Hat depreciate BTRFS does not mean that BTRFS is depreciated. It not removed from the kernel and so far BTRFS offers features that other filesystems don't have. ZFS is something that people brag about all the time as a viable alternative, but for me it seems to be a pain to manage properly. E.g. grow, add/remove devices, shrink etc... good luck doing that right! BTRFS biggest problem is not that there are some bits and pieces that are thoroughly screwed up (raid5/6 (which just got some fixes by the way)), but the fact that the documentation is rather dated. There is a simple status page here https://btrfs.wiki.kernel.org/index.php/Status As others have pointed out already the explanations on the status page is not exactly good. For example compression (that was also mentioned) is as of writing this marked as 'Mostly ok' '(needs verification and source) - auto repair and compression may crash' Now, I am aware that many use compression without trouble. I am not sure how many that has compression with disk issues and don't have trouble , but I would at least expect to see more people yelling on the mailing list if that where the case. The problem here is that this message is rather scary and certainly does NOT sound like 'mostly ok' for most people. What exactly needs verification and source? the mostly ok statement or something else?! A more detailed explanation would be required here to avoid scaring people away. Same thing with the trim feature that is marked OK . It clearly says that is has performance implications. It is marked OK so one would expect it to not cause the filesystem to fail, but if the performance becomes so slow that the filesystem gets practically unusable it is of course not "OK". The relevant information is missing for people to make a decent choice and I certainly don't know how serious these performance implications are, if they are at all relevant... Most people interested in BTRFS are probably a bit more paranoid and concerned about their data than the average computer user. What people tend to forget is that other filesystems either have NO redundancy, auto-repair and other fancy features that BTRFS have. So for the compression example above... if you run compressed files on ext4 and your disk gets some corruption you are in a no better state than what you would be with btrfs either (in fact probably worse). Also nothing is stopping you from putting btrfs DUP on a mdadm raid5 or 6 which mean you should be VERY safe. Simple documentation is the key so HERE ARE MY DEMANDS!!!. ehhh so here is what I think should be done: 1. The documentation needs to either be improved (or old non-relevant stuff simply removed / archived somewhere) 2. The status page MUST always be up to date for the latest kernel release (It's ok so far , let's hope nobody sleeps here) 3. Proper explanations must be given so the layman and reasonably technical people understand the risks / issues for non-ok stuff. 4. There should be links to roadmaps for each feature on the status page that clearly stats what is being worked on for the NEXT kernel release -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs raid assurance
Hugo Mills wrote: You can see about the disk usage in different scenarios with the online tool at: http://carfax.org.uk/btrfs-usage/ Hugo. As a side note, have you ever considered making this online tool (that should never go away just for the record) part of btrfs-progs e.g. a proper tool? I use it quite often (at least several timers per. month) and I would love for this to be a visual tool 'btrfs-space-calculator' would be a great name for it I think. Imagine how nice it would be to run btrfs-space-calculator -mraid1 -draid10 /dev/sda1 /dev/sdb1 /dev/sdc2 /dev/sdd2 /dev/sde3 for example and instantly get something similar to my example below (no accuracy intended) d=data m=metadata .=unusable { 500mb} [|d|] /dev/sda1 { 3000mb} [|d|m|m|m|m|mm...|] /dev/sdb1 { 3000mb} [|d|m|m|m|m|mmm..|] /dev/sdc2 { 5000mb} [|d|m|m|m|m|m|m|m|m|m|] /dev/sdb1 {11500mb} Total space usable for data (raid10): 1000mb / 2000mb usable for metadata (raid1): 4500mb / 9000mb unusable: 500mb Of course this would have to change one (if ever) subvolumes can have different raid levels etc, but I would have loved using something like this instead of jumping around carfax abbey (!) at night. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Best Practice: Add new device to RAID1 pool
Chris Murphy wrote: On Mon, Jul 24, 2017 at 5:27 AM, Cloud Adminwrote: I am a little bit confused because the balance command is running since 12 hours and only 3GB of data are touched. That's incredibly slow. Something isn't right. Using btrfs-debug -b from btrfs-progs, I've selected a few 100% full chunks. [156777.077378] f26s.localdomain sudo[13757]:chris : TTY=pts/2 ; PWD=/home/chris ; USER=root ; COMMAND=/sbin/btrfs balance start -dvrange=157970071552..159043813376 / [156773.328606] f26s.localdomain kernel: BTRFS info (device sda1): relocating block group 157970071552 flags data [156800.408918] f26s.localdomain kernel: BTRFS info (device sda1): found 38952 extents [156861.343067] f26s.localdomain kernel: BTRFS info (device sda1): found 38951 extents That 1GiB chunk with quite a few fragments took 88s. That's 11MB/s. Even for a hard drive, that's slow. I This may be a stupid question , but are your pool of butter (or BTRFS pool) by any chance hooked up via USB? If this is USB2.0 at 480mitb/s then it is about 57MB/s / 4 drives = roughly 14.25 or about 11MB/s if you shave off some overhead. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Exactly what is wrong with RAID5/6
I am trying to piece together the actual status of the RAID5/6 bit of BTRFS. The wiki refer to kernel 3.19 which was released in February 2015 so I assume that the information there is a tad outdated (the last update on the wiki page was July 2016) https://btrfs.wiki.kernel.org/index.php/RAID56 Now there are four problems listed 1. Parity may be inconsistent after a crash (the "write hole") Is this still true, if yes - would not this apply for RAID1 / RAID10 as well? How was it solved there , and why can't that be done for RAID5/6 2. Parity data is not checksummed Why is this a problem? Does it have to do with the design of BTRFS somehow? Parity is after all just data, BTRFS does checksum data so what is the reason this is a problem? 3. No support for discard? (possibly -- needs confirmation with cmason) Does this matter that much really?, is there an update on this? 4. The algorithm uses as many devices as are available: No support for a fixed-width stripe. What is the plan for this one? There was patches on the mailing list by the SnapRAID author to support up to 6 parity devices. Will the (re?) resign of btrfs raid5/6 support a scheme that allows for multiple parity devices? I do have a few other questions as well... 5. BTRFS does still (kernel 4.9) not seem to use the device ID to communicate with devices. If you on a multi device filesystem yank out a device, for example /dev/sdg and it reappear as /dev/sdx for example btrfs will still happily try to write to /dev/sdg even if btrfs fi sh /mnt shows the correct device ID. What is the status for getting BTRFS to properly understand that a device is missing? 6. RAID1 needs to be able to make two copies always. E.g. if you have three disks you can loose one and it should still work. What about RAID10 ? If you have for example 6 disk RAID10 array, loose one disk and reboots (due to #5 above). Will RAID10 recognize that the array now is a 5 disk array and stripe+mirror over 2 disks (or possibly 2.5 disks?) instead of 3? In other words, will it work as long as it can create a RAID10 profile that requires a minimum of four disks? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Home storage with btrfs
Same here, Have been using BTRFS for a 'scratch' disk since about 2014. The disk have had quite some abuse and no issues yet. I don't use compression, snapshots or any fancy features. I have recently moved all of the root filesystem to BTRFS with 5x SSD disks set up in RAID1 and everything is (still) working fine, and I have been shuffling large amounts of data on this volume. I bet the SSD's will break before BTRFS does, so the real test is yet to come I guess... I am on Debian GNU/Linux with kernel 4.9.0-2-amd64 (Debian 4.9.13-1) - btrfs-progs 4.7.3 However, keep in mind that backups is winning the fight against binary related traumas :) Peter Becker wrote: I can confirm this. I have also no generell issues since the past 2 years with BTRFS in RAID1 and 6 Disks with different sizes and also no issues with the DUP profile on a single disk. Only some performance issues with deduplication and very large files. But i also recommand to use a newer kernel (4.4 or higher) or better the newest and build a newer version of btrfs progs form source. I use Ubuntu 16.04 and kernel 4.9 + btrfs progs 4.9 currently. 2017-03-13 13:02 GMT+01:00 Austin S. Hemmelgarn: On 2017-03-13 07:52, Juan Orti Alcaine wrote: 2017-03-13 12:29 GMT+01:00 Hérikz Nawarro : Hello everyone, Today is safe to use btrfs for home storage? No raid, just secure storage for some files and create snapshots from it. In my humble opinion, yes. I'm running a RAID1 btrfs at home for 5 years and I feel the most serious bugs have been fixed, because in the last two years I have not experienced any issue. In general, I'd agree. I've not seen any issues resulting from BTRFS itself for the past 2.5 years (although it's helped me find quite a lot of marginal or failing hardware over that time), but I've also not used many of the less stable features (raid56, qgroups, and a handful of other things). One piece of advice I will give though, try to keep the total number of snapshots to a reasonably small three digit number (ideally less than 200, absolutely less than 300), otherwise performance is going to be horrible. Anyway, keeping your kernel and btrfs-progs updated is a must, and of course, having good backups. I'm using Fedora and it's fine. Also agreed, Fedora is one of the best options for a traditional distro (they're very good about staying up to date and back-porting bug-fixes from the upstream kernel). The other two I'd recommend are Arch (they actually use an almost upstream kernel and are generally the first distro to have new versions of any arbitrary software) and Gentoo (similar to Arch, but more maintenance intensive (although also more efficient (usually))). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Why do BTRFS (still) forgets what device to write to?
I am doing some test on BTRFS with both data and metadata in raid1. uname -a Linux daffy 4.9.0-1-amd64 #1 SMP Debian 4.9.6-3 (2017-01-28) x86_64 GNU/Linux btrfs--version btrfs-progs v4.7.3 01. mkfs.btrfs /dev/sd[fgh]1 02. mount /dev/sdf1 /btrfs_test/ 03. btrfs balance start -dconvert=raid1 /btrfs_test/ 04. copied a lots of 3-4MB files to it (about 40GB)... 05. Started to compress some of the files to create one larger file... 06. Pulled the (sata) plug on one of the drives... (sdf1) 07. dmesg shows that the kernel is rejecting I/O to offline device + [sdf] killing request] 08. BTRS error (device sdf1) bdev /dev/sdf1 errs: wr 0, rd 1, flush 0, corrupt 0, gen 0 09. the previous line repeats - increasing rd count 10. Reconnecting the sdf1 drive again makes it show up as sdi1 11. btrfs fi sh /btrfs_test shows sd1 as the correct device id (1). 12. Yet dmesg shows tons of errors like this: BTRFS error (device sdf1) : bdev /dev/sdi1 errs wr 37182, rd 39851, flush 1, corrupt 0, gen 0 13. and the above line repeats increasing wr, and rd errors. 14. BTRFS never seems to "get in tune again" while the filesystem is mounted. The conclusion appears to be that the device ID is back again in the btrfs pool so why does btrfs still try to write to the wrong device (or does it?!). The good thing here is that BTRFS does still work fine after a unmount and mount again. Running a scrub on the filesystem cleans up tons of errors , but no uncorrectable errors. However it says total bytes scrubbed 94.21GB with 75 errors ... and further down it says corrected errors: 72, uncorrectable errors: 0 , unverified errors: 0 Why 75 vs 72 errors?! did it correct all or not? I have recently lost 1x 5 device BTRFS filesystem as well as 2x 3 device BTRFS filesystems set up in RAID1 (both data and medata) by toying around with them. The 2x filesystems I lost was using all bad disks (all 3 of them) but the one mentioned here uses good (but old) 400GB drives just for the record. By lost I mean that mount does not recognize the filesystem, but BTRFS fi sh does show that all devices are present. I did not make notes for those filesystems , but it appears that RAID1 is a bit fragile. I don't need to recover anything. This is just a "toy system" for playing around with btrfs and doing some tests. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
Chris Murphy wrote: On Thu, Mar 2, 2017 at 6:48 PM, Chris Murphywrote: Again, my data is fine. The problem I'm having is this: https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/Documentation/filesystems/btrfs.txt?id=refs/tags/v4.10.1 Which says in the first line, in part, "focusing on fault tolerance, repair and easy administration" and quite frankly this sort of enduring bug in this file system that's nearly 10 years old now, is rendered misleading, and possibly dishonest. How do we describe this file system as focusing on fault tolerance when, in the identical scenario using mdadm or LVM raid, the user's data is not mishandled like it is on Btrfs with multiple devices? I think until these problems are fixed, the Btrfs status page should describe RAID 1 and 10 as mostly OK, with this problem as the reason for it not being OK. I took the liberty of changing the status page... -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID56 status?
Hugo Mills wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Sun, Jan 22, 2017 at 11:35:49PM +0100, Christoph Anton Mitterer wrote: On Sun, 2017-01-22 at 22:22 +0100, Jan Vales wrote: Therefore my question: whats the status of raid5/6 is in btrfs? Is it somehow "production"-ready by now? AFAIK, what's on the - apparently already no longer updated - https://btrfs.wiki.kernel.org/index.php/Status still applies, and RAID56 is not yet usable for anything near production. It's still all valid. Nothing's changed. How would you like it to be updated? "Nope, still broken"? Hugo. I risked updating the wiki to show kernel version from 4.9 instead of 4.7 then... -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is stability a joke? (wiki updated)
Pasi Kärkkäinen wrote: On Mon, Sep 12, 2016 at 09:57:17PM +0200, Martin Steigerwald wrote: Great. I made to minor adaption. I added a link to the Status page to my warning in before the kernel log by feature page. And I also mentioned that at the time the page was last updated the latest kernel version was 4.7. Yes, thats some extra work to update the kernel version, but I think its beneficial to explicitely mention the kernel version the page talks about. Everyone who updates the page can update the version within a second. Hmm.. that will still leave people wondering "but I'm running Linux 4.4, not 4.7, I wonder what the status of feature X is.." Should we also add a column for kernel version, so we can add "feature X is known to be OK on Linux 3.18 and later".. ? Or add those to "notes" field, where applicable? -- Pasi I think a separate column would be the best solution. For example archiving the status page pr. kernel version (as I suggested) will lead to issues too. For example if something appears to be just fine in 4.6 is found to be horribly broken in for example 4.10 the archive would still indicate that it WAS ok at that time even if it perhaps was not. Then you have regressions - something that worked in 4.4 may not work in 4.9, but I still think the best idea is to simply label the status as ok / broken since 4.x as those who really want to use a broken feature probably would to research to see if this used to work , besides if something that used to work goes haywire it should be fixed quickly :) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is stability a joke?
Zoiled wrote: Chris Mason wrote: On 09/11/2016 04:55 AM, Waxhead wrote: I have been following BTRFS for years and have recently been starting to use BTRFS more and more and as always BTRFS' stability is a hot topic. Some says that BTRFS is a dead end research project while others claim the opposite. Taking a quick glance at the wiki does not say much about what is safe to use or not and it also points to some who are using BTRFS in production. While BTRFS can apparently work well in production it does have some caveats, and finding out what features is safe or not can be problematic and I especially think that new users of BTRFS can easily be bitten if they do not do a lot of research on it first. The Debian wiki for BTRFS (which is recent by the way) contains a bunch of warnings and recommendations and is for me a bit better than the official BTRFS wiki when it comes to how to decide what features to use. The Nouveau graphics driver have a nice feature matrix on it's webpage and I think that BTRFS perhaps should consider doing something like that on it's official wiki as well For example something along the lines of (the statuses are taken our of thin air just for demonstration purposes) The out of thin air part is a little confusing, I'm not sure if you're basing this on reports you've read? Well to be honest I used "whatever I felt was right" more or less in that table and as I wrote it was only for demonstration purposes only to show how such a table could look. I'm in favor flagged device replace with raid5/6 as not supported yet. That seems to be where most of the problems are coming in. The compression framework shouldn't allow one to work well with the other unusable. Ok good to know , however from the Debian wiki as well as the link to the mailing list only LZO compression are mentioned (as far as I remember) and I have no idea myself how much difference there is between LZO and the ZLIB code, There were problems with autodefrag related to snapshot-aware defrag, so Josef disabled the snapshot aware part. In general, we put btrfs through heavy use at facebook. The crcs have found serious hardware problems the other filesystems missed. We've also uncovered performance problems and a some serious bugs, both in btrfs and the other filesystems. With the other filesystems the fixes were usually upstream (doubly true for the most serious problems), and with btrfs we usually had to make the fixes ourselves. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html I'll just pop this in here since I assume most people will read the response from your comment: I think I made my point. The wiki lacks some good documentation on what's safe to use and what's not. Yesterday I (Svein Engelsgjerd) did put a table on the main wiki and someone have moved that away to a status page and also improved the layout a bit. It is a tad more complex than my version, but also a lot better for the slightly more advanced users and it actually made my view on things a bit clearer as well. I am glad that I by bringing this up (hopefully) contributed slightly to improving the documentation a tiny bit! :) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Just for the record - sorry for using my "crap mail" - I sometimes forget to change to the correct sender. I am therefore Svein Engelsgjerd a.k.a. Waxhead a.k.a. "Zoiled" :) ...sorry for the confusion -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is stability a joke?
Martin Steigerwald wrote: Am Sonntag, 11. September 2016, 13:43:59 CEST schrieb Martin Steigerwald: The Nouveau graphics driver have a nice feature matrix on it's webpage and I think that BTRFS perhaps should consider doing something like that on it's official wiki as well BTRFS also has a feature matrix. The links to it are in the "News" section however: https://btrfs.wiki.kernel.org/index.php/Changelog#By_feature I disagree, this is not a feature / stability matrix. It is a clearly a changelog by kernel version. It is a *feature* matrix. I fully said its not about stability, but about implementation – I just wrote this a sentence after this one. There is no need whatsoever to further discuss this as I never claimed that it is a feature / stability matrix in the first place. Thing is: This just seems to be when has a feature been implemented matrix. Not when it is considered to be stable. I think this could be done with colors or so. Like red for not supported, yellow for implemented and green for production ready. Exactly, just like the Nouveau matrix. It clearly shows what you can expect from it. I mentioned this matrix as a good *starting* point. And I think it would be easy to extent it: Just add another column called "Production ready". Then research / ask about production stability of each feature. The only challenge is: Who is authoritative on that? I´d certainly ask the developer of a feature, but I´d also consider user reports to some extent. Maybe thats the real challenge. If you wish, I´d go through each feature there and give my own estimation. But I think there are others who are deeper into this. That is exactly the same reason I don't edit the wiki myself. I could of course get it started and hopefully someone will correct what I write, but I feel that if I start this off I don't have deep enough knowledge to do a proper start. Perhaps I will change my mind about this. I do think for example that scrubbing and auto raid repair are stable, except for RAID 5/6. Also device statistics and RAID 0 and 1 I consider to be stable. I think RAID 10 is also stable, but as I do not run it, I don´t know. For me also skinny-metadata is stable. For me so far even compress=lzo seems to be stable, but well for others it may not. Since what kernel version? Now, there you go. I have no idea. All I know I started BTRFS with Kernel 2.6.38 or 2.6.39 on my laptop, but not as RAID 1 at that time. See, the implementation time of a feature is much easier to assess. Maybe thats part of the reason why there is not stability matrix: Maybe no one *exactly* knows *for sure*. How could you? So I would even put a footnote on that "production ready" row explaining "Considered to be stable by developer and user oppinions". Of course additionally it would be good to read about experiences of corporate usage of BTRFS. I know at least Fujitsu, SUSE, Facebook, Oracle are using it. But I don´t know in what configurations and with what experiences. One Oracle developer invests a lot of time to bring BTRFS like features to XFS and RedHat still favors XFS over BTRFS, even SLES defaults to XFS for /home and other non /-filesystems. That also tells a story. Some ideas you can get from SUSE releasenotes. Even if you do not want to use it, it tells something and I bet is one of the better sources of information regarding your question you can get at this time. Cause I believe SUSE developers invested some time to assess the stability of features. Cause they would carefully assess what they can support in enterprise environments. There is also someone from Fujitsu who shared experiences in a talk, I can search the URL to the slides again. By all means, SUSE's wiki is very valuable. I just said that I *prefer* to have that stuff on the BTRFS wiki and feel that is the right place for it. I bet Chris Mason and other BTRFS developers at Facebook have some idea on what they use within Facebook as well. To what extent they are allowed to talk about it… I don´t know. My personal impression is that as soon as Chris went to Facebook he became quite quiet. Maybe just due to being busy. Maybe due to Facebook being concerned much more about the privacy of itself than of its users. Thanks, -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is stability a joke?
Martin Steigerwald wrote: Am Sonntag, 11. September 2016, 13:21:30 CEST schrieb Zoiled: Martin Steigerwald wrote: Am Sonntag, 11. September 2016, 10:55:21 CEST schrieb Waxhead: I have been following BTRFS for years and have recently been starting to use BTRFS more and more and as always BTRFS' stability is a hot topic. Some says that BTRFS is a dead end research project while others claim the opposite. First off: On my systems BTRFS definately runs too stable for a research project. Actually: I have zero issues with stability of BTRFS on *any* of my systems at the moment and in the last half year. The only issue I had till about half an year ago was BTRFS getting stuck at seeking free space on a highly fragmented RAID 1 + compress=lzo /home. This went away with either kernel 4.4 or 4.5. Additionally I never ever lost even a single byte of data on my own BTRFS filesystems. I had a checksum failure on one of the SSDs, but BTRFS RAID 1 repaired it. Where do I use BTRFS? 1) On this ThinkPad T520 with two SSDs. /home and / in RAID 1, another data volume as single. In case you can read german, search blog.teamix.de for BTRFS. 2) On my music box ThinkPad T42 for /home. I did not bother to change / so far and may never to so for this laptop. It has a slow 2,5 inch harddisk. 3) I used it on Workstation at work as well for a data volume in RAID 1. But workstation is no more (not due to a filesystem failure). 4) On a server VM for /home with Maildirs and Owncloud data. /var is still on Ext4, but I want to migrate it as well. Whether I ever change /, I don´t know. 5) On another server VM, a backup VM which I currently use with borgbackup. With borgbackup I actually wouldn´t really need BTRFS, but well… 6) On *all* of my externel eSATA based backup harddisks for snapshotting older states of the backups. In other words, you are one of those who claim the opposite :) I have also myself run btrfs for a "toy" filesystem since 2013 without any issues, but this is more or less irrelevant since some people have experienced data loss thanks to unstable features that are not clearly marked as such. And making a claim that you have not lost a single byte of data does not make sense, how did you test this? SHA256 against a backup? :) Do you have any proof like that with *any* other filesystem on Linux? No, my claim is a bit weaker: BTRFS own scrubbing feature and well no I/O errors on rsyncing my data over to the backup drive - BTRFS checks checksum on read as well –, and yes I know BTRFS uses a weaker hashing algorithm, I think crc32c. Yet this is still more than what I can say about *any* other filesystem I used so far. Up to my current knowledge neither XFS nor Ext4/3 provide data checksumming. They do have metadata checksumming and I found contradicting information on whether XFS may support data checksumming in the future, but up to now, no *proof* *whatsoever* from side of the filesystem that the data is, what it was when I saved it initially. There may be bit errors rotting on any of your Ext4 and XFS filesystem without you even noticing for *years*. I think thats still unlikely, but it can happen, I have seen this years ago after restoring a backup with bit errors from a hardware RAID controller. Of course, I rely on the checksumming feature within BTRFS – which may have errors. But even that is more than with any other filesystem I had before. And I do not scrub daily, especially not the backup disks, but for any scrubs up to now, no issues. So, granted, my claim has been a bit bold. Right now I have no up-to-this-day scrubs so all I can say is that I am not aware of any data losses up to the point in time where I last scrubbed my devices. Just redoing the scrubbing now on my laptop. The way I see it BTRFS is the best filesystem we got so far. It is also the first (to my knowledge) that provides checksums of both data and metadata. My point was simply that such an extraordinary claim require some evidence. I am not saying it is unlikely that you have never lost a byte, I am just saying that it is a fantastic thing to claim. The Debian wiki for BTRFS (which is recent by the way) contains a bunch of warnings and recommendations and is for me a bit better than the official BTRFS wiki when it comes to how to decide what features to use. Nice page. I wasn´t aware of this one. If you use BTRFS with Debian, I suggest to usually use the recent backport kernel, currently 4.6. Hmmm, maybe I better remove that compress=lzo mount option. Never saw any issue with it, tough. Will research what they say about it. My point exactly: You did not know about this and hence the risk of your data being gnawed on. Well I do follow BTRFS mailinglist to some extent and I recommend anyone who uses BTRFS in production to do this. And: So far I see no data loss from using that option and for me personally it is exactly that what counts. J Still: An information on what features are stable with what version
Is stability a joke?
I have been following BTRFS for years and have recently been starting to use BTRFS more and more and as always BTRFS' stability is a hot topic. Some says that BTRFS is a dead end research project while others claim the opposite. Taking a quick glance at the wiki does not say much about what is safe to use or not and it also points to some who are using BTRFS in production. While BTRFS can apparently work well in production it does have some caveats, and finding out what features is safe or not can be problematic and I especially think that new users of BTRFS can easily be bitten if they do not do a lot of research on it first. The Debian wiki for BTRFS (which is recent by the way) contains a bunch of warnings and recommendations and is for me a bit better than the official BTRFS wiki when it comes to how to decide what features to use. The Nouveau graphics driver have a nice feature matrix on it's webpage and I think that BTRFS perhaps should consider doing something like that on it's official wiki as well For example something along the lines of (the statuses are taken our of thin air just for demonstration purposes) Kernel version 4.7 +++-+---+---++---++ | Feature / Redundancy level | Single | Dup | Raid0 | Raid1 | Raid10 | Raid5 | Raid 6 | +++-+---+---++---++ | Subvolumes | Ok | Ok | Ok| Ok| Ok | Bad | Bad| +++-+---+---++---++ | Snapshots | Ok | Ok | Ok| Ok| Ok | Bad | Bad| +++-+---+---++---++ | LZO Compression| Bad(1) | Bad | Bad | Bad(2)| Bad| Bad | Bad| +++-+---+---++---++ | ZLIB Compression | Ok | Ok | Ok| Ok| Ok | Bad | Bad| +++-+---+---++---++ | Autodefrag | Ok | Bad | Bad(3)| Ok| Ok | Bad | Bad| +++-+---+---++---++ (1) Some explanation here... (2) Some explanation there (3) And some explanation elsewhere... ...etc...etc... I therefore would like to propose that some sort of feature / stability matrix for the latest kernel is added to the wiki preferably somewhere where it is easy to find. It would be nice to archive old matrix'es as well in case someone runs on a bit older kernel (we who use Debian tend to like older kernels). In my opinion it would make things bit easier and perhaps a bit less scary too. Remember if you get bitten badly once you tend to stay away from from it all just in case, if you on the other hand know what bites you can safely pet the fluffy end instead :) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs scrub failure for raid 6 kernel 4.3
Chris Murphy wrote: Well all the generations on all devices are now the same, and so are the chunk trees. I haven't looked at them in detail to see if there are any discrepancies among them. If you don't care much for this file system, then you could try btrfs check --repair, using btrfs-progs 4.3.1 or integration branch. I have no idea where btrfsck repair is at with raid56. On the one hand, corruption should be fixed by scrub. But scrub fails with a kernel trace. Maybe btrfs check --repair can fix the tree block corruption since scrub can't, and then if that corruption is fixed, possibly scrub will work. I could not care less about this particular filesystem as I wrote in the original post. It's just for having some fun with btrfs. What I find troublesome is that corrupting one (or even two) drives in a Raid6 config fails. Granted the filesystem "works" e.g. I can mount it and access files, but I get a input/output error on a file on this filesystem and btrfs only shows warning (not errors) on device sdg1 where the csum failed. A raid6 setup should work fine even if two missing disks (or in this case chunks of data) is missing and even if I don't care about this filesystem I care about btrfs getting stable ;) so if I can help I'll keep this filesystem around for a little longer! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs scrub failure for raid 6 kernel 4.3
Waxhead wrote: Chris Murphy wrote: Well all the generations on all devices are now the same, and so are the chunk trees. I haven't looked at them in detail to see if there are any discrepancies among them. If you don't care much for this file system, then you could try btrfs check --repair, using btrfs-progs 4.3.1 or integration branch. I have no idea where btrfsck repair is at with raid56. On the one hand, corruption should be fixed by scrub. But scrub fails with a kernel trace. Maybe btrfs check --repair can fix the tree block corruption since scrub can't, and then if that corruption is fixed, possibly scrub will work. I could not care less about this particular filesystem as I wrote in the original post. It's just for having some fun with btrfs. What I find troublesome is that corrupting one (or even two) drives in a Raid6 config fails. Granted the filesystem "works" e.g. I can mount it and access files, but I get a input/output error on a file on this filesystem and btrfs only shows warning (not errors) on device sdg1 where the csum failed. A raid6 setup should work fine even if two missing disks (or in this case chunks of data) is missing and even if I don't care about this filesystem I care about btrfs getting stable ;) so if I can help I'll keep this filesystem around for a little longer! For your information I tried a balance on the filesystem - a new stack trace below (the system is still working). Sorry for flooding the mailinglist with the stack trace - this is what I got from dmesg , hope it is of some use... / gets used... :) [ 243.603661] CPU: 0 PID: 1182 Comm: btrfs Tainted: G W 4.3.0-1-686-pae #1 Debian 4.3.3-2 [ 243.603664] Hardware name: Acer AOA150/, BIOS v0.3310 10/06/2008 [ 243.603676] 09f7a8eb eef57990 c12ae3c5 c106685d c1614e20 [ 243.603687] 049e f86df010 190a f86350ff 0009 f86350ff f1dd8b18 [ 243.603697] 0078 eef579a0 c1066962 0009 eef57a6c f86350ff [ 243.603699] Call Trace: [ 243.603716] [] ? dump_stack+0x3e/0x59 [ 243.603724] [] ? warn_slowpath_common+0x8d/0xc0 [ 243.603763] [] ? __btrfs_free_extent+0xbbf/0xec0 [btrfs] [ 243.603798] [] ? __btrfs_free_extent+0xbbf/0xec0 [btrfs] [ 243.603806] [] ? warn_slowpath_null+0x22/0x30 [ 243.603837] [] ? __btrfs_free_extent+0xbbf/0xec0 [btrfs] [ 243.603877] [] ? __btrfs_run_delayed_refs+0x96e/0x11a0 [btrfs] [ 243.603889] [] ? __percpu_counter_add+0x8e/0xb0 [ 243.603930] [] ? btrfs_run_delayed_refs+0x6d/0x250 [btrfs] [ 243.603969] [] ? btrfs_should_end_transaction+0x3c/0x60 [btrfs] [ 243.604003] [] ? btrfs_drop_snapshot+0x426/0x850 [btrfs] [ 243.604110] [] ? merge_reloc_roots+0xee/0x260 [btrfs] [ 243.604152] [] ? remove_backref_node+0x67/0xe0 [btrfs] [ 243.604198] [] ? relocate_block_group+0x28f/0x750 [btrfs] [ 243.604242] [] ? btrfs_relocate_block_group+0x1d8/0x2e0 [btrfs] [ 243.604282] [] ? btrfs_relocate_chunk.isra.29+0x3d/0xf0 [btrfs] [ 243.604326] [] ? btrfs_balance+0x97c/0x12e0 [btrfs] [ 243.604338] [] ? __alloc_pages_nodemask+0x13b/0x850 [ 243.604345] [] ? get_page_from_freelist+0x3dd/0x5c0 [ 243.604391] [] ? btrfs_ioctl_balance+0x385/0x390 [btrfs] [ 243.604430] [] ? btrfs_ioctl+0x793/0x2c50 [btrfs] [ 243.604437] [] ? __alloc_pages_nodemask+0x13b/0x850 [ 243.604443] [] ? terminate_walk+0x69/0xc0 [ 243.604453] [] ? anon_vma_prepare+0xdf/0x130 [ 243.604460] [] ? page_add_new_anon_rmap+0x6c/0x90 [ 243.604468] [] ? handle_mm_fault+0xa63/0x14f0 [ 243.604476] [] ? __rb_insert_augmented+0xf3/0x1c0 [ 243.604520] [] ? update_ioctl_balance_args+0x1c0/0x1c0 [btrfs] [ 243.604527] [] ? do_vfs_ioctl+0x2e2/0x500 [ 243.604534] [] ? do_brk+0x113/0x2b0 [ 243.604542] [] ? __do_page_fault+0x1a0/0x460 [ 243.604549] [] ? SyS_ioctl+0x68/0x80 [ 243.604557] [] ? sysenter_do_call+0x12/0x12 [ 243.604563] ---[ end trace eb3e6200cba2a564 ]--- [ 243.604654] [ cut here ] [ 243.604695] WARNING: CPU: 0 PID: 1182 at /build/linux-P8Ifgy/linux-4.3.3/fs/btrfs/extent-tree.c:6410 __btrfs_free_extent+0xbbf/0xec0 [btrfs]() [ 243.604813] Modules linked in: cpufreq_stats cpufreq_conservative cpufreq_userspace bnep cpufreq_powersave zram zsmalloc lz4_compress nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc joydev iTCO_wdt iTCO_vendor_support sparse_keymap arc4 acerhdf coretemp pcspkr evdev psmouse serio_raw i2c_i801 uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core v4l2_common videodev media lpc_ich mfd_core btusb btrtl btbcm btintel rng_core bluetooth ath5k ath snd_hda_codec_realtek snd_hda_codec_generic mac80211 jmb38x_ms snd_hda_intel i915 cfg80211 memstick snd_hda_codec rfkill snd_hda_core snd_hwdep drm_kms_helper snd_pcm snd_timer shpchp snd soundcore drm i2c_algo_bit wmi battery video ac button acpi_cpufreq processor sg loop autofs4 uas usb_storage ext4 crc16 mbcache jbd2 crc32c_generic btrfs xor [ 243.604837]
Re: Btrfs scrub failure for raid 6 kernel 4.3
Chris Murphy wrote: On Mon, Dec 28, 2015 at 3:55 PM, Waxhead <waxh...@online.no> wrote: I tried the following btrfs-image -t4 -c9 /dev/sdb1 /btrfs_raid6.img checksum verify failed on 28734324736 found C3E98F3B wanted EB2392C6 checksum verify failed on 28734324736 found C3E98F3B wanted EB2392C6 checksum verify failed on 28734324736 found 5F516E2A wanted BBB2D39C checksum verify failed on 28734324736 found C4AA0B8D wanted 41745FB5 checksum verify failed on 28734324736 found C3E98F3B wanted EB2392C6 bytenr mismatch, want=28734324736, have=16273726433708437499 Error reading metadata block Error adding block -5 checksum verify failed on 28734324736 found C3E98F3B wanted EB2392C6 checksum verify failed on 28734324736 found C3E98F3B wanted EB2392C6 checksum verify failed on 28734324736 found 5F516E2A wanted BBB2D39C checksum verify failed on 28734324736 found C4AA0B8D wanted 41745FB5 checksum verify failed on 28734324736 found C3E98F3B wanted EB2392C6 bytenr mismatch, want=28734324736, have=16273726433708437499 Error reading metadata block Error flushing pending -5 create failed (Success) Well, I can't make out what this is supposed to mean, but no output file... Dunno. Maybe btrfs-show-super -fa for each device, along with btrfs-debug-tree output to a file might have some useful info for a dev. It could be a while before we hear from one though, considering the season. The btrfs-debug-tree creates about 56.3 megabytes of text for each device a few checksums (marked [match]) and dev.item.uuid differs between the files for all partitions so I will leave that out of this post. The output of btrfs-show-super -fa for each device follows: --snip-- superblock: bytenr=65536, device=/dev/sdb1 - csum0xdedfa466 [match] bytenr65536 flags0x1 ( WRITTEN ) magic_BHRfS_M [match] fsid2832346e-0720-499f-8239-355534e5721b label generation15529 root28539699200 sys_array_size257 chunk_root_generation15524 root_level0 chunk_root28416425984 chunk_root_level0 log_root0 log_root_transid0 log_root_level0 total_bytes49453105152 bytes_used9239322624 sectorsize4096 nodesize16384 leafsize16384 stripesize4096 root_dir6 num_devices6 compat_flags0x0 compat_ro_flags0x0 incompat_flags0xe1 ( MIXED_BACKREF | BIG_METADATA | EXTENDED_IREF | RAID56 ) csum_type0 csum_size4 cache_generation15529 uuid_tree_generation15529 dev_item.uuidc14fb599-f515-4feb-a458-227af0af683b dev_item.fsid2832346e-0720-499f-8239-355534e5721b [match] dev_item.type0 dev_item.total_bytes8242184192 dev_item.bytes_used3305111552 dev_item.io_align4096 dev_item.io_width4096 dev_item.sector_size4096 dev_item.devid1 dev_item.dev_group0 dev_item.seek_speed0 dev_item.bandwidth0 dev_item.generation0 sys_chunk_array[2048]: item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 28416409600) chunk length 67108864 owner 2 stripe_len 65536 type SYSTEM|RAID6 num_stripes 6 stripe 0 devid 5 offset 303038464 dev uuid: c292e6f4-d113-47af-96bf-1b262ef28c77 stripe 1 devid 4 offset 1048576 dev uuid: 3458b150-02a7-44a5-81e5-61a734e30439 stripe 2 devid 3 offset 1074790400 dev uuid: a1ae2083-2730-4611-b019-ba9954c8fa13 stripe 3 devid 2 offset 1074790400 dev uuid: 9a1af441-88fc-43b1-b1f8-e1d1163195ef stripe 4 devid 6 offset 1074790400 dev uuid: 094d2a3c-f538-4a4e-850a-e611f2517e7f stripe 5 devid 1 offset 1074790400 dev uuid: c14fb599-f515-4feb-a458-227af0af683b backup_roots[4]: backup 0: backup_tree_root:28541386752gen: 15526level: 0 backup_chunk_root:28416425984gen: 15524level: 0 backup_extent_root:28541403136gen: 15526level: 1 backup_fs_root:28496560128gen: 15514level: 2 backup_dev_root:28540829696gen: 15524level: 0 backup_csum_root:28541452288gen: 15526level: 2 backup_total_bytes:49453105152 backup_bytes_used:9164218368 backup_num_devices:6 backup 1: backup_tree_root:28502654976gen: 15527level: 0 backup_chunk_root:28416425984gen: 15524level: 0 backup_extent_root:28513714176gen: 15528level: 1 backup_fs_root:28513566720gen: 15528level: 2 backup_dev_root:28513550336gen: 15527level: 0 backup_csum_root:28513615872gen: 15528level: 2 backup_total_bytes:49453105152 backup_bytes_used:9206
Re: Btrfs scrub failure for raid 6 kernel 4.3
Chris Murphy wrote: On Sun, Dec 27, 2015 at 7:04 PM, Waxhead <waxh...@online.no> wrote: Since all drives register and since I can even mount the filesystem. OK so you've umounted the file system, reconnected all devices, mounted the file system normally, and there are no problems reported in dmesg? If so, yes I agree that a scrub should probably work, it should fix any problems with the simulated corrupt device, and also not crash. What if you umount, and run btrfs check without --repair, what are the results? This is btrfs-progs 4.3.1? The output from dmesg after mounting [ 546.857533] BTRFS info (device sdg1): disk space caching is enabled [ 546.872126] BTRFS: bdev /dev/sde1 errs: wr 10, rd 0, flush 0, corrupt 29094, gen 4 [ 546.872165] BTRFS: bdev /dev/sdb1 errs: wr 16, rd 7, flush 0, corrupt 0, gen 0 This is the output I get from btrfs check /dev/sdb1 > somefile.output (note the filesystem was checked in unmounted state) Checking filesystem on /dev/sdb1 UUID: 2832346e-0720-499f-8239-355534e5721b The following tree block(s) is corrupted in tree 5: tree block bytenr: 28488941568, level: 1, node key: (52273, 1, 0) The following data extent is lost in tree 5: inode: 104721, offset:0, disk_bytenr: 37828165632, disk_len: 524288 found 9161007108 bytes used err is 1 total csum bytes: 8859672 total tree bytes: 80969728 total fs tree bytes: 66633728 total extent tree bytes: 3211264 btree space waste bytes: 10638420 file data blocks allocated: 7929208832 referenced 7929208832 btrfs-progs v4.3 I also get a pretty fantastic amount of errors that will not be redirected to a file. ---snip--- parent transid verify failed on 28597895168 wanted 371 found 339 parent transid verify failed on 28597895168 wanted 371 found 339 checksum verify failed on 28597895168 found 5D16DA87 wanted B9F56731 checksum verify failed on 28597895168 found 1183EB4E wanted C18D87AC checksum verify failed on 28597895168 found 1183EB4E wanted C18D87AC bytenr mismatch, want=28597895168, have=147474999040 Incorrect local backref count on 37826895872 root 5 owner 104850 offset 0 found 0 wanted 1 back 0x94ac688 Backref disk bytenr does not match extent record, bytenr=37826895872, ref bytenr=0 backpointer mismatch on [37826895872 131072] owner ref check failed [37826895872 131072] ref mismatch on [37827117056 475136] extent item 1, found 0 parent transid verify failed on 28597714944 wanted 371 found 339 parent transid verify failed on 28597714944 wanted 371 found 339 checksum verify failed on 28597714944 found 49CB81B9 wanted AD283C0F checksum verify failed on 28597714944 found D9F20AF8 wanted 1A5EE553 checksum verify failed on 28597714944 found D9F20AF8 wanted 1A5EE553 bytenr mismatch, want=28597714944, have=147480498944 Incorrect local backref count on 37827117056 root 5 owner 104719 offset 0 found 0 wanted 1 back 0x93688b0 Backref disk bytenr does not match extent record, bytenr=37827117056, ref bytenr=37827026944 backpointer mismatch on [37827117056 475136] owner ref check failed [37827117056 475136] ref mismatch on [37827641344 487424] extent item 1, found 0 parent transid verify failed on 28597714944 wanted 371 found 339 parent transid verify failed on 28597714944 wanted 371 found 339 checksum verify failed on 28597714944 found 49CB81B9 wanted AD283C0F checksum verify failed on 28597714944 found D9F20AF8 wanted 1A5EE553 checksum verify failed on 28597714944 found D9F20AF8 wanted 1A5EE553 bytenr mismatch, want=28597714944, have=147480498944 Incorrect local backref count on 37827641344 root 5 owner 104720 offset 0 found 0 wanted 1 back 0x94ac778 Backref disk bytenr does not match extent record, bytenr=37827641344, ref bytenr=0 backpointer mismatch on [37827641344 487424] owner ref check failed [37827641344 487424] ref mismatch on [37828165632 524288] extent item 1, found 0 parent transid verify failed on 28597714944 wanted 371 found 339 parent transid verify failed on 28597714944 wanted 371 found 339 checksum verify failed on 28597714944 found 49CB81B9 wanted AD283C0F checksum verify failed on 28597714944 found D9F20AF8 wanted 1A5EE553 checksum verify failed on 28597714944 found D9F20AF8 wanted 1A5EE553 bytenr mismatch, want=28597714944, have=147480498944 Incorrect local backref count on 37828165632 root 5 owner 104721 offset 0 found 0 wanted 1 back 0x94ac868 Backref disk bytenr does not match extent record, bytenr=37828165632, ref bytenr=0 backpointer mismatch on [37828165632 524288] owner ref check failed [37828165632 524288] checking free space cache checking fs roots ---snip end--- ---snippety snip--- root 5 inode 53325 errors 2001, no inode item, link count wrong unresolved ref dir 52207 index 0 namelen 14 name pthreadtypes.h filetype 1 errors 6, no dir index, no inode ref root 5 inode 53328 errors 2001, no inode item, link count wrong unresolved ref dir 52207 index 0 namelen 8 name select.h filetype 1 errors 6, no dir index, no inode ref root 5 inode 533
Re: Btrfs scrub failure for raid 6 kernel 4.3
Duncan wrote: Waxhead posted on Mon, 28 Dec 2015 03:04:33 +0100 as excerpted: Duncan wrote: Waxhead posted on Mon, 28 Dec 2015 00:06:46 +0100 as excerpted: btrfs scrub status /mnt scrub status for 2832346e-0720-499f-8239-355534e5721b scrub started at Sun Mar 29 23:21:04 2015 Now here is the first worrying part... it says that scrub started at Sun Mar 29. Hmm... The status is stored in readable plain-text files in /var/lib/ btrfs/scrub.status.*, where the * is the UUID. If you check there, the start time (t_start) seems to be in POSIX time. Is it possible you were or are running the scrub from, for instance, a rescue image that might not set the system time correctly and that falls back to, say, the date the rescue image was created, if it can't get network connectivity or some such? No I don't think so # ls -la /var/lib/btrfs/scrub.status.2832346e-0720-499f-8239-355534e5721b -rw--- 1 root root 2315 Mar 29 2015 /var/lib/btrfs/scrub.status.2832346e-0720-499f-8239-355534e5721b # cat /var/lib/btrfs/scrub.status.2832346e-0720-499f-8239-355534e5721b scrub status:1 2832346e-0720-499f-8239-355534e5721b:1|[...]|t_start:1427664064|[...] # date Mon Dec 28 02:54:11 CET 2015 Just to clear up any possible misunderstandings. I run this from a simple netbook, and I have no idea why the date is off by so much. Well, both the file time and the unix time in the file say back in March, so whatever time syncing mechanism you use on that netbook, it evidently failed the boot you did that scrub. The netbook is set up with NTP with pfSense as a host server. The pfSense is itself synched with multiple pools. Note: I have used the same USB drives (memory sticks really) to create various configs of btrfs filesystems earlier. Could it be old metadata in the filesystem that mess up things? Is not metadata stamped with the UUID of the filesystem to prevent such things? Yes, metadata is stamped with UUID. But one other possible explanation for the scrub time back in March might be if you were already playing with it back then, and somehow you have a USB stick with a filesystem from back then that... somehow... has the same UUID as the one you're experimenting on today. yes, I have played around with these usb sticks for a long time. Probably also before march 29. Don't ask me how it could get the same UUID. I don't understand it either. But if it did somehow happen, btrfs would be /very/ confused, and crashing scrubs and further data corruption could certainly result. What if my use of dd accidentally trashed some important part of the new filesystem and btrfs therefore thinks a older version of the filesystem is the current one? If UUID's are in every metadatablock I find that pretty hard to believe. What if the UUID==0 ? is this accounted for? Of course if you weren't experimenting with btrfs on these devices back at the end of March and there's absolutely no way they could have gotten btrfs on them until say October or whenever, then we're back to the date somehow being wrong for that scrub, and having to look elsewhere for why scrub is crashing. No, by all means - I tried a lot of weird stuff on those usb sticks way before march so they defiantly had a (multi disk) btrfs filesystem on them before. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Btrfs scrub failure for raid 6 kernel 4.3
Hi, I have a "toy-array" of 6x USB drives hooked up to a hub where I made a btrfs raid 6 data+metadata filesystem. I copied some files to the filesystem, ripped out one USB drive and ruined it dd if=/dev/random to various locations on the drive. Put the USB drive back and the filesystem mounts ok. If i start scrub I after seconds get the following kernel:[ 50.844026] CPU: 1 PID: 91 Comm: kworker/u4:2 Not tainted 4.3.0-1-686-pae #1 Debian 4.3.3-2 kernel:[ 50.844026] Hardware name: Acer AOA150/, BIOS v0.3310 10/06/2008 kernel:[ 50.844026] Workqueue: btrfs-endio-raid56 btrfs_endio_raid56_helper [btrfs] kernel:[ 50.844026] task: f642c040 ti: f664c000 task.ti: f664c000 kernel:[ 50.844026] Stack: kernel:[ 50.844026] 0005 f0d20800 f664ded0 f86d0262 f664deac c109a0fc 0001 kernel:[ 50.844026] f79eac40 edb4a000 edb7a000 edb8a000 edbba000 eccc1000 ecca1000 kernel:[ 50.844026] f664de68 0003 f664de74 ecb23000 f664de5c f5cda6a4 f0d20800 kernel:[ 50.844026] Call Trace: kernel:[ 50.844026] [] ? finish_parity_scrub+0x272/0x560 [btrfs] kernel:[ 50.844026] [] ? set_next_entity+0x8c/0xba0 kernel:[ 50.844026] [] ? bio_endio+0x40/0x70 kernel:[ 50.844026] [] ? btrfs_scrubparity_helper+0xce/0x270 [btrfs] kernel:[ 50.844026] [] ? process_one_work+0x14d/0x360 kernel:[ 50.844026] [] ? worker_thread+0x39/0x440 kernel:[ 50.844026] [] ? process_one_work+0x360/0x360 kernel:[ 50.844026] [] ? kthread+0xa6/0xc0 kernel:[ 50.844026] [] ? ret_from_kernel_thread+0x21/0x30 kernel:[ 50.844026] [] ? kthread_create_on_node+0x130/0x130 kernel:[ 50.844026] Code: 6e c1 e8 ac dd f2 ff 83 c4 04 5b 5d c3 8d b6 00 00 00 00 31 c9 81 3d 84 f0 6e c1 84 f0 6e c1 0f 95 c1 eb b9 8d b4 200 00 00 00 0f 0b 8d b4 26 00 00 00 00 8d bc 27 00 kernel:[ 50.844026] EIP: [] kunmap_high+0xa8/0xc0 SS:ESP 0068:f664de40 This is only a test setup and I will keep this filesystem for a while if it can be of any use... -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs scrub failure for raid 6 kernel 4.3
Duncan wrote: Waxhead posted on Mon, 28 Dec 2015 00:06:46 +0100 as excerpted: btrfs scrub status /mnt scrub status for 2832346e-0720-499f-8239-355534e5721b scrub started at Sun Mar 29 23:21:04 2015 and finished after 00:01:04 total bytes scrubbed: 1.97GiB with 14549 errors error details: super=2 csum=14547 corrected errors: 0, uncorrectable errors: 14547, unverified errors: 0 Now here is the first worrying part... it says that scrub started at Sun Mar 29. That is NOT true, the first scrub I did on this filesystem was a few days ago and it claims it is a lot of uncorrectable errors. Why? This is after all a raid6 filesystem correct?! Hmm... The status is stored in readable plain-text files in /var/lib/ btrfs/scrub.status.*, where the * is the UUID. If you check there, the start time (t_start) seems to be in POSIX time. Is it possible you were or are running the scrub from, for instance, a rescue image that might not set the system time correctly and that falls back to, say, the date the rescue image was created, if it can't get network connectivity or some such? No I don't think so # ls -la /var/lib/btrfs/scrub.status.2832346e-0720-499f-8239-355534e5721b -rw--- 1 root root 2315 Mar 29 2015 /var/lib/btrfs/scrub.status.2832346e-0720-499f-8239-355534e5721b # cat /var/lib/btrfs/scrub.status.2832346e-0720-499f-8239-355534e5721b scrub status:1 2832346e-0720-499f-8239-355534e5721b:1|data_extents_scrubbed:5391|tree_extents_scrubbed:21|data_bytes_scrubbed:352542720|tree_bytes_scrubbed:344064|read_errors:0|csum_errors:0|verify_errors:0|no_csum:32|csum_discards:0|super_errors:0|malloc_errors:0|uncorrectable_errors:0|corrected_errors:0|last_physical:3306160128|t_start:1427664064|t_resumed:0|duration:51|canceled:0|finished:1 2832346e-0720-499f-8239-355534e5721b:2|data_extents_scrubbed:5404|tree_extents_scrubbed:26|data_bytes_scrubbed:353517568|tree_bytes_scrubbed:425984|read_errors:0|csum_errors:0|verify_errors:0|no_csum:64|csum_discards:2|super_errors:0|malloc_errors:0|uncorrectable_errors:0|corrected_errors:0|last_physical:3306160128|t_start:1427664064|t_resumed:0|duration:51|canceled:0|finished:1 2832346e-0720-499f-8239-355534e5721b:3|data_extents_scrubbed:5396|tree_extents_scrubbed:19|data_bytes_scrubbed:352718848|tree_bytes_scrubbed:311296|read_errors:0|csum_errors:0|verify_errors:0|no_csum:48|csum_discards:2|super_errors:0|malloc_errors:0|uncorrectable_errors:0|corrected_errors:0|last_physical:3306160128|t_start:1427664064|t_resumed:0|duration:51|canceled:0|finished:1 2832346e-0720-499f-8239-355534e5721b:4|data_extents_scrubbed:5391|tree_extents_scrubbed:31|data_bytes_scrubbed:352739328|tree_bytes_scrubbed:507904|read_errors:0|csum_errors:14547|verify_errors:0|no_csum:32|csum_discards:0|super_errors:2|malloc_errors:0|uncorrectable_errors:14547|corrected_errors:0|last_physical:2282749952|t_start:1427664064|t_resumed:0|duration:64|canceled:0|finished:1 2832346e-0720-499f-8239-355534e5721b:5|data_extents_scrubbed:5393|tree_extents_scrubbed:23|data_bytes_scrubbed:352665600|tree_bytes_scrubbed:376832|read_errors:0|csum_errors:0|verify_errors:0|no_csum:48|csum_discards:0|super_errors:0|malloc_errors:0|uncorrectable_errors:0|corrected_errors:0|last_physical:2534408192|t_start:1427664064|t_resumed:0|duration:51|canceled:0|finished:1 2832346e-0720-499f-8239-355534e5721b:6|data_extents_scrubbed:5407|tree_extents_scrubbed:33|data_bytes_scrubbed:353361920|tree_bytes_scrubbed:540672|read_errors:0|csum_errors:0|verify_errors:0|no_csum:48|csum_discards:2|super_errors:0|malloc_errors:0|uncorrectable_errors:0|corrected_errors:0|last_physical:3306160128|t_start:1427664064|t_resumed:0|duration:51|canceled:0|finished:1 # date Mon Dec 28 02:54:11 CET 2015 Just to clear up any possible misunderstandings. I run this from a simple netbook, and I have no idea why the date is off by so much. Since all drives register and since I can even mount the filesystem. Since I can reproduce this every time I try to start a scrub I have not tried to run balance , defrag or just md5sum all the files on the filesystem to see if that fixes up things a bit. In a raid6 config you should be able to loose up to two drives and honestly so far only one drive is hampered and even if another one for any bizarre reason should contain damaged data things should "just work" right? Note: I have used the same USB drives (memory sticks really) to create various configs of btrfs filesystems earlier. Could it be old metadata in the filesystem that mess up things? Is not metadata stamped with the UUID of the filesystem to prevent such things? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs scrub failure for raid 6 kernel 4.3
Chris Murphy wrote: On Sun, Dec 27, 2015 at 6:59 AM, Waxhead <waxh...@online.no> wrote: Hi, I have a "toy-array" of 6x USB drives hooked up to a hub where I made a btrfs raid 6 data+metadata filesystem. I copied some files to the filesystem, ripped out one USB drive and ruined it dd if=/dev/random to various locations on the drive. Put the USB drive back and the filesystem mounts ok. If i start scrub I after seconds get the following kernel:[ 50.844026] CPU: 1 PID: 91 Comm: kworker/u4:2 Not tainted 4.3.0-1-686-pae #1 Debian 4.3.3-2 kernel:[ 50.844026] Hardware name: Acer AOA150/, BIOS v0.3310 10/06/2008 kernel:[ 50.844026] Workqueue: btrfs-endio-raid56 btrfs_endio_raid56_helper [btrfs] kernel:[ 50.844026] task: f642c040 ti: f664c000 task.ti: f664c000 kernel:[ 50.844026] Stack: kernel:[ 50.844026] 0005 f0d20800 f664ded0 f86d0262 f664deac c109a0fc 0001 kernel:[ 50.844026] f79eac40 edb4a000 edb7a000 edb8a000 edbba000 eccc1000 ecca1000 kernel:[ 50.844026] f664de68 0003 f664de74 ecb23000 f664de5c f5cda6a4 f0d20800 kernel:[ 50.844026] Call Trace: kernel:[ 50.844026] [] ? finish_parity_scrub+0x272/0x560 [btrfs] kernel:[ 50.844026] [] ? set_next_entity+0x8c/0xba0 kernel:[ 50.844026] [] ? bio_endio+0x40/0x70 kernel:[ 50.844026] [] ? btrfs_scrubparity_helper+0xce/0x270 [btrfs] kernel:[ 50.844026] [] ? process_one_work+0x14d/0x360 kernel:[ 50.844026] [] ? worker_thread+0x39/0x440 kernel:[ 50.844026] [] ? process_one_work+0x360/0x360 kernel:[ 50.844026] [] ? kthread+0xa6/0xc0 kernel:[ 50.844026] [] ? ret_from_kernel_thread+0x21/0x30 kernel:[ 50.844026] [] ? kthread_create_on_node+0x130/0x130 kernel:[ 50.844026] Code: 6e c1 e8 ac dd f2 ff 83 c4 04 5b 5d c3 8d b6 00 00 00 00 31 c9 81 3d 84 f0 6e c1 84 f0 6e c1 0f 95 c1 eb b9 8d b4 200 00 00 00 0f 0b 8d b4 26 00 00 00 00 8d bc 27 00 kernel:[ 50.844026] EIP: [] kunmap_high+0xa8/0xc0 SS:ESP 0068:f664de40 This is only a test setup and I will keep this filesystem for a while if it can be of any use... Sounds like a bug, but also might be missing functionality still. If you can include the reproduce steps, including the exact locations+lengths of the random writes, that's probably useful. More than one thing could be going on. First, I don't know that Btrfs even understands the device went missing because it doesn't yet have a concept of faulty devices, and then I've seen it get confused when drives reappear with new drive designations (not uncommon), and from your call trace we don't know if that happened because there's not enough information posted. Second, if the damage is too much on a device, it almost certainly isn't recognized when reattached. But this depends on what locations were damaged. If Btrfs doesn't recognize the drive as part of the array, then the scrub request is effectively a scrub for a volume with a missing drive which you probably wouldn't ever do, you'd first replace the missing device. Scrubs happen on normally operating arrays not degraded ones. So it's uncertain either Btrfs, or the user, had any idea what state the volume was actually in at the time. Conversely on mdadm, it knows in such a case to mark a device as faulty, the array automatically goes degraded, but when the drive is reattached it is not automatically re-added. When the user re-adds, typically a complete rebuild happens unless there's a write-intent bitmap, which isn't a default at create time. I am afraid I can't exactly include the how to reproduce steps. I do however have the filesystem in a "bad state" so if there is anything I can do - let me know. First of all ... a "btrfs filesystem show" does list all drives Label: none uuid: 2832346e-0720-499f-8239-355534e5721b Total devices 6 FS bytes used 8.53GiB devid1 size 7.68GiB used 3.08GiB path /dev/sdb1 devid2 size 7.68GiB used 3.08GiB path /dev/sdc1 devid3 size 7.68GiB used 3.08GiB path /dev/sdd1 devid4 size 7.68GiB used 3.08GiB path /dev/sde1 devid5 size 7.68GiB used 3.08GiB path /dev/sdf1 devid6 size 7.68GiB used 3.08GiB path /dev/sdg1 mount /dev/sdb1 /mnt/ btrfs filesystem df /mnt Data, RAID6: total=12.00GiB, used=8.45GiB System, RAID6: total=64.00MiB, used=16.00KiB Metadata, RAID6: total=256.00MiB, used=84.58MiB GlobalReserve, single: total=32.00MiB, used=0.00B btrfs scrub status /mnt scrub status for 2832346e-0720-499f-8239-355534e5721b scrub started at Sun Mar 29 23:21:04 2015 and finished after 00:01:04 total bytes scrubbed: 1.97GiB with 14549 errors error details: super=2 csum=14547 corrected errors: 0, uncorrectable errors: 14547, unverified errors: 0 Now here is the first worrying part... it says that scrub started at Sun Mar 29. That is NOT true, the first scrub I did on this filesystem was a fe
Re: Hot data Tracking
David Sterba wrote: On Sat, Feb 11, 2012 at 05:49:41AM +0100, Timo Witte wrote: What happened to the hot data tracking feature in btrfs? There are a lot of old patches from aug 2010, but it looks like the feature has been completly removed from the current version of btrfs. Is this feature still on the roadmap? Removed? AFAIK it hasn't been ever merged, though it's be a nice feature. There were suggestions to turn it into a generic API for any filesystem to use, but this hasn't happened. The patches are quite independent and it was easy to refresh them on top of current for-linus branch. A test run did not survive a random xfstest, 013 this time, so I probably mismerged some bits. The patchset lives in branch foreign/ibm/hotdatatrack in my git repo. david Someone recently mentioned bcache in another post who seems to cover this subject fairly well. However would it not make sense if btrfs actually was able to automatically take advantage of whatever disks is added to the pool? For example if you have 10 disk of different size and performance in a raid5/6 like configuration would it not be feasible if btrfs automagically (option) could manage it's own cache? For example it could reserve a chunk of free space as cache (based on how much data is free) and stripe data over all disks (cache). When the filesystem becomes idle or at set intervals it could empty the cache or move/rebalance pending writes over to the original raid5/6 like setup. As far as I remember hot data tracking was all about moving the data over to the fastest disk. Why not utilize all disks and benefit from disks working together? Svein Engelsgjerd -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Is btrfsck really required?
After playing around with btrfs for a while, reading about it and also watching Avi Miller's presentation on youtube I am starting to wonder why one would need btrfsck at all. I am no expert in filesystems so I apologize if any of these questions may sound a bit stupid. 1. How self-healing is btrfs really?! According to Miller's talk btrfs is making a (circular?) backup of the root tree every 30 seconds. If I remember correctly the root tree is also mirrored several places on disk and on rotational media all those are updated in tandem. This is leading me to believe that there should be no problem in recovering from a corruption. 2. Also in addition to question 1. Is there some sanity checking when writing the root tree? e.g. if you write garbage to the root tree by accident will there be some recovery mechanism there to protect you as well? 3. What is the point with the mount -o recovery ? If there already is a corruption there is there any reason btrfs should not recover automatically by itself? 4. If a disk responds slowly, Will btrfs throw it out of a raid configuration and if so will a btrfsck be less strict about timeouts and will it automatically rebalance the data from the bad disk over to other good disks?! -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
How well does BTRFS manage different sized disks?
Hi, Can someone shed some light on how BTRFS will manage a bunch of disks of varying size for the planned raid5/6. e.g. 3x 2TB disk and 1x 250GB disk? If using a raid5 setup will a 750GB of usable data automatically be used as a 4 disk raid5 while the rest is used as a 3 disk raid5?! If so; how do you control what files are on the speedy section of the volume? -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Will BTRFS repair or restore data if corrupted?
Hi, From what I have read BTRFS does replace a bad copy of data with a known good copy (if it has one). Will BTRFS try to repair the corrupt data or will it simply silently restore the data without the user knowing that a file has been fixed? -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html