Re: mkfs.btrfs limits odd [and maybe a failed phantom device?]
On Thu, 11 Dec 2014, Robert White wrote: On 12/11/2014 07:56 PM, Zygo Blaxell wrote: RAID5 with even parity and two devices should be exactly the same as RAID1 (i.e. disk1 ^ disk2 == 0, therefore disk1 == disk2, the striping is irrelevant because there is no difference in disk contents so the disks are interchangeable), except with different behavior when more devices are added (RAID1 will mirror chunks on pairs of disks, RAID5 should start writing new chunks with N stripes instead of two). That's not correct. A RAID5 with three elements presents two _different_ sectors in each stripe. When one element is lost, it would still present two different sectors, but the safety is gone. The above quote is discussing two device RAID5, you are discussing three device RAID5. I understand that the XOR collapses into a mirror if only two datum are involved, but that's a mathematical fact that is irrelevant to the definition of a RAID5 layout. When you take a wheel off of a tricycle it doesn't just become a bike. And you can't make a bicycle into a trike by just welding on a wheel somewhere. The infrastructure of the two is completely different. True. A two-device RAID5 is not the same as a degraded three-device RAID5. So RAID5 with three media M is MMM MMM D1 D2 P(a) D3 P(b) D4 P(c) D5 D6 If MMM is lost D1, D2, D3, and D5 are intact D4 and D6 can be recreated via D3^P(b) and P(c)^D5 MMM X D1 D2 . D3 P(b) . P(c) D5 . So under _no_ circumstances would a two-disk RAID5 be the same as a RAID1 since a two disk RAID5 functionally implies disk three because the _minimum_ arity of a RAID5 is 3. A two-disk RAID5 has _zero_ data protection because the minimum third element is a computational phantom. You again seem to be treating a two disk RAID5 as synonymous with your degraded three disk RAID5 above. It is not. RAID5 with two media M would be: MMM D1 P(a) P(b) D2 D3 P(c) [and each P would be identical to its corresponding D] In short it is irrational to have a two disk RAID5 that is not degraded in the same way you cannot have a two-wheeled tricycle without scraping some part of something along the asphalt. There is nothing irrational about it at all, except that it is exactly equivalent to two disk RAID1. A RAID1 with two elements presents one sector along the stripe. A RAID5 with N elements presents N-1 sectors along the stripe, so I'm not sure what the problem is with setting N=2. I realize that what has been implemented is what you call a two drive RAID5, and done so by really implementing a RAID1, but it's nonsense. It's not really, it's merely an argument of semantics if you want to define it as nonsense. I mean I understand what you are saying you've done, but it makes no sense according to the definitions of RAID5. There is no circumstance where RAID5 falls back to mirroring. Trying to implement RAID5 as an extension of a mirroring paradigm would involve a fundamental conflict in definitions. Especially when you reached a failure mode. I have no idea what you mean by a fundamental conflict in definition. This is so fundamental to the design that the fast way to assemble a RAID5 of N-arity (minimum N being 3) is to just connect the first N-1 elements, declare the raid valid-but-degraded using (N-1) of the media, and then replacing the Nth phantom/missing/failed element with the real disk and triggering a rebuild. This only works if you don't need the initial contents of the array to have a specific value like zero. (This involves fewest reads and the array is instantly available while it builds.) There is no reason you could not do exactly this with N=2. As soon as you start writing to the array, the stripes you write repair the extents if the repair process hadn't gotten to them yet. Its basically impossible to turn a mirror into a RAID5 if you _ever_ expect the code base to to be able to recover an array that's lost an element. Again, I'm not really sure what you mean. Uh, no. A raid 6 with three drives, or even two drives, is also degraded because the minimum is four. You're doing your weird semantic dance again. Just because you define the minimum to be four does not mean that someone talking about a three device RAID6 is talking about a degraded four device RAID6, they're not. As above, a non-degraded three-device RAID6 can be perfectly sensibly defined. Once again, it has exactly the same failure properties as a three device RAID1 (any two of the devices can fail), so it's a bit pointless. But not impossible... A B C D D1 D2 Pa Qa D3 Pb Qb D4 Pc Qc D5 D6 Qd D7 D8 Pd You can lose one or two media but the minimum stripe is again [X1,X2] for any read (ABCD)(ABC.)(AB..)(A..D) etc. Minimum arity for RAID6 is 4, maximum lost-but-functional configuration is arity-minus-two. A B C D1 Pa Qa Pb Qb D2 Qc D3 Pc D4 Pd Qd They're only missing if you believe the
Re: mkfs.btrfs limits odd [and maybe a failed phantom device?]
On 12/12/2014 01:06 AM, David Taylor wrote: The above quote is discussing two device RAID5, you are discussing three device RAID5. Heresy! (yes, some humor is required here.) There is no such thing as a two device RAID5. That's what RAID1 is for. Saying The above quote is discussing a two device RAID5 is exactly like saying The above quote is discussing a two wheeled tricycle. You might as well be talking about three-octet IP addresses. That is you could make a network address out of three octets, but it wouldn't' be an IP address. It would be something else with the wrong name attached. I challenge you... nay I _defy_ you... to find a single authority on disk storage anywhere on this planet (except, apparently, this list and its directly attached people and materials) that discusses, describes, or acknowledges the existence of a two device RAID5 while not discussing a system with an arity of 3 degraded by the absence of one media. All these words have standardized definitions. [That's not hyperbole. I searched for several hours and could not find _any_ reference anywhere to construction of a RAID5 array using only two devices that did not involve airity-3 and a dummy/missing/failed psudo target. So if you can find any reference to doing this _anywhere_ outside of BTRFS I'd like to see it. Genuinely.] THAT SAID... I really can find no reason the math wouldn't work using only two drives. It would be a terrific waste of CPU cycles and storage space to construct the stripe buffers and do the XORs instead of just copying the data, but the math would work. So, um, well I'll be damned. Perhaps is just a tautological belief that someone here didn't buy into. Like how people keep partitioning drives into little slices for things because thats the preserved wisdom from early eighties. I think constructing a non-degraded-mode two device thing and calling it RAID5 will surprise virtually _everyone_ on the planet. In every other system. And I do mean _every_ other system, if I had two media and I put them under RAID-5 I'd be required to specify the third drive as some sort failed device (the block device equivalent of /dev/null but that returns error results for all operations instead of successes.) See the reserved keyword missing in the mdadm documentation etc. That is, If I put two 1TiB disks into a RAID-5 I'd expect to get a 2TiB array with no actual redundancy. As in mdadm --create md0 --level=r5 --raid-devices=3 /dev/sda missing /dev/sdc the resulting array would be the same effective size as a stripe of the two drives, but when the third was added later it would just slot in as a replacement for the missing device and the airity-3 thing would reestablish it's redundancy. (this is actually what mdadm does internally with a normal build, it blesses the first N-1 drives into an array with a missing member, and adds the Nth drive as a spare and then the spare is immediately adopted as a replacement for the missing drive.) The parity computation on a single value is just nutty waste of time though. Backing it out when the array is degraded is double-nuts. Maybe everybody just decided it was too crazy to consider for the CPU time penalty...? So yea, semantics... apparently... -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mkfs.btrfs limits odd [and maybe a failed phantom device?]
On Fri, Dec 12, 2014 at 03:16:03AM -0800, Robert White wrote: On 12/12/2014 01:06 AM, David Taylor wrote: The above quote is discussing two device RAID5, you are discussing three device RAID5. Heresy! (yes, some humor is required here.) There is no such thing as a two device RAID5. That's what RAID1 is for. Saying The above quote is discussing a two device RAID5 is exactly like saying The above quote is discussing a two wheeled tricycle. You might as well be talking about three-octet IP addresses. That is you could make a network address out of three octets, but it wouldn't' be an IP address. It would be something else with the wrong name attached. OK. Sounds like I need to dust off the change-of-nomenclature patch again. The argument here is about the 1c1s1p configuration. Is there a problem with that? Hugo. I challenge you... nay I _defy_ you... to find a single authority on disk storage anywhere on this planet (except, apparently, this list and its directly attached people and materials) that discusses, describes, or acknowledges the existence of a two device RAID5 while not discussing a system with an arity of 3 degraded by the absence of one media. All these words have standardized definitions. [That's not hyperbole. I searched for several hours and could not find _any_ reference anywhere to construction of a RAID5 array using only two devices that did not involve airity-3 and a dummy/missing/failed psudo target. So if you can find any reference to doing this _anywhere_ outside of BTRFS I'd like to see it. Genuinely.] THAT SAID... I really can find no reason the math wouldn't work using only two drives. It would be a terrific waste of CPU cycles and storage space to construct the stripe buffers and do the XORs instead of just copying the data, but the math would work. So, um, well I'll be damned. Perhaps is just a tautological belief that someone here didn't buy into. Like how people keep partitioning drives into little slices for things because thats the preserved wisdom from early eighties. I think constructing a non-degraded-mode two device thing and calling it RAID5 will surprise virtually _everyone_ on the planet. In every other system. And I do mean _every_ other system, if I had two media and I put them under RAID-5 I'd be required to specify the third drive as some sort failed device (the block device equivalent of /dev/null but that returns error results for all operations instead of successes.) See the reserved keyword missing in the mdadm documentation etc. That is, If I put two 1TiB disks into a RAID-5 I'd expect to get a 2TiB array with no actual redundancy. As in mdadm --create md0 --level=r5 --raid-devices=3 /dev/sda missing /dev/sdc the resulting array would be the same effective size as a stripe of the two drives, but when the third was added later it would just slot in as a replacement for the missing device and the airity-3 thing would reestablish it's redundancy. (this is actually what mdadm does internally with a normal build, it blesses the first N-1 drives into an array with a missing member, and adds the Nth drive as a spare and then the spare is immediately adopted as a replacement for the missing drive.) The parity computation on a single value is just nutty waste of time though. Backing it out when the array is degraded is double-nuts. Maybe everybody just decided it was too crazy to consider for the CPU time penalty...? So yea, semantics... apparently... -- Hugo Mills | There's an infinite number of monkeys outside who hugo@... carfax.org.uk | want to talk to us about this new script for Hamlet http://carfax.org.uk/ | they've worked out! PGP: 65E74AC0 | Arthur Dent signature.asc Description: Digital signature
Re: mkfs.btrfs limits odd [and maybe a failed phantom device?]
On Thu, Dec 11, 2014 at 10:01:06PM -0800, Robert White wrote: So RAID5 with three media M is MMM MMM D1 D2 P(a) D3 P(b) D4 P(c) D5 D6 RAID5 with two media is well defined, and looks like this: MMM D1 P(a) P(b) D2 D3 P(c) With even parity and N disks P(a) ^ D1 [^ D2 ^ ... ^ DN] = 0 Simplifying for one data disk and one parity stripe: P(a) ^ D1 = 0 therefore P(a) = D1 which is effectively (and, in practice, literally) mirroring. signature.asc Description: Digital signature
Re: mkfs.btrfs limits odd [and maybe a failed phantom device?]
On 12/12/2014 08:45 AM, Zygo Blaxell wrote: On Thu, Dec 11, 2014 at 10:01:06PM -0800, Robert White wrote: So RAID5 with three media M is MMM MMM D1 D2 P(a) D3 P(b) D4 P(c) D5 D6 RAID5 with two media is well defined, and looks like this: MMM D1 P(a) P(b) D2 D3 P(c) Like I said in the other fork of this thread... I see (now) that the math works but I can find no trace of anyone having ever implemented this for arity less than 3 RAID greater than one paradigm (outside btrfs and its associated materials). It's like talking about a two-wheeled tricycle. 8-) I would _genuinely_ like to see any third party discussion of this. It just isn't done (probably because, as you've shown it just a really complicated and CPU intensive way to end up with a simple mirror). I spent several hours looking. I can see the math works, and I understand what you are doing (as I said at some length in the grandparent message) but it just isn't done. The reason I use the tricycle example is that, while most people know this instinctively few are aware of the fact that going from two wheels to three-or-more wheels reverses the steering paradigm. On a bike you push-left lean-left and go-left. At the higher arity vehicles (including adding a side-car to a bike) you push-right go left (you lean left too, but that's just to keep from nosing over 8-). I find that quite apt in the whole RAID1 vs RAID5 discussion since the former is about copying one-or-more times and the latter is about starting with a theoretically zeroed buffer and doing reversible checksumming into it. I doubt that I will be the last person to be confused by BTRFS' implementation of a two-wheeled tricycle. You're going to get a lot of mail over the years. 8-) MEANWHILE the system really needs to be able to explicitly express and support the missing media paradigm. M xMMM D1.P(a) D3.D4 P(c) .D6 The correct logic here to remove (e.g. replace with nothing instead of delete) a media just doesn't seem to exist. And it's already painfully missing in the RAID1 situation. If I have a system with N SATA ports, and I have connected N drives, and device M is starting to fail... I need to be able to disconnect M and then connect M(new). Possibly with a non-trivial amount of time in there. For all RAID levels greater than zero this is a natural operation in a degraded mode. And for a nearly full filesystem the shrink operation that is btrfs device delete would not work. And for any nontrivially occupied fiesystem it would be way slow, and need to be reversed for another way-slow interval. So I need to be able to replace a drive with a nothing so that the number of active media becomes N-1 but the arity remains N. mdadm has the missing keyword. the Device Mapper has the zero target. As near as I can tell btrfs has got nothing in this functional slot. Imagine, if you will, a block device that is the anti-/dev/null. All operations on this block device return EFAULT. lets call it /dev/nothing. And lets say I have a /dev/sdc that has to come out immediately (and all my stuff is RAID1/5/6). The operational chain would be btrfs replace start /dev/sdc /dev/nothing / (time pases, physical device is removed and replace) btrfs replace start /dev/nothing /dev/sdc / Now that's good-ish, but really the first replace is pernicious. The internal state for the filesystem should just be able to record that device id 3 (assuming /dev/sda is devid1 and b is 2 etc for this example) is just gone. The replace-with-nothing becomes more-or-less instant. The first replace is also pernicious if its the second media failure on a fully RAID6 array since that would trying to put the same kernel level device in the array twice. The restore operation, the replace of the nothing with the something, remains fully elaborate. The nothing devices need to show up in the device id tables for a running array in their geographically correct positions and all that. Without this missing status as a first-class part of the system, dealing with failures and communicating about those failures with the operator will become vexatious. [The use of device delete and device add as changes in arity and size, and its inaplicability to cases where failure is being dealt with abent a change of arity, could be clearer in the documentation.] -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mkfs.btrfs limits odd [and maybe a failed phantom device?]
Robert White posted on Fri, 12 Dec 2014 03:16:03 -0800 as excerpted: Perhaps is just a tautological belief that someone here didn't buy into. Like how people keep partitioning drives into little slices for things because thats the preserved wisdom from early eighties. While I absolutely agree with your raid5 sentiments (which is exactly what I suppose they might be; I'm getting a bit of an education in that regard myself, here)... In the context of the 80s, or even the 90s, nothing about multi-gigabyte could be considered little! =:^) In fact, while it most assuredly dates me, it /still/ feels a bit odd referring to the 1 GiB btrfs default threshold for mixed-bg-mode as small, given that I distinctly remember wondering how long it might take me to fill my first 1 GB (not GiB, unfortunately) drive, tho by that time I did have enough experience to know I'd eventually be dealing with multi-gig as at the time I was dealing with multi-meg. More to the point, however... Those partitions have saved my a** quite a few times over the years. Among other things, partitioning allows me to keep my (8 GiB) rootfs an entirely separate filesystem that's mounted read-only by default, which has kept it undamaged and the tools on it still available to help recover my other filesystems, when /var/log and /home were damaged due to a hard shutdown recently. And some years ago I had an AC failure here in Phoenix in the middle of the summer, resulting in a physical head-crash and loss of the operating partitions on my disk in use at the time, while the backup partitions on the same device remained intact, such that after cooldown I actually continued to use that disk for some time, mounting the damaged partitions only to recover the most recent copies of what I could, updating the backups which were now promoted to operational. Sure, technology such as LVM can do similar and is more flexible in some ways, but unfortunately it requires userspace and thus an initr* in ordered to handle a root on the same technology. Otherwise, root must be treated differently, and then you have partitioning again. Additionally, LVM is yet another layer of software that can and does go wrong and itself need fixed. Partitioning is too, to some extent, but in practice it has been pretty bullet-proof compared to technologies such as LVM and btrfs-subvolumes. LVM has some way to go before it's as robust as partitioning, and of course btrfs with its subvolumes isn't really even completely stable yet. Further, btrfs doesn't well limit damage of a subvolume to just that subvolume (that head-crash scenario would have almost certainly been a total loss on btrfs subvolumes), the way partitioning tends to do. And LVM's very flexibility means it doesn't normally have that sort of damage limitation either. It certainly can, but doing so severely reduces its flexibility, making going back to regular partitions to avoid the complexity and additional points of failure entirely a rather viable and often better choice. Meanwhile, technology such as EFI and GPT is breathing new life into partitioning, making it more reliable (checksummed redundant partition tables), more useful/flexible (killing the primary/secondary/logical divisions and adding partition names/labels and a far larger typing space), and creating yet more uses for partitioning in the first place, due to separate reserved EFI and legacy-BIOS partition types. Tho of course these days those partition slices are often tens or hundreds of gigs, and are now sometimes teras[1], bringing up my initial point once again; that's NOT actually so small! But to each his own, of course, and I definitely do agree with you on raid5, the larger point. FWIW, I still consider allowing a two-device raid5 or a three-device raid6 a bug, particularly given that a single- device raid1 is /not/ allowed, nor is a 3-device raid10. --- [1] Hmm, K, megs, gigs, ters, teras, simply T to match K ??? -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mkfs.btrfs limits odd [and maybe a failed phantom device?]
On Fri, Dec 12, 2014 at 02:28:06PM -0800, Robert White wrote: On 12/12/2014 08:45 AM, Zygo Blaxell wrote: On Thu, Dec 11, 2014 at 10:01:06PM -0800, Robert White wrote: So RAID5 with three media M is MMM MMM D1 D2 P(a) D3 P(b) D4 P(c) D5 D6 RAID5 with two media is well defined, and looks like this: MMM D1 P(a) P(b) D2 D3 P(c) Like I said in the other fork of this thread... I see (now) that the math works but I can find no trace of anyone having ever implemented this for arity less than 3 RAID greater than one paradigm (outside btrfs and its associated materials). I've set up mdadm that way (though it does ask you to use '--force' when you set it up). mdadm will also ask for --force if you try to set up RAID1 with one disk. I don't know of a RAID implementation that _doesn't_ do these modes, excluding a few ancient proprietary implementations which have no way to change a layout once created (usually because they shoot themselves in the foot with bad choices early on, e.g. by picking odd parity for RAID5). The reason to allow it is future expansion: below-3-disk RAID5 ensures that you have the layout constraints *now* for stripe/chunk size so you can add more disks later. If RAID5 has a 512K chunk size, and you start with a linear or RAID1 array and add another disk later, you might lose part of the last 512K when you switch to RAID5. So you start with RAID5 on one or two disks so you can scale up without losing any data. Also, mdadm can grow a two-disk RAID5, but if you try to grow a two-disk mdadm RAID1 you just get a three-disk RAID1 (i.e. two redudant copies with no additional capacity). btrfs doesn't really need this capability for expansion, since it can just create new RAID5 profile chunks whenever it wants to; however, I'd expect a complete btrfs RAID5 implementation to borrow some ideas from ZFS, and dynamically change the number of disks per chunk to maintain write integrity as drives are added/removed/missing. That would imply btrfs-RAID56 profile chunks would have to be able to exist on two or even one disk, if that was all that was available for writing at the time. Simply using btrfs-RAID1 chunks wouldn't work since they'd behave the wrong way when more disks were added later. MEANWHILE the system really needs to be able to explicitly express and support the missing media paradigm. M xMMM D1.P(a) D3.D4 P(c) .D6 The correct logic here to remove (e.g. replace with nothing instead of delete) a media just doesn't seem to exist. And it's already painfully missing in the RAID1 situation. There are a number of permanent mistakes a naive admin can make when dealing with a broken array. I've destroyed arrays (made them permanently read-only beyond the ability of btrfs kernel or user tools to recover) by getting add and replace confused, or by allowing an offline drive to rejoin an array that had been mounted read-write,degraded for some time. The basic functionality works. btrfs does track missing devices and can replace them relatively quickly (not as fast as mdadm, but less than an order of magnitude slower) in RAID1. The reporting is full of out-of-date cached data, but when a disk is really failing, there is usually little doubt which one needs to be replaced. If I have a system with N SATA ports, and I have connected N drives, and device M is starting to fail... I need to be able to disconnect M and then connect M(new). Possibly with a non-trivial amount of time in there. For all RAID levels greater than zero this is a natural operation in a degraded mode. And for a nearly full filesystem the shrink operation that is btrfs device delete would not work. And for any nontrivially occupied fiesystem it would be way slow, and need to be reversed for another way-slow interval. So I need to be able to replace a drive with a nothing so that the number of active media becomes N-1 but the arity remains N. btrfs already does that, but it sucks. In a naive RAID5 implementation, a write in degraded mode will corrupt your data if it is interrupted. This is a general property of all RAID5 implementations that don't have NVRAM journalling or some other way to solve the atomic update problem. ZFS does this well: when a device is missing, it leaves old data in degraded mode, but writes new data striped across the existing disks in non-degraded mode. If you have 5 disks, and one dies, your writes are then spread across 4 disks (3 data + parity) while your reads are reconstructed from 4 disks (4 data + 1 parity - 1 missing). This prevents the degraded mode write data integrity problem. When the dead disk is replaced you would have the 3 data + parity promoted to 4 data + parity, or you can elect not to replace the dead disk and get 3 data + party everywhere (with a loss of capacity). btrfs could presumably do that by allocating chunks with different raid56 parameters, although in this
Re: mkfs.btrfs limits odd [and maybe a failed phantom device?]
On Wed, Dec 10, 2014 at 02:18:55PM -0800, Robert White wrote: (3) why can I make a raid5 out of two devices? (I understand that we are currently just making mirrors, but the standard requires three devices in the geometry etc. So I would expect a two device RAID5 to be considered degraded with all that entails. It just looks like its asking for trouble to allow this once the support is finalized as suddenly a working RAID5 thats really a mirror would become something that can only be mounted with the degraded flag.) RAID5 with even parity and two devices should be exactly the same as RAID1 (i.e. disk1 ^ disk2 == 0, therefore disk1 == disk2, the striping is irrelevant because there is no difference in disk contents so the disks are interchangeable), except with different behavior when more devices are added (RAID1 will mirror chunks on pairs of disks, RAID5 should start writing new chunks with N stripes instead of two). (4) Same question for raid6 but with three drives instead of the mandated four. RAID6 with three devices should behave more or less like three-way RAID1, except maybe the two parity disks might be different (I forget how the function used to calculate the two parity stripes works, and whether it can be defined such that F(disk1, disk2, disk3) == disk1). (5) If I can make a RAID5 or RAID6 device with one missing element, why can't I make a RAID1 out of one drive, e.g. with one missing element? They're only missing if you believe the minimum number of RAID5 disks is not two and the minimum number of RAID6 disks is not three. (6) If I make a RAID1 out of three devices are there three copies of every extent or are there always two copies that are semi-randomly spread across three devices? (ibid for more than three). There are always two copies. RAID1 on 3x1TBdisks gives you 1.5TB of mirrored storage. --- It seems to me (very dangerous words in computer science, I know) that we need a failed device designator so that a device can be in the geometry (e.g. have a device ID) but not actually exist. Reads/writes to the failed device would always be treated as error returns. The failed device would be subject to replacement with btrfs dev replace, and could be the source of said replacement to drop a problematic device out of an array. EXAMPLE: Gust t # mkfs.btrfs -f /dev/loop0 failed -d raid1 -m raid1 Btrfs v3.17.1 See http://btrfs.wiki.kernel.org for more information. Performing full device TRIM (2.00GiB) ... Turning ON incompat feature 'extref': increased hardlink limit per file to 65536 Processing explicitly missing device adding device (failed) id 2 (phantom device) mount /dev/loop0 /mountpoint btrfs replace start 2 /dev/loop1 /mountpoint (and so on) Being able to replace a faulty device with a phantom failed device would nicely disambiguate the whole device add/remove versus replace mistake. It is a little odd that an array of 3 disks with one missing looks like this: Label: none uuid: dde7fa84-11ed-4281-8d90-3eb0d29b949b Total devices 3 FS bytes used 256.00KiB devid1 size 4.00GiB used 847.12MiB path /dev/mapper/vgtester-d01 devid2 size 4.00GiB used 827.12MiB path /dev/mapper/vgtester-d02 devid3 size 4.00GiB used 0.00B path /dev/mapper/vgtester-d04 In the above, vgtest-d02 was a deleted LV and does not exist, but you'd never know that from the output of 'btrfs fi show'... It would make the degraded status less mysterious. The 'degraded' status currently protects against some significant data corruption risks. :-O A filesystem with an explicitly failed element would also make the future roll-out of full RAID5/6 less confusing. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: Digital signature
Re: mkfs.btrfs limits odd [and maybe a failed phantom device?]
Robert White posted on Wed, 10 Dec 2014 14:18:55 -0800 as excerpted: So I started looking at the mkfs.btrfs manual page with an eye towards documenting some of the tidbits like metadata automatically switching from dup to raid1 when more than one device is used. In experimenting I ended up with some questions... (1) why is the dup profile for data restricted to only one device and only if it's mixed mode? (2) why is metadata dup profile restricted to only one device on creation when it will run that way just fine after a device add? 1 and 2 together since they both deal with dup mode... Dup mode was apparently originally considered purely an extra safeguard for metadata in the single-device case, where it was made the default (except for SSDs, which default to single mode metadata on a single- device filesystem, because the FTL voids any guarantees on location anyway, and because firmware such as sandforce compresses and dedups anyway, in which case the hardware/firmware is subverting btrfs' efforts to do dup anyway). In the single-device case, two copies of data was considered simply not worth the cost, due both to doubling the size (especially on SSD where size is money!) and to the speed penalties on spinning rust due to seeks between one 1-GiB data-chunk and its dup. With multi-device, raid1 metadata, forcing one copy to each of two different devices, was considered enough superior to make that the default, since that provided device-loss resiliency for the all-important metadata, thus enabling recovery of at least /some/ files even with a device missing (single-mode data where the file's extents all happened to be on available devices, plus of course raid1, etc, data). Further, dup- mode metadata was considered a mistake it was better not to even have available as an option, since loss of a single device would likely kill the filesystem, which made dup mode little better than single mode, without the doubled-size-cost. Further, on spinning rust there'd again be the seek penalty, to little benefit since dup mode provides no guarantees in case of device loss. So multi-device defaults to raid1 metadata for safety, but single mode metadata remains an option (along with raid0) if you really /don't/ care about losing everything due to loss of a single device. Single-device simply makes dup-mode available (and the default) for metadata, as a poor- man's substitute for the safety of raid1, but single-device-metadata is the only case where that poor-man's-raid1-substitute is worth the (considered extreme) cost, with usage of that option not even available on multi-device as it'd be a near-certain mistake, certainly at the mkfs level. And dup mode isn't ordinarily available for data even on single- device, because it's considered not worth the cost. As for dup-mode working after device-add, that's simply a necessary bit in ordered for device add to work from a default-dup-mode single-device at all. And it's only the existing metadata chunks on the original device that will be dup-mode. Once a second device is added, additional metadata chunks will be written in raid1 mode, forcing the two chunk copies to different devices since there's multiple devices available to allow that. The clear intent and recommendation is to do a rebalance ASAP after a device add, to spread usage to the new device as appropriate. And of course that rebalance will use the new raid1 metadata defaults, unless told otherwise of course, and I don't believe dup mode is available to tell it otherwise there, either. What all that original reasoning fails to account for, however, is the btrfs data/metadata checksumming and integrity features and the very high (which the original btrfs mode designers obviously considered extreme) value some users (including me) place on them. While a multi-device dup- mode-metadata choice at mkfs is arguably still a mistake, the cost of raid1 metadata without the benefit, near the risk of single metadata but at double the size, dup-mode data combined with btrfs checksumming and data integrity features on a single device has strong data integrity benefits that some would definitely consider worth it, even at the additional cost in speed on spinning rust due to seeking, and in size on expensive SSDs. Meanwhile, mixed-bg-mode was an after-thought, added much later (after my own btrfs journey began) in ordered to make working with small filesystems reasonable. Before mixed-bg-mode, people attempting to use btrfs on sub-GiB devices often found they couldn't use all available space (often 25-50% wasted!) as the separate data/metadata chunk allocation was simply too large grained to properly deal with the small sizes involved. And small filesystems really _was_ mixed-mode's _entire_ purpose. That it could additionally be used to allow dup-data, using the ability to specify mixed-bg-mode even on 1 GiB filesystems where it wasn't