Re: How do really work RAID1 on btrfs?
On 12/9/20 1:31 AM, Kevin Kofler via devel wrote: It follows that the solutions are nonnegative under the following conditions: * a+b≥c * a+c≥b * b+c≥a which are quite logical. Consider a=4, b=1, and c=1, i.e., disks of 4 GB, 1 GB, and 1 GB. Each of the 1 GB disks can only mirror (at most) 1 of the 4 GB, so where would you want to mirror the remaining 2 GB to? And without attempting a formal proof, I would suspect that there is not a unique solution for more than 3 disks, since you get a lot more freedom, but in any case the bigger disk can't be bigger than the sum of all the others, because then of course losing that would be impossible to recover. That condition will be necessary, and I think sufficient too. Regards. -- Roberto Ragusamail at robertoragusa.it ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Re: How do really work RAID1 on btrfs?
On Tue, Dec 8, 2020 at 5:08 PM Kevin Kofler via devel wrote: > > Sergio Belkin wrote: > > So, let's say we have 3 small disks: 4GB, 3G, and 2GB. > > > > If I create one file of 3GB I think that > > 3 GB is written on 4GB disk, it leaves 1 GB free. > > 3 GB of copy is written on 3 GB disk, it leaves 0 GB Free. > > > > So, I create one file of 1GB that is written on 4GB disk, it leaves 0 GB > > free. > > 1 GB of copy is written on 2 GB disk, so it leaves 1 GB free. > > > > So I've used 4GB, ok it leaves 1 GB free on only one disk, but cannot be > > mirrored. > > > > However as [1] I could use 4.5 ((4GB+3GB+2GB)/2) GB instead of 4GB. > > Surely, I'm missing or mistaking something. > > > > Please could you help me? > > The optimum size can theoretically be achieved by using the following > physical partitioning: > * x GB on the 4 GB disk and the 3 GB disk, > * y GB on the 4 GB disk and the 2 GB disk, and > * z GB on the 3 GB disk and the 2 GB disk, > for a total of x+y+z GB, where x, y, and z solve the following system of > equations: > * x+y=4 > * x+z=3 > * y+z=2 > i.e., in standard form: > * 1x+1y+0z=4 > * 1x+0y+1z=3 > * 0x+1y+1z=2 > The determinant of this system is -2, which is not 0, so this system admits > a unique solution. It can be computed using any method to solve linear > systems of equations, such as direct substitution (solving an equation for a > variable and substituting it), Gauss elimination with back substitution, > Gauss-Jordan (bidirectional) elimination, or Cramer's rule. The result is: > * x=2.5 > * y=1.5 > * z=0.5 > for a total of x+y+z=2.5+1.5+0.5=4.5 GB. > > Now how btrfs actually handles this in practice is a different story. > Judging from Chris Murphy's reply, it does not precompute the above > repartition, but tries to dynamically select 2 disks for each newly > allocated 1 GB block to approximate the optimal solution for large enough > drives (which will not achieve the optimum for the sizes in your example > because the optimum allocation is not an integer amount of gigabytes, and > will in fact be pretty far from the optimum due to the small sizes, whereas > the larger the disk sizes, the less noticeable the loss is). It's a bit more complicated still. The block group size is typically 1G but in reality it's variable, depending on file system size, and unallocated space remaining. I don't know the minimum size, although I have seen 128MB data block groups. The reason block groups are not set in advance, is because there are different types of block groups: data and metadata. File system blocks go in metadata block groups, and blocks for file data go in data block groups. And the ratio of data to metadata usage is workload dependent. Some workloads produce heavy metadata others less so. Why separate block groups? They can have different block sizes and redundancy profiles, e.g. by default 16KiB block size for metadata, 4KiB for data. And by default hard drives have dup metadata, single data; and 2+ device file systems will get raid1 for metadata and single for data. But it's this way for efficiency and features. I'll stop here before I fall into a balance, resize, multiple device rabbit hole. (dup = two copies on a single device, can also apply to data) -- Chris Murphy ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Re: How do really work RAID1 on btrfs?
El mar, 8 dic 2020 a las 20:21, Chris Murphy () escribió: > On Tue, Dec 8, 2020 at 12:22 PM Sergio Belkin wrote: > > > > Hi! > > I've read the explanation about how much space is available using disks > with different sizes[1]. I understand the rules, but I see a contradiction > with definition of RAID-1 in btrs: > > > > «A form of RAID which stores two complete copies of each piece of data. > Each copy is stored on a different device. btrfs requires a minimum of two > devices to use RAID-1. This is the default for btrfs's metadata on more > than one device. > > > > So, let's say we have 3 small disks: 4GB, 3G, and 2GB. > > From the btrfs perspective, this is a 9G file system, with raid1 > metadata and data block groups. The "raidness" happens at the block > group level, it is not at the device level like mdadm raid. > > Deep dive: Block groups are a logical range of bytes (variable size, > typically 1G). Where and what drive a file extent actually exists on > is a function of the block group to chunk mapping. i.e. a 1G data > block group using raid1 profile, physically exists as two 1G chunks, > each one on two devices. What this means is internally to Btrfs it > sees everything as just one copy in a virtual address space, and it's > a function of the chunk tree and allocator to handle the details of > exactly where it's located physically and how it's replicated. It's > normal to not totally grok this, it's pretty esoteric, but if there's > one complicated thing to try to get about Btrfs, it's this. Because > once you get it, all the other unique/unusual/confusing things start > to make sense. > > Because the "pool" is 9G, and each 1G of data results in two 1G > "mirror" chunks, each written on two drives, writes consume double the > space. Two copies for raid1. The 'btrfs filesystem usage' command > reveals this reality. Whereas 'df' kinda lies to try and make it > behave more like what we've come to expect with more conventional > raid1 implementation. This lie works ok for even number of same size > devices. It starts to fall apart [1] with odd number of drives, and > odd sized devices. So you're likely to run up against some still > remaining issues in 'df' reporting in this example. > > https://carfax.org.uk/btrfs-usage/ > > Set three disks. On the right side, use preset raid1. Go down to > Devices sizes and enter 4000,3000,2000. And it'll show you what > happens. > > > > > If I create one file of 3GB I think that > > 3 GB is written on 4GB disk, it leaves 1 GB free. > > 3 GB of copy is written on 3 GB disk, it leaves 0 GB Free. > > It's more complicated than that because first it'll be broken up into > 3 1GB block groups (possibly more and smaller block groups), and then > the allocator tries to maintain equal free space. That means it'll > tend to initially write to the biggest and 2nd biggest drives, but it > won't fill either of them up. It'll start writing to the smaller > device once it has more space than the free space in the middle > device. And yep, it can split up chunks like this, sorta like Tetris. > > The example size 9G is perhaps not a great example of real world > allocation for btrfs raid1, I'd bump that to T :) 9G is even below the > threshold of USB sticks you can buy off the shelf these days. > > > > > So, I create one file of 1GB that is written on 4GB disk, it leaves 0 GB > free. > > 1 GB of copy is written on 2 GB disk, so it leaves 1 GB free. > > > > So I've used 4GB, ok it leaves 1 GB free on only one disk, but cannot be > mirrored. > > > > However as [1] I could use 4.5 ((4GB+3GB+2GB)/2) GB instead of 4GB. > Surely, I'm missing or mistaking something. > > Block groups and chunks. There's lots of reused jargon in btrfs that > sounds familiar but it's not the same as mdadm or lvm, they're just > reused terms. Another example: raid1 or raid10 on btrfs don't work > like you're used to with mdadm and LVM. i.e. raid10 on btrfs is not a > ""stripe of mirrored drives" it is "striped and mirrored block > groups". man mkfs.btrfs has quite concise and important information > about such things, and of course questions welcome. > > So it's worth knowing a bit about how it works differently so you can > properly assess (a) if it fits for your use case and meets your > expectations (b) how to maintain and manage it, in particular disaster > recovery. Because that too is different. > > > [1] > https://github.com/kdave/btrfs-progs/issues/277 > > -- > Chris Murphy > Nice. I'm ruminating btrfs documentation :) The size of disks of the examples were just to use relatively small and a few files. man mkfs.btrfs has a nice table of example but AFAIK it's only for disk of equal size, for example in "Space Utilization" it says 50% for raid1. -- -- Sergio Belkin LPIC-2 Certified - http://www.lpi.org ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: htt
Re: How do really work RAID1 on btrfs?
PS: I wrote: > The optimum size can theoretically be achieved by using the following > physical partitioning: > * x GB on the 4 GB disk and the 3 GB disk, > * y GB on the 4 GB disk and the 2 GB disk, and > * z GB on the 3 GB disk and the 2 GB disk, > for a total of x+y+z GB, where x, y, and z solve the following system of > equations: > * x+y=4 > * x+z=3 > * y+z=2 > i.e., in standard form: > * 1x+1y+0z=4 > * 1x+0y+1z=3 > * 0x+1y+1z=2 > The determinant of this system is -2, which is not 0, so this system > admits a unique solution. It shall be noted that the relevant determinant of the system is the determinant of the left hand side, which is independent of the actual disk sizes, so the determinant is always -2 and the system always has a unique solution. So the last remaining question is, under what conditions are the solutions x, y, and z nonnegative? (Obviously, a solution with negative x, y, and/or z would be useless in practice.) So consider the system: * 1x+1y+0z=a * 1x+0y+1z=b * 0x+1y+1z=c (In your example, a=4, b=3, and c=2.) The best way to get a symbolic solution to this system is Cramer's rule (though any other method will give you the same solution), which results in: * x=(c-a-b)/(-2)=(a+b-c)/2 * y=(b-a-c)/(-2)=(a+c-b)/2 * z=(a-b-c)/(-2)=(b+c-a)/2 It follows that the solutions are nonnegative under the following conditions: * a+b≥c * a+c≥b * b+c≥a which are quite logical. Consider a=4, b=1, and c=1, i.e., disks of 4 GB, 1 GB, and 1 GB. Each of the 1 GB disks can only mirror (at most) 1 of the 4 GB, so where would you want to mirror the remaining 2 GB to? For any system that does not satisfy the above conditions, assume without loss of generality that b+chttps://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Re: How do really work RAID1 on btrfs?
Sergio Belkin wrote: > So, let's say we have 3 small disks: 4GB, 3G, and 2GB. > > If I create one file of 3GB I think that > 3 GB is written on 4GB disk, it leaves 1 GB free. > 3 GB of copy is written on 3 GB disk, it leaves 0 GB Free. > > So, I create one file of 1GB that is written on 4GB disk, it leaves 0 GB > free. > 1 GB of copy is written on 2 GB disk, so it leaves 1 GB free. > > So I've used 4GB, ok it leaves 1 GB free on only one disk, but cannot be > mirrored. > > However as [1] I could use 4.5 ((4GB+3GB+2GB)/2) GB instead of 4GB. > Surely, I'm missing or mistaking something. > > Please could you help me? The optimum size can theoretically be achieved by using the following physical partitioning: * x GB on the 4 GB disk and the 3 GB disk, * y GB on the 4 GB disk and the 2 GB disk, and * z GB on the 3 GB disk and the 2 GB disk, for a total of x+y+z GB, where x, y, and z solve the following system of equations: * x+y=4 * x+z=3 * y+z=2 i.e., in standard form: * 1x+1y+0z=4 * 1x+0y+1z=3 * 0x+1y+1z=2 The determinant of this system is -2, which is not 0, so this system admits a unique solution. It can be computed using any method to solve linear systems of equations, such as direct substitution (solving an equation for a variable and substituting it), Gauss elimination with back substitution, Gauss-Jordan (bidirectional) elimination, or Cramer's rule. The result is: * x=2.5 * y=1.5 * z=0.5 for a total of x+y+z=2.5+1.5+0.5=4.5 GB. Now how btrfs actually handles this in practice is a different story. Judging from Chris Murphy's reply, it does not precompute the above repartition, but tries to dynamically select 2 disks for each newly allocated 1 GB block to approximate the optimal solution for large enough drives (which will not achieve the optimum for the sizes in your example because the optimum allocation is not an integer amount of gigabytes, and will in fact be pretty far from the optimum due to the small sizes, whereas the larger the disk sizes, the less noticeable the loss is). Kevin Kofler ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Re: How do really work RAID1 on btrfs?
On Tue, Dec 8, 2020 at 12:22 PM Sergio Belkin wrote: > > Hi! > I've read the explanation about how much space is available using disks with > different sizes[1]. I understand the rules, but I see a contradiction with > definition of RAID-1 in btrs: > > «A form of RAID which stores two complete copies of each piece of data. Each > copy is stored on a different device. btrfs requires a minimum of two devices > to use RAID-1. This is the default for btrfs's metadata on more than one > device. > > So, let's say we have 3 small disks: 4GB, 3G, and 2GB. From the btrfs perspective, this is a 9G file system, with raid1 metadata and data block groups. The "raidness" happens at the block group level, it is not at the device level like mdadm raid. Deep dive: Block groups are a logical range of bytes (variable size, typically 1G). Where and what drive a file extent actually exists on is a function of the block group to chunk mapping. i.e. a 1G data block group using raid1 profile, physically exists as two 1G chunks, each one on two devices. What this means is internally to Btrfs it sees everything as just one copy in a virtual address space, and it's a function of the chunk tree and allocator to handle the details of exactly where it's located physically and how it's replicated. It's normal to not totally grok this, it's pretty esoteric, but if there's one complicated thing to try to get about Btrfs, it's this. Because once you get it, all the other unique/unusual/confusing things start to make sense. Because the "pool" is 9G, and each 1G of data results in two 1G "mirror" chunks, each written on two drives, writes consume double the space. Two copies for raid1. The 'btrfs filesystem usage' command reveals this reality. Whereas 'df' kinda lies to try and make it behave more like what we've come to expect with more conventional raid1 implementation. This lie works ok for even number of same size devices. It starts to fall apart [1] with odd number of drives, and odd sized devices. So you're likely to run up against some still remaining issues in 'df' reporting in this example. https://carfax.org.uk/btrfs-usage/ Set three disks. On the right side, use preset raid1. Go down to Devices sizes and enter 4000,3000,2000. And it'll show you what happens. > If I create one file of 3GB I think that > 3 GB is written on 4GB disk, it leaves 1 GB free. > 3 GB of copy is written on 3 GB disk, it leaves 0 GB Free. It's more complicated than that because first it'll be broken up into 3 1GB block groups (possibly more and smaller block groups), and then the allocator tries to maintain equal free space. That means it'll tend to initially write to the biggest and 2nd biggest drives, but it won't fill either of them up. It'll start writing to the smaller device once it has more space than the free space in the middle device. And yep, it can split up chunks like this, sorta like Tetris. The example size 9G is perhaps not a great example of real world allocation for btrfs raid1, I'd bump that to T :) 9G is even below the threshold of USB sticks you can buy off the shelf these days. > > So, I create one file of 1GB that is written on 4GB disk, it leaves 0 GB free. > 1 GB of copy is written on 2 GB disk, so it leaves 1 GB free. > > So I've used 4GB, ok it leaves 1 GB free on only one disk, but cannot be > mirrored. > > However as [1] I could use 4.5 ((4GB+3GB+2GB)/2) GB instead of 4GB. Surely, > I'm missing or mistaking something. Block groups and chunks. There's lots of reused jargon in btrfs that sounds familiar but it's not the same as mdadm or lvm, they're just reused terms. Another example: raid1 or raid10 on btrfs don't work like you're used to with mdadm and LVM. i.e. raid10 on btrfs is not a ""stripe of mirrored drives" it is "striped and mirrored block groups". man mkfs.btrfs has quite concise and important information about such things, and of course questions welcome. So it's worth knowing a bit about how it works differently so you can properly assess (a) if it fits for your use case and meets your expectations (b) how to maintain and manage it, in particular disaster recovery. Because that too is different. [1] https://github.com/kdave/btrfs-progs/issues/277 -- Chris Murphy ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
How do really work RAID1 on btrfs?
Hi! I've read the explanation about how much space is available using disks with different sizes[1]. I understand the rules, but I see a contradiction with definition of RAID-1 in btrs: «A form of *RAID* which stores two complete copies of each piece of data. Each copy is stored on a different *device*. btrfs requires a minimum of two devices to use RAID-1. This is the default for btrfs's *metadata* on more than one device. So, let's say we have 3 small disks: 4GB, 3G, and 2GB. If I create one file of 3GB I think that 3 GB is written on 4GB disk, it leaves 1 GB free. 3 GB of copy is written on 3 GB disk, it leaves 0 GB Free. So, I create one file of 1GB that is written on 4GB disk, it leaves 0 GB free. 1 GB of copy is written on 2 GB disk, so it leaves 1 GB free. So I've used 4GB, ok it leaves 1 GB free on only one disk, but cannot be mirrored. However as [1] I could use 4.5 ((4GB+3GB+2GB)/2) GB instead of 4GB. Surely, I'm missing or mistaking something. Please could you help me? [1]: https://btrfs.wiki.kernel.org/index.php/FAQ#How_much_space_do_I_get_with_unequal_devices_in_RAID-1_mode.3F [2]: https://btrfs.wiki.kernel.org/index.php/Glossary Thanks in advance! -- -- Sergio Belkin LPIC-2 Certified - http://www.lpi.org ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org