Re: How do really work RAID1 on btrfs?

2020-12-09 Thread Roberto Ragusa

On 12/9/20 1:31 AM, Kevin Kofler via devel wrote:


It follows that the solutions are nonnegative under the following
conditions:
* a+b≥c
* a+c≥b
* b+c≥a
which are quite logical. Consider a=4, b=1, and c=1, i.e., disks of 4 GB,
1 GB, and 1 GB. Each of the 1 GB disks can only mirror (at most) 1 of the
4 GB, so where would you want to mirror the remaining 2 GB to?

And without attempting a formal proof, I would suspect that there
is not a unique solution for more than 3 disks, since you get a lot
more freedom, but in any case the bigger disk can't be bigger than the
sum of all the others, because then of course losing that would be
impossible to recover. That condition will be necessary,
and I think sufficient too.

Regards.
--
   Roberto Ragusamail at robertoragusa.it
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org


Re: How do really work RAID1 on btrfs?

2020-12-08 Thread Chris Murphy
On Tue, Dec 8, 2020 at 5:08 PM Kevin Kofler via devel
 wrote:
>
> Sergio Belkin wrote:
> > So, let's say we have 3 small disks: 4GB, 3G, and 2GB.
> >
> > If I create one file of 3GB I think that
> > 3 GB is written on 4GB disk, it leaves 1 GB free.
> > 3 GB  of copy is written on 3 GB disk, it leaves 0 GB Free.
> >
> > So, I create one file of 1GB that is written on 4GB disk, it leaves 0 GB
> > free.
> > 1 GB of copy is written on 2 GB disk, so it leaves 1 GB free.
> >
> > So I've used 4GB, ok it leaves 1 GB free on only one disk, but cannot be
> > mirrored.
> >
> > However as [1] I could use 4.5 ((4GB+3GB+2GB)/2) GB instead of 4GB.
> > Surely, I'm missing or mistaking something.
> >
> > Please could you help me?
>
> The optimum size can theoretically be achieved by using the following
> physical partitioning:
> * x GB on the 4 GB disk and the 3 GB disk,
> * y GB on the 4 GB disk and the 2 GB disk, and
> * z GB on the 3 GB disk and the 2 GB disk,
> for a total of x+y+z GB, where x, y, and z solve the following system of
> equations:
> * x+y=4
> * x+z=3
> * y+z=2
> i.e., in standard form:
> * 1x+1y+0z=4
> * 1x+0y+1z=3
> * 0x+1y+1z=2
> The determinant of this system is -2, which is not 0, so this system admits
> a unique solution. It can be computed using any method to solve linear
> systems of equations, such as direct substitution (solving an equation for a
> variable and substituting it), Gauss elimination with back substitution,
> Gauss-Jordan (bidirectional) elimination, or Cramer's rule. The result is:
> * x=2.5
> * y=1.5
> * z=0.5
> for a total of x+y+z=2.5+1.5+0.5=4.5 GB.
>
> Now how btrfs actually handles this in practice is a different story.
> Judging from Chris Murphy's reply, it does not precompute the above
> repartition, but tries to dynamically select 2 disks for each newly
> allocated 1 GB block to approximate the optimal solution for large enough
> drives (which will not achieve the optimum for the sizes in your example
> because the optimum allocation is not an integer amount of gigabytes, and
> will in fact be pretty far from the optimum due to the small sizes, whereas
> the larger the disk sizes, the less noticeable the loss is).

It's a bit more complicated still. The block group size is typically
1G but in reality it's variable, depending on file system size, and
unallocated space remaining. I don't know the minimum size, although I
have seen 128MB data block groups.

The reason block groups are not set in advance, is because there are
different types of block groups: data and metadata. File system blocks
go in metadata block groups, and blocks for file data go in data block
groups. And the ratio of data to metadata usage is workload dependent.
Some workloads produce heavy metadata others less so.

Why separate block groups? They can have different block sizes and
redundancy profiles, e.g. by default 16KiB block size for metadata,
4KiB for data. And by default hard drives have dup metadata, single
data; and 2+ device file systems will get raid1 for metadata and
single for data. But it's this way for efficiency and features. I'll
stop here before I fall into a balance, resize, multiple device rabbit
hole.

(dup = two copies on a single device, can also apply to data)

-- 
Chris Murphy
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org


Re: How do really work RAID1 on btrfs?

2020-12-08 Thread Sergio Belkin
El mar, 8 dic 2020 a las 20:21, Chris Murphy ()
escribió:

> On Tue, Dec 8, 2020 at 12:22 PM Sergio Belkin  wrote:
> >
> > Hi!
> > I've read the explanation about how much space is available using disks
> with different sizes[1]. I understand the rules, but I see a contradiction
> with definition of RAID-1 in btrs:
> >
> > «A form of RAID which stores two complete copies of each piece of data.
> Each copy is stored on a different device. btrfs requires a minimum of two
> devices to use RAID-1. This is the default for btrfs's metadata on more
> than one device.
> >
> > So, let's say we have 3 small disks: 4GB, 3G, and 2GB.
>
> From the btrfs perspective, this is a 9G file system, with raid1
> metadata and data block groups. The "raidness" happens at the block
> group level, it is not at the device level like mdadm raid.
>
> Deep dive: Block groups are a logical range of bytes (variable size,
> typically 1G). Where and what drive a file extent actually exists on
> is a function of the block group to chunk mapping. i.e. a 1G data
> block group using raid1 profile, physically exists as two 1G chunks,
> each one on two devices. What this means is internally to Btrfs it
> sees everything as just one copy in a virtual address space, and it's
> a function of the chunk tree and allocator to handle the details of
> exactly where it's located physically and how it's replicated. It's
> normal to not totally grok this, it's pretty esoteric, but if there's
> one complicated thing to try to get about Btrfs, it's this. Because
> once you get it, all the other unique/unusual/confusing things start
> to make sense.
>
> Because the "pool" is 9G, and each 1G of data results in two 1G
> "mirror" chunks, each written on two drives, writes consume double the
> space. Two copies for raid1. The 'btrfs filesystem usage' command
> reveals this reality. Whereas 'df' kinda lies to try and make it
> behave more like what we've come to expect with more conventional
> raid1 implementation. This lie works ok for even number of same size
> devices. It starts to fall apart [1] with odd number of drives, and
> odd sized devices. So you're likely to run up against some still
> remaining issues in 'df' reporting in this example.
>
> https://carfax.org.uk/btrfs-usage/
>
> Set three disks. On the right side, use preset raid1. Go down to
> Devices sizes and enter 4000,3000,2000. And it'll show you what
> happens.
>
>
>
> > If I create one file of 3GB I think that
> > 3 GB is written on 4GB disk, it leaves 1 GB free.
> > 3 GB  of copy is written on 3 GB disk, it leaves 0 GB Free.
>
> It's more complicated than that because first it'll be broken up into
> 3 1GB block groups (possibly more and smaller block groups), and then
> the allocator tries to maintain equal free space. That means it'll
> tend to initially write to the biggest and 2nd biggest drives, but it
> won't fill either of them up. It'll start writing to the smaller
> device once it has more space than the free space in the middle
> device. And yep, it can split up chunks like this, sorta like Tetris.
>
> The example size 9G is perhaps not a great example of real world
> allocation for btrfs raid1, I'd bump that to T :) 9G is even below the
> threshold of USB sticks you can buy off the shelf these days.
>
> >
> > So, I create one file of 1GB that is written on 4GB disk, it leaves 0 GB
> free.
> > 1 GB of copy is written on 2 GB disk, so it leaves 1 GB free.
> >
> > So I've used 4GB, ok it leaves 1 GB free on only one disk, but cannot be
> mirrored.
> >
> > However as [1] I could use 4.5 ((4GB+3GB+2GB)/2) GB instead of 4GB.
> Surely, I'm missing or mistaking something.
>
> Block groups and chunks. There's lots of reused jargon in btrfs that
> sounds familiar but it's not the same as mdadm or lvm, they're just
> reused terms. Another example: raid1 or raid10 on btrfs don't work
> like you're used to with mdadm and LVM. i.e. raid10 on btrfs is not a
> ""stripe of mirrored drives" it is "striped and mirrored block
> groups". man mkfs.btrfs has quite concise and important information
> about such things, and of course questions welcome.
>
> So it's worth knowing a bit about how it works differently so you can
> properly assess (a) if it fits for your use case and meets your
> expectations (b) how to maintain and manage it, in particular disaster
> recovery. Because that too is different.
>
>
> [1]
> https://github.com/kdave/btrfs-progs/issues/277
>
> --
> Chris Murphy
>

Nice. I'm ruminating btrfs documentation :)
The size of disks of the examples were just to use relatively small and a
few files.
man mkfs.btrfs has a nice table of example but AFAIK it's only for disk of
equal size, for example in "Space Utilization" it says 50% for raid1.

-- 
--
Sergio Belkin
LPIC-2 Certified - http://www.lpi.org
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
htt

Re: How do really work RAID1 on btrfs?

2020-12-08 Thread Kevin Kofler via devel
PS:

I wrote:
> The optimum size can theoretically be achieved by using the following
> physical partitioning:
> * x GB on the 4 GB disk and the 3 GB disk,
> * y GB on the 4 GB disk and the 2 GB disk, and
> * z GB on the 3 GB disk and the 2 GB disk,
> for a total of x+y+z GB, where x, y, and z solve the following system of
> equations:
> * x+y=4
> * x+z=3
> * y+z=2
> i.e., in standard form:
> * 1x+1y+0z=4
> * 1x+0y+1z=3
> * 0x+1y+1z=2
> The determinant of this system is -2, which is not 0, so this system
> admits a unique solution.

It shall be noted that the relevant determinant of the system is the 
determinant of the left hand side, which is independent of the actual disk 
sizes, so the determinant is always -2 and the system always has a unique 
solution.

So the last remaining question is, under what conditions are the solutions 
x, y, and z nonnegative? (Obviously, a solution with negative x, y, and/or z 
would be useless in practice.)

So consider the system:
* 1x+1y+0z=a
* 1x+0y+1z=b
* 0x+1y+1z=c
(In your example, a=4, b=3, and c=2.)

The best way to get a symbolic solution to this system is Cramer's rule 
(though any other method will give you the same solution), which results in:
* x=(c-a-b)/(-2)=(a+b-c)/2
* y=(b-a-c)/(-2)=(a+c-b)/2
* z=(a-b-c)/(-2)=(b+c-a)/2

It follows that the solutions are nonnegative under the following 
conditions:
* a+b≥c
* a+c≥b
* b+c≥a
which are quite logical. Consider a=4, b=1, and c=1, i.e., disks of 4 GB,
1 GB, and 1 GB. Each of the 1 GB disks can only mirror (at most) 1 of the
4 GB, so where would you want to mirror the remaining 2 GB to?

For any system that does not satisfy the above conditions, assume without 
loss of generality that b+chttps://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org


Re: How do really work RAID1 on btrfs?

2020-12-08 Thread Kevin Kofler via devel
Sergio Belkin wrote:
> So, let's say we have 3 small disks: 4GB, 3G, and 2GB.
> 
> If I create one file of 3GB I think that
> 3 GB is written on 4GB disk, it leaves 1 GB free.
> 3 GB  of copy is written on 3 GB disk, it leaves 0 GB Free.
> 
> So, I create one file of 1GB that is written on 4GB disk, it leaves 0 GB
> free.
> 1 GB of copy is written on 2 GB disk, so it leaves 1 GB free.
> 
> So I've used 4GB, ok it leaves 1 GB free on only one disk, but cannot be
> mirrored.
> 
> However as [1] I could use 4.5 ((4GB+3GB+2GB)/2) GB instead of 4GB.
> Surely, I'm missing or mistaking something.
> 
> Please could you help me?

The optimum size can theoretically be achieved by using the following 
physical partitioning:
* x GB on the 4 GB disk and the 3 GB disk,
* y GB on the 4 GB disk and the 2 GB disk, and
* z GB on the 3 GB disk and the 2 GB disk,
for a total of x+y+z GB, where x, y, and z solve the following system of 
equations:
* x+y=4
* x+z=3
* y+z=2
i.e., in standard form:
* 1x+1y+0z=4
* 1x+0y+1z=3
* 0x+1y+1z=2
The determinant of this system is -2, which is not 0, so this system admits 
a unique solution. It can be computed using any method to solve linear 
systems of equations, such as direct substitution (solving an equation for a 
variable and substituting it), Gauss elimination with back substitution, 
Gauss-Jordan (bidirectional) elimination, or Cramer's rule. The result is:
* x=2.5
* y=1.5
* z=0.5
for a total of x+y+z=2.5+1.5+0.5=4.5 GB.

Now how btrfs actually handles this in practice is a different story. 
Judging from Chris Murphy's reply, it does not precompute the above 
repartition, but tries to dynamically select 2 disks for each newly 
allocated 1 GB block to approximate the optimal solution for large enough 
drives (which will not achieve the optimum for the sizes in your example 
because the optimum allocation is not an integer amount of gigabytes, and 
will in fact be pretty far from the optimum due to the small sizes, whereas 
the larger the disk sizes, the less noticeable the loss is).

Kevin Kofler
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org


Re: How do really work RAID1 on btrfs?

2020-12-08 Thread Chris Murphy
On Tue, Dec 8, 2020 at 12:22 PM Sergio Belkin  wrote:
>
> Hi!
> I've read the explanation about how much space is available using disks with 
> different sizes[1]. I understand the rules, but I see a contradiction with 
> definition of RAID-1 in btrs:
>
> «A form of RAID which stores two complete copies of each piece of data. Each 
> copy is stored on a different device. btrfs requires a minimum of two devices 
> to use RAID-1. This is the default for btrfs's metadata on more than one 
> device.
>
> So, let's say we have 3 small disks: 4GB, 3G, and 2GB.

From the btrfs perspective, this is a 9G file system, with raid1
metadata and data block groups. The "raidness" happens at the block
group level, it is not at the device level like mdadm raid.

Deep dive: Block groups are a logical range of bytes (variable size,
typically 1G). Where and what drive a file extent actually exists on
is a function of the block group to chunk mapping. i.e. a 1G data
block group using raid1 profile, physically exists as two 1G chunks,
each one on two devices. What this means is internally to Btrfs it
sees everything as just one copy in a virtual address space, and it's
a function of the chunk tree and allocator to handle the details of
exactly where it's located physically and how it's replicated. It's
normal to not totally grok this, it's pretty esoteric, but if there's
one complicated thing to try to get about Btrfs, it's this. Because
once you get it, all the other unique/unusual/confusing things start
to make sense.

Because the "pool" is 9G, and each 1G of data results in two 1G
"mirror" chunks, each written on two drives, writes consume double the
space. Two copies for raid1. The 'btrfs filesystem usage' command
reveals this reality. Whereas 'df' kinda lies to try and make it
behave more like what we've come to expect with more conventional
raid1 implementation. This lie works ok for even number of same size
devices. It starts to fall apart [1] with odd number of drives, and
odd sized devices. So you're likely to run up against some still
remaining issues in 'df' reporting in this example.

https://carfax.org.uk/btrfs-usage/

Set three disks. On the right side, use preset raid1. Go down to
Devices sizes and enter 4000,3000,2000. And it'll show you what
happens.



> If I create one file of 3GB I think that
> 3 GB is written on 4GB disk, it leaves 1 GB free.
> 3 GB  of copy is written on 3 GB disk, it leaves 0 GB Free.

It's more complicated than that because first it'll be broken up into
3 1GB block groups (possibly more and smaller block groups), and then
the allocator tries to maintain equal free space. That means it'll
tend to initially write to the biggest and 2nd biggest drives, but it
won't fill either of them up. It'll start writing to the smaller
device once it has more space than the free space in the middle
device. And yep, it can split up chunks like this, sorta like Tetris.

The example size 9G is perhaps not a great example of real world
allocation for btrfs raid1, I'd bump that to T :) 9G is even below the
threshold of USB sticks you can buy off the shelf these days.

>
> So, I create one file of 1GB that is written on 4GB disk, it leaves 0 GB free.
> 1 GB of copy is written on 2 GB disk, so it leaves 1 GB free.
>
> So I've used 4GB, ok it leaves 1 GB free on only one disk, but cannot be 
> mirrored.
>
> However as [1] I could use 4.5 ((4GB+3GB+2GB)/2) GB instead of 4GB. Surely, 
> I'm missing or mistaking something.

Block groups and chunks. There's lots of reused jargon in btrfs that
sounds familiar but it's not the same as mdadm or lvm, they're just
reused terms. Another example: raid1 or raid10 on btrfs don't work
like you're used to with mdadm and LVM. i.e. raid10 on btrfs is not a
""stripe of mirrored drives" it is "striped and mirrored block
groups". man mkfs.btrfs has quite concise and important information
about such things, and of course questions welcome.

So it's worth knowing a bit about how it works differently so you can
properly assess (a) if it fits for your use case and meets your
expectations (b) how to maintain and manage it, in particular disaster
recovery. Because that too is different.


[1]
https://github.com/kdave/btrfs-progs/issues/277

-- 
Chris Murphy
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org


How do really work RAID1 on btrfs?

2020-12-08 Thread Sergio Belkin
Hi!
I've read the explanation about how much space is available using disks
with different sizes[1]. I understand the rules, but I see a contradiction
with definition of RAID-1 in btrs:

«A form of *RAID* which stores two complete copies of each piece of data.
Each copy is stored on a different *device*. btrfs requires a minimum of
two devices to use RAID-1. This is the default for btrfs's *metadata* on
more than one device.

So, let's say we have 3 small disks: 4GB, 3G, and 2GB.

If I create one file of 3GB I think that
3 GB is written on 4GB disk, it leaves 1 GB free.
3 GB  of copy is written on 3 GB disk, it leaves 0 GB Free.

So, I create one file of 1GB that is written on 4GB disk, it leaves 0 GB
free.
1 GB of copy is written on 2 GB disk, so it leaves 1 GB free.

So I've used 4GB, ok it leaves 1 GB free on only one disk, but cannot be
mirrored.

However as [1] I could use 4.5 ((4GB+3GB+2GB)/2) GB instead of 4GB. Surely,
I'm missing or mistaking something.

Please could you help me?

[1]:
https://btrfs.wiki.kernel.org/index.php/FAQ#How_much_space_do_I_get_with_unequal_devices_in_RAID-1_mode.3F
[2]: https://btrfs.wiki.kernel.org/index.php/Glossary

Thanks in advance!
-- 
--
Sergio Belkin
LPIC-2 Certified - http://www.lpi.org
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org