Re: [PATCH] btrfs: fix max chunk size on dup

2017-10-11 Thread Hans van Kranenburg
On 10/11/2017 11:58 AM, Naohiro Aota wrote:
> Hi,
> 
> You may notice my @wdc.com address is bounced. As my internship job at
> wdc is finished, the address is already expired.
> Please contact me on this @elisp.net address.
> 
> 2017-10-11 2:42 GMT+09:00 Hans van Kranenburg 
> :
>> Sorry for the mail spam, it's an interesting code puzzle... :)
>>
>> On 10/10/2017 07:22 PM, Hans van Kranenburg wrote:
>>> On 10/10/2017 07:07 PM, Hans van Kranenburg wrote:
 On 10/10/2017 01:31 PM, David Sterba wrote:
> Hi,
>
> On Fri, Sep 29, 2017 at 04:20:51PM +0900, Naohiro Aota wrote:
>> Balancing a fresh METADATA=dup btrfs file system (with size < 50G)
>> generates a 128MB sized block group. While we set max_stripe_size =
>> max_chunk_size = 256MB, we get this half sized block group:
>>
>> $ btrfs ins dump-t -t CHUNK_TREE btrfs.img|grep length
>> length 8388608 owner 2 stripe_len 65536 type DATA
>> length 33554432 owner 2 stripe_len 65536 type SYSTEM|DUP
>> length 134217728 owner 2 stripe_len 65536 type 
>> METADATA|DUP
>>
>> Before commit 86db25785a6e ("Btrfs: fix max chunk size on raid5/6"), we
>> used "stripe_size * ndevs > max_chunk_size * ncopies" to check the max
>> chunk size. Since stripe_size = 256MB * dev_stripes (= 2) = 512MB, ndevs
>> = 1, max_chunk_size = 256MB, and ncopies = 2, we allowed 256MB
>> METADATA|DUP block group.
>>
>> But now, we use "stripe_size * data_stripes > max_chunk_size". Since
>> data_stripes = 1 on DUP, it disallows the block group to have > 128MB.
>> What missing here is "dev_stripes". Proper logical space used by the 
>> block
>> group is "stripe_size * data_stripes / dev_stripes". Tweak the equations 
>> to
>> use the right value.
>
> I started looking into it and still don't fully understand it. Change
> deep in the allocator can easily break some blockgroup combinations, so
> I'm rather conservative here.

 I think that the added usage of data_stripes in 86db25785a6e is the
 problematic change. data_stripes is something that was introduced as
 part of RAID56 in 53b381b3a and clearly only has a meaning that's
 properly thought of for RAID56. The RAID56 commit already adds "this
 will have to be fixed for RAID1 and RAID10 over more drives", only the
 author doesn't catch the DUP case, which already breaks at that point.

> 
> Thank you for explaining in detail :) Yes, the allocator was already
> broken by using the data_stripes.

Well, technically, data_stripes is right (I first was confused about it
yesterday). But it's something specific for parity raid.

> (and it will also "break" existing file systems, in terms of making a
> DUP|META block group smaller)

Yes, and on a fs larger than 50GiB I indeed see 512MB DUP metadata
chunks, instead of 1GiB like the code says.

> I believe this patch fix the things.
> 
 At the beginning it says:

 int data_stripes; /* number of stripes that count for block group size */

 For the example:

 This is DUP:

 .sub_stripes= 1,
 .dev_stripes= 2,
 .devs_max   = 1,
 .devs_min   = 1,
 .tolerated_failures = 0,
 .devs_increment = 1,
 .ncopies= 2,

 In the code:

 max_stripe_size = SZ_256M
 max_chunk_size = max_stripe_size-> SZ_256M

 Then we have find_free_dev_extent:
 max_stripe_size * dev_stripes   -> SZ_256M * 2 -> 512M

 So we like to find 512M on a disk, to stuff 2 stripes of 256M inside for
 the DUP. (remember: The two parts of DUP *never* end up on a different
 disk, even if you have multiple)

>>
>> Another better fix would be to make a change here...
>>
 If we find one:
 stripe_size = devices_info[ndevs-1].max_avail   -> 512M, yay
>>
>> ...because this is not yay. The 512M is max_avail, which needs to holds
>> *2* stripes, not 1. So stripe_size is suddenly changed to twice the
>> stripe size for DUP.
>>
>> So, an additional division again by dev_stripes would translate the
>> max_avail on device ndevs-1 to the stripe size to use.
> 
> This catch the point. Notice stripe_size is divided by dev_stripes
> after the lines in the patch.

Yes,

stripe_size = div_u64(stripe_size, dev_stripes);

Anyone who reads this should think.. "So, ok, what IS a stripe then". O_o

> It means that at this region (from "stripe_size =
> devices_info[ndevs-1].max_avail;" to "stripe_size =
> div_u64(stripe_size, dev_stripes);"),
> "stripe_size" stands for "stripe size written in one device". IOW,
> stripe_size (here, size of region in one device) = dev_stripes *
> stripe_size (final, one chunk).
> These two "stripe_size" is mostly the same value, because we set
> dev_stripes > 1 only with DUP.
> 
> We are using this stripe_size in one device manner to calculate logical size.
> This is 

Re: [PATCH] btrfs: fix max chunk size on dup

2017-10-11 Thread Naohiro Aota
Hi,

You may notice my @wdc.com address is bounced. As my internship job at
wdc is finished, the address is already expired.
Please contact me on this @elisp.net address.

2017-10-11 2:42 GMT+09:00 Hans van Kranenburg :
> Sorry for the mail spam, it's an interesting code puzzle... :)
>
> On 10/10/2017 07:22 PM, Hans van Kranenburg wrote:
>> On 10/10/2017 07:07 PM, Hans van Kranenburg wrote:
>>> On 10/10/2017 01:31 PM, David Sterba wrote:
 Hi,

 On Fri, Sep 29, 2017 at 04:20:51PM +0900, Naohiro Aota wrote:
> Balancing a fresh METADATA=dup btrfs file system (with size < 50G)
> generates a 128MB sized block group. While we set max_stripe_size =
> max_chunk_size = 256MB, we get this half sized block group:
>
> $ btrfs ins dump-t -t CHUNK_TREE btrfs.img|grep length
> length 8388608 owner 2 stripe_len 65536 type DATA
> length 33554432 owner 2 stripe_len 65536 type SYSTEM|DUP
> length 134217728 owner 2 stripe_len 65536 type 
> METADATA|DUP
>
> Before commit 86db25785a6e ("Btrfs: fix max chunk size on raid5/6"), we
> used "stripe_size * ndevs > max_chunk_size * ncopies" to check the max
> chunk size. Since stripe_size = 256MB * dev_stripes (= 2) = 512MB, ndevs
> = 1, max_chunk_size = 256MB, and ncopies = 2, we allowed 256MB
> METADATA|DUP block group.
>
> But now, we use "stripe_size * data_stripes > max_chunk_size". Since
> data_stripes = 1 on DUP, it disallows the block group to have > 128MB.
> What missing here is "dev_stripes". Proper logical space used by the block
> group is "stripe_size * data_stripes / dev_stripes". Tweak the equations 
> to
> use the right value.

 I started looking into it and still don't fully understand it. Change
 deep in the allocator can easily break some blockgroup combinations, so
 I'm rather conservative here.
>>>
>>> I think that the added usage of data_stripes in 86db25785a6e is the
>>> problematic change. data_stripes is something that was introduced as
>>> part of RAID56 in 53b381b3a and clearly only has a meaning that's
>>> properly thought of for RAID56. The RAID56 commit already adds "this
>>> will have to be fixed for RAID1 and RAID10 over more drives", only the
>>> author doesn't catch the DUP case, which already breaks at that point.
>>>

Thank you for explaining in detail :) Yes, the allocator was already
broken by using the data_stripes.
(and it will also "break" existing file systems, in terms of making a
DUP|META block group smaller)
I believe this patch fix the things.

>>> At the beginning it says:
>>>
>>> int data_stripes; /* number of stripes that count for block group size */
>>>
>>> For the example:
>>>
>>> This is DUP:
>>>
>>> .sub_stripes= 1,
>>> .dev_stripes= 2,
>>> .devs_max   = 1,
>>> .devs_min   = 1,
>>> .tolerated_failures = 0,
>>> .devs_increment = 1,
>>> .ncopies= 2,
>>>
>>> In the code:
>>>
>>> max_stripe_size = SZ_256M
>>> max_chunk_size = max_stripe_size-> SZ_256M
>>>
>>> Then we have find_free_dev_extent:
>>> max_stripe_size * dev_stripes   -> SZ_256M * 2 -> 512M
>>>
>>> So we like to find 512M on a disk, to stuff 2 stripes of 256M inside for
>>> the DUP. (remember: The two parts of DUP *never* end up on a different
>>> disk, even if you have multiple)
>>>
>
> Another better fix would be to make a change here...
>
>>> If we find one:
>>> stripe_size = devices_info[ndevs-1].max_avail   -> 512M, yay
>
> ...because this is not yay. The 512M is max_avail, which needs to holds
> *2* stripes, not 1. So stripe_size is suddenly changed to twice the
> stripe size for DUP.
>
> So, an additional division again by dev_stripes would translate the
> max_avail on device ndevs-1 to the stripe size to use.

This catch the point. Notice stripe_size is divided by dev_stripes
after the lines in the patch.
It means that at this region (from "stripe_size =
devices_info[ndevs-1].max_avail;" to "stripe_size =
div_u64(stripe_size, dev_stripes);"),
"stripe_size" stands for "stripe size written in one device". IOW,
stripe_size (here, size of region in one device) = dev_stripes *
stripe_size (final, one chunk).
These two "stripe_size" is mostly the same value, because we set
dev_stripes > 1 only with DUP.

We are using this stripe_size in one device manner to calculate logical size.
This is correct, if we actually stripe write to stripes in one device.
But, it actually is
copied stripes at least with current implementation. So the
if-condition "if (stripe_size * data_stripes > max_chunk_size)"
is doing wrong calculation.

So the solution would be:

1. as in my patch, consider dev_stripes in the if-condition
or,
2. as Hans suggests, move the division by dev_stripes above the if-condition,
   and properly re-calculate stripe_size if logical size > max_chunk_size

Now, I think taking the 2nd solution would make the code more clear,
because it eliminate 

Re: [PATCH] btrfs: fix max chunk size on dup

2017-10-10 Thread Hans van Kranenburg
Sorry for the mail spam, it's an interesting code puzzle... :)

On 10/10/2017 07:22 PM, Hans van Kranenburg wrote:
> On 10/10/2017 07:07 PM, Hans van Kranenburg wrote:
>> On 10/10/2017 01:31 PM, David Sterba wrote:
>>> Hi,
>>>
>>> On Fri, Sep 29, 2017 at 04:20:51PM +0900, Naohiro Aota wrote:
 Balancing a fresh METADATA=dup btrfs file system (with size < 50G)
 generates a 128MB sized block group. While we set max_stripe_size =
 max_chunk_size = 256MB, we get this half sized block group:

 $ btrfs ins dump-t -t CHUNK_TREE btrfs.img|grep length
 length 8388608 owner 2 stripe_len 65536 type DATA
 length 33554432 owner 2 stripe_len 65536 type SYSTEM|DUP
 length 134217728 owner 2 stripe_len 65536 type METADATA|DUP

 Before commit 86db25785a6e ("Btrfs: fix max chunk size on raid5/6"), we
 used "stripe_size * ndevs > max_chunk_size * ncopies" to check the max
 chunk size. Since stripe_size = 256MB * dev_stripes (= 2) = 512MB, ndevs
 = 1, max_chunk_size = 256MB, and ncopies = 2, we allowed 256MB
 METADATA|DUP block group.

 But now, we use "stripe_size * data_stripes > max_chunk_size". Since
 data_stripes = 1 on DUP, it disallows the block group to have > 128MB.
 What missing here is "dev_stripes". Proper logical space used by the block
 group is "stripe_size * data_stripes / dev_stripes". Tweak the equations to
 use the right value.
>>>
>>> I started looking into it and still don't fully understand it. Change
>>> deep in the allocator can easily break some blockgroup combinations, so
>>> I'm rather conservative here.
>>
>> I think that the added usage of data_stripes in 86db25785a6e is the
>> problematic change. data_stripes is something that was introduced as
>> part of RAID56 in 53b381b3a and clearly only has a meaning that's
>> properly thought of for RAID56. The RAID56 commit already adds "this
>> will have to be fixed for RAID1 and RAID10 over more drives", only the
>> author doesn't catch the DUP case, which already breaks at that point.
>>
>> At the beginning it says:
>>
>> int data_stripes; /* number of stripes that count for block group size */
>>
>> For the example:
>>
>> This is DUP:
>>
>> .sub_stripes= 1,
>> .dev_stripes= 2,
>> .devs_max   = 1,
>> .devs_min   = 1,
>> .tolerated_failures = 0,
>> .devs_increment = 1,
>> .ncopies= 2,
>>
>> In the code:
>>
>> max_stripe_size = SZ_256M
>> max_chunk_size = max_stripe_size-> SZ_256M
>>
>> Then we have find_free_dev_extent:
>> max_stripe_size * dev_stripes   -> SZ_256M * 2 -> 512M
>>
>> So we like to find 512M on a disk, to stuff 2 stripes of 256M inside for
>> the DUP. (remember: The two parts of DUP *never* end up on a different
>> disk, even if you have multiple)
>>

Another better fix would be to make a change here...

>> If we find one:
>> stripe_size = devices_info[ndevs-1].max_avail   -> 512M, yay

...because this is not yay. The 512M is max_avail, which needs to holds
*2* stripes, not 1. So stripe_size is suddenly changed to twice the
stripe size for DUP.

So, an additional division again by dev_stripes would translate the
max_avail on device ndevs-1 to the stripe size to use.

>>
>> num_stripes = ndevs * dev_stripes -> 1 * 2  -> 2
>>
>> data_stripes = num_stripes / ncopies = 2 / 2 = 1
> 
> Oh, wow, this is not true of course, because the "number of stripes that
> count for block group size" should still be 1 for DUP...
> 
>> BOOM! There's the problem. The data_stripes only thinks about data that
>> is horizontally spread over disks, and not vertically spread...
> 
> Hm... no Hans, you're wrong.
> 
>> What I would propose is changing...
>>the data_stripes =  and afterwards trying to correct it with
>> some ifs
>> ...to...
>>a switch/case thing where the explicit logic is put to get the right
>> value for each specific raid type.
>>
>> In case of DUP this simply means data_stripes = 2, because there is no
>> need for fancy calculations about spreading DUP data over X devices.
>> It's always 2 copies on 1 device.
> 
> Eh, nope. 1.
> 
> So, then I end up at the "stripe_size * data_stripes > max_chunk_size"
> again.
> 
> So, yes, Naohiro is right, and DUP is the only case in which this logic
> breaks. DUP is the only one in which it this change makes a difference,
> because it's the only one which has dev_stripes set to something else
> than 1.

If stripe_size doesn't incorrectly get set to dev_stripes * stripe_size
max_avail above, the if works.

The comments below still stand, because doing all those calculations
with stripes and number of devices for something as predictable as DUP
is only confusing. :D

> 
> \o/
> 
>> ...
>>
>> My general feeling when looking at the code, is that this single part of
>> code is responsible for too many different cases, or, more possible
>> cases than a developer can reason about at once "in his head" when
>> working on it.
>>
>> 7 raid options * 3 

Re: [PATCH] btrfs: fix max chunk size on dup

2017-10-10 Thread Hans van Kranenburg
On 10/10/2017 07:07 PM, Hans van Kranenburg wrote:
> On 10/10/2017 01:31 PM, David Sterba wrote:
>> Hi,
>>
>> On Fri, Sep 29, 2017 at 04:20:51PM +0900, Naohiro Aota wrote:
>>> Balancing a fresh METADATA=dup btrfs file system (with size < 50G)
>>> generates a 128MB sized block group. While we set max_stripe_size =
>>> max_chunk_size = 256MB, we get this half sized block group:
>>>
>>> $ btrfs ins dump-t -t CHUNK_TREE btrfs.img|grep length
>>> length 8388608 owner 2 stripe_len 65536 type DATA
>>> length 33554432 owner 2 stripe_len 65536 type SYSTEM|DUP
>>> length 134217728 owner 2 stripe_len 65536 type METADATA|DUP
>>>
>>> Before commit 86db25785a6e ("Btrfs: fix max chunk size on raid5/6"), we
>>> used "stripe_size * ndevs > max_chunk_size * ncopies" to check the max
>>> chunk size. Since stripe_size = 256MB * dev_stripes (= 2) = 512MB, ndevs
>>> = 1, max_chunk_size = 256MB, and ncopies = 2, we allowed 256MB
>>> METADATA|DUP block group.
>>>
>>> But now, we use "stripe_size * data_stripes > max_chunk_size". Since
>>> data_stripes = 1 on DUP, it disallows the block group to have > 128MB.
>>> What missing here is "dev_stripes". Proper logical space used by the block
>>> group is "stripe_size * data_stripes / dev_stripes". Tweak the equations to
>>> use the right value.
>>
>> I started looking into it and still don't fully understand it. Change
>> deep in the allocator can easily break some blockgroup combinations, so
>> I'm rather conservative here.
> 
> I think that the added usage of data_stripes in 86db25785a6e is the
> problematic change. data_stripes is something that was introduced as
> part of RAID56 in 53b381b3a and clearly only has a meaning that's
> properly thought of for RAID56. The RAID56 commit already adds "this
> will have to be fixed for RAID1 and RAID10 over more drives", only the
> author doesn't catch the DUP case, which already breaks at that point.
> 
> At the beginning it says:
> 
> int data_stripes; /* number of stripes that count for block group size */
> 
> For the example:
> 
> This is DUP:
> 
> .sub_stripes= 1,
> .dev_stripes= 2,
> .devs_max   = 1,
> .devs_min   = 1,
> .tolerated_failures = 0,
> .devs_increment = 1,
> .ncopies= 2,
> 
> In the code:
> 
> max_stripe_size = SZ_256M
> max_chunk_size = max_stripe_size-> SZ_256M
> 
> Then we have find_free_dev_extent:
> max_stripe_size * dev_stripes   -> SZ_256M * 2 -> 512M
> 
> So we like to find 512M on a disk, to stuff 2 stripes of 256M inside for
> the DUP. (remember: The two parts of DUP *never* end up on a different
> disk, even if you have multiple)
> 
> If we find one:
> stripe_size = devices_info[ndevs-1].max_avail   -> 512M, yay
> 
> ndevs = min(ndevs, devs_max)  -> min($whatever, 1)   -> 1
> num_stripes = ndevs * dev_stripes -> 1 * 2  -> 2
> 
> data_stripes = num_stripes / ncopies = 2 / 2 = 1

Oh, wow, this is not true of course, because the "number of stripes that
count for block group size" should still be 1 for DUP...

> BOOM! There's the problem. The data_stripes only thinks about data that
> is horizontally spread over disks, and not vertically spread...

Hm... no Hans, you're wrong.

> What I would propose is changing...
>the data_stripes =  and afterwards trying to correct it with
> some ifs
> ...to...
>a switch/case thing where the explicit logic is put to get the right
> value for each specific raid type.
> 
> In case of DUP this simply means data_stripes = 2, because there is no
> need for fancy calculations about spreading DUP data over X devices.
> It's always 2 copies on 1 device.

Eh, nope. 1.

So, then I end up at the "stripe_size * data_stripes > max_chunk_size"
again.

So, yes, Naohiro is right, and DUP is the only case in which this logic
breaks. DUP is the only one in which it this change makes a difference,
because it's the only one which has dev_stripes set to something else
than 1.

\o/

> ...
> 
> My general feeling when looking at the code, is that this single part of
> code is responsible for too many different cases, or, more possible
> cases than a developer can reason about at once "in his head" when
> working on it.
> 
> 7 raid options * 3 different types (data, metadata, system) = 21
> already... Some parts of the algorithm make only sense for a subset of
> the combinations, but they're still part of the computation, which
> sometimes "by accident" results in the correct outcome. :)
> 
> If it can't be done in a way that's easier to understand when reading
> the code, it should have unit tests with a list of known input/output to
> detect unwanted changes.
> 


-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: fix max chunk size on dup

2017-10-10 Thread Hans van Kranenburg
On 10/10/2017 01:31 PM, David Sterba wrote:
> Hi,
> 
> On Fri, Sep 29, 2017 at 04:20:51PM +0900, Naohiro Aota wrote:
>> Balancing a fresh METADATA=dup btrfs file system (with size < 50G)
>> generates a 128MB sized block group. While we set max_stripe_size =
>> max_chunk_size = 256MB, we get this half sized block group:
>>
>> $ btrfs ins dump-t -t CHUNK_TREE btrfs.img|grep length
>> length 8388608 owner 2 stripe_len 65536 type DATA
>> length 33554432 owner 2 stripe_len 65536 type SYSTEM|DUP
>> length 134217728 owner 2 stripe_len 65536 type METADATA|DUP
>>
>> Before commit 86db25785a6e ("Btrfs: fix max chunk size on raid5/6"), we
>> used "stripe_size * ndevs > max_chunk_size * ncopies" to check the max
>> chunk size. Since stripe_size = 256MB * dev_stripes (= 2) = 512MB, ndevs
>> = 1, max_chunk_size = 256MB, and ncopies = 2, we allowed 256MB
>> METADATA|DUP block group.
>>
>> But now, we use "stripe_size * data_stripes > max_chunk_size". Since
>> data_stripes = 1 on DUP, it disallows the block group to have > 128MB.
>> What missing here is "dev_stripes". Proper logical space used by the block
>> group is "stripe_size * data_stripes / dev_stripes". Tweak the equations to
>> use the right value.
> 
> I started looking into it and still don't fully understand it. Change
> deep in the allocator can easily break some blockgroup combinations, so
> I'm rather conservative here.

I think that the added usage of data_stripes in 86db25785a6e is the
problematic change. data_stripes is something that was introduced as
part of RAID56 in 53b381b3a and clearly only has a meaning that's
properly thought of for RAID56. The RAID56 commit already adds "this
will have to be fixed for RAID1 and RAID10 over more drives", only the
author doesn't catch the DUP case, which already breaks at that point.

At the beginning it says:

int data_stripes; /* number of stripes that count for block group size */

For the example:

This is DUP:

.sub_stripes= 1,
.dev_stripes= 2,
.devs_max   = 1,
.devs_min   = 1,
.tolerated_failures = 0,
.devs_increment = 1,
.ncopies= 2,

In the code:

max_stripe_size = SZ_256M
max_chunk_size = max_stripe_size-> SZ_256M

Then we have find_free_dev_extent:
max_stripe_size * dev_stripes   -> SZ_256M * 2 -> 512M

So we like to find 512M on a disk, to stuff 2 stripes of 256M inside for
the DUP. (remember: The two parts of DUP *never* end up on a different
disk, even if you have multiple)

If we find one:
stripe_size = devices_info[ndevs-1].max_avail   -> 512M, yay

ndevs = min(ndevs, devs_max)  -> min($whatever, 1)   -> 1
num_stripes = ndevs * dev_stripes -> 1 * 2  -> 2

data_stripes = num_stripes / ncopies = 2 / 2 = 1

BOOM! There's the problem. The data_stripes only thinks about data that
is horizontally spread over disks, and not vertically spread...

What I would propose is changing...
   the data_stripes =  and afterwards trying to correct it with
some ifs
...to...
   a switch/case thing where the explicit logic is put to get the right
value for each specific raid type.

In case of DUP this simply means data_stripes = 2, because there is no
need for fancy calculations about spreading DUP data over X devices.
It's always 2 copies on 1 device.

...

My general feeling when looking at the code, is that this single part of
code is responsible for too many different cases, or, more possible
cases than a developer can reason about at once "in his head" when
working on it.

7 raid options * 3 different types (data, metadata, system) = 21
already... Some parts of the algorithm make only sense for a subset of
the combinations, but they're still part of the computation, which
sometimes "by accident" results in the correct outcome. :)

If it can't be done in a way that's easier to understand when reading
the code, it should have unit tests with a list of known input/output to
detect unwanted changes.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: fix max chunk size on dup

2017-10-10 Thread David Sterba
Hi,

On Fri, Sep 29, 2017 at 04:20:51PM +0900, Naohiro Aota wrote:
> Balancing a fresh METADATA=dup btrfs file system (with size < 50G)
> generates a 128MB sized block group. While we set max_stripe_size =
> max_chunk_size = 256MB, we get this half sized block group:
> 
> $ btrfs ins dump-t -t CHUNK_TREE btrfs.img|grep length
> length 8388608 owner 2 stripe_len 65536 type DATA
> length 33554432 owner 2 stripe_len 65536 type SYSTEM|DUP
> length 134217728 owner 2 stripe_len 65536 type METADATA|DUP
> 
> Before commit 86db25785a6e ("Btrfs: fix max chunk size on raid5/6"), we
> used "stripe_size * ndevs > max_chunk_size * ncopies" to check the max
> chunk size. Since stripe_size = 256MB * dev_stripes (= 2) = 512MB, ndevs
> = 1, max_chunk_size = 256MB, and ncopies = 2, we allowed 256MB
> METADATA|DUP block group.
> 
> But now, we use "stripe_size * data_stripes > max_chunk_size". Since
> data_stripes = 1 on DUP, it disallows the block group to have > 128MB.
> What missing here is "dev_stripes". Proper logical space used by the block
> group is "stripe_size * data_stripes / dev_stripes". Tweak the equations to
> use the right value.

I started looking into it and still don't fully understand it. Change
deep in the allocator can easily break some blockgroup combinations, so
I'm rather conservative here.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: fix max chunk size on dup

2017-09-29 Thread Naohiro Aota
Balancing a fresh METADATA=dup btrfs file system (with size < 50G)
generates a 128MB sized block group. While we set max_stripe_size =
max_chunk_size = 256MB, we get this half sized block group:

$ btrfs ins dump-t -t CHUNK_TREE btrfs.img|grep length
length 8388608 owner 2 stripe_len 65536 type DATA
length 33554432 owner 2 stripe_len 65536 type SYSTEM|DUP
length 134217728 owner 2 stripe_len 65536 type METADATA|DUP

Before commit 86db25785a6e ("Btrfs: fix max chunk size on raid5/6"), we
used "stripe_size * ndevs > max_chunk_size * ncopies" to check the max
chunk size. Since stripe_size = 256MB * dev_stripes (= 2) = 512MB, ndevs
= 1, max_chunk_size = 256MB, and ncopies = 2, we allowed 256MB
METADATA|DUP block group.

But now, we use "stripe_size * data_stripes > max_chunk_size". Since
data_stripes = 1 on DUP, it disallows the block group to have > 128MB.
What missing here is "dev_stripes". Proper logical space used by the block
group is "stripe_size * data_stripes / dev_stripes". Tweak the equations to
use the right value.

Signed-off-by: Naohiro Aota 
---
 fs/btrfs/volumes.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 0e8f16c305df..d1ac3e1fb753 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4751,10 +4751,11 @@ static int __btrfs_alloc_chunk(struct 
btrfs_trans_handle *trans,
 * is really going to be in terms of logical address space,
 * and compare that answer with the max chunk size
 */
-   if (stripe_size * data_stripes > max_chunk_size) {
+   if (stripe_size * data_stripes > max_chunk_size * dev_stripes) {
u64 mask = (1ULL << 24) - 1;
 
-   stripe_size = div_u64(max_chunk_size, data_stripes);
+   stripe_size = div_u64(max_chunk_size * dev_stripes,
+ data_stripes);
 
/* bump the answer up to a 16MB boundary */
stripe_size = (stripe_size + mask) & ~mask;

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html