Re: [PATCH v3 1/3] btrfs: discard relocated block groups

2021-04-14 Thread Filipe Manana
On Wed, Apr 14, 2021 at 2:00 PM Johannes Thumshirn
 wrote:
>
> On 14/04/2021 13:23, Johannes Thumshirn wrote:
> > From how I understand the code, yes. It's a quick test, so let's just do it
> > and see what breaks.
> >
> > I'd prefer to just drop the ->ro check, it's less special casing for zoned
> > btrfs that we have to keep in mind when changing things.
>
> OK, no this doesn't work, because btrfs_start_trans_remove_block_group() has
> this ASSERT(em && em->start == chunk_offset);, but btrfs_remove_block_group()
> from relocation has already called remove_extent_mapping().

Ah bummer.
Well your previous proposal is reasonable.

Thanks for checking it.

>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH v3 1/3] btrfs: discard relocated block groups

2021-04-14 Thread Johannes Thumshirn
On 14/04/2021 13:23, Johannes Thumshirn wrote:
> From how I understand the code, yes. It's a quick test, so let's just do it
> and see what breaks.
> 
> I'd prefer to just drop the ->ro check, it's less special casing for zoned
> btrfs that we have to keep in mind when changing things.

OK, no this doesn't work, because btrfs_start_trans_remove_block_group() has
this ASSERT(em && em->start == chunk_offset);, but btrfs_remove_block_group()
from relocation has already called remove_extent_mapping().



Re: [PATCH v3 1/3] btrfs: discard relocated block groups

2021-04-14 Thread Filipe Manana
On Wed, Apr 14, 2021 at 12:22 PM Johannes Thumshirn
 wrote:
>
> On 14/04/2021 13:17, Filipe Manana wrote:
> > Yep, that's what puzzled me, why the need to do it for non-zoned file
> > systems when using -o discard=sync.
> > I assumed you ran into a case where discard was not happening due to
> > some bug bug in the extent pinning/unpinning mechanism.
> >
> >> So the correct fix would
> >> be to get the block group into the 'trans->transaction->deleted_bgs' list
> >> after relocation, which would work if we wouldn't check for 
> >> block_group->ro in
> >> btrfs_delete_unused_bgs(), but I suppose this check is there for a reason.
> >
> > Actually the check for ->ro does not make sense anymore since I
> > introduced the delete_unused_bgs_mutex in commit
> > 67c5e7d464bc466471b05e027abe8a6b29687ebd.
> >
> > When the ->ro check was added
> > (47ab2a6c689913db23ccae38349714edf8365e0a), it was meant to prevent
> > the cleaner kthread and relocation tasks from calling
> > btrfs_remove_chunk() concurrently, but checking for ->ro only was
> > buggy, hence the addition of delete_unused_bgs_mutex later.
> >
>
>
> I'll have a look at these commits.
>
>
> >>
> >> How about changing the patch to the following:
> >
> > Looks good.
> > However would just removing the ->ro check by enough as well?
>
> From how I understand the code, yes. It's a quick test, so let's just do it
> and see what breaks.

The thing most worth checking is if concurrent scrubs can cause any
issue with deletion of unused block groups.
I think they won't, but there were races in the past I remember fixing.

Several test cases from btrfs/060 to btrfs/074 exercise scrub running
with fsstress, balance, etc.
They are a good way to stress test this code.

>
> I'd prefer to just drop the ->ro check, it's less special casing for zoned
> btrfs that we have to keep in mind when changing things.
>
> Thanks for helping me with this,
> Johannes



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH v3 1/3] btrfs: discard relocated block groups

2021-04-14 Thread Johannes Thumshirn
On 14/04/2021 13:17, Filipe Manana wrote:
> Yep, that's what puzzled me, why the need to do it for non-zoned file
> systems when using -o discard=sync.
> I assumed you ran into a case where discard was not happening due to
> some bug bug in the extent pinning/unpinning mechanism.
> 
>> So the correct fix would
>> be to get the block group into the 'trans->transaction->deleted_bgs' list
>> after relocation, which would work if we wouldn't check for block_group->ro 
>> in
>> btrfs_delete_unused_bgs(), but I suppose this check is there for a reason.
> 
> Actually the check for ->ro does not make sense anymore since I
> introduced the delete_unused_bgs_mutex in commit
> 67c5e7d464bc466471b05e027abe8a6b29687ebd.
> 
> When the ->ro check was added
> (47ab2a6c689913db23ccae38349714edf8365e0a), it was meant to prevent
> the cleaner kthread and relocation tasks from calling
> btrfs_remove_chunk() concurrently, but checking for ->ro only was
> buggy, hence the addition of delete_unused_bgs_mutex later.
>


I'll have a look at these commits.

 
>>
>> How about changing the patch to the following:
> 
> Looks good.
> However would just removing the ->ro check by enough as well?

>From how I understand the code, yes. It's a quick test, so let's just do it
and see what breaks.

I'd prefer to just drop the ->ro check, it's less special casing for zoned
btrfs that we have to keep in mind when changing things.

Thanks for helping me with this,
Johannes


Re: [PATCH v3 1/3] btrfs: discard relocated block groups

2021-04-14 Thread Filipe Manana
On Tue, Apr 13, 2021 at 6:48 PM Johannes Thumshirn
 wrote:
>
> On 13/04/2021 14:57, Filipe Manana wrote:
> > And what about the other mechanism that triggers discards on pinned
> > extents, after the transaction commits the super blocks?
> > Why isn't that happening (with -o discard=sync)? We create the delayed
> > references to drop extents from the relocated block group, which
> > results in pinning extents.
> > This is the case that surprised me that it isn't working for you.
>
> I think this is the case. I would have expected to end up in this
> part of btrfs_finish_extent_commit():
>
>
> /*
>  * Transaction is finished.  We don't need the lock anymore.  We
>  * do need to clean up the block groups in case of a transaction
>  * abort.
>  */
> deleted_bgs = &trans->transaction->deleted_bgs;
> list_for_each_entry_safe(block_group, tmp, deleted_bgs, bg_list) {
> u64 trimmed = 0;
>
> ret = -EROFS;
> if (!TRANS_ABORTED(trans))
> ret = btrfs_discard_extent(fs_info,
>block_group->start,
>block_group->length,
>&trimmed);
>
> list_del_init(&block_group->bg_list);
> btrfs_unfreeze_block_group(block_group);
> btrfs_put_block_group(block_group);
>
> if (ret) {
> const char *errstr = btrfs_decode_error(ret);
> btrfs_warn(fs_info,
>"discard failed while removing blockgroup: 
> errno=%d %s",
>ret, errstr);
> }
> }
>
> and the btrfs_discard_extent() over the whole block group would then trigger a
> REQ_OP_ZONE_RESET operation, resetting the device's zone.
>
> But as btrfs_delete_unused_bgs() doesn't add the block group to the
> ->deleted_bgs list, we're not reaching above code. I /think/ (i.e. 
> verification
> pending) the -o discard=sync case works for regular block devices, as each 
> extent
> is discarded on it's own, by this (also in btrfs_finish_extent_commit()):
>
> while (!TRANS_ABORTED(trans)) {
> struct extent_state *cached_state = NULL;
>
> mutex_lock(&fs_info->unused_bg_unpin_mutex);
> ret = find_first_extent_bit(unpin, 0, &start, &end,
> EXTENT_DIRTY, &cached_state);
> if (ret) {
> mutex_unlock(&fs_info->unused_bg_unpin_mutex);
> break;
> }
>
> if (btrfs_test_opt(fs_info, DISCARD_SYNC))
> ret = btrfs_discard_extent(fs_info, start,
>end + 1 - start, NULL);
>
> clear_extent_dirty(unpin, start, end, &cached_state);
> unpin_extent_range(fs_info, start, end, true);
> mutex_unlock(&fs_info->unused_bg_unpin_mutex);
> free_extent_state(cached_state);
> cond_resched();
> }
>
> If this is the case, my patch will essentially discard the data twice, for a
> non-zoned block device, which is certainly not ideal.

Yep, that's what puzzled me, why the need to do it for non-zoned file
systems when using -o discard=sync.
I assumed you ran into a case where discard was not happening due to
some bug bug in the extent pinning/unpinning mechanism.

> So the correct fix would
> be to get the block group into the 'trans->transaction->deleted_bgs' list
> after relocation, which would work if we wouldn't check for block_group->ro in
> btrfs_delete_unused_bgs(), but I suppose this check is there for a reason.

Actually the check for ->ro does not make sense anymore since I
introduced the delete_unused_bgs_mutex in commit
67c5e7d464bc466471b05e027abe8a6b29687ebd.

When the ->ro check was added
(47ab2a6c689913db23ccae38349714edf8365e0a), it was meant to prevent
the cleaner kthread and relocation tasks from calling
btrfs_remove_chunk() concurrently, but checking for ->ro only was
buggy, hence the addition of delete_unused_bgs_mutex later.

>
> How about changing the patch to the following:

Looks good.
However would just removing the ->ro check by enough as well?

Thanks Johannes.

>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 6d9b2369f17a..ba13b2ea3c6f 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -3103,6 +3103,9 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info 
> *fs_info, u64 chunk_offset)
> struct btrfs_root *root = fs_info->chunk_root;
> struct btrfs_trans_handle *trans;
> struct btrfs_block_group *block_group;
> +   u64 length;
> int ret;
>
> /*
> @@ -3130,8 +3133,16 @@ static int btrfs_relocate_chunk(struct btrf

Re: [PATCH v3 1/3] btrfs: discard relocated block groups

2021-04-13 Thread Johannes Thumshirn
On 13/04/2021 14:57, Filipe Manana wrote:
> And what about the other mechanism that triggers discards on pinned
> extents, after the transaction commits the super blocks?
> Why isn't that happening (with -o discard=sync)? We create the delayed
> references to drop extents from the relocated block group, which
> results in pinning extents.
> This is the case that surprised me that it isn't working for you.

I think this is the case. I would have expected to end up in this
part of btrfs_finish_extent_commit():

  
/*  
 
 * Transaction is finished.  We don't need the lock anymore.  We
 
 * do need to clean up the block groups in case of a transaction
 
 * abort.   
 
 */ 
 
deleted_bgs = &trans->transaction->deleted_bgs; 
 
list_for_each_entry_safe(block_group, tmp, deleted_bgs, bg_list) {  
 
u64 trimmed = 0;
 

 
ret = -EROFS;   
 
if (!TRANS_ABORTED(trans))  
 
ret = btrfs_discard_extent(fs_info, 
 
   block_group->start,  
 
   block_group->length, 
 
   &trimmed);   
 

 
list_del_init(&block_group->bg_list);   
 
btrfs_unfreeze_block_group(block_group);
 
btrfs_put_block_group(block_group); 
 

 
if (ret) {  
 
const char *errstr = btrfs_decode_error(ret);   
 
btrfs_warn(fs_info, 
 
   "discard failed while removing blockgroup: errno=%d 
%s",
   ret, errstr);
 
}   
 
}

and the btrfs_discard_extent() over the whole block group would then trigger a
REQ_OP_ZONE_RESET operation, resetting the device's zone.

But as btrfs_delete_unused_bgs() doesn't add the block group to the 
->deleted_bgs list, we're not reaching above code. I /think/ (i.e. verification
pending) the -o discard=sync case works for regular block devices, as each 
extent
is discarded on it's own, by this (also in btrfs_finish_extent_commit()):

while (!TRANS_ABORTED(trans)) { 
 
struct extent_state *cached_state = NULL;   
 

 
mutex_lock(&fs_info->unused_bg_unpin_mutex);
 
ret = find_first_extent_bit(unpin, 0, &start, &end, 
 
EXTENT_DIRTY, &cached_state);   
 
if (ret) {  
 
mutex_unlock(&fs_info->unused_bg_unpin_mutex);  
 
break;  
 
}   
 

 
if (btrfs_test_opt(fs_info, DISCARD_SYNC))  
 
ret = btrfs_discard_extent(fs_info, start,  
 
   end + 1 - start, NULL);  
 

 
clear_extent_dirty(unpin, start, end, &cached_state);   
 
unpin_extent_range(fs_info, start, end, true);  
 
mutex_unlock(&fs_info->unused_bg_unpin_mutex);  
 
free_extent_state(cached_state);
 
cond_resched(); 
 
}

If this is the case, my patch will essentially discard the data twice, for a
non-zoned block device, which is cert

Re: [PATCH v3 1/3] btrfs: discard relocated block groups

2021-04-13 Thread Filipe Manana
On Tue, Apr 13, 2021 at 1:44 PM Johannes Thumshirn
 wrote:
>
> On 12/04/2021 16:21, Johannes Thumshirn wrote:
> >> If it affects only the zoned case, I also don't see why doing it when
> >> not in zoned mode (and -o discard=sync is set).
> >> Unless of course the default discard mechanism is broken somehow, in
> >> which case we need to find out why and fix it.
> >
> > I'm at it.
> >
>
> OK I've found out what's wrong. In btrfs_relocate_block_group() we're calling
> btrfs_inc_block_group_ro(). btrfs_update_block_group() will call
> btrfs_mark_bg_unused() to add the block group to the 'fs_info->unused_bgs' 
> list.
>
> But in btrfs_delete_unused_bgs() we have this check:
> if (block_group->reserved || block_group->pinned ||
> block_group->used || block_group->ro ||
> list_is_singular(&block_group->list)) {
> /*
>  * We want to bail if we made new allocations or have
>  * outstanding allocations in this block group.  We do
>  * the ro check in case balance is currently acting on
>  * this block group.
>  */
> trace_btrfs_skip_unused_block_group(block_group);
> spin_unlock(&block_group->lock);
> up_write(&space_info->groups_sem);
> goto next;
> }
>
> So we're skipping over the removal of the block group. I've instrumented the
> kernel and also tested with non-zoned devices, this skip is always present
> with block_group->ro = 1.

Yep, expected, see below.

>
> Also the auto-relocation patch has a problem, not decrementing block_group->ro
> and so it's left at ro = 2 after auto relocation got triggered.
>
> So the question is, why are we leaving block_group->ro at 1 after relocation
> finished? Probably that the allocator doesn't pick the block group for the 
> next
> allocation.

Yes. Otherwise after relocating (or even during relocation), a task
allocates an extent from the block group, and then blindly after we
delete the block group - this would be racy and cause corruption.

>
> What am I missing?

And what about the other mechanism that triggers discards on pinned
extents, after the transaction commits the super blocks?
Why isn't that happening (with -o discard=sync)? We create the delayed
references to drop extents from the relocated block group, which
results in pinning extents.
This is the case that surprised me that it isn't working for you.

Thanks.

>
> Thanks,
> Johannes



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH v3 1/3] btrfs: discard relocated block groups

2021-04-13 Thread Johannes Thumshirn
On 12/04/2021 16:21, Johannes Thumshirn wrote:
>> If it affects only the zoned case, I also don't see why doing it when
>> not in zoned mode (and -o discard=sync is set).
>> Unless of course the default discard mechanism is broken somehow, in
>> which case we need to find out why and fix it.
> 
> I'm at it.
> 

OK I've found out what's wrong. In btrfs_relocate_block_group() we're calling
btrfs_inc_block_group_ro(). btrfs_update_block_group() will call 
btrfs_mark_bg_unused() to add the block group to the 'fs_info->unused_bgs' list.

But in btrfs_delete_unused_bgs() we have this check:
if (block_group->reserved || block_group->pinned || 

 
block_group->used || block_group->ro || 

 
list_is_singular(&block_group->list)) { 

 
/*  

 
 * We want to bail if we made new allocations or have   

 
 * outstanding allocations in this block group.  We do  

 
 * the ro check in case balance is currently acting on  

 
 * this block group.

 
 */ 

 
trace_btrfs_skip_unused_block_group(block_group);   

 
spin_unlock(&block_group->lock);

 
up_write(&space_info->groups_sem);  

 
goto next;  

 
}   

So we're skipping over the removal of the block group. I've instrumented the
kernel and also tested with non-zoned devices, this skip is always present
with block_group->ro = 1.

Also the auto-relocation patch has a problem, not decrementing block_group->ro
and so it's left at ro = 2 after auto relocation got triggered.

So the question is, why are we leaving block_group->ro at 1 after relocation 
finished? Probably that the allocator doesn't pick the block group for the next
allocation.

What am I missing?

Thanks,
Johannes


Re: [PATCH v3 1/3] btrfs: discard relocated block groups

2021-04-12 Thread Johannes Thumshirn
On 12/04/2021 16:09, Filipe Manana wrote:
>> [   81.014752] BTRFS info (device nullb0): reclaiming chunk 1073741824
>> [   81.015732] BTRFS info (device nullb0): relocating block group 1073741824 
>> flags data
>> [   81.798090] BTRFS info (device nullb0): found 12452 extents, stage: move 
>> data extents
>> [   82.314562] BTRFS info (device nullb0): found 12452 extents, stage: 
>> update data pointers
>> # blkzone report -o $((1073741824 >> 9)) -c 1 /dev/nullb0
>>   start: 0x00020, len 0x08, cap 0x08, wptr 0x0799a0 reset:0 
>> non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)]
> 
> Not familiar with zoned device details, but what you are passing to
> blkzone is a logical address, in non-zoned btrfs it does not always
> matches the physical address on disk.
> So maybe that is not a reliable way to check it.

Yep need to find a more reliable way to use in fstests later, but as of now, it 
works for debugging.
 
>>
>> Whereas when the this patch is applied as well:
>> [   85.126542] BTRFS info (device nullb0): reclaiming chunk 1073741824
>> [   85.128211] BTRFS info (device nullb0): relocating block group 1073741824 
>> flags data
>> [   86.061831] BTRFS info (device nullb0): found 12452 extents, stage: move 
>> data extents
>> [   86.766549] BTRFS info (device nullb0): found 12452 extents, stage: 
>> update data pointers
>> # blkzone report -c 1 -o $((1073741824 >> 9)) /dev/nullb0
>>   start: 0x00020, len 0x08, cap 0x08, wptr 0x00 reset:0 
>> non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
>>
>> As a positive side effect of this, I now have code that can be used in
>> xfstests to verify the patchset.
> 
> Ok.
> Maybe the zone isn't reset properly because the default mechanism is
> doing smaller discards, on a per extent basis, and perhaps the order
> of those discards matters?
> 
> If it affects only the zoned case, I also don't see why doing it when
> not in zoned mode (and -o discard=sync is set).
> Unless of course the default discard mechanism is broken somehow, in
> which case we need to find out why and fix it.

I'm at it.


Re: [PATCH v3 1/3] btrfs: discard relocated block groups

2021-04-12 Thread Filipe Manana
On Mon, Apr 12, 2021 at 2:49 PM Johannes Thumshirn
 wrote:
>
> On 09/04/2021 13:37, Filipe Manana wrote:
> > On Fri, Apr 9, 2021 at 11:54 AM Johannes Thumshirn
> >  wrote:
> >>
> >> When relocating a block group the freed up space is not discarded. On
> >> devices like SSDs this hint is useful to tell the device the space is
> >> freed now. On zoned block devices btrfs' discard code will reset the zone
> >> the block group is on, freeing up the occupied space.
> >>
> >> Signed-off-by: Johannes Thumshirn 
> >> ---
> >>  fs/btrfs/volumes.c | 13 +
> >>  1 file changed, 13 insertions(+)
> >>
> >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> >> index 6d9b2369f17a..d9ef8bce0cde 100644
> >> --- a/fs/btrfs/volumes.c
> >> +++ b/fs/btrfs/volumes.c
> >> @@ -3103,6 +3103,10 @@ static int btrfs_relocate_chunk(struct 
> >> btrfs_fs_info *fs_info, u64 chunk_offset)
> >> struct btrfs_root *root = fs_info->chunk_root;
> >> struct btrfs_trans_handle *trans;
> >> struct btrfs_block_group *block_group;
> >> +   const bool trim = btrfs_is_zoned(fs_info) ||
> >> +   btrfs_test_opt(fs_info, DISCARD_SYNC);
> >> +   u64 trimmed;
> >> +   u64 length;
> >> int ret;
> >>
> >> /*
> >> @@ -3130,6 +3134,7 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info 
> >> *fs_info, u64 chunk_offset)
> >> if (!block_group)
> >> return -ENOENT;
> >> btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group);
> >> +   length = block_group->length;
> >> btrfs_put_block_group(block_group);
> >>
> >> trans = btrfs_start_trans_remove_block_group(root->fs_info,
> >> @@ -3144,6 +3149,14 @@ static int btrfs_relocate_chunk(struct 
> >> btrfs_fs_info *fs_info, u64 chunk_offset)
> >>  * step two, delete the device extents and the
> >>  * chunk tree entries
> >>  */
> >> +   if (trim) {
> >> +   ret = btrfs_discard_extent(fs_info, chunk_offset, length,
> >> +  &trimmed);
> >
> > Ideally we do IO and potentially slow operations such as discard while
> > not holding a transaction handle open, to avoid slowing down anyone
> > trying to commit the transaction.
> >
> >> +   if (ret) {
> >> +   btrfs_abort_transaction(trans, ret);
> >
> > This is leaking the transaction, still need to call btrfs_end_transaction().
> >
> > Making the discard before joining/starting transaction, would fix both 
> > things.
>
> Note taken.
>
> >
> > Now more importantly, I don't see why the freed space isn't already
> > discarded at this point.
> > Relocation creates delayed references to drop extents from the block
> > group, and when the delayed references are run, we pin the extents
> > through:
> >
> > __btrfs_free_extent() -> btrfs_update_block_group()
> >
> > Then at the very end of the transaction commit, we discard pinned
> > extents and then unpin them, at btrfs_finish_extent_commit().
> > Relocation commits the transaction at the end of
> > relocate_block_group(), so the delayed references are fun, and then we
> > discard and unpin their extents.
> > So that's why I don't get it why the discard is needed here (at least
> > when we are not on a zoned filesystem).
>
> This is a good point to investigate, but as of now, the zone isn't reset.
> Zone reset handling in btrfs is piggy backed onto discard handling, but
> from testing the patchset I see the zone isn't reset:
>
> [   81.014752] BTRFS info (device nullb0): reclaiming chunk 1073741824
> [   81.015732] BTRFS info (device nullb0): relocating block group 1073741824 
> flags data
> [   81.798090] BTRFS info (device nullb0): found 12452 extents, stage: move 
> data extents
> [   82.314562] BTRFS info (device nullb0): found 12452 extents, stage: update 
> data pointers
> # blkzone report -o $((1073741824 >> 9)) -c 1 /dev/nullb0
>   start: 0x00020, len 0x08, cap 0x08, wptr 0x0799a0 reset:0 
> non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)]

Not familiar with zoned device details, but what you are passing to
blkzone is a logical address, in non-zoned btrfs it does not always
matches the physical address on disk.
So maybe that is not a reliable way to check it.

>
> Whereas when the this patch is applied as well:
> [   85.126542] BTRFS info (device nullb0): reclaiming chunk 1073741824
> [   85.128211] BTRFS info (device nullb0): relocating block group 1073741824 
> flags data
> [   86.061831] BTRFS info (device nullb0): found 12452 extents, stage: move 
> data extents
> [   86.766549] BTRFS info (device nullb0): found 12452 extents, stage: update 
> data pointers
> # blkzone report -c 1 -o $((1073741824 >> 9)) /dev/nullb0
>   start: 0x00020, len 0x08, cap 0x08, wptr 0x00 reset:0 
> non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
>
> As a positive side effect of this, I now have code that can be used in
> xfstests to verify

Re: [PATCH v3 1/3] btrfs: discard relocated block groups

2021-04-12 Thread Johannes Thumshirn
On 09/04/2021 13:37, Filipe Manana wrote:
> On Fri, Apr 9, 2021 at 11:54 AM Johannes Thumshirn
>  wrote:
>>
>> When relocating a block group the freed up space is not discarded. On
>> devices like SSDs this hint is useful to tell the device the space is
>> freed now. On zoned block devices btrfs' discard code will reset the zone
>> the block group is on, freeing up the occupied space.
>>
>> Signed-off-by: Johannes Thumshirn 
>> ---
>>  fs/btrfs/volumes.c | 13 +
>>  1 file changed, 13 insertions(+)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 6d9b2369f17a..d9ef8bce0cde 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -3103,6 +3103,10 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info 
>> *fs_info, u64 chunk_offset)
>> struct btrfs_root *root = fs_info->chunk_root;
>> struct btrfs_trans_handle *trans;
>> struct btrfs_block_group *block_group;
>> +   const bool trim = btrfs_is_zoned(fs_info) ||
>> +   btrfs_test_opt(fs_info, DISCARD_SYNC);
>> +   u64 trimmed;
>> +   u64 length;
>> int ret;
>>
>> /*
>> @@ -3130,6 +3134,7 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info 
>> *fs_info, u64 chunk_offset)
>> if (!block_group)
>> return -ENOENT;
>> btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group);
>> +   length = block_group->length;
>> btrfs_put_block_group(block_group);
>>
>> trans = btrfs_start_trans_remove_block_group(root->fs_info,
>> @@ -3144,6 +3149,14 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info 
>> *fs_info, u64 chunk_offset)
>>  * step two, delete the device extents and the
>>  * chunk tree entries
>>  */
>> +   if (trim) {
>> +   ret = btrfs_discard_extent(fs_info, chunk_offset, length,
>> +  &trimmed);
> 
> Ideally we do IO and potentially slow operations such as discard while
> not holding a transaction handle open, to avoid slowing down anyone
> trying to commit the transaction.
> 
>> +   if (ret) {
>> +   btrfs_abort_transaction(trans, ret);
> 
> This is leaking the transaction, still need to call btrfs_end_transaction().
> 
> Making the discard before joining/starting transaction, would fix both things.

Note taken.

> 
> Now more importantly, I don't see why the freed space isn't already
> discarded at this point.
> Relocation creates delayed references to drop extents from the block
> group, and when the delayed references are run, we pin the extents
> through:
> 
> __btrfs_free_extent() -> btrfs_update_block_group()
> 
> Then at the very end of the transaction commit, we discard pinned
> extents and then unpin them, at btrfs_finish_extent_commit().
> Relocation commits the transaction at the end of
> relocate_block_group(), so the delayed references are fun, and then we
> discard and unpin their extents.
> So that's why I don't get it why the discard is needed here (at least
> when we are not on a zoned filesystem).

This is a good point to investigate, but as of now, the zone isn't reset.
Zone reset handling in btrfs is piggy backed onto discard handling, but
from testing the patchset I see the zone isn't reset:

[   81.014752] BTRFS info (device nullb0): reclaiming chunk 1073741824
[   81.015732] BTRFS info (device nullb0): relocating block group 1073741824 
flags data
[   81.798090] BTRFS info (device nullb0): found 12452 extents, stage: move 
data extents
[   82.314562] BTRFS info (device nullb0): found 12452 extents, stage: update 
data pointers 
# blkzone report -o $((1073741824 >> 9)) -c 1 /dev/nullb0
  start: 0x00020, len 0x08, cap 0x08, wptr 0x0799a0 reset:0 
non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)]

Whereas when the this patch is applied as well:
[   85.126542] BTRFS info (device nullb0): reclaiming chunk 1073741824
[   85.128211] BTRFS info (device nullb0): relocating block group 1073741824 
flags data
[   86.061831] BTRFS info (device nullb0): found 12452 extents, stage: move 
data extents
[   86.766549] BTRFS info (device nullb0): found 12452 extents, stage: update 
data pointers
# blkzone report -c 1 -o $((1073741824 >> 9)) /dev/nullb0
  start: 0x00020, len 0x08, cap 0x08, wptr 0x00 reset:0 
non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]

As a positive side effect of this, I now have code that can be used in 
xfstests to verify the patchset.

Byte,
Johannes


Re: [PATCH v3 1/3] btrfs: discard relocated block groups

2021-04-09 Thread Filipe Manana
On Fri, Apr 9, 2021 at 11:54 AM Johannes Thumshirn
 wrote:
>
> When relocating a block group the freed up space is not discarded. On
> devices like SSDs this hint is useful to tell the device the space is
> freed now. On zoned block devices btrfs' discard code will reset the zone
> the block group is on, freeing up the occupied space.
>
> Signed-off-by: Johannes Thumshirn 
> ---
>  fs/btrfs/volumes.c | 13 +
>  1 file changed, 13 insertions(+)
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 6d9b2369f17a..d9ef8bce0cde 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -3103,6 +3103,10 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info 
> *fs_info, u64 chunk_offset)
> struct btrfs_root *root = fs_info->chunk_root;
> struct btrfs_trans_handle *trans;
> struct btrfs_block_group *block_group;
> +   const bool trim = btrfs_is_zoned(fs_info) ||
> +   btrfs_test_opt(fs_info, DISCARD_SYNC);
> +   u64 trimmed;
> +   u64 length;
> int ret;
>
> /*
> @@ -3130,6 +3134,7 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info 
> *fs_info, u64 chunk_offset)
> if (!block_group)
> return -ENOENT;
> btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group);
> +   length = block_group->length;
> btrfs_put_block_group(block_group);
>
> trans = btrfs_start_trans_remove_block_group(root->fs_info,
> @@ -3144,6 +3149,14 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info 
> *fs_info, u64 chunk_offset)
>  * step two, delete the device extents and the
>  * chunk tree entries
>  */
> +   if (trim) {
> +   ret = btrfs_discard_extent(fs_info, chunk_offset, length,
> +  &trimmed);

Ideally we do IO and potentially slow operations such as discard while
not holding a transaction handle open, to avoid slowing down anyone
trying to commit the transaction.

> +   if (ret) {
> +   btrfs_abort_transaction(trans, ret);

This is leaking the transaction, still need to call btrfs_end_transaction().

Making the discard before joining/starting transaction, would fix both things.

Now more importantly, I don't see why the freed space isn't already
discarded at this point.
Relocation creates delayed references to drop extents from the block
group, and when the delayed references are run, we pin the extents
through:

__btrfs_free_extent() -> btrfs_update_block_group()

Then at the very end of the transaction commit, we discard pinned
extents and then unpin them, at btrfs_finish_extent_commit().
Relocation commits the transaction at the end of
relocate_block_group(), so the delayed references are fun, and then we
discard and unpin their extents.
So that's why I don't get it why the discard is needed here (at least
when we are not on a zoned filesystem).

Thanks.




> +   return ret;
> +   }
> +   }
> ret = btrfs_remove_chunk(trans, chunk_offset);
> btrfs_end_transaction(trans);
> return ret;
> --
> 2.30.0
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


[PATCH v3 1/3] btrfs: discard relocated block groups

2021-04-09 Thread Johannes Thumshirn
When relocating a block group the freed up space is not discarded. On
devices like SSDs this hint is useful to tell the device the space is
freed now. On zoned block devices btrfs' discard code will reset the zone
the block group is on, freeing up the occupied space.

Signed-off-by: Johannes Thumshirn 
---
 fs/btrfs/volumes.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 6d9b2369f17a..d9ef8bce0cde 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3103,6 +3103,10 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info 
*fs_info, u64 chunk_offset)
struct btrfs_root *root = fs_info->chunk_root;
struct btrfs_trans_handle *trans;
struct btrfs_block_group *block_group;
+   const bool trim = btrfs_is_zoned(fs_info) ||
+   btrfs_test_opt(fs_info, DISCARD_SYNC);
+   u64 trimmed;
+   u64 length;
int ret;
 
/*
@@ -3130,6 +3134,7 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info 
*fs_info, u64 chunk_offset)
if (!block_group)
return -ENOENT;
btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group);
+   length = block_group->length;
btrfs_put_block_group(block_group);
 
trans = btrfs_start_trans_remove_block_group(root->fs_info,
@@ -3144,6 +3149,14 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info 
*fs_info, u64 chunk_offset)
 * step two, delete the device extents and the
 * chunk tree entries
 */
+   if (trim) {
+   ret = btrfs_discard_extent(fs_info, chunk_offset, length,
+  &trimmed);
+   if (ret) {
+   btrfs_abort_transaction(trans, ret);
+   return ret;
+   }
+   }
ret = btrfs_remove_chunk(trans, chunk_offset);
btrfs_end_transaction(trans);
return ret;
-- 
2.30.0