Re: [PATCH] btrfs: zoned: automatically reclaim zones

2021-03-17 Thread Josef Bacik

On 3/17/21 6:38 AM, Johannes Thumshirn wrote:

When a file gets deleted on a zoned file system, the space freed is no
returned back into the block group's free space, but is migrated to
zone_unusable.

As this zone_unusable space is behind the current write pointer it is not
possible to use it for new allocations. In the current implementation a
zone is reset once all of the block group's space is accounted as zone
unusable.

This behaviour can lead to premature ENOSPC errors on a busy file system.

Instead of only reclaiming the zone once it is completely unusable,
kick off a reclaim job once the amount of unusable bytes exceeds a user
configurable threshold between 51% and 100%.

Similar to reclaiming unused block groups, these dirty block groups are
added to a to_reclaim list and then on a transaction commit, the reclaim
process is triggered but after we deleted unused block groups, which will
free space for the relocation process.

Signed-off-by: Johannes Thumshirn 


I think this is generally useful, but we should indicate to the user that it's 
happening via automatic cleanup and not because of somebody using btrfs balance 
start.


As for doing it on non-zoned, I'm still hesitant to do this simply because I 
haven't been able to finish working on the relocation stuff.  Once my current 
set of patches get merged I'll pick it back up, but there still remains some 
pretty unpleasant side-effects of balance.


Chris and I have been discussing your RAID5/6 plan, and we both really, really 
like the idea of the stripe root from the point of view that it makes relocation 
insanely straightforward and simple.  I think once we have that in place I'd 
feel a lot more comfortable running relocation automatically everywhere, as 
it'll be a much faster and less error prone operation.  Thanks,


Josef


Re: [PATCH] btrfs: zoned: automatically reclaim zones

2021-03-17 Thread Filipe Manana
On Wed, Mar 17, 2021 at 10:40 AM Johannes Thumshirn
 wrote:
>
> When a file gets deleted on a zoned file system, the space freed is no
> returned back into the block group's free space, but is migrated to
> zone_unusable.
>
> As this zone_unusable space is behind the current write pointer it is not
> possible to use it for new allocations. In the current implementation a
> zone is reset once all of the block group's space is accounted as zone
> unusable.
>
> This behaviour can lead to premature ENOSPC errors on a busy file system.
>
> Instead of only reclaiming the zone once it is completely unusable,
> kick off a reclaim job once the amount of unusable bytes exceeds a user
> configurable threshold between 51% and 100%.
>
> Similar to reclaiming unused block groups, these dirty block groups are
> added to a to_reclaim list and then on a transaction commit, the reclaim
> process is triggered but after we deleted unused block groups, which will
> free space for the relocation process.
>
> Signed-off-by: Johannes Thumshirn 
> ---
>  fs/btrfs/block-group.c   | 85 
>  fs/btrfs/block-group.h   |  2 +
>  fs/btrfs/ctree.h |  5 +++
>  fs/btrfs/disk-io.c   | 11 +
>  fs/btrfs/free-space-cache.c  |  9 +++-
>  fs/btrfs/sysfs.c | 35 +++
>  fs/btrfs/volumes.c   |  2 +-
>  fs/btrfs/volumes.h   |  1 +
>  include/trace/events/btrfs.h | 12 +
>  9 files changed, 160 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 85077c95b4f7..7f8ea2888e4c 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -1485,6 +1485,81 @@ void btrfs_mark_bg_unused(struct btrfs_block_group *bg)
> spin_unlock(&fs_info->unused_bgs_lock);
>  }
>
> +void btrfs_reclaim_bgs(struct btrfs_fs_info *fs_info)
> +{
> +   struct btrfs_block_group *bg;
> +   struct btrfs_space_info *space_info;
> +   int ret = 0;
> +
> +   if (!test_bit(BTRFS_FS_OPEN, &fs_info->flags))
> +   return;
> +
> +   if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_BALANCE))
> +   return;
> +
> +   mutex_lock(&fs_info->delete_unused_bgs_mutex);
> +   mutex_lock(&fs_info->reclaim_bgs_lock);

Could we just use delete_unused_bgs_mutex? I think you are using it
only because btrfs_relocate_chunk() asserts it's being held.

Just renaming delete_unused_bgs_mutex to a more generic name like
reclaim_bgs_lock, and just use that should work.

> +   while (!list_empty(&fs_info->reclaim_bgs)) {
> +   bg = list_first_entry(&fs_info->reclaim_bgs,
> + struct btrfs_block_group,
> + bg_list);
> +   list_del_init(&bg->bg_list);
> +
> +   space_info = bg->space_info;
> +   mutex_unlock(&fs_info->reclaim_bgs_lock);
> +
> +   down_write(&space_info->groups_sem);

Having a comment on why we lock groups_sem would be good to have, just
like at btrfs_delete_unused_bgs().

> +
> +   spin_lock(&bg->lock);
> +   if (bg->reserved || bg->pinned || bg->ro ||
> +   list_is_singular(&bg->list)) {
> +   /*
> +* We want to bail if we made new allocations or have
> +* outstanding allocations in this block group.  We do
> +* the ro check in case balance is currently acting on
> +* this block group.

Why is the list_is_singular() check there?

This was copied from the empty block group removal case -
btrfs_delete_unused_bgs(), but here I don't see why it's needed, since
we relocate rather than remove block groups.
The list_is_singular() from btrfs_delete_unused_bgs() is there to
prevent losing the raid profile for data when the last data block
group is removed (which implies not using space cache v1).

Other than that, overall it looks good.

Thanks.

> +*/
> +   spin_unlock(&bg->lock);
> +   up_write(&space_info->groups_sem);
> +   goto next;
> +   }
> +   spin_unlock(&bg->lock);
> +
> +   ret = inc_block_group_ro(bg, 0);
> +   up_write(&space_info->groups_sem);
> +   if (ret < 0) {
> +   ret = 0;
> +   goto next;
> +   }
> +
> +   trace_btrfs_reclaim_block_group(bg);
> +   ret = btrfs_relocate_chunk(fs_info, bg->start);
> +   if (ret)
> +   btrfs_err(fs_info, "error relocating chunk %llu",
> + bg->start);
> +
> +next:
> +   btrfs_put_block_group(bg);
> +   mutex_lock(&fs_info->reclaim_bgs_lock);
> +   }
> +   mutex_unlock(&fs_info->reclaim_bgs_lock);
> +   mutex_unlock(&fs_info->delete_un

Re: [PATCH] btrfs: zoned: automatically reclaim zones

2021-03-17 Thread Filipe Manana
On Wed, Mar 17, 2021 at 3:31 PM Johannes Thumshirn
 wrote:
>
> On 17/03/2021 16:26, Filipe Manana wrote:
> >> +void btrfs_reclaim_bgs(struct btrfs_fs_info *fs_info)
> >> +{
> >> +   struct btrfs_block_group *bg;
> >> +   struct btrfs_space_info *space_info;
> >> +   int ret = 0;
> >> +
> >> +   if (!test_bit(BTRFS_FS_OPEN, &fs_info->flags))
> >> +   return;
> >> +
> >> +   if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_BALANCE))
> >> +   return;
> >> +
> >> +   mutex_lock(&fs_info->delete_unused_bgs_mutex);
> >> +   mutex_lock(&fs_info->reclaim_bgs_lock);
> >
> > Could we just use delete_unused_bgs_mutex? I think you are using it
> > only because btrfs_relocate_chunk() asserts it's being held.
> >
> > Just renaming delete_unused_bgs_mutex to a more generic name like
> > reclaim_bgs_lock, and just use that should work.
>
> Do I understand you correctly, that btrfs_delete_unused_bgs and 
> btrfs_reclaim_bgs
> should use the same mutex? I was thinking about that but then didn't want to
> have one mutex protecting two lists.

Correct, that's what I meant.

Since btrfs_delete_unused_bgs() and btrfs_reclaim_bgs() can never run
in parallel, and since they are already locking
delete_unused_bgs_mutex,
I don't see a reason not to do it.

The other cases that take the mutex, adding blocks groups to one of
the lists, are fast and are relatively rare specially for empty block
groups.

>
> >
> >> +   while (!list_empty(&fs_info->reclaim_bgs)) {
> >> +   bg = list_first_entry(&fs_info->reclaim_bgs,
> >> + struct btrfs_block_group,
> >> + bg_list);
> >> +   list_del_init(&bg->bg_list);
> >> +
> >> +   space_info = bg->space_info;
> >> +   mutex_unlock(&fs_info->reclaim_bgs_lock);
> >> +
> >> +   down_write(&space_info->groups_sem);
> >
> > Having a comment on why we lock groups_sem would be good to have, just
> > like at btrfs_delete_unused_bgs().
> >
>
> OK
>
> >> +
> >> +   spin_lock(&bg->lock);
> >> +   if (bg->reserved || bg->pinned || bg->ro ||
> >> +   list_is_singular(&bg->list)) {
> >> +   /*
> >> +* We want to bail if we made new allocations or 
> >> have
> >> +* outstanding allocations in this block group.  
> >> We do
> >> +* the ro check in case balance is currently 
> >> acting on
> >> +* this block group.
> >
> > Why is the list_is_singular() check there?
> >
> > This was copied from the empty block group removal case -
> > btrfs_delete_unused_bgs(), but here I don't see why it's needed, since
> > we relocate rather than remove block groups.
> > The list_is_singular() from btrfs_delete_unused_bgs() is there to
> > prevent losing the raid profile for data when the last data block
> > group is removed (which implies not using space cache v1).
> >
> > Other than that, overall it looks good.
>
> Ah OK, yes this was a copy.
>
>
> >> +   /*
> >> +* Reclaim block groups in the reclaim_bgs list after we 
> >> deleted
> >> +* all unused block_groups. This possibly gives us some 
> >> more free
> >> +* space.
> >> +*/
> >> +   btrfs_reclaim_bgs(fs_info);
> >
> > Now with the cleaner kthread doing relocation, which can be a very
> > slow operation, we end up delaying iputs and other work delegated to
> > it for a very long time.
> > Not saying it needs to be fixed right away, perhaps it's not that bad,
> > I'm just not sure at the moment. We already do subvolume deletion in
> > this kthread, which is also a slow operation.
>
> Noted.



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH] btrfs: zoned: automatically reclaim zones

2021-03-17 Thread Johannes Thumshirn
On 17/03/2021 16:26, Filipe Manana wrote:
>> +void btrfs_reclaim_bgs(struct btrfs_fs_info *fs_info)
>> +{
>> +   struct btrfs_block_group *bg;
>> +   struct btrfs_space_info *space_info;
>> +   int ret = 0;
>> +
>> +   if (!test_bit(BTRFS_FS_OPEN, &fs_info->flags))
>> +   return;
>> +
>> +   if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_BALANCE))
>> +   return;
>> +
>> +   mutex_lock(&fs_info->delete_unused_bgs_mutex);
>> +   mutex_lock(&fs_info->reclaim_bgs_lock);
> 
> Could we just use delete_unused_bgs_mutex? I think you are using it
> only because btrfs_relocate_chunk() asserts it's being held.
> 
> Just renaming delete_unused_bgs_mutex to a more generic name like
> reclaim_bgs_lock, and just use that should work.

Do I understand you correctly, that btrfs_delete_unused_bgs and 
btrfs_reclaim_bgs
should use the same mutex? I was thinking about that but then didn't want to
have one mutex protecting two lists.

> 
>> +   while (!list_empty(&fs_info->reclaim_bgs)) {
>> +   bg = list_first_entry(&fs_info->reclaim_bgs,
>> + struct btrfs_block_group,
>> + bg_list);
>> +   list_del_init(&bg->bg_list);
>> +
>> +   space_info = bg->space_info;
>> +   mutex_unlock(&fs_info->reclaim_bgs_lock);
>> +
>> +   down_write(&space_info->groups_sem);
> 
> Having a comment on why we lock groups_sem would be good to have, just
> like at btrfs_delete_unused_bgs().
> 

OK

>> +
>> +   spin_lock(&bg->lock);
>> +   if (bg->reserved || bg->pinned || bg->ro ||
>> +   list_is_singular(&bg->list)) {
>> +   /*
>> +* We want to bail if we made new allocations or have
>> +* outstanding allocations in this block group.  We 
>> do
>> +* the ro check in case balance is currently acting 
>> on
>> +* this block group.
> 
> Why is the list_is_singular() check there?
> 
> This was copied from the empty block group removal case -
> btrfs_delete_unused_bgs(), but here I don't see why it's needed, since
> we relocate rather than remove block groups.
> The list_is_singular() from btrfs_delete_unused_bgs() is there to
> prevent losing the raid profile for data when the last data block
> group is removed (which implies not using space cache v1).
> 
> Other than that, overall it looks good.

Ah OK, yes this was a copy.


>> +   /*
>> +* Reclaim block groups in the reclaim_bgs list after we 
>> deleted
>> +* all unused block_groups. This possibly gives us some more 
>> free
>> +* space.
>> +*/
>> +   btrfs_reclaim_bgs(fs_info);
> 
> Now with the cleaner kthread doing relocation, which can be a very
> slow operation, we end up delaying iputs and other work delegated to
> it for a very long time.
> Not saying it needs to be fixed right away, perhaps it's not that bad,
> I'm just not sure at the moment. We already do subvolume deletion in
> this kthread, which is also a slow operation.

Noted.


Re: [PATCH] btrfs: zoned: automatically reclaim zones

2021-03-17 Thread Johannes Thumshirn
On 17/03/2021 15:33, David Sterba wrote:
> On Wed, Mar 17, 2021 at 07:38:11PM +0900, Johannes Thumshirn wrote:
>> When a file gets deleted on a zoned file system, the space freed is no
>> returned back into the block group's free space, but is migrated to
>> zone_unusable.
>>
>> As this zone_unusable space is behind the current write pointer it is not
>> possible to use it for new allocations. In the current implementation a
>> zone is reset once all of the block group's space is accounted as zone
>> unusable.
>>
>> This behaviour can lead to premature ENOSPC errors on a busy file system.
>>
>> Instead of only reclaiming the zone once it is completely unusable,
>> kick off a reclaim job once the amount of unusable bytes exceeds a user
>> configurable threshold between 51% and 100%.
>>
>> Similar to reclaiming unused block groups, these dirty block groups are
>> added to a to_reclaim list and then on a transaction commit, the reclaim
>> process is triggered but after we deleted unused block groups, which will
>> free space for the relocation process.
>>
>> Signed-off-by: Johannes Thumshirn 
> 
> I'll add it as topic branch to for-next but I haven't reviewed it and
> first thing I see missing is lack of mentioning the sysfs tunable in the
> changelog.
> 

Thanks, will add the sysfs tunable (and fixed the type in line 1 of the 
commit msg as well).

Can we also kick off a discussion whether this is generally useful for btrfs
and not just a zoned FS? While we're not having a low hanging fruit like 
zone_unusable as indicator when to balance, I think we can work against 
fragmentation this way.


Re: [PATCH] btrfs: zoned: automatically reclaim zones

2021-03-17 Thread David Sterba
On Wed, Mar 17, 2021 at 07:38:11PM +0900, Johannes Thumshirn wrote:
> When a file gets deleted on a zoned file system, the space freed is no
> returned back into the block group's free space, but is migrated to
> zone_unusable.
> 
> As this zone_unusable space is behind the current write pointer it is not
> possible to use it for new allocations. In the current implementation a
> zone is reset once all of the block group's space is accounted as zone
> unusable.
> 
> This behaviour can lead to premature ENOSPC errors on a busy file system.
> 
> Instead of only reclaiming the zone once it is completely unusable,
> kick off a reclaim job once the amount of unusable bytes exceeds a user
> configurable threshold between 51% and 100%.
> 
> Similar to reclaiming unused block groups, these dirty block groups are
> added to a to_reclaim list and then on a transaction commit, the reclaim
> process is triggered but after we deleted unused block groups, which will
> free space for the relocation process.
> 
> Signed-off-by: Johannes Thumshirn 

I'll add it as topic branch to for-next but I haven't reviewed it and
first thing I see missing is lack of mentioning the sysfs tunable in the
changelog.