date:20181127

Re: [RFC PATCH] btrfs: drop file privileges in btrfs_clone_files

2018-11-27 Thread Nikolay Borisov




On 28.11.18 г. 9:46 ч., Christoph Hellwig wrote:
> On Wed, Nov 28, 2018 at 09:44:59AM +0200, Nikolay Borisov wrote:
>>
>>
>> On 28.11.18 г. 5:07 ч., Lu Fengqi wrote:
>>> The generic/513 tell that cloning into a file did not strip security
>>> privileges (suid, capabilities) like a regular write would.
>>>
>>> Signed-off-by: Lu Fengqi 
>>> ---
>>> The xfs and ocfs2 call generic_remap_file_range_prep to drop file
>>> privileges, I'm not sure whether btrfs should do the same thing.
>>
>> Why do you think btrfs shouldn't do the same thing. Looking at
>> remap_file_range_prep it seems that btrfs is missing a ton of checks
>> that are useful i.e immutable files/aligned offsets etc.
> 
> Any chance we could move btrfs over to use remap_file_range_prep so that
> all file systems share the exact same checks?

I'm not very familiar with the, Filipe is more familiar so adding to CC.
But IMO we should do that provided there are no blockers.

Filipe, what do you think, is it feasible?

>

Re: [RFC PATCH] btrfs: drop file privileges in btrfs_clone_files

2018-11-27 Thread Christoph Hellwig

On Wed, Nov 28, 2018 at 09:44:59AM +0200, Nikolay Borisov wrote:
> 
> 
> On 28.11.18 г. 5:07 ч., Lu Fengqi wrote:
> > The generic/513 tell that cloning into a file did not strip security
> > privileges (suid, capabilities) like a regular write would.
> > 
> > Signed-off-by: Lu Fengqi 
> > ---
> > The xfs and ocfs2 call generic_remap_file_range_prep to drop file
> > privileges, I'm not sure whether btrfs should do the same thing.
> 
> Why do you think btrfs shouldn't do the same thing. Looking at
> remap_file_range_prep it seems that btrfs is missing a ton of checks
> that are useful i.e immutable files/aligned offsets etc.

Any chance we could move btrfs over to use remap_file_range_prep so that
all file systems share the exact same checks?

Re: [RFC PATCH] btrfs: drop file privileges in btrfs_clone_files

2018-11-27 Thread Nikolay Borisov




On 28.11.18 г. 5:07 ч., Lu Fengqi wrote:
> The generic/513 tell that cloning into a file did not strip security
> privileges (suid, capabilities) like a regular write would.
> 
> Signed-off-by: Lu Fengqi 
> ---
> The xfs and ocfs2 call generic_remap_file_range_prep to drop file
> privileges, I'm not sure whether btrfs should do the same thing.

Why do you think btrfs shouldn't do the same thing. Looking at
remap_file_range_prep it seems that btrfs is missing a ton of checks
that are useful i.e immutable files/aligned offsets etc.


> 
> Any suggestion?
> 
>  fs/btrfs/ioctl.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 410c7e007ba8..bc33c480603b 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -4312,6 +4312,10 @@ static noinline int btrfs_clone_files(struct file 
> *file, struct file *file_src,
>   goto out_unlock;
>   }
>  
> + ret = file_remove_privs(file);
> + if (ret)
> + goto out_unlock;
> +
>   if (destoff > inode->i_size) {
>   ret = btrfs_cont_expand(inode, inode->i_size, destoff);
>   if (ret)
>

Re: Balance: invalid convert data profile raid10

2018-11-27 Thread Mikko Merikivi

Well, excuse me for thinking it wouldn't since in md-raid it worked.
https://wiki.archlinux.org/index.php/RAID#RAID_level_comparison

Anyway, the error message is truly confusing for someone who doesn't
know about btrfs's implementation. I suppose in md-raid the near
layout is actually RAID 1 and far2 uses twice as much space.
https://en.wikipedia.org/wiki/Non-standard_RAID_levels#LINUX-MD-RAID-10

Well, I guess I'll go with RAID 1, then. Thanks for clearing up the confusion.
ke 28. marrask. 2018 klo 3.14 Qu Wenruo (quwenruo.bt...@gmx.com) kirjoitti:
>
>
>
> On 2018/11/28 上午5:16, Mikko Merikivi wrote:
> > I seem unable to convert an existing btrfs device array to RAID 10.
> > Since it's pretty much RAID 0 and 1 combined, and 5 and 6 are
> > unstable, it's what I would like to use.
> >
> > After I did tried this with 4.19.2-arch1-1-ARCH and btrfs-progs v4.19,
> > I updated my system and tried btrfs balance again with this system
> > information:
> > [mikko@localhost lvdata]$ uname -a
> > Linux localhost 4.19.4-arch1-1-ARCH #1 SMP PREEMPT Fri Nov 23 09:06:58
> > UTC 2018 x86_64 GNU/Linux
> > [mikko@localhost lvdata]$ btrfs --version
> > btrfs-progs v4.19
> > [mikko@localhost lvdata]$ sudo btrfs fi show
> > Label: 'main1'  uuid: c7cbb9c3-8c55-45f1-b03c-48992efe2f11
> > Total devices 1 FS bytes used 2.90TiB
> > devid1 size 3.64TiB used 2.91TiB path /dev/mapper/main
> >
> > Label: 'red'  uuid: f3c781a8-0f3e-4019-acbf-0b783cf566d0
> > Total devices 2 FS bytes used 640.00KiB
> > devid1 size 931.51GiB used 2.03GiB path /dev/mapper/red1
> > devid2 size 931.51GiB used 2.03GiB path /dev/mapper/red2
>
> RAID10 needs at least 4 devices.
>
> Thanks,
> Qu
>
> > [mikko@localhost lvdata]$ btrfs fi df /mnt/red/
> > Data, RAID1: total=1.00GiB, used=512.00KiB
> > System, RAID1: total=32.00MiB, used=16.00KiB
> > Metadata, RAID1: total=1.00GiB, used=112.00KiB
> > GlobalReserve, single: total=16.00MiB, used=0.00B
> >
> > ---
> >
> > Here are the steps I originally used:
> >
> > [mikko@localhost lvdata]$ sudo cryptsetup luksFormat -s 512
> > --use-random /dev/sdc
> > [mikko@localhost lvdata]$ sudo cryptsetup luksFormat -s 512
> > --use-random /dev/sdd
> > [mikko@localhost lvdata]$ sudo cryptsetup luksOpen /dev/sdc red1
> > [mikko@localhost lvdata]$ sudo cryptsetup luksOpen /dev/sdd red2
> > [mikko@localhost lvdata]$ sudo mkfs.btrfs -L red /dev/mapper/red1
> > btrfs-progs v4.19
> > See http://btrfs.wiki.kernel.org for more information.
> >
> > Label:  red
> > UUID:   f3c781a8-0f3e-4019-acbf-0b783cf566d0
> > Node size:  16384
> > Sector size:4096
> > Filesystem size:931.51GiB
> > Block group profiles:
> >   Data: single8.00MiB
> >   Metadata: DUP   1.00GiB
> >   System:   DUP   8.00MiB
> > SSD detected:   no
> > Incompat features:  extref, skinny-metadata
> > Number of devices:  1
> > Devices:
> >IDSIZE  PATH
> > 1   931.51GiB  /dev/mapper/red1
> >
> > [mikko@localhost lvdata]$ sudo mount -t btrfs -o
> > defaults,noatime,nodiratime,autodefrag,compress=lzo /dev/mapper/red1
> > /mnt/red
> > [mikko@localhost lvdata]$ sudo btrfs device add /dev/mapper/red2 /mnt/red
> > [mikko@localhost lvdata]$ sudo btrfs balance start -dconvert=raid10
> > -mconvert=raid10 /mnt/red
> > ERROR: error during balancing '/mnt/red': Invalid argument
> > There may be more info in syslog - try dmesg | tail
> > code 1
> >
> > [mikko@localhost lvdata]$ dmesg | tail
> > [12026.263243] BTRFS info (device dm-1): disk space caching is enabled
> > [12026.263244] BTRFS info (device dm-1): has skinny extents
> > [12026.263245] BTRFS info (device dm-1): flagging fs with big metadata 
> > feature
> > [12026.275153] BTRFS info (device dm-1): checking UUID tree
> > [12195.431766] BTRFS info (device dm-1): enabling auto defrag
> > [12195.431769] BTRFS info (device dm-1): use lzo compression, level 0
> > [12195.431770] BTRFS info (device dm-1): disk space caching is enabled
> > [12195.431771] BTRFS info (device dm-1): has skinny extents
> > [12205.815941] BTRFS info (device dm-1): disk added /dev/mapper/red2
> > [12744.788747] BTRFS error (device dm-1): balance: invalid convert
> > data profile raid10
> >
> > Converting to RAID 1 did work but what can I do to make it RAID 10?
> > With the up-to-date system it still says "invalid convert data profile
> > raid10".
> >
>

Re: [PATCH] btrfs: adjust order of unlocks in do_trimming()

2018-11-27 Thread Nikolay Borisov




On 28.11.18 г. 5:21 ч., Su Yue wrote:
> In function do_trimming(), block_group->lock should be unlocked first.
> 
> Fixes: 7fe1e6415026 ("Btrfs: rewrite btrfs_trim_block_group()")
> Signed-off-by: Su Yue 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/free-space-cache.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index 4ba0aedc878b..3731fa92df56 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -3149,8 +3149,8 @@ static int do_trimming(struct btrfs_block_group_cache 
> *block_group,
>   space_info->bytes_readonly += reserved_bytes;
>   block_group->reserved -= reserved_bytes;
>   space_info->bytes_reserved -= reserved_bytes;
> - spin_unlock(_info->lock);
>   spin_unlock(_group->lock);
> + spin_unlock(_info->lock);
>   }
>  
>   return ret;
>

Re: [PATCH 3/3] btrfs: remove redundant nowait check for buffered_write

2018-11-27 Thread Nikolay Borisov




On 28.11.18 г. 5:23 ч., Lu Fengqi wrote:
> The generic_write_checks will check the combination of IOCB_NOWAIT and
> !IOCB_DIRECT.

True, however btrfs will return ENOSUPP whereas the generic code returns
EINVAL. I guess this is not a big deal and it's likely generic code is
correct, so:

Reviewed-by: Nikolay Borisov 

> 
> Signed-off-by: Lu Fengqi 
> ---
>  fs/btrfs/file.c | 4 
>  1 file changed, 4 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 3835bb8c146d..190db9a685a2 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1889,10 +1889,6 @@ static ssize_t btrfs_file_write_iter(struct kiocb 
> *iocb,
>   loff_t oldsize;
>   int clean_page = 0;
>  
> - if (!(iocb->ki_flags & IOCB_DIRECT) &&
> - (iocb->ki_flags & IOCB_NOWAIT))
> - return -EOPNOTSUPP;
> -
>   if (!inode_trylock(inode)) {
>   if (iocb->ki_flags & IOCB_NOWAIT)
>   return -EAGAIN;
>

Re: [PATCH 1/3] btrfs: remove always true if branch in find_delalloc_range

2018-11-27 Thread Nikolay Borisov




On 28.11.18 г. 5:21 ч., Lu Fengqi wrote:
> The @found is always false when it comes to the if branch. Besides, the
> bool type is more suitable for @found.

Well if you are ranging the type of found variable it also makes sense
to change the return value of the function to bool as well.

> 
> Signed-off-by: Lu Fengqi 
> ---
>  fs/btrfs/extent_io.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 582b4b1c41e0..b4ee3399be96 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1461,7 +1461,7 @@ static noinline u64 find_delalloc_range(struct 
> extent_io_tree *tree,
>   struct rb_node *node;
>   struct extent_state *state;
>   u64 cur_start = *start;
> - u64 found = 0;
> + bool found = false;
>   u64 total_bytes = 0;
>  
>   spin_lock(>lock);
> @@ -1472,8 +1472,7 @@ static noinline u64 find_delalloc_range(struct 
> extent_io_tree *tree,
>*/
>   node = tree_search(tree, cur_start);
>   if (!node) {
> - if (!found)
> - *end = (u64)-1;
> + *end = (u64)-1;
>   goto out;
>   }
>  
> @@ -1493,7 +1492,7 @@ static noinline u64 find_delalloc_range(struct 
> extent_io_tree *tree,
>   *cached_state = state;
>   refcount_inc(>refs);
>   }
> - found++;
> + found = true;
>   *end = state->end;
>   cur_start = state->end + 1;
>   node = rb_next(node);
>

Re: [PATCH 2/3] btrfs: cleanup the useless DEFINE_WAIT in cleanup_transaction

2018-11-27 Thread Nikolay Borisov




On 28.11.18 г. 5:22 ч., Lu Fengqi wrote:
> When it is introduced at commit f094ac32aba3 ("Btrfs: fix NULL pointer
> after aborting a transaction"), it's useless.
> 
> Signed-off-by: Lu Fengqi 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/transaction.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index f92c0a88c4ad..67e84939b758 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1840,7 +1840,6 @@ static void cleanup_transaction(struct 
> btrfs_trans_handle *trans, int err)
>  {
>   struct btrfs_fs_info *fs_info = trans->fs_info;
>   struct btrfs_transaction *cur_trans = trans->transaction;
> - DEFINE_WAIT(wait);
>  
>   WARN_ON(refcount_read(>use_count) > 1);
>  
>

[PATCH] btrfs: qgroup: Introduce more accurate early reloc tree detection

2018-11-27 Thread Qu Wenruo

The biggest challenge for qgroup to skip reloc tree extents is to detect
correct owner of reloc tree blocks owner.

Unlike most data operations, the root of tree reloc tree can't be
easily detected.

For example, for relocation we call btrfs_copy_root to init reloc tree:

btrfs_copy_root(root=257, new_root_objectid=RELOC)
|- btrfs_inc_ref(trans, root=257, cow, 1)
   | << From this point, we won't know this eb will be used for reloc >>
   |- __btrfs_mod_ref(root=257)
  |- btrfS_inc_extent_ref()

In above case, at the timing of calling btrfs_inc_ref(), all later
function will not be aware of the fact that the extent buffer is used
for reloc tree.

This makes it extremely hard for qgroup code to detect tree block
allocated for reloc tree.

The good news is, at btrfs_copy_root() if we found @new_root_objectid ==
RELOC, we set BTRFS_HEADER_FLAG_RELOC for that extent buffer.

We could use that flag to detect reloc tree blocks, then we needs to
modify the following function for an extra parameter:
- btrfs_inc_extent_ref
- btrfs_free_extent
- add_delayed_tree_ref

This parameter change affects a lot of callers, but is needed for qgroup
to reduce balance overhead.

For benchmark, still the same memory backed VM, 4G subvolume 16
snaphots:
 | v4.20-rc1 + delayed* | w/ patch   | diff
---
relocated extents| 22703| 22610  | -0.0%
qgroup dirty extents | 74938| 69292  | -7.5%
time (real)  | 24.567s  | 23.546 | -4.1%

*: With delayed subtree scan and "btrfs: qgroup: Skip delayed data ref
for reloc trees"

Signed-off-by: Qu Wenruo 
---
For the delayed ref API paramaters mess, it need an interface refactor
to make things tidy.
In fact from current interface, we don't even have a method to know the
real owner of a delayed ref.
It will definitely cause problem for later qgroup + balance
optimization.
---
 fs/btrfs/ctree.h   |  5 +++--
 fs/btrfs/delayed-ref.c |  5 +++--
 fs/btrfs/delayed-ref.h |  3 ++-
 fs/btrfs/extent-tree.c | 24 +---
 fs/btrfs/file.c| 10 +-
 fs/btrfs/inode.c   |  4 ++--
 fs/btrfs/ioctl.c   |  3 ++-
 fs/btrfs/relocation.c  | 16 +---
 fs/btrfs/tree-log.c|  2 +-
 9 files changed, 44 insertions(+), 28 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index e32fcf211c8a..6f4b1e605736 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2673,7 +2673,7 @@ int btrfs_set_disk_extent_flags(struct btrfs_trans_handle 
*trans,
 int btrfs_free_extent(struct btrfs_trans_handle *trans,
  struct btrfs_root *root,
  u64 bytenr, u64 num_bytes, u64 parent, u64 root_objectid,
- u64 owner, u64 offset);
+ u64 owner, u64 offset, bool for_reloc);
 
 int btrfs_free_reserved_extent(struct btrfs_fs_info *fs_info,
   u64 start, u64 len, int delalloc);
@@ -2684,7 +2684,8 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle 
*trans);
 int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 struct btrfs_root *root,
 u64 bytenr, u64 num_bytes, u64 parent,
-u64 root_objectid, u64 owner, u64 offset);
+u64 root_objectid, u64 owner, u64 offset,
+bool for_reloc);
 
 int btrfs_start_dirty_block_groups(struct btrfs_trans_handle *trans);
 int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 269bd6ecb8f3..685b21b2dc24 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -718,7 +718,8 @@ static void init_delayed_ref_common(struct btrfs_fs_info 
*fs_info,
  */
 int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
   u64 bytenr, u64 num_bytes, u64 parent,
-  u64 ref_root,  int level, int action,
+  u64 ref_root,  int level, bool for_reloc,
+  int action,
   struct btrfs_delayed_extent_op *extent_op,
   int *old_ref_mod, int *new_ref_mod)
 {
@@ -744,7 +745,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
*trans,
}
 
if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags) &&
-   is_fstree(ref_root)) {
+   is_fstree(ref_root) && !for_reloc) {
record = kmalloc(sizeof(*record), GFP_NOFS);
if (!record) {
kmem_cache_free(btrfs_delayed_tree_ref_cachep, ref);
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 6c60737e55d6..35b38410e6bf 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -236,7 +236,8 @@ static inline void btrfs_put_delayed_ref_head(struct

Re: [RFC PATCH 00/17] btrfs: implementation of priority aware allocator

2018-11-27 Thread Qu Wenruo




On 2018/11/28 上午11:11, Su Yue wrote:
> This patchset can be fetched from repo:
> https://github.com/Damenly/btrfs-devel/commits/priority_aware_allocator.
> Since patchset 'btrfs: Refactor find_free_extent()' does a nice work
> to simplify find_free_extent(). This patchset dependents on the refactor.
> The base is the commit in kdave/misc-next:
> 
> commit fcaaa1dfa81f2f87ad88cbe0ab86a07f9f76073c (kdave/misc-next)
> Author: Nikolay Borisov 
> Date:   Tue Nov 6 16:40:20 2018 +0200
> 
> btrfs: Always try all copies when reading extent buffers
> 
> 
> This patchset introduces a new mount option named 'priority_alloc=%s',
> %s is supported to be "usage" and "off" now. The mount option changes
> the way find_free_extent() how to search block groups.
> 
> Previously, block groups are stored in list of btrfs_space_info
> by start position. When call find_free_extent() if no hint,
> block_groups are searched one by one.
> 
> Design of priority aware allocator:
> Block group has its own priority. We split priorities to many levels,
> block groups are split to different trees according priorities.
> And those trees are sorted by their levels and stored in space_info.
> Once find_free_extent() is called, try to search block groups in higher
> priority level then lower level. Then a block group with higher
> priority is more likely to be used.
> 
> Pros:
> 1) Reduce the frequency of balance.
>The block group with a higher usage rate will be used preferentially
>for allocating extents. Free the empty block groups with pinned bytes
>as non-zero.[1]
> 
> 2) The priority of empty block group with pinned bytes as non-zero
>will be set as the lowest.
>
> 3) Support zoned block device.[2]
>For metadata allocation, the block group in conventional zones
>will be used as much as possible regardless of usage rate.
>Will do it in future.

Personally I'm a big fan of the priority aware extent allocator.

So nice job!

>
> Cons:
> 1) Expectable performance regression.
>The degree of the decline is temporarily unknown.
>The user can disable block group priority to get the full performance.
> 
> TESTS:
> 
> If use usage as priority(the only available option), empty block group
> is much harder to be reused.
> 
> About block group usage:
>   Disk: 4 x 1T HDD gathered in LVM.
> 
>   Run script to create files and delete files randomly in loop.
>   The num of files to create are double than to delete.
> 
>   Default mount option result:
>   https://i.loli.net/2018/11/28/5bfdfdf08c760.png
> 
>   Priority aware allocator(usage) result:
>   https://i.loli.net/2018/11/28/5bfdfdf0c1b11.png
> 
>   X coordinate means total disk usage, Y coordinate means avg block
>   group usage.
> 
>   Due to fragmentation of extents, the different are not obvious,
>   only about 1% improvement

I think you're using the wrong indicator to show the difference.

The real indicator should not be overall block group usage, but:
1) Number of block groups
2) Usage distribution of the block groups

If the number of block groups isn't much different, then we should go
check the distribution.
E.g. all bgs with 97% usage is not as good mostly 100% bgs and several
near 10% bgs.

And we should check the usage distribution between metadata and data bgs.
For data bg, we could hit some fragmentation problem, while for meta bgs
all extents are in the same size, thus may have a better performance for
metadata.

Thus we could do better for the test result.

>  
> Performance regression:
>I have ran sysbench on our machine with SSD in multi combinations,
>no obvious regression found.
>However in theory, the new allocator may cost more time in some
>cases.

Isn't that a good news? :)

> 
> [1] https://www.spinics.net/lists/linux-btrfs/msg79508.html
> [2] https://lkml.org/lkml/2018/8/16/174
> 
> ---
> Due to some reasons includes time and hardware, the use-case is not
> outstanding enough.

As discussed offline, another cause would be data extent fragmentations.
E.g we have a lot of small 4K holes but the request is a big 128M.
In that case btrfs_reserve_extent() could still trigger a new data chunk
other than return the 4K holes found.

Thanks,
Qu

> And some codes are dirty but I can't found another
> way. So I named it as RFC.
>  Any comments and suggestions are welcome.
>  
> Su Yue (17):
>   btrfs: priority alloc: prepare of priority aware allocator
>   btrfs: add mount definition BTRFS_MOUNT_PRIORITY_USAGE
>   btrfs: priority alloc: introduce compute_block_group_priority/usage
>   btrfs: priority alloc: add functions to create/remove priority trees
>   btrfs: priority alloc: introduce functions to add block group to
> priority tree
>   btrfs: priority alloc: introduce three macros to mark block group
> status
>   btrfs: priority alloc: add functions to remove block group from
> priority tree
>   btrfs: priority alloc: add

[PATCH 3/3] btrfs: remove redundant nowait check for buffered_write

2018-11-27 Thread Lu Fengqi

The generic_write_checks will check the combination of IOCB_NOWAIT and
!IOCB_DIRECT.

Signed-off-by: Lu Fengqi 
---
 fs/btrfs/file.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 3835bb8c146d..190db9a685a2 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1889,10 +1889,6 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
loff_t oldsize;
int clean_page = 0;
 
-   if (!(iocb->ki_flags & IOCB_DIRECT) &&
-   (iocb->ki_flags & IOCB_NOWAIT))
-   return -EOPNOTSUPP;
-
if (!inode_trylock(inode)) {
if (iocb->ki_flags & IOCB_NOWAIT)
return -EAGAIN;
-- 
2.19.2

[PATCH 2/3] btrfs: cleanup the useless DEFINE_WAIT in cleanup_transaction

2018-11-27 Thread Lu Fengqi

When it is introduced at commit f094ac32aba3 ("Btrfs: fix NULL pointer
after aborting a transaction"), it's useless.

Signed-off-by: Lu Fengqi 
---
 fs/btrfs/transaction.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index f92c0a88c4ad..67e84939b758 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1840,7 +1840,6 @@ static void cleanup_transaction(struct btrfs_trans_handle 
*trans, int err)
 {
struct btrfs_fs_info *fs_info = trans->fs_info;
struct btrfs_transaction *cur_trans = trans->transaction;
-   DEFINE_WAIT(wait);
 
WARN_ON(refcount_read(>use_count) > 1);
 
-- 
2.19.2

[PATCH 1/3] btrfs: remove always true if branch in find_delalloc_range

2018-11-27 Thread Lu Fengqi

The @found is always false when it comes to the if branch. Besides, the
bool type is more suitable for @found.

Signed-off-by: Lu Fengqi 
---
 fs/btrfs/extent_io.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 582b4b1c41e0..b4ee3399be96 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1461,7 +1461,7 @@ static noinline u64 find_delalloc_range(struct 
extent_io_tree *tree,
struct rb_node *node;
struct extent_state *state;
u64 cur_start = *start;
-   u64 found = 0;
+   bool found = false;
u64 total_bytes = 0;
 
spin_lock(>lock);
@@ -1472,8 +1472,7 @@ static noinline u64 find_delalloc_range(struct 
extent_io_tree *tree,
 */
node = tree_search(tree, cur_start);
if (!node) {
-   if (!found)
-   *end = (u64)-1;
+   *end = (u64)-1;
goto out;
}
 
@@ -1493,7 +1492,7 @@ static noinline u64 find_delalloc_range(struct 
extent_io_tree *tree,
*cached_state = state;
refcount_inc(>refs);
}
-   found++;
+   found = true;
*end = state->end;
cur_start = state->end + 1;
node = rb_next(node);
-- 
2.19.2

[PATCH] btrfs: adjust order of unlocks in do_trimming()

2018-11-27 Thread Su Yue

In function do_trimming(), block_group->lock should be unlocked first.

Fixes: 7fe1e6415026 ("Btrfs: rewrite btrfs_trim_block_group()")
Signed-off-by: Su Yue 
---
 fs/btrfs/free-space-cache.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 4ba0aedc878b..3731fa92df56 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -3149,8 +3149,8 @@ static int do_trimming(struct btrfs_block_group_cache 
*block_group,
space_info->bytes_readonly += reserved_bytes;
block_group->reserved -= reserved_bytes;
space_info->bytes_reserved -= reserved_bytes;
-   spin_unlock(_info->lock);
spin_unlock(_group->lock);
+   spin_unlock(_info->lock);
}
 
return ret;
-- 
2.19.1

[RFC PATCH] btrfs: drop file privileges in btrfs_clone_files

2018-11-27 Thread Lu Fengqi

The generic/513 tell that cloning into a file did not strip security
privileges (suid, capabilities) like a regular write would.

Signed-off-by: Lu Fengqi 
---
The xfs and ocfs2 call generic_remap_file_range_prep to drop file
privileges, I'm not sure whether btrfs should do the same thing.

Any suggestion?

 fs/btrfs/ioctl.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 410c7e007ba8..bc33c480603b 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4312,6 +4312,10 @@ static noinline int btrfs_clone_files(struct file *file, 
struct file *file_src,
goto out_unlock;
}
 
+   ret = file_remove_privs(file);
+   if (ret)
+   goto out_unlock;
+
if (destoff > inode->i_size) {
ret = btrfs_cont_expand(inode, inode->i_size, destoff);
if (ret)
-- 
2.19.2

[RFC PATCH 12/17] btrfs: priority alloc: introduce find_free_extent_search()

2018-11-27 Thread Su Yue

In origin, find_free_extent() just searches block groups in space_info
one by one.

In priority aware allocator, we first search block groups in
higher priority tree than in lower priority tree.

This helper unify above two ways for further use.

Signed-off-by: Su Yue 
---
 fs/btrfs/extent-tree.c | 77 ++
 1 file changed, 77 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 74955f79fcce..5484256169dd 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7331,6 +7331,83 @@ struct find_free_extent_ctl {
 };
 
 
+/*
+ * Helper function for find_free_extent().
+ *
+ * Return next block_group_cache to use.
+ * Return NULL mean no more block group to use.
+ *
+ * If use_priority is false: return the next of @block_group in
+ * space_info->block_groups[index] or if @blockgroup is NULL, return first
+ * block group in the list.
+ *
+ * If use_priority is true: if *pt_ret is NULL, return first block group in
+ * highest priority level. If *pt_ret is NOT NULL, return the next of
+ * @block_group or if @block_group is NULL, return first available block group
+ * which located in priority level <= *pt_ret->level.
+ * The tree will be down_read if return value is not NULL.
+ *
+ */
+static struct btrfs_block_group_cache *
+find_free_extent_search(struct btrfs_space_info *space_info, int index,
+   struct btrfs_block_group_cache *block_group, bool use_priority,
+   struct btrfs_priority_tree **pt_ret)
+{
+
+   struct list_head *lentry, *head;
+   struct rb_root *root;
+   struct rb_node *node, *bg_node;
+   struct btrfs_priority_tree *pt;
+   bool read;
+
+   if (!use_priority) {
+   head = _info->block_groups[index];
+   if (!block_group)
+   lentry = head->next;
+   else
+   lentry = block_group->list.next;
+   if (lentry == head)
+   return NULL;
+   return list_entry(lentry, struct btrfs_block_group_cache,
+ list);
+   }
+
+   root = _info->priority_trees[index];
+   pt = pt_ret ? *pt_ret : NULL;
+
+   if (!pt) {
+   node = rb_first(root);
+   read = true;
+   } else {
+   node = >node;
+   read = false;
+
+   if (block_group)
+   bg_node = rb_next(_group->node);
+   else
+   bg_node = rb_first(>block_groups);
+   }
+
+   for (; node; node = rb_next(node)) {
+   pt = rb_entry(node, struct btrfs_priority_tree, node);
+   if (read) {
+   down_read(>groups_sem);
+   bg_node = rb_first(>block_groups);
+   }
+   for (; bg_node; bg_node = rb_next(bg_node)) {
+   if (bg_node) {
+   *pt_ret = pt;
+   return rb_entry(bg_node,
+   struct btrfs_block_group_cache, node);
+   }
+   }
+
+   up_read(>groups_sem);
+   read = true;
+   }
+
+   return NULL;
+}
 /*
  * Helper function for find_free_extent().
  *
-- 
2.19.1

[RFC PATCH 16/17] btrfs: priority alloc: write bg->priority_tree->groups_sem to avoid race in btrfs_delete_unused_bgs()

2018-11-27 Thread Su Yue

If use priority aware allocator, bg->priority_tree->groups_sem should
be written instead of space_info->groups_sem.

Signed-off-by: Su Yue 
---
 fs/btrfs/extent-tree.c | 60 +++---
 1 file changed, 44 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 6627bbe56ad5..5c9536609621 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10944,6 +10944,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info 
*fs_info)
struct btrfs_block_group_cache *block_group;
struct btrfs_space_info *space_info;
struct btrfs_trans_handle *trans;
+   bool use_priority = is_priority_alloc_enabled(fs_info);
int ret = 0;
 
if (!test_bit(BTRFS_FS_OPEN, _info->flags))
@@ -10953,6 +10954,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info 
*fs_info)
while (!list_empty(_info->unused_bgs)) {
u64 start, end;
int trimming;
+   struct btrfs_priority_tree *pt;
 
block_group = list_first_entry(_info->unused_bgs,
   struct btrfs_block_group_cache,
@@ -10969,29 +10971,55 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info 
*fs_info)
 
mutex_lock(_info->delete_unused_bgs_mutex);
 
-   /* Don't want to race with allocators so take the groups_sem */
-   down_write(_info->groups_sem);
-   spin_lock(_group->lock);
-   if (block_group->reserved || block_group->pinned ||
-   btrfs_block_group_used(_group->item) ||
-   block_group->ro ||
-   list_is_singular(_group->list)) {
+   if (use_priority) {
+   spin_lock(_group->lock);
+   if (block_group->reserved || block_group->pinned ||
+   btrfs_block_group_used(_group->item) ||
+   block_group->ro ||
+   block_group->priority != 0) {
+   trace_btrfs_skip_unused_block_group(
+   block_group);
+   spin_unlock(_group->lock);
+   goto next;
+   }
+   block_group->priority = PRIORITY_BG_BUSY;
+   spin_unlock(_group->lock);
+   pt = block_group->priority_tree;
+   down_write(>groups_sem);
+   } else {
/*
-* We want to bail if we made new allocations or have
-* outstanding allocations in this block group.  We do
-* the ro check in case balance is currently acting on
-* this block group.
+* Don't want to race with allocators so take the
+* groups_sem
 */
-   trace_btrfs_skip_unused_block_group(block_group);
+   down_write(_info->groups_sem);
+   spin_lock(_group->lock);
+   if (block_group->reserved || block_group->pinned ||
+   btrfs_block_group_used(_group->item) ||
+   block_group->ro ||
+   list_is_singular(_group->list)) {
+   /*
+* We want to bail if we made new allocations
+* or have outstanding allocations in this
+* block group.  We do the ro check in case
+* balance is currently acting on this block
+* group.
+*/
+   trace_btrfs_skip_unused_block_group(
+   block_group);
+   spin_unlock(_group->lock);
+   up_write(_info->groups_sem);
+   goto next;
+   }
spin_unlock(_group->lock);
-   up_write(_info->groups_sem);
-   goto next;
}
-   spin_unlock(_group->lock);
 
/* We don't want to force the issue, only flip if it's ok. */
ret = inc_block_group_ro(block_group, 0);
-   up_write(_info->groups_sem);
+   if (use_priority)
+   up_write(>groups_sem);
+   else
+   up_write(_info->groups_sem);
+
if (ret < 0) {
ret = 0;
goto next;
-- 
2.19.1

[RFC PATCH 08/17] btrfs: priority alloc: add btrfs_update_block_group_priority()

2018-11-27 Thread Su Yue

Introduce btrfs_update_block_group_priority() to update
block_groups::priority. It will move block group from old tree
to new tree if need.

Signed-off-by: Su Yue 
---
 fs/btrfs/extent-tree.c | 76 ++
 1 file changed, 76 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index b559c9a9afc6..5ea1f2e40701 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -11466,3 +11466,79 @@ void btrfs_remove_block_group_priority(struct 
btrfs_block_group_cache *cache)
spin_unlock(>lock);
up_write(>groups_sem);
 }
+
+static inline struct btrfs_priority_tree *
+find_priority_tree(struct btrfs_space_info *sinfo, u64 flags, int level)
+{
+   int index = btrfs_bg_flags_to_raid_index(flags);
+
+   return search_priority_tree(>priority_trees[index], level);
+}
+
+void btrfs_update_block_group_priority(struct btrfs_block_group_cache *cache)
+{
+   struct btrfs_priority_tree *old_tree, *new_tree;
+   struct rw_semaphore *front_sem, *back_sem;
+   long priority;
+   int old_level, new_level;
+
+   if (!is_priority_alloc_enabled(cache->fs_info))
+   return;
+
+   spin_lock(>lock);
+   if (cache->priority != PRIORITY_BG_UPDATING) {
+   spin_unlock(>lock);
+   return;
+   }
+
+   old_level = cache->priority_tree->level;
+   priority = compute_block_group_priority(cache);
+   new_level = compute_priority_level(cache->fs_info, priority);
+
+   /* no need to move the block group */
+   if (old_level == new_level) {
+   cache->priority = priority;
+   spin_unlock(>lock);
+   return;
+   }
+   spin_unlock(>lock);
+
+   old_tree = cache->priority_tree;
+   new_tree = find_priority_tree(cache->space_info, cache->flags,
+ new_level);
+
+   if (!old_tree || !new_tree) {
+   btrfs_err(cache->fs_info,
+"can't found priority tree %d for block_group %llu old level %d new priority 
%ld",
+ old_tree ? new_level : old_level,
+ cache->key.objectid, old_level, priority);
+   BUG();
+   }
+
+   if (old_level > new_level) {
+   front_sem = _tree->groups_sem;
+   back_sem = _tree->groups_sem;
+   } else {
+   front_sem = _tree->groups_sem;
+   back_sem = _tree->groups_sem;
+   }
+
+   down_write(front_sem);
+   down_write(back_sem);
+   spin_lock(>lock);
+
+   /* the block group was removed/is removing */
+   if (cache->priority != PRIORITY_BG_UPDATING ||
+   priority != compute_block_group_priority(cache) ||
+   old_tree != cache->priority_tree)
+   goto out;
+
+   unlink_block_group_priority(old_tree, cache);
+   cache->priority = priority;
+   link_block_group_priority(new_tree, cache);
+   cache->priority_tree = new_tree;
+out:
+   spin_unlock(>lock);
+   up_write(front_sem);
+   up_write(back_sem);
+}
-- 
2.19.1

[RFC PATCH 04/17] btrfs: priority alloc: add functions to create/remove priority trees

2018-11-27 Thread Su Yue

Introduce create_priority_trees() to create priority trees in
space_info.
Introduce remove_priority_trees() to remove priority trees in
space_info.

Signed-off-by: Su Yue 
---
 fs/btrfs/extent-tree.c | 94 ++
 1 file changed, 94 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0f4c5b1e0bcc..787a68b5bdcb 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -11250,3 +11250,97 @@ static int compute_priority_level(struct btrfs_fs_info 
*fs_info,
 
return level;
 }
+
+static inline bool
+is_priority_alloc_enabled(struct btrfs_fs_info *fs_info)
+{
+   if (btrfs_test_opt(fs_info, PRIORITY_USAGE))
+   return true;
+   return false;
+}
+
+static void init_priority_tree(struct btrfs_priority_tree *pt, int level)
+{
+   pt->block_groups = RB_ROOT;
+   init_rwsem(>groups_sem);
+   pt->level = level;
+}
+
+static void remove_priority_trees(struct btrfs_fs_info *fs_info,
+ struct btrfs_space_info *space_info)
+{
+   struct rb_root *root;
+   struct btrfs_priority_tree *pt, *next_pt;
+   int i;
+
+   if (!is_priority_alloc_enabled(fs_info))
+   return;
+
+   for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
+   root = _info->priority_trees[i];
+   rbtree_postorder_for_each_entry_safe(pt, next_pt, root, node) {
+   kfree(pt);
+   }
+   space_info->priority_trees[i] = RB_ROOT;
+   }
+}
+
+static int insert_priority_tree(struct rb_root *root,
+   struct btrfs_priority_tree *pt)
+{
+   struct rb_node **p = >rb_node;
+   struct rb_node *parent = NULL;
+   struct btrfs_priority_tree *tmp;
+
+   while (*p) {
+   parent = *p;
+   tmp = rb_entry(parent, struct btrfs_priority_tree, node);
+   if (pt->level > tmp->level)
+   p = &(*p)->rb_left;
+   else if (pt->level < tmp->level)
+   p = &(*p)->rb_right;
+   else
+   return -EEXIST;
+   }
+
+   rb_link_node(>node, parent, p);
+   rb_insert_color(>node, root);
+   return 0;
+}
+
+static int create_priority_trees(struct btrfs_fs_info *fs_info,
+struct btrfs_space_info *space_info)
+{
+   struct rb_root *root;
+   struct btrfs_priority_tree *pt;
+   int ret = 0;
+   int i, level, max_level;
+   u64 priority;
+
+   if (!is_priority_alloc_enabled(fs_info))
+   return 0;
+
+   if (btrfs_test_opt(fs_info, PRIORITY_USAGE)) {
+   priority = (u8)100 << PRIORITY_USAGE_SHIFT;
+   max_level = compute_priority_level(fs_info, priority);
+   }
+   for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
+   root = _info->priority_trees[i];
+
+   for (level = 0; level <= max_level; level++) {
+   pt = kzalloc(sizeof(*pt), GFP_NOFS);
+   if (!pt) {
+   ret = -ENOMEM;
+   break;
+   }
+   init_priority_tree(pt, level);
+   ret = insert_priority_tree(root, pt);
+   if (ret)
+   break;
+   }
+   }
+
+   if (ret)
+   remove_priority_trees(fs_info, space_info);
+   return ret;
+}
-- 
2.19.1

[RFC PATCH 17/17] btrfs: add mount option "priority_alloc=%s"

2018-11-27 Thread Su Yue

Add mount option "priority_alloc=%s", now %s only supports "usage" and
"off". The latter is used for remount.
"priority_alloc=usage" will active priority aware allocator.
This mount option changes the way of find_free_extent to search
block groups and may cost more time.

Signed-off-by: Su Yue 
---
 fs/btrfs/super.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index cbc9d0d2c12d..4a6ccd4c29fd 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -326,6 +326,7 @@ enum {
Opt_treelog, Opt_notreelog,
Opt_usebackuproot,
Opt_user_subvol_rm_allowed,
+   Opt_priority_allocator,
 
/* Deprecated options */
Opt_alloc_start,
@@ -393,6 +394,7 @@ static const match_table_t tokens = {
{Opt_notreelog, "notreelog"},
{Opt_usebackuproot, "usebackuproot"},
{Opt_user_subvol_rm_allowed, "user_subvol_rm_allowed"},
+   {Opt_priority_allocator, "priority_alloc=%s"},
 
/* Deprecated options */
{Opt_alloc_start, "alloc_start=%s"},
@@ -765,6 +767,18 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
*options,
case Opt_skip_balance:
btrfs_set_opt(info->mount_opt, SKIP_BALANCE);
break;
+   case Opt_priority_allocator:
+   if (strcmp(args[0].from, "usage") == 0) {
+   btrfs_set_and_info(info, PRIORITY_USAGE,
+  "using priority usage-aware allocator");
+   } else if (strcmp(args[0].from, "off") == 0) {
+   btrfs_clear_and_info(info, PRIORITY_USAGE,
+  "priority awareallocator disabled");
+   } else {
+   ret = -EINVAL;
+   goto out;
+   }
+   break;
 #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
case Opt_check_integrity_including_extent_data:
btrfs_info(info,
@@ -1337,6 +1351,10 @@ static int btrfs_show_options(struct seq_file *seq, 
struct dentry *dentry)
seq_puts(seq, ",inode_cache");
if (btrfs_test_opt(info, SKIP_BALANCE))
seq_puts(seq, ",skip_balance");
+   if (btrfs_test_opt(info, PRIORITY_USAGE))
+   seq_puts(seq, ",priority_alloc=usage");
+   else
+   seq_puts(seq, ",priority_alloc=off");
 #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
if (btrfs_test_opt(info, CHECK_INTEGRITY_INCLUDING_EXTENT_DATA))
seq_puts(seq, ",check_int_data");
-- 
2.19.1

[RFC PATCH 13/17] btrfs: priority alloc: modify find_free_extent() to fit priority allocator

2018-11-27 Thread Su Yue

Add member priority_tree to find_free_extent_ctl to represents the
tree using.

Modify find_free_extent to use find_free_extent_search, so it can
work in default mount option and priorit aware allocator.

Signed-off-by: Su Yue 
---
 fs/btrfs/extent-tree.c | 114 -
 1 file changed, 91 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 5484256169dd..4c76677a54a9 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7328,6 +7328,8 @@ struct find_free_extent_ctl {
 
/* Found result */
u64 found_offset;
+
+   struct btrfs_priority_tree *priority_tree;
 };
 
 
@@ -7692,6 +7694,10 @@ static int find_free_extent_update_loop(struct 
btrfs_fs_info *fs_info,
return -ENOSPC;
 }
 
+static inline bool
+is_priority_alloc_enabled(struct btrfs_fs_info *fs_info);
+
+static long compute_block_group_priority(struct btrfs_block_group_cache *bg);
 /*
  * walks the btree of allocated extents and find a hole of a given size.
  * The key ins is changed to record the hole:
@@ -7729,6 +7735,7 @@ static noinline int find_free_extent(struct btrfs_fs_info 
*fs_info,
struct btrfs_space_info *space_info;
bool use_cluster = true;
bool full_search = false;
+   bool use_priority = is_priority_alloc_enabled(fs_info);
 
WARN_ON(num_bytes < fs_info->sectorsize);
 
@@ -7744,6 +7751,7 @@ static noinline int find_free_extent(struct btrfs_fs_info 
*fs_info,
ffe_ctl.have_caching_bg = false;
ffe_ctl.orig_have_caching_bg = false;
ffe_ctl.found_offset = 0;
+   ffe_ctl.priority_tree = NULL;
 
ins->type = BTRFS_EXTENT_ITEM_KEY;
ins->objectid = 0;
@@ -7813,40 +7821,82 @@ static noinline int find_free_extent(struct 
btrfs_fs_info *fs_info,
 */
if (block_group && block_group_bits(block_group, flags) &&
block_group->cached != BTRFS_CACHE_NO) {
-   down_read(_info->groups_sem);
-   if (list_empty(_group->list) ||
-   block_group->ro) {
-   /*
-* someone is removing this block group,
-* we can't jump into the have_block_group
-* target because our list pointers are not
-* valid
-*/
-   btrfs_put_block_group(block_group);
-   up_read(_info->groups_sem);
+   if (use_priority) {
+   spin_lock(_group->lock);
+   if (block_group->priority ==
+   PRIORITY_BG_DELETED ||
+   block_group->priority ==
+   PRIORITY_BG_BUSY ||
+   block_group->ro) {
+   spin_unlock(_group->lock);
+   btrfs_put_block_group(block_group);
+   goto search;
+   }
+
+   block_group->priority = PRIORITY_BG_BUSY;
+   spin_unlock(_group->lock);
+   ffe_ctl.priority_tree =
+   block_group->priority_tree;
+   down_read(_ctl.priority_tree->groups_sem);
} else {
-   ffe_ctl.index = btrfs_bg_flags_to_raid_index(
-   block_group->flags);
-   btrfs_lock_block_group(block_group, delalloc);
-   goto have_block_group;
+   down_read(_info->groups_sem);
+   if (list_empty(_group->list) ||
+   block_group->ro) {
+   /*
+* someone is removing this block
+* group, we can't jump into the
+* have_block_group target because our
+* list pointers are not valid
+*/
+   btrfs_put_block_group(block_group);
+   up_read(_info->groups_sem);
+   goto search;
+   }
}
+   ffe_ctl.index = btrfs_bg_flags_to_raid_index(
+   block_group->flags);
+   btrfs_lock_block_group(block_group, delalloc);
+   goto have_block_group;
} else if

[RFC PATCH 06/17] btrfs: priority alloc: introduce three macros to mark block group status

2018-11-27 Thread Su Yue

In origin design, the order of space_info::block_groups is changed only
when add/remove block group(s).

In design of priority aware allocator, block groups may be moved from
one priority tree to another one. What the operation is usually that
1)  lock block_group
down_write first_tree
down_write second_tree
do unlink

do link
up_write second_tree
up_write first_tree
unlock blcok_group
However, the expected order in find_free_extent() is
2)  down_read tree
lock block_group
...
unlock block_group
up_read tree

Obviously, orders of operation 1 and operation 2 are on the contrary
and will cause dead lock.

The ugly method is to introduce special status to mark block group
status.
Then:

1) After priority changed:
lock block_group
mark block_group to be updated
unlock blcok_group

down_write first_tree
down_write second_tree

lock block_group
check block group should be moved
do unlink

do link
unlock blcok_group
up_write second_tree
up_write first_tree

2) find_free_extent():
down_read tree
lock block_group
check the block group is not in special status
...
unlock block_group
up_read tree

This patch introduce three macros to represents block group is removing
/need to updated / busy.

Signed-off-by: Su Yue 
---
 fs/btrfs/extent-tree.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d63078930a1e..5bae757786dc 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -31,6 +31,15 @@
 
 #undef SCRAMBLE_DELAYED_REFS
 
+/* The block group is used by find_free_extent() */
+#define PRIORITY_BG_BUSY -1
+
+/* The block group is in remove or removed */
+#define PRIORITY_BG_DELETED -2
+
+/* The block group' priority needs to be updated */
+#define PRIORITY_BG_UPDATING -3
+
 /*
  * control flags for do_chunk_alloc's force field
  * CHUNK_ALLOC_NO_FORCE means to only allocate a chunk
-- 
2.19.1

[RFC PATCH 03/17] btrfs: priority alloc: introduce compute_block_group_priority/usage

2018-11-27 Thread Su Yue

Introduce compute_block_group_usage() and compute_block_group_usage().
And call the latter in btrfs_make_block_group() and
btrfs_read_block_groups().

compute_priority_level use ilog2(free) to compute priority level.

Signed-off-by: Su Yue 
---
 fs/btrfs/extent-tree.c | 60 ++
 1 file changed, 60 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d242a1174e50..0f4c5b1e0bcc 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10091,6 +10091,7 @@ static int check_chunk_block_group_mappings(struct 
btrfs_fs_info *fs_info)
return ret;
 }
 
+static long compute_block_group_priority(struct btrfs_block_group_cache *bg);
 int btrfs_read_block_groups(struct btrfs_fs_info *info)
 {
struct btrfs_path *path;
@@ -10224,6 +10225,8 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 
link_block_group(cache);
 
+   cache->priority = compute_block_group_priority(cache);
+
set_avail_alloc_bits(info, cache->flags);
if (btrfs_chunk_readonly(info, cache->key.objectid)) {
inc_block_group_ro(cache, 1);
@@ -10373,6 +10376,8 @@ int btrfs_make_block_group(struct btrfs_trans_handle 
*trans, u64 bytes_used,
 
link_block_group(cache);
 
+   cache->priority = compute_block_group_priority(cache);
+
list_add_tail(>bg_list, >new_bgs);
 
set_avail_alloc_bits(fs_info, type);
@@ -11190,3 +11195,58 @@ void btrfs_mark_bg_unused(struct 
btrfs_block_group_cache *bg)
}
spin_unlock(_info->unused_bgs_lock);
 }
+
+enum btrfs_priority_shift {
+   PRIORITY_USAGE_SHIFT = 0
+};
+
+static inline u8
+compute_block_group_usage(struct btrfs_block_group_cache *cache)
+{
+   u64 num_bytes;
+   u8 usage;
+
+   num_bytes = cache->reserved + cache->bytes_super +
+   btrfs_block_group_used(>item);
+
+   usage = div_u64(num_bytes, div_factor_fine(cache->key.offset, 1));
+
+   return usage;
+}
+
+static long compute_block_group_priority(struct btrfs_block_group_cache *bg)
+{
+   u8 usage;
+   long priority = 0;
+
+   if (btrfs_test_opt(bg->fs_info, PRIORITY_USAGE)) {
+   usage = compute_block_group_usage(bg);
+   priority |= usage << PRIORITY_USAGE_SHIFT;
+   }
+
+   return priority;
+}
+
+static int compute_priority_level(struct btrfs_fs_info *fs_info,
+ long priority)
+{
+   int level = 0;
+
+   if (btrfs_test_opt(fs_info, PRIORITY_USAGE)) {
+   u8 free;
+
+   WARN_ON(priority < 0);
+   free = 100 - (priority >> PRIORITY_USAGE_SHIFT);
+
+   if (free == 0)
+   level = 0;
+   else if (free == 100)
+   level = ilog2(free) + 1;
+   else
+   level = ilog2(free);
+   /* log2(1) == 0, leave level 0 for unused block_groups */
+   level = ilog2(100) + 1 - level;
+   }
+
+   return level;
+}
-- 
2.19.1

[RFC PATCH 11/17] btrfs: priority alloc: remove block group from priority tree while removing block group

2018-11-27 Thread Su Yue

Export btrfs_remove_block_group_priority() to header ctree.h.
Call btrfs_remove_block_group_priority() while deleting
transaction->deleted_bgs, btrfs_free_block_groups() and
btrfs_remove_block_group().

Signed-off-by: Su Yue 
---
 fs/btrfs/ctree.h   | 1 +
 fs/btrfs/extent-tree.c | 3 +++
 fs/btrfs/transaction.c | 1 +
 3 files changed, 5 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4c56baf9f7cf..091b878e326c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2752,6 +2752,7 @@ u64 btrfs_data_alloc_profile(struct btrfs_fs_info 
*fs_info);
 u64 btrfs_metadata_alloc_profile(struct btrfs_fs_info *fs_info);
 u64 btrfs_system_alloc_profile(struct btrfs_fs_info *fs_info);
 void btrfs_clear_space_info_full(struct btrfs_fs_info *info);
+void btrfs_remove_block_group_priority(struct btrfs_block_group_cache *cache);
 
 enum btrfs_reserve_flush_enum {
/* If we are in the transaction, we can't flush anything.*/
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index fc40901b4772..74955f79fcce 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9896,6 +9896,7 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
list_del(_group->list);
up_write(_group->space_info->groups_sem);
 
+   btrfs_remove_block_group_priority(block_group);
/*
 * We haven't cached this block group, which means we could
 * possibly have excluded extents on this block group.
@@ -10571,6 +10572,8 @@ int btrfs_remove_block_group(struct btrfs_trans_handle 
*trans,
clear_avail_alloc_bits(fs_info, block_group->flags);
}
up_write(_group->space_info->groups_sem);
+   btrfs_remove_block_group_priority(block_group);
+
if (kobj) {
kobject_del(kobj);
kobject_put(kobj);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index f92c0a88c4ad..74234de9304a 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -72,6 +72,7 @@ void btrfs_put_transaction(struct btrfs_transaction 
*transaction)
 struct btrfs_block_group_cache,
 bg_list);
list_del_init(>bg_list);
+   btrfs_remove_block_group_priority(cache);
btrfs_put_block_group_trimming(cache);
btrfs_put_block_group(cache);
}
-- 
2.19.1

[RFC PATCH 02/17] btrfs: add mount definition BTRFS_MOUNT_PRIORITY_USAGE

2018-11-27 Thread Su Yue

Now implementation of priority allocator only support usage option.
Add BTRFS_MOUNT_PRIORITY_USAGE for further commits.

Signed-off-by: Su Yue 
---
 fs/btrfs/ctree.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5c4651d8a524..4c56baf9f7cf 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1406,6 +1406,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct 
btrfs_fs_info *info)
 #define BTRFS_MOUNT_FREE_SPACE_TREE(1 << 26)
 #define BTRFS_MOUNT_NOLOGREPLAY(1 << 27)
 #define BTRFS_MOUNT_REF_VERIFY (1 << 28)
+#define BTRFS_MOUNT_PRIORITY_USAGE  (1 << 29)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL  (30)
 #define BTRFS_DEFAULT_MAX_INLINE   (2048)
-- 
2.19.1

[RFC PATCH 15/17] btrfs: priority alloc: write bg->priority_groups_sem while waiting reservation

2018-11-27 Thread Su Yue

Since if use priority alloc, we should down/up_write()
bg->priority_groups_sem.

Signed-off-by: Su Yue 
---
 fs/btrfs/extent-tree.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f530a4344368..6627bbe56ad5 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6425,6 +6425,8 @@ void btrfs_dec_block_group_reservations(struct 
btrfs_fs_info *fs_info,
btrfs_put_block_group(bg);
 }
 
+static inline bool
+is_priority_alloc_enabled(struct btrfs_fs_info *fs_info);
 void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg)
 {
struct btrfs_space_info *space_info = bg->space_info;
@@ -6433,7 +6435,11 @@ void btrfs_wait_block_group_reservations(struct 
btrfs_block_group_cache *bg)
 
if (!(bg->flags & BTRFS_BLOCK_GROUP_DATA))
return;
-
+   if (is_priority_alloc_enabled(bg->fs_info)) {
+   down_write(>priority_tree->groups_sem);
+   up_write(>priority_tree->groups_sem);
+   goto wait;
+   }
/*
 * Our block group is read only but before we set it to read only,
 * some task might have had allocated an extent from it already, but it
@@ -6446,7 +6452,7 @@ void btrfs_wait_block_group_reservations(struct 
btrfs_block_group_cache *bg)
 */
down_write(_info->groups_sem);
up_write(_info->groups_sem);
-
+wait:
wait_var_event(>reservations, !atomic_read(>reservations));
 }
 
-- 
2.19.1

[RFC PATCH 10/17] btrfs: priority alloc: call add_block_group_priority while reading or making block group

2018-11-27 Thread Su Yue

Add  block group to priority tree in btrfs_read_block_groups()
and btrfs_make_block_groups().

Signed-off-by: Su Yue 
---
 fs/btrfs/extent-tree.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 2dec02782df1..fc40901b4772 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10117,6 +10117,7 @@ static int check_chunk_block_group_mappings(struct 
btrfs_fs_info *fs_info)
 }
 
 static long compute_block_group_priority(struct btrfs_block_group_cache *bg);
+static void add_block_group_priority(struct btrfs_block_group_cache *cache);
 int btrfs_read_block_groups(struct btrfs_fs_info *info)
 {
struct btrfs_path *path;
@@ -10251,6 +10252,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
link_block_group(cache);
 
cache->priority = compute_block_group_priority(cache);
+   add_block_group_priority(cache);
 
set_avail_alloc_bits(info, cache->flags);
if (btrfs_chunk_readonly(info, cache->key.objectid)) {
@@ -10402,6 +10404,7 @@ int btrfs_make_block_group(struct btrfs_trans_handle 
*trans, u64 bytes_used,
link_block_group(cache);
 
cache->priority = compute_block_group_priority(cache);
+   add_block_group_priority(cache);
 
list_add_tail(>bg_list, >new_bgs);
 
-- 
2.19.1

[RFC PATCH 14/17] btrfs: priority alloc: introduce btrfs_set_bg_updating and call btrfs_update_block_group_prioriy

2018-11-27 Thread Su Yue

For usage as priority, the varaiables in block groups we concered are
reserved, bytes_super and btrfs_block_group_used(>item).

This patch calls btrfs_set_bg_updating() in locations where above three
varaiables changed to mark block groups needs to be updated, then
calls btrfs_update_block_group() to update priority tree if needed.

Signed-off-by: Su Yue 
---
 fs/btrfs/ctree.h|  2 ++
 fs/btrfs/extent-tree.c  | 40 +
 fs/btrfs/free-space-cache.c |  3 +++
 3 files changed, 45 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 091b878e326c..f1ab0310da08 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2753,6 +2753,8 @@ u64 btrfs_metadata_alloc_profile(struct btrfs_fs_info 
*fs_info);
 u64 btrfs_system_alloc_profile(struct btrfs_fs_info *fs_info);
 void btrfs_clear_space_info_full(struct btrfs_fs_info *info);
 void btrfs_remove_block_group_priority(struct btrfs_block_group_cache *cache);
+void btrfs_set_bg_priority_updating(struct btrfs_block_group_cache *cache);
+void btrfs_update_block_group_priority(struct btrfs_block_group_cache *cache);
 
 enum btrfs_reserve_flush_enum {
/* If we are in the transaction, we can't flush anything.*/
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4c76677a54a9..f530a4344368 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6183,6 +6183,7 @@ static int update_block_group(struct btrfs_trans_handle 
*trans,
cache->space_info->bytes_reserved -= num_bytes;
cache->space_info->bytes_used += num_bytes;
cache->space_info->disk_used += num_bytes * factor;
+   btrfs_set_bg_priority_updating(cache);
spin_unlock(>lock);
spin_unlock(>space_info->lock);
} else {
@@ -6192,6 +6193,7 @@ static int update_block_group(struct btrfs_trans_handle 
*trans,
update_bytes_pinned(cache->space_info, num_bytes);
cache->space_info->bytes_used -= num_bytes;
cache->space_info->disk_used -= num_bytes * factor;
+   btrfs_set_bg_priority_updating(cache);
spin_unlock(>lock);
spin_unlock(>space_info->lock);
 
@@ -6205,6 +6207,7 @@ static int update_block_group(struct btrfs_trans_handle 
*trans,
 bytenr, bytenr + num_bytes - 1,
 GFP_NOFS | __GFP_NOFAIL);
}
+   btrfs_update_block_group_priority(cache);
 
spin_lock(>transaction->dirty_bgs_lock);
if (list_empty(>dirty_list)) {
@@ -6264,6 +6267,7 @@ static int pin_down_extent(struct btrfs_fs_info *fs_info,
if (reserved) {
cache->reserved -= num_bytes;
cache->space_info->bytes_reserved -= num_bytes;
+   btrfs_set_bg_priority_updating(cache);
}
spin_unlock(>lock);
spin_unlock(>space_info->lock);
@@ -6274,6 +6278,8 @@ static int pin_down_extent(struct btrfs_fs_info *fs_info,
num_bytes, BTRFS_TOTAL_BYTES_PINNED_BATCH);
set_extent_dirty(fs_info->pinned_extents, bytenr,
 bytenr + num_bytes - 1, GFP_NOFS | __GFP_NOFAIL);
+
+   btrfs_update_block_group_priority(cache);
return 0;
 }
 
@@ -6472,6 +6478,12 @@ static int btrfs_add_reserved_bytes(struct 
btrfs_block_group_cache *cache,
update_bytes_may_use(space_info, -ram_bytes);
if (delalloc)
cache->delalloc_bytes += num_bytes;
+   /*
+* Since it's called in find_free_extent(),
+* call btrfs_update_block_group_priority() in outter to
+* avoid dead lock.
+*/
+   btrfs_set_bg_priority_updating(cache);
}
spin_unlock(>lock);
spin_unlock(_info->lock);
@@ -6502,11 +6514,14 @@ static void btrfs_free_reserved_bytes(struct 
btrfs_block_group_cache *cache,
cache->reserved -= num_bytes;
space_info->bytes_reserved -= num_bytes;
space_info->max_extent_size = 0;
+   btrfs_set_bg_priority_updating(cache);
 
if (delalloc)
cache->delalloc_bytes -= num_bytes;
spin_unlock(>lock);
spin_unlock(_info->lock);
+
+   btrfs_update_block_group_priority(cache);
 }
 void btrfs_prepare_extent_commit(struct btrfs_fs_info *fs_info)
 {
@@ -8025,6 +8040,7 @@ static noinline int find_free_extent(struct btrfs_fs_info 
*fs_info,
else
up_read(_info->groups_sem);
btrfs_release_block_group(block_group, delalloc);
+   btrfs_update_block_group_priority(block_group);
}
ret = find_free_extent_update_loop(fs_info, last_ptr, ins, _ctl,

[RFC PATCH 05/17] btrfs: priority alloc: introduce functions to add block group to priority tree

2018-11-27 Thread Su Yue

Introduce compute_priority_level() to compute priority level according
priority, now just divides PRIORITY_USAGE_FACOTR.

Introduce add_block_group_priority() to add block groups to
priority tree.

Signed-off-by: Su Yue 
---
 fs/btrfs/extent-tree.c | 76 ++
 1 file changed, 76 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 787a68b5bdcb..d63078930a1e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -11344,3 +11344,79 @@ static int create_priority_trees(struct btrfs_fs_info 
*fs_info,
remove_priority_trees(fs_info, space_info);
return ret;
 }
+
+static int link_block_group_priority(struct btrfs_priority_tree *tree,
+struct btrfs_block_group_cache *cache)
+{
+   struct rb_node **p = >block_groups.rb_node;
+   struct rb_node *parent = NULL;
+   struct btrfs_block_group_cache *tmp;
+
+   while (*p) {
+   parent = *p;
+   tmp = rb_entry(parent, struct btrfs_block_group_cache, node);
+   if (cache->key.objectid < tmp->key.objectid)
+   p = &(*p)->rb_left;
+   else if (cache->key.objectid > tmp->key.objectid)
+   p = &(*p)->rb_right;
+   else
+   return -EEXIST;
+   }
+
+   rb_link_node(>node, parent, p);
+   rb_insert_color(>node, >block_groups);
+   return 0;
+}
+
+static struct btrfs_priority_tree *
+search_priority_tree(struct rb_root *root, int level)
+{
+   struct rb_node *node = root->rb_node;
+   struct btrfs_priority_tree *pt;
+
+   while (node) {
+   pt = rb_entry(node, struct btrfs_priority_tree, node);
+
+   if (level > pt->level)
+   node = node->rb_left;
+   else if (level < pt->level)
+   node = node->rb_right;
+   else
+   return pt;
+   }
+
+   return NULL;
+}
+
+static void add_block_group_priority(struct btrfs_block_group_cache *cache)
+{
+   struct btrfs_priority_tree *pt;
+   int index = btrfs_bg_flags_to_raid_index(cache->flags);
+   int level, max_level, min_level;
+   int ret;
+
+   if (!is_priority_alloc_enabled(cache->fs_info))
+   return;
+
+   if (btrfs_test_opt(cache->fs_info, PRIORITY_USAGE)) {
+   max_level = compute_priority_level(cache->fs_info,
+  (u8)100 << PRIORITY_USAGE_SHIFT);
+   min_level = 0;
+   }
+
+   level = compute_priority_level(cache->fs_info, cache->priority);
+   if (level > max_level || level < min_level) {
+   WARN_ON(level);
+   return;
+   }
+
+   pt = search_priority_tree(>space_info->priority_trees[index],
+ level);
+   BUG_ON(pt == NULL);
+
+   down_write(>groups_sem);
+   cache->priority_tree = pt;
+   ret = link_block_group_priority(pt, cache);
+   up_write(>groups_sem);
+   BUG_ON(ret);
+}
-- 
2.19.1

[RFC PATCH 07/17] btrfs: priority alloc: add functions to remove block group from priority tree

2018-11-27 Thread Su Yue

Introduce btrfs_remove_block_group_priority() to remove block group
from priority tree.

Signed-off-by: Su Yue 
---
 fs/btrfs/extent-tree.c | 37 +
 1 file changed, 37 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 5bae757786dc..b559c9a9afc6 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -11429,3 +11429,40 @@ static void add_block_group_priority(struct 
btrfs_block_group_cache *cache)
up_write(>groups_sem);
BUG_ON(ret);
 }
+
+static void unlink_block_group_priority(struct btrfs_priority_tree *pt,
+   struct btrfs_block_group_cache *cache)
+{
+   rb_erase(>node, >block_groups);
+   RB_CLEAR_NODE(>node);
+}
+
+void btrfs_remove_block_group_priority(struct btrfs_block_group_cache *cache)
+{
+   struct btrfs_priority_tree *pt;
+
+   if (!is_priority_alloc_enabled(cache->fs_info))
+   return;
+
+   spin_lock(>lock);
+   if (cache->priority_tree == NULL) {
+   spin_unlock(>lock);
+   return;
+   }
+
+   pt = cache->priority_tree;
+   cache->priority = PRIORITY_BG_DELETED;
+   spin_unlock(>lock);
+
+   down_write(>groups_sem);
+   spin_lock(>lock);
+
+   if (cache->priority_tree == NULL)
+   goto out;
+
+   unlink_block_group_priority(pt, cache);
+   cache->priority_tree = NULL;
+out:
+   spin_unlock(>lock);
+   up_write(>groups_sem);
+}
-- 
2.19.1

[RFC PATCH 09/17] btrfs: priority alloc: call create/remove_priority_trees in space_info

2018-11-27 Thread Su Yue

Call create_priority_trees() in create_space_info().
Call remove_priority_trees() before free of space_info.

Signed-off-by: Su Yue 
---
 fs/btrfs/extent-tree.c | 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 5ea1f2e40701..2dec02782df1 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3940,6 +3940,10 @@ static const char *alloc_name(u64 flags)
};
 }
 
+static int create_priority_trees(struct btrfs_fs_info *fs_info,
+struct btrfs_space_info *space_info);
+static void remove_priority_trees(struct btrfs_fs_info *fs_info,
+ struct btrfs_space_info *space_info);
 static int create_space_info(struct btrfs_fs_info *info, u64 flags)
 {
 
@@ -3958,8 +3962,17 @@ static int create_space_info(struct btrfs_fs_info *info, 
u64 flags)
return ret;
}
 
-   for (i = 0; i < BTRFS_NR_RAID_TYPES; i++)
+   for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
INIT_LIST_HEAD(_info->block_groups[i]);
+   space_info->priority_trees[i] = RB_ROOT;
+   }
+
+   ret = create_priority_trees(info, space_info);
+   if (ret) {
+   kfree(space_info);
+   return ret;
+   }
+
init_rwsem(_info->groups_sem);
spin_lock_init(_info->lock);
space_info->flags = flags & BTRFS_BLOCK_GROUP_TYPE_MASK;
@@ -3974,6 +3987,7 @@ static int create_space_info(struct btrfs_fs_info *info, 
u64 flags)
alloc_name(space_info->flags));
if (ret) {
percpu_counter_destroy(_info->total_bytes_pinned);
+   remove_priority_trees(info, space_info);
kfree(space_info);
return ret;
}
@@ -9928,6 +9942,8 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
space_info->bytes_may_use > 0))
dump_space_info(info, space_info, 0, 0);
list_del(_info->list);
+   remove_priority_trees(info, space_info);
+
for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
struct kobject *kobj;
kobj = space_info->block_group_kobjs[i];
@@ -11282,9 +11298,6 @@ static void remove_priority_trees(struct btrfs_fs_info 
*fs_info,
struct btrfs_priority_tree *pt, *next_pt;
int i;
 
-   if (!is_priority_alloc_enabled(fs_info))
-   return;
-
for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
root = _info->priority_trees[i];
rbtree_postorder_for_each_entry_safe(pt, next_pt, root, node) {
-- 
2.19.1

[RFC PATCH 01/17] btrfs: priority alloc: prepare of priority aware allocator

2018-11-27 Thread Su Yue

To implement priority aware allocator, this patch:
Introduces struct btrfs_priority_tree which contains block groups
in same level.
Adds member priority to struct btrfs_block_group_cache and pointer
points to the priority tree it's located.

Adds member priority_trees to struct btrfs_space_info to represents
priority trees in different raid types.

Signed-off-by: Su Yue 
---
 fs/btrfs/ctree.h | 24 
 1 file changed, 24 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index e62824cae00a..5c4651d8a524 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -437,6 +437,8 @@ struct btrfs_space_info {
struct rw_semaphore groups_sem;
/* for block groups in our same type */
struct list_head block_groups[BTRFS_NR_RAID_TYPES];
+   /* for priority trees in our same type */
+   struct rb_root priority_trees[BTRFS_NR_RAID_TYPES];
wait_queue_head_t wait;
 
struct kobject kobj;
@@ -558,6 +560,21 @@ struct btrfs_full_stripe_locks_tree {
struct mutex lock;
 };
 
+/*
+ * Tree to record all block_groups in same priority level.
+ * Only used in priority aware allocator.
+ */
+struct btrfs_priority_tree {
+   /* protected by groups_sem */
+   struct rb_root block_groups;
+   struct rw_semaphore groups_sem;
+
+   /* for different level priority trees in same index*/
+   struct rb_node node;
+
+   int level;
+};
+
 struct btrfs_block_group_cache {
struct btrfs_key key;
struct btrfs_block_group_item item;
@@ -571,6 +588,8 @@ struct btrfs_block_group_cache {
u64 flags;
u64 cache_generation;
 
+   /* It's used only when priority aware allocator is enabled. */
+   long priority;
/*
 * If the free space extent count exceeds this number, convert the block
 * group to bitmaps.
@@ -616,6 +635,9 @@ struct btrfs_block_group_cache {
/* for block groups in the same raid type */
struct list_head list;
 
+   /* for block groups in the same priority level */
+   struct rb_node node;
+
/* usage count */
atomic_t count;
 
@@ -670,6 +692,8 @@ struct btrfs_block_group_cache {
 
/* Record locked full stripes for RAID5/6 block group */
struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
+
+   struct btrfs_priority_tree *priority_tree;
 };
 
 /* delayed seq elem */
-- 
2.19.1

[RFC PATCH 00/17] btrfs: implementation of priority aware allocator

2018-11-27 Thread Su Yue

This patchset can be fetched from repo:
https://github.com/Damenly/btrfs-devel/commits/priority_aware_allocator.
Since patchset 'btrfs: Refactor find_free_extent()' does a nice work
to simplify find_free_extent(). This patchset dependents on the refactor.
The base is the commit in kdave/misc-next:

commit fcaaa1dfa81f2f87ad88cbe0ab86a07f9f76073c (kdave/misc-next)
Author: Nikolay Borisov 
Date:   Tue Nov 6 16:40:20 2018 +0200

btrfs: Always try all copies when reading extent buffers


This patchset introduces a new mount option named 'priority_alloc=%s',
%s is supported to be "usage" and "off" now. The mount option changes
the way find_free_extent() how to search block groups.

Previously, block groups are stored in list of btrfs_space_info
by start position. When call find_free_extent() if no hint,
block_groups are searched one by one.

Design of priority aware allocator:
Block group has its own priority. We split priorities to many levels,
block groups are split to different trees according priorities.
And those trees are sorted by their levels and stored in space_info.
Once find_free_extent() is called, try to search block groups in higher
priority level then lower level. Then a block group with higher
priority is more likely to be used.

Pros:
1) Reduce the frequency of balance.
   The block group with a higher usage rate will be used preferentially
   for allocating extents. Free the empty block groups with pinned bytes
   as non-zero.[1]

2) The priority of empty block group with pinned bytes as non-zero
   will be set as the lowest.
   
3) Support zoned block device.[2]
   For metadata allocation, the block group in conventional zones
   will be used as much as possible regardless of usage rate.
   Will do it in future.
   
Cons:
1) Expectable performance regression.
   The degree of the decline is temporarily unknown.
   The user can disable block group priority to get the full performance.

TESTS:

If use usage as priority(the only available option), empty block group
is much harder to be reused.

About block group usage:
  Disk: 4 x 1T HDD gathered in LVM.

  Run script to create files and delete files randomly in loop.
  The num of files to create are double than to delete.

  Default mount option result:
  https://i.loli.net/2018/11/28/5bfdfdf08c760.png

  Priority aware allocator(usage) result:
  https://i.loli.net/2018/11/28/5bfdfdf0c1b11.png

  X coordinate means total disk usage, Y coordinate means avg block
  group usage.

  Due to fragmentation of extents, the different are not obvious,
  only about 1% improvement
   
Performance regression:
   I have ran sysbench on our machine with SSD in multi combinations,
   no obvious regression found.
   However in theory, the new allocator may cost more time in some
   cases.

[1] https://www.spinics.net/lists/linux-btrfs/msg79508.html
[2] https://lkml.org/lkml/2018/8/16/174

---
Due to some reasons includes time and hardware, the use-case is not
outstanding enough. And some codes are dirty but I can't found another
way. So I named it as RFC.
 Any comments and suggestions are welcome.
 
Su Yue (17):
  btrfs: priority alloc: prepare of priority aware allocator
  btrfs: add mount definition BTRFS_MOUNT_PRIORITY_USAGE
  btrfs: priority alloc: introduce compute_block_group_priority/usage
  btrfs: priority alloc: add functions to create/remove priority trees
  btrfs: priority alloc: introduce functions to add block group to
priority tree
  btrfs: priority alloc: introduce three macros to mark block group
status
  btrfs: priority alloc: add functions to remove block group from
priority tree
  btrfs: priority alloc: add btrfs_update_block_group_priority()
  btrfs: priority alloc: call create/remove_priority_trees in space_info
  btrfs: priority alloc: call add_block_group_priority while reading or
making block group
  btrfs: priority alloc: remove block group from priority tree while
removing block group
  btrfs: priority alloc: introduce find_free_extent_search()
  btrfs: priority alloc: modify find_free_extent() to fit priority
allocator
  btrfs: priority alloc: introduce btrfs_set_bg_updating and call
btrfs_update_block_group_prioriy
  btrfs: priority alloc: write bg->priority_groups_sem while waiting
reservation
  btrfs: priority alloc: write bg->priority_tree->groups_sem to avoid
race in btrfs_delete_unused_bgs()
  btrfs: add mount option "priority_alloc=%s"

 fs/btrfs/ctree.h|  28 ++
 fs/btrfs/extent-tree.c  | 672 +---
 fs/btrfs/free-space-cache.c |   3 +
 fs/btrfs/super.c|  18 +
 fs/btrfs/transaction.c  |   1 +
 5 files changed, 681 insertions(+), 41 deletions(-)

-- 
2.19.1

Re: [PATCH 9/9] btrfs: drop extra enum initialization where using defaults

2018-11-27 Thread Qu Wenruo



On 2018/11/28 上午3:53, David Sterba wrote:
> The first auto-assigned value to enum is 0, we can use that and not
> initialize all members where the auto-increment does the same. This is
> used for values that are not part of on-disk format.
> 
> Signed-off-by: David Sterba 

Reviewed-by: Qu Wenruo 

Thanks,
Qu

> ---
>  fs/btrfs/btrfs_inode.h |  2 +-
>  fs/btrfs/ctree.h   | 28 ++--
>  fs/btrfs/disk-io.h | 10 +-
>  fs/btrfs/qgroup.h  |  2 +-
>  fs/btrfs/sysfs.h   |  2 +-
>  fs/btrfs/transaction.h | 14 +++---
>  6 files changed, 29 insertions(+), 29 deletions(-)
> 
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index 4de321aee7a5..fc25607304f2 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -20,7 +20,7 @@
>   * new data the application may have written before commit.
>   */
>  enum {
> - BTRFS_INODE_ORDERED_DATA_CLOSE = 0,
> + BTRFS_INODE_ORDERED_DATA_CLOSE,
>   BTRFS_INODE_DUMMY,
>   BTRFS_INODE_IN_DEFRAG,
>   BTRFS_INODE_HAS_ASYNC_EXTENT,
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 4bb0ac3050ff..f1d1c6ba3aa1 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -334,7 +334,7 @@ struct btrfs_node {
>   * The slots array records the index of the item or block pointer
>   * used while walking the tree.
>   */
> -enum { READA_NONE = 0, READA_BACK, READA_FORWARD };
> +enum { READA_NONE, READA_BACK, READA_FORWARD };
>  struct btrfs_path {
>   struct extent_buffer *nodes[BTRFS_MAX_LEVEL];
>   int slots[BTRFS_MAX_LEVEL];
> @@ -532,18 +532,18 @@ struct btrfs_free_cluster {
>  };
>  
>  enum btrfs_caching_type {
> - BTRFS_CACHE_NO  = 0,
> - BTRFS_CACHE_STARTED = 1,
> - BTRFS_CACHE_FAST= 2,
> - BTRFS_CACHE_FINISHED= 3,
> - BTRFS_CACHE_ERROR   = 4,
> + BTRFS_CACHE_NO,
> + BTRFS_CACHE_STARTED,
> + BTRFS_CACHE_FAST,
> + BTRFS_CACHE_FINISHED,
> + BTRFS_CACHE_ERROR,
>  };
>  
>  enum btrfs_disk_cache_state {
> - BTRFS_DC_WRITTEN= 0,
> - BTRFS_DC_ERROR  = 1,
> - BTRFS_DC_CLEAR  = 2,
> - BTRFS_DC_SETUP  = 3,
> + BTRFS_DC_WRITTEN,
> + BTRFS_DC_ERROR,
> + BTRFS_DC_CLEAR,
> + BTRFS_DC_SETUP,
>  };
>  
>  struct btrfs_caching_control {
> @@ -2621,10 +2621,10 @@ static inline gfp_t btrfs_alloc_write_mask(struct 
> address_space *mapping)
>  /* extent-tree.c */
>  
>  enum btrfs_inline_ref_type {
> - BTRFS_REF_TYPE_INVALID = 0,
> - BTRFS_REF_TYPE_BLOCK =   1,
> - BTRFS_REF_TYPE_DATA =2,
> - BTRFS_REF_TYPE_ANY = 3,
> + BTRFS_REF_TYPE_INVALID,
> + BTRFS_REF_TYPE_BLOCK,
> + BTRFS_REF_TYPE_DATA,
> + BTRFS_REF_TYPE_ANY,
>  };
>  
>  int btrfs_get_extent_inline_ref_type(const struct extent_buffer *eb,
> diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
> index 4cccba22640f..987a64bc0c66 100644
> --- a/fs/btrfs/disk-io.h
> +++ b/fs/btrfs/disk-io.h
> @@ -21,11 +21,11 @@
>  #define BTRFS_BDEV_BLOCKSIZE (4096)
>  
>  enum btrfs_wq_endio_type {
> - BTRFS_WQ_ENDIO_DATA = 0,
> - BTRFS_WQ_ENDIO_METADATA = 1,
> - BTRFS_WQ_ENDIO_FREE_SPACE = 2,
> - BTRFS_WQ_ENDIO_RAID56 = 3,
> - BTRFS_WQ_ENDIO_DIO_REPAIR = 4,
> + BTRFS_WQ_ENDIO_DATA,
> + BTRFS_WQ_ENDIO_METADATA,
> + BTRFS_WQ_ENDIO_FREE_SPACE,
> + BTRFS_WQ_ENDIO_RAID56,
> + BTRFS_WQ_ENDIO_DIO_REPAIR,
>  };
>  
>  static inline u64 btrfs_sb_offset(int mirror)
> diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
> index d8f78f5ab854..e4e6ee44073a 100644
> --- a/fs/btrfs/qgroup.h
> +++ b/fs/btrfs/qgroup.h
> @@ -70,7 +70,7 @@ struct btrfs_qgroup_extent_record {
>   *   be converted into META_PERTRANS.
>   */
>  enum btrfs_qgroup_rsv_type {
> - BTRFS_QGROUP_RSV_DATA = 0,
> + BTRFS_QGROUP_RSV_DATA,
>   BTRFS_QGROUP_RSV_META_PERTRANS,
>   BTRFS_QGROUP_RSV_META_PREALLOC,
>   BTRFS_QGROUP_RSV_LAST,
> diff --git a/fs/btrfs/sysfs.h b/fs/btrfs/sysfs.h
> index c6ee600aff89..40716b357c1d 100644
> --- a/fs/btrfs/sysfs.h
> +++ b/fs/btrfs/sysfs.h
> @@ -9,7 +9,7 @@
>  extern u64 btrfs_debugfs_test;
>  
>  enum btrfs_feature_set {
> - FEAT_COMPAT = 0,
> + FEAT_COMPAT,
>   FEAT_COMPAT_RO,
>   FEAT_INCOMPAT,
>   FEAT_MAX
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index 703d5116a2fc..f1ba78949d1b 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -12,13 +12,13 @@
>  #include "ctree.h"
>  
>  enum btrfs_trans_state {
> - TRANS_STATE_RUNNING = 0,
> - TRANS_STATE_BLOCKED = 1,
> - TRANS_STATE_COMMIT_START= 2,
> - TRANS_STATE_COMMIT_DOING= 3,
> - TRANS_STATE_UNBLOCKED   = 4,
> - TRANS_STATE_COMPLETED   = 5,
> - TRANS_STATE_MAX = 6,
> + TRANS_STATE_RUNNING,
> + TRANS_STATE_BLOCKED,
> + TRANS_STATE_COMMIT_START,
> +

Re: [PATCH 0/9] Switch defines to enums

2018-11-27 Thread Qu Wenruo



On 2018/11/28 上午3:53, David Sterba wrote:
> This is motivated by a merging mistake that happened a few releases ago.
> Two patches updated BTRFS_FS_* flags independently to the same value,
> git did not see any direct merge conflict. The two values got mixed at
> runtime and caused crash.
> 
> Update all #define sequential values, the above merging problem would
> not happen as there would be a conflict and the enum value
> auto-increment would prevent duplicated values anyway.

Just one small question for the bitmap usage.

For enum we won't directly know the last number is, my concern is if
we're using u16 as bitmap and there is some enum over 15, will we get a
warning at compile time or some bug would just sneak into kernel?

Thanks,
Qu

> 
> David Sterba (9):
>   btrfs: switch BTRFS_FS_STATE_* to enums
>   btrfs: switch BTRFS_BLOCK_RSV_* to enums
>   btrfs: switch BTRFS_FS_* to enums
>   btrfs: switch BTRFS_ROOT_* to enums
>   btrfs: swtich EXTENT_BUFFER_* to enums
>   btrfs: switch EXTENT_FLAG_* to enums
>   btrfs: switch BTRFS_*_LOCK to enums
>   btrfs: switch BTRFS_ORDERED_* to enums
>   btrfs: drop extra enum initialization where using defaults
> 
>  fs/btrfs/btrfs_inode.h  |   2 +-
>  fs/btrfs/ctree.h| 168 ++--
>  fs/btrfs/disk-io.h  |  10 +--
>  fs/btrfs/extent_io.h|  28 ---
>  fs/btrfs/extent_map.h   |  21 +++--
>  fs/btrfs/locking.h  |  10 ++-
>  fs/btrfs/ordered-data.h |  45 ++-
>  fs/btrfs/qgroup.h   |   2 +-
>  fs/btrfs/sysfs.h|   2 +-
>  fs/btrfs/transaction.h  |  14 ++--
>  10 files changed, 169 insertions(+), 133 deletions(-)
> 



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 8/9] btrfs: switch BTRFS_ORDERED_* to enums

2018-11-27 Thread Qu Wenruo



On 2018/11/28 上午3:53, David Sterba wrote:
> We can use simple enum for values that are not part of on-disk format:
> ordered extent flags.
> 
> Signed-off-by: David Sterba 

Reviewed-by: Qu Wenruo 

Thanks,
Qu

> ---
>  fs/btrfs/ordered-data.h | 45 +++--
>  1 file changed, 25 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
> index b10e6765d88f..fb9a161f0215 100644
> --- a/fs/btrfs/ordered-data.h
> +++ b/fs/btrfs/ordered-data.h
> @@ -37,26 +37,31 @@ struct btrfs_ordered_sum {
>   * rbtree, just before waking any waiters.  It is used to indicate the
>   * IO is done and any metadata is inserted into the tree.
>   */
> -#define BTRFS_ORDERED_IO_DONE 0 /* set when all the pages are written */
> -
> -#define BTRFS_ORDERED_COMPLETE 1 /* set when removed from the tree */
> -
> -#define BTRFS_ORDERED_NOCOW 2 /* set when we want to write in place */
> -
> -#define BTRFS_ORDERED_COMPRESSED 3 /* writing a zlib compressed extent */
> -
> -#define BTRFS_ORDERED_PREALLOC 4 /* set when writing to preallocated extent 
> */
> -
> -#define BTRFS_ORDERED_DIRECT 5 /* set when we're doing DIO with this extent 
> */
> -
> -#define BTRFS_ORDERED_IOERR 6 /* We had an io error when writing this out */
> -
> -#define BTRFS_ORDERED_UPDATED_ISIZE 7 /* indicates whether this ordered 
> extent
> -* has done its due diligence in updating
> -* the isize. */
> -#define BTRFS_ORDERED_TRUNCATED 8 /* Set when we have to truncate an extent 
> */
> -
> -#define BTRFS_ORDERED_REGULAR 10 /* Regular IO for COW */
> +enum {
> + /* set when all the pages are written */
> + BTRFS_ORDERED_IO_DONE,
> + /* set when removed from the tree */
> + BTRFS_ORDERED_COMPLETE,
> + /* set when we want to write in place */
> + BTRFS_ORDERED_NOCOW,
> + /* writing a zlib compressed extent */
> + BTRFS_ORDERED_COMPRESSED,
> + /* set when writing to preallocated extent */
> + BTRFS_ORDERED_PREALLOC,
> + /* set when we're doing DIO with this extent */
> + BTRFS_ORDERED_DIRECT,
> + /* We had an io error when writing this out */
> + BTRFS_ORDERED_IOERR,
> + /*
> +  * indicates whether this ordered extent has done its due diligence in
> +  * updating the isize
> +  */
> + BTRFS_ORDERED_UPDATED_ISIZE,
> + /* Set when we have to truncate an extent */
> + BTRFS_ORDERED_TRUNCATED,
> + /* Regular IO for COW */
> + BTRFS_ORDERED_REGULAR,
> +};
>  
>  struct btrfs_ordered_extent {
>   /* logical offset in the file */
> 



signature.asc
Description: OpenPGP digital signature

RE: [PATCH RESEND 0/8] btrfs-progs: sub: Relax the privileges of "subvolume list/show"

2018-11-27 Thread misono.tomoh...@fujitsu.com

Hi,

> -Original Message-
> From: Martin Steigerwald [mailto:mar...@lichtvoll.de]
> Sent: Tuesday, November 27, 2018 6:48 PM
> To: Misono, Tomohiro 
> Cc: linux-btrfs@vger.kernel.org
> Subject: Re: [PATCH RESEND 0/8] btrfs-progs: sub: Relax the privileges
> of "subvolume list/show"
> 
> Misono Tomohiro - 27.11.18, 06:24:
> > Importantly, in order to make output consistent for both root and
> > non-privileged user, this changes the behavior of "subvolume list":
> >  - (default) Only list in subvolume under the specified path.
> >Path needs to be a subvolume.
> 
> Does that work recursively?

Not in this version.

Previous version has -f option which recursively search and list subbvolumes
(only if they have the same btrfs fsid):
https://lore.kernel.org/linux-btrfs/84d06767762b4285ddefec0392ee16e2d7e06f62.1529310485.git.misono.tomoh...@jp.fujitsu.com/

However, current "sub list" command already has many options and we cannot
add new one randomly, I drop the patches which add new options in this version.
(In other word, previous version can be divided in two parts: 
  1. Relax the privileges of sub list/show
  2. Add new option to sub list
And this version contains only 1).

> 
> I wound find it quite unexpected if I did btrfs subvol list in or on the
> root directory of a BTRFS filesystem would not display any subvolumes
> on
> that filesystem no matter where they are.

Yes, I think output of -f option is more readable in such case.
If agreement has been made about how "sub list" should really work,
I could update/send patches again.

Thanks,
Misono

Re: [PATCH 7/9] btrfs: switch BTRFS_*_LOCK to enums

2018-11-27 Thread Qu Wenruo



On 2018/11/28 上午3:53, David Sterba wrote:
> We can use simple enum for values that are not part of on-disk format:
> tree lock types.
> 
> Signed-off-by: David Sterba 

Reviewed-by: Qu Wenruo 

Thanks,
Qu

> ---
>  fs/btrfs/locking.h | 10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/locking.h b/fs/btrfs/locking.h
> index 29135def468e..684d0ef4faa4 100644
> --- a/fs/btrfs/locking.h
> +++ b/fs/btrfs/locking.h
> @@ -6,10 +6,12 @@
>  #ifndef BTRFS_LOCKING_H
>  #define BTRFS_LOCKING_H
>  
> -#define BTRFS_WRITE_LOCK 1
> -#define BTRFS_READ_LOCK 2
> -#define BTRFS_WRITE_LOCK_BLOCKING 3
> -#define BTRFS_READ_LOCK_BLOCKING 4
> +enum {
> + BTRFS_WRITE_LOCK,
> + BTRFS_READ_LOCK,
> + BTRFS_WRITE_LOCK_BLOCKING,
> + BTRFS_READ_LOCK_BLOCKING,
> +};
>  
>  void btrfs_tree_lock(struct extent_buffer *eb);
>  void btrfs_tree_unlock(struct extent_buffer *eb);
> 



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 6/9] btrfs: switch EXTENT_FLAG_* to enums

2018-11-27 Thread Qu Wenruo



On 2018/11/28 上午3:53, David Sterba wrote:
> We can use simple enum for values that are not part of on-disk format:
> extent map flags.
> 
> Signed-off-by: David Sterba 

Reviewed-by: Qu Wenruo 

Thanks,
Qu

> ---
>  fs/btrfs/extent_map.h | 21 ++---
>  1 file changed, 14 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
> index 31977ffd6190..ef05a0121652 100644
> --- a/fs/btrfs/extent_map.h
> +++ b/fs/btrfs/extent_map.h
> @@ -11,13 +11,20 @@
>  #define EXTENT_MAP_INLINE ((u64)-2)
>  #define EXTENT_MAP_DELALLOC ((u64)-1)
>  
> -/* bits for the flags field */
> -#define EXTENT_FLAG_PINNED 0 /* this entry not yet on disk, don't free it */
> -#define EXTENT_FLAG_COMPRESSED 1
> -#define EXTENT_FLAG_PREALLOC 3 /* pre-allocated extent */
> -#define EXTENT_FLAG_LOGGING 4 /* Logging this extent */
> -#define EXTENT_FLAG_FILLING 5 /* Filling in a preallocated extent */
> -#define EXTENT_FLAG_FS_MAPPING 6 /* filesystem extent mapping type */
> +/* bits for the extent_map::flags field */
> +enum {
> + /* this entry not yet on disk, don't free it */
> + EXTENT_FLAG_PINNED,
> + EXTENT_FLAG_COMPRESSED,
> + /* pre-allocated extent */
> + EXTENT_FLAG_PREALLOC,
> + /* Logging this extent */
> + EXTENT_FLAG_LOGGING,
> + /* Filling in a preallocated extent */
> + EXTENT_FLAG_FILLING,
> + /* filesystem extent mapping type */
> + EXTENT_FLAG_FS_MAPPING,
> +};
>  
>  struct extent_map {
>   struct rb_node rb_node;
> 



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 5/9] btrfs: swtich EXTENT_BUFFER_* to enums

2018-11-27 Thread Qu Wenruo



On 2018/11/28 上午3:53, David Sterba wrote:
> We can use simple enum for values that are not part of on-disk format:
> extent buffer flags;
> 
> Signed-off-by: David Sterba 

Reviewed-by: Qu Wenruo 

Thanks,
Qu

> ---
>  fs/btrfs/extent_io.h | 28 
>  1 file changed, 16 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index a1d3ea5a0d32..fd42492e62e5 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -37,18 +37,22 @@
>  #define EXTENT_BIO_COMPRESSED 1
>  #define EXTENT_BIO_FLAG_SHIFT 16
>  
> -/* these are bit numbers for test/set bit */
> -#define EXTENT_BUFFER_UPTODATE 0
> -#define EXTENT_BUFFER_DIRTY 2
> -#define EXTENT_BUFFER_CORRUPT 3
> -#define EXTENT_BUFFER_READAHEAD 4/* this got triggered by readahead */
> -#define EXTENT_BUFFER_TREE_REF 5
> -#define EXTENT_BUFFER_STALE 6
> -#define EXTENT_BUFFER_WRITEBACK 7
> -#define EXTENT_BUFFER_READ_ERR 8/* read IO error */
> -#define EXTENT_BUFFER_UNMAPPED 9
> -#define EXTENT_BUFFER_IN_TREE 10
> -#define EXTENT_BUFFER_WRITE_ERR 11/* write IO error */
> +enum {
> + EXTENT_BUFFER_UPTODATE,
> + EXTENT_BUFFER_DIRTY,
> + EXTENT_BUFFER_CORRUPT,
> + /* this got triggered by readahead */
> + EXTENT_BUFFER_READAHEAD,
> + EXTENT_BUFFER_TREE_REF,
> + EXTENT_BUFFER_STALE,
> + EXTENT_BUFFER_WRITEBACK,
> + /* read IO error */
> + EXTENT_BUFFER_READ_ERR,
> + EXTENT_BUFFER_UNMAPPED,
> + EXTENT_BUFFER_IN_TREE,
> + /* write IO error */
> + EXTENT_BUFFER_WRITE_ERR,
> +};
>  
>  /* these are flags for __process_pages_contig */
>  #define PAGE_UNLOCK  (1 << 0)
> 



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 4/9] btrfs: switch BTRFS_ROOT_* to enums

2018-11-27 Thread Qu Wenruo



On 2018/11/28 上午3:53, David Sterba wrote:
> We can use simple enum for values that are not part of on-disk format:
> root tree flags.
> 
> Signed-off-by: David Sterba 

Reviewed-by: Qu Wenruo 

Thanks,
Qu

> ---
>  fs/btrfs/ctree.h | 33 +
>  1 file changed, 17 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 7176b95b40e7..4bb0ac3050ff 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1180,22 +1180,23 @@ struct btrfs_subvolume_writers {
>  /*
>   * The state of btrfs root
>   */
> -/*
> - * btrfs_record_root_in_trans is a multi-step process,
> - * and it can race with the balancing code.   But the
> - * race is very small, and only the first time the root
> - * is added to each transaction.  So IN_TRANS_SETUP
> - * is used to tell us when more checks are required
> - */
> -#define BTRFS_ROOT_IN_TRANS_SETUP0
> -#define BTRFS_ROOT_REF_COWS  1
> -#define BTRFS_ROOT_TRACK_DIRTY   2
> -#define BTRFS_ROOT_IN_RADIX  3
> -#define BTRFS_ROOT_ORPHAN_ITEM_INSERTED  4
> -#define BTRFS_ROOT_DEFRAG_RUNNING5
> -#define BTRFS_ROOT_FORCE_COW 6
> -#define BTRFS_ROOT_MULTI_LOG_TASKS   7
> -#define BTRFS_ROOT_DIRTY 8
> +enum {
> + /*
> +  * btrfs_record_root_in_trans is a multi-step process, and it can race
> +  * with the balancing code.   But the race is very small, and only the
> +  * first time the root is added to each transaction.  So IN_TRANS_SETUP
> +  * is used to tell us when more checks are required
> +  */
> + BTRFS_ROOT_IN_TRANS_SETUP,
> + BTRFS_ROOT_REF_COWS,
> + BTRFS_ROOT_TRACK_DIRTY,
> + BTRFS_ROOT_IN_RADIX,
> + BTRFS_ROOT_ORPHAN_ITEM_INSERTED,
> + BTRFS_ROOT_DEFRAG_RUNNING,
> + BTRFS_ROOT_FORCE_COW,
> + BTRFS_ROOT_MULTI_LOG_TASKS,
> + BTRFS_ROOT_DIRTY,
> +};
>  
>  /*
>   * in ram representation of the tree.  extent_root is used for all 
> allocations
> 



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 3/9] btrfs: switch BTRFS_FS_* to enums

2018-11-27 Thread Qu Wenruo



On 2018/11/28 上午3:53, David Sterba wrote:
> We can use simple enum for values that are not part of on-disk format:
> internal filesystem states.
> 
> Signed-off-by: David Sterba 

Reviewed-by: Qu Wenruo 

Thanks,
Qu

> ---
>  fs/btrfs/ctree.h | 63 
>  1 file changed, 31 insertions(+), 32 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 40c405d74a01..7176b95b40e7 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -757,38 +757,37 @@ struct btrfs_swapfile_pin {
>  
>  bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr);
>  
> -#define BTRFS_FS_BARRIER 1
> -#define BTRFS_FS_CLOSING_START   2
> -#define BTRFS_FS_CLOSING_DONE3
> -#define BTRFS_FS_LOG_RECOVERING  4
> -#define BTRFS_FS_OPEN5
> -#define BTRFS_FS_QUOTA_ENABLED   6
> -#define BTRFS_FS_UPDATE_UUID_TREE_GEN9
> -#define BTRFS_FS_CREATING_FREE_SPACE_TREE10
> -#define BTRFS_FS_BTREE_ERR   11
> -#define BTRFS_FS_LOG1_ERR12
> -#define BTRFS_FS_LOG2_ERR13
> -#define BTRFS_FS_QUOTA_OVERRIDE  14
> -/* Used to record internally whether fs has been frozen */
> -#define BTRFS_FS_FROZEN  15
> -
> -/*
> - * Indicate that a whole-filesystem exclusive operation is running
> - * (device replace, resize, device add/delete, balance)
> - */
> -#define BTRFS_FS_EXCL_OP 16
> -
> -/*
> - * To info transaction_kthread we need an immediate commit so it doesn't
> - * need to wait for commit_interval
> - */
> -#define BTRFS_FS_NEED_ASYNC_COMMIT   17
> -
> -/*
> - * Indicate that balance has been set up from the ioctl and is in the main
> - * phase. The fs_info::balance_ctl is initialized.
> - */
> -#define BTRFS_FS_BALANCE_RUNNING 18
> +enum {
> + BTRFS_FS_BARRIER,
> + BTRFS_FS_CLOSING_START,
> + BTRFS_FS_CLOSING_DONE,
> + BTRFS_FS_LOG_RECOVERING,
> + BTRFS_FS_OPEN,
> + BTRFS_FS_QUOTA_ENABLED,
> + BTRFS_FS_UPDATE_UUID_TREE_GEN,
> + BTRFS_FS_CREATING_FREE_SPACE_TREE,
> + BTRFS_FS_BTREE_ERR,
> + BTRFS_FS_LOG1_ERR,
> + BTRFS_FS_LOG2_ERR,
> + BTRFS_FS_QUOTA_OVERRIDE,
> + /* Used to record internally whether fs has been frozen */
> + BTRFS_FS_FROZEN,
> + /*
> +  * Indicate that a whole-filesystem exclusive operation is running
> +  * (device replace, resize, device add/delete, balance)
> +  */
> + BTRFS_FS_EXCL_OP,
> + /*
> +  * To info transaction_kthread we need an immediate commit so it
> +  * doesn't need to wait for commit_interval
> +  */
> + BTRFS_FS_NEED_ASYNC_COMMIT,
> + /*
> +  * Indicate that balance has been set up from the ioctl and is in the
> +  * main phase. The fs_info::balance_ctl is initialized.
> +  */
> + BTRFS_FS_BALANCE_RUNNING,
> +};
>  
>  struct btrfs_fs_info {
>   u8 chunk_tree_uuid[BTRFS_UUID_SIZE];
> 



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 2/9] btrfs: switch BTRFS_BLOCK_RSV_* to enums

2018-11-27 Thread Qu Wenruo



On 2018/11/28 上午3:53, David Sterba wrote:
> We can use simple enum for values that are not part of on-disk format:
> block reserve types.
> 
> Signed-off-by: David Sterba 

Reviewed-by: Qu Wenruo 

However more comment will always be a good thing.

Thanks,
Qu

> ---
>  fs/btrfs/ctree.h | 19 ---
>  1 file changed, 12 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index f82ec5e41b0c..40c405d74a01 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -461,13 +461,18 @@ struct btrfs_space_info {
>   struct kobject *block_group_kobjs[BTRFS_NR_RAID_TYPES];
>  };
>  
> -#define  BTRFS_BLOCK_RSV_GLOBAL  1
> -#define  BTRFS_BLOCK_RSV_DELALLOC2
> -#define  BTRFS_BLOCK_RSV_TRANS   3
> -#define  BTRFS_BLOCK_RSV_CHUNK   4
> -#define  BTRFS_BLOCK_RSV_DELOPS  5
> -#define  BTRFS_BLOCK_RSV_EMPTY   6
> -#define  BTRFS_BLOCK_RSV_TEMP7
> +/*
> + * Types of block reserves
> + */
> +enum {
> + BTRFS_BLOCK_RSV_GLOBAL,
> + BTRFS_BLOCK_RSV_DELALLOC,
> + BTRFS_BLOCK_RSV_TRANS,
> + BTRFS_BLOCK_RSV_CHUNK,
> + BTRFS_BLOCK_RSV_DELOPS,
> + BTRFS_BLOCK_RSV_EMPTY,
> + BTRFS_BLOCK_RSV_TEMP,
> +};
>  
>  struct btrfs_block_rsv {
>   u64 size;
> 



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 1/9] btrfs: switch BTRFS_FS_STATE_* to enums

2018-11-27 Thread Qu Wenruo



On 2018/11/28 上午3:53, David Sterba wrote:
> We can use simple enum for values that are not part of on-disk format:
> global filesystem states.
> 
> Signed-off-by: David Sterba 

Good comment.

Reviewed-by: Qu Wenruo 

Thanks,
Qu

> ---
>  fs/btrfs/ctree.h | 25 +++--
>  1 file changed, 19 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index a98507fa9192..f82ec5e41b0c 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -109,13 +109,26 @@ static inline unsigned long btrfs_chunk_item_size(int 
> num_stripes)
>  }
>  
>  /*
> - * File system states
> + * Runtime (in-memory) states of filesystem
>   */
> -#define BTRFS_FS_STATE_ERROR 0
> -#define BTRFS_FS_STATE_REMOUNTING1
> -#define BTRFS_FS_STATE_TRANS_ABORTED 2
> -#define BTRFS_FS_STATE_DEV_REPLACING 3
> -#define BTRFS_FS_STATE_DUMMY_FS_INFO 4
> +enum {
> + /* Global indicator of serious filesysystem errors */
> + BTRFS_FS_STATE_ERROR,
> + /*
> +  * Filesystem is being remounted, allow to skip some operations, like
> +  * defrag
> +  */
> + BTRFS_FS_STATE_REMOUNTING,
> + /* Track if the transaction abort has been reported */
> + BTRFS_FS_STATE_TRANS_ABORTED,
> + /*
> +  * Indicate that replace source or target device state is changed and
> +  * allow to block bio operations
> +  */
> + BTRFS_FS_STATE_DEV_REPLACING,
> + /* The btrfs_fs_info created for self-tests */
> + BTRFS_FS_STATE_DUMMY_FS_INFO,
> +};
>  
>  #define BTRFS_BACKREF_REV_MAX256
>  #define BTRFS_BACKREF_REV_SHIFT  56
> 



signature.asc
Description: OpenPGP digital signature

Re: Balance: invalid convert data profile raid10

2018-11-27 Thread Qu Wenruo



On 2018/11/28 上午5:16, Mikko Merikivi wrote:
> I seem unable to convert an existing btrfs device array to RAID 10.
> Since it's pretty much RAID 0 and 1 combined, and 5 and 6 are
> unstable, it's what I would like to use.
> 
> After I did tried this with 4.19.2-arch1-1-ARCH and btrfs-progs v4.19,
> I updated my system and tried btrfs balance again with this system
> information:
> [mikko@localhost lvdata]$ uname -a
> Linux localhost 4.19.4-arch1-1-ARCH #1 SMP PREEMPT Fri Nov 23 09:06:58
> UTC 2018 x86_64 GNU/Linux
> [mikko@localhost lvdata]$ btrfs --version
> btrfs-progs v4.19
> [mikko@localhost lvdata]$ sudo btrfs fi show
> Label: 'main1'  uuid: c7cbb9c3-8c55-45f1-b03c-48992efe2f11
> Total devices 1 FS bytes used 2.90TiB
> devid1 size 3.64TiB used 2.91TiB path /dev/mapper/main
> 
> Label: 'red'  uuid: f3c781a8-0f3e-4019-acbf-0b783cf566d0
> Total devices 2 FS bytes used 640.00KiB
> devid1 size 931.51GiB used 2.03GiB path /dev/mapper/red1
> devid2 size 931.51GiB used 2.03GiB path /dev/mapper/red2

RAID10 needs at least 4 devices.

Thanks,
Qu

> [mikko@localhost lvdata]$ btrfs fi df /mnt/red/
> Data, RAID1: total=1.00GiB, used=512.00KiB
> System, RAID1: total=32.00MiB, used=16.00KiB
> Metadata, RAID1: total=1.00GiB, used=112.00KiB
> GlobalReserve, single: total=16.00MiB, used=0.00B
> 
> ---
> 
> Here are the steps I originally used:
> 
> [mikko@localhost lvdata]$ sudo cryptsetup luksFormat -s 512
> --use-random /dev/sdc
> [mikko@localhost lvdata]$ sudo cryptsetup luksFormat -s 512
> --use-random /dev/sdd
> [mikko@localhost lvdata]$ sudo cryptsetup luksOpen /dev/sdc red1
> [mikko@localhost lvdata]$ sudo cryptsetup luksOpen /dev/sdd red2
> [mikko@localhost lvdata]$ sudo mkfs.btrfs -L red /dev/mapper/red1
> btrfs-progs v4.19
> See http://btrfs.wiki.kernel.org for more information.
> 
> Label:  red
> UUID:   f3c781a8-0f3e-4019-acbf-0b783cf566d0
> Node size:  16384
> Sector size:4096
> Filesystem size:931.51GiB
> Block group profiles:
>   Data: single8.00MiB
>   Metadata: DUP   1.00GiB
>   System:   DUP   8.00MiB
> SSD detected:   no
> Incompat features:  extref, skinny-metadata
> Number of devices:  1
> Devices:
>IDSIZE  PATH
> 1   931.51GiB  /dev/mapper/red1
> 
> [mikko@localhost lvdata]$ sudo mount -t btrfs -o
> defaults,noatime,nodiratime,autodefrag,compress=lzo /dev/mapper/red1
> /mnt/red
> [mikko@localhost lvdata]$ sudo btrfs device add /dev/mapper/red2 /mnt/red
> [mikko@localhost lvdata]$ sudo btrfs balance start -dconvert=raid10
> -mconvert=raid10 /mnt/red
> ERROR: error during balancing '/mnt/red': Invalid argument
> There may be more info in syslog - try dmesg | tail
> code 1
> 
> [mikko@localhost lvdata]$ dmesg | tail
> [12026.263243] BTRFS info (device dm-1): disk space caching is enabled
> [12026.263244] BTRFS info (device dm-1): has skinny extents
> [12026.263245] BTRFS info (device dm-1): flagging fs with big metadata feature
> [12026.275153] BTRFS info (device dm-1): checking UUID tree
> [12195.431766] BTRFS info (device dm-1): enabling auto defrag
> [12195.431769] BTRFS info (device dm-1): use lzo compression, level 0
> [12195.431770] BTRFS info (device dm-1): disk space caching is enabled
> [12195.431771] BTRFS info (device dm-1): has skinny extents
> [12205.815941] BTRFS info (device dm-1): disk added /dev/mapper/red2
> [12744.788747] BTRFS error (device dm-1): balance: invalid convert
> data profile raid10
> 
> Converting to RAID 1 did work but what can I do to make it RAID 10?
> With the up-to-date system it still says "invalid convert data profile
> raid10".
> 



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 9/9] btrfs: drop extra enum initialization where using defaults

2018-11-27 Thread Omar Sandoval

On Tue, Nov 27, 2018 at 08:53:59PM +0100, David Sterba wrote:
> The first auto-assigned value to enum is 0, we can use that and not
> initialize all members where the auto-increment does the same. This is
> used for values that are not part of on-disk format.

Reviewed-by: Omar Sandoval 

> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/btrfs_inode.h |  2 +-
>  fs/btrfs/ctree.h   | 28 ++--
>  fs/btrfs/disk-io.h | 10 +-
>  fs/btrfs/qgroup.h  |  2 +-
>  fs/btrfs/sysfs.h   |  2 +-
>  fs/btrfs/transaction.h | 14 +++---
>  6 files changed, 29 insertions(+), 29 deletions(-)
> 
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index 4de321aee7a5..fc25607304f2 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -20,7 +20,7 @@
>   * new data the application may have written before commit.
>   */
>  enum {
> - BTRFS_INODE_ORDERED_DATA_CLOSE = 0,
> + BTRFS_INODE_ORDERED_DATA_CLOSE,
>   BTRFS_INODE_DUMMY,
>   BTRFS_INODE_IN_DEFRAG,
>   BTRFS_INODE_HAS_ASYNC_EXTENT,
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 4bb0ac3050ff..f1d1c6ba3aa1 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -334,7 +334,7 @@ struct btrfs_node {
>   * The slots array records the index of the item or block pointer
>   * used while walking the tree.
>   */
> -enum { READA_NONE = 0, READA_BACK, READA_FORWARD };
> +enum { READA_NONE, READA_BACK, READA_FORWARD };
>  struct btrfs_path {
>   struct extent_buffer *nodes[BTRFS_MAX_LEVEL];
>   int slots[BTRFS_MAX_LEVEL];
> @@ -532,18 +532,18 @@ struct btrfs_free_cluster {
>  };
>  
>  enum btrfs_caching_type {
> - BTRFS_CACHE_NO  = 0,
> - BTRFS_CACHE_STARTED = 1,
> - BTRFS_CACHE_FAST= 2,
> - BTRFS_CACHE_FINISHED= 3,
> - BTRFS_CACHE_ERROR   = 4,
> + BTRFS_CACHE_NO,
> + BTRFS_CACHE_STARTED,
> + BTRFS_CACHE_FAST,
> + BTRFS_CACHE_FINISHED,
> + BTRFS_CACHE_ERROR,
>  };
>  
>  enum btrfs_disk_cache_state {
> - BTRFS_DC_WRITTEN= 0,
> - BTRFS_DC_ERROR  = 1,
> - BTRFS_DC_CLEAR  = 2,
> - BTRFS_DC_SETUP  = 3,
> + BTRFS_DC_WRITTEN,
> + BTRFS_DC_ERROR,
> + BTRFS_DC_CLEAR,
> + BTRFS_DC_SETUP,
>  };
>  
>  struct btrfs_caching_control {
> @@ -2621,10 +2621,10 @@ static inline gfp_t btrfs_alloc_write_mask(struct 
> address_space *mapping)
>  /* extent-tree.c */
>  
>  enum btrfs_inline_ref_type {
> - BTRFS_REF_TYPE_INVALID = 0,
> - BTRFS_REF_TYPE_BLOCK =   1,
> - BTRFS_REF_TYPE_DATA =2,
> - BTRFS_REF_TYPE_ANY = 3,
> + BTRFS_REF_TYPE_INVALID,
> + BTRFS_REF_TYPE_BLOCK,
> + BTRFS_REF_TYPE_DATA,
> + BTRFS_REF_TYPE_ANY,
>  };
>  
>  int btrfs_get_extent_inline_ref_type(const struct extent_buffer *eb,
> diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
> index 4cccba22640f..987a64bc0c66 100644
> --- a/fs/btrfs/disk-io.h
> +++ b/fs/btrfs/disk-io.h
> @@ -21,11 +21,11 @@
>  #define BTRFS_BDEV_BLOCKSIZE (4096)
>  
>  enum btrfs_wq_endio_type {
> - BTRFS_WQ_ENDIO_DATA = 0,
> - BTRFS_WQ_ENDIO_METADATA = 1,
> - BTRFS_WQ_ENDIO_FREE_SPACE = 2,
> - BTRFS_WQ_ENDIO_RAID56 = 3,
> - BTRFS_WQ_ENDIO_DIO_REPAIR = 4,
> + BTRFS_WQ_ENDIO_DATA,
> + BTRFS_WQ_ENDIO_METADATA,
> + BTRFS_WQ_ENDIO_FREE_SPACE,
> + BTRFS_WQ_ENDIO_RAID56,
> + BTRFS_WQ_ENDIO_DIO_REPAIR,
>  };
>  
>  static inline u64 btrfs_sb_offset(int mirror)
> diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
> index d8f78f5ab854..e4e6ee44073a 100644
> --- a/fs/btrfs/qgroup.h
> +++ b/fs/btrfs/qgroup.h
> @@ -70,7 +70,7 @@ struct btrfs_qgroup_extent_record {
>   *   be converted into META_PERTRANS.
>   */
>  enum btrfs_qgroup_rsv_type {
> - BTRFS_QGROUP_RSV_DATA = 0,
> + BTRFS_QGROUP_RSV_DATA,
>   BTRFS_QGROUP_RSV_META_PERTRANS,
>   BTRFS_QGROUP_RSV_META_PREALLOC,
>   BTRFS_QGROUP_RSV_LAST,
> diff --git a/fs/btrfs/sysfs.h b/fs/btrfs/sysfs.h
> index c6ee600aff89..40716b357c1d 100644
> --- a/fs/btrfs/sysfs.h
> +++ b/fs/btrfs/sysfs.h
> @@ -9,7 +9,7 @@
>  extern u64 btrfs_debugfs_test;
>  
>  enum btrfs_feature_set {
> - FEAT_COMPAT = 0,
> + FEAT_COMPAT,
>   FEAT_COMPAT_RO,
>   FEAT_INCOMPAT,
>   FEAT_MAX
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index 703d5116a2fc..f1ba78949d1b 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -12,13 +12,13 @@
>  #include "ctree.h"
>  
>  enum btrfs_trans_state {
> - TRANS_STATE_RUNNING = 0,
> - TRANS_STATE_BLOCKED = 1,
> - TRANS_STATE_COMMIT_START= 2,
> - TRANS_STATE_COMMIT_DOING= 3,
> - TRANS_STATE_UNBLOCKED   = 4,
> - TRANS_STATE_COMPLETED   = 5,
> - TRANS_STATE_MAX = 6,
> + TRANS_STATE_RUNNING,
> + TRANS_STATE_BLOCKED,
> + TRANS_STATE_COMMIT_START,
> +

Re: [PATCH 8/9] btrfs: switch BTRFS_ORDERED_* to enums

2018-11-27 Thread Omar Sandoval

On Tue, Nov 27, 2018 at 08:53:57PM +0100, David Sterba wrote:
> We can use simple enum for values that are not part of on-disk format:
> ordered extent flags.

Reviewed-by: Omar Sandoval 

> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/ordered-data.h | 45 +++--
>  1 file changed, 25 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
> index b10e6765d88f..fb9a161f0215 100644
> --- a/fs/btrfs/ordered-data.h
> +++ b/fs/btrfs/ordered-data.h
> @@ -37,26 +37,31 @@ struct btrfs_ordered_sum {
>   * rbtree, just before waking any waiters.  It is used to indicate the
>   * IO is done and any metadata is inserted into the tree.
>   */
> -#define BTRFS_ORDERED_IO_DONE 0 /* set when all the pages are written */
> -
> -#define BTRFS_ORDERED_COMPLETE 1 /* set when removed from the tree */
> -
> -#define BTRFS_ORDERED_NOCOW 2 /* set when we want to write in place */
> -
> -#define BTRFS_ORDERED_COMPRESSED 3 /* writing a zlib compressed extent */
> -
> -#define BTRFS_ORDERED_PREALLOC 4 /* set when writing to preallocated extent 
> */
> -
> -#define BTRFS_ORDERED_DIRECT 5 /* set when we're doing DIO with this extent 
> */
> -
> -#define BTRFS_ORDERED_IOERR 6 /* We had an io error when writing this out */
> -
> -#define BTRFS_ORDERED_UPDATED_ISIZE 7 /* indicates whether this ordered 
> extent
> -* has done its due diligence in updating
> -* the isize. */
> -#define BTRFS_ORDERED_TRUNCATED 8 /* Set when we have to truncate an extent 
> */
> -
> -#define BTRFS_ORDERED_REGULAR 10 /* Regular IO for COW */
> +enum {
> + /* set when all the pages are written */
> + BTRFS_ORDERED_IO_DONE,
> + /* set when removed from the tree */
> + BTRFS_ORDERED_COMPLETE,
> + /* set when we want to write in place */
> + BTRFS_ORDERED_NOCOW,
> + /* writing a zlib compressed extent */
> + BTRFS_ORDERED_COMPRESSED,
> + /* set when writing to preallocated extent */
> + BTRFS_ORDERED_PREALLOC,
> + /* set when we're doing DIO with this extent */
> + BTRFS_ORDERED_DIRECT,
> + /* We had an io error when writing this out */
> + BTRFS_ORDERED_IOERR,
> + /*
> +  * indicates whether this ordered extent has done its due diligence in
> +  * updating the isize
> +  */
> + BTRFS_ORDERED_UPDATED_ISIZE,
> + /* Set when we have to truncate an extent */
> + BTRFS_ORDERED_TRUNCATED,
> + /* Regular IO for COW */
> + BTRFS_ORDERED_REGULAR,
> +};
>  
>  struct btrfs_ordered_extent {
>   /* logical offset in the file */
> -- 
> 2.19.1
>

Re: [PATCH 7/9] btrfs: switch BTRFS_*_LOCK to enums

2018-11-27 Thread Omar Sandoval

On Tue, Nov 27, 2018 at 08:53:55PM +0100, David Sterba wrote:
> We can use simple enum for values that are not part of on-disk format:
> tree lock types.
> 
> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/locking.h | 10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/locking.h b/fs/btrfs/locking.h
> index 29135def468e..684d0ef4faa4 100644
> --- a/fs/btrfs/locking.h
> +++ b/fs/btrfs/locking.h
> @@ -6,10 +6,12 @@
>  #ifndef BTRFS_LOCKING_H
>  #define BTRFS_LOCKING_H
>  
> -#define BTRFS_WRITE_LOCK 1
> -#define BTRFS_READ_LOCK 2
> -#define BTRFS_WRITE_LOCK_BLOCKING 3
> -#define BTRFS_READ_LOCK_BLOCKING 4
> +enum {
> + BTRFS_WRITE_LOCK,

See btrfs_set_path_blocking() and btrfs_release_path(); 0 means no lock,
so this needs to be BTRFS_WRITE_LOCK = 1. I imagine that lockdep would
catch this.

> + BTRFS_READ_LOCK,
> + BTRFS_WRITE_LOCK_BLOCKING,
> + BTRFS_READ_LOCK_BLOCKING,
> +};
>  
>  void btrfs_tree_lock(struct extent_buffer *eb);
>  void btrfs_tree_unlock(struct extent_buffer *eb);
> -- 
> 2.19.1
>

Re: [PATCH 6/9] btrfs: switch EXTENT_FLAG_* to enums

2018-11-27 Thread Omar Sandoval

On Tue, Nov 27, 2018 at 08:53:52PM +0100, David Sterba wrote:
> We can use simple enum for values that are not part of on-disk format:
> extent map flags.

Reviewed-by: Omar Sandoval 

> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/extent_map.h | 21 ++---
>  1 file changed, 14 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
> index 31977ffd6190..ef05a0121652 100644
> --- a/fs/btrfs/extent_map.h
> +++ b/fs/btrfs/extent_map.h
> @@ -11,13 +11,20 @@
>  #define EXTENT_MAP_INLINE ((u64)-2)
>  #define EXTENT_MAP_DELALLOC ((u64)-1)
>  
> -/* bits for the flags field */
> -#define EXTENT_FLAG_PINNED 0 /* this entry not yet on disk, don't free it */
> -#define EXTENT_FLAG_COMPRESSED 1
> -#define EXTENT_FLAG_PREALLOC 3 /* pre-allocated extent */
> -#define EXTENT_FLAG_LOGGING 4 /* Logging this extent */
> -#define EXTENT_FLAG_FILLING 5 /* Filling in a preallocated extent */
> -#define EXTENT_FLAG_FS_MAPPING 6 /* filesystem extent mapping type */
> +/* bits for the extent_map::flags field */
> +enum {
> + /* this entry not yet on disk, don't free it */
> + EXTENT_FLAG_PINNED,
> + EXTENT_FLAG_COMPRESSED,
> + /* pre-allocated extent */
> + EXTENT_FLAG_PREALLOC,
> + /* Logging this extent */
> + EXTENT_FLAG_LOGGING,
> + /* Filling in a preallocated extent */
> + EXTENT_FLAG_FILLING,
> + /* filesystem extent mapping type */
> + EXTENT_FLAG_FS_MAPPING,
> +};
>  
>  struct extent_map {
>   struct rb_node rb_node;
> -- 
> 2.19.1
>

Re: [PATCH 5/9] btrfs: swtich EXTENT_BUFFER_* to enums

2018-11-27 Thread Omar Sandoval

On Tue, Nov 27, 2018 at 08:53:50PM +0100, David Sterba wrote:
> We can use simple enum for values that are not part of on-disk format:
> extent buffer flags;

This one has a "swtich" typo in the subject. Otherwise,

Reviewed-by: Omar Sandoval 

> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/extent_io.h | 28 
>  1 file changed, 16 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index a1d3ea5a0d32..fd42492e62e5 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -37,18 +37,22 @@
>  #define EXTENT_BIO_COMPRESSED 1
>  #define EXTENT_BIO_FLAG_SHIFT 16
>  
> -/* these are bit numbers for test/set bit */
> -#define EXTENT_BUFFER_UPTODATE 0
> -#define EXTENT_BUFFER_DIRTY 2
> -#define EXTENT_BUFFER_CORRUPT 3
> -#define EXTENT_BUFFER_READAHEAD 4/* this got triggered by readahead */
> -#define EXTENT_BUFFER_TREE_REF 5
> -#define EXTENT_BUFFER_STALE 6
> -#define EXTENT_BUFFER_WRITEBACK 7
> -#define EXTENT_BUFFER_READ_ERR 8/* read IO error */
> -#define EXTENT_BUFFER_UNMAPPED 9
> -#define EXTENT_BUFFER_IN_TREE 10
> -#define EXTENT_BUFFER_WRITE_ERR 11/* write IO error */
> +enum {
> + EXTENT_BUFFER_UPTODATE,
> + EXTENT_BUFFER_DIRTY,
> + EXTENT_BUFFER_CORRUPT,
> + /* this got triggered by readahead */
> + EXTENT_BUFFER_READAHEAD,
> + EXTENT_BUFFER_TREE_REF,
> + EXTENT_BUFFER_STALE,
> + EXTENT_BUFFER_WRITEBACK,
> + /* read IO error */
> + EXTENT_BUFFER_READ_ERR,
> + EXTENT_BUFFER_UNMAPPED,
> + EXTENT_BUFFER_IN_TREE,
> + /* write IO error */
> + EXTENT_BUFFER_WRITE_ERR,
> +};
>  
>  /* these are flags for __process_pages_contig */
>  #define PAGE_UNLOCK  (1 << 0)
> -- 
> 2.19.1
>

Re: [PATCH 4/9] btrfs: switch BTRFS_ROOT_* to enums

2018-11-27 Thread Omar Sandoval

On Tue, Nov 27, 2018 at 08:53:48PM +0100, David Sterba wrote:
> We can use simple enum for values that are not part of on-disk format:
> root tree flags.

Reviewed-by: Omar Sandoval 

> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/ctree.h | 33 +
>  1 file changed, 17 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 7176b95b40e7..4bb0ac3050ff 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1180,22 +1180,23 @@ struct btrfs_subvolume_writers {
>  /*
>   * The state of btrfs root
>   */
> -/*
> - * btrfs_record_root_in_trans is a multi-step process,
> - * and it can race with the balancing code.   But the
> - * race is very small, and only the first time the root
> - * is added to each transaction.  So IN_TRANS_SETUP
> - * is used to tell us when more checks are required
> - */
> -#define BTRFS_ROOT_IN_TRANS_SETUP0
> -#define BTRFS_ROOT_REF_COWS  1
> -#define BTRFS_ROOT_TRACK_DIRTY   2
> -#define BTRFS_ROOT_IN_RADIX  3
> -#define BTRFS_ROOT_ORPHAN_ITEM_INSERTED  4
> -#define BTRFS_ROOT_DEFRAG_RUNNING5
> -#define BTRFS_ROOT_FORCE_COW 6
> -#define BTRFS_ROOT_MULTI_LOG_TASKS   7
> -#define BTRFS_ROOT_DIRTY 8
> +enum {
> + /*
> +  * btrfs_record_root_in_trans is a multi-step process, and it can race
> +  * with the balancing code.   But the race is very small, and only the
> +  * first time the root is added to each transaction.  So IN_TRANS_SETUP
> +  * is used to tell us when more checks are required
> +  */
> + BTRFS_ROOT_IN_TRANS_SETUP,
> + BTRFS_ROOT_REF_COWS,
> + BTRFS_ROOT_TRACK_DIRTY,
> + BTRFS_ROOT_IN_RADIX,
> + BTRFS_ROOT_ORPHAN_ITEM_INSERTED,
> + BTRFS_ROOT_DEFRAG_RUNNING,
> + BTRFS_ROOT_FORCE_COW,
> + BTRFS_ROOT_MULTI_LOG_TASKS,
> + BTRFS_ROOT_DIRTY,
> +};
>  
>  /*
>   * in ram representation of the tree.  extent_root is used for all 
> allocations
> -- 
> 2.19.1
>

Re: [PATCH 3/9] btrfs: switch BTRFS_FS_* to enums

2018-11-27 Thread Omar Sandoval

On Tue, Nov 27, 2018 at 08:53:45PM +0100, David Sterba wrote:
> We can use simple enum for values that are not part of on-disk format:
> internal filesystem states.

Hah, looks like we never had a bit 0 ;)

Reviewed-by: Omar Sandoval 

> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/ctree.h | 63 
>  1 file changed, 31 insertions(+), 32 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 40c405d74a01..7176b95b40e7 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -757,38 +757,37 @@ struct btrfs_swapfile_pin {
>  
>  bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr);
>  
> -#define BTRFS_FS_BARRIER 1
> -#define BTRFS_FS_CLOSING_START   2
> -#define BTRFS_FS_CLOSING_DONE3
> -#define BTRFS_FS_LOG_RECOVERING  4
> -#define BTRFS_FS_OPEN5
> -#define BTRFS_FS_QUOTA_ENABLED   6
> -#define BTRFS_FS_UPDATE_UUID_TREE_GEN9
> -#define BTRFS_FS_CREATING_FREE_SPACE_TREE10
> -#define BTRFS_FS_BTREE_ERR   11
> -#define BTRFS_FS_LOG1_ERR12
> -#define BTRFS_FS_LOG2_ERR13
> -#define BTRFS_FS_QUOTA_OVERRIDE  14
> -/* Used to record internally whether fs has been frozen */
> -#define BTRFS_FS_FROZEN  15
> -
> -/*
> - * Indicate that a whole-filesystem exclusive operation is running
> - * (device replace, resize, device add/delete, balance)
> - */
> -#define BTRFS_FS_EXCL_OP 16
> -
> -/*
> - * To info transaction_kthread we need an immediate commit so it doesn't
> - * need to wait for commit_interval
> - */
> -#define BTRFS_FS_NEED_ASYNC_COMMIT   17
> -
> -/*
> - * Indicate that balance has been set up from the ioctl and is in the main
> - * phase. The fs_info::balance_ctl is initialized.
> - */
> -#define BTRFS_FS_BALANCE_RUNNING 18
> +enum {
> + BTRFS_FS_BARRIER,
> + BTRFS_FS_CLOSING_START,
> + BTRFS_FS_CLOSING_DONE,
> + BTRFS_FS_LOG_RECOVERING,
> + BTRFS_FS_OPEN,
> + BTRFS_FS_QUOTA_ENABLED,
> + BTRFS_FS_UPDATE_UUID_TREE_GEN,
> + BTRFS_FS_CREATING_FREE_SPACE_TREE,
> + BTRFS_FS_BTREE_ERR,
> + BTRFS_FS_LOG1_ERR,
> + BTRFS_FS_LOG2_ERR,
> + BTRFS_FS_QUOTA_OVERRIDE,
> + /* Used to record internally whether fs has been frozen */
> + BTRFS_FS_FROZEN,
> + /*
> +  * Indicate that a whole-filesystem exclusive operation is running
> +  * (device replace, resize, device add/delete, balance)
> +  */
> + BTRFS_FS_EXCL_OP,
> + /*
> +  * To info transaction_kthread we need an immediate commit so it
> +  * doesn't need to wait for commit_interval
> +  */
> + BTRFS_FS_NEED_ASYNC_COMMIT,
> + /*
> +  * Indicate that balance has been set up from the ioctl and is in the
> +  * main phase. The fs_info::balance_ctl is initialized.
> +  */
> + BTRFS_FS_BALANCE_RUNNING,
> +};
>  
>  struct btrfs_fs_info {
>   u8 chunk_tree_uuid[BTRFS_UUID_SIZE];
> -- 
> 2.19.1
>

Re: [PATCH 2/9] btrfs: switch BTRFS_BLOCK_RSV_* to enums

2018-11-27 Thread Omar Sandoval

On Tue, Nov 27, 2018 at 08:53:43PM +0100, David Sterba wrote:
> We can use simple enum for values that are not part of on-disk format:
> block reserve types.

Reviewed-by: Omar Sandoval 

> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/ctree.h | 19 ---
>  1 file changed, 12 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index f82ec5e41b0c..40c405d74a01 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -461,13 +461,18 @@ struct btrfs_space_info {
>   struct kobject *block_group_kobjs[BTRFS_NR_RAID_TYPES];
>  };
>  
> -#define  BTRFS_BLOCK_RSV_GLOBAL  1
> -#define  BTRFS_BLOCK_RSV_DELALLOC2
> -#define  BTRFS_BLOCK_RSV_TRANS   3
> -#define  BTRFS_BLOCK_RSV_CHUNK   4
> -#define  BTRFS_BLOCK_RSV_DELOPS  5
> -#define  BTRFS_BLOCK_RSV_EMPTY   6
> -#define  BTRFS_BLOCK_RSV_TEMP7
> +/*
> + * Types of block reserves
> + */
> +enum {
> + BTRFS_BLOCK_RSV_GLOBAL,
> + BTRFS_BLOCK_RSV_DELALLOC,
> + BTRFS_BLOCK_RSV_TRANS,
> + BTRFS_BLOCK_RSV_CHUNK,
> + BTRFS_BLOCK_RSV_DELOPS,
> + BTRFS_BLOCK_RSV_EMPTY,
> + BTRFS_BLOCK_RSV_TEMP,
> +};
>  
>  struct btrfs_block_rsv {
>   u64 size;
> -- 
> 2.19.1
>

Re: [PATCH 1/9] btrfs: switch BTRFS_FS_STATE_* to enums

2018-11-27 Thread Omar Sandoval

On Tue, Nov 27, 2018 at 08:53:41PM +0100, David Sterba wrote:
> We can use simple enum for values that are not part of on-disk format:
> global filesystem states.

Reviewed-by: Omar Sandoval 

Some typos/wording suggestions below.

> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/ctree.h | 25 +++--
>  1 file changed, 19 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index a98507fa9192..f82ec5e41b0c 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -109,13 +109,26 @@ static inline unsigned long btrfs_chunk_item_size(int 
> num_stripes)
>  }
>  
>  /*
> - * File system states
> + * Runtime (in-memory) states of filesystem
>   */
> -#define BTRFS_FS_STATE_ERROR 0
> -#define BTRFS_FS_STATE_REMOUNTING1
> -#define BTRFS_FS_STATE_TRANS_ABORTED 2
> -#define BTRFS_FS_STATE_DEV_REPLACING 3
> -#define BTRFS_FS_STATE_DUMMY_FS_INFO 4
> +enum {
> + /* Global indicator of serious filesysystem errors */

filesysystem -> filesystem

> + BTRFS_FS_STATE_ERROR,
> + /*
> +  * Filesystem is being remounted, allow to skip some operations, like
> +  * defrag
> +  */
> + BTRFS_FS_STATE_REMOUNTING,
> + /* Track if the transaction abort has been reported */

Which one is "the" transaction abort? This gives me the impression that
this is a flag on the transaction, but it's actually filesystem state.
Maybe "Track if a transaction abort has been reported on this
filesystem"?

> + BTRFS_FS_STATE_TRANS_ABORTED,
> + /*
> +  * Indicate that replace source or target device state is changed and
> +  * allow to block bio operations
> +  */

Again, this makes it sound like it's device state, but it's actually
filesystem state. How about "Bio operations should be blocked on this
filesystem because a source or target device is being destroyed as part
of a device replace"?

> + BTRFS_FS_STATE_DEV_REPLACING,
> + /* The btrfs_fs_info created for self-tests */
> + BTRFS_FS_STATE_DUMMY_FS_INFO,
> +};
>  
>  #define BTRFS_BACKREF_REV_MAX256
>  #define BTRFS_BACKREF_REV_SHIFT  56
> -- 
> 2.19.1
>

Balance: invalid convert data profile raid10

2018-11-27 Thread Mikko Merikivi

I seem unable to convert an existing btrfs device array to RAID 10.
Since it's pretty much RAID 0 and 1 combined, and 5 and 6 are
unstable, it's what I would like to use.

After I did tried this with 4.19.2-arch1-1-ARCH and btrfs-progs v4.19,
I updated my system and tried btrfs balance again with this system
information:
[mikko@localhost lvdata]$ uname -a
Linux localhost 4.19.4-arch1-1-ARCH #1 SMP PREEMPT Fri Nov 23 09:06:58
UTC 2018 x86_64 GNU/Linux
[mikko@localhost lvdata]$ btrfs --version
btrfs-progs v4.19
[mikko@localhost lvdata]$ sudo btrfs fi show
Label: 'main1'  uuid: c7cbb9c3-8c55-45f1-b03c-48992efe2f11
Total devices 1 FS bytes used 2.90TiB
devid1 size 3.64TiB used 2.91TiB path /dev/mapper/main

Label: 'red'  uuid: f3c781a8-0f3e-4019-acbf-0b783cf566d0
Total devices 2 FS bytes used 640.00KiB
devid1 size 931.51GiB used 2.03GiB path /dev/mapper/red1
devid2 size 931.51GiB used 2.03GiB path /dev/mapper/red2
[mikko@localhost lvdata]$ btrfs fi df /mnt/red/
Data, RAID1: total=1.00GiB, used=512.00KiB
System, RAID1: total=32.00MiB, used=16.00KiB
Metadata, RAID1: total=1.00GiB, used=112.00KiB
GlobalReserve, single: total=16.00MiB, used=0.00B

---

Here are the steps I originally used:

[mikko@localhost lvdata]$ sudo cryptsetup luksFormat -s 512
--use-random /dev/sdc
[mikko@localhost lvdata]$ sudo cryptsetup luksFormat -s 512
--use-random /dev/sdd
[mikko@localhost lvdata]$ sudo cryptsetup luksOpen /dev/sdc red1
[mikko@localhost lvdata]$ sudo cryptsetup luksOpen /dev/sdd red2
[mikko@localhost lvdata]$ sudo mkfs.btrfs -L red /dev/mapper/red1
btrfs-progs v4.19
See http://btrfs.wiki.kernel.org for more information.

Label:  red
UUID:   f3c781a8-0f3e-4019-acbf-0b783cf566d0
Node size:  16384
Sector size:4096
Filesystem size:931.51GiB
Block group profiles:
  Data: single8.00MiB
  Metadata: DUP   1.00GiB
  System:   DUP   8.00MiB
SSD detected:   no
Incompat features:  extref, skinny-metadata
Number of devices:  1
Devices:
   IDSIZE  PATH
1   931.51GiB  /dev/mapper/red1

[mikko@localhost lvdata]$ sudo mount -t btrfs -o
defaults,noatime,nodiratime,autodefrag,compress=lzo /dev/mapper/red1
/mnt/red
[mikko@localhost lvdata]$ sudo btrfs device add /dev/mapper/red2 /mnt/red
[mikko@localhost lvdata]$ sudo btrfs balance start -dconvert=raid10
-mconvert=raid10 /mnt/red
ERROR: error during balancing '/mnt/red': Invalid argument
There may be more info in syslog - try dmesg | tail
code 1

[mikko@localhost lvdata]$ dmesg | tail
[12026.263243] BTRFS info (device dm-1): disk space caching is enabled
[12026.263244] BTRFS info (device dm-1): has skinny extents
[12026.263245] BTRFS info (device dm-1): flagging fs with big metadata feature
[12026.275153] BTRFS info (device dm-1): checking UUID tree
[12195.431766] BTRFS info (device dm-1): enabling auto defrag
[12195.431769] BTRFS info (device dm-1): use lzo compression, level 0
[12195.431770] BTRFS info (device dm-1): disk space caching is enabled
[12195.431771] BTRFS info (device dm-1): has skinny extents
[12205.815941] BTRFS info (device dm-1): disk added /dev/mapper/red2
[12744.788747] BTRFS error (device dm-1): balance: invalid convert
data profile raid10

Converting to RAID 1 did work but what can I do to make it RAID 10?
With the up-to-date system it still says "invalid convert data profile
raid10".

Re: [PATCH 3/3] btrfs: document extent mapping assumptions in checksum

2018-11-27 Thread Noah Massey

On Tue, Nov 27, 2018 at 2:32 PM Nikolay Borisov  wrote:
>
> On 27.11.18 г. 21:08 ч., Noah Massey wrote:
> > On Tue, Nov 27, 2018 at 11:43 AM Nikolay Borisov  wrote:
> >>
> >> On 27.11.18 г. 18:00 ч., Johannes Thumshirn wrote:
> >>> Document why map_private_extent_buffer() cannot return '1' (i.e. the map
> >>> spans two pages) for the csum_tree_block() case.
> >>>
> >>> The current algorithm for detecting a page boundary crossing in
> >>> map_private_extent_buffer() will return a '1' *IFF* the product of the
> >>
> >> I think the word product must be replaced with 'sum', since product
> >> implies multiplication :)
> >>
> >
> > doesn't 'sum' imply addition? How about 'output'?
>
> It does and the code indeed sums the value and not multiply them hence
> my suggestion.
>

I'm sorry, I didn't phrase that well.

Since 'sum' already implies addition, it gets confusing with the
mathematical operators used later in the description. So, if a
objective noun is required, a generic term such as 'output' or
'result' reads more cleanly for me. OTOH, dropping that and creating
an actual expression

*IFF* the extent buffer's offset in the page + the offset passed in by
csum_tree_block() + the minimal length passed in by csum_tree_block()
- 1 > PAGE_SIZE.

is also straightforward.

Re: [PATCH 2/3] btrfs: wakeup cleaner thread when adding delayed iput

2018-11-27 Thread Josef Bacik

On Tue, Nov 27, 2018 at 07:59:42PM +, Chris Mason wrote:
> On 27 Nov 2018, at 14:54, Josef Bacik wrote:
> 
> > On Tue, Nov 27, 2018 at 10:26:15AM +0200, Nikolay Borisov wrote:
> >>
> >>
> >> On 21.11.18 г. 21:09 ч., Josef Bacik wrote:
> >>> The cleaner thread usually takes care of delayed iputs, with the
> >>> exception of the btrfs_end_transaction_throttle path.  The cleaner
> >>> thread only gets woken up every 30 seconds, so instead wake it up to 
> >>> do
> >>> it's work so that we can free up that space as quickly as possible.
> >>
> >> Have you done any measurements how this affects the overall system. I
> >> suspect this introduces a lot of noise since now we are going to be
> >> doing a thread wakeup on every iput, does this give a chance to have
> >> nice, large batches of iputs that  the cost of wake up can be 
> >> amortized
> >> across?
> >
> > I ran the whole patchset with our A/B testing stuff and the patchset 
> > was a 5%
> > win overall, so I'm inclined to think it's fine.  Thanks,
> 
> It's a good point though, a delayed wakeup may be less overhead.

Sure, but how do we go about that without it sometimes messing up?  In practice
the only time we're doing this is at the end of finish_ordered_io, so likely to
not be a constant stream.  I suppose since we have places where we force it to
run that we don't really need this.  IDK, I'm fine with dropping it.  Thanks,

Josef

Re: [PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue

2018-11-27 Thread Josef Bacik

On Tue, Nov 27, 2018 at 10:29:57AM +0200, Nikolay Borisov wrote:
> 
> 
> On 21.11.18 г. 21:09 ч., Josef Bacik wrote:
> > The throttle path doesn't take cleaner_delayed_iput_mutex, which means
> 
> Which one is the throttle path? btrfs_end_transaction_throttle is only
> called during snapshot drop and relocation? What scenario triggered your
> observation and prompted this fix?
> 

One of my enospc tests runs snapshot creation/deletion in the background.

> > we could think we're done flushing iputs in the data space reservation
> > path when we could have a throttler doing an iput.  There's no real
> > reason to serialize the delayed iput flushing, so instead of taking the
> > cleaner_delayed_iput_mutex whenever we flush the delayed iputs just
> > replace it with an atomic counter and a waitqueue.  This removes the
> > short (or long depending on how big the inode is) window where we think
> > there are no more pending iputs when there really are some.
> > 
> > Signed-off-by: Josef Bacik 
> > ---
> >  fs/btrfs/ctree.h   |  4 +++-
> >  fs/btrfs/disk-io.c |  5 ++---
> >  fs/btrfs/extent-tree.c |  9 +
> >  fs/btrfs/inode.c   | 21 +
> >  4 files changed, 31 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 709de7471d86..a835fe7076eb 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -912,7 +912,8 @@ struct btrfs_fs_info {
> >  
> > spinlock_t delayed_iput_lock;
> > struct list_head delayed_iputs;
> > -   struct mutex cleaner_delayed_iput_mutex;
> > +   atomic_t nr_delayed_iputs;
> > +   wait_queue_head_t delayed_iputs_wait;
> >  
> > /* this protects tree_mod_seq_list */
> > spinlock_t tree_mod_seq_lock;
> > @@ -3237,6 +3238,7 @@ int btrfs_orphan_cleanup(struct btrfs_root *root);
> >  int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size);
> >  void btrfs_add_delayed_iput(struct inode *inode);
> >  void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info);
> > +int btrfs_wait_on_delayed_iputs(struct btrfs_fs_info *fs_info);
> >  int btrfs_prealloc_file_range(struct inode *inode, int mode,
> >   u64 start, u64 num_bytes, u64 min_size,
> >   loff_t actual_len, u64 *alloc_hint);
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index c5918ff8241b..3f81dfaefa32 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -1692,9 +1692,7 @@ static int cleaner_kthread(void *arg)
> > goto sleep;
> > }
> >  
> > -   mutex_lock(_info->cleaner_delayed_iput_mutex);
> > btrfs_run_delayed_iputs(fs_info);
> > -   mutex_unlock(_info->cleaner_delayed_iput_mutex);
> >  
> > again = btrfs_clean_one_deleted_snapshot(root);
> > mutex_unlock(_info->cleaner_mutex);
> > @@ -2651,7 +2649,6 @@ int open_ctree(struct super_block *sb,
> > mutex_init(_info->delete_unused_bgs_mutex);
> > mutex_init(_info->reloc_mutex);
> > mutex_init(_info->delalloc_root_mutex);
> > -   mutex_init(_info->cleaner_delayed_iput_mutex);
> > seqlock_init(_info->profiles_lock);
> >  
> > INIT_LIST_HEAD(_info->dirty_cowonly_roots);
> > @@ -2673,6 +2670,7 @@ int open_ctree(struct super_block *sb,
> > atomic_set(_info->defrag_running, 0);
> > atomic_set(_info->qgroup_op_seq, 0);
> > atomic_set(_info->reada_works_cnt, 0);
> > +   atomic_set(_info->nr_delayed_iputs, 0);
> > atomic64_set(_info->tree_mod_seq, 0);
> > fs_info->sb = sb;
> > fs_info->max_inline = BTRFS_DEFAULT_MAX_INLINE;
> > @@ -2750,6 +2748,7 @@ int open_ctree(struct super_block *sb,
> > init_waitqueue_head(_info->transaction_wait);
> > init_waitqueue_head(_info->transaction_blocked_wait);
> > init_waitqueue_head(_info->async_submit_wait);
> > +   init_waitqueue_head(_info->delayed_iputs_wait);
> >  
> > INIT_LIST_HEAD(_info->pinned_chunks);
> >  
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index 3a90dc1d6b31..36f43876be22 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -4272,8 +4272,9 @@ int btrfs_alloc_data_chunk_ondemand(struct 
> > btrfs_inode *inode, u64 bytes)
> >  * operations. Wait for it to finish so that
> >  * more space is released.
> >  */
> > -   
> > mutex_lock(_info->cleaner_delayed_iput_mutex);
> > -   
> > mutex_unlock(_info->cleaner_delayed_iput_mutex);
> > +   ret = btrfs_wait_on_delayed_iputs(fs_info);
> > +   if (ret)
> > +   return ret;
> > goto again;
> > } else {
> > btrfs_end_transaction(trans);
> > @@ -4838,9 +4839,9 @@ static int may_commit_transaction(struct 
> > btrfs_fs_info *fs_info,
>

Re: [PATCH 2/3] btrfs: wakeup cleaner thread when adding delayed iput

2018-11-27 Thread Chris Mason

On 27 Nov 2018, at 14:54, Josef Bacik wrote:

> On Tue, Nov 27, 2018 at 10:26:15AM +0200, Nikolay Borisov wrote:
>>
>>
>> On 21.11.18 г. 21:09 ч., Josef Bacik wrote:
>>> The cleaner thread usually takes care of delayed iputs, with the
>>> exception of the btrfs_end_transaction_throttle path.  The cleaner
>>> thread only gets woken up every 30 seconds, so instead wake it up to 
>>> do
>>> it's work so that we can free up that space as quickly as possible.
>>
>> Have you done any measurements how this affects the overall system. I
>> suspect this introduces a lot of noise since now we are going to be
>> doing a thread wakeup on every iput, does this give a chance to have
>> nice, large batches of iputs that  the cost of wake up can be 
>> amortized
>> across?
>
> I ran the whole patchset with our A/B testing stuff and the patchset 
> was a 5%
> win overall, so I'm inclined to think it's fine.  Thanks,

It's a good point though, a delayed wakeup may be less overhead.

-chris

Re: [PATCH] btrfs: only run delayed refs if we're committing

2018-11-27 Thread Josef Bacik

On Tue, Nov 27, 2018 at 07:43:39PM +, Filipe Manana wrote:
> On Tue, Nov 27, 2018 at 7:22 PM Josef Bacik  wrote:
> >
> > On Fri, Nov 23, 2018 at 04:59:32PM +, Filipe Manana wrote:
> > > On Thu, Nov 22, 2018 at 12:35 AM Josef Bacik  wrote:
> > > >
> > > > I noticed in a giant dbench run that we spent a lot of time on lock
> > > > contention while running transaction commit.  This is because dbench
> > > > results in a lot of fsync()'s that do a btrfs_transaction_commit(), and
> > > > they all run the delayed refs first thing, so they all contend with
> > > > each other.  This leads to seconds of 0 throughput.  Change this to only
> > > > run the delayed refs if we're the ones committing the transaction.  This
> > > > makes the latency go away and we get no more lock contention.
> > >
> > > Can you share the following in the changelog please?
> > >
> > > 1) How did you ran dbench (parameters, config).
> > >
> > > 2) What results did you get before and after this change. So that we all 
> > > get
> > > an idea of how good the impact is.
> > >
> > > While the reduced contention makes all sense and seems valid, I'm not
> > > sure this is always a win.
> > > It certainly is when multiple tasks are calling
> > > btrfs_commit_transaction() simultaneously, but,
> > > what about when only one does it?
> > >
> > > By running all delayed references inside the critical section of the
> > > transaction commit
> > > (state == TRANS_STATE_COMMIT_START), instead of running most of them
> > > outside/before,
> > > we will be blocking for a longer a time other tasks calling
> > > btrfs_start_transaction() (which is used
> > > a lot - creating files, unlinking files, adding links, etc, and even 
> > > fsync).
> > >
> > > Won't there by any other types of workload and tests other then dbench
> > > that can get increased
> > > latency and/or smaller throughput?
> > >
> > > I find that sort of information useful to have in the changelog. If
> > > you verified that or you think
> > > it's irrelevant to measure/consider, it would be great to have it
> > > mentioned in the changelog
> > > (and explained).
> > >
> >
> > Yeah I thought about the delayed refs being run in the critical section now,
> > that's not awesome.  I'll drop this for now, I think just having a mutex 
> > around
> > running delayed refs will be good enough, since we want people who care 
> > about
> > flushing delayed refs to wait around for that to finish happening.  Thanks,
> 
> Well, I think we can have a solution that doesn't bring such trade-off
> nor introducing a mutex.
> We could do like what is currently done for writing space caches, to
> make sure only the first task
> calling commit transaction does the work and all others do nothing
> except waiting for the commit to finish:
> 
> btrfs_commit_transaction()
>if (!test_and_set_bit(BTRFS_TRANS_COMMIT_START, _trans->flags)) {
>run delayed refs before entering critical section
>}
> 

That was my first inclination but then one of the other committers goes past
this and gets into the start TRANS_STATE_COMMIT_START place and has to wait for
the guy running the delayed refs to finish before moving forward, essentially
extending the time in the critical section for the same reason.  The mutex means
that if 9000 people try to commit the transaction, 1 guy gets to run the delayed
refs and everybody else waits for him to finish, and in the meantime the
critical section is small.  Thanks,

Josef

Re: [PATCH 2/3] btrfs: wakeup cleaner thread when adding delayed iput

2018-11-27 Thread Josef Bacik

On Tue, Nov 27, 2018 at 10:26:15AM +0200, Nikolay Borisov wrote:
> 
> 
> On 21.11.18 г. 21:09 ч., Josef Bacik wrote:
> > The cleaner thread usually takes care of delayed iputs, with the
> > exception of the btrfs_end_transaction_throttle path.  The cleaner
> > thread only gets woken up every 30 seconds, so instead wake it up to do
> > it's work so that we can free up that space as quickly as possible.
> 
> Have you done any measurements how this affects the overall system. I
> suspect this introduces a lot of noise since now we are going to be
> doing a thread wakeup on every iput, does this give a chance to have
> nice, large batches of iputs that  the cost of wake up can be amortized
> across?

I ran the whole patchset with our A/B testing stuff and the patchset was a 5%
win overall, so I'm inclined to think it's fine.  Thanks,

Josef

[PATCH 9/9] btrfs: drop extra enum initialization where using defaults

2018-11-27 Thread David Sterba

The first auto-assigned value to enum is 0, we can use that and not
initialize all members where the auto-increment does the same. This is
used for values that are not part of on-disk format.

Signed-off-by: David Sterba 
---
 fs/btrfs/btrfs_inode.h |  2 +-
 fs/btrfs/ctree.h   | 28 ++--
 fs/btrfs/disk-io.h | 10 +-
 fs/btrfs/qgroup.h  |  2 +-
 fs/btrfs/sysfs.h   |  2 +-
 fs/btrfs/transaction.h | 14 +++---
 6 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 4de321aee7a5..fc25607304f2 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -20,7 +20,7 @@
  * new data the application may have written before commit.
  */
 enum {
-   BTRFS_INODE_ORDERED_DATA_CLOSE = 0,
+   BTRFS_INODE_ORDERED_DATA_CLOSE,
BTRFS_INODE_DUMMY,
BTRFS_INODE_IN_DEFRAG,
BTRFS_INODE_HAS_ASYNC_EXTENT,
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4bb0ac3050ff..f1d1c6ba3aa1 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -334,7 +334,7 @@ struct btrfs_node {
  * The slots array records the index of the item or block pointer
  * used while walking the tree.
  */
-enum { READA_NONE = 0, READA_BACK, READA_FORWARD };
+enum { READA_NONE, READA_BACK, READA_FORWARD };
 struct btrfs_path {
struct extent_buffer *nodes[BTRFS_MAX_LEVEL];
int slots[BTRFS_MAX_LEVEL];
@@ -532,18 +532,18 @@ struct btrfs_free_cluster {
 };
 
 enum btrfs_caching_type {
-   BTRFS_CACHE_NO  = 0,
-   BTRFS_CACHE_STARTED = 1,
-   BTRFS_CACHE_FAST= 2,
-   BTRFS_CACHE_FINISHED= 3,
-   BTRFS_CACHE_ERROR   = 4,
+   BTRFS_CACHE_NO,
+   BTRFS_CACHE_STARTED,
+   BTRFS_CACHE_FAST,
+   BTRFS_CACHE_FINISHED,
+   BTRFS_CACHE_ERROR,
 };
 
 enum btrfs_disk_cache_state {
-   BTRFS_DC_WRITTEN= 0,
-   BTRFS_DC_ERROR  = 1,
-   BTRFS_DC_CLEAR  = 2,
-   BTRFS_DC_SETUP  = 3,
+   BTRFS_DC_WRITTEN,
+   BTRFS_DC_ERROR,
+   BTRFS_DC_CLEAR,
+   BTRFS_DC_SETUP,
 };
 
 struct btrfs_caching_control {
@@ -2621,10 +2621,10 @@ static inline gfp_t btrfs_alloc_write_mask(struct 
address_space *mapping)
 /* extent-tree.c */
 
 enum btrfs_inline_ref_type {
-   BTRFS_REF_TYPE_INVALID = 0,
-   BTRFS_REF_TYPE_BLOCK =   1,
-   BTRFS_REF_TYPE_DATA =2,
-   BTRFS_REF_TYPE_ANY = 3,
+   BTRFS_REF_TYPE_INVALID,
+   BTRFS_REF_TYPE_BLOCK,
+   BTRFS_REF_TYPE_DATA,
+   BTRFS_REF_TYPE_ANY,
 };
 
 int btrfs_get_extent_inline_ref_type(const struct extent_buffer *eb,
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 4cccba22640f..987a64bc0c66 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -21,11 +21,11 @@
 #define BTRFS_BDEV_BLOCKSIZE   (4096)
 
 enum btrfs_wq_endio_type {
-   BTRFS_WQ_ENDIO_DATA = 0,
-   BTRFS_WQ_ENDIO_METADATA = 1,
-   BTRFS_WQ_ENDIO_FREE_SPACE = 2,
-   BTRFS_WQ_ENDIO_RAID56 = 3,
-   BTRFS_WQ_ENDIO_DIO_REPAIR = 4,
+   BTRFS_WQ_ENDIO_DATA,
+   BTRFS_WQ_ENDIO_METADATA,
+   BTRFS_WQ_ENDIO_FREE_SPACE,
+   BTRFS_WQ_ENDIO_RAID56,
+   BTRFS_WQ_ENDIO_DIO_REPAIR,
 };
 
 static inline u64 btrfs_sb_offset(int mirror)
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index d8f78f5ab854..e4e6ee44073a 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -70,7 +70,7 @@ struct btrfs_qgroup_extent_record {
  * be converted into META_PERTRANS.
  */
 enum btrfs_qgroup_rsv_type {
-   BTRFS_QGROUP_RSV_DATA = 0,
+   BTRFS_QGROUP_RSV_DATA,
BTRFS_QGROUP_RSV_META_PERTRANS,
BTRFS_QGROUP_RSV_META_PREALLOC,
BTRFS_QGROUP_RSV_LAST,
diff --git a/fs/btrfs/sysfs.h b/fs/btrfs/sysfs.h
index c6ee600aff89..40716b357c1d 100644
--- a/fs/btrfs/sysfs.h
+++ b/fs/btrfs/sysfs.h
@@ -9,7 +9,7 @@
 extern u64 btrfs_debugfs_test;
 
 enum btrfs_feature_set {
-   FEAT_COMPAT = 0,
+   FEAT_COMPAT,
FEAT_COMPAT_RO,
FEAT_INCOMPAT,
FEAT_MAX
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 703d5116a2fc..f1ba78949d1b 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -12,13 +12,13 @@
 #include "ctree.h"
 
 enum btrfs_trans_state {
-   TRANS_STATE_RUNNING = 0,
-   TRANS_STATE_BLOCKED = 1,
-   TRANS_STATE_COMMIT_START= 2,
-   TRANS_STATE_COMMIT_DOING= 3,
-   TRANS_STATE_UNBLOCKED   = 4,
-   TRANS_STATE_COMPLETED   = 5,
-   TRANS_STATE_MAX = 6,
+   TRANS_STATE_RUNNING,
+   TRANS_STATE_BLOCKED,
+   TRANS_STATE_COMMIT_START,
+   TRANS_STATE_COMMIT_DOING,
+   TRANS_STATE_UNBLOCKED,
+   TRANS_STATE_COMPLETED,
+   TRANS_STATE_MAX,
 };
 
 #define BTRFS_TRANS_HAVE_FREE_BGS  0
-- 
2.19.1

[PATCH 4/9] btrfs: switch BTRFS_ROOT_* to enums

2018-11-27 Thread David Sterba

We can use simple enum for values that are not part of on-disk format:
root tree flags.

Signed-off-by: David Sterba 
---
 fs/btrfs/ctree.h | 33 +
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7176b95b40e7..4bb0ac3050ff 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1180,22 +1180,23 @@ struct btrfs_subvolume_writers {
 /*
  * The state of btrfs root
  */
-/*
- * btrfs_record_root_in_trans is a multi-step process,
- * and it can race with the balancing code.   But the
- * race is very small, and only the first time the root
- * is added to each transaction.  So IN_TRANS_SETUP
- * is used to tell us when more checks are required
- */
-#define BTRFS_ROOT_IN_TRANS_SETUP  0
-#define BTRFS_ROOT_REF_COWS1
-#define BTRFS_ROOT_TRACK_DIRTY 2
-#define BTRFS_ROOT_IN_RADIX3
-#define BTRFS_ROOT_ORPHAN_ITEM_INSERTED4
-#define BTRFS_ROOT_DEFRAG_RUNNING  5
-#define BTRFS_ROOT_FORCE_COW   6
-#define BTRFS_ROOT_MULTI_LOG_TASKS 7
-#define BTRFS_ROOT_DIRTY   8
+enum {
+   /*
+* btrfs_record_root_in_trans is a multi-step process, and it can race
+* with the balancing code.   But the race is very small, and only the
+* first time the root is added to each transaction.  So IN_TRANS_SETUP
+* is used to tell us when more checks are required
+*/
+   BTRFS_ROOT_IN_TRANS_SETUP,
+   BTRFS_ROOT_REF_COWS,
+   BTRFS_ROOT_TRACK_DIRTY,
+   BTRFS_ROOT_IN_RADIX,
+   BTRFS_ROOT_ORPHAN_ITEM_INSERTED,
+   BTRFS_ROOT_DEFRAG_RUNNING,
+   BTRFS_ROOT_FORCE_COW,
+   BTRFS_ROOT_MULTI_LOG_TASKS,
+   BTRFS_ROOT_DIRTY,
+};
 
 /*
  * in ram representation of the tree.  extent_root is used for all allocations
-- 
2.19.1

[PATCH 3/9] btrfs: switch BTRFS_FS_* to enums

2018-11-27 Thread David Sterba

We can use simple enum for values that are not part of on-disk format:
internal filesystem states.

Signed-off-by: David Sterba 
---
 fs/btrfs/ctree.h | 63 
 1 file changed, 31 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 40c405d74a01..7176b95b40e7 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -757,38 +757,37 @@ struct btrfs_swapfile_pin {
 
 bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr);
 
-#define BTRFS_FS_BARRIER   1
-#define BTRFS_FS_CLOSING_START 2
-#define BTRFS_FS_CLOSING_DONE  3
-#define BTRFS_FS_LOG_RECOVERING4
-#define BTRFS_FS_OPEN  5
-#define BTRFS_FS_QUOTA_ENABLED 6
-#define BTRFS_FS_UPDATE_UUID_TREE_GEN  9
-#define BTRFS_FS_CREATING_FREE_SPACE_TREE  10
-#define BTRFS_FS_BTREE_ERR 11
-#define BTRFS_FS_LOG1_ERR  12
-#define BTRFS_FS_LOG2_ERR  13
-#define BTRFS_FS_QUOTA_OVERRIDE14
-/* Used to record internally whether fs has been frozen */
-#define BTRFS_FS_FROZEN15
-
-/*
- * Indicate that a whole-filesystem exclusive operation is running
- * (device replace, resize, device add/delete, balance)
- */
-#define BTRFS_FS_EXCL_OP   16
-
-/*
- * To info transaction_kthread we need an immediate commit so it doesn't
- * need to wait for commit_interval
- */
-#define BTRFS_FS_NEED_ASYNC_COMMIT 17
-
-/*
- * Indicate that balance has been set up from the ioctl and is in the main
- * phase. The fs_info::balance_ctl is initialized.
- */
-#define BTRFS_FS_BALANCE_RUNNING   18
+enum {
+   BTRFS_FS_BARRIER,
+   BTRFS_FS_CLOSING_START,
+   BTRFS_FS_CLOSING_DONE,
+   BTRFS_FS_LOG_RECOVERING,
+   BTRFS_FS_OPEN,
+   BTRFS_FS_QUOTA_ENABLED,
+   BTRFS_FS_UPDATE_UUID_TREE_GEN,
+   BTRFS_FS_CREATING_FREE_SPACE_TREE,
+   BTRFS_FS_BTREE_ERR,
+   BTRFS_FS_LOG1_ERR,
+   BTRFS_FS_LOG2_ERR,
+   BTRFS_FS_QUOTA_OVERRIDE,
+   /* Used to record internally whether fs has been frozen */
+   BTRFS_FS_FROZEN,
+   /*
+* Indicate that a whole-filesystem exclusive operation is running
+* (device replace, resize, device add/delete, balance)
+*/
+   BTRFS_FS_EXCL_OP,
+   /*
+* To info transaction_kthread we need an immediate commit so it
+* doesn't need to wait for commit_interval
+*/
+   BTRFS_FS_NEED_ASYNC_COMMIT,
+   /*
+* Indicate that balance has been set up from the ioctl and is in the
+* main phase. The fs_info::balance_ctl is initialized.
+*/
+   BTRFS_FS_BALANCE_RUNNING,
+};
 
 struct btrfs_fs_info {
u8 chunk_tree_uuid[BTRFS_UUID_SIZE];
-- 
2.19.1

[PATCH 5/9] btrfs: swtich EXTENT_BUFFER_* to enums

2018-11-27 Thread David Sterba

We can use simple enum for values that are not part of on-disk format:
extent buffer flags;

Signed-off-by: David Sterba 
---
 fs/btrfs/extent_io.h | 28 
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index a1d3ea5a0d32..fd42492e62e5 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -37,18 +37,22 @@
 #define EXTENT_BIO_COMPRESSED 1
 #define EXTENT_BIO_FLAG_SHIFT 16
 
-/* these are bit numbers for test/set bit */
-#define EXTENT_BUFFER_UPTODATE 0
-#define EXTENT_BUFFER_DIRTY 2
-#define EXTENT_BUFFER_CORRUPT 3
-#define EXTENT_BUFFER_READAHEAD 4  /* this got triggered by readahead */
-#define EXTENT_BUFFER_TREE_REF 5
-#define EXTENT_BUFFER_STALE 6
-#define EXTENT_BUFFER_WRITEBACK 7
-#define EXTENT_BUFFER_READ_ERR 8/* read IO error */
-#define EXTENT_BUFFER_UNMAPPED 9
-#define EXTENT_BUFFER_IN_TREE 10
-#define EXTENT_BUFFER_WRITE_ERR 11/* write IO error */
+enum {
+   EXTENT_BUFFER_UPTODATE,
+   EXTENT_BUFFER_DIRTY,
+   EXTENT_BUFFER_CORRUPT,
+   /* this got triggered by readahead */
+   EXTENT_BUFFER_READAHEAD,
+   EXTENT_BUFFER_TREE_REF,
+   EXTENT_BUFFER_STALE,
+   EXTENT_BUFFER_WRITEBACK,
+   /* read IO error */
+   EXTENT_BUFFER_READ_ERR,
+   EXTENT_BUFFER_UNMAPPED,
+   EXTENT_BUFFER_IN_TREE,
+   /* write IO error */
+   EXTENT_BUFFER_WRITE_ERR,
+};
 
 /* these are flags for __process_pages_contig */
 #define PAGE_UNLOCK(1 << 0)
-- 
2.19.1

[PATCH 8/9] btrfs: switch BTRFS_ORDERED_* to enums

2018-11-27 Thread David Sterba

We can use simple enum for values that are not part of on-disk format:
ordered extent flags.

Signed-off-by: David Sterba 
---
 fs/btrfs/ordered-data.h | 45 +++--
 1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index b10e6765d88f..fb9a161f0215 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -37,26 +37,31 @@ struct btrfs_ordered_sum {
  * rbtree, just before waking any waiters.  It is used to indicate the
  * IO is done and any metadata is inserted into the tree.
  */
-#define BTRFS_ORDERED_IO_DONE 0 /* set when all the pages are written */
-
-#define BTRFS_ORDERED_COMPLETE 1 /* set when removed from the tree */
-
-#define BTRFS_ORDERED_NOCOW 2 /* set when we want to write in place */
-
-#define BTRFS_ORDERED_COMPRESSED 3 /* writing a zlib compressed extent */
-
-#define BTRFS_ORDERED_PREALLOC 4 /* set when writing to preallocated extent */
-
-#define BTRFS_ORDERED_DIRECT 5 /* set when we're doing DIO with this extent */
-
-#define BTRFS_ORDERED_IOERR 6 /* We had an io error when writing this out */
-
-#define BTRFS_ORDERED_UPDATED_ISIZE 7 /* indicates whether this ordered extent
-  * has done its due diligence in updating
-  * the isize. */
-#define BTRFS_ORDERED_TRUNCATED 8 /* Set when we have to truncate an extent */
-
-#define BTRFS_ORDERED_REGULAR 10 /* Regular IO for COW */
+enum {
+   /* set when all the pages are written */
+   BTRFS_ORDERED_IO_DONE,
+   /* set when removed from the tree */
+   BTRFS_ORDERED_COMPLETE,
+   /* set when we want to write in place */
+   BTRFS_ORDERED_NOCOW,
+   /* writing a zlib compressed extent */
+   BTRFS_ORDERED_COMPRESSED,
+   /* set when writing to preallocated extent */
+   BTRFS_ORDERED_PREALLOC,
+   /* set when we're doing DIO with this extent */
+   BTRFS_ORDERED_DIRECT,
+   /* We had an io error when writing this out */
+   BTRFS_ORDERED_IOERR,
+   /*
+* indicates whether this ordered extent has done its due diligence in
+* updating the isize
+*/
+   BTRFS_ORDERED_UPDATED_ISIZE,
+   /* Set when we have to truncate an extent */
+   BTRFS_ORDERED_TRUNCATED,
+   /* Regular IO for COW */
+   BTRFS_ORDERED_REGULAR,
+};
 
 struct btrfs_ordered_extent {
/* logical offset in the file */
-- 
2.19.1

[PATCH 7/9] btrfs: switch BTRFS_*_LOCK to enums

2018-11-27 Thread David Sterba

We can use simple enum for values that are not part of on-disk format:
tree lock types.

Signed-off-by: David Sterba 
---
 fs/btrfs/locking.h | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/locking.h b/fs/btrfs/locking.h
index 29135def468e..684d0ef4faa4 100644
--- a/fs/btrfs/locking.h
+++ b/fs/btrfs/locking.h
@@ -6,10 +6,12 @@
 #ifndef BTRFS_LOCKING_H
 #define BTRFS_LOCKING_H
 
-#define BTRFS_WRITE_LOCK 1
-#define BTRFS_READ_LOCK 2
-#define BTRFS_WRITE_LOCK_BLOCKING 3
-#define BTRFS_READ_LOCK_BLOCKING 4
+enum {
+   BTRFS_WRITE_LOCK,
+   BTRFS_READ_LOCK,
+   BTRFS_WRITE_LOCK_BLOCKING,
+   BTRFS_READ_LOCK_BLOCKING,
+};
 
 void btrfs_tree_lock(struct extent_buffer *eb);
 void btrfs_tree_unlock(struct extent_buffer *eb);
-- 
2.19.1

[PATCH 6/9] btrfs: switch EXTENT_FLAG_* to enums

2018-11-27 Thread David Sterba

We can use simple enum for values that are not part of on-disk format:
extent map flags.

Signed-off-by: David Sterba 
---
 fs/btrfs/extent_map.h | 21 ++---
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index 31977ffd6190..ef05a0121652 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -11,13 +11,20 @@
 #define EXTENT_MAP_INLINE ((u64)-2)
 #define EXTENT_MAP_DELALLOC ((u64)-1)
 
-/* bits for the flags field */
-#define EXTENT_FLAG_PINNED 0 /* this entry not yet on disk, don't free it */
-#define EXTENT_FLAG_COMPRESSED 1
-#define EXTENT_FLAG_PREALLOC 3 /* pre-allocated extent */
-#define EXTENT_FLAG_LOGGING 4 /* Logging this extent */
-#define EXTENT_FLAG_FILLING 5 /* Filling in a preallocated extent */
-#define EXTENT_FLAG_FS_MAPPING 6 /* filesystem extent mapping type */
+/* bits for the extent_map::flags field */
+enum {
+   /* this entry not yet on disk, don't free it */
+   EXTENT_FLAG_PINNED,
+   EXTENT_FLAG_COMPRESSED,
+   /* pre-allocated extent */
+   EXTENT_FLAG_PREALLOC,
+   /* Logging this extent */
+   EXTENT_FLAG_LOGGING,
+   /* Filling in a preallocated extent */
+   EXTENT_FLAG_FILLING,
+   /* filesystem extent mapping type */
+   EXTENT_FLAG_FS_MAPPING,
+};
 
 struct extent_map {
struct rb_node rb_node;
-- 
2.19.1

[PATCH 1/9] btrfs: switch BTRFS_FS_STATE_* to enums

2018-11-27 Thread David Sterba

We can use simple enum for values that are not part of on-disk format:
global filesystem states.

Signed-off-by: David Sterba 
---
 fs/btrfs/ctree.h | 25 +++--
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index a98507fa9192..f82ec5e41b0c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -109,13 +109,26 @@ static inline unsigned long btrfs_chunk_item_size(int 
num_stripes)
 }
 
 /*
- * File system states
+ * Runtime (in-memory) states of filesystem
  */
-#define BTRFS_FS_STATE_ERROR   0
-#define BTRFS_FS_STATE_REMOUNTING  1
-#define BTRFS_FS_STATE_TRANS_ABORTED   2
-#define BTRFS_FS_STATE_DEV_REPLACING   3
-#define BTRFS_FS_STATE_DUMMY_FS_INFO   4
+enum {
+   /* Global indicator of serious filesysystem errors */
+   BTRFS_FS_STATE_ERROR,
+   /*
+* Filesystem is being remounted, allow to skip some operations, like
+* defrag
+*/
+   BTRFS_FS_STATE_REMOUNTING,
+   /* Track if the transaction abort has been reported */
+   BTRFS_FS_STATE_TRANS_ABORTED,
+   /*
+* Indicate that replace source or target device state is changed and
+* allow to block bio operations
+*/
+   BTRFS_FS_STATE_DEV_REPLACING,
+   /* The btrfs_fs_info created for self-tests */
+   BTRFS_FS_STATE_DUMMY_FS_INFO,
+};
 
 #define BTRFS_BACKREF_REV_MAX  256
 #define BTRFS_BACKREF_REV_SHIFT56
-- 
2.19.1

[PATCH 2/9] btrfs: switch BTRFS_BLOCK_RSV_* to enums

2018-11-27 Thread David Sterba

We can use simple enum for values that are not part of on-disk format:
block reserve types.

Signed-off-by: David Sterba 
---
 fs/btrfs/ctree.h | 19 ---
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index f82ec5e41b0c..40c405d74a01 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -461,13 +461,18 @@ struct btrfs_space_info {
struct kobject *block_group_kobjs[BTRFS_NR_RAID_TYPES];
 };
 
-#defineBTRFS_BLOCK_RSV_GLOBAL  1
-#defineBTRFS_BLOCK_RSV_DELALLOC2
-#defineBTRFS_BLOCK_RSV_TRANS   3
-#defineBTRFS_BLOCK_RSV_CHUNK   4
-#defineBTRFS_BLOCK_RSV_DELOPS  5
-#defineBTRFS_BLOCK_RSV_EMPTY   6
-#defineBTRFS_BLOCK_RSV_TEMP7
+/*
+ * Types of block reserves
+ */
+enum {
+   BTRFS_BLOCK_RSV_GLOBAL,
+   BTRFS_BLOCK_RSV_DELALLOC,
+   BTRFS_BLOCK_RSV_TRANS,
+   BTRFS_BLOCK_RSV_CHUNK,
+   BTRFS_BLOCK_RSV_DELOPS,
+   BTRFS_BLOCK_RSV_EMPTY,
+   BTRFS_BLOCK_RSV_TEMP,
+};
 
 struct btrfs_block_rsv {
u64 size;
-- 
2.19.1

[PATCH 0/9] Switch defines to enums

2018-11-27 Thread David Sterba

This is motivated by a merging mistake that happened a few releases ago.
Two patches updated BTRFS_FS_* flags independently to the same value,
git did not see any direct merge conflict. The two values got mixed at
runtime and caused crash.

Update all #define sequential values, the above merging problem would
not happen as there would be a conflict and the enum value
auto-increment would prevent duplicated values anyway.

David Sterba (9):
  btrfs: switch BTRFS_FS_STATE_* to enums
  btrfs: switch BTRFS_BLOCK_RSV_* to enums
  btrfs: switch BTRFS_FS_* to enums
  btrfs: switch BTRFS_ROOT_* to enums
  btrfs: swtich EXTENT_BUFFER_* to enums
  btrfs: switch EXTENT_FLAG_* to enums
  btrfs: switch BTRFS_*_LOCK to enums
  btrfs: switch BTRFS_ORDERED_* to enums
  btrfs: drop extra enum initialization where using defaults

 fs/btrfs/btrfs_inode.h  |   2 +-
 fs/btrfs/ctree.h| 168 ++--
 fs/btrfs/disk-io.h  |  10 +--
 fs/btrfs/extent_io.h|  28 ---
 fs/btrfs/extent_map.h   |  21 +++--
 fs/btrfs/locking.h  |  10 ++-
 fs/btrfs/ordered-data.h |  45 ++-
 fs/btrfs/qgroup.h   |   2 +-
 fs/btrfs/sysfs.h|   2 +-
 fs/btrfs/transaction.h  |  14 ++--
 10 files changed, 169 insertions(+), 133 deletions(-)

-- 
2.19.1

Re: [PATCH 5/8] btrfs: don't enospc all tickets on flush failure

2018-11-27 Thread Josef Bacik

On Mon, Nov 26, 2018 at 02:25:52PM +0200, Nikolay Borisov wrote:
> 
> 
> On 21.11.18 г. 21:03 ч., Josef Bacik wrote:
> > With the introduction of the per-inode block_rsv it became possible to
> > have really really large reservation requests made because of data
> > fragmentation.  Since the ticket stuff assumed that we'd always have
> > relatively small reservation requests it just killed all tickets if we
> > were unable to satisfy the current request.  However this is generally
> > not the case anymore.  So fix this logic to instead see if we had a
> > ticket that we were able to give some reservation to, and if we were
> > continue the flushing loop again.  Likewise we make the tickets use the
> > space_info_add_old_bytes() method of returning what reservation they did
> > receive in hopes that it could satisfy reservations down the line.
> 
> 
> The logic of the patch can be summarised as follows:
> 
> If no progress is made for a ticket, then start fail all tickets until
> the first one that has progress made on its reservation (inclusive). In
> this case this first ticket will be failed but at least it's space will
> be reused via space_info_add_old_bytes.
> 
> Frankly this seem really arbitrary.

It's not though.  The tickets are in order of who requested the reservation.
Because we will backfill reservations for things like hugely fragmented files or
large amounts of delayed refs we can have spikes where we're trying to reserve
100mb's of metadata space.  We may fill 50mb of that before we run out of space.
Well so we can't satisfy that reservation, but the small 100k reservations that
are waiting to be serviced can be satisfied and they can run.  The alternative
is you get ENOSPC and then you can turn around and touch a file no problem
because it's a small reservation and there was room for it.  This patch enables
better behavior for the user.  Thanks,

Josef

Re: [PATCH] btrfs: only run delayed refs if we're committing

2018-11-27 Thread Filipe Manana

On Tue, Nov 27, 2018 at 7:22 PM Josef Bacik  wrote:
>
> On Fri, Nov 23, 2018 at 04:59:32PM +, Filipe Manana wrote:
> > On Thu, Nov 22, 2018 at 12:35 AM Josef Bacik  wrote:
> > >
> > > I noticed in a giant dbench run that we spent a lot of time on lock
> > > contention while running transaction commit.  This is because dbench
> > > results in a lot of fsync()'s that do a btrfs_transaction_commit(), and
> > > they all run the delayed refs first thing, so they all contend with
> > > each other.  This leads to seconds of 0 throughput.  Change this to only
> > > run the delayed refs if we're the ones committing the transaction.  This
> > > makes the latency go away and we get no more lock contention.
> >
> > Can you share the following in the changelog please?
> >
> > 1) How did you ran dbench (parameters, config).
> >
> > 2) What results did you get before and after this change. So that we all get
> > an idea of how good the impact is.
> >
> > While the reduced contention makes all sense and seems valid, I'm not
> > sure this is always a win.
> > It certainly is when multiple tasks are calling
> > btrfs_commit_transaction() simultaneously, but,
> > what about when only one does it?
> >
> > By running all delayed references inside the critical section of the
> > transaction commit
> > (state == TRANS_STATE_COMMIT_START), instead of running most of them
> > outside/before,
> > we will be blocking for a longer a time other tasks calling
> > btrfs_start_transaction() (which is used
> > a lot - creating files, unlinking files, adding links, etc, and even fsync).
> >
> > Won't there by any other types of workload and tests other then dbench
> > that can get increased
> > latency and/or smaller throughput?
> >
> > I find that sort of information useful to have in the changelog. If
> > you verified that or you think
> > it's irrelevant to measure/consider, it would be great to have it
> > mentioned in the changelog
> > (and explained).
> >
>
> Yeah I thought about the delayed refs being run in the critical section now,
> that's not awesome.  I'll drop this for now, I think just having a mutex 
> around
> running delayed refs will be good enough, since we want people who care about
> flushing delayed refs to wait around for that to finish happening.  Thanks,

Well, I think we can have a solution that doesn't bring such trade-off
nor introducing a mutex.
We could do like what is currently done for writing space caches, to
make sure only the first task
calling commit transaction does the work and all others do nothing
except waiting for the commit to finish:

btrfs_commit_transaction()
   if (!test_and_set_bit(BTRFS_TRANS_COMMIT_START, _trans->flags)) {
   run delayed refs before entering critical section
   }

thanks

>
> Josef



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

Re: [PATCH 3/3] btrfs: document extent mapping assumptions in checksum

2018-11-27 Thread Nikolay Borisov




On 27.11.18 г. 21:08 ч., Noah Massey wrote:
> On Tue, Nov 27, 2018 at 11:43 AM Nikolay Borisov  wrote:
>>
>> On 27.11.18 г. 18:00 ч., Johannes Thumshirn wrote:
>>> Document why map_private_extent_buffer() cannot return '1' (i.e. the map
>>> spans two pages) for the csum_tree_block() case.
>>>
>>> The current algorithm for detecting a page boundary crossing in
>>> map_private_extent_buffer() will return a '1' *IFF* the product of the
>>
>> I think the word product must be replaced with 'sum', since product
>> implies multiplication :)
>>
> 
> doesn't 'sum' imply addition? How about 'output'?

It does and the code indeed sums the value and not multiply them hence
my suggestion.

> 
>>> extent buffer's offset in the page + the offset passed in by
>>> csum_tree_block() and the minimal length passed in by csum_tree_block() - 1
>>> are bigger than PAGE_SIZE.
>>>
>>> We always pass BTRFS_CSUM_SIZE (32) as offset and a minimal length of 32
>>> and the current extent buffer allocator always guarantees page aligned
>>> extends, so the above condition can't be true.
>>>
>>> Signed-off-by: Johannes Thumshirn 
>>
>> With that wording changed:
>>
>> Reviewed-by: Nikolay Borisov 
>>
>>> ---
>>>  fs/btrfs/disk-io.c | 6 ++
>>>  1 file changed, 6 insertions(+)
>>>
>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>> index 4bc270ef29b4..14d355d0cb7a 100644
>>> --- a/fs/btrfs/disk-io.c
>>> +++ b/fs/btrfs/disk-io.c
>>> @@ -279,6 +279,12 @@ static int csum_tree_block(struct btrfs_fs_info 
>>> *fs_info,
>>>
>>>   len = buf->len - offset;
>>>   while (len > 0) {
>>> + /*
>>> +  * Note: we don't need to check for the err == 1 case here, as
>>> +  * with the given combination of 'start = BTRFS_CSUM_SIZE 
>>> (32)'
>>> +  * and 'min_len = 32' and the currently implemented mapping
>>> +  * algorithm we cannot cross a page boundary.
>>> +  */
>>>   err = map_private_extent_buffer(buf, offset, 32,
>>>   , _start, _len);
>>>   if (err)
>>>
>

Re: [PATCH] btrfs: only run delayed refs if we're committing

2018-11-27 Thread Josef Bacik

On Fri, Nov 23, 2018 at 04:59:32PM +, Filipe Manana wrote:
> On Thu, Nov 22, 2018 at 12:35 AM Josef Bacik  wrote:
> >
> > I noticed in a giant dbench run that we spent a lot of time on lock
> > contention while running transaction commit.  This is because dbench
> > results in a lot of fsync()'s that do a btrfs_transaction_commit(), and
> > they all run the delayed refs first thing, so they all contend with
> > each other.  This leads to seconds of 0 throughput.  Change this to only
> > run the delayed refs if we're the ones committing the transaction.  This
> > makes the latency go away and we get no more lock contention.
> 
> Can you share the following in the changelog please?
> 
> 1) How did you ran dbench (parameters, config).
> 
> 2) What results did you get before and after this change. So that we all get
> an idea of how good the impact is.
> 
> While the reduced contention makes all sense and seems valid, I'm not
> sure this is always a win.
> It certainly is when multiple tasks are calling
> btrfs_commit_transaction() simultaneously, but,
> what about when only one does it?
> 
> By running all delayed references inside the critical section of the
> transaction commit
> (state == TRANS_STATE_COMMIT_START), instead of running most of them
> outside/before,
> we will be blocking for a longer a time other tasks calling
> btrfs_start_transaction() (which is used
> a lot - creating files, unlinking files, adding links, etc, and even fsync).
> 
> Won't there by any other types of workload and tests other then dbench
> that can get increased
> latency and/or smaller throughput?
> 
> I find that sort of information useful to have in the changelog. If
> you verified that or you think
> it's irrelevant to measure/consider, it would be great to have it
> mentioned in the changelog
> (and explained).
> 

Yeah I thought about the delayed refs being run in the critical section now,
that's not awesome.  I'll drop this for now, I think just having a mutex around
running delayed refs will be good enough, since we want people who care about
flushing delayed refs to wait around for that to finish happening.  Thanks,

Josef

Re: [PATCH 5/6] btrfs: introduce delayed_refs_rsv

2018-11-27 Thread Josef Bacik

On Mon, Nov 26, 2018 at 11:14:12AM +0200, Nikolay Borisov wrote:
> 
> 
> On 21.11.18 г. 20:59 ч., Josef Bacik wrote:
> > From: Josef Bacik 
> > 
> > Traditionally we've had voodoo in btrfs to account for the space that
> > delayed refs may take up by having a global_block_rsv.  This works most
> > of the time, except when it doesn't.  We've had issues reported and seen
> > in production where sometimes the global reserve is exhausted during
> > transaction commit before we can run all of our delayed refs, resulting
> > in an aborted transaction.  Because of this voodoo we have equally
> > dubious flushing semantics around throttling delayed refs which we often
> > get wrong.
> > 
> > So instead give them their own block_rsv.  This way we can always know
> > exactly how much outstanding space we need for delayed refs.  This
> > allows us to make sure we are constantly filling that reservation up
> > with space, and allows us to put more precise pressure on the enospc
> > system.  Instead of doing math to see if its a good time to throttle,
> > the normal enospc code will be invoked if we have a lot of delayed refs
> > pending, and they will be run via the normal flushing mechanism.
> > 
> > For now the delayed_refs_rsv will hold the reservations for the delayed
> > refs, the block group updates, and deleting csums.  We could have a
> > separate rsv for the block group updates, but the csum deletion stuff is
> > still handled via the delayed_refs so that will stay there.
> 
> Couldn't the same "no premature ENOSPC in critical section" effect be
> achieved if we simply allocate 2* num bytes in start transaction without
> adding the additional granularity for delayd refs? There seems to be a
> lot of supporting code added  and this obfuscates the simple idea that
> now we do 2* reservation and put it in a separate block_rsv structure.
> 
> > 
> > Signed-off-by: Josef Bacik 
> > ---
> >  fs/btrfs/ctree.h |  26 ++--
> >  fs/btrfs/delayed-ref.c   |  28 -
> >  fs/btrfs/disk-io.c   |   4 +
> >  fs/btrfs/extent-tree.c   | 279 
> > +++
> >  fs/btrfs/inode.c |   4 +-
> >  fs/btrfs/transaction.c   |  77 ++--
> >  include/trace/events/btrfs.h |   2 +
> >  7 files changed, 313 insertions(+), 107 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 8b41ec42f405..0c6d589c8ce4 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -448,8 +448,9 @@ struct btrfs_space_info {
> >  #defineBTRFS_BLOCK_RSV_TRANS   3
> >  #defineBTRFS_BLOCK_RSV_CHUNK   4
> >  #defineBTRFS_BLOCK_RSV_DELOPS  5
> > -#defineBTRFS_BLOCK_RSV_EMPTY   6
> > -#defineBTRFS_BLOCK_RSV_TEMP7
> > +#define BTRFS_BLOCK_RSV_DELREFS6
> > +#defineBTRFS_BLOCK_RSV_EMPTY   7
> > +#defineBTRFS_BLOCK_RSV_TEMP8
> >  
> >  struct btrfs_block_rsv {
> > u64 size;
> > @@ -812,6 +813,8 @@ struct btrfs_fs_info {
> > struct btrfs_block_rsv chunk_block_rsv;
> > /* block reservation for delayed operations */
> > struct btrfs_block_rsv delayed_block_rsv;
> > +   /* block reservation for delayed refs */
> > +   struct btrfs_block_rsv delayed_refs_rsv;
> >  
> > struct btrfs_block_rsv empty_block_rsv;
> >  
> > @@ -2628,7 +2631,7 @@ static inline u64 
> > btrfs_calc_trunc_metadata_size(struct btrfs_fs_info *fs_info,
> >  }
> >  
> >  int btrfs_should_throttle_delayed_refs(struct btrfs_trans_handle *trans);
> > -int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans);
> > +bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info);
> >  void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info,
> >  const u64 start);
> >  void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache 
> > *bg);
> > @@ -2742,10 +2745,12 @@ enum btrfs_reserve_flush_enum {
> >  enum btrfs_flush_state {
> > FLUSH_DELAYED_ITEMS_NR  =   1,
> > FLUSH_DELAYED_ITEMS =   2,
> > -   FLUSH_DELALLOC  =   3,
> > -   FLUSH_DELALLOC_WAIT =   4,
> > -   ALLOC_CHUNK =   5,
> > -   COMMIT_TRANS=   6,
> > +   FLUSH_DELAYED_REFS_NR   =   3,
> > +   FLUSH_DELAYED_REFS  =   4,
> > +   FLUSH_DELALLOC  =   5,
> > +   FLUSH_DELALLOC_WAIT =   6,
> > +   ALLOC_CHUNK =   7,
> > +   COMMIT_TRANS=   8,
> >  };
> >  
> >  int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes);
> > @@ -2796,6 +2801,13 @@ int btrfs_cond_migrate_bytes(struct btrfs_fs_info 
> > *fs_info,
> >  void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
> >  struct btrfs_block_rsv *block_rsv,
> >  u64 num_bytes);
> > +void btrfs_delayed_refs_rsv_release(struct btrfs_fs_info *fs_info, int nr);
> > +void

Re: [PATCH 3/3] btrfs: document extent mapping assumptions in checksum

2018-11-27 Thread Noah Massey

On Tue, Nov 27, 2018 at 11:43 AM Nikolay Borisov  wrote:
>
> On 27.11.18 г. 18:00 ч., Johannes Thumshirn wrote:
> > Document why map_private_extent_buffer() cannot return '1' (i.e. the map
> > spans two pages) for the csum_tree_block() case.
> >
> > The current algorithm for detecting a page boundary crossing in
> > map_private_extent_buffer() will return a '1' *IFF* the product of the
>
> I think the word product must be replaced with 'sum', since product
> implies multiplication :)
>

doesn't 'sum' imply addition? How about 'output'?

> > extent buffer's offset in the page + the offset passed in by
> > csum_tree_block() and the minimal length passed in by csum_tree_block() - 1
> > are bigger than PAGE_SIZE.
> >
> > We always pass BTRFS_CSUM_SIZE (32) as offset and a minimal length of 32
> > and the current extent buffer allocator always guarantees page aligned
> > extends, so the above condition can't be true.
> >
> > Signed-off-by: Johannes Thumshirn 
>
> With that wording changed:
>
> Reviewed-by: Nikolay Borisov 
>
> > ---
> >  fs/btrfs/disk-io.c | 6 ++
> >  1 file changed, 6 insertions(+)
> >
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index 4bc270ef29b4..14d355d0cb7a 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -279,6 +279,12 @@ static int csum_tree_block(struct btrfs_fs_info 
> > *fs_info,
> >
> >   len = buf->len - offset;
> >   while (len > 0) {
> > + /*
> > +  * Note: we don't need to check for the err == 1 case here, as
> > +  * with the given combination of 'start = BTRFS_CSUM_SIZE 
> > (32)'
> > +  * and 'min_len = 32' and the currently implemented mapping
> > +  * algorithm we cannot cross a page boundary.
> > +  */
> >   err = map_private_extent_buffer(buf, offset, 32,
> >   , _start, _len);
> >   if (err)
> >

[PATCH] btrfs: Refactor btrfs_merge_bio_hook

2018-11-27 Thread Nikolay Borisov

This function really checks whether adding more data to the bio will
straddle a stripe/chunk. So first let's give it a more appropraite
name - btrfs_bio_fits_in_stripe. Secondly, the offset parameter was
never used to just remove it. Thirdly, pages are submitted to either
btree or data inodes so it's guaranteed that tree->ops is set so replace
the check with an ASSERT. Finally, document the parameters of the
function. No functional changes.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/compression.c |  7 ---
 fs/btrfs/ctree.h   |  5 ++---
 fs/btrfs/extent_io.c   |  4 ++--
 fs/btrfs/inode.c   | 19 ---
 4 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 34d50bc5c10d..dba59ae914b8 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -332,7 +332,8 @@ blk_status_t btrfs_submit_compressed_write(struct inode 
*inode, u64 start,
page = compressed_pages[pg_index];
page->mapping = inode->i_mapping;
if (bio->bi_iter.bi_size)
-   submit = btrfs_merge_bio_hook(page, 0, PAGE_SIZE, bio, 
0);
+   submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio,
+ 0);
 
page->mapping = NULL;
if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) <
@@ -610,8 +611,8 @@ blk_status_t btrfs_submit_compressed_read(struct inode 
*inode, struct bio *bio,
page->index = em_start >> PAGE_SHIFT;
 
if (comp_bio->bi_iter.bi_size)
-   submit = btrfs_merge_bio_hook(page, 0, PAGE_SIZE,
-   comp_bio, 0);
+   submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE,
+ comp_bio, 0);
 
page->mapping = NULL;
if (submit || bio_add_page(comp_bio, page, PAGE_SIZE, 0) <
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index a98507fa9192..791112a82775 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3191,9 +3191,8 @@ void btrfs_merge_delalloc_extent(struct inode *inode, 
struct extent_state *new,
 struct extent_state *other);
 void btrfs_split_delalloc_extent(struct inode *inode,
 struct extent_state *orig, u64 split);
-int btrfs_merge_bio_hook(struct page *page, unsigned long offset,
-size_t size, struct bio *bio,
-unsigned long bio_flags);
+int btrfs_bio_fits_in_stripe(struct page *page, size_t size, struct bio *bio,
+unsigned long bio_flags);
 void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 
end);
 vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf);
 int btrfs_readpage(struct file *file, struct page *page);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 582b4b1c41e0..19f4b8fd654f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2759,8 +2759,8 @@ static int submit_extent_page(unsigned int opf, struct 
extent_io_tree *tree,
else
contig = bio_end_sector(bio) == sector;
 
-   if (tree->ops && btrfs_merge_bio_hook(page, offset, page_size,
- bio, bio_flags))
+   ASSERT(tree->ops);
+   if (btrfs_bio_fits_in_stripe(page, page_size, bio, bio_flags))
can_merge = false;
 
if (prev_bio_flags != bio_flags || !contig || !can_merge ||
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index be7d43c42779..11c533db2db7 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1881,16 +1881,21 @@ void btrfs_clear_delalloc_extent(struct inode 
*vfs_inode,
 }
 
 /*
- * Merge bio hook, this must check the chunk tree to make sure we don't create
- * bios that span stripes or chunks
+ * btrfs_bio_fits_in_stripe - Checks whether the size of the given bio will fit
+ * in a chunk's stripe. This function ensures that bios do not span a
+ * stripe/chunk
  *
- * return 1 if page cannot be merged to bio
- * return 0 if page can be merged to bio
+ * @page - The page we are about to add to the bio
+ * @size - size we want to add to the bio
+ * @bio - bio we want to ensure is smaller than a stripe
+ * @bio_flags - flags of the bio
+ *
+ * return 1 if page cannot be added to the bio
+ * return 0 if page can be added to the bio
  * return error otherwise
  */
-int btrfs_merge_bio_hook(struct page *page, unsigned long offset,
-size_t size, struct bio *bio,
-unsigned long bio_flags)
+int btrfs_bio_fits_in_stripe(struct page *page, size_t size, struct bio *bio,
+unsigned long bio_flags)
 {
struct inode *inode = page->mapping->host;
struct btrfs_fs_info *fs_info =

Re: [PATCH 3/3] btrfs: document extent mapping assumptions in checksum

2018-11-27 Thread Nikolay Borisov




On 27.11.18 г. 18:00 ч., Johannes Thumshirn wrote:
> Document why map_private_extent_buffer() cannot return '1' (i.e. the map
> spans two pages) for the csum_tree_block() case.
> 
> The current algorithm for detecting a page boundary crossing in
> map_private_extent_buffer() will return a '1' *IFF* the product of the

I think the word product must be replaced with 'sum', since product
implies multiplication :)

> extent buffer's offset in the page + the offset passed in by
> csum_tree_block() and the minimal length passed in by csum_tree_block() - 1
> are bigger than PAGE_SIZE.
> 
> We always pass BTRFS_CSUM_SIZE (32) as offset and a minimal length of 32
> and the current extent buffer allocator always guarantees page aligned
> extends, so the above condition can't be true.
> 
> Signed-off-by: Johannes Thumshirn 

With that wording changed:

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/disk-io.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 4bc270ef29b4..14d355d0cb7a 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -279,6 +279,12 @@ static int csum_tree_block(struct btrfs_fs_info *fs_info,
>  
>   len = buf->len - offset;
>   while (len > 0) {
> + /*
> +  * Note: we don't need to check for the err == 1 case here, as
> +  * with the given combination of 'start = BTRFS_CSUM_SIZE (32)'
> +  * and 'min_len = 32' and the currently implemented mapping
> +  * algorithm we cannot cross a page boundary.
> +  */
>   err = map_private_extent_buffer(buf, offset, 32,
>   , _start, _len);
>   if (err)
>

Re: [PATCH 2/3] btrfs: use offset_in_page for start_offset in map_private_extent_buffer()

2018-11-27 Thread Nikolay Borisov




On 27.11.18 г. 18:00 ч., Johannes Thumshirn wrote:
> In map_private_extent_buffer() use offset_in_page() to initialize
> 'start_offset' instead of open-coding it.
> 
> Signed-off-by: Johannes Thumshirn 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/extent_io.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 7aafdec49dc3..85cd3975c680 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -5383,7 +5383,7 @@ int map_private_extent_buffer(const struct 
> extent_buffer *eb,
>   size_t offset;
>   char *kaddr;
>   struct page *p;
> - size_t start_offset = eb->start & ((u64)PAGE_SIZE - 1);
> + size_t start_offset = offset_in_page(eb->start);
>   unsigned long i = (start_offset + start) >> PAGE_SHIFT;
>   unsigned long end_i = (start_offset + start + min_len - 1) >>
>   PAGE_SHIFT;
>

Re: [PATCH 1/3] btrfs: don't initialize 'offset' in map_private_extent_buffer()

2018-11-27 Thread Nikolay Borisov




On 27.11.18 г. 18:00 ч., Johannes Thumshirn wrote:
> In map_private_extent_buffer() the 'offset' variable is initialized to a
> page aligned version of the 'start' parameter.
> 
> But later on it is overwritten with either the offset from the extent
> buffer's start or 0.
> 
> So get rid of the initial initialization.
> 
> Signed-off-by: Johannes Thumshirn 

You know, the fastest/most clean code is the one which is deleted so :

Reviewed-by: Nikolay Borisov 


> ---
>  fs/btrfs/extent_io.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 582b4b1c41e0..7aafdec49dc3 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -5380,7 +5380,7 @@ int map_private_extent_buffer(const struct 
> extent_buffer *eb,
> char **map, unsigned long *map_start,
> unsigned long *map_len)
>  {
> - size_t offset = start & (PAGE_SIZE - 1);
> + size_t offset;
>   char *kaddr;
>   struct page *p;
>   size_t start_offset = eb->start & ((u64)PAGE_SIZE - 1);
>

[PATCH 0/3] Misc cosmetic changes for map_private_extent_buffer

2018-11-27 Thread Johannes Thumshirn

While trying to understand the checksum code I came across some oddities
regarding map_private_extent_buffer() and it's interaction with
csum_tree_block().

These patches address them but are either purely cosmetic or only add a
comment documenting behaviour.

Johannes Thumshirn (3):
  btrfs: don't initialize 'offset' in map_private_extent_buffer()
  btrfs: use offset_in_page for start_offset in
map_private_extent_buffer()
  btrfs: document extent mapping assumptions in checksum

 fs/btrfs/disk-io.c   | 6 ++
 fs/btrfs/extent_io.c | 4 ++--
 2 files changed, 8 insertions(+), 2 deletions(-)

-- 
2.16.4

[PATCH 2/3] btrfs: use offset_in_page for start_offset in map_private_extent_buffer()

2018-11-27 Thread Johannes Thumshirn

In map_private_extent_buffer() use offset_in_page() to initialize
'start_offset' instead of open-coding it.

Signed-off-by: Johannes Thumshirn 
---
 fs/btrfs/extent_io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 7aafdec49dc3..85cd3975c680 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5383,7 +5383,7 @@ int map_private_extent_buffer(const struct extent_buffer 
*eb,
size_t offset;
char *kaddr;
struct page *p;
-   size_t start_offset = eb->start & ((u64)PAGE_SIZE - 1);
+   size_t start_offset = offset_in_page(eb->start);
unsigned long i = (start_offset + start) >> PAGE_SHIFT;
unsigned long end_i = (start_offset + start + min_len - 1) >>
PAGE_SHIFT;
-- 
2.16.4

[PATCH 3/3] btrfs: document extent mapping assumptions in checksum

2018-11-27 Thread Johannes Thumshirn

Document why map_private_extent_buffer() cannot return '1' (i.e. the map
spans two pages) for the csum_tree_block() case.

The current algorithm for detecting a page boundary crossing in
map_private_extent_buffer() will return a '1' *IFF* the product of the
extent buffer's offset in the page + the offset passed in by
csum_tree_block() and the minimal length passed in by csum_tree_block() - 1
are bigger than PAGE_SIZE.

We always pass BTRFS_CSUM_SIZE (32) as offset and a minimal length of 32
and the current extent buffer allocator always guarantees page aligned
extends, so the above condition can't be true.

Signed-off-by: Johannes Thumshirn 
---
 fs/btrfs/disk-io.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 4bc270ef29b4..14d355d0cb7a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -279,6 +279,12 @@ static int csum_tree_block(struct btrfs_fs_info *fs_info,
 
len = buf->len - offset;
while (len > 0) {
+   /*
+* Note: we don't need to check for the err == 1 case here, as
+* with the given combination of 'start = BTRFS_CSUM_SIZE (32)'
+* and 'min_len = 32' and the currently implemented mapping
+* algorithm we cannot cross a page boundary.
+*/
err = map_private_extent_buffer(buf, offset, 32,
, _start, _len);
if (err)
-- 
2.16.4

[PATCH 1/3] btrfs: don't initialize 'offset' in map_private_extent_buffer()

2018-11-27 Thread Johannes Thumshirn

In map_private_extent_buffer() the 'offset' variable is initialized to a
page aligned version of the 'start' parameter.

But later on it is overwritten with either the offset from the extent
buffer's start or 0.

So get rid of the initial initialization.

Signed-off-by: Johannes Thumshirn 
---
 fs/btrfs/extent_io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 582b4b1c41e0..7aafdec49dc3 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5380,7 +5380,7 @@ int map_private_extent_buffer(const struct extent_buffer 
*eb,
  char **map, unsigned long *map_start,
  unsigned long *map_len)
 {
-   size_t offset = start & (PAGE_SIZE - 1);
+   size_t offset;
char *kaddr;
struct page *p;
size_t start_offset = eb->start & ((u64)PAGE_SIZE - 1);
-- 
2.16.4

Re: [PATCH 5/6] btrfs: introduce delayed_refs_rsv

2018-11-27 Thread David Sterba

On Mon, Nov 26, 2018 at 11:14:12AM +0200, Nikolay Borisov wrote:
> > -   if (global_rsv->space_info->full) {
> > -   num_dirty_bgs_bytes <<= 1;
> > -   num_bytes <<= 1;
> > -   }
> > +   struct btrfs_block_rsv *global_rsv = _info->global_block_rsv;
> > +   struct btrfs_block_rsv *delayed_refs_rsv = _info->delayed_refs_rsv;
> > +   u64 reserved;
> > +   bool ret = false;
> 
> nit: Reverse christmas tree:

Let it be known that the reverse-xmas-tree declaration style will _not_
be enforced in new patches and cleanup patches only reordering
declarations will not be accepted.

You're free to use it but I may edit your patch so the declaration
conforms to the style used in the file or to a pattern that's common the
subsystem, eg. global pointers like fs_info or root first in the list.

Re: [PATCH 3/6] btrfs: cleanup extent_op handling

2018-11-27 Thread Josef Bacik

On Thu, Nov 22, 2018 at 12:09:34PM +0200, Nikolay Borisov wrote:
> 
> 
> On 21.11.18 г. 20:59 ч., Josef Bacik wrote:
> > From: Josef Bacik 
> > 
> > The cleanup_extent_op function actually would run the extent_op if it
> > needed running, which made the name sort of a misnomer.  Change it to
> > run_and_cleanup_extent_op, and move the actual cleanup work to
> > cleanup_extent_op so it can be used by check_ref_cleanup() in order to
> > unify the extent op handling.
> 
> The whole name extent_op is actually a misnomer since AFAIR this is some
> sort of modification of the references of metadata nodes. I don't see
> why it can't be made as yet another type of reference which is run for a
> given node.
> 

It would change the key for a metadata extent reference for non-skinny metadata,
and it sets the FULL_BACKREF flag.  Since it really only changes flags now we
could probably roll that into it's own thing, but that's out of scope for this
stuff.  Thanks,

Josef

Re: [PATCH] Proposal for more detail in scrub doc

2018-11-27 Thread Nikolay Borisov




On 27.11.18 г. 16:14 ч., Andrea Gelmini wrote:
> Wise words from Qu:
> https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg82557.html
> ---
>  Documentation/btrfs-scrub.asciidoc | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/btrfs-scrub.asciidoc 
> b/Documentation/btrfs-scrub.asciidoc
> index 4c49269..1fc085c 100644
> --- a/Documentation/btrfs-scrub.asciidoc
> +++ b/Documentation/btrfs-scrub.asciidoc
> @@ -16,7 +16,8 @@ and metadata blocks from all devices and verify checksums. 
> Automatically repair
>  corrupted blocks if there's a correct copy available.
>  
>  NOTE: Scrub is not a filesystem checker (fsck) and does not verify nor repair
> -structural damage in the filesystem.
> +structural damage in the filesystem. It only checks csum of data and tree 
> blocks,
> +it doesn't ensure the content of tree blocks are OK.

I would rephrase this as:

"It only ensures that the checksum of given data/metadata blocks match
but doesn't guarantee that the contents themselves are consistent"

It sounds a bit more formal which I think is more appropriate for a man
page.


>  
>  The user is supposed to run it manually or via a periodic system service. The
>  recommended period is a month but could be less. The estimated device 
> bandwidth
>

Re: Linux-next regression?

2018-11-27 Thread Qu Wenruo



On 2018/11/27 下午10:11, Andrea Gelmini wrote:
> On Tue, Nov 27, 2018 at 09:13:02AM +0800, Qu Wenruo wrote:
>>
>>
>> On 2018/11/26 下午11:01, Andrea Gelmini wrote:
>>>   One question: I can completely trust the ok return status of scrub? I 
>>> know is made for this, but shit happens...
>>
>> No, scrub only checks csum of data and tree blocks, it doesn't ensure
>> the content of tree blocks are OK.
> 
> Hi Qu,
>   and thanks a lot, really. Your answers are always the best: short,
>   detailed and very kind. You rock.
> 
>   I'm going to send a patch to propose to add your explanation above
>   on the relative man page, if you agree.
> 
>> For comprehensive check, go "btrfs check --readonly".
> 
>   I'll do it.
> 
>   At the moment I just compared the file existance between my laptop and
>   latest backup. Everything is fine.
> 
>>
>> However I don't think it's something "btrfs check --readonly" would
>> report, but some strange behavior, maybe from LVM or cryptsetup.
> 
>   Well, I'm using this setup with ext4 and xfs, on same machine, without
>   troubles.

Then it indeed looks like something goes wrong in linux-next.

I would recommend to do a bisect if possible.

As you compared all your data with laptop, it ensures your csum/file
trees are OK, thus no corruption in that trees.
But still something doesn't look right for extent tree only.

But it's less a concerning problem since it doesn't reach latest RC, so
if you could reproduce it stably, I'd recommend to do a bisect.

Thanks,
Qu

>   I've got files checksummed on the backup machine, so I can be sure about
>   comparing integrity.
> 
> Anyway, thanks a lot again,
> Andrea
> 



signature.asc
Description: OpenPGP digital signature

[PATCH] Proposal for more detail in scrub doc

2018-11-27 Thread Andrea Gelmini

Wise words from Qu:
https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg82557.html
---
 Documentation/btrfs-scrub.asciidoc | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/btrfs-scrub.asciidoc 
b/Documentation/btrfs-scrub.asciidoc
index 4c49269..1fc085c 100644
--- a/Documentation/btrfs-scrub.asciidoc
+++ b/Documentation/btrfs-scrub.asciidoc
@@ -16,7 +16,8 @@ and metadata blocks from all devices and verify checksums. 
Automatically repair
 corrupted blocks if there's a correct copy available.
 
 NOTE: Scrub is not a filesystem checker (fsck) and does not verify nor repair
-structural damage in the filesystem.
+structural damage in the filesystem. It only checks csum of data and tree 
blocks,
+it doesn't ensure the content of tree blocks are OK.
 
 The user is supposed to run it manually or via a periodic system service. The
 recommended period is a month but could be less. The estimated device bandwidth
-- 
2.20.0.rc1.28.g8a4ee09a8a

Re: [PATCH] Fix typos in docs (second try...)

2018-11-27 Thread Nikolay Borisov




On 27.11.18 г. 15:56 ч., Andrea Gelmini wrote:
> Signed-off-by: Andrea Gelmini 

Reviewed-by: Nikolay Borisov 

> ---
>  Documentation/DocConventions  | 4 ++--
>  Documentation/ReleaseChecklist| 4 ++--
>  Documentation/btrfs-man5.asciidoc | 8 
>  Documentation/btrfs-property.asciidoc | 2 +-
>  Documentation/btrfs-qgroup.asciidoc   | 4 ++--
>  Documentation/mkfs.btrfs.asciidoc | 2 +-
>  6 files changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/Documentation/DocConventions b/Documentation/DocConventions
> index e84ed7a..969209c 100644
> --- a/Documentation/DocConventions
> +++ b/Documentation/DocConventions
> @@ -23,7 +23,7 @@ Quotation in subcommands:
>  - command reference: bold *btrfs fi show*
>  - section references: italics 'EXAMPLES'
>  
> -- argument name in option desciption: caps in angle brackets 
> +- argument name in option description: caps in angle brackets 
>- reference in help text: caps NAME
>  also possible: caps italics 'NAME'
>  
> @@ -34,6 +34,6 @@ Quotation in subcommands:
>- optional parameter with argument: [-p ]
>  
>  
> -Refrences:
> +References:
>  - full asciidoc syntax: http://asciidoc.org/userguide.html
>  - cheatsheet: http://powerman.name/doc/asciidoc
> diff --git a/Documentation/ReleaseChecklist b/Documentation/ReleaseChecklist
> index d8bf50c..ebe251d 100644
> --- a/Documentation/ReleaseChecklist
> +++ b/Documentation/ReleaseChecklist
> @@ -4,7 +4,7 @@ Release checklist
>  Last code touches:
>  
>  * make the code ready, collect patches queued for the release
> -* look to mailinglist for any relevant last-minute fixes
> +* look to mailing list for any relevant last-minute fixes
>  
>  Pre-checks:
>  
> @@ -31,7 +31,7 @@ Release:
>  
>  Post-release:
>  
> -* write and send announcement mail to mailinglist
> +* write and send announcement mail to mailing list
>  * update wiki://Main_page#News
>  * update wiki://Changelog#btrfs-progs
>  * update title on IRC
> diff --git a/Documentation/btrfs-man5.asciidoc 
> b/Documentation/btrfs-man5.asciidoc
> index 448710a..4a269e2 100644
> --- a/Documentation/btrfs-man5.asciidoc
> +++ b/Documentation/btrfs-man5.asciidoc
> @@ -156,7 +156,7 @@ under 'nodatacow' are also set the NOCOW file attribute 
> (see `chattr`(1)).
>  NOTE: If 'nodatacow' or 'nodatasum' are enabled, compression is disabled.
>  +
>  Updates in-place improve performance for workloads that do frequent 
> overwrites,
> -at the cost of potential partial writes, in case the write is interruted
> +at the cost of potential partial writes, in case the write is interrupted
>  (system crash, device failure).
>  
>  *datasum*::
> @@ -171,7 +171,7 @@ corresponding file attribute (see `chattr`(1)).
>  NOTE: If 'nodatacow' or 'nodatasum' are enabled, compression is disabled.
>  +
>  There is a slight performance gain when checksums are turned off, the
> -correspoinding metadata blocks holding the checksums do not need to updated.
> +corresponding metadata blocks holding the checksums do not need to be 
> updated.
>  The cost of checksumming of the blocks in memory is much lower than the IO,
>  modern CPUs feature hardware support of the checksumming algorithm.
>  
> @@ -185,7 +185,7 @@ missing, for example if a stripe member is completely 
> missing from RAID0.
>  Since 4.14, the constraint checks have been improved and are verified on the
>  chunk level, not an the device level. This allows degraded mounts of
>  filesystems with mixed RAID profiles for data and metadata, even if the
> -device number constraints would not be satisfied for some of the prifles.
> +device number constraints would not be satisfied for some of the profiles.
>  +
>  Example: metadata -- raid1, data -- single, devices -- /dev/sda, /dev/sdb
>  +
> @@ -649,7 +649,7 @@ inherent limit of btrfs is 2^64^ (16 EiB) but the linux 
> VFS limit is 2^63^ (8 Ei
>  
>  maximum number of subvolumes::
>  2^64^ but depends on the available metadata space, the space consumed by all
> -subvolume metadata includes bookeeping of the shared extents can be large 
> (MiB,
> +subvolume metadata includes bookkeeping of the shared extents can be large 
> (MiB,
>  GiB)
>  
>  maximum number of hardlinks of a file in a directory::
> diff --git a/Documentation/btrfs-property.asciidoc 
> b/Documentation/btrfs-property.asciidoc
> index b562717..4bad88b 100644
> --- a/Documentation/btrfs-property.asciidoc
> +++ b/Documentation/btrfs-property.asciidoc
> @@ -34,7 +34,7 @@ specify what type of object you meant. This is only needed 
> when a
>  property could be set for more then one object type.
>  +
>  Possible types are 's[ubvol]', 'f[ilesystem]', 'i[node]' and 'd[evice]', 
> where
> -the first lettes is a shortcut.
> +the first letter is a shortcut.
>  +
>  Set the name of property by 'name'. If no 'name' is specified,
>  all properties for the given object are printed. 'name' is one of
> diff --git a/Documentation/btrfs-qgroup.asciidoc 
> b/Documentation/btrfs-qgroup.asciidoc
>

Re: Linux-next regression?

2018-11-27 Thread Andrea Gelmini

On Tue, Nov 27, 2018 at 09:13:02AM +0800, Qu Wenruo wrote:
> 
> 
> On 2018/11/26 下午11:01, Andrea Gelmini wrote:
> >   One question: I can completely trust the ok return status of scrub? I 
> > know is made for this, but shit happens...
> 
> No, scrub only checks csum of data and tree blocks, it doesn't ensure
> the content of tree blocks are OK.

Hi Qu,
  and thanks a lot, really. Your answers are always the best: short,
  detailed and very kind. You rock.

  I'm going to send a patch to propose to add your explanation above
  on the relative man page, if you agree.

> For comprehensive check, go "btrfs check --readonly".

  I'll do it.

  At the moment I just compared the file existance between my laptop and
  latest backup. Everything is fine.

> 
> However I don't think it's something "btrfs check --readonly" would
> report, but some strange behavior, maybe from LVM or cryptsetup.

  Well, I'm using this setup with ext4 and xfs, on same machine, without
  troubles.
  I've got files checksummed on the backup machine, so I can be sure about
  comparing integrity.

Anyway, thanks a lot again,
Andrea

signature.asc
Description: PGP signature

[PATCH] Fix typos in docs (second try...)

2018-11-27 Thread Andrea Gelmini

Signed-off-by: Andrea Gelmini 
---
 Documentation/DocConventions  | 4 ++--
 Documentation/ReleaseChecklist| 4 ++--
 Documentation/btrfs-man5.asciidoc | 8 
 Documentation/btrfs-property.asciidoc | 2 +-
 Documentation/btrfs-qgroup.asciidoc   | 4 ++--
 Documentation/mkfs.btrfs.asciidoc | 2 +-
 6 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/Documentation/DocConventions b/Documentation/DocConventions
index e84ed7a..969209c 100644
--- a/Documentation/DocConventions
+++ b/Documentation/DocConventions
@@ -23,7 +23,7 @@ Quotation in subcommands:
 - command reference: bold *btrfs fi show*
 - section references: italics 'EXAMPLES'
 
-- argument name in option desciption: caps in angle brackets 
+- argument name in option description: caps in angle brackets 
   - reference in help text: caps NAME
 also possible: caps italics 'NAME'
 
@@ -34,6 +34,6 @@ Quotation in subcommands:
   - optional parameter with argument: [-p ]
 
 
-Refrences:
+References:
 - full asciidoc syntax: http://asciidoc.org/userguide.html
 - cheatsheet: http://powerman.name/doc/asciidoc
diff --git a/Documentation/ReleaseChecklist b/Documentation/ReleaseChecklist
index d8bf50c..ebe251d 100644
--- a/Documentation/ReleaseChecklist
+++ b/Documentation/ReleaseChecklist
@@ -4,7 +4,7 @@ Release checklist
 Last code touches:
 
 * make the code ready, collect patches queued for the release
-* look to mailinglist for any relevant last-minute fixes
+* look to mailing list for any relevant last-minute fixes
 
 Pre-checks:
 
@@ -31,7 +31,7 @@ Release:
 
 Post-release:
 
-* write and send announcement mail to mailinglist
+* write and send announcement mail to mailing list
 * update wiki://Main_page#News
 * update wiki://Changelog#btrfs-progs
 * update title on IRC
diff --git a/Documentation/btrfs-man5.asciidoc 
b/Documentation/btrfs-man5.asciidoc
index 448710a..4a269e2 100644
--- a/Documentation/btrfs-man5.asciidoc
+++ b/Documentation/btrfs-man5.asciidoc
@@ -156,7 +156,7 @@ under 'nodatacow' are also set the NOCOW file attribute 
(see `chattr`(1)).
 NOTE: If 'nodatacow' or 'nodatasum' are enabled, compression is disabled.
 +
 Updates in-place improve performance for workloads that do frequent overwrites,
-at the cost of potential partial writes, in case the write is interruted
+at the cost of potential partial writes, in case the write is interrupted
 (system crash, device failure).
 
 *datasum*::
@@ -171,7 +171,7 @@ corresponding file attribute (see `chattr`(1)).
 NOTE: If 'nodatacow' or 'nodatasum' are enabled, compression is disabled.
 +
 There is a slight performance gain when checksums are turned off, the
-correspoinding metadata blocks holding the checksums do not need to updated.
+corresponding metadata blocks holding the checksums do not need to be updated.
 The cost of checksumming of the blocks in memory is much lower than the IO,
 modern CPUs feature hardware support of the checksumming algorithm.
 
@@ -185,7 +185,7 @@ missing, for example if a stripe member is completely 
missing from RAID0.
 Since 4.14, the constraint checks have been improved and are verified on the
 chunk level, not an the device level. This allows degraded mounts of
 filesystems with mixed RAID profiles for data and metadata, even if the
-device number constraints would not be satisfied for some of the prifles.
+device number constraints would not be satisfied for some of the profiles.
 +
 Example: metadata -- raid1, data -- single, devices -- /dev/sda, /dev/sdb
 +
@@ -649,7 +649,7 @@ inherent limit of btrfs is 2^64^ (16 EiB) but the linux VFS 
limit is 2^63^ (8 Ei
 
 maximum number of subvolumes::
 2^64^ but depends on the available metadata space, the space consumed by all
-subvolume metadata includes bookeeping of the shared extents can be large (MiB,
+subvolume metadata includes bookkeeping of the shared extents can be large 
(MiB,
 GiB)
 
 maximum number of hardlinks of a file in a directory::
diff --git a/Documentation/btrfs-property.asciidoc 
b/Documentation/btrfs-property.asciidoc
index b562717..4bad88b 100644
--- a/Documentation/btrfs-property.asciidoc
+++ b/Documentation/btrfs-property.asciidoc
@@ -34,7 +34,7 @@ specify what type of object you meant. This is only needed 
when a
 property could be set for more then one object type.
 +
 Possible types are 's[ubvol]', 'f[ilesystem]', 'i[node]' and 'd[evice]', where
-the first lettes is a shortcut.
+the first letter is a shortcut.
 +
 Set the name of property by 'name'. If no 'name' is specified,
 all properties for the given object are printed. 'name' is one of
diff --git a/Documentation/btrfs-qgroup.asciidoc 
b/Documentation/btrfs-qgroup.asciidoc
index dff0867..dc4c93b 100644
--- a/Documentation/btrfs-qgroup.asciidoc
+++ b/Documentation/btrfs-qgroup.asciidoc
@@ -52,7 +52,7 @@ assignment would lead to quota inconsistency. See 'QUOTA 
RESCAN' for more
 information.
 --no-rescan
 Explicitly ask not to do a rescan, even if the assignment will make the quotas

Re: [PATCH] Fix typos in docs

2018-11-27 Thread Andrea Gelmini

On Tue, Nov 27, 2018 at 03:41:55PM +0200, Nikolay Borisov wrote:
> On 27.11.18 г. 15:37 ч., Andrea Gelmini wrote:
> 
> Wrong position of space, should be mailing list

Thanks a lot Nikolay for the review.

I'm going to send new version.

Thanks again,
Andrea

Re: [PATCH] Fix typos in docs

2018-11-27 Thread Nikolay Borisov




On 27.11.18 г. 15:37 ч., Andrea Gelmini wrote:
> Signed-off-by: Andrea Gelmini 
> ---
>  Documentation/DocConventions  | 4 ++--
>  Documentation/ReleaseChecklist| 4 ++--
>  Documentation/btrfs-man5.asciidoc | 8 
>  Documentation/btrfs-property.asciidoc | 2 +-
>  Documentation/btrfs-qgroup.asciidoc   | 4 ++--
>  Documentation/mkfs.btrfs.asciidoc | 2 +-
>  6 files changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/Documentation/DocConventions b/Documentation/DocConventions
> index e84ed7a..969209c 100644
> --- a/Documentation/DocConventions
> +++ b/Documentation/DocConventions
> @@ -23,7 +23,7 @@ Quotation in subcommands:
>  - command reference: bold *btrfs fi show*
>  - section references: italics 'EXAMPLES'
>  
> -- argument name in option desciption: caps in angle brackets 
> +- argument name in option description: caps in angle brackets 
>- reference in help text: caps NAME
>  also possible: caps italics 'NAME'
>  
> @@ -34,6 +34,6 @@ Quotation in subcommands:
>- optional parameter with argument: [-p ]
>  
>  
> -Refrences:
> +References:
>  - full asciidoc syntax: http://asciidoc.org/userguide.html
>  - cheatsheet: http://powerman.name/doc/asciidoc
> diff --git a/Documentation/ReleaseChecklist b/Documentation/ReleaseChecklist
> index d8bf50c..7fdbddf 100644
> --- a/Documentation/ReleaseChecklist
> +++ b/Documentation/ReleaseChecklist
> @@ -4,7 +4,7 @@ Release checklist
>  Last code touches:
>  
>  * make the code ready, collect patches queued for the release
> -* look to mailinglist for any relevant last-minute fixes
> +* look to mailin glist for any relevant last-minute fixes

Wrong position of space, should be mailing list
>  
>  Pre-checks:
>  
> @@ -31,7 +31,7 @@ Release:
>  
>  Post-release:
>  
> -* write and send announcement mail to mailinglist
> +* write and send announcement mail to mailing list
>  * update wiki://Main_page#News
>  * update wiki://Changelog#btrfs-progs
>  * update title on IRC
> diff --git a/Documentation/btrfs-man5.asciidoc 
> b/Documentation/btrfs-man5.asciidoc
> index 448710a..c358cef 100644
> --- a/Documentation/btrfs-man5.asciidoc
> +++ b/Documentation/btrfs-man5.asciidoc
> @@ -156,7 +156,7 @@ under 'nodatacow' are also set the NOCOW file attribute 
> (see `chattr`(1)).
>  NOTE: If 'nodatacow' or 'nodatasum' are enabled, compression is disabled.
>  +
>  Updates in-place improve performance for workloads that do frequent 
> overwrites,
> -at the cost of potential partial writes, in case the write is interruted
> +at the cost of potential partial writes, in case the write is interrupted
>  (system crash, device failure).
>  
>  *datasum*::
> @@ -171,7 +171,7 @@ corresponding file attribute (see `chattr`(1)).
>  NOTE: If 'nodatacow' or 'nodatasum' are enabled, compression is disabled.
>  +
>  There is a slight performance gain when checksums are turned off, the
> -correspoinding metadata blocks holding the checksums do not need to updated.
> +corresponding metadata blocks holding the checksums do not need to updated.

You've missed one grammatical error - the end of the sentence should
say" do not need to be updated".

>  The cost of checksumming of the blocks in memory is much lower than the IO,
>  modern CPUs feature hardware support of the checksumming algorithm.
>

Re: [PATCH 1/2] btrfs: scrub: maintain the unlock order in scrub thread

2018-11-27 Thread Anand Jain





On 11/26/2018 05:47 PM, Nikolay Borisov wrote:



On 26.11.18 г. 11:07 ч., Anand Jain wrote:

The fs_info::device_list_mutex and fs_info::scrub_lock creates a
nested locks in btrfs_scrub_dev(). During the lock acquire the
hierarchy is fs_info::device_list_mutex and then fs_info::scrub_lock,
so following the same reverse order during unlock, that is
fs_info::scrub_lock and then fs_info::device_list_mutex.

Signed-off-by: Anand Jain 
---
  fs/btrfs/scrub.c | 16 +++-
  1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 902819d3cf41..b1c2d1cdbd4b 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3865,7 +3865,6 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
devid, u64 start,
}
sctx->readonly = readonly;
dev->scrub_ctx = sctx;
-   mutex_unlock(_info->fs_devices->device_list_mutex);
  
  	/*

 * checking @scrub_pause_req here, we can avoid
@@ -3875,15 +3874,14 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
devid, u64 start,
atomic_inc(_info->scrubs_running);
mutex_unlock(_info->scrub_lock);
  
-	if (!is_dev_replace) {

-   /*
-* by holding device list mutex, we can
-* kick off writing super in log tree sync.
-*/
-   mutex_lock(_info->fs_devices->device_list_mutex);
+   /*
+* by holding device list mutex, we can kick off writing super in log
+* tree sync.
+*/
+   if (!is_dev_replace)
ret = scrub_supers(sctx, dev);
-   mutex_unlock(_info->fs_devices->device_list_mutex);
-   }
+
+   mutex_unlock(_info->fs_devices->device_list_mutex);


Have you considered whether this change will have any negative impact
due to the fact that __scrtub_blocked_if_needed can go to sleep for
arbitrary time with device_list_mutex held now ?


 You are right. I missed that point. The device_list_mutex must not be
 blocked. In fact here we don't need the nested device_list_mutex and
 scrub_lock at all. I have comeup with a new fix [1] below separating
 them.

[1]
-
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 902819d3cf41..db895ad23eda 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3830,42 +3830,37 @@ int btrfs_scrub_dev(struct btrfs_fs_info 
*fs_info, u64 devid, u64 start,

return -EROFS;
}

-   mutex_lock(_info->scrub_lock);
if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, >dev_state) ||
test_bit(BTRFS_DEV_STATE_REPLACE_TGT, >dev_state)) {
-   mutex_unlock(_info->scrub_lock);
mutex_unlock(_info->fs_devices->device_list_mutex);
return -EIO;
}
+   mutex_unlock(_info->fs_devices->device_list_mutex);

btrfs_dev_replace_read_lock(_info->dev_replace);
if (dev->scrub_ctx ||
(!is_dev_replace &&
 btrfs_dev_replace_is_ongoing(_info->dev_replace))) {
btrfs_dev_replace_read_unlock(_info->dev_replace);
-   mutex_unlock(_info->scrub_lock);
-   mutex_unlock(_info->fs_devices->device_list_mutex);
return -EINPROGRESS;
}
btrfs_dev_replace_read_unlock(_info->dev_replace);

+   mutex_lock(_info->scrub_lock);
ret = scrub_workers_get(fs_info, is_dev_replace);
if (ret) {
mutex_unlock(_info->scrub_lock);
-   mutex_unlock(_info->fs_devices->device_list_mutex);
return ret;
}

sctx = scrub_setup_ctx(dev, is_dev_replace);
if (IS_ERR(sctx)) {
mutex_unlock(_info->scrub_lock);
-   mutex_unlock(_info->fs_devices->device_list_mutex);
scrub_workers_put(fs_info);
return PTR_ERR(sctx);
}
sctx->readonly = readonly;
dev->scrub_ctx = sctx;
-   mutex_unlock(_info->fs_devices->device_list_mutex);

/*
 * checking @scrub_pause_req here, we can avoid


Will send v2.

Thanks, Anand





  
  	if (!ret)

ret = scrub_enumerate_chunks(sctx, dev, start, end);

[PATCH] Fix typos in docs

2018-11-27 Thread Andrea Gelmini

Signed-off-by: Andrea Gelmini 
---
 Documentation/DocConventions  | 4 ++--
 Documentation/ReleaseChecklist| 4 ++--
 Documentation/btrfs-man5.asciidoc | 8 
 Documentation/btrfs-property.asciidoc | 2 +-
 Documentation/btrfs-qgroup.asciidoc   | 4 ++--
 Documentation/mkfs.btrfs.asciidoc | 2 +-
 6 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/Documentation/DocConventions b/Documentation/DocConventions
index e84ed7a..969209c 100644
--- a/Documentation/DocConventions
+++ b/Documentation/DocConventions
@@ -23,7 +23,7 @@ Quotation in subcommands:
 - command reference: bold *btrfs fi show*
 - section references: italics 'EXAMPLES'
 
-- argument name in option desciption: caps in angle brackets 
+- argument name in option description: caps in angle brackets 
   - reference in help text: caps NAME
 also possible: caps italics 'NAME'
 
@@ -34,6 +34,6 @@ Quotation in subcommands:
   - optional parameter with argument: [-p ]
 
 
-Refrences:
+References:
 - full asciidoc syntax: http://asciidoc.org/userguide.html
 - cheatsheet: http://powerman.name/doc/asciidoc
diff --git a/Documentation/ReleaseChecklist b/Documentation/ReleaseChecklist
index d8bf50c..7fdbddf 100644
--- a/Documentation/ReleaseChecklist
+++ b/Documentation/ReleaseChecklist
@@ -4,7 +4,7 @@ Release checklist
 Last code touches:
 
 * make the code ready, collect patches queued for the release
-* look to mailinglist for any relevant last-minute fixes
+* look to mailin glist for any relevant last-minute fixes
 
 Pre-checks:
 
@@ -31,7 +31,7 @@ Release:
 
 Post-release:
 
-* write and send announcement mail to mailinglist
+* write and send announcement mail to mailing list
 * update wiki://Main_page#News
 * update wiki://Changelog#btrfs-progs
 * update title on IRC
diff --git a/Documentation/btrfs-man5.asciidoc 
b/Documentation/btrfs-man5.asciidoc
index 448710a..c358cef 100644
--- a/Documentation/btrfs-man5.asciidoc
+++ b/Documentation/btrfs-man5.asciidoc
@@ -156,7 +156,7 @@ under 'nodatacow' are also set the NOCOW file attribute 
(see `chattr`(1)).
 NOTE: If 'nodatacow' or 'nodatasum' are enabled, compression is disabled.
 +
 Updates in-place improve performance for workloads that do frequent overwrites,
-at the cost of potential partial writes, in case the write is interruted
+at the cost of potential partial writes, in case the write is interrupted
 (system crash, device failure).
 
 *datasum*::
@@ -171,7 +171,7 @@ corresponding file attribute (see `chattr`(1)).
 NOTE: If 'nodatacow' or 'nodatasum' are enabled, compression is disabled.
 +
 There is a slight performance gain when checksums are turned off, the
-correspoinding metadata blocks holding the checksums do not need to updated.
+corresponding metadata blocks holding the checksums do not need to updated.
 The cost of checksumming of the blocks in memory is much lower than the IO,
 modern CPUs feature hardware support of the checksumming algorithm.
 
@@ -185,7 +185,7 @@ missing, for example if a stripe member is completely 
missing from RAID0.
 Since 4.14, the constraint checks have been improved and are verified on the
 chunk level, not an the device level. This allows degraded mounts of
 filesystems with mixed RAID profiles for data and metadata, even if the
-device number constraints would not be satisfied for some of the prifles.
+device number constraints would not be satisfied for some of the profiles.
 +
 Example: metadata -- raid1, data -- single, devices -- /dev/sda, /dev/sdb
 +
@@ -649,7 +649,7 @@ inherent limit of btrfs is 2^64^ (16 EiB) but the linux VFS 
limit is 2^63^ (8 Ei
 
 maximum number of subvolumes::
 2^64^ but depends on the available metadata space, the space consumed by all
-subvolume metadata includes bookeeping of the shared extents can be large (MiB,
+subvolume metadata includes bookkeeping of the shared extents can be large 
(MiB,
 GiB)
 
 maximum number of hardlinks of a file in a directory::
diff --git a/Documentation/btrfs-property.asciidoc 
b/Documentation/btrfs-property.asciidoc
index b562717..4bad88b 100644
--- a/Documentation/btrfs-property.asciidoc
+++ b/Documentation/btrfs-property.asciidoc
@@ -34,7 +34,7 @@ specify what type of object you meant. This is only needed 
when a
 property could be set for more then one object type.
 +
 Possible types are 's[ubvol]', 'f[ilesystem]', 'i[node]' and 'd[evice]', where
-the first lettes is a shortcut.
+the first letter is a shortcut.
 +
 Set the name of property by 'name'. If no 'name' is specified,
 all properties for the given object are printed. 'name' is one of
diff --git a/Documentation/btrfs-qgroup.asciidoc 
b/Documentation/btrfs-qgroup.asciidoc
index dff0867..dc4c93b 100644
--- a/Documentation/btrfs-qgroup.asciidoc
+++ b/Documentation/btrfs-qgroup.asciidoc
@@ -52,7 +52,7 @@ assignment would lead to quota inconsistency. See 'QUOTA 
RESCAN' for more
 information.
 --no-rescan
 Explicitly ask not to do a rescan, even if the assignment will make the quotas

Re: [PATCH RESEND 0/8] btrfs-progs: sub: Relax the privileges of "subvolume list/show"

2018-11-27 Thread Martin Steigerwald

Misono Tomohiro - 27.11.18, 06:24:
> Importantly, in order to make output consistent for both root and
> non-privileged user, this changes the behavior of "subvolume list":
>  - (default) Only list in subvolume under the specified path.
>Path needs to be a subvolume.

Does that work recursively?

I wound find it quite unexpected if I did btrfs subvol list in or on the 
root directory of a BTRFS filesystem would not display any subvolumes on 
that filesystem no matter where they are.

Thanks,
-- 
Martin

Re: [PATCH v2 1/5] btrfs-progs: image: Refactor fixup_devices() to fixup_chunks_and_devices()

2018-11-27 Thread Nikolay Borisov




On 27.11.18 г. 10:50 ч., Qu Wenruo wrote:
> 
> 
> On 2018/11/27 下午4:46, Nikolay Borisov wrote:
>>
>>
>> On 27.11.18 г. 10:38 ч., Qu Wenruo wrote:
>>> Current fixup_devices() will only remove DEV_ITEMs and reset DEV_ITEM
>>> size.
>>> Later we need to do more fixup works, so change the name to
>>> fixup_chunks_and_devices() and refactor the original device size fixup
>>> to fixup_device_size().
>>>
>>> Signed-off-by: Qu Wenruo 
>>
>> Reviewed-by: Nikolay Borisov 
>>
>> However, one minor nit below.
>>
>>> ---
>>>  image/main.c | 52 
>>>  1 file changed, 36 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/image/main.c b/image/main.c
>>> index c680ab19de6c..bbfcf8f19298 100644
>>> --- a/image/main.c
>>> +++ b/image/main.c
>>> @@ -2084,28 +2084,19 @@ static void remap_overlapping_chunks(struct 
>>> mdrestore_struct *mdres)
>>> }
>>>  }
>>>  
>>> -static int fixup_devices(struct btrfs_fs_info *fs_info,
>>> -struct mdrestore_struct *mdres, off_t dev_size)
>>> +static int fixup_device_size(struct btrfs_trans_handle *trans,
>>> +struct mdrestore_struct *mdres,
>>> +off_t dev_size)
>>>  {
>>> -   struct btrfs_trans_handle *trans;
>>> +   struct btrfs_fs_info *fs_info = trans->fs_info;
>>> struct btrfs_dev_item *dev_item;
>>> struct btrfs_path path;
>>> -   struct extent_buffer *leaf;
>>> struct btrfs_root *root = fs_info->chunk_root;
>>> struct btrfs_key key;
>>> +   struct extent_buffer *leaf;
>>
>> nit: Unnecessary change
> 
> Doesn't it look better when all btrfs_ prefix get batched together? :)

Didn't even realize this was the intended effect. IMO doesn't make much
difference, what does, though, is probably reverse christmas tree, ie

longer variable names
come before shorter
ones

But I guess this is a matter of taste, no need to resend.

> 
> Thanks,
> Qu
> 
>>
>>> u64 devid, cur_devid;
>>> int ret;
>>>  
>>> -   if (btrfs_super_log_root(fs_info->super_copy)) {
>>> -   warning(
>>> -   "log tree detected, its generation will not match superblock");
>>> -   }
>>> -   trans = btrfs_start_transaction(fs_info->tree_root, 1);
>>> -   if (IS_ERR(trans)) {
>>> -   error("cannot starting transaction %ld", PTR_ERR(trans));
>>> -   return PTR_ERR(trans);
>>> -   }
>>> -
>>> dev_item = _info->super_copy->dev_item;
>>>  
>>> devid = btrfs_stack_device_id(dev_item);
>>> @@ -2123,7 +2114,7 @@ again:
>>> ret = btrfs_search_slot(trans, root, , , -1, 1);
>>> if (ret < 0) {
>>> error("search failed: %d", ret);
>>> -   exit(1);
>>> +   return ret;
>>> }
>>>  
>>> while (1) {
>>> @@ -2170,12 +2161,41 @@ again:
>>> }
>>>  
>>> btrfs_release_path();
>>> +   return 0;
>>> +}
>>> +
>>> +static int fixup_chunks_and_devices(struct btrfs_fs_info *fs_info,
>>> +struct mdrestore_struct *mdres, off_t dev_size)
>>> +{
>>> +   struct btrfs_trans_handle *trans;
>>> +   int ret;
>>> +
>>> +   if (btrfs_super_log_root(fs_info->super_copy)) {
>>> +   warning(
>>> +   "log tree detected, its generation will not match superblock");
>>> +   }
>>> +   trans = btrfs_start_transaction(fs_info->tree_root, 1);
>>> +   if (IS_ERR(trans)) {
>>> +   error("cannot starting transaction %ld", PTR_ERR(trans));
>>> +   return PTR_ERR(trans);
>>> +   }
>>> +
>>> +   ret = fixup_device_size(trans, mdres, dev_size);
>>> +   if (ret < 0)
>>> +   goto error;
>>> +
>>> ret = btrfs_commit_transaction(trans, fs_info->tree_root);
>>> if (ret) {
>>> error("unable to commit transaction: %d", ret);
>>> return ret;
>>> }
>>> return 0;
>>> +error:
>>> +   error(
>>> +"failed to fix chunks and devices mapping, the fs may not be mountable: 
>>> %s",
>>> +   strerror(-ret));
>>> +   btrfs_abort_transaction(trans, ret);
>>> +   return ret;
>>>  }
>>>  
>>>  static int restore_metadump(const char *input, FILE *out, int old_restore,
>>> @@ -2282,7 +2302,7 @@ static int restore_metadump(const char *input, FILE 
>>> *out, int old_restore,
>>> return 1;
>>> }
>>>  
>>> -   ret = fixup_devices(info, , st.st_size);
>>> +   ret = fixup_chunks_and_devices(info, , st.st_size);
>>> close_ctree(info->chunk_root);
>>> if (ret)
>>> goto out;
>>>
>

Re: [PATCH v2] btrfs: improve error handling of btrfs_add_link()

2018-11-27 Thread Johannes Thumshirn

On 26/11/2018 19:37, David Sterba wrote:
> I though a transaction abort would be here, as the state is not
> consistent. Also I'm not sure what I as a user would get from such error
> message after calling link(). If the error handling in the error
> handling fails, there's not much left to do and the abort either
> happened earlier in the callees or is necessary here.

OK, now you've lost me. Isn't that what my previous patch did?

Byte,
Johannes
-- 
Johannes ThumshirnSUSE Labs
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

1 2 >

1 - 100 of 114 matches

Mail list logo