Re: [RFC PATCH] btrfs: Speedup btrfs_read_block_groups()

2018-02-23 Thread Qu Wenruo


On 2018年02月24日 00:29, Ellis H. Wilson III wrote:
> On 02/22/2018 06:37 PM, Qu Wenruo wrote:
>> On 2018年02月23日 00:31, Ellis H. Wilson III wrote:
>>> On 02/21/2018 11:56 PM, Qu Wenruo wrote:
 On 2018年02月22日 12:52, Qu Wenruo wrote:
> btrfs_read_block_groups() is used to build up the block group cache
> for
> all block groups, so it will iterate all block group items in extent
> tree.
>
> For large filesystem (TB level), it will search for BLOCK_GROUP_ITEM
> thousands times, which is the most time consuming part of mounting
> btrfs.
>
> So this patch will try to speed it up by:
>
> 1) Avoid unnecessary readahead
>  We were using READA_FORWARD to search for block group item.
>  However block group items are in fact scattered across quite a
> lot of
>  leaves. Doing readahead will just waste our IO (especially
> important
>  for HDD).
>
>  In real world case, for a filesystem with 3T used space, it would
>  have about 50K extent tree leaves, but only have 3K block group
>  items. Meaning we need to iterate 16 leaves to meet one block
> group
>  on average.
>
>  So readahead won't help but waste slow HDD seeks.
>
> 2) Use chunk mapping to locate block group items
>  Since one block group item always has one corresponding chunk
> item,
>  we could use chunk mapping to get the block group item size.
>
>  With block group item size, we can do a pinpoint tree search,
> instead
>  of searching with some uncertain value and do forward search.
>
>  In some case, like next BLOCK_GROUP_ITEM is in the next leaf of
>  current path, we could save such unnecessary tree block read.
>
> Cc: Ellis H. Wilson III 

 Hi Ellis,

 Would you please try this patch to see if it helps to speedup the mount
 of your large filesystem?
>>>
>>> I will try either tomorrow or over the weekend.  I'm waiting on hardware
>>> to be able to build and load a custom kernel on.
>>
>> If you're using Archlinux, I could build the package for you.
>>
>> (For other distributions, unfortunately I'm not that familiar with)
>>
>> Thanks,
>> Qu
> 
> No sweat.  I'm not running arch anywhere, so was glad to handle this
> myself.
> 
> Short story: It doesn't appear to have any notable impact on mount time.
> 
> Long story:
> #Built a modern kernel:
> git clone https://github.com/kdave/btrfs-devel
> cd'd into btrfs-devel
> copied my current kernel config in /boot to .config
> make olddefconfig
> make -j16
> make modules_install
> make install
> grub2-mkconfig -o /boot/grub/grub.cfg
> reboot
> 
> #Reran tests with vanilla 4.16.0-rc1+ kernel
> As root, of the form: time mount /dev/sdb /mnt/btrfs
> 5 iteration average: 16.869s
> 
> #Applied your patch, rebuild, switched kernel module
> wget -O - 'https://patchwork.kernel.org/patch/10234619/mbox' | git am -
> make -j16
> make modules_install
> rmmod btrfs
> modprobe btrfs
> 
> #Reran tests with patched 4.16.0-rc1+ kernel
> As root, of the form: time mount /dev/sdb /mnt/btrfs
> 5 iteration average: 16.642s
> 
> So, there's a slight improvement against vanilla 4.16.0-rc1+, but it's
> still slightly slower than my original runs in 4.5.5, which got me
> 16.553s.  In any event, most of this is statistically unsignificant
> since the standard deviation is about two tenths of a second.

Yep, I also saw guys with similar report when the first version is sent.

Despite of the readahead things, the patch can only reduce disk reads
where block group items are located at the 1st slot of a leaf.

If all block group items are located from the 2nd slot of leaves, then
it shouldn't have much affect.
And I think your fs is already in such states so it doesn't have much
speedup.

BTW, you could also verify this by btrfs-debug-tree.

To get all block group items number:
# btrfs-debug-tree -t extent  | grep BLOCK_GROUP_ITEM | grep
item | nc -l

To get block group items which locates at 1st slot:
# btrfs-debug-tree -t extent  | grep BLOCK_GROUP_ITEM | grep
"item 0" | nc -l

And the ratio should imply how effective the patch will speedup.

>
> So, my conclusion here is this problem needs to be handled at an
> architectural level to be truly solved (read: have mounts that few
> seconds at worst), which either requires:
> a) On-disk format changes like you (Qu) suggested some time back for a
> tree of block groups or
> b) Lazy block group walking post-mount and algorithms that can cope with
> making sub-optimal choices.  One would likely want to stonewall out
> certain operations until the lazy post-mount walk completed like
> balance, defrag, etc, that have more reason to require complete
> knowledge of the usage of each block group.
> 
> I may take a stab at b), but I'm first going to do the tests I promised
> relating to how mount times scale with increased capacity 

Please update the BTRFS status page

2018-02-23 Thread waxhead

The latest released kernel is 4.15
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Add nossd_spread mount option

2018-02-23 Thread Howard McLauchlan
On 02/21/2018 05:20 PM, Hans van Kranenburg wrote:
> On 02/22/2018 12:31 AM, Howard McLauchlan wrote:
>> Btrfs has two mount options for SSD optimizations: ssd and ssd_spread.
>> Presently there is an option to disable all SSD optimizations, but there
>> isn't an option to disable just ssd_spread.
>>
>> This patch adds a mount option nossd_spread that disables ssd_spread
>> only.
>>
>> Signed-off-by: Howard McLauchlan 
>> ---
>>  fs/btrfs/super.c | 13 +
>>  1 file changed, 9 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>> index 6e71a2a78363..4c0fcf5b3e7e 100644
>> --- a/fs/btrfs/super.c
>> +++ b/fs/btrfs/super.c
>> @@ -310,10 +310,10 @@ static void btrfs_put_super(struct super_block *sb)
>>  enum {
>>  Opt_degraded, Opt_subvol, Opt_subvolid, Opt_device, Opt_nodatasum,
>>  Opt_nodatacow, Opt_max_inline, Opt_alloc_start, Opt_nobarrier, Opt_ssd,
>> -Opt_nossd, Opt_ssd_spread, Opt_thread_pool, Opt_noacl, Opt_compress,
>> -Opt_compress_type, Opt_compress_force, Opt_compress_force_type,
>> -Opt_notreelog, Opt_ratio, Opt_flushoncommit, Opt_discard,
>> -Opt_space_cache, Opt_space_cache_version, Opt_clear_cache,
>> +Opt_nossd, Opt_ssd_spread, Opt_nossd_spread, Opt_thread_pool, Opt_noacl,
>> +Opt_compress, Opt_compress_type, Opt_compress_force,
>> +Opt_compress_force_type, Opt_notreelog, Opt_ratio, Opt_flushoncommit,
>> +Opt_discard, Opt_space_cache, Opt_space_cache_version, Opt_clear_cache,
>>  Opt_user_subvol_rm_allowed, Opt_enospc_debug, Opt_subvolrootid,
>>  Opt_defrag, Opt_inode_cache, Opt_no_space_cache, Opt_recovery,
>>  Opt_skip_balance, Opt_check_integrity,
>> @@ -353,6 +353,7 @@ static const match_table_t tokens = {
>>  {Opt_ssd, "ssd"},
>>  {Opt_ssd_spread, "ssd_spread"},
>>  {Opt_nossd, "nossd"},
>> +{Opt_nossd_spread, "nossd_spread"},
>>  {Opt_acl, "acl"},
>>  {Opt_noacl, "noacl"},
>>  {Opt_notreelog, "notreelog"},
> 
> .oO(Why doesn't the enum just have one option per line, so the changelog
> is less invasive?)
> 
>> @@ -582,6 +583,10 @@ int btrfs_parse_options(struct btrfs_fs_info *info, 
>> char *options,
>>  btrfs_clear_and_info(info, SSD_SPREAD,
>>   "not using spread ssd allocation 
>> scheme");
>>  break;
>> +case Opt_nossd_spread:
>> +btrfs_clear_and_info(info, SSD_SPREAD,
>> + "not using spread ssd allocation 
>> scheme");
>> +break;
>>  case Opt_barrier:
>>  btrfs_clear_and_info(info, NOBARRIER,
>>   "turning on barriers");
>>
> 
> Related:
> * 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_linux-2Dbtrfs_msg64247.html=DwIDaQ=5VD0RTtNlTh3ycd41b3MUw=UA4c4GV4shA70-jKB4kwcF99U6K6bzKVYdicFvu-DtQ=oYU5SWdFSoawOXAWRYLqZvvDYpROHFkzMQLlAZEehaU=0LRSD37KWUOyo_i9ypRAaZrv7_8R_kJ0yHv9vNoqDB4=
> * 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_linux-2Dbtrfs_msg64277.html=DwIDaQ=5VD0RTtNlTh3ycd41b3MUw=UA4c4GV4shA70-jKB4kwcF99U6K6bzKVYdicFvu-DtQ=oYU5SWdFSoawOXAWRYLqZvvDYpROHFkzMQLlAZEehaU=WB_Fyhm3SDx3XdaE71X6aYRxonuk0Q8Jr4f1ai6K2dI=
> * 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_linux-2Dbtrfs_msg64499.html=DwIDaQ=5VD0RTtNlTh3ycd41b3MUw=UA4c4GV4shA70-jKB4kwcF99U6K6bzKVYdicFvu-DtQ=oYU5SWdFSoawOXAWRYLqZvvDYpROHFkzMQLlAZEehaU=FcKlny0UlOCTWEfvCgEz_QtGnYI20EiDF3weQVKPyDs=
> 
> Apparently that discussion never resulted in actual changes, so thanks
> for continuing it now.
> 
> The mount options are a bit weird, because ssd_spread also includes ssd,
> but doesn't show it.
> 
> I personally don't like all of them at all, and I should really finish
> and send my proposal to get them replaced by options that can choose
> extent allocator for data and metadata individually (instead of some
> setting that changes them both at the same time) because there are
> proper real life situations that e.g. benefit from nossd style 'tetris'
> data allocator with ssd_spread style 'contiguous' metadata extent allocator.
> 
I agree; this change is just something very specific that came up in prod for 
us, so I decided to push a patch for it. 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: drop optimal argument from find_live_mirror()

2018-02-23 Thread David Sterba
On Tue, Jan 30, 2018 at 02:28:31PM +0800, Anand Jain wrote:
> Drop optimal argument from the function find_live_mirror()
> as we can deduce it in the function itself.

Yeah the argument is not necessary. It's another misleading variable
name, as it's not 'optimal' but a fallback. Please rename it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs: drop num argument from find_live_mirror()

2018-02-23 Thread David Sterba
On Tue, Jan 30, 2018 at 02:28:30PM +0800, Anand Jain wrote:
> Obtain the stripes info from the map directly and so no need
> to pass it as an argument.
> 
> Signed-off-by: Anand Jain 
> ---
>  fs/btrfs/volumes.c | 14 +++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index f7147740b68e..9c9d987838c2 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5252,13 +5252,22 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
> *fs_info, u64 logical, u64 len)
>  }
>  
>  static int find_live_mirror(struct btrfs_fs_info *fs_info,
> - struct map_lookup *map, int first, int num,
> + struct map_lookup *map, int first,
>   int optimal, int dev_replace_is_ongoing)
>  {
>   int i;
> + int num;

Can you please rename it to something more descriptive? Eg. num_stripes.

>   int tolerance;
>   struct btrfs_device *srcdev;
>  
> + ASSERT((map->type &
> +  (BTRFS_BLOCK_GROUP_RAID1 | BTRFS_BLOCK_GROUP_RAID10)));
> +
> + if (map->type & BTRFS_BLOCK_GROUP_RAID10)
> + num = map->sub_stripes;
> + else
> + num = map->num_stripes;
> +
>   if (dev_replace_is_ongoing &&
>   fs_info->dev_replace.cont_reading_from_srcdev_mode ==
>BTRFS_DEV_REPLACE_ITEM_CONT_READING_FROM_SRCDEV_MODE_AVOID)
> @@ -5812,7 +5821,6 @@ static int __btrfs_map_block(struct btrfs_fs_info 
> *fs_info,
>   stripe_index = mirror_num - 1;
>   else {
>   stripe_index = find_live_mirror(fs_info, map, 0,
> - map->num_stripes,
>   current->pid % map->num_stripes,
>   dev_replace_is_ongoing);
>   mirror_num = stripe_index + 1;
> @@ -5841,7 +5849,7 @@ static int __btrfs_map_block(struct btrfs_fs_info 
> *fs_info,
>   int old_stripe_index = stripe_index;
>   stripe_index = find_live_mirror(fs_info, map,
> stripe_index,
> -   map->sub_stripes, stripe_index +
> +   stripe_index +
> current->pid % map->sub_stripes,
> dev_replace_is_ongoing);
>   mirror_num = stripe_index - old_stripe_index + 1;
> -- 
> 2.7.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Relax memory barrier in btrfs_tree_unlock

2018-02-23 Thread David Sterba
On Wed, Feb 14, 2018 at 02:37:26PM +0200, Nikolay Borisov wrote:
> When performing an unlock on an extent buffer we'd like to order the
> decrement of extent_buffer::blocking_writers with waking up any
> waiters. In such situations it's sufficient to use smp_mb__after_atomic
> rather than the heavy smp_mb. On architectures where atomic operations
> are fully ordered (such as x86 or s390) unconditionally executing
> a heavyweight smp_mb instruction causes a severe hit to performance
> while bringin no improvements in terms of correctness.

Have you measured this severe performance hit? There is an impact, but I
doubt you'll ever notice it in the profiles given where the
btrfs_tree_unlock appears.

> The better thing is to use the appropriate smp_mb__after_atomic routine
> which will do the correct thing (invoke a full smp_mb or in the case
> of ordered atomics insert a compiler barrier). Put another way,
> an RMW atomic op + smp_load__after_atomic equals, in terms of
> semantics, to a full smp_mb. This ensures that none of the problems
> described in the accompanying comment of waitqueue_active occur.
> No functional changes.

I tend to agree.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fs was hung and sort of full

2018-02-23 Thread Rich Rauenzahn
It is still hanging afterall.  Currently, it hangs on mount:

[154422.778624] mount   D0 27894  27742 0x0084
[154422.778625] Call Trace:
[154422.779018]  __schedule+0x28d/0x880
[154422.779494]  schedule+0x36/0x80
[154422.779886]  io_schedule+0x16/0x40
[154422.780288]  do_read_cache_page+0x3da/0x5c0
[154422.780669]  ? blkdev_writepages+0x10/0x10
[154422.781052]  ? mntput+0x24/0x40
[154422.781457]  ? page_cache_tree_insert+0xd0/0xd0
[154422.781849]  read_cache_page_gfp+0x1b/0x20
[154422.782272]  btrfs_scan_one_device+0x71/0x260 [btrfs]
[154422.782658]  ? __free_pages+0x25/0x30
[154422.783052]  ? free_pages.part.94+0x40/0x50
[154422.783521]  ? free_pages+0x13/0x20
[154422.783940]  btrfs_mount+0x2cd/0xfa0 [btrfs]
[154422.784353]  ? find_next_bit+0xb/0x10
[154422.784736]  ? pcpu_next_unpop+0x3c/0x50
[154422.785124]  ? find_next_bit+0xb/0x10
[154422.785565]  mount_fs+0x39/0x150
[154422.785949]  ? __alloc_percpu+0x15/0x20
[154422.786349]  vfs_kern_mount+0x67/0x130
[154422.786732]  btrfs_mount+0x19d/0xfa0 [btrfs]
[154422.787120]  ? find_next_bit+0xb/0x10
[154422.787586]  ? pcpu_next_unpop+0x3c/0x50
[154422.787966]  mount_fs+0x39/0x150
[154422.788360]  ? __alloc_percpu+0x15/0x20
[154422.788732]  vfs_kern_mount+0x67/0x130
[154422.789109]  do_mount+0x1f5/0xca0
[154422.789525]  SyS_mount+0x83/0xd0
[154422.789900]  do_syscall_64+0x74/0x1b0
[154422.790313]  entry_SYSCALL_64_after_hwframe+0x21/0x86

I do have a disk 100% busy (only one of the pair ...) ... so maybe
it's "working"?

Running 4.15.4-1.el7.elrepo.x86_64


On Thu, Feb 22, 2018 at 3:52 PM, Chris Murphy  wrote:
> If there's no hung task listed in dmesg you could try to do sysrq+t to
> find out what everything's up to, although then you have to learn how
> to parse the result.
>
>
> Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Remove redundant memory barriers

2018-02-23 Thread David Sterba
On Wed, Feb 14, 2018 at 10:53:36AM +0200, Nikolay Borisov wrote:
> Using any kind of memory barriers around atomic operations which have
> a return value is redundant, since those operations themselves are
> fully ordered. atomic_t.txt states:
> 
> - RMW operations that have a return value are fully ordered;
> 
> Fully ordered primitives are ordered against everything prior and
> everything subsequent. Therefore a fully ordered primitive is like
> having an smp_mb() before and an smp_mb() after the primitive.
> 
> Given this let's replace the extra memory barriers with comments.
> 
> Signed-off-by: Nikolay Borisov 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 14/27] libbtrfsutil: add btrfs_util_deleted_subvolumes()

2018-02-23 Thread Omar Sandoval
On Fri, Feb 23, 2018 at 11:12:56AM +0900, Misono, Tomohiro wrote:
> 
> On 2018/02/16 4:04, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > 
> > Signed-off-by: Omar Sandoval 
> > ---
> >  libbtrfsutil/btrfsutil.h| 21 +++
> >  libbtrfsutil/python/btrfsutilpy.h   |  3 +
> >  libbtrfsutil/python/module.c| 30 ++
> >  libbtrfsutil/python/qgroup.c| 17 +-
> >  libbtrfsutil/python/subvolume.c | 30 ++
> >  libbtrfsutil/python/tests/test_subvolume.py |  8 +++
> >  libbtrfsutil/subvolume.c| 89 
> > +
> >  7 files changed, 183 insertions(+), 15 deletions(-)
> > 
> > diff --git a/libbtrfsutil/btrfsutil.h b/libbtrfsutil/btrfsutil.h
> > index 00c86174..677ab3c1 100644
> > --- a/libbtrfsutil/btrfsutil.h
> > +++ b/libbtrfsutil/btrfsutil.h
> > @@ -534,6 +534,27 @@ enum btrfs_util_error 
> > btrfs_util_subvolume_iterator_next_info(struct btrfs_util_
> >   char **path_ret,
> >   struct 
> > btrfs_util_subvolume_info *subvol);
> >  
> > +/**
> > + * btrfs_util_deleted_subvolumes() - Get a list of subvolume which have 
> > been
> > + * deleted but not yet cleaned up.
> > + * @path: Path on a Btrfs filesystem.
> > + * @ids: Returned array of subvolume IDs.
> > + * @n: Returned number of IDs in the @ids array.
> > + *
> > + * This requires appropriate privilege (CAP_SYS_ADMIN).
> > + *
> > + * Return: %BTRFS_UTIL_OK on success, non-zero error code on failure.
> > + */
> > +enum btrfs_util_error btrfs_util_deleted_subvolumes(const char *path,
> > +   uint64_t **ids,
> > +   size_t *n);
> > +
> > +/**
> > + * btrfs_util_deleted_subvolumes_fd() - See 
> > btrfs_util_deleted_subvolumes().
> > + */
> > +enum btrfs_util_error btrfs_util_deleted_subvolumes_fd(int fd, uint64_t 
> > **ids,
> > +  size_t *n);
> > +
> >  /**
> >   * btrfs_util_create_qgroup_inherit() - Create a qgroup inheritance 
> > specifier
> >   * for btrfs_util_create_subvolume() or btrfs_util_create_snapshot().
> > diff --git a/libbtrfsutil/python/btrfsutilpy.h 
> > b/libbtrfsutil/python/btrfsutilpy.h
> > index b3ec047f..be5122e2 100644
> > --- a/libbtrfsutil/python/btrfsutilpy.h
> > +++ b/libbtrfsutil/python/btrfsutilpy.h
> > @@ -54,6 +54,8 @@ struct path_arg {
> >  int path_converter(PyObject *o, void *p);
> >  void path_cleanup(struct path_arg *path);
> >  
> > +PyObject *list_from_uint64_array(const uint64_t *arr, size_t n);
> > +
> >  void SetFromBtrfsUtilError(enum btrfs_util_error err);
> >  void SetFromBtrfsUtilErrorWithPath(enum btrfs_util_error err,
> >struct path_arg *path);
> > @@ -72,6 +74,7 @@ PyObject *set_default_subvolume(PyObject *self, PyObject 
> > *args, PyObject *kwds);
> >  PyObject *create_subvolume(PyObject *self, PyObject *args, PyObject *kwds);
> >  PyObject *create_snapshot(PyObject *self, PyObject *args, PyObject *kwds);
> >  PyObject *delete_subvolume(PyObject *self, PyObject *args, PyObject *kwds);
> > +PyObject *deleted_subvolumes(PyObject *self, PyObject *args, PyObject 
> > *kwds);
> >  
> >  void add_module_constants(PyObject *m);
> >  
> > diff --git a/libbtrfsutil/python/module.c b/libbtrfsutil/python/module.c
> > index e995a1be..eaa062ac 100644
> > --- a/libbtrfsutil/python/module.c
> > +++ b/libbtrfsutil/python/module.c
> > @@ -125,6 +125,29 @@ err:
> > return 0;
> >  }
> >  
> > +PyObject *list_from_uint64_array(const uint64_t *arr, size_t n)
> > +{
> > +PyObject *ret;
> > +size_t i;
> > +
> > +ret = PyList_New(n);
> > +if (!ret)
> > +   return NULL;
> > +
> > +for (i = 0; i < n; i++) {
> > +   PyObject *tmp;
> > +
> > +   tmp = PyLong_FromUnsignedLongLong(arr[i]);
> > +   if (!tmp) {
> > +   Py_DECREF(ret);
> > +   return NULL;
> > +   }
> > +   PyList_SET_ITEM(ret, i, tmp);
> > +}
> > +
> > +return ret;
> > +}
> > +
> >  void path_cleanup(struct path_arg *path)
> >  {
> > Py_CLEAR(path->object);
> > @@ -214,6 +237,13 @@ static PyMethodDef btrfsutil_methods[] = {
> >  "path -- string, bytes, or path-like object\n"
> >  "recursive -- if the given subvolume has child subvolumes, delete\n"
> >  "them instead of failing"},
> > +   {"deleted_subvolumes", (PyCFunction)deleted_subvolumes,
> > +METH_VARARGS | METH_KEYWORDS,
> > +"deleted_subvolumes(path)\n\n"
> > +"Get the list of subvolume IDs which have been deleted but not yet\n"
> > +"cleaned up\n\n"
> > +"Arguments:\n"
> > +"path -- string, bytes, path-like object, or open file descriptor"},
> > {},
> >  };
> >  
> > diff --git a/libbtrfsutil/python/qgroup.c b/libbtrfsutil/python/qgroup.c
> > index 

Re: [PATCH v2 23/27] btrfs-progs: use libbtrfsutil for subvol sync

2018-02-23 Thread David Sterba
On Fri, Feb 23, 2018 at 02:41:27PM -0800, Omar Sandoval wrote:
> > >  static int cmd_subvol_sync(int argc, char **argv)
> > >  {
> > >   int fd = -1;
> > > - int i;
> > >   int ret = 1;
> > >   DIR *dirstream = NULL;
> > > - u64 *ids = NULL;
> > > - int id_count;
> > > + uint64_t *ids;
> > 
> > "ids" should be initialized to prevent to be used uninitialized for free()
> > in error path (clang build warning).
> 
> I'm disappointed that GCC didn't also warn here, but thanks, fixed.

It's not reported even with the make W=3 level, gcc version 7.3.0.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix NULL pointer exception in find_bio_stripe()

2018-02-23 Thread David Sterba
On Fri, Feb 16, 2018 at 07:51:38PM +, Dmitriy Gorokh wrote:
> On detaching of a disk which is a part of a RAID6 filesystem, the following 
> kernel OOPS may happen:
> 
> [63122.680461] BTRFS error (device sdo): bdev /dev/sdo errs: wr 0, rd 0, 
> flush 1, corrupt 0, gen 0 
> [63122.719584] BTRFS warning (device sdo): lost page write due to IO error on 
> /dev/sdo 
> [63122.719587] BTRFS error (device sdo): bdev /dev/sdo errs: wr 1, rd 0, 
> flush 1, corrupt 0, gen 0 
> [63122.803516] BTRFS warning (device sdo): lost page write due to IO error on 
> /dev/sdo 
> [63122.803519] BTRFS error (device sdo): bdev /dev/sdo errs: wr 2, rd 0, 
> flush 1, corrupt 0, gen 0 
> [63122.863902] BTRFS critical (device sdo): fatal error on device /dev/sdo 
> [63122.935338] BUG: unable to handle kernel NULL pointer dereference at 
> 0080 
> [63122.946554] IP: fail_bio_stripe+0x58/0xa0 [btrfs] 
> [63122.958185] PGD 9ecda067 P4D 9ecda067 PUD b2b37067 PMD 0 
> [63122.971202] Oops:  [#1] SMP 
> [63122.990786] Modules linked in: libcrc32c dlm configfs cpufreq_userspace 
> cpufreq_powersave cpufreq_conservative softdog nfsd auth_rpcgss nfs_acl nfs 
> lockd grace fscache sunrpc bonding ipmi_devintf ipmi_msghandler joydev 
> snd_intel8x0 snd_ac97_codec snd_pcm snd_timer snd psmouse evdev parport_pc 
> soundcore serio_raw battery pcspkr video ac97_bus ac parport ohci_pci 
> ohci_hcd i2c_piix4 button crc32c_generic crc32c_intel btrfs xor 
> zstd_decompress zstd_compress xxhash raid6_pq dm_mod dax raid1 md_mod 
> hid_generic usbhid hid xhci_pci xhci_hcd ehci_pci ehci_hcd usbcore sg sd_mod 
> sr_mod cdrom ata_generic ahci libahci ata_piix libata e1000 scsi_mod [last 
> unloaded: scst] 
> [63123.006760] CPU: 0 PID: 3979 Comm: kworker/u8:9 Tainted: G W 
> 4.14.2-16-scst34x+ #8 
> [63123.007091] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS 
> VirtualBox 12/01/2006 
> [63123.007402] Workqueue: btrfs-worker btrfs_worker_helper [btrfs] 
> [63123.007595] task: 880036ea4040 task.stack: c90006384000 
> [63123.007796] RIP: 0010:fail_bio_stripe+0x58/0xa0 [btrfs] 
> [63123.007968] RSP: 0018:c90006387ad8 EFLAGS: 00010287 
> [63123.008140] RAX: 0002 RBX: 88004beaa0b8 RCX: 
> 8800b2bd5690 
> [63123.008359] RDX:  RSI: 88007bb43500 RDI: 
> 88004beaa000 
> [63123.008621] RBP: c90006387ae8 R08: 9910 R09: 
> 8800b2bd5600 
> [63123.008840] R10: 0004 R11: 0001 R12: 
> 88007bb43500 
> [63123.009059] R13: fffb R14: 880036fc5180 R15: 
> 0004 
> [63123.009278] FS: () GS:8800b700() 
> knlGS: 
> [63123.009564] CS: 0010 DS:  ES:  CR0: 80050033 
> [63123.009748] CR2: 0080 CR3: b0866000 CR4: 
> 000406f0 
> [63123.009969] Call Trace: 
> [63123.010085] raid_write_end_io+0x7e/0x80 [btrfs] 
> [63123.010251] bio_endio+0xa1/0x120 
> [63123.010378] generic_make_request+0x218/0x270 
> [63123.010921] submit_bio+0x66/0x130 
> [63123.011073] finish_rmw+0x3fc/0x5b0 [btrfs] 
> [63123.011245] full_stripe_write+0x96/0xc0 [btrfs] 
> [63123.011428] raid56_parity_write+0x117/0x170 [btrfs] 
> [63123.011604] btrfs_map_bio+0x2ec/0x320 [btrfs] 
> [63123.011759] ? ___cache_free+0x1c5/0x300 
> [63123.011909] __btrfs_submit_bio_done+0x26/0x50 [btrfs] 
> [63123.012087] run_one_async_done+0x9c/0xc0 [btrfs] 
> [63123.012257] normal_work_helper+0x19e/0x300 [btrfs] 
> [63123.012429] btrfs_worker_helper+0x12/0x20 [btrfs] 
> [63123.012656] process_one_work+0x14d/0x350 
> [63123.012888] worker_thread+0x4d/0x3a0 
> [63123.013026] ? _raw_spin_unlock_irqrestore+0x15/0x20 
> [63123.013192] kthread+0x109/0x140 
> [63123.013315] ? process_scheduled_works+0x40/0x40 
> [63123.013472] ? kthread_stop+0x110/0x110 
> [63123.013610] ret_from_fork+0x25/0x30 
> [63123.013741] Code: 7e 43 31 c0 48 63 d0 48 8d 14 52 49 8d 4c d1 60 48 8b 51 
> 08 49 39 d0 72 1f 4c 63 1b 4c 01 da 49 39 d0 73 14 48 8b 11 48 8b 52 68 <48> 
> 8b 8a 80 00 00 00 48 39 4e 08 74 14 83 c0 01 44 39 d0 75 c4 
> [63123.014469] RIP: fail_bio_stripe+0x58/0xa0 [btrfs] RSP: c90006387ad8 
> [63123.014678] CR2: 0080 
> [63123.016590] ---[ end trace a295ea7259c17880 ]— 
> 
> This is reproducible in a cycle, where a series of writes is followed by SCSI 
> device delete command. The test may take up to few minutes.
> 
> Fixes: commit 74d46992e0d9dee7f1f376de0d56d31614c8a17a ("block: replace 
> bi_bdev with a gendisk pointer and partitions index")

Right, the commit introduced dereference of stripe->dev->bdev so we have
to check it first.

Please update the Fixes: tag, the word 'commit' should not be there and
the sha1 length can be shortened to 12 digits.

Also please add your Signed-off-by. Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [PATCH v2 26/27] btrfs-progs: use libbtrfsutil for subvolume list

2018-02-23 Thread Omar Sandoval
On Fri, Feb 23, 2018 at 11:26:35AM +0900, Misono, Tomohiro wrote:
> 
> 
> On 2018/02/16 4:05, Omar Sandoval wrote:
> > From: Omar Sandoval 
> 
> > +static struct subvol_list *btrfs_list_deleted_subvols(int fd,
> > + struct 
> > btrfs_list_filter_set *filter_set)
> > +{
> > +   struct subvol_list *subvols = NULL;
> > +   uint64_t *ids = NULL;
> > +   size_t i, n;
> > +   enum btrfs_util_error err;
> > +   int ret = -1;
> > +
> > +   err = btrfs_util_deleted_subvolumes_fd(fd, , );
> > +   if (err) {
> > +   error_btrfs_util(err);
> > +   return NULL;
> > +   }
> > +
> > +   subvols = malloc(sizeof(*subvols) + n * sizeof(subvols->subvols[0]));
> > +   if (!subvols) {
> > +   error("out of memory");
> > +   goto out;
> > +   }
> > +
> > +   subvols->num = 0;
> > +   for (i = 0; i < n; i++) {
> > +   struct listed_subvol *subvol = >subvols[subvols->num];
> > +
> > +   err = btrfs_util_subvolume_info_fd(fd, ids[i], >info);
> > +   if (err) {
> 
> I think there is a small chance that subvolume would be removed from tree 
> between 
> btrfs_util_deleted_subvolumes_fd() and btrfs_util_subvolume_info_fd().
> So, error of BTRFS_UTIL_ERROR_SUBVOLUME_NOT_FOUND should be ignored.

Thanks, since this patch isn't in the devel branch in, I'll fold the fix
in.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] btrfs: fix NPD when target device is missing

2018-02-23 Thread David Sterba
On Tue, Feb 20, 2018 at 10:46:25PM +0800, Anand Jain wrote:
> The replace target device can be missing in which case we don't
> allocate a missing btrfs_device when mounted with the -o degraded.
> So check the device before access.
> 
> BUG: unable to handle kernel NULL pointer dereference at 00b0

Please don't use uncommon acronyms, NPD is quite confusing, null pointer
deref should be fine.

> IP: btrfs_destroy_dev_replace_tgtdev+0x43/0xf0 [btrfs]
> Call Trace:
> btrfs_dev_replace_cancel+0x15f/0x180 [btrfs]
> btrfs_ioctl+0x2216/0x2590 [btrfs]
> do_vfs_ioctl+0x625/0x650
> SyS_ioctl+0x4e/0x80
> do_syscall_64+0x5d/0x160
> entry_SYSCALL64_slow_path+0x25/0x25

Do you have a reproducer for that?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 06/27] libbtrfsutil: add btrfs_util_create_subvolume()

2018-02-23 Thread Omar Sandoval
On Fri, Feb 23, 2018 at 05:24:04PM +0900, Misono, Tomohiro wrote:
> On 2018/02/16 4:04, Omar Sandoval wrote:
> > From: Omar Sandoval 
> 
> > +static enum btrfs_util_error openat_parent_and_name(int dirfd, const char 
> > *path,
> > +   char *name, size_t name_len,
> > +   int *fd)
> > +{
> > +   char *tmp_path, *slash, *dirname, *basename;
> > +   size_t len;
> > +
> > +   /* Ignore trailing slashes. */
> > +   len = strlen(path);
> > +   while (len > 1 && path[len - 1] == '/')
> > +   len--;
> > +
> > +   tmp_path = malloc(len + 1);
> > +   if (!tmp_path)
> > +   return BTRFS_UTIL_ERROR_NO_MEMORY;
> > +   memcpy(tmp_path, path, len);
> > +   tmp_path[len] = '\0';
> > +
> > +   slash = memrchr(tmp_path, '/', len);
> > +   if (slash == tmp_path) {
> > +   dirname = "/";
> > +   basename = tmp_path + 1;
> > +   } else if (slash) {
> > +   *slash = '\0';
> > +   dirname = tmp_path;
> > +   basename = slash + 1;
> > +   } else {
> > +   dirname = ".";
> > +   basename = tmp_path;
> > +   }
> > +
> > +   len = strlen(basename);
> > +   if (len >= name_len) {
> > +   errno = ENAMETOOLONG;
> 
> tmp_path should be also freed here.

Another good catch, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] btrfs: remove assert in btrfs_init_dev_replace_tgtdev()

2018-02-23 Thread David Sterba
On Tue, Feb 20, 2018 at 10:50:36PM +0800, Anand Jain wrote:
> In the same function we just ran btrfs_alloc_device() which means the
> btrfs_device::resized_list is sure to be empty and we are protected
> with the btrfs_fs_info::volume_mutex.
> 
> Signed-off-by: Anand Jain 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 11/27] libbtrfsutil: add subvolume iterator helpers

2018-02-23 Thread Omar Sandoval
On Fri, Feb 23, 2018 at 04:40:45PM +0900, Misono, Tomohiro wrote:
> On 2018/02/16 4:04, Omar Sandoval wrote:
> > From: Omar Sandoval 
> 
> > +PUBLIC enum btrfs_util_error btrfs_util_create_subvolume_iterator(const 
> > char *path,
> > + uint64_t top,
> > + int flags,
> > + struct 
> > btrfs_util_subvolume_iterator **ret)
> > +{
> > +   enum btrfs_util_error err;
> > +   int fd;
> > +
> > +   fd = open(path, O_RDONLY);
> > +   if (fd == -1)
> > +   return BTRFS_UTIL_ERROR_OPEN_FAILED;
> > +
> > +   err = btrfs_util_create_subvolume_iterator_fd(fd, top, flags, ret);
> > +   if (err == BTRFS_UTIL_OK)
> > +   (*ret)->flags |= BTRFS_UTIL_SUBVOLUME_ITERATOR_CLOSE_FD;
> 
> If btrfs_util_create_subvolume_iterator_fd() returns error, 'fd' remains open.
> So, fd should be closed here.

Good catch, fixed.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 07/27] libbtrfsutil: add btrfs_util_subvolume_path()

2018-02-23 Thread Omar Sandoval
On Fri, Feb 23, 2018 at 03:27:52PM +0900, Misono, Tomohiro wrote:
> On 2018/02/16 4:04, Omar Sandoval wrote:
> > From: Omar Sandoval 
> 
> > +PUBLIC enum btrfs_util_error btrfs_util_subvolume_path_fd(int fd, uint64_t 
> > id,
> > + char **path_ret)
> > +{
> > +   char *path, *p;
> > +   size_t capacity = 4096;
> > +
> > +   path = malloc(capacity);
> > +   if (!path)
> > +   return BTRFS_UTIL_ERROR_NO_MEMORY;
> > +   p = path + capacity - 1;
> > +   p[0] = '\0';
> > +
> > +   if (id == 0) {
> > +   enum btrfs_util_error err;
> > +
> > +   err = btrfs_util_is_subvolume_fd(fd);
> > +   if (err)
> > +   return err;
> 
> 'path' should be freed here and below.
> 
> > +
> > +   err = btrfs_util_subvolume_id_fd(fd, );
> > +   if (err)
> > +   return err;
> > +   }

Indeed, although I'll just change it to allocate path after this
instead.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 16/27] btrfs-progs: use libbtrfsutil for read-only property

2018-02-23 Thread Omar Sandoval
On Thu, Feb 22, 2018 at 01:23:45PM +0900, Misono, Tomohiro wrote:
> 
> 
> On 2018/02/16 4:05, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > 
> > Signed-off-by: Omar Sandoval 
> > ---
> >  messages.h | 13 
> >  props.c| 69 
> > +++---
> >  2 files changed, 38 insertions(+), 44 deletions(-)
> > 
> > diff --git a/messages.h b/messages.h
> > index 4999c7b9..004d5167 100644
> > --- a/messages.h
> > +++ b/messages.h
> > @@ -54,6 +54,19 @@
> > DO_ABORT_ON_ERROR;  \
> > } while (0)
> >  
> > +#define error_btrfs_util(err)  
> > \
> > +   do {\
> > +   const char *errno_str = strerror(errno);\
> > +   const char *lib_str = btrfs_util_strerror(err)  \
> 
> "make D=trace" fails because ";" is missing here.

That's embarrassing, thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 23/27] btrfs-progs: use libbtrfsutil for subvol sync

2018-02-23 Thread Omar Sandoval
On Thu, Feb 22, 2018 at 11:03:05AM +0900, Misono, Tomohiro wrote:
> 
> 
> On 2018/02/16 4:05, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > 
> > btrfs_util_f_deleted_subvolumes() replaces enumerate_dead_subvols() and
> > btrfs_util_f_subvolume_info() replaces is_subvolume_cleaned().
> > 
> > Signed-off-by: Omar Sandoval 
> > ---
> >  cmds-subvolume.c | 217 
> > ++-
> >  1 file changed, 21 insertions(+), 196 deletions(-)
> > 
> > diff --git a/cmds-subvolume.c b/cmds-subvolume.c
> > index 49c9c8cf..9bab9312 100644
> > --- a/cmds-subvolume.c
> > +++ b/cmds-subvolume.c
> > @@ -42,38 +42,11 @@
> >  #include "utils.h"
> >  #include "help.h"
> >  
> > -static int is_subvolume_cleaned(int fd, u64 subvolid)
> > +static int wait_for_subvolume_cleaning(int fd, size_t count, uint64_t *ids,
> > +  int sleep_interval)
> >  {
> > -   int ret;
> > -   struct btrfs_ioctl_search_args args;
> > -   struct btrfs_ioctl_search_key *sk = 
> > -
> > -   sk->tree_id = BTRFS_ROOT_TREE_OBJECTID;
> > -   sk->min_objectid = subvolid;
> > -   sk->max_objectid = subvolid;
> > -   sk->min_type = BTRFS_ROOT_ITEM_KEY;
> > -   sk->max_type = BTRFS_ROOT_ITEM_KEY;
> > -   sk->min_offset = 0;
> > -   sk->max_offset = (u64)-1;
> > -   sk->min_transid = 0;
> > -   sk->max_transid = (u64)-1;
> > -   sk->nr_items = 1;
> > -
> > -   ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, );
> > -   if (ret < 0)
> > -   return -errno;
> > -
> > -   if (sk->nr_items == 0)
> > -   return 1;
> > -
> > -   return 0;
> > -}
> > -
> > -static int wait_for_subvolume_cleaning(int fd, int count, u64 *ids,
> > -   int sleep_interval)
> > -{
> > -   int ret;
> > -   int i;
> > +   size_t i;
> > +   enum btrfs_util_error err;
> >  
> > while (1) {
> > int clean = 1;
> > @@ -81,16 +54,14 @@ static int wait_for_subvolume_cleaning(int fd, int 
> > count, u64 *ids,
> > for (i = 0; i < count; i++) {
> > if (!ids[i])
> > continue;
> > -   ret = is_subvolume_cleaned(fd, ids[i]);
> > -   if (ret < 0) {
> > -   error(
> > -   "cannot read status of dead subvolume %llu: %s",
> > -   (unsigned long long)ids[i], 
> > strerror(-ret));
> > -   return ret;
> > -   }
> > -   if (ret) {
> > -   printf("Subvolume id %llu is gone\n", ids[i]);
> > +   err = btrfs_util_subvolume_info_fd(fd, ids[i], NULL);
> > +   if (err == BTRFS_UTIL_ERROR_SUBVOLUME_NOT_FOUND) {
> > +   printf("Subvolume id %" PRIu64 " is gone\n",
> > +  ids[i]);
> > ids[i] = 0;
> > +   } else if (err) {
> > +   error_btrfs_util(err);
> > +   return -errno;
> > } else {
> > clean = 0;
> > }
> > @@ -1028,160 +999,15 @@ static const char * const cmd_subvol_sync_usage[] = 
> > {
> > NULL
> >  };
> >  
> > -#if 0
> > -/*
> > - * If we're looking for any dead subvolume, take a shortcut and look
> > - * for any ORPHAN_ITEMs in the tree root
> > - */
> > -static int fs_has_dead_subvolumes(int fd)
> > -{
> > -   int ret;
> > -   struct btrfs_ioctl_search_args args;
> > -   struct btrfs_ioctl_search_key *sk = 
> > -   struct btrfs_ioctl_search_header sh;
> > -   u64 min_subvolid = 0;
> > -
> > -again:
> > -   sk->tree_id = BTRFS_ROOT_TREE_OBJECTID;
> > -   sk->min_objectid = BTRFS_ORPHAN_OBJECTID;
> > -   sk->max_objectid = BTRFS_ORPHAN_OBJECTID;
> > -   sk->min_type = BTRFS_ORPHAN_ITEM_KEY;
> > -   sk->max_type = BTRFS_ORPHAN_ITEM_KEY;
> > -   sk->min_offset = min_subvolid;
> > -   sk->max_offset = (u64)-1;
> > -   sk->min_transid = 0;
> > -   sk->max_transid = (u64)-1;
> > -   sk->nr_items = 1;
> > -
> > -   ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, );
> > -   if (ret < 0)
> > -   return -errno;
> > -
> > -   if (!sk->nr_items)
> > -   return 0;
> > -
> > -   memcpy(, args.buf, sizeof(sh));
> > -   min_subvolid = sh.offset;
> > -
> > -   /*
> > -* Verify that the root item is really there and we haven't hit
> > -* a stale orphan
> > -*/
> > -   sk->tree_id = BTRFS_ROOT_TREE_OBJECTID;
> > -   sk->min_objectid = min_subvolid;
> > -   sk->max_objectid = min_subvolid;
> > -   sk->min_type = BTRFS_ROOT_ITEM_KEY;
> > -   sk->max_type = BTRFS_ROOT_ITEM_KEY;
> > -   sk->min_offset = 0;
> > -   sk->max_offset = (u64)-1;
> > -   sk->min_transid = 0;
> > -   sk->max_transid = (u64)-1;
> > -   sk->nr_items = 1;
> > -
> > -   ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, );
> > -   if (ret < 0)
> > -   return -errno;
> > -
> > -   /*
> > -* Stale orphan, try the next one
> 

Re: [PATCH v2 10/27] libbtrfsutil: add btrfs_util_[gs]et_default_subvolume()

2018-02-23 Thread Omar Sandoval
On Thu, Feb 22, 2018 at 10:55:48AM +0900, Misono, Tomohiro wrote:
> On 2018/02/16 4:04, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > 
> > set_default_subvolume() is a trivial ioctl(), but there's no ioctl() for
> > get_default_subvolume(), so we need to search the root tree.
> > 
> > Signed-off-by: Omar Sandoval 
> > ---
> >  libbtrfsutil/btrfsutil.h|  41 ++
> >  libbtrfsutil/python/btrfsutilpy.h   |   2 +
> >  libbtrfsutil/python/module.c|  14 
> >  libbtrfsutil/python/subvolume.c |  50 
> >  libbtrfsutil/python/tests/test_subvolume.py |  14 
> >  libbtrfsutil/subvolume.c| 113 
> > 
> >  6 files changed, 234 insertions(+)
> > 
> > diff --git a/libbtrfsutil/btrfsutil.h b/libbtrfsutil/btrfsutil.h
> > index 8bd2b847..54777f1d 100644
> > --- a/libbtrfsutil/btrfsutil.h
> > +++ b/libbtrfsutil/btrfsutil.h
> > @@ -256,6 +256,8 @@ enum btrfs_util_error 
> > btrfs_util_get_subvolume_read_only_fd(int fd, bool *ret);
> >   * @path: Subvolume path.
> >   * @read_only: New value of read-only flag.
> >   *
> > + * This requires appropriate privilege (CAP_SYS_ADMIN).
> > + *
> >   * Return: %BTRFS_UTIL_OK on success, non-zero error code on failure.
> >   */
> >  enum btrfs_util_error btrfs_util_set_subvolume_read_only(const char *path,
> > @@ -268,6 +270,45 @@ enum btrfs_util_error 
> > btrfs_util_set_subvolume_read_only(const char *path,
> >  enum btrfs_util_error btrfs_util_set_subvolume_read_only_fd(int fd,
> > bool read_only);
> >  
> > +/**
> > + * btrfs_util_get_default_subvolume() - Get the default subvolume for a
> > + * filesystem.
> > + * @path: Path on a Btrfs filesystem.
> > + * @id_ret: Returned subvolume ID.
> > + *
> > + * This requires appropriate privilege (CAP_SYS_ADMIN).
> > + *
> > + * Return: %BTRFS_UTIL_OK on success, non-zero error code on failure.
> > + */
> > +enum btrfs_util_error btrfs_util_get_default_subvolume(const char *path,
> > +  uint64_t *id_ret);
> > +
> > +/**
> > + * btrfs_util_get_default_subvolume_fd() - See
> > + * btrfs_util_get_default_subvolume().
> > + */
> > +enum btrfs_util_error btrfs_util_get_default_subvolume_fd(int fd,
> > + uint64_t *id_ret);
> > +
> > +/**
> > + * btrfs_util_set_default_subvolume() - Set the default subvolume for a
> > + * filesystem.
> > + * @path: Path in a Btrfs filesystem. This may be any path in the 
> > filesystem; it
> > + * does not have to refer to a subvolume unless @id is zero.
> > + * @id: ID of subvolume to set as the default. If zero is given, the 
> > subvolume
> > + * ID of @path is used.
> 
> The line "This requires appropriate privilege (CAP_SYS_ADMIN)." is missing 
> here.

Good catch, thanks, fixed.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: fix endianness in super_copy

2018-02-23 Thread David Sterba
On Thu, Feb 22, 2018 at 09:58:42PM +0800, Anand Jain wrote:
> Moving between opposite endianness will report bogus numbers in sysfs,
> and mount may fail as the root will not be restored correctly. If the
> filesystem is always used on a same endian host, this will not be a
> problem.
> Fix this by using the btrfs_set_super...() functions to set
> fs_info::super_copy values, and for the sysfs, use the cached
> fs_info::nodesize,sectorsize values.
> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Anand Jain 

Reviewed-by: David Sterba 

Updated changelog and Fixes: tag.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: fix bare unsigned declarations

2018-02-23 Thread David Sterba
On Thu, Feb 15, 2018 at 01:13:00AM +0800, Anand Jain wrote:
> We have btrfs_fs_info::data_chunk_allocations and
> btrfs_fs_info::metadata_ratio declared as unsigned which would
> be unsinged int and kernel style prefers unsigned int over bare
> unsigned. So this patch changes them to u32.
> 
> Signed-off-by: Anand Jain 

Please also update the temporary variable old_metadata_ratio in
btrfs_remount that get assigned from one of the variables.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: verify subvolid mount parameter

2018-02-23 Thread David Sterba
On Thu, Feb 15, 2018 at 01:11:37AM +0800, Anand Jain wrote:
> We aren't verifying the parameter passed to the subvolid mount option,
> so we won't report and fail the mount if a junk value is specified for
> example, -o subvolid=abc.
> This patch verifies the subvolid option with match_u64.
> 
> Signed-off-by: Anand Jain 

Reviewed-by: David Sterba 

Changelog updated to mention the previous memparse effects.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/27] btrfs-progs: introduce libbtrfsutil, "btrfs-progs as a library"

2018-02-23 Thread David Sterba
On Wed, Feb 21, 2018 at 10:50:32AM -0800, Omar Sandoval wrote:
> On Wed, Feb 21, 2018 at 04:13:38PM +0100, David Sterba wrote:
> > On Tue, Feb 20, 2018 at 07:50:48PM +0100, David Sterba wrote:
> > > I have more comments or maybe questions about the future development
> > > workflow, but at this point the patchset is in a good shape for
> > > incremental merge.
> > 
> > After removnig the first patch adding subvolume.c (with
> > linux/btrfs_tree.h) and what depends on it, I'm left with:
> > 
> > Omar Sandoval (4):
> >   Add libbtrfsutil
> >   libbtrfsutil: add Python bindings
> >   libbtrfsutil: add qgroup inheritance helpers
> >   libbtrfsutil: add filesystem sync helpers
> > 
> > with some context updates. That builds and passes the CI tests.
> 
> Great. Does the CI system run the Python tests yet?

Tested here https://travis-ci.org/kdave/btrfs-progs/jobs/345410536 ,
does not pass.


test_start_sync (test_filesystem.TestSubvolume) ... mkfs.btrfs: invalid option 
-- 'q'
usage: mkfs.btrfs [options] dev [ dev ... ]


Looks like it tries to use the system mkfs.btrfs that is old.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: lots of data written, but no space occupied

2018-02-23 Thread André Weidemann


On 23.02.2018 00:50, Chris Murphy wrote:

The only thing I can think of is something's updating metadata due to
relatime mount option. Maybe try noatime? At 9GiB, really it's 4.5GiB
because whatever is being written is being doubled by raid1 profile
and multiple devices. There is a case of wandering trees where a
little bit of change can get amplified, but this requires an
instigator, the file system won't do it on its own.

What were the mkfs options? Are you using the default 16KiB nodesize?


I found the cause...  And "pebkac" is the only way to describe it.

As I wrote in my initial email, I installed collectd. And I was sure to 
have disabled the rrdtool plugin, which apparently I hadn't. So there 
was no additional space occupied, because collectd only updated already 
existing files.


In the end it only took a small effort to figure it out.
I knew that calling "echo 1 > /proc/sys/vm/block_dump" would dump all 
file access to syslog.
Since /var/log/syslog was on the effected drives, I created a ram-disk 
and had rsyslog log to the RAM-disk. I did this so I wouldn't see my own 
writes to the log while running "grep sda".

I then grep inside the syslog using:
"tail -f /var/log-ram/syslog|egrep -v "btrfs-transacti|kwork"|grep sda"
until something like this fell out:
Feb 23 15:05:35 server2 kernel: [496223.186943] collectd(604): dirtied 
inode 870979 (smart_attribute-temperature-celsius-2.rrd) on sda2




Thanks anyway.
  André




smime.p7s
Description: S/MIME Cryptographic Signature


Re: btrfs crash consistency bug : Blocks allocated beyond eof are lost

2018-02-23 Thread Filipe Manana
On Fri, Feb 23, 2018 at 4:35 PM, Jayashree Mohan
 wrote:
> Hi,
>
> [Fsync issue in btrfs]
> In addition to the above, I would like to bring to your notice that :
> After doing a fallocate or fallocate zero_range with keep size option,
> a fsync() operation would have no effect at all. If we crash after the
> fsync, on recovery the blocks allocated due to the fallocate call are
> lost. This aligns with a patch work here[1] for a similar issue with
> punch_hole. A simple test scenario that reproduces this bug is :
>
> 1. write (0-16K) to a file foo
> 2. sync()
> 3. fallocate keep_size (16K - 20K)
> 4. fsync(foo)
> 5. Crash
>
> On recovery, all blocks allocated in step 3 are lost (this is the true
> even when fallocate is replaced by zero_range operation supported in
> kernel 4.16 )
> Could you explain why a fsync() of the file would still not persist
> the metadata(block count in this case), across power failures?

In a very short explanation, because it thinks, on log recovery, that
a shrinking truncate happened before the file was fsync'ed, so it
drops the extents allocated by fallocate() after it replayed them from
the log.
I had seen this a year or 2 ago but never managed to fix it due to
other more important things, but I'll try to fix it soon.

>
> [1] https://patchwork.kernel.org/patch/5830801/
>
>
> Thanks,
> Jayashree Mohan
>
>
>
>
>
> On Wed, Feb 21, 2018 at 8:23 PM, Jayashree Mohan
>  wrote:
>> Hi,
>>
>> On btrfs (as of kernel 4.15), say we fallocate a file with keep_size
>> option, followed by fdatasync() or fsync(). If we now crash, on
>> recovery we see a wrong block count and all the blocks allocated
>> beyond the eof are lost. This bug was reported(xfstest generic/468)
>> and patched on ext4[1], and a variant of this, that did not recover
>> the correct file size was patched in f2fs[2]. I am wondering why this
>> is still not fixed in btrfs. You can reproduce this bug on btrfs using
>> a tool called CrashMonkey that we are building at UT Austin, which is
>> a test harness for filesystem crash consistency checks[3]
>>
>> To reproduce the bug, simply run :
>>  ./c_harness -f /dev/sda -d /dev/cow_ram0 -t btrfs -e 102400  -v
>> tests/generic_468.so
>>
>> Is there a reason why this is not yet patched in btrfs? I don't see
>> why even after a fsync(), losing the blocks allocated beyond the eof
>> are acceptable.
>>
>> [1] https://patchwork.kernel.org/patch/10120293/
>> [2] https://sourceforge.net/p/linux-f2fs/mailman/message/36104201/
>> [3] https://github.com/utsaslab/crashmonkey
>>
>> Thanks,
>>
>> Jayashree Mohan
>> 2nd Year PhD in Computer Science
>> University of Texas at Austin.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs crash consistency bug : Blocks allocated beyond eof are lost

2018-02-23 Thread Jayashree Mohan
Hi,

[Fsync issue in btrfs]
In addition to the above, I would like to bring to your notice that :
After doing a fallocate or fallocate zero_range with keep size option,
a fsync() operation would have no effect at all. If we crash after the
fsync, on recovery the blocks allocated due to the fallocate call are
lost. This aligns with a patch work here[1] for a similar issue with
punch_hole. A simple test scenario that reproduces this bug is :

1. write (0-16K) to a file foo
2. sync()
3. fallocate keep_size (16K - 20K)
4. fsync(foo)
5. Crash

On recovery, all blocks allocated in step 3 are lost (this is the true
even when fallocate is replaced by zero_range operation supported in
kernel 4.16 )
Could you explain why a fsync() of the file would still not persist
the metadata(block count in this case), across power failures?

[1] https://patchwork.kernel.org/patch/5830801/


Thanks,
Jayashree Mohan





On Wed, Feb 21, 2018 at 8:23 PM, Jayashree Mohan
 wrote:
> Hi,
>
> On btrfs (as of kernel 4.15), say we fallocate a file with keep_size
> option, followed by fdatasync() or fsync(). If we now crash, on
> recovery we see a wrong block count and all the blocks allocated
> beyond the eof are lost. This bug was reported(xfstest generic/468)
> and patched on ext4[1], and a variant of this, that did not recover
> the correct file size was patched in f2fs[2]. I am wondering why this
> is still not fixed in btrfs. You can reproduce this bug on btrfs using
> a tool called CrashMonkey that we are building at UT Austin, which is
> a test harness for filesystem crash consistency checks[3]
>
> To reproduce the bug, simply run :
>  ./c_harness -f /dev/sda -d /dev/cow_ram0 -t btrfs -e 102400  -v
> tests/generic_468.so
>
> Is there a reason why this is not yet patched in btrfs? I don't see
> why even after a fsync(), losing the blocks allocated beyond the eof
> are acceptable.
>
> [1] https://patchwork.kernel.org/patch/10120293/
> [2] https://sourceforge.net/p/linux-f2fs/mailman/message/36104201/
> [3] https://github.com/utsaslab/crashmonkey
>
> Thanks,
>
> Jayashree Mohan
> 2nd Year PhD in Computer Science
> University of Texas at Austin.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] btrfs: Speedup btrfs_read_block_groups()

2018-02-23 Thread Ellis H. Wilson III

On 02/22/2018 06:37 PM, Qu Wenruo wrote:

On 2018年02月23日 00:31, Ellis H. Wilson III wrote:

On 02/21/2018 11:56 PM, Qu Wenruo wrote:

On 2018年02月22日 12:52, Qu Wenruo wrote:

btrfs_read_block_groups() is used to build up the block group cache for
all block groups, so it will iterate all block group items in extent
tree.

For large filesystem (TB level), it will search for BLOCK_GROUP_ITEM
thousands times, which is the most time consuming part of mounting
btrfs.

So this patch will try to speed it up by:

1) Avoid unnecessary readahead
     We were using READA_FORWARD to search for block group item.
     However block group items are in fact scattered across quite a
lot of
     leaves. Doing readahead will just waste our IO (especially important
     for HDD).

     In real world case, for a filesystem with 3T used space, it would
     have about 50K extent tree leaves, but only have 3K block group
     items. Meaning we need to iterate 16 leaves to meet one block group
     on average.

     So readahead won't help but waste slow HDD seeks.

2) Use chunk mapping to locate block group items
     Since one block group item always has one corresponding chunk item,
     we could use chunk mapping to get the block group item size.

     With block group item size, we can do a pinpoint tree search,
instead
     of searching with some uncertain value and do forward search.

     In some case, like next BLOCK_GROUP_ITEM is in the next leaf of
     current path, we could save such unnecessary tree block read.

Cc: Ellis H. Wilson III 


Hi Ellis,

Would you please try this patch to see if it helps to speedup the mount
of your large filesystem?


I will try either tomorrow or over the weekend.  I'm waiting on hardware
to be able to build and load a custom kernel on.


If you're using Archlinux, I could build the package for you.

(For other distributions, unfortunately I'm not that familiar with)

Thanks,
Qu


No sweat.  I'm not running arch anywhere, so was glad to handle this myself.

Short story: It doesn't appear to have any notable impact on mount time.

Long story:
#Built a modern kernel:
git clone https://github.com/kdave/btrfs-devel
cd'd into btrfs-devel
copied my current kernel config in /boot to .config
make olddefconfig
make -j16
make modules_install
make install
grub2-mkconfig -o /boot/grub/grub.cfg
reboot

#Reran tests with vanilla 4.16.0-rc1+ kernel
As root, of the form: time mount /dev/sdb /mnt/btrfs
5 iteration average: 16.869s

#Applied your patch, rebuild, switched kernel module
wget -O - 'https://patchwork.kernel.org/patch/10234619/mbox' | git am -
make -j16
make modules_install
rmmod btrfs
modprobe btrfs

#Reran tests with patched 4.16.0-rc1+ kernel
As root, of the form: time mount /dev/sdb /mnt/btrfs
5 iteration average: 16.642s

So, there's a slight improvement against vanilla 4.16.0-rc1+, but it's 
still slightly slower than my original runs in 4.5.5, which got me 
16.553s.  In any event, most of this is statistically unsignificant 
since the standard deviation is about two tenths of a second.


So, my conclusion here is this problem needs to be handled at an 
architectural level to be truly solved (read: have mounts that few 
seconds at worst), which either requires:
a) On-disk format changes like you (Qu) suggested some time back for a 
tree of block groups or
b) Lazy block group walking post-mount and algorithms that can cope with 
making sub-optimal choices.  One would likely want to stonewall out 
certain operations until the lazy post-mount walk completed like 
balance, defrag, etc, that have more reason to require complete 
knowledge of the usage of each block group.


I may take a stab at b), but I'm first going to do the tests I promised 
relating to how mount times scale with increased capacity consumption 
for varying filesizes.


Best,

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 10/10] btrfs: qgroup: Use independent and accurate per inode qgroup rsv

2018-02-23 Thread Jeff Mahoney
On 2/22/18 6:34 PM, Qu Wenruo wrote:
> 
> 
> On 2018年02月23日 06:44, Jeff Mahoney wrote:
>> On 12/22/17 1:18 AM, Qu Wenruo wrote:
>>> Unlike reservation calculation used in inode rsv for metadata, qgroup
>>> doesn't really need to care things like csum size or extent usage for
>>> whole tree COW.
>>>
>>> Qgroup care more about net change of extent usage.
>>> That's to say, if we're going to insert one file extent, it will mostly
>>> find its place in CoWed tree block, leaving no change in extent usage.
>>> Or cause leaf split, result one new net extent, increasing qgroup number
>>> by nodesize.
>>> (Or even more rare case, increase the tree level, increasing qgroup
>>> number by 2 * nodesize)
>>>
>>> So here instead of using the way overkilled calculation for extent
>>> allocator, which cares more about accurate and no error, qgroup doesn't
>>> need that over-calculated reservation.
>>>
>>> This patch will maintain 2 new members in btrfs_block_rsv structure for
>>> qgroup, using much smaller calculation for qgroup rsv, reducing false
>>> EDQUOT.
>>
>>
>> I think this is the right idea but, rather than being the last step in a
>> qgroup rework, it should be the first.
> 
> That's right.
> 
> Although putting it as 1st patch needs extra work to co-operate with
> later type seperation.
> 
>>  Don't qgroup reservation
>> lifetimes match the block reservation lifetimes?
> 
> Also correct, but...
> 
>>  We wouldn't want a
>> global reserve and we wouldn't track usage on a per-block basis, but
>> they should otherwise match.  We already have clear APIs surrounding the
>> use of block reservations, so it seems to me that it'd make sense to
>> extend those to cover the qgroup cases as well.  Otherwise, the rest of
>> your patchset makes a parallel reservation system with a different API.
>> That keeps the current state of qgroups as a bolt-on that always needs
>> to be tracked separately (and which developers need to ensure they get
>> right).
> 
> The problem is, block reservation is designed to ensure every CoW could
> be fulfilled without error.
> 
> That's to say, for case like CoW write with csum, we need to care about
> space reservation not only for EXTENT_DATA in fs tree, but also later
> EXTENT_ITEM for extent tree, and CSUM_ITEM for csum tree.
> 
> However extent tree and csum tree doesn't contribute to quota at all.
> If we follow such over-reservation, it would be much much easier to hit
> false EDQUOTA early.

I'm not suggesting a 1:1 mapping between block reservations and qgroup
reservations.  If that were possible, we wouldn't need separate
reservations at all.  What we can do is only use bytes from the qgroup
reservation when we allocate the leaf nodes belonging to the root we're
tracking.  Everywhere else we can migrate bytes normally between
reservations the same way we do for block reservations.  As we discussed
offline yesterday, I'll work up something along what I have in mind and
see if it works out.

-Jeff

> That's the main reason why a separate (and a little simplified) block
> rsv tracking system.
> 
> And if there is better solution, I'm all ears.
> 
> Thanks,
> Qu
> 
>>
>> -Jeff
>>
>>> Signed-off-by: Qu Wenruo 
>>> ---
>>>  fs/btrfs/ctree.h   | 18 +
>>>  fs/btrfs/extent-tree.c | 55 
>>> --
>>>  2 files changed, 62 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>> index 0c58f92c2d44..97783ba91e00 100644
>>> --- a/fs/btrfs/ctree.h
>>> +++ b/fs/btrfs/ctree.h
>>> @@ -467,6 +467,24 @@ struct btrfs_block_rsv {
>>> unsigned short full;
>>> unsigned short type;
>>> unsigned short failfast;
>>> +
>>> +   /*
>>> +* Qgroup equivalent for @size @reserved
>>> +*
>>> +* Unlike normal normal @size/@reserved for inode rsv,
>>> +* qgroup doesn't care about things like csum size nor how many tree
>>> +* blocks it will need to reserve.
>>> +*
>>> +* Qgroup cares more about *NET* change of extent usage.
>>> +* So for ONE newly inserted file extent, in worst case it will cause
>>> +* leaf split and level increase, nodesize for each file extent
>>> +* is already way overkilled.
>>> +*
>>> +* In short, qgroup_size/reserved is the up limit of possible needed
>>> +* qgroup metadata reservation.
>>> +*/
>>> +   u64 qgroup_rsv_size;
>>> +   u64 qgroup_rsv_reserved;
>>>  };
>>>  
>>>  /*
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index 986660f301de..9d579c7bcf7f 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -5565,14 +5565,18 @@ static void space_info_add_new_bytes(struct 
>>> btrfs_fs_info *fs_info,
>>>  
>>>  static u64 block_rsv_release_bytes(struct btrfs_fs_info *fs_info,
>>> struct btrfs_block_rsv *block_rsv,
>>> -   struct btrfs_block_rsv *dest, u64 num_bytes)
>>> +   struct 

Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs

2018-02-23 Thread Martin Svec
Hello,

we have a btrfs-based backup system using btrfs snapshots and rsync. Sometimes,
we hit ENOSPC bug and the filesystem is remounted read-only. However, there's 
still plenty of unallocated space according to "btrfs fi usage". So I think this
isn't another edge condition when btrfs runs out of space due to fragmented 
chunks,
but a bug in disk space allocation code. It suffices to umount the filesystem 
and
remount it back and it works fine again. The frequency of ENOSPC seems to be
dependent on metadata chunks usage. When there's a lot of free space in existing
metadata chunks, the bug doesn't happen for months. If most metadata chunks are
above ~98%, we hit the bug every few days. Below are details regarding the 
backup
server and btrfs.

The backup works as follows: 

  * Every night, we create a btrfs snapshot on the backup server and rsync data
    from a production server into it. This snapshot is then marked read-only and
will be used as a base subvolume for the next backup snapshot.
  * Every day, expired snapshots are removed and their space is freed. Cleanup
is scheduled in such a way that it doesn't interfere with the backup window.
  * Multiple production servers are backed up in parallel to one backup server.
  * The backed up servers are mostly webhosting servers and mail servers, i.e.
hundreds of billions of small files. (Yes, we push btrfs to the limits :-))
  * Backup server contains ~1080 snapshots, Zlib compression is enabled.
  * Rsync is configured to use whole file copying.

System configuration:

Debian Stretch, vanilla stable 4.14.20 kernel with one custom btrfs patch (see 
below) and
Nikolay's patch 1b816c23e9 (btrfs: Add enospc_debug printing in 
metadata_reserve_bytes)

btrfs mount options: noatime,compress=zlib,enospc_debug,space_cache=v2,commit=15

$ btrfs fi df /backup:

Data, single: total=28.05TiB, used=26.37TiB
System, single: total=32.00MiB, used=3.53MiB
Metadata, single: total=255.00GiB, used=250.73GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

$ btrfs fi show /backup:

Label: none  uuid: a52501a9-651c-4712-a76b-7b4238cfff63
Total devices 2 FS bytes used 26.62TiB
devid1 size 416.62GiB used 255.03GiB path /dev/sdb
devid2 size 36.38TiB used 28.05TiB path /dev/sdc

$ btrfs fi usage /backup:

Overall:
Device size:  36.79TiB
Device allocated: 28.30TiB
Device unallocated:8.49TiB
Device missing:  0.00B
Used: 26.62TiB
Free (estimated): 10.17TiB  (min: 10.17TiB)
Data ratio:   1.00
Metadata ratio:   1.00
Global reserve:  512.00MiB  (used: 0.00B)

Data,single: Size:28.05TiB, Used:26.37TiB
   /dev/sdc   28.05TiB

Metadata,single: Size:255.00GiB, Used:250.73GiB
   /dev/sdb  255.00GiB

System,single: Size:32.00MiB, Used:3.53MiB
   /dev/sdb   32.00MiB

Unallocated:
   /dev/sdb  161.59GiB
   /dev/sdc8.33TiB

Btrfs filesystem uses two logical drives in single mode, backed by
hardware RAID controller PERC H710; /dev/sdb is HW RAID1 consisting
of two SATA SSDs and /dev/sdc is HW RAID6 SATA volume.

Please note that we have a simple custom patch in btrfs which ensures
that metadata chunks are allocated preferably on SSD volume and data
chunks are allocated only on SATA volume. The patch slightly modifies
__btrfs_alloc_chunk() so that its loop over devices ignores rotating
devices when a metadata chunk is requested and vice versa. However, 
I'm quite sure that this patch doesn't cause the reported bug because
we log every call of the modified code and there're no __btrfs_alloc_chunk()
calls when ENOSPC is triggered. Moreover, we observed the same bug before
we developed the patch. (IIRC, Chris Mason mentioned that they work on
a similar feature in facebook, but I've found no official patches yet.)

Dmesg dump:

[285167.750763] use_block_rsv: 62468 callbacks suppressed
[285167.750764] BTRFS: block rsv returned -28
[285167.750789] [ cut here ]
[285167.750822] WARNING: CPU: 5 PID: 443 at fs/btrfs/extent-tree.c:8463 
btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
[285167.750823] Modules linked in: binfmt_misc xt_comment xt_tcpudp 
iptable_filter nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack iptable_raw 
ip6table_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat 
nf_conntr
[285167.750853]  zstd_compress xxhash raid6_pq sd_mod crc32c_intel psmouse 
uhci_hcd ehci_pci ehci_hcd megaraid_sas usbcore scsi_mod bnx2
[285167.750861] CPU: 5 PID: 443 Comm: btrfs-transacti Tainted: GW I 
4.14.20-znr1+ #69
[285167.750862] Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.6.3 
02/01/2011
[285167.750863] task: 9c4a1740e280 task.stack: ba48c1ecc000
[285167.750878] RIP: 0010:btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
[285167.750879] RSP: 0018:ba48c1ecf958 EFLAGS: 00010282

Re: Btrfs occupies more space than du reports...

2018-02-23 Thread Austin S. Hemmelgarn

On 2018-02-23 06:21, Shyam Prasad N wrote:

Hi,

Can someone explain me why there is a difference in the number of
blocks reported by df and du commands below?

=
# df -h /dc
Filesystem  Size  Used Avail Use% Mounted on
/dev/drbd1  746G  519G  225G  70% /dc

# btrfs filesystem df -h /dc/
Data, single: total=518.01GiB, used=516.58GiB
System, DUP: total=8.00MiB, used=80.00KiB
Metadata, DUP: total=2.00GiB, used=1019.72MiB
GlobalReserve, single: total=352.00MiB, used=0.00B

# du -sh /dc
467G/dc
=

df shows 519G is used. While recursive check using du shows only 467G.
The filesystem doesn't contain any snapshots/extra subvolumes.
Neither does it contain any mounted filesystem under /dc.
I also considered that it could be a void left behind by one of the
open FDs held by a process. So I rebooted the system. Still no
changes.

The situation is even worse on a few other systems with similar configuration.



At least part of this is a difference in how each tool computes space usage.

* `df` calls `statvfs` to get it's data, which tries to count physical 
allocation accounting for replication profiles.  In other words, data in 
chunks with the dup, raid1, and raid10 profiles gets counted twice, data 
in raid5 and raid6 chunks gets counted with a bit of extra space for the 
parity, etc.


* `btrfs fi df` looks directly at the filesystem itself and counts how 
much space is available to each chunk type in the `total` values and how 
much space is used in each chunk type in the `used` values, after 
replication.  If you add together the data used value and twice the 
system and metadata used values, you get the used value reported by 
regular `df` (well, close to it that is, `df` rounds at a lower 
precision than `btrfs fi df` does).


* `du` scans the directory tree and looks at the file allocation values 
returned form `stat` calls (or just looks at file sizes if you pass the 
`--apparent-size` flag to it).  Like `btrfs fi df`, it reports values 
after replication, it has a couple of nasty caveats on BTRFS, namely 
that it will report sizes for natively compressed files _before_ 
compression, and will count reflinked blocks once for each link.


Now, this doesn't explain the entirety of the discrepancy with `du`, but 
it should cover the whole difference between `df` and `btrfs fi df`.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 10/10] btrfs: qgroup: Use independent and accurate per inode qgroup rsv

2018-02-23 Thread Qu Wenruo
[snip]
>>
>> We don't need to do such check at call site.
>>
>> Just do the calculation (which should be really simple, as simple as
>> nodesize * nr_items_to_add), and pass it to btrfs_qgroup_reserve_* APIs,
>> which would handle the quota enabled check.
>>
>>>
>>> be contained into the block rsv apis. This will ensure lifetime of
>>> blockrsv/qgroups is always consistent. I think this only applies to
>>> qgroup metadata reservations.
>>
>> Yep, and only applies to PREALLOC type metadata reservation.
>> For per-transaction rsv, it's handled by PERTRANS type.
> 
> And what about the btrfs_qgroup_reserve_meta()-type reservations, this
> function doesn't take a transaction, what kind of reservation is that o_O

We only have two functions to reserve metadata:
1) btrfs_qgroup_reserve_meta_pertrans
2) btrfs_qgroup_reserve_meta_prealloc

No 3rd option.
And each function name should explain themselves.

For pertrans rsv, we don't really need @tran parameter in fact.
We only needs to tell qgroup to reserve some meta space for pertrans type.

And there is only one caller for pertrans, that in transaction.c.

> 
> To sum it up we have  PERTRANS, PREALLOC and  btrfs_qgroup_reserve_meta
> and those are 3 distinct type of reservations, correct?

Only 2 in facts.

Thanks,
Qu

>>
>> Thanks,
>> Qu
>>
>>>
>>>
>>>
>>>

 Thanks,
 Qu

>
> -Jeff
>
>> Signed-off-by: Qu Wenruo 
>> ---
>>  fs/btrfs/ctree.h   | 18 +
>>  fs/btrfs/extent-tree.c | 55 
>> --
>>  2 files changed, 62 insertions(+), 11 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 0c58f92c2d44..97783ba91e00 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -467,6 +467,24 @@ struct btrfs_block_rsv {
>>  unsigned short full;
>>  unsigned short type;
>>  unsigned short failfast;
>> +
>> +/*
>> + * Qgroup equivalent for @size @reserved
>> + *
>> + * Unlike normal normal @size/@reserved for inode rsv,
>> + * qgroup doesn't care about things like csum size nor how many 
>> tree
>> + * blocks it will need to reserve.
>> + *
>> + * Qgroup cares more about *NET* change of extent usage.
>> + * So for ONE newly inserted file extent, in worst case it will 
>> cause
>> + * leaf split and level increase, nodesize for each file extent
>> + * is already way overkilled.
>> + *
>> + * In short, qgroup_size/reserved is the up limit of possible 
>> needed
>> + * qgroup metadata reservation.
>> + */
>> +u64 qgroup_rsv_size;
>> +u64 qgroup_rsv_reserved;
>>  };
>>  
>>  /*
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 986660f301de..9d579c7bcf7f 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -5565,14 +5565,18 @@ static void space_info_add_new_bytes(struct 
>> btrfs_fs_info *fs_info,
>>  
>>  static u64 block_rsv_release_bytes(struct btrfs_fs_info *fs_info,
>>  struct btrfs_block_rsv *block_rsv,
>> -struct btrfs_block_rsv *dest, u64 
>> num_bytes)
>> +struct btrfs_block_rsv *dest, u64 
>> num_bytes,
>> +u64 *qgroup_to_release_ret)
>>  {
>>  struct btrfs_space_info *space_info = block_rsv->space_info;
>> +u64 qgroup_to_release = 0;
>>  u64 ret;
>>  
>>  spin_lock(_rsv->lock);
>> -if (num_bytes == (u64)-1)
>> +if (num_bytes == (u64)-1) {
>>  num_bytes = block_rsv->size;
>> +qgroup_to_release = block_rsv->qgroup_rsv_size;
>> +}
>>  block_rsv->size -= num_bytes;
>>  if (block_rsv->reserved >= block_rsv->size) {
>>  num_bytes = block_rsv->reserved - block_rsv->size;
>> @@ -5581,6 +5585,12 @@ static u64 block_rsv_release_bytes(struct 
>> btrfs_fs_info *fs_info,
>>  } else {
>>  num_bytes = 0;
>>  }
>> +if (block_rsv->qgroup_rsv_reserved >= 
>> block_rsv->qgroup_rsv_size) {
>> +qgroup_to_release = block_rsv->qgroup_rsv_reserved -
>> +block_rsv->qgroup_rsv_size;
>> +block_rsv->qgroup_rsv_reserved = 
>> block_rsv->qgroup_rsv_size;
>> +} else
>> +qgroup_to_release = 0;
>>  spin_unlock(_rsv->lock);
>>  
>>  ret = num_bytes;
>> @@ -5603,6 +5613,8 @@ static u64 block_rsv_release_bytes(struct 
>> 

Btrfs occupies more space than du reports...

2018-02-23 Thread Shyam Prasad N
Hi,

Can someone explain me why there is a difference in the number of
blocks reported by df and du commands below?

=
# df -h /dc
Filesystem  Size  Used Avail Use% Mounted on
/dev/drbd1  746G  519G  225G  70% /dc

# btrfs filesystem df -h /dc/
Data, single: total=518.01GiB, used=516.58GiB
System, DUP: total=8.00MiB, used=80.00KiB
Metadata, DUP: total=2.00GiB, used=1019.72MiB
GlobalReserve, single: total=352.00MiB, used=0.00B

# du -sh /dc
467G/dc
=

df shows 519G is used. While recursive check using du shows only 467G.
The filesystem doesn't contain any snapshots/extra subvolumes.
Neither does it contain any mounted filesystem under /dc.
I also considered that it could be a void left behind by one of the
open FDs held by a process. So I rebooted the system. Still no
changes.

The situation is even worse on a few other systems with similar configuration.

-- 
-Shyam
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 10/10] btrfs: qgroup: Use independent and accurate per inode qgroup rsv

2018-02-23 Thread Nikolay Borisov


On 23.02.2018 11:06, Qu Wenruo wrote:
> 
> 
> On 2018年02月23日 16:14, Nikolay Borisov wrote:
>>
>>
>> On 23.02.2018 01:34, Qu Wenruo wrote:
>>>
>>>
>>> On 2018年02月23日 06:44, Jeff Mahoney wrote:
 On 12/22/17 1:18 AM, Qu Wenruo wrote:
> Unlike reservation calculation used in inode rsv for metadata, qgroup
> doesn't really need to care things like csum size or extent usage for
> whole tree COW.
>
> Qgroup care more about net change of extent usage.
> That's to say, if we're going to insert one file extent, it will mostly
> find its place in CoWed tree block, leaving no change in extent usage.
> Or cause leaf split, result one new net extent, increasing qgroup number
> by nodesize.
> (Or even more rare case, increase the tree level, increasing qgroup
> number by 2 * nodesize)
>
> So here instead of using the way overkilled calculation for extent
> allocator, which cares more about accurate and no error, qgroup doesn't
> need that over-calculated reservation.
>
> This patch will maintain 2 new members in btrfs_block_rsv structure for
> qgroup, using much smaller calculation for qgroup rsv, reducing false
> EDQUOT.


 I think this is the right idea but, rather than being the last step in a
 qgroup rework, it should be the first.
>>>
>>> That's right.
>>>
>>> Although putting it as 1st patch needs extra work to co-operate with
>>> later type seperation.
>>>
  Don't qgroup reservation
 lifetimes match the block reservation lifetimes?
>>>
>>> Also correct, but...
>>>
  We wouldn't want a
 global reserve and we wouldn't track usage on a per-block basis, but
 they should otherwise match.  We already have clear APIs surrounding the
 use of block reservations, so it seems to me that it'd make sense to
 extend those to cover the qgroup cases as well.  Otherwise, the rest of
 your patchset makes a parallel reservation system with a different API.
 That keeps the current state of qgroups as a bolt-on that always needs
 to be tracked separately (and which developers need to ensure they get
 right).
>>>
>>> The problem is, block reservation is designed to ensure every CoW could
>>> be fulfilled without error.
>>>
>>> That's to say, for case like CoW write with csum, we need to care about
>>> space reservation not only for EXTENT_DATA in fs tree, but also later
>>> EXTENT_ITEM for extent tree, and CSUM_ITEM for csum tree.
>>>
>>> However extent tree and csum tree doesn't contribute to quota at all.
>>> If we follow such over-reservation, it would be much much easier to hit
>>> false EDQUOTA early.
>>>
>>> That's the main reason why a separate (and a little simplified) block
>>> rsv tracking system.
>>>
>>> And if there is better solution, I'm all ears.
>>
>> So looking at the code the main hurdles seems to be the fact that the
>> btrfs_block_rsv_* API's take a raw byte count, which is derived from
>> btrfs_calc_trans_metadata_size, which in turn is passed the number of
>> items we want.
>>
>> So what if we extend the block_rsv_* apis to take an additional
>> 'quota_bytes' or somesuch argument which would represent the amount we
>> would like to be charged to the qgroups. This will force us to revisit
>> every call site of block_rsv API's and adjust the code accordingly so
>> that callers pass the correct number. This will lead to code such as:
>>
>> if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags)) { blah }
> 
> We don't need to do such check at call site.
> 
> Just do the calculation (which should be really simple, as simple as
> nodesize * nr_items_to_add), and pass it to btrfs_qgroup_reserve_* APIs,
> which would handle the quota enabled check.
> 
>>
>> be contained into the block rsv apis. This will ensure lifetime of
>> blockrsv/qgroups is always consistent. I think this only applies to
>> qgroup metadata reservations.
> 
> Yep, and only applies to PREALLOC type metadata reservation.
> For per-transaction rsv, it's handled by PERTRANS type.

And what about the btrfs_qgroup_reserve_meta()-type reservations, this
function doesn't take a transaction, what kind of reservation is that o_O

To sum it up we have  PERTRANS, PREALLOC and  btrfs_qgroup_reserve_meta
and those are 3 distinct type of reservations, correct?
> 
> Thanks,
> Qu
> 
>>
>>
>>
>>
>>>
>>> Thanks,
>>> Qu
>>>

 -Jeff

> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/ctree.h   | 18 +
>  fs/btrfs/extent-tree.c | 55 
> --
>  2 files changed, 62 insertions(+), 11 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 0c58f92c2d44..97783ba91e00 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -467,6 +467,24 @@ struct btrfs_block_rsv {
>   unsigned short full;
>   unsigned short type;
>   unsigned short failfast;
> +
> + /*
> +  * 

Re: Fwd: [Bug 1547319] 4.16.0-0.rc1.git4.1.fc28.x86_64 #1 Not tainted: possible recursive locking detected

2018-02-23 Thread Tomasz Kłoczko
On 22 February 2018 at 23:57, David Sterba  wrote:
[..]
>> https://bugzilla.redhat.com/show_bug.cgi?id=1547319
>>
>> Laura Abbott  changed:
>>
>>What|Removed |Added
>> 
>>  CC||labb...@redhat.com
>>
>> --- Comment #1 from Laura Abbott  ---
>> This is btrfs which is pretty far down on our priority list. This is best
>> reported to the upstream btrfs devs.
>
> Quick and wrong conclusions, it's not a btrfs bug.
>
> The fix:
>
> https://patchwork.kernel.org/patch/10207013/
>
> but I don't see it merged in master branch.

Thank you for the reply :)
I'll keep my ticket opened until it will be merged.

kloczek
-- 
Tomasz Kłoczko | LinkedIn: http://lnkd.in/FXPWxH
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 10/10] btrfs: qgroup: Use independent and accurate per inode qgroup rsv

2018-02-23 Thread Qu Wenruo


On 2018年02月23日 16:14, Nikolay Borisov wrote:
> 
> 
> On 23.02.2018 01:34, Qu Wenruo wrote:
>>
>>
>> On 2018年02月23日 06:44, Jeff Mahoney wrote:
>>> On 12/22/17 1:18 AM, Qu Wenruo wrote:
 Unlike reservation calculation used in inode rsv for metadata, qgroup
 doesn't really need to care things like csum size or extent usage for
 whole tree COW.

 Qgroup care more about net change of extent usage.
 That's to say, if we're going to insert one file extent, it will mostly
 find its place in CoWed tree block, leaving no change in extent usage.
 Or cause leaf split, result one new net extent, increasing qgroup number
 by nodesize.
 (Or even more rare case, increase the tree level, increasing qgroup
 number by 2 * nodesize)

 So here instead of using the way overkilled calculation for extent
 allocator, which cares more about accurate and no error, qgroup doesn't
 need that over-calculated reservation.

 This patch will maintain 2 new members in btrfs_block_rsv structure for
 qgroup, using much smaller calculation for qgroup rsv, reducing false
 EDQUOT.
>>>
>>>
>>> I think this is the right idea but, rather than being the last step in a
>>> qgroup rework, it should be the first.
>>
>> That's right.
>>
>> Although putting it as 1st patch needs extra work to co-operate with
>> later type seperation.
>>
>>>  Don't qgroup reservation
>>> lifetimes match the block reservation lifetimes?
>>
>> Also correct, but...
>>
>>>  We wouldn't want a
>>> global reserve and we wouldn't track usage on a per-block basis, but
>>> they should otherwise match.  We already have clear APIs surrounding the
>>> use of block reservations, so it seems to me that it'd make sense to
>>> extend those to cover the qgroup cases as well.  Otherwise, the rest of
>>> your patchset makes a parallel reservation system with a different API.
>>> That keeps the current state of qgroups as a bolt-on that always needs
>>> to be tracked separately (and which developers need to ensure they get
>>> right).
>>
>> The problem is, block reservation is designed to ensure every CoW could
>> be fulfilled without error.
>>
>> That's to say, for case like CoW write with csum, we need to care about
>> space reservation not only for EXTENT_DATA in fs tree, but also later
>> EXTENT_ITEM for extent tree, and CSUM_ITEM for csum tree.
>>
>> However extent tree and csum tree doesn't contribute to quota at all.
>> If we follow such over-reservation, it would be much much easier to hit
>> false EDQUOTA early.
>>
>> That's the main reason why a separate (and a little simplified) block
>> rsv tracking system.
>>
>> And if there is better solution, I'm all ears.
> 
> So looking at the code the main hurdles seems to be the fact that the
> btrfs_block_rsv_* API's take a raw byte count, which is derived from
> btrfs_calc_trans_metadata_size, which in turn is passed the number of
> items we want.
> 
> So what if we extend the block_rsv_* apis to take an additional
> 'quota_bytes' or somesuch argument which would represent the amount we
> would like to be charged to the qgroups. This will force us to revisit
> every call site of block_rsv API's and adjust the code accordingly so
> that callers pass the correct number. This will lead to code such as:
> 
> if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags)) { blah }

We don't need to do such check at call site.

Just do the calculation (which should be really simple, as simple as
nodesize * nr_items_to_add), and pass it to btrfs_qgroup_reserve_* APIs,
which would handle the quota enabled check.

> 
> be contained into the block rsv apis. This will ensure lifetime of
> blockrsv/qgroups is always consistent. I think this only applies to
> qgroup metadata reservations.

Yep, and only applies to PREALLOC type metadata reservation.
For per-transaction rsv, it's handled by PERTRANS type.

Thanks,
Qu

> 
> 
> 
> 
>>
>> Thanks,
>> Qu
>>
>>>
>>> -Jeff
>>>
 Signed-off-by: Qu Wenruo 
 ---
  fs/btrfs/ctree.h   | 18 +
  fs/btrfs/extent-tree.c | 55 
 --
  2 files changed, 62 insertions(+), 11 deletions(-)

 diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
 index 0c58f92c2d44..97783ba91e00 100644
 --- a/fs/btrfs/ctree.h
 +++ b/fs/btrfs/ctree.h
 @@ -467,6 +467,24 @@ struct btrfs_block_rsv {
unsigned short full;
unsigned short type;
unsigned short failfast;
 +
 +  /*
 +   * Qgroup equivalent for @size @reserved
 +   *
 +   * Unlike normal normal @size/@reserved for inode rsv,
 +   * qgroup doesn't care about things like csum size nor how many tree
 +   * blocks it will need to reserve.
 +   *
 +   * Qgroup cares more about *NET* change of extent usage.
 +   * So for ONE newly inserted file extent, in worst case it will cause
 +   * leaf split and level increase, nodesize for each 

Re: [PATCH v2 06/27] libbtrfsutil: add btrfs_util_create_subvolume()

2018-02-23 Thread Misono, Tomohiro
On 2018/02/16 4:04, Omar Sandoval wrote:
> From: Omar Sandoval 

> +static enum btrfs_util_error openat_parent_and_name(int dirfd, const char 
> *path,
> + char *name, size_t name_len,
> + int *fd)
> +{
> + char *tmp_path, *slash, *dirname, *basename;
> + size_t len;
> +
> + /* Ignore trailing slashes. */
> + len = strlen(path);
> + while (len > 1 && path[len - 1] == '/')
> + len--;
> +
> + tmp_path = malloc(len + 1);
> + if (!tmp_path)
> + return BTRFS_UTIL_ERROR_NO_MEMORY;
> + memcpy(tmp_path, path, len);
> + tmp_path[len] = '\0';
> +
> + slash = memrchr(tmp_path, '/', len);
> + if (slash == tmp_path) {
> + dirname = "/";
> + basename = tmp_path + 1;
> + } else if (slash) {
> + *slash = '\0';
> + dirname = tmp_path;
> + basename = slash + 1;
> + } else {
> + dirname = ".";
> + basename = tmp_path;
> + }
> +
> + len = strlen(basename);
> + if (len >= name_len) {
> + errno = ENAMETOOLONG;

tmp_path should be also freed here.

> + return BTRFS_UTIL_ERROR_INVALID_ARGUMENT;
> + }
> + memcpy(name, basename, len);
> + name[len] = '\0';
> +
> + *fd = openat(dirfd, dirname, O_RDONLY | O_DIRECTORY);
> + if (*fd == -1) {
> + free(tmp_path);
> + return BTRFS_UTIL_ERROR_OPEN_FAILED;
> + }
> +
> + free(tmp_path);
> + return BTRFS_UTIL_OK;
> +}


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 10/10] btrfs: qgroup: Use independent and accurate per inode qgroup rsv

2018-02-23 Thread Nikolay Borisov


On 23.02.2018 01:34, Qu Wenruo wrote:
> 
> 
> On 2018年02月23日 06:44, Jeff Mahoney wrote:
>> On 12/22/17 1:18 AM, Qu Wenruo wrote:
>>> Unlike reservation calculation used in inode rsv for metadata, qgroup
>>> doesn't really need to care things like csum size or extent usage for
>>> whole tree COW.
>>>
>>> Qgroup care more about net change of extent usage.
>>> That's to say, if we're going to insert one file extent, it will mostly
>>> find its place in CoWed tree block, leaving no change in extent usage.
>>> Or cause leaf split, result one new net extent, increasing qgroup number
>>> by nodesize.
>>> (Or even more rare case, increase the tree level, increasing qgroup
>>> number by 2 * nodesize)
>>>
>>> So here instead of using the way overkilled calculation for extent
>>> allocator, which cares more about accurate and no error, qgroup doesn't
>>> need that over-calculated reservation.
>>>
>>> This patch will maintain 2 new members in btrfs_block_rsv structure for
>>> qgroup, using much smaller calculation for qgroup rsv, reducing false
>>> EDQUOT.
>>
>>
>> I think this is the right idea but, rather than being the last step in a
>> qgroup rework, it should be the first.
> 
> That's right.
> 
> Although putting it as 1st patch needs extra work to co-operate with
> later type seperation.
> 
>>  Don't qgroup reservation
>> lifetimes match the block reservation lifetimes?
> 
> Also correct, but...
> 
>>  We wouldn't want a
>> global reserve and we wouldn't track usage on a per-block basis, but
>> they should otherwise match.  We already have clear APIs surrounding the
>> use of block reservations, so it seems to me that it'd make sense to
>> extend those to cover the qgroup cases as well.  Otherwise, the rest of
>> your patchset makes a parallel reservation system with a different API.
>> That keeps the current state of qgroups as a bolt-on that always needs
>> to be tracked separately (and which developers need to ensure they get
>> right).
> 
> The problem is, block reservation is designed to ensure every CoW could
> be fulfilled without error.
> 
> That's to say, for case like CoW write with csum, we need to care about
> space reservation not only for EXTENT_DATA in fs tree, but also later
> EXTENT_ITEM for extent tree, and CSUM_ITEM for csum tree.
> 
> However extent tree and csum tree doesn't contribute to quota at all.
> If we follow such over-reservation, it would be much much easier to hit
> false EDQUOTA early.
> 
> That's the main reason why a separate (and a little simplified) block
> rsv tracking system.
> 
> And if there is better solution, I'm all ears.

So looking at the code the main hurdles seems to be the fact that the
btrfs_block_rsv_* API's take a raw byte count, which is derived from
btrfs_calc_trans_metadata_size, which in turn is passed the number of
items we want.

So what if we extend the block_rsv_* apis to take an additional
'quota_bytes' or somesuch argument which would represent the amount we
would like to be charged to the qgroups. This will force us to revisit
every call site of block_rsv API's and adjust the code accordingly so
that callers pass the correct number. This will lead to code such as:

if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags)) { blah }

be contained into the block rsv apis. This will ensure lifetime of
blockrsv/qgroups is always consistent. I think this only applies to
qgroup metadata reservations.




> 
> Thanks,
> Qu
> 
>>
>> -Jeff
>>
>>> Signed-off-by: Qu Wenruo 
>>> ---
>>>  fs/btrfs/ctree.h   | 18 +
>>>  fs/btrfs/extent-tree.c | 55 
>>> --
>>>  2 files changed, 62 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>> index 0c58f92c2d44..97783ba91e00 100644
>>> --- a/fs/btrfs/ctree.h
>>> +++ b/fs/btrfs/ctree.h
>>> @@ -467,6 +467,24 @@ struct btrfs_block_rsv {
>>> unsigned short full;
>>> unsigned short type;
>>> unsigned short failfast;
>>> +
>>> +   /*
>>> +* Qgroup equivalent for @size @reserved
>>> +*
>>> +* Unlike normal normal @size/@reserved for inode rsv,
>>> +* qgroup doesn't care about things like csum size nor how many tree
>>> +* blocks it will need to reserve.
>>> +*
>>> +* Qgroup cares more about *NET* change of extent usage.
>>> +* So for ONE newly inserted file extent, in worst case it will cause
>>> +* leaf split and level increase, nodesize for each file extent
>>> +* is already way overkilled.
>>> +*
>>> +* In short, qgroup_size/reserved is the up limit of possible needed
>>> +* qgroup metadata reservation.
>>> +*/
>>> +   u64 qgroup_rsv_size;
>>> +   u64 qgroup_rsv_reserved;
>>>  };
>>>  
>>>  /*
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index 986660f301de..9d579c7bcf7f 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -5565,14 +5565,18 @@ static void space_info_add_new_bytes(struct 
>>>