Re: [RFC PATCH] btrfs: Speedup btrfs_read_block_groups()
On 2018年02月24日 00:29, Ellis H. Wilson III wrote: > On 02/22/2018 06:37 PM, Qu Wenruo wrote: >> On 2018年02月23日 00:31, Ellis H. Wilson III wrote: >>> On 02/21/2018 11:56 PM, Qu Wenruo wrote: On 2018年02月22日 12:52, Qu Wenruo wrote: > btrfs_read_block_groups() is used to build up the block group cache > for > all block groups, so it will iterate all block group items in extent > tree. > > For large filesystem (TB level), it will search for BLOCK_GROUP_ITEM > thousands times, which is the most time consuming part of mounting > btrfs. > > So this patch will try to speed it up by: > > 1) Avoid unnecessary readahead > We were using READA_FORWARD to search for block group item. > However block group items are in fact scattered across quite a > lot of > leaves. Doing readahead will just waste our IO (especially > important > for HDD). > > In real world case, for a filesystem with 3T used space, it would > have about 50K extent tree leaves, but only have 3K block group > items. Meaning we need to iterate 16 leaves to meet one block > group > on average. > > So readahead won't help but waste slow HDD seeks. > > 2) Use chunk mapping to locate block group items > Since one block group item always has one corresponding chunk > item, > we could use chunk mapping to get the block group item size. > > With block group item size, we can do a pinpoint tree search, > instead > of searching with some uncertain value and do forward search. > > In some case, like next BLOCK_GROUP_ITEM is in the next leaf of > current path, we could save such unnecessary tree block read. > > Cc: Ellis H. Wilson IIIHi Ellis, Would you please try this patch to see if it helps to speedup the mount of your large filesystem? >>> >>> I will try either tomorrow or over the weekend. I'm waiting on hardware >>> to be able to build and load a custom kernel on. >> >> If you're using Archlinux, I could build the package for you. >> >> (For other distributions, unfortunately I'm not that familiar with) >> >> Thanks, >> Qu > > No sweat. I'm not running arch anywhere, so was glad to handle this > myself. > > Short story: It doesn't appear to have any notable impact on mount time. > > Long story: > #Built a modern kernel: > git clone https://github.com/kdave/btrfs-devel > cd'd into btrfs-devel > copied my current kernel config in /boot to .config > make olddefconfig > make -j16 > make modules_install > make install > grub2-mkconfig -o /boot/grub/grub.cfg > reboot > > #Reran tests with vanilla 4.16.0-rc1+ kernel > As root, of the form: time mount /dev/sdb /mnt/btrfs > 5 iteration average: 16.869s > > #Applied your patch, rebuild, switched kernel module > wget -O - 'https://patchwork.kernel.org/patch/10234619/mbox' | git am - > make -j16 > make modules_install > rmmod btrfs > modprobe btrfs > > #Reran tests with patched 4.16.0-rc1+ kernel > As root, of the form: time mount /dev/sdb /mnt/btrfs > 5 iteration average: 16.642s > > So, there's a slight improvement against vanilla 4.16.0-rc1+, but it's > still slightly slower than my original runs in 4.5.5, which got me > 16.553s. In any event, most of this is statistically unsignificant > since the standard deviation is about two tenths of a second. Yep, I also saw guys with similar report when the first version is sent. Despite of the readahead things, the patch can only reduce disk reads where block group items are located at the 1st slot of a leaf. If all block group items are located from the 2nd slot of leaves, then it shouldn't have much affect. And I think your fs is already in such states so it doesn't have much speedup. BTW, you could also verify this by btrfs-debug-tree. To get all block group items number: # btrfs-debug-tree -t extent | grep BLOCK_GROUP_ITEM | grep item | nc -l To get block group items which locates at 1st slot: # btrfs-debug-tree -t extent | grep BLOCK_GROUP_ITEM | grep "item 0" | nc -l And the ratio should imply how effective the patch will speedup. > > So, my conclusion here is this problem needs to be handled at an > architectural level to be truly solved (read: have mounts that few > seconds at worst), which either requires: > a) On-disk format changes like you (Qu) suggested some time back for a > tree of block groups or > b) Lazy block group walking post-mount and algorithms that can cope with > making sub-optimal choices. One would likely want to stonewall out > certain operations until the lazy post-mount walk completed like > balance, defrag, etc, that have more reason to require complete > knowledge of the usage of each block group. > > I may take a stab at b), but I'm first going to do the tests I promised > relating to how mount times scale with increased capacity
Please update the BTRFS status page
The latest released kernel is 4.15 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Add nossd_spread mount option
On 02/21/2018 05:20 PM, Hans van Kranenburg wrote: > On 02/22/2018 12:31 AM, Howard McLauchlan wrote: >> Btrfs has two mount options for SSD optimizations: ssd and ssd_spread. >> Presently there is an option to disable all SSD optimizations, but there >> isn't an option to disable just ssd_spread. >> >> This patch adds a mount option nossd_spread that disables ssd_spread >> only. >> >> Signed-off-by: Howard McLauchlan>> --- >> fs/btrfs/super.c | 13 + >> 1 file changed, 9 insertions(+), 4 deletions(-) >> >> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c >> index 6e71a2a78363..4c0fcf5b3e7e 100644 >> --- a/fs/btrfs/super.c >> +++ b/fs/btrfs/super.c >> @@ -310,10 +310,10 @@ static void btrfs_put_super(struct super_block *sb) >> enum { >> Opt_degraded, Opt_subvol, Opt_subvolid, Opt_device, Opt_nodatasum, >> Opt_nodatacow, Opt_max_inline, Opt_alloc_start, Opt_nobarrier, Opt_ssd, >> -Opt_nossd, Opt_ssd_spread, Opt_thread_pool, Opt_noacl, Opt_compress, >> -Opt_compress_type, Opt_compress_force, Opt_compress_force_type, >> -Opt_notreelog, Opt_ratio, Opt_flushoncommit, Opt_discard, >> -Opt_space_cache, Opt_space_cache_version, Opt_clear_cache, >> +Opt_nossd, Opt_ssd_spread, Opt_nossd_spread, Opt_thread_pool, Opt_noacl, >> +Opt_compress, Opt_compress_type, Opt_compress_force, >> +Opt_compress_force_type, Opt_notreelog, Opt_ratio, Opt_flushoncommit, >> +Opt_discard, Opt_space_cache, Opt_space_cache_version, Opt_clear_cache, >> Opt_user_subvol_rm_allowed, Opt_enospc_debug, Opt_subvolrootid, >> Opt_defrag, Opt_inode_cache, Opt_no_space_cache, Opt_recovery, >> Opt_skip_balance, Opt_check_integrity, >> @@ -353,6 +353,7 @@ static const match_table_t tokens = { >> {Opt_ssd, "ssd"}, >> {Opt_ssd_spread, "ssd_spread"}, >> {Opt_nossd, "nossd"}, >> +{Opt_nossd_spread, "nossd_spread"}, >> {Opt_acl, "acl"}, >> {Opt_noacl, "noacl"}, >> {Opt_notreelog, "notreelog"}, > > .oO(Why doesn't the enum just have one option per line, so the changelog > is less invasive?) > >> @@ -582,6 +583,10 @@ int btrfs_parse_options(struct btrfs_fs_info *info, >> char *options, >> btrfs_clear_and_info(info, SSD_SPREAD, >> "not using spread ssd allocation >> scheme"); >> break; >> +case Opt_nossd_spread: >> +btrfs_clear_and_info(info, SSD_SPREAD, >> + "not using spread ssd allocation >> scheme"); >> +break; >> case Opt_barrier: >> btrfs_clear_and_info(info, NOBARRIER, >> "turning on barriers"); >> > > Related: > * > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_linux-2Dbtrfs_msg64247.html=DwIDaQ=5VD0RTtNlTh3ycd41b3MUw=UA4c4GV4shA70-jKB4kwcF99U6K6bzKVYdicFvu-DtQ=oYU5SWdFSoawOXAWRYLqZvvDYpROHFkzMQLlAZEehaU=0LRSD37KWUOyo_i9ypRAaZrv7_8R_kJ0yHv9vNoqDB4= > * > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_linux-2Dbtrfs_msg64277.html=DwIDaQ=5VD0RTtNlTh3ycd41b3MUw=UA4c4GV4shA70-jKB4kwcF99U6K6bzKVYdicFvu-DtQ=oYU5SWdFSoawOXAWRYLqZvvDYpROHFkzMQLlAZEehaU=WB_Fyhm3SDx3XdaE71X6aYRxonuk0Q8Jr4f1ai6K2dI= > * > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_linux-2Dbtrfs_msg64499.html=DwIDaQ=5VD0RTtNlTh3ycd41b3MUw=UA4c4GV4shA70-jKB4kwcF99U6K6bzKVYdicFvu-DtQ=oYU5SWdFSoawOXAWRYLqZvvDYpROHFkzMQLlAZEehaU=FcKlny0UlOCTWEfvCgEz_QtGnYI20EiDF3weQVKPyDs= > > Apparently that discussion never resulted in actual changes, so thanks > for continuing it now. > > The mount options are a bit weird, because ssd_spread also includes ssd, > but doesn't show it. > > I personally don't like all of them at all, and I should really finish > and send my proposal to get them replaced by options that can choose > extent allocator for data and metadata individually (instead of some > setting that changes them both at the same time) because there are > proper real life situations that e.g. benefit from nossd style 'tetris' > data allocator with ssd_spread style 'contiguous' metadata extent allocator. > I agree; this change is just something very specific that came up in prod for us, so I decided to push a patch for it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs: drop optimal argument from find_live_mirror()
On Tue, Jan 30, 2018 at 02:28:31PM +0800, Anand Jain wrote: > Drop optimal argument from the function find_live_mirror() > as we can deduce it in the function itself. Yeah the argument is not necessary. It's another misleading variable name, as it's not 'optimal' but a fallback. Please rename it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] btrfs: drop num argument from find_live_mirror()
On Tue, Jan 30, 2018 at 02:28:30PM +0800, Anand Jain wrote: > Obtain the stripes info from the map directly and so no need > to pass it as an argument. > > Signed-off-by: Anand Jain> --- > fs/btrfs/volumes.c | 14 +++--- > 1 file changed, 11 insertions(+), 3 deletions(-) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index f7147740b68e..9c9d987838c2 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -5252,13 +5252,22 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info > *fs_info, u64 logical, u64 len) > } > > static int find_live_mirror(struct btrfs_fs_info *fs_info, > - struct map_lookup *map, int first, int num, > + struct map_lookup *map, int first, > int optimal, int dev_replace_is_ongoing) > { > int i; > + int num; Can you please rename it to something more descriptive? Eg. num_stripes. > int tolerance; > struct btrfs_device *srcdev; > > + ASSERT((map->type & > + (BTRFS_BLOCK_GROUP_RAID1 | BTRFS_BLOCK_GROUP_RAID10))); > + > + if (map->type & BTRFS_BLOCK_GROUP_RAID10) > + num = map->sub_stripes; > + else > + num = map->num_stripes; > + > if (dev_replace_is_ongoing && > fs_info->dev_replace.cont_reading_from_srcdev_mode == >BTRFS_DEV_REPLACE_ITEM_CONT_READING_FROM_SRCDEV_MODE_AVOID) > @@ -5812,7 +5821,6 @@ static int __btrfs_map_block(struct btrfs_fs_info > *fs_info, > stripe_index = mirror_num - 1; > else { > stripe_index = find_live_mirror(fs_info, map, 0, > - map->num_stripes, > current->pid % map->num_stripes, > dev_replace_is_ongoing); > mirror_num = stripe_index + 1; > @@ -5841,7 +5849,7 @@ static int __btrfs_map_block(struct btrfs_fs_info > *fs_info, > int old_stripe_index = stripe_index; > stripe_index = find_live_mirror(fs_info, map, > stripe_index, > - map->sub_stripes, stripe_index + > + stripe_index + > current->pid % map->sub_stripes, > dev_replace_is_ongoing); > mirror_num = stripe_index - old_stripe_index + 1; > -- > 2.7.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Relax memory barrier in btrfs_tree_unlock
On Wed, Feb 14, 2018 at 02:37:26PM +0200, Nikolay Borisov wrote: > When performing an unlock on an extent buffer we'd like to order the > decrement of extent_buffer::blocking_writers with waking up any > waiters. In such situations it's sufficient to use smp_mb__after_atomic > rather than the heavy smp_mb. On architectures where atomic operations > are fully ordered (such as x86 or s390) unconditionally executing > a heavyweight smp_mb instruction causes a severe hit to performance > while bringin no improvements in terms of correctness. Have you measured this severe performance hit? There is an impact, but I doubt you'll ever notice it in the profiles given where the btrfs_tree_unlock appears. > The better thing is to use the appropriate smp_mb__after_atomic routine > which will do the correct thing (invoke a full smp_mb or in the case > of ordered atomics insert a compiler barrier). Put another way, > an RMW atomic op + smp_load__after_atomic equals, in terms of > semantics, to a full smp_mb. This ensures that none of the problems > described in the accompanying comment of waitqueue_active occur. > No functional changes. I tend to agree. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fs was hung and sort of full
It is still hanging afterall. Currently, it hangs on mount: [154422.778624] mount D0 27894 27742 0x0084 [154422.778625] Call Trace: [154422.779018] __schedule+0x28d/0x880 [154422.779494] schedule+0x36/0x80 [154422.779886] io_schedule+0x16/0x40 [154422.780288] do_read_cache_page+0x3da/0x5c0 [154422.780669] ? blkdev_writepages+0x10/0x10 [154422.781052] ? mntput+0x24/0x40 [154422.781457] ? page_cache_tree_insert+0xd0/0xd0 [154422.781849] read_cache_page_gfp+0x1b/0x20 [154422.782272] btrfs_scan_one_device+0x71/0x260 [btrfs] [154422.782658] ? __free_pages+0x25/0x30 [154422.783052] ? free_pages.part.94+0x40/0x50 [154422.783521] ? free_pages+0x13/0x20 [154422.783940] btrfs_mount+0x2cd/0xfa0 [btrfs] [154422.784353] ? find_next_bit+0xb/0x10 [154422.784736] ? pcpu_next_unpop+0x3c/0x50 [154422.785124] ? find_next_bit+0xb/0x10 [154422.785565] mount_fs+0x39/0x150 [154422.785949] ? __alloc_percpu+0x15/0x20 [154422.786349] vfs_kern_mount+0x67/0x130 [154422.786732] btrfs_mount+0x19d/0xfa0 [btrfs] [154422.787120] ? find_next_bit+0xb/0x10 [154422.787586] ? pcpu_next_unpop+0x3c/0x50 [154422.787966] mount_fs+0x39/0x150 [154422.788360] ? __alloc_percpu+0x15/0x20 [154422.788732] vfs_kern_mount+0x67/0x130 [154422.789109] do_mount+0x1f5/0xca0 [154422.789525] SyS_mount+0x83/0xd0 [154422.789900] do_syscall_64+0x74/0x1b0 [154422.790313] entry_SYSCALL_64_after_hwframe+0x21/0x86 I do have a disk 100% busy (only one of the pair ...) ... so maybe it's "working"? Running 4.15.4-1.el7.elrepo.x86_64 On Thu, Feb 22, 2018 at 3:52 PM, Chris Murphywrote: > If there's no hung task listed in dmesg you could try to do sysrq+t to > find out what everything's up to, although then you have to learn how > to parse the result. > > > Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Remove redundant memory barriers
On Wed, Feb 14, 2018 at 10:53:36AM +0200, Nikolay Borisov wrote: > Using any kind of memory barriers around atomic operations which have > a return value is redundant, since those operations themselves are > fully ordered. atomic_t.txt states: > > - RMW operations that have a return value are fully ordered; > > Fully ordered primitives are ordered against everything prior and > everything subsequent. Therefore a fully ordered primitive is like > having an smp_mb() before and an smp_mb() after the primitive. > > Given this let's replace the extra memory barriers with comments. > > Signed-off-by: Nikolay BorisovReviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 14/27] libbtrfsutil: add btrfs_util_deleted_subvolumes()
On Fri, Feb 23, 2018 at 11:12:56AM +0900, Misono, Tomohiro wrote: > > On 2018/02/16 4:04, Omar Sandoval wrote: > > From: Omar Sandoval> > > > Signed-off-by: Omar Sandoval > > --- > > libbtrfsutil/btrfsutil.h| 21 +++ > > libbtrfsutil/python/btrfsutilpy.h | 3 + > > libbtrfsutil/python/module.c| 30 ++ > > libbtrfsutil/python/qgroup.c| 17 +- > > libbtrfsutil/python/subvolume.c | 30 ++ > > libbtrfsutil/python/tests/test_subvolume.py | 8 +++ > > libbtrfsutil/subvolume.c| 89 > > + > > 7 files changed, 183 insertions(+), 15 deletions(-) > > > > diff --git a/libbtrfsutil/btrfsutil.h b/libbtrfsutil/btrfsutil.h > > index 00c86174..677ab3c1 100644 > > --- a/libbtrfsutil/btrfsutil.h > > +++ b/libbtrfsutil/btrfsutil.h > > @@ -534,6 +534,27 @@ enum btrfs_util_error > > btrfs_util_subvolume_iterator_next_info(struct btrfs_util_ > > char **path_ret, > > struct > > btrfs_util_subvolume_info *subvol); > > > > +/** > > + * btrfs_util_deleted_subvolumes() - Get a list of subvolume which have > > been > > + * deleted but not yet cleaned up. > > + * @path: Path on a Btrfs filesystem. > > + * @ids: Returned array of subvolume IDs. > > + * @n: Returned number of IDs in the @ids array. > > + * > > + * This requires appropriate privilege (CAP_SYS_ADMIN). > > + * > > + * Return: %BTRFS_UTIL_OK on success, non-zero error code on failure. > > + */ > > +enum btrfs_util_error btrfs_util_deleted_subvolumes(const char *path, > > + uint64_t **ids, > > + size_t *n); > > + > > +/** > > + * btrfs_util_deleted_subvolumes_fd() - See > > btrfs_util_deleted_subvolumes(). > > + */ > > +enum btrfs_util_error btrfs_util_deleted_subvolumes_fd(int fd, uint64_t > > **ids, > > + size_t *n); > > + > > /** > > * btrfs_util_create_qgroup_inherit() - Create a qgroup inheritance > > specifier > > * for btrfs_util_create_subvolume() or btrfs_util_create_snapshot(). > > diff --git a/libbtrfsutil/python/btrfsutilpy.h > > b/libbtrfsutil/python/btrfsutilpy.h > > index b3ec047f..be5122e2 100644 > > --- a/libbtrfsutil/python/btrfsutilpy.h > > +++ b/libbtrfsutil/python/btrfsutilpy.h > > @@ -54,6 +54,8 @@ struct path_arg { > > int path_converter(PyObject *o, void *p); > > void path_cleanup(struct path_arg *path); > > > > +PyObject *list_from_uint64_array(const uint64_t *arr, size_t n); > > + > > void SetFromBtrfsUtilError(enum btrfs_util_error err); > > void SetFromBtrfsUtilErrorWithPath(enum btrfs_util_error err, > >struct path_arg *path); > > @@ -72,6 +74,7 @@ PyObject *set_default_subvolume(PyObject *self, PyObject > > *args, PyObject *kwds); > > PyObject *create_subvolume(PyObject *self, PyObject *args, PyObject *kwds); > > PyObject *create_snapshot(PyObject *self, PyObject *args, PyObject *kwds); > > PyObject *delete_subvolume(PyObject *self, PyObject *args, PyObject *kwds); > > +PyObject *deleted_subvolumes(PyObject *self, PyObject *args, PyObject > > *kwds); > > > > void add_module_constants(PyObject *m); > > > > diff --git a/libbtrfsutil/python/module.c b/libbtrfsutil/python/module.c > > index e995a1be..eaa062ac 100644 > > --- a/libbtrfsutil/python/module.c > > +++ b/libbtrfsutil/python/module.c > > @@ -125,6 +125,29 @@ err: > > return 0; > > } > > > > +PyObject *list_from_uint64_array(const uint64_t *arr, size_t n) > > +{ > > +PyObject *ret; > > +size_t i; > > + > > +ret = PyList_New(n); > > +if (!ret) > > + return NULL; > > + > > +for (i = 0; i < n; i++) { > > + PyObject *tmp; > > + > > + tmp = PyLong_FromUnsignedLongLong(arr[i]); > > + if (!tmp) { > > + Py_DECREF(ret); > > + return NULL; > > + } > > + PyList_SET_ITEM(ret, i, tmp); > > +} > > + > > +return ret; > > +} > > + > > void path_cleanup(struct path_arg *path) > > { > > Py_CLEAR(path->object); > > @@ -214,6 +237,13 @@ static PyMethodDef btrfsutil_methods[] = { > > "path -- string, bytes, or path-like object\n" > > "recursive -- if the given subvolume has child subvolumes, delete\n" > > "them instead of failing"}, > > + {"deleted_subvolumes", (PyCFunction)deleted_subvolumes, > > +METH_VARARGS | METH_KEYWORDS, > > +"deleted_subvolumes(path)\n\n" > > +"Get the list of subvolume IDs which have been deleted but not yet\n" > > +"cleaned up\n\n" > > +"Arguments:\n" > > +"path -- string, bytes, path-like object, or open file descriptor"}, > > {}, > > }; > > > > diff --git a/libbtrfsutil/python/qgroup.c b/libbtrfsutil/python/qgroup.c > > index
Re: [PATCH v2 23/27] btrfs-progs: use libbtrfsutil for subvol sync
On Fri, Feb 23, 2018 at 02:41:27PM -0800, Omar Sandoval wrote: > > > static int cmd_subvol_sync(int argc, char **argv) > > > { > > > int fd = -1; > > > - int i; > > > int ret = 1; > > > DIR *dirstream = NULL; > > > - u64 *ids = NULL; > > > - int id_count; > > > + uint64_t *ids; > > > > "ids" should be initialized to prevent to be used uninitialized for free() > > in error path (clang build warning). > > I'm disappointed that GCC didn't also warn here, but thanks, fixed. It's not reported even with the make W=3 level, gcc version 7.3.0. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix NULL pointer exception in find_bio_stripe()
On Fri, Feb 16, 2018 at 07:51:38PM +, Dmitriy Gorokh wrote: > On detaching of a disk which is a part of a RAID6 filesystem, the following > kernel OOPS may happen: > > [63122.680461] BTRFS error (device sdo): bdev /dev/sdo errs: wr 0, rd 0, > flush 1, corrupt 0, gen 0 > [63122.719584] BTRFS warning (device sdo): lost page write due to IO error on > /dev/sdo > [63122.719587] BTRFS error (device sdo): bdev /dev/sdo errs: wr 1, rd 0, > flush 1, corrupt 0, gen 0 > [63122.803516] BTRFS warning (device sdo): lost page write due to IO error on > /dev/sdo > [63122.803519] BTRFS error (device sdo): bdev /dev/sdo errs: wr 2, rd 0, > flush 1, corrupt 0, gen 0 > [63122.863902] BTRFS critical (device sdo): fatal error on device /dev/sdo > [63122.935338] BUG: unable to handle kernel NULL pointer dereference at > 0080 > [63122.946554] IP: fail_bio_stripe+0x58/0xa0 [btrfs] > [63122.958185] PGD 9ecda067 P4D 9ecda067 PUD b2b37067 PMD 0 > [63122.971202] Oops: [#1] SMP > [63122.990786] Modules linked in: libcrc32c dlm configfs cpufreq_userspace > cpufreq_powersave cpufreq_conservative softdog nfsd auth_rpcgss nfs_acl nfs > lockd grace fscache sunrpc bonding ipmi_devintf ipmi_msghandler joydev > snd_intel8x0 snd_ac97_codec snd_pcm snd_timer snd psmouse evdev parport_pc > soundcore serio_raw battery pcspkr video ac97_bus ac parport ohci_pci > ohci_hcd i2c_piix4 button crc32c_generic crc32c_intel btrfs xor > zstd_decompress zstd_compress xxhash raid6_pq dm_mod dax raid1 md_mod > hid_generic usbhid hid xhci_pci xhci_hcd ehci_pci ehci_hcd usbcore sg sd_mod > sr_mod cdrom ata_generic ahci libahci ata_piix libata e1000 scsi_mod [last > unloaded: scst] > [63123.006760] CPU: 0 PID: 3979 Comm: kworker/u8:9 Tainted: G W > 4.14.2-16-scst34x+ #8 > [63123.007091] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS > VirtualBox 12/01/2006 > [63123.007402] Workqueue: btrfs-worker btrfs_worker_helper [btrfs] > [63123.007595] task: 880036ea4040 task.stack: c90006384000 > [63123.007796] RIP: 0010:fail_bio_stripe+0x58/0xa0 [btrfs] > [63123.007968] RSP: 0018:c90006387ad8 EFLAGS: 00010287 > [63123.008140] RAX: 0002 RBX: 88004beaa0b8 RCX: > 8800b2bd5690 > [63123.008359] RDX: RSI: 88007bb43500 RDI: > 88004beaa000 > [63123.008621] RBP: c90006387ae8 R08: 9910 R09: > 8800b2bd5600 > [63123.008840] R10: 0004 R11: 0001 R12: > 88007bb43500 > [63123.009059] R13: fffb R14: 880036fc5180 R15: > 0004 > [63123.009278] FS: () GS:8800b700() > knlGS: > [63123.009564] CS: 0010 DS: ES: CR0: 80050033 > [63123.009748] CR2: 0080 CR3: b0866000 CR4: > 000406f0 > [63123.009969] Call Trace: > [63123.010085] raid_write_end_io+0x7e/0x80 [btrfs] > [63123.010251] bio_endio+0xa1/0x120 > [63123.010378] generic_make_request+0x218/0x270 > [63123.010921] submit_bio+0x66/0x130 > [63123.011073] finish_rmw+0x3fc/0x5b0 [btrfs] > [63123.011245] full_stripe_write+0x96/0xc0 [btrfs] > [63123.011428] raid56_parity_write+0x117/0x170 [btrfs] > [63123.011604] btrfs_map_bio+0x2ec/0x320 [btrfs] > [63123.011759] ? ___cache_free+0x1c5/0x300 > [63123.011909] __btrfs_submit_bio_done+0x26/0x50 [btrfs] > [63123.012087] run_one_async_done+0x9c/0xc0 [btrfs] > [63123.012257] normal_work_helper+0x19e/0x300 [btrfs] > [63123.012429] btrfs_worker_helper+0x12/0x20 [btrfs] > [63123.012656] process_one_work+0x14d/0x350 > [63123.012888] worker_thread+0x4d/0x3a0 > [63123.013026] ? _raw_spin_unlock_irqrestore+0x15/0x20 > [63123.013192] kthread+0x109/0x140 > [63123.013315] ? process_scheduled_works+0x40/0x40 > [63123.013472] ? kthread_stop+0x110/0x110 > [63123.013610] ret_from_fork+0x25/0x30 > [63123.013741] Code: 7e 43 31 c0 48 63 d0 48 8d 14 52 49 8d 4c d1 60 48 8b 51 > 08 49 39 d0 72 1f 4c 63 1b 4c 01 da 49 39 d0 73 14 48 8b 11 48 8b 52 68 <48> > 8b 8a 80 00 00 00 48 39 4e 08 74 14 83 c0 01 44 39 d0 75 c4 > [63123.014469] RIP: fail_bio_stripe+0x58/0xa0 [btrfs] RSP: c90006387ad8 > [63123.014678] CR2: 0080 > [63123.016590] ---[ end trace a295ea7259c17880 ]— > > This is reproducible in a cycle, where a series of writes is followed by SCSI > device delete command. The test may take up to few minutes. > > Fixes: commit 74d46992e0d9dee7f1f376de0d56d31614c8a17a ("block: replace > bi_bdev with a gendisk pointer and partitions index") Right, the commit introduced dereference of stripe->dev->bdev so we have to check it first. Please update the Fixes: tag, the word 'commit' should not be there and the sha1 length can be shortened to 12 digits. Also please add your Signed-off-by. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: [PATCH v2 26/27] btrfs-progs: use libbtrfsutil for subvolume list
On Fri, Feb 23, 2018 at 11:26:35AM +0900, Misono, Tomohiro wrote: > > > On 2018/02/16 4:05, Omar Sandoval wrote: > > From: Omar Sandoval> > > +static struct subvol_list *btrfs_list_deleted_subvols(int fd, > > + struct > > btrfs_list_filter_set *filter_set) > > +{ > > + struct subvol_list *subvols = NULL; > > + uint64_t *ids = NULL; > > + size_t i, n; > > + enum btrfs_util_error err; > > + int ret = -1; > > + > > + err = btrfs_util_deleted_subvolumes_fd(fd, , ); > > + if (err) { > > + error_btrfs_util(err); > > + return NULL; > > + } > > + > > + subvols = malloc(sizeof(*subvols) + n * sizeof(subvols->subvols[0])); > > + if (!subvols) { > > + error("out of memory"); > > + goto out; > > + } > > + > > + subvols->num = 0; > > + for (i = 0; i < n; i++) { > > + struct listed_subvol *subvol = >subvols[subvols->num]; > > + > > + err = btrfs_util_subvolume_info_fd(fd, ids[i], >info); > > + if (err) { > > I think there is a small chance that subvolume would be removed from tree > between > btrfs_util_deleted_subvolumes_fd() and btrfs_util_subvolume_info_fd(). > So, error of BTRFS_UTIL_ERROR_SUBVOLUME_NOT_FOUND should be ignored. Thanks, since this patch isn't in the devel branch in, I'll fold the fix in. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] btrfs: fix NPD when target device is missing
On Tue, Feb 20, 2018 at 10:46:25PM +0800, Anand Jain wrote: > The replace target device can be missing in which case we don't > allocate a missing btrfs_device when mounted with the -o degraded. > So check the device before access. > > BUG: unable to handle kernel NULL pointer dereference at 00b0 Please don't use uncommon acronyms, NPD is quite confusing, null pointer deref should be fine. > IP: btrfs_destroy_dev_replace_tgtdev+0x43/0xf0 [btrfs] > Call Trace: > btrfs_dev_replace_cancel+0x15f/0x180 [btrfs] > btrfs_ioctl+0x2216/0x2590 [btrfs] > do_vfs_ioctl+0x625/0x650 > SyS_ioctl+0x4e/0x80 > do_syscall_64+0x5d/0x160 > entry_SYSCALL64_slow_path+0x25/0x25 Do you have a reproducer for that? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 06/27] libbtrfsutil: add btrfs_util_create_subvolume()
On Fri, Feb 23, 2018 at 05:24:04PM +0900, Misono, Tomohiro wrote: > On 2018/02/16 4:04, Omar Sandoval wrote: > > From: Omar Sandoval> > > +static enum btrfs_util_error openat_parent_and_name(int dirfd, const char > > *path, > > + char *name, size_t name_len, > > + int *fd) > > +{ > > + char *tmp_path, *slash, *dirname, *basename; > > + size_t len; > > + > > + /* Ignore trailing slashes. */ > > + len = strlen(path); > > + while (len > 1 && path[len - 1] == '/') > > + len--; > > + > > + tmp_path = malloc(len + 1); > > + if (!tmp_path) > > + return BTRFS_UTIL_ERROR_NO_MEMORY; > > + memcpy(tmp_path, path, len); > > + tmp_path[len] = '\0'; > > + > > + slash = memrchr(tmp_path, '/', len); > > + if (slash == tmp_path) { > > + dirname = "/"; > > + basename = tmp_path + 1; > > + } else if (slash) { > > + *slash = '\0'; > > + dirname = tmp_path; > > + basename = slash + 1; > > + } else { > > + dirname = "."; > > + basename = tmp_path; > > + } > > + > > + len = strlen(basename); > > + if (len >= name_len) { > > + errno = ENAMETOOLONG; > > tmp_path should be also freed here. Another good catch, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] btrfs: remove assert in btrfs_init_dev_replace_tgtdev()
On Tue, Feb 20, 2018 at 10:50:36PM +0800, Anand Jain wrote: > In the same function we just ran btrfs_alloc_device() which means the > btrfs_device::resized_list is sure to be empty and we are protected > with the btrfs_fs_info::volume_mutex. > > Signed-off-by: Anand JainReviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 11/27] libbtrfsutil: add subvolume iterator helpers
On Fri, Feb 23, 2018 at 04:40:45PM +0900, Misono, Tomohiro wrote: > On 2018/02/16 4:04, Omar Sandoval wrote: > > From: Omar Sandoval> > > +PUBLIC enum btrfs_util_error btrfs_util_create_subvolume_iterator(const > > char *path, > > + uint64_t top, > > + int flags, > > + struct > > btrfs_util_subvolume_iterator **ret) > > +{ > > + enum btrfs_util_error err; > > + int fd; > > + > > + fd = open(path, O_RDONLY); > > + if (fd == -1) > > + return BTRFS_UTIL_ERROR_OPEN_FAILED; > > + > > + err = btrfs_util_create_subvolume_iterator_fd(fd, top, flags, ret); > > + if (err == BTRFS_UTIL_OK) > > + (*ret)->flags |= BTRFS_UTIL_SUBVOLUME_ITERATOR_CLOSE_FD; > > If btrfs_util_create_subvolume_iterator_fd() returns error, 'fd' remains open. > So, fd should be closed here. Good catch, fixed. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 07/27] libbtrfsutil: add btrfs_util_subvolume_path()
On Fri, Feb 23, 2018 at 03:27:52PM +0900, Misono, Tomohiro wrote: > On 2018/02/16 4:04, Omar Sandoval wrote: > > From: Omar Sandoval> > > +PUBLIC enum btrfs_util_error btrfs_util_subvolume_path_fd(int fd, uint64_t > > id, > > + char **path_ret) > > +{ > > + char *path, *p; > > + size_t capacity = 4096; > > + > > + path = malloc(capacity); > > + if (!path) > > + return BTRFS_UTIL_ERROR_NO_MEMORY; > > + p = path + capacity - 1; > > + p[0] = '\0'; > > + > > + if (id == 0) { > > + enum btrfs_util_error err; > > + > > + err = btrfs_util_is_subvolume_fd(fd); > > + if (err) > > + return err; > > 'path' should be freed here and below. > > > + > > + err = btrfs_util_subvolume_id_fd(fd, ); > > + if (err) > > + return err; > > + } Indeed, although I'll just change it to allocate path after this instead. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 16/27] btrfs-progs: use libbtrfsutil for read-only property
On Thu, Feb 22, 2018 at 01:23:45PM +0900, Misono, Tomohiro wrote: > > > On 2018/02/16 4:05, Omar Sandoval wrote: > > From: Omar Sandoval> > > > Signed-off-by: Omar Sandoval > > --- > > messages.h | 13 > > props.c| 69 > > +++--- > > 2 files changed, 38 insertions(+), 44 deletions(-) > > > > diff --git a/messages.h b/messages.h > > index 4999c7b9..004d5167 100644 > > --- a/messages.h > > +++ b/messages.h > > @@ -54,6 +54,19 @@ > > DO_ABORT_ON_ERROR; \ > > } while (0) > > > > +#define error_btrfs_util(err) > > \ > > + do {\ > > + const char *errno_str = strerror(errno);\ > > + const char *lib_str = btrfs_util_strerror(err) \ > > "make D=trace" fails because ";" is missing here. That's embarrassing, thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 23/27] btrfs-progs: use libbtrfsutil for subvol sync
On Thu, Feb 22, 2018 at 11:03:05AM +0900, Misono, Tomohiro wrote: > > > On 2018/02/16 4:05, Omar Sandoval wrote: > > From: Omar Sandoval> > > > btrfs_util_f_deleted_subvolumes() replaces enumerate_dead_subvols() and > > btrfs_util_f_subvolume_info() replaces is_subvolume_cleaned(). > > > > Signed-off-by: Omar Sandoval > > --- > > cmds-subvolume.c | 217 > > ++- > > 1 file changed, 21 insertions(+), 196 deletions(-) > > > > diff --git a/cmds-subvolume.c b/cmds-subvolume.c > > index 49c9c8cf..9bab9312 100644 > > --- a/cmds-subvolume.c > > +++ b/cmds-subvolume.c > > @@ -42,38 +42,11 @@ > > #include "utils.h" > > #include "help.h" > > > > -static int is_subvolume_cleaned(int fd, u64 subvolid) > > +static int wait_for_subvolume_cleaning(int fd, size_t count, uint64_t *ids, > > + int sleep_interval) > > { > > - int ret; > > - struct btrfs_ioctl_search_args args; > > - struct btrfs_ioctl_search_key *sk = > > - > > - sk->tree_id = BTRFS_ROOT_TREE_OBJECTID; > > - sk->min_objectid = subvolid; > > - sk->max_objectid = subvolid; > > - sk->min_type = BTRFS_ROOT_ITEM_KEY; > > - sk->max_type = BTRFS_ROOT_ITEM_KEY; > > - sk->min_offset = 0; > > - sk->max_offset = (u64)-1; > > - sk->min_transid = 0; > > - sk->max_transid = (u64)-1; > > - sk->nr_items = 1; > > - > > - ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, ); > > - if (ret < 0) > > - return -errno; > > - > > - if (sk->nr_items == 0) > > - return 1; > > - > > - return 0; > > -} > > - > > -static int wait_for_subvolume_cleaning(int fd, int count, u64 *ids, > > - int sleep_interval) > > -{ > > - int ret; > > - int i; > > + size_t i; > > + enum btrfs_util_error err; > > > > while (1) { > > int clean = 1; > > @@ -81,16 +54,14 @@ static int wait_for_subvolume_cleaning(int fd, int > > count, u64 *ids, > > for (i = 0; i < count; i++) { > > if (!ids[i]) > > continue; > > - ret = is_subvolume_cleaned(fd, ids[i]); > > - if (ret < 0) { > > - error( > > - "cannot read status of dead subvolume %llu: %s", > > - (unsigned long long)ids[i], > > strerror(-ret)); > > - return ret; > > - } > > - if (ret) { > > - printf("Subvolume id %llu is gone\n", ids[i]); > > + err = btrfs_util_subvolume_info_fd(fd, ids[i], NULL); > > + if (err == BTRFS_UTIL_ERROR_SUBVOLUME_NOT_FOUND) { > > + printf("Subvolume id %" PRIu64 " is gone\n", > > + ids[i]); > > ids[i] = 0; > > + } else if (err) { > > + error_btrfs_util(err); > > + return -errno; > > } else { > > clean = 0; > > } > > @@ -1028,160 +999,15 @@ static const char * const cmd_subvol_sync_usage[] = > > { > > NULL > > }; > > > > -#if 0 > > -/* > > - * If we're looking for any dead subvolume, take a shortcut and look > > - * for any ORPHAN_ITEMs in the tree root > > - */ > > -static int fs_has_dead_subvolumes(int fd) > > -{ > > - int ret; > > - struct btrfs_ioctl_search_args args; > > - struct btrfs_ioctl_search_key *sk = > > - struct btrfs_ioctl_search_header sh; > > - u64 min_subvolid = 0; > > - > > -again: > > - sk->tree_id = BTRFS_ROOT_TREE_OBJECTID; > > - sk->min_objectid = BTRFS_ORPHAN_OBJECTID; > > - sk->max_objectid = BTRFS_ORPHAN_OBJECTID; > > - sk->min_type = BTRFS_ORPHAN_ITEM_KEY; > > - sk->max_type = BTRFS_ORPHAN_ITEM_KEY; > > - sk->min_offset = min_subvolid; > > - sk->max_offset = (u64)-1; > > - sk->min_transid = 0; > > - sk->max_transid = (u64)-1; > > - sk->nr_items = 1; > > - > > - ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, ); > > - if (ret < 0) > > - return -errno; > > - > > - if (!sk->nr_items) > > - return 0; > > - > > - memcpy(, args.buf, sizeof(sh)); > > - min_subvolid = sh.offset; > > - > > - /* > > -* Verify that the root item is really there and we haven't hit > > -* a stale orphan > > -*/ > > - sk->tree_id = BTRFS_ROOT_TREE_OBJECTID; > > - sk->min_objectid = min_subvolid; > > - sk->max_objectid = min_subvolid; > > - sk->min_type = BTRFS_ROOT_ITEM_KEY; > > - sk->max_type = BTRFS_ROOT_ITEM_KEY; > > - sk->min_offset = 0; > > - sk->max_offset = (u64)-1; > > - sk->min_transid = 0; > > - sk->max_transid = (u64)-1; > > - sk->nr_items = 1; > > - > > - ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, ); > > - if (ret < 0) > > - return -errno; > > - > > - /* > > -* Stale orphan, try the next one >
Re: [PATCH v2 10/27] libbtrfsutil: add btrfs_util_[gs]et_default_subvolume()
On Thu, Feb 22, 2018 at 10:55:48AM +0900, Misono, Tomohiro wrote: > On 2018/02/16 4:04, Omar Sandoval wrote: > > From: Omar Sandoval> > > > set_default_subvolume() is a trivial ioctl(), but there's no ioctl() for > > get_default_subvolume(), so we need to search the root tree. > > > > Signed-off-by: Omar Sandoval > > --- > > libbtrfsutil/btrfsutil.h| 41 ++ > > libbtrfsutil/python/btrfsutilpy.h | 2 + > > libbtrfsutil/python/module.c| 14 > > libbtrfsutil/python/subvolume.c | 50 > > libbtrfsutil/python/tests/test_subvolume.py | 14 > > libbtrfsutil/subvolume.c| 113 > > > > 6 files changed, 234 insertions(+) > > > > diff --git a/libbtrfsutil/btrfsutil.h b/libbtrfsutil/btrfsutil.h > > index 8bd2b847..54777f1d 100644 > > --- a/libbtrfsutil/btrfsutil.h > > +++ b/libbtrfsutil/btrfsutil.h > > @@ -256,6 +256,8 @@ enum btrfs_util_error > > btrfs_util_get_subvolume_read_only_fd(int fd, bool *ret); > > * @path: Subvolume path. > > * @read_only: New value of read-only flag. > > * > > + * This requires appropriate privilege (CAP_SYS_ADMIN). > > + * > > * Return: %BTRFS_UTIL_OK on success, non-zero error code on failure. > > */ > > enum btrfs_util_error btrfs_util_set_subvolume_read_only(const char *path, > > @@ -268,6 +270,45 @@ enum btrfs_util_error > > btrfs_util_set_subvolume_read_only(const char *path, > > enum btrfs_util_error btrfs_util_set_subvolume_read_only_fd(int fd, > > bool read_only); > > > > +/** > > + * btrfs_util_get_default_subvolume() - Get the default subvolume for a > > + * filesystem. > > + * @path: Path on a Btrfs filesystem. > > + * @id_ret: Returned subvolume ID. > > + * > > + * This requires appropriate privilege (CAP_SYS_ADMIN). > > + * > > + * Return: %BTRFS_UTIL_OK on success, non-zero error code on failure. > > + */ > > +enum btrfs_util_error btrfs_util_get_default_subvolume(const char *path, > > + uint64_t *id_ret); > > + > > +/** > > + * btrfs_util_get_default_subvolume_fd() - See > > + * btrfs_util_get_default_subvolume(). > > + */ > > +enum btrfs_util_error btrfs_util_get_default_subvolume_fd(int fd, > > + uint64_t *id_ret); > > + > > +/** > > + * btrfs_util_set_default_subvolume() - Set the default subvolume for a > > + * filesystem. > > + * @path: Path in a Btrfs filesystem. This may be any path in the > > filesystem; it > > + * does not have to refer to a subvolume unless @id is zero. > > + * @id: ID of subvolume to set as the default. If zero is given, the > > subvolume > > + * ID of @path is used. > > The line "This requires appropriate privilege (CAP_SYS_ADMIN)." is missing > here. Good catch, thanks, fixed. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: fix endianness in super_copy
On Thu, Feb 22, 2018 at 09:58:42PM +0800, Anand Jain wrote: > Moving between opposite endianness will report bogus numbers in sysfs, > and mount may fail as the root will not be restored correctly. If the > filesystem is always used on a same endian host, this will not be a > problem. > Fix this by using the btrfs_set_super...() functions to set > fs_info::super_copy values, and for the sysfs, use the cached > fs_info::nodesize,sectorsize values. > > Cc: sta...@vger.kernel.org > Signed-off-by: Anand JainReviewed-by: David Sterba Updated changelog and Fixes: tag. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs: fix bare unsigned declarations
On Thu, Feb 15, 2018 at 01:13:00AM +0800, Anand Jain wrote: > We have btrfs_fs_info::data_chunk_allocations and > btrfs_fs_info::metadata_ratio declared as unsigned which would > be unsinged int and kernel style prefers unsigned int over bare > unsigned. So this patch changes them to u32. > > Signed-off-by: Anand JainPlease also update the temporary variable old_metadata_ratio in btrfs_remount that get assigned from one of the variables. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs: verify subvolid mount parameter
On Thu, Feb 15, 2018 at 01:11:37AM +0800, Anand Jain wrote: > We aren't verifying the parameter passed to the subvolid mount option, > so we won't report and fail the mount if a junk value is specified for > example, -o subvolid=abc. > This patch verifies the subvolid option with match_u64. > > Signed-off-by: Anand JainReviewed-by: David Sterba Changelog updated to mention the previous memparse effects. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/27] btrfs-progs: introduce libbtrfsutil, "btrfs-progs as a library"
On Wed, Feb 21, 2018 at 10:50:32AM -0800, Omar Sandoval wrote: > On Wed, Feb 21, 2018 at 04:13:38PM +0100, David Sterba wrote: > > On Tue, Feb 20, 2018 at 07:50:48PM +0100, David Sterba wrote: > > > I have more comments or maybe questions about the future development > > > workflow, but at this point the patchset is in a good shape for > > > incremental merge. > > > > After removnig the first patch adding subvolume.c (with > > linux/btrfs_tree.h) and what depends on it, I'm left with: > > > > Omar Sandoval (4): > > Add libbtrfsutil > > libbtrfsutil: add Python bindings > > libbtrfsutil: add qgroup inheritance helpers > > libbtrfsutil: add filesystem sync helpers > > > > with some context updates. That builds and passes the CI tests. > > Great. Does the CI system run the Python tests yet? Tested here https://travis-ci.org/kdave/btrfs-progs/jobs/345410536 , does not pass. test_start_sync (test_filesystem.TestSubvolume) ... mkfs.btrfs: invalid option -- 'q' usage: mkfs.btrfs [options] dev [ dev ... ] Looks like it tries to use the system mkfs.btrfs that is old. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: lots of data written, but no space occupied
On 23.02.2018 00:50, Chris Murphy wrote: The only thing I can think of is something's updating metadata due to relatime mount option. Maybe try noatime? At 9GiB, really it's 4.5GiB because whatever is being written is being doubled by raid1 profile and multiple devices. There is a case of wandering trees where a little bit of change can get amplified, but this requires an instigator, the file system won't do it on its own. What were the mkfs options? Are you using the default 16KiB nodesize? I found the cause... And "pebkac" is the only way to describe it. As I wrote in my initial email, I installed collectd. And I was sure to have disabled the rrdtool plugin, which apparently I hadn't. So there was no additional space occupied, because collectd only updated already existing files. In the end it only took a small effort to figure it out. I knew that calling "echo 1 > /proc/sys/vm/block_dump" would dump all file access to syslog. Since /var/log/syslog was on the effected drives, I created a ram-disk and had rsyslog log to the RAM-disk. I did this so I wouldn't see my own writes to the log while running "grep sda". I then grep inside the syslog using: "tail -f /var/log-ram/syslog|egrep -v "btrfs-transacti|kwork"|grep sda" until something like this fell out: Feb 23 15:05:35 server2 kernel: [496223.186943] collectd(604): dirtied inode 870979 (smart_attribute-temperature-celsius-2.rrd) on sda2 Thanks anyway. André smime.p7s Description: S/MIME Cryptographic Signature
Re: btrfs crash consistency bug : Blocks allocated beyond eof are lost
On Fri, Feb 23, 2018 at 4:35 PM, Jayashree Mohanwrote: > Hi, > > [Fsync issue in btrfs] > In addition to the above, I would like to bring to your notice that : > After doing a fallocate or fallocate zero_range with keep size option, > a fsync() operation would have no effect at all. If we crash after the > fsync, on recovery the blocks allocated due to the fallocate call are > lost. This aligns with a patch work here[1] for a similar issue with > punch_hole. A simple test scenario that reproduces this bug is : > > 1. write (0-16K) to a file foo > 2. sync() > 3. fallocate keep_size (16K - 20K) > 4. fsync(foo) > 5. Crash > > On recovery, all blocks allocated in step 3 are lost (this is the true > even when fallocate is replaced by zero_range operation supported in > kernel 4.16 ) > Could you explain why a fsync() of the file would still not persist > the metadata(block count in this case), across power failures? In a very short explanation, because it thinks, on log recovery, that a shrinking truncate happened before the file was fsync'ed, so it drops the extents allocated by fallocate() after it replayed them from the log. I had seen this a year or 2 ago but never managed to fix it due to other more important things, but I'll try to fix it soon. > > [1] https://patchwork.kernel.org/patch/5830801/ > > > Thanks, > Jayashree Mohan > > > > > > On Wed, Feb 21, 2018 at 8:23 PM, Jayashree Mohan > wrote: >> Hi, >> >> On btrfs (as of kernel 4.15), say we fallocate a file with keep_size >> option, followed by fdatasync() or fsync(). If we now crash, on >> recovery we see a wrong block count and all the blocks allocated >> beyond the eof are lost. This bug was reported(xfstest generic/468) >> and patched on ext4[1], and a variant of this, that did not recover >> the correct file size was patched in f2fs[2]. I am wondering why this >> is still not fixed in btrfs. You can reproduce this bug on btrfs using >> a tool called CrashMonkey that we are building at UT Austin, which is >> a test harness for filesystem crash consistency checks[3] >> >> To reproduce the bug, simply run : >> ./c_harness -f /dev/sda -d /dev/cow_ram0 -t btrfs -e 102400 -v >> tests/generic_468.so >> >> Is there a reason why this is not yet patched in btrfs? I don't see >> why even after a fsync(), losing the blocks allocated beyond the eof >> are acceptable. >> >> [1] https://patchwork.kernel.org/patch/10120293/ >> [2] https://sourceforge.net/p/linux-f2fs/mailman/message/36104201/ >> [3] https://github.com/utsaslab/crashmonkey >> >> Thanks, >> >> Jayashree Mohan >> 2nd Year PhD in Computer Science >> University of Texas at Austin. > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.” -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs crash consistency bug : Blocks allocated beyond eof are lost
Hi, [Fsync issue in btrfs] In addition to the above, I would like to bring to your notice that : After doing a fallocate or fallocate zero_range with keep size option, a fsync() operation would have no effect at all. If we crash after the fsync, on recovery the blocks allocated due to the fallocate call are lost. This aligns with a patch work here[1] for a similar issue with punch_hole. A simple test scenario that reproduces this bug is : 1. write (0-16K) to a file foo 2. sync() 3. fallocate keep_size (16K - 20K) 4. fsync(foo) 5. Crash On recovery, all blocks allocated in step 3 are lost (this is the true even when fallocate is replaced by zero_range operation supported in kernel 4.16 ) Could you explain why a fsync() of the file would still not persist the metadata(block count in this case), across power failures? [1] https://patchwork.kernel.org/patch/5830801/ Thanks, Jayashree Mohan On Wed, Feb 21, 2018 at 8:23 PM, Jayashree Mohanwrote: > Hi, > > On btrfs (as of kernel 4.15), say we fallocate a file with keep_size > option, followed by fdatasync() or fsync(). If we now crash, on > recovery we see a wrong block count and all the blocks allocated > beyond the eof are lost. This bug was reported(xfstest generic/468) > and patched on ext4[1], and a variant of this, that did not recover > the correct file size was patched in f2fs[2]. I am wondering why this > is still not fixed in btrfs. You can reproduce this bug on btrfs using > a tool called CrashMonkey that we are building at UT Austin, which is > a test harness for filesystem crash consistency checks[3] > > To reproduce the bug, simply run : > ./c_harness -f /dev/sda -d /dev/cow_ram0 -t btrfs -e 102400 -v > tests/generic_468.so > > Is there a reason why this is not yet patched in btrfs? I don't see > why even after a fsync(), losing the blocks allocated beyond the eof > are acceptable. > > [1] https://patchwork.kernel.org/patch/10120293/ > [2] https://sourceforge.net/p/linux-f2fs/mailman/message/36104201/ > [3] https://github.com/utsaslab/crashmonkey > > Thanks, > > Jayashree Mohan > 2nd Year PhD in Computer Science > University of Texas at Austin. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] btrfs: Speedup btrfs_read_block_groups()
On 02/22/2018 06:37 PM, Qu Wenruo wrote: On 2018年02月23日 00:31, Ellis H. Wilson III wrote: On 02/21/2018 11:56 PM, Qu Wenruo wrote: On 2018年02月22日 12:52, Qu Wenruo wrote: btrfs_read_block_groups() is used to build up the block group cache for all block groups, so it will iterate all block group items in extent tree. For large filesystem (TB level), it will search for BLOCK_GROUP_ITEM thousands times, which is the most time consuming part of mounting btrfs. So this patch will try to speed it up by: 1) Avoid unnecessary readahead We were using READA_FORWARD to search for block group item. However block group items are in fact scattered across quite a lot of leaves. Doing readahead will just waste our IO (especially important for HDD). In real world case, for a filesystem with 3T used space, it would have about 50K extent tree leaves, but only have 3K block group items. Meaning we need to iterate 16 leaves to meet one block group on average. So readahead won't help but waste slow HDD seeks. 2) Use chunk mapping to locate block group items Since one block group item always has one corresponding chunk item, we could use chunk mapping to get the block group item size. With block group item size, we can do a pinpoint tree search, instead of searching with some uncertain value and do forward search. In some case, like next BLOCK_GROUP_ITEM is in the next leaf of current path, we could save such unnecessary tree block read. Cc: Ellis H. Wilson IIIHi Ellis, Would you please try this patch to see if it helps to speedup the mount of your large filesystem? I will try either tomorrow or over the weekend. I'm waiting on hardware to be able to build and load a custom kernel on. If you're using Archlinux, I could build the package for you. (For other distributions, unfortunately I'm not that familiar with) Thanks, Qu No sweat. I'm not running arch anywhere, so was glad to handle this myself. Short story: It doesn't appear to have any notable impact on mount time. Long story: #Built a modern kernel: git clone https://github.com/kdave/btrfs-devel cd'd into btrfs-devel copied my current kernel config in /boot to .config make olddefconfig make -j16 make modules_install make install grub2-mkconfig -o /boot/grub/grub.cfg reboot #Reran tests with vanilla 4.16.0-rc1+ kernel As root, of the form: time mount /dev/sdb /mnt/btrfs 5 iteration average: 16.869s #Applied your patch, rebuild, switched kernel module wget -O - 'https://patchwork.kernel.org/patch/10234619/mbox' | git am - make -j16 make modules_install rmmod btrfs modprobe btrfs #Reran tests with patched 4.16.0-rc1+ kernel As root, of the form: time mount /dev/sdb /mnt/btrfs 5 iteration average: 16.642s So, there's a slight improvement against vanilla 4.16.0-rc1+, but it's still slightly slower than my original runs in 4.5.5, which got me 16.553s. In any event, most of this is statistically unsignificant since the standard deviation is about two tenths of a second. So, my conclusion here is this problem needs to be handled at an architectural level to be truly solved (read: have mounts that few seconds at worst), which either requires: a) On-disk format changes like you (Qu) suggested some time back for a tree of block groups or b) Lazy block group walking post-mount and algorithms that can cope with making sub-optimal choices. One would likely want to stonewall out certain operations until the lazy post-mount walk completed like balance, defrag, etc, that have more reason to require complete knowledge of the usage of each block group. I may take a stab at b), but I'm first going to do the tests I promised relating to how mount times scale with increased capacity consumption for varying filesizes. Best, ellis -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 10/10] btrfs: qgroup: Use independent and accurate per inode qgroup rsv
On 2/22/18 6:34 PM, Qu Wenruo wrote: > > > On 2018年02月23日 06:44, Jeff Mahoney wrote: >> On 12/22/17 1:18 AM, Qu Wenruo wrote: >>> Unlike reservation calculation used in inode rsv for metadata, qgroup >>> doesn't really need to care things like csum size or extent usage for >>> whole tree COW. >>> >>> Qgroup care more about net change of extent usage. >>> That's to say, if we're going to insert one file extent, it will mostly >>> find its place in CoWed tree block, leaving no change in extent usage. >>> Or cause leaf split, result one new net extent, increasing qgroup number >>> by nodesize. >>> (Or even more rare case, increase the tree level, increasing qgroup >>> number by 2 * nodesize) >>> >>> So here instead of using the way overkilled calculation for extent >>> allocator, which cares more about accurate and no error, qgroup doesn't >>> need that over-calculated reservation. >>> >>> This patch will maintain 2 new members in btrfs_block_rsv structure for >>> qgroup, using much smaller calculation for qgroup rsv, reducing false >>> EDQUOT. >> >> >> I think this is the right idea but, rather than being the last step in a >> qgroup rework, it should be the first. > > That's right. > > Although putting it as 1st patch needs extra work to co-operate with > later type seperation. > >> Don't qgroup reservation >> lifetimes match the block reservation lifetimes? > > Also correct, but... > >> We wouldn't want a >> global reserve and we wouldn't track usage on a per-block basis, but >> they should otherwise match. We already have clear APIs surrounding the >> use of block reservations, so it seems to me that it'd make sense to >> extend those to cover the qgroup cases as well. Otherwise, the rest of >> your patchset makes a parallel reservation system with a different API. >> That keeps the current state of qgroups as a bolt-on that always needs >> to be tracked separately (and which developers need to ensure they get >> right). > > The problem is, block reservation is designed to ensure every CoW could > be fulfilled without error. > > That's to say, for case like CoW write with csum, we need to care about > space reservation not only for EXTENT_DATA in fs tree, but also later > EXTENT_ITEM for extent tree, and CSUM_ITEM for csum tree. > > However extent tree and csum tree doesn't contribute to quota at all. > If we follow such over-reservation, it would be much much easier to hit > false EDQUOTA early. I'm not suggesting a 1:1 mapping between block reservations and qgroup reservations. If that were possible, we wouldn't need separate reservations at all. What we can do is only use bytes from the qgroup reservation when we allocate the leaf nodes belonging to the root we're tracking. Everywhere else we can migrate bytes normally between reservations the same way we do for block reservations. As we discussed offline yesterday, I'll work up something along what I have in mind and see if it works out. -Jeff > That's the main reason why a separate (and a little simplified) block > rsv tracking system. > > And if there is better solution, I'm all ears. > > Thanks, > Qu > >> >> -Jeff >> >>> Signed-off-by: Qu Wenruo>>> --- >>> fs/btrfs/ctree.h | 18 + >>> fs/btrfs/extent-tree.c | 55 >>> -- >>> 2 files changed, 62 insertions(+), 11 deletions(-) >>> >>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h >>> index 0c58f92c2d44..97783ba91e00 100644 >>> --- a/fs/btrfs/ctree.h >>> +++ b/fs/btrfs/ctree.h >>> @@ -467,6 +467,24 @@ struct btrfs_block_rsv { >>> unsigned short full; >>> unsigned short type; >>> unsigned short failfast; >>> + >>> + /* >>> +* Qgroup equivalent for @size @reserved >>> +* >>> +* Unlike normal normal @size/@reserved for inode rsv, >>> +* qgroup doesn't care about things like csum size nor how many tree >>> +* blocks it will need to reserve. >>> +* >>> +* Qgroup cares more about *NET* change of extent usage. >>> +* So for ONE newly inserted file extent, in worst case it will cause >>> +* leaf split and level increase, nodesize for each file extent >>> +* is already way overkilled. >>> +* >>> +* In short, qgroup_size/reserved is the up limit of possible needed >>> +* qgroup metadata reservation. >>> +*/ >>> + u64 qgroup_rsv_size; >>> + u64 qgroup_rsv_reserved; >>> }; >>> >>> /* >>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >>> index 986660f301de..9d579c7bcf7f 100644 >>> --- a/fs/btrfs/extent-tree.c >>> +++ b/fs/btrfs/extent-tree.c >>> @@ -5565,14 +5565,18 @@ static void space_info_add_new_bytes(struct >>> btrfs_fs_info *fs_info, >>> >>> static u64 block_rsv_release_bytes(struct btrfs_fs_info *fs_info, >>> struct btrfs_block_rsv *block_rsv, >>> - struct btrfs_block_rsv *dest, u64 num_bytes) >>> + struct
Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs
Hello, we have a btrfs-based backup system using btrfs snapshots and rsync. Sometimes, we hit ENOSPC bug and the filesystem is remounted read-only. However, there's still plenty of unallocated space according to "btrfs fi usage". So I think this isn't another edge condition when btrfs runs out of space due to fragmented chunks, but a bug in disk space allocation code. It suffices to umount the filesystem and remount it back and it works fine again. The frequency of ENOSPC seems to be dependent on metadata chunks usage. When there's a lot of free space in existing metadata chunks, the bug doesn't happen for months. If most metadata chunks are above ~98%, we hit the bug every few days. Below are details regarding the backup server and btrfs. The backup works as follows: * Every night, we create a btrfs snapshot on the backup server and rsync data from a production server into it. This snapshot is then marked read-only and will be used as a base subvolume for the next backup snapshot. * Every day, expired snapshots are removed and their space is freed. Cleanup is scheduled in such a way that it doesn't interfere with the backup window. * Multiple production servers are backed up in parallel to one backup server. * The backed up servers are mostly webhosting servers and mail servers, i.e. hundreds of billions of small files. (Yes, we push btrfs to the limits :-)) * Backup server contains ~1080 snapshots, Zlib compression is enabled. * Rsync is configured to use whole file copying. System configuration: Debian Stretch, vanilla stable 4.14.20 kernel with one custom btrfs patch (see below) and Nikolay's patch 1b816c23e9 (btrfs: Add enospc_debug printing in metadata_reserve_bytes) btrfs mount options: noatime,compress=zlib,enospc_debug,space_cache=v2,commit=15 $ btrfs fi df /backup: Data, single: total=28.05TiB, used=26.37TiB System, single: total=32.00MiB, used=3.53MiB Metadata, single: total=255.00GiB, used=250.73GiB GlobalReserve, single: total=512.00MiB, used=0.00B $ btrfs fi show /backup: Label: none uuid: a52501a9-651c-4712-a76b-7b4238cfff63 Total devices 2 FS bytes used 26.62TiB devid1 size 416.62GiB used 255.03GiB path /dev/sdb devid2 size 36.38TiB used 28.05TiB path /dev/sdc $ btrfs fi usage /backup: Overall: Device size: 36.79TiB Device allocated: 28.30TiB Device unallocated:8.49TiB Device missing: 0.00B Used: 26.62TiB Free (estimated): 10.17TiB (min: 10.17TiB) Data ratio: 1.00 Metadata ratio: 1.00 Global reserve: 512.00MiB (used: 0.00B) Data,single: Size:28.05TiB, Used:26.37TiB /dev/sdc 28.05TiB Metadata,single: Size:255.00GiB, Used:250.73GiB /dev/sdb 255.00GiB System,single: Size:32.00MiB, Used:3.53MiB /dev/sdb 32.00MiB Unallocated: /dev/sdb 161.59GiB /dev/sdc8.33TiB Btrfs filesystem uses two logical drives in single mode, backed by hardware RAID controller PERC H710; /dev/sdb is HW RAID1 consisting of two SATA SSDs and /dev/sdc is HW RAID6 SATA volume. Please note that we have a simple custom patch in btrfs which ensures that metadata chunks are allocated preferably on SSD volume and data chunks are allocated only on SATA volume. The patch slightly modifies __btrfs_alloc_chunk() so that its loop over devices ignores rotating devices when a metadata chunk is requested and vice versa. However, I'm quite sure that this patch doesn't cause the reported bug because we log every call of the modified code and there're no __btrfs_alloc_chunk() calls when ENOSPC is triggered. Moreover, we observed the same bug before we developed the patch. (IIRC, Chris Mason mentioned that they work on a similar feature in facebook, but I've found no official patches yet.) Dmesg dump: [285167.750763] use_block_rsv: 62468 callbacks suppressed [285167.750764] BTRFS: block rsv returned -28 [285167.750789] [ cut here ] [285167.750822] WARNING: CPU: 5 PID: 443 at fs/btrfs/extent-tree.c:8463 btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs] [285167.750823] Modules linked in: binfmt_misc xt_comment xt_tcpudp iptable_filter nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack iptable_raw ip6table_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntr [285167.750853] zstd_compress xxhash raid6_pq sd_mod crc32c_intel psmouse uhci_hcd ehci_pci ehci_hcd megaraid_sas usbcore scsi_mod bnx2 [285167.750861] CPU: 5 PID: 443 Comm: btrfs-transacti Tainted: GW I 4.14.20-znr1+ #69 [285167.750862] Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.6.3 02/01/2011 [285167.750863] task: 9c4a1740e280 task.stack: ba48c1ecc000 [285167.750878] RIP: 0010:btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs] [285167.750879] RSP: 0018:ba48c1ecf958 EFLAGS: 00010282
Re: Btrfs occupies more space than du reports...
On 2018-02-23 06:21, Shyam Prasad N wrote: Hi, Can someone explain me why there is a difference in the number of blocks reported by df and du commands below? = # df -h /dc Filesystem Size Used Avail Use% Mounted on /dev/drbd1 746G 519G 225G 70% /dc # btrfs filesystem df -h /dc/ Data, single: total=518.01GiB, used=516.58GiB System, DUP: total=8.00MiB, used=80.00KiB Metadata, DUP: total=2.00GiB, used=1019.72MiB GlobalReserve, single: total=352.00MiB, used=0.00B # du -sh /dc 467G/dc = df shows 519G is used. While recursive check using du shows only 467G. The filesystem doesn't contain any snapshots/extra subvolumes. Neither does it contain any mounted filesystem under /dc. I also considered that it could be a void left behind by one of the open FDs held by a process. So I rebooted the system. Still no changes. The situation is even worse on a few other systems with similar configuration. At least part of this is a difference in how each tool computes space usage. * `df` calls `statvfs` to get it's data, which tries to count physical allocation accounting for replication profiles. In other words, data in chunks with the dup, raid1, and raid10 profiles gets counted twice, data in raid5 and raid6 chunks gets counted with a bit of extra space for the parity, etc. * `btrfs fi df` looks directly at the filesystem itself and counts how much space is available to each chunk type in the `total` values and how much space is used in each chunk type in the `used` values, after replication. If you add together the data used value and twice the system and metadata used values, you get the used value reported by regular `df` (well, close to it that is, `df` rounds at a lower precision than `btrfs fi df` does). * `du` scans the directory tree and looks at the file allocation values returned form `stat` calls (or just looks at file sizes if you pass the `--apparent-size` flag to it). Like `btrfs fi df`, it reports values after replication, it has a couple of nasty caveats on BTRFS, namely that it will report sizes for natively compressed files _before_ compression, and will count reflinked blocks once for each link. Now, this doesn't explain the entirety of the discrepancy with `du`, but it should cover the whole difference between `df` and `btrfs fi df`. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 10/10] btrfs: qgroup: Use independent and accurate per inode qgroup rsv
[snip] >> >> We don't need to do such check at call site. >> >> Just do the calculation (which should be really simple, as simple as >> nodesize * nr_items_to_add), and pass it to btrfs_qgroup_reserve_* APIs, >> which would handle the quota enabled check. >> >>> >>> be contained into the block rsv apis. This will ensure lifetime of >>> blockrsv/qgroups is always consistent. I think this only applies to >>> qgroup metadata reservations. >> >> Yep, and only applies to PREALLOC type metadata reservation. >> For per-transaction rsv, it's handled by PERTRANS type. > > And what about the btrfs_qgroup_reserve_meta()-type reservations, this > function doesn't take a transaction, what kind of reservation is that o_O We only have two functions to reserve metadata: 1) btrfs_qgroup_reserve_meta_pertrans 2) btrfs_qgroup_reserve_meta_prealloc No 3rd option. And each function name should explain themselves. For pertrans rsv, we don't really need @tran parameter in fact. We only needs to tell qgroup to reserve some meta space for pertrans type. And there is only one caller for pertrans, that in transaction.c. > > To sum it up we have PERTRANS, PREALLOC and btrfs_qgroup_reserve_meta > and those are 3 distinct type of reservations, correct? Only 2 in facts. Thanks, Qu >> >> Thanks, >> Qu >> >>> >>> >>> >>> Thanks, Qu > > -Jeff > >> Signed-off-by: Qu Wenruo>> --- >> fs/btrfs/ctree.h | 18 + >> fs/btrfs/extent-tree.c | 55 >> -- >> 2 files changed, 62 insertions(+), 11 deletions(-) >> >> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h >> index 0c58f92c2d44..97783ba91e00 100644 >> --- a/fs/btrfs/ctree.h >> +++ b/fs/btrfs/ctree.h >> @@ -467,6 +467,24 @@ struct btrfs_block_rsv { >> unsigned short full; >> unsigned short type; >> unsigned short failfast; >> + >> +/* >> + * Qgroup equivalent for @size @reserved >> + * >> + * Unlike normal normal @size/@reserved for inode rsv, >> + * qgroup doesn't care about things like csum size nor how many >> tree >> + * blocks it will need to reserve. >> + * >> + * Qgroup cares more about *NET* change of extent usage. >> + * So for ONE newly inserted file extent, in worst case it will >> cause >> + * leaf split and level increase, nodesize for each file extent >> + * is already way overkilled. >> + * >> + * In short, qgroup_size/reserved is the up limit of possible >> needed >> + * qgroup metadata reservation. >> + */ >> +u64 qgroup_rsv_size; >> +u64 qgroup_rsv_reserved; >> }; >> >> /* >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index 986660f301de..9d579c7bcf7f 100644 >> --- a/fs/btrfs/extent-tree.c >> +++ b/fs/btrfs/extent-tree.c >> @@ -5565,14 +5565,18 @@ static void space_info_add_new_bytes(struct >> btrfs_fs_info *fs_info, >> >> static u64 block_rsv_release_bytes(struct btrfs_fs_info *fs_info, >> struct btrfs_block_rsv *block_rsv, >> -struct btrfs_block_rsv *dest, u64 >> num_bytes) >> +struct btrfs_block_rsv *dest, u64 >> num_bytes, >> +u64 *qgroup_to_release_ret) >> { >> struct btrfs_space_info *space_info = block_rsv->space_info; >> +u64 qgroup_to_release = 0; >> u64 ret; >> >> spin_lock(_rsv->lock); >> -if (num_bytes == (u64)-1) >> +if (num_bytes == (u64)-1) { >> num_bytes = block_rsv->size; >> +qgroup_to_release = block_rsv->qgroup_rsv_size; >> +} >> block_rsv->size -= num_bytes; >> if (block_rsv->reserved >= block_rsv->size) { >> num_bytes = block_rsv->reserved - block_rsv->size; >> @@ -5581,6 +5585,12 @@ static u64 block_rsv_release_bytes(struct >> btrfs_fs_info *fs_info, >> } else { >> num_bytes = 0; >> } >> +if (block_rsv->qgroup_rsv_reserved >= >> block_rsv->qgroup_rsv_size) { >> +qgroup_to_release = block_rsv->qgroup_rsv_reserved - >> +block_rsv->qgroup_rsv_size; >> +block_rsv->qgroup_rsv_reserved = >> block_rsv->qgroup_rsv_size; >> +} else >> +qgroup_to_release = 0; >> spin_unlock(_rsv->lock); >> >> ret = num_bytes; >> @@ -5603,6 +5613,8 @@ static u64 block_rsv_release_bytes(struct >>
Btrfs occupies more space than du reports...
Hi, Can someone explain me why there is a difference in the number of blocks reported by df and du commands below? = # df -h /dc Filesystem Size Used Avail Use% Mounted on /dev/drbd1 746G 519G 225G 70% /dc # btrfs filesystem df -h /dc/ Data, single: total=518.01GiB, used=516.58GiB System, DUP: total=8.00MiB, used=80.00KiB Metadata, DUP: total=2.00GiB, used=1019.72MiB GlobalReserve, single: total=352.00MiB, used=0.00B # du -sh /dc 467G/dc = df shows 519G is used. While recursive check using du shows only 467G. The filesystem doesn't contain any snapshots/extra subvolumes. Neither does it contain any mounted filesystem under /dc. I also considered that it could be a void left behind by one of the open FDs held by a process. So I rebooted the system. Still no changes. The situation is even worse on a few other systems with similar configuration. -- -Shyam -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 10/10] btrfs: qgroup: Use independent and accurate per inode qgroup rsv
On 23.02.2018 11:06, Qu Wenruo wrote: > > > On 2018年02月23日 16:14, Nikolay Borisov wrote: >> >> >> On 23.02.2018 01:34, Qu Wenruo wrote: >>> >>> >>> On 2018年02月23日 06:44, Jeff Mahoney wrote: On 12/22/17 1:18 AM, Qu Wenruo wrote: > Unlike reservation calculation used in inode rsv for metadata, qgroup > doesn't really need to care things like csum size or extent usage for > whole tree COW. > > Qgroup care more about net change of extent usage. > That's to say, if we're going to insert one file extent, it will mostly > find its place in CoWed tree block, leaving no change in extent usage. > Or cause leaf split, result one new net extent, increasing qgroup number > by nodesize. > (Or even more rare case, increase the tree level, increasing qgroup > number by 2 * nodesize) > > So here instead of using the way overkilled calculation for extent > allocator, which cares more about accurate and no error, qgroup doesn't > need that over-calculated reservation. > > This patch will maintain 2 new members in btrfs_block_rsv structure for > qgroup, using much smaller calculation for qgroup rsv, reducing false > EDQUOT. I think this is the right idea but, rather than being the last step in a qgroup rework, it should be the first. >>> >>> That's right. >>> >>> Although putting it as 1st patch needs extra work to co-operate with >>> later type seperation. >>> Don't qgroup reservation lifetimes match the block reservation lifetimes? >>> >>> Also correct, but... >>> We wouldn't want a global reserve and we wouldn't track usage on a per-block basis, but they should otherwise match. We already have clear APIs surrounding the use of block reservations, so it seems to me that it'd make sense to extend those to cover the qgroup cases as well. Otherwise, the rest of your patchset makes a parallel reservation system with a different API. That keeps the current state of qgroups as a bolt-on that always needs to be tracked separately (and which developers need to ensure they get right). >>> >>> The problem is, block reservation is designed to ensure every CoW could >>> be fulfilled without error. >>> >>> That's to say, for case like CoW write with csum, we need to care about >>> space reservation not only for EXTENT_DATA in fs tree, but also later >>> EXTENT_ITEM for extent tree, and CSUM_ITEM for csum tree. >>> >>> However extent tree and csum tree doesn't contribute to quota at all. >>> If we follow such over-reservation, it would be much much easier to hit >>> false EDQUOTA early. >>> >>> That's the main reason why a separate (and a little simplified) block >>> rsv tracking system. >>> >>> And if there is better solution, I'm all ears. >> >> So looking at the code the main hurdles seems to be the fact that the >> btrfs_block_rsv_* API's take a raw byte count, which is derived from >> btrfs_calc_trans_metadata_size, which in turn is passed the number of >> items we want. >> >> So what if we extend the block_rsv_* apis to take an additional >> 'quota_bytes' or somesuch argument which would represent the amount we >> would like to be charged to the qgroups. This will force us to revisit >> every call site of block_rsv API's and adjust the code accordingly so >> that callers pass the correct number. This will lead to code such as: >> >> if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags)) { blah } > > We don't need to do such check at call site. > > Just do the calculation (which should be really simple, as simple as > nodesize * nr_items_to_add), and pass it to btrfs_qgroup_reserve_* APIs, > which would handle the quota enabled check. > >> >> be contained into the block rsv apis. This will ensure lifetime of >> blockrsv/qgroups is always consistent. I think this only applies to >> qgroup metadata reservations. > > Yep, and only applies to PREALLOC type metadata reservation. > For per-transaction rsv, it's handled by PERTRANS type. And what about the btrfs_qgroup_reserve_meta()-type reservations, this function doesn't take a transaction, what kind of reservation is that o_O To sum it up we have PERTRANS, PREALLOC and btrfs_qgroup_reserve_meta and those are 3 distinct type of reservations, correct? > > Thanks, > Qu > >> >> >> >> >>> >>> Thanks, >>> Qu >>> -Jeff > Signed-off-by: Qu Wenruo> --- > fs/btrfs/ctree.h | 18 + > fs/btrfs/extent-tree.c | 55 > -- > 2 files changed, 62 insertions(+), 11 deletions(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 0c58f92c2d44..97783ba91e00 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -467,6 +467,24 @@ struct btrfs_block_rsv { > unsigned short full; > unsigned short type; > unsigned short failfast; > + > + /* > + *
Re: Fwd: [Bug 1547319] 4.16.0-0.rc1.git4.1.fc28.x86_64 #1 Not tainted: possible recursive locking detected
On 22 February 2018 at 23:57, David Sterbawrote: [..] >> https://bugzilla.redhat.com/show_bug.cgi?id=1547319 >> >> Laura Abbott changed: >> >>What|Removed |Added >> >> CC||labb...@redhat.com >> >> --- Comment #1 from Laura Abbott --- >> This is btrfs which is pretty far down on our priority list. This is best >> reported to the upstream btrfs devs. > > Quick and wrong conclusions, it's not a btrfs bug. > > The fix: > > https://patchwork.kernel.org/patch/10207013/ > > but I don't see it merged in master branch. Thank you for the reply :) I'll keep my ticket opened until it will be merged. kloczek -- Tomasz Kłoczko | LinkedIn: http://lnkd.in/FXPWxH -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 10/10] btrfs: qgroup: Use independent and accurate per inode qgroup rsv
On 2018年02月23日 16:14, Nikolay Borisov wrote: > > > On 23.02.2018 01:34, Qu Wenruo wrote: >> >> >> On 2018年02月23日 06:44, Jeff Mahoney wrote: >>> On 12/22/17 1:18 AM, Qu Wenruo wrote: Unlike reservation calculation used in inode rsv for metadata, qgroup doesn't really need to care things like csum size or extent usage for whole tree COW. Qgroup care more about net change of extent usage. That's to say, if we're going to insert one file extent, it will mostly find its place in CoWed tree block, leaving no change in extent usage. Or cause leaf split, result one new net extent, increasing qgroup number by nodesize. (Or even more rare case, increase the tree level, increasing qgroup number by 2 * nodesize) So here instead of using the way overkilled calculation for extent allocator, which cares more about accurate and no error, qgroup doesn't need that over-calculated reservation. This patch will maintain 2 new members in btrfs_block_rsv structure for qgroup, using much smaller calculation for qgroup rsv, reducing false EDQUOT. >>> >>> >>> I think this is the right idea but, rather than being the last step in a >>> qgroup rework, it should be the first. >> >> That's right. >> >> Although putting it as 1st patch needs extra work to co-operate with >> later type seperation. >> >>> Don't qgroup reservation >>> lifetimes match the block reservation lifetimes? >> >> Also correct, but... >> >>> We wouldn't want a >>> global reserve and we wouldn't track usage on a per-block basis, but >>> they should otherwise match. We already have clear APIs surrounding the >>> use of block reservations, so it seems to me that it'd make sense to >>> extend those to cover the qgroup cases as well. Otherwise, the rest of >>> your patchset makes a parallel reservation system with a different API. >>> That keeps the current state of qgroups as a bolt-on that always needs >>> to be tracked separately (and which developers need to ensure they get >>> right). >> >> The problem is, block reservation is designed to ensure every CoW could >> be fulfilled without error. >> >> That's to say, for case like CoW write with csum, we need to care about >> space reservation not only for EXTENT_DATA in fs tree, but also later >> EXTENT_ITEM for extent tree, and CSUM_ITEM for csum tree. >> >> However extent tree and csum tree doesn't contribute to quota at all. >> If we follow such over-reservation, it would be much much easier to hit >> false EDQUOTA early. >> >> That's the main reason why a separate (and a little simplified) block >> rsv tracking system. >> >> And if there is better solution, I'm all ears. > > So looking at the code the main hurdles seems to be the fact that the > btrfs_block_rsv_* API's take a raw byte count, which is derived from > btrfs_calc_trans_metadata_size, which in turn is passed the number of > items we want. > > So what if we extend the block_rsv_* apis to take an additional > 'quota_bytes' or somesuch argument which would represent the amount we > would like to be charged to the qgroups. This will force us to revisit > every call site of block_rsv API's and adjust the code accordingly so > that callers pass the correct number. This will lead to code such as: > > if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags)) { blah } We don't need to do such check at call site. Just do the calculation (which should be really simple, as simple as nodesize * nr_items_to_add), and pass it to btrfs_qgroup_reserve_* APIs, which would handle the quota enabled check. > > be contained into the block rsv apis. This will ensure lifetime of > blockrsv/qgroups is always consistent. I think this only applies to > qgroup metadata reservations. Yep, and only applies to PREALLOC type metadata reservation. For per-transaction rsv, it's handled by PERTRANS type. Thanks, Qu > > > > >> >> Thanks, >> Qu >> >>> >>> -Jeff >>> Signed-off-by: Qu Wenruo--- fs/btrfs/ctree.h | 18 + fs/btrfs/extent-tree.c | 55 -- 2 files changed, 62 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 0c58f92c2d44..97783ba91e00 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -467,6 +467,24 @@ struct btrfs_block_rsv { unsigned short full; unsigned short type; unsigned short failfast; + + /* + * Qgroup equivalent for @size @reserved + * + * Unlike normal normal @size/@reserved for inode rsv, + * qgroup doesn't care about things like csum size nor how many tree + * blocks it will need to reserve. + * + * Qgroup cares more about *NET* change of extent usage. + * So for ONE newly inserted file extent, in worst case it will cause + * leaf split and level increase, nodesize for each
Re: [PATCH v2 06/27] libbtrfsutil: add btrfs_util_create_subvolume()
On 2018/02/16 4:04, Omar Sandoval wrote: > From: Omar Sandoval> +static enum btrfs_util_error openat_parent_and_name(int dirfd, const char > *path, > + char *name, size_t name_len, > + int *fd) > +{ > + char *tmp_path, *slash, *dirname, *basename; > + size_t len; > + > + /* Ignore trailing slashes. */ > + len = strlen(path); > + while (len > 1 && path[len - 1] == '/') > + len--; > + > + tmp_path = malloc(len + 1); > + if (!tmp_path) > + return BTRFS_UTIL_ERROR_NO_MEMORY; > + memcpy(tmp_path, path, len); > + tmp_path[len] = '\0'; > + > + slash = memrchr(tmp_path, '/', len); > + if (slash == tmp_path) { > + dirname = "/"; > + basename = tmp_path + 1; > + } else if (slash) { > + *slash = '\0'; > + dirname = tmp_path; > + basename = slash + 1; > + } else { > + dirname = "."; > + basename = tmp_path; > + } > + > + len = strlen(basename); > + if (len >= name_len) { > + errno = ENAMETOOLONG; tmp_path should be also freed here. > + return BTRFS_UTIL_ERROR_INVALID_ARGUMENT; > + } > + memcpy(name, basename, len); > + name[len] = '\0'; > + > + *fd = openat(dirfd, dirname, O_RDONLY | O_DIRECTORY); > + if (*fd == -1) { > + free(tmp_path); > + return BTRFS_UTIL_ERROR_OPEN_FAILED; > + } > + > + free(tmp_path); > + return BTRFS_UTIL_OK; > +} -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 10/10] btrfs: qgroup: Use independent and accurate per inode qgroup rsv
On 23.02.2018 01:34, Qu Wenruo wrote: > > > On 2018年02月23日 06:44, Jeff Mahoney wrote: >> On 12/22/17 1:18 AM, Qu Wenruo wrote: >>> Unlike reservation calculation used in inode rsv for metadata, qgroup >>> doesn't really need to care things like csum size or extent usage for >>> whole tree COW. >>> >>> Qgroup care more about net change of extent usage. >>> That's to say, if we're going to insert one file extent, it will mostly >>> find its place in CoWed tree block, leaving no change in extent usage. >>> Or cause leaf split, result one new net extent, increasing qgroup number >>> by nodesize. >>> (Or even more rare case, increase the tree level, increasing qgroup >>> number by 2 * nodesize) >>> >>> So here instead of using the way overkilled calculation for extent >>> allocator, which cares more about accurate and no error, qgroup doesn't >>> need that over-calculated reservation. >>> >>> This patch will maintain 2 new members in btrfs_block_rsv structure for >>> qgroup, using much smaller calculation for qgroup rsv, reducing false >>> EDQUOT. >> >> >> I think this is the right idea but, rather than being the last step in a >> qgroup rework, it should be the first. > > That's right. > > Although putting it as 1st patch needs extra work to co-operate with > later type seperation. > >> Don't qgroup reservation >> lifetimes match the block reservation lifetimes? > > Also correct, but... > >> We wouldn't want a >> global reserve and we wouldn't track usage on a per-block basis, but >> they should otherwise match. We already have clear APIs surrounding the >> use of block reservations, so it seems to me that it'd make sense to >> extend those to cover the qgroup cases as well. Otherwise, the rest of >> your patchset makes a parallel reservation system with a different API. >> That keeps the current state of qgroups as a bolt-on that always needs >> to be tracked separately (and which developers need to ensure they get >> right). > > The problem is, block reservation is designed to ensure every CoW could > be fulfilled without error. > > That's to say, for case like CoW write with csum, we need to care about > space reservation not only for EXTENT_DATA in fs tree, but also later > EXTENT_ITEM for extent tree, and CSUM_ITEM for csum tree. > > However extent tree and csum tree doesn't contribute to quota at all. > If we follow such over-reservation, it would be much much easier to hit > false EDQUOTA early. > > That's the main reason why a separate (and a little simplified) block > rsv tracking system. > > And if there is better solution, I'm all ears. So looking at the code the main hurdles seems to be the fact that the btrfs_block_rsv_* API's take a raw byte count, which is derived from btrfs_calc_trans_metadata_size, which in turn is passed the number of items we want. So what if we extend the block_rsv_* apis to take an additional 'quota_bytes' or somesuch argument which would represent the amount we would like to be charged to the qgroups. This will force us to revisit every call site of block_rsv API's and adjust the code accordingly so that callers pass the correct number. This will lead to code such as: if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags)) { blah } be contained into the block rsv apis. This will ensure lifetime of blockrsv/qgroups is always consistent. I think this only applies to qgroup metadata reservations. > > Thanks, > Qu > >> >> -Jeff >> >>> Signed-off-by: Qu Wenruo>>> --- >>> fs/btrfs/ctree.h | 18 + >>> fs/btrfs/extent-tree.c | 55 >>> -- >>> 2 files changed, 62 insertions(+), 11 deletions(-) >>> >>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h >>> index 0c58f92c2d44..97783ba91e00 100644 >>> --- a/fs/btrfs/ctree.h >>> +++ b/fs/btrfs/ctree.h >>> @@ -467,6 +467,24 @@ struct btrfs_block_rsv { >>> unsigned short full; >>> unsigned short type; >>> unsigned short failfast; >>> + >>> + /* >>> +* Qgroup equivalent for @size @reserved >>> +* >>> +* Unlike normal normal @size/@reserved for inode rsv, >>> +* qgroup doesn't care about things like csum size nor how many tree >>> +* blocks it will need to reserve. >>> +* >>> +* Qgroup cares more about *NET* change of extent usage. >>> +* So for ONE newly inserted file extent, in worst case it will cause >>> +* leaf split and level increase, nodesize for each file extent >>> +* is already way overkilled. >>> +* >>> +* In short, qgroup_size/reserved is the up limit of possible needed >>> +* qgroup metadata reservation. >>> +*/ >>> + u64 qgroup_rsv_size; >>> + u64 qgroup_rsv_reserved; >>> }; >>> >>> /* >>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >>> index 986660f301de..9d579c7bcf7f 100644 >>> --- a/fs/btrfs/extent-tree.c >>> +++ b/fs/btrfs/extent-tree.c >>> @@ -5565,14 +5565,18 @@ static void space_info_add_new_bytes(struct >>>