[PATCH] btrfs-progs: mkfs/rootdir: Fix memory leak in traverse_directory()
The bug is exposed by mkfs test case 009, with D=asan. We are leaking memory of parent_dir_entry->path() which ,except the rootdir, is allocated by strdup(). Before fixing it, unifiy the allocation of parent_dir_entry() to heap allocation. Then fix it by adding "free(parent_dir_entry->path);" in traverse_directory() and error handler. Signed-off-by: Qu Wenruo--- mkfs/rootdir.c | 5 +++-- mkfs/rootdir.h | 2 +- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/mkfs/rootdir.c b/mkfs/rootdir.c index 9be4bcbfd0b5..f9472e126674 100644 --- a/mkfs/rootdir.c +++ b/mkfs/rootdir.c @@ -453,7 +453,6 @@ static int traverse_directory(struct btrfs_trans_handle *trans, ino_t parent_inum, cur_inum; ino_t highest_inum = 0; const char *parent_dir_name; - char real_path[PATH_MAX]; struct btrfs_path path; struct extent_buffer *leaf; struct btrfs_key root_dir_key; @@ -464,7 +463,7 @@ static int traverse_directory(struct btrfs_trans_handle *trans, if (!dir_entry) return -ENOMEM; dir_entry->dir_name = dir_name; - dir_entry->path = realpath(dir_name, real_path); + dir_entry->path = realpath(dir_name, NULL); if (!dir_entry->path) { error("realpath failed for %s: %s", dir_name, strerror(errno)); ret = -1; @@ -616,6 +615,7 @@ static int traverse_directory(struct btrfs_trans_handle *trans, } free_namelist(files, count); + free(parent_dir_entry->path); free(parent_dir_entry); index_cnt = 2; @@ -686,6 +686,7 @@ fail: dir_entry = list_entry(dir_head.list.next, struct directory_name_entry, list); list_del(_entry->list); + free(dir_entry->path); free(dir_entry); } out: diff --git a/mkfs/rootdir.h b/mkfs/rootdir.h index d0fc2eb58563..f06c7dd16e3c 100644 --- a/mkfs/rootdir.h +++ b/mkfs/rootdir.h @@ -23,7 +23,7 @@ struct directory_name_entry { const char *dir_name; - const char *path; + char *path; ino_t inum; struct list_head list; }; -- 2.16.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: Fix read beyond boundary bug in build_roots_info_cache()
This bug is exposed by fsck-test with D=asan, hit by test case 020, with the following error report: = ==10740==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x62161580 at pc 0x56051f0db6cd bp 0x7ffe170f3e20 sp 0x7ffe170f3e10 READ of size 1 at 0x62161580 thread T0 #0 0x56051f0db6cc in btrfs_extent_inline_ref_type /home/adam/btrfs/btrfs-progs/ctree.h:1727 #1 0x56051f13b669 in build_roots_info_cache /home/adam/btrfs/btrfs-progs/cmds-check.c:14306 #2 0x56051f13c86a in repair_root_items /home/adam/btrfs/btrfs-progs/cmds-check.c:14450 #3 0x56051f13ea89 in cmd_check /home/adam/btrfs/btrfs-progs/cmds-check.c:14965 #4 0x56051efe75bb in main /home/adam/btrfs/btrfs-progs/btrfs.c:302 #5 0x7f04ddbb0f49 in __libc_start_main (/usr/lib/libc.so.6+0x20f49) #6 0x56051efe68c9 in _start (/home/adam/btrfs/btrfs-progs/btrfs+0x5b8c9) 0x62161580 is located 0 bytes to the right of 4224-byte region [0x62160500,0x62161580) allocated by thread T0 here: #0 0x7f04ded50ce1 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cc:70 #1 0x56051f04685e in __alloc_extent_buffer /home/adam/btrfs/btrfs-progs/extent_io.c:553 #2 0x56051f047563 in alloc_extent_buffer /home/adam/btrfs/btrfs-progs/extent_io.c:687 #3 0x56051efff1d1 in btrfs_find_create_tree_block /home/adam/btrfs/btrfs-progs/disk-io.c:187 #4 0x56051f000133 in read_tree_block /home/adam/btrfs/btrfs-progs/disk-io.c:327 #5 0x56051efeddb8 in read_node_slot /home/adam/btrfs/btrfs-progs/ctree.c:652 #6 0x56051effb0d9 in btrfs_next_leaf /home/adam/btrfs/btrfs-progs/ctree.c:2853 #7 0x56051f13b343 in build_roots_info_cache /home/adam/btrfs/btrfs-progs/cmds-check.c:14267 #8 0x56051f13c86a in repair_root_items /home/adam/btrfs/btrfs-progs/cmds-check.c:14450 #9 0x56051f13ea89 in cmd_check /home/adam/btrfs/btrfs-progs/cmds-check.c:14965 #10 0x56051efe75bb in main /home/adam/btrfs/btrfs-progs/btrfs.c:302 #11 0x7f04ddbb0f49 in __libc_start_main (/usr/lib/libc.so.6+0x20f49) It's completely possible that one extent/metadata item has no inline reference, while build_roots_info_cache() doesn't have such check. Fix it by checking @iref against item end to avoid such problem. Signed-off-by: Qu Wenruo--- I'm pretty sure this fix will cause conflict with original mode code split. Maybe it's better to merge original code split first? --- cmds-check.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/cmds-check.c b/cmds-check.c index 99fbafc5538c..b15802887e70 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -14255,6 +14255,7 @@ static int build_roots_info_cache(struct btrfs_fs_info *info) struct btrfs_key found_key; struct btrfs_extent_item *ei; struct btrfs_extent_inline_ref *iref; + unsigned long item_end; int slot = path.slots[0]; int type; u64 flags; @@ -14283,6 +14284,7 @@ static int build_roots_info_cache(struct btrfs_fs_info *info) ei = btrfs_item_ptr(leaf, slot, struct btrfs_extent_item); flags = btrfs_extent_flags(leaf, ei); + item_end = (unsigned long)ei + btrfs_item_size_nr(leaf, slot); if (found_key.type == BTRFS_EXTENT_ITEM_KEY && !(flags & BTRFS_EXTENT_FLAG_TREE_BLOCK)) @@ -14299,6 +14301,15 @@ static int build_roots_info_cache(struct btrfs_fs_info *info) level = btrfs_tree_block_level(leaf, binfo); } + /* +* It's valid one extent/metadata item has no inline ref, but +* SHARED_BLOCK_REF or other shared refernce. +* So we need to do extra check to avoid read beyond leaf +* boudnary. +*/ + if ((unsigned long)iref >= item_end) + goto next; + /* * For a root extent, it must be of the following type and the * first (and only one) iref in the item. -- 2.16.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-cleaner / snapshot performance analysis
On Sun, Feb 11, 2018 at 02:40:16PM +0800, Qu Wenruo wrote: > > > On 2018年02月10日 00:45, Ellis H. Wilson III wrote: > > Hi all, > > > > I am trying to better understand how the cleaner kthread (btrfs-cleaner) > > impacts foreground performance, specifically during snapshot deletion. > > My experience so far has been that it can be dramatically disruptive to > > foreground I/O. > > > > Looking through the wiki at kernel.org I have not yet stumbled onto any > > analysis that would shed light on this specific problem. I have found > > numerous complaints about btrfs-cleaner online, especially relating to > > quotas being enabled. This has proven thus far less than helpful, as > > the response tends to be "use less snapshots," or "disable quotas," both > > of which strike me as intellectually unsatisfying answers, especially > > the former in a filesystem where snapshots are supposed to be > > "first-class citizens." > > Yes, snapshots of btrfs is really "first-class citizen". > Tons of designs are all biased to snapshot. > > But one should be clear about one thing: > Snapshot creation and backref walk (used in qgroup, relocation and > extent deletion), are two conflicting workload in fact. > > Btrfs puts snapshot creation on a very high priority, so that it greatly > degrades the performance of backref walk (used in snapshot deletion, > relocation and extent exclusive/shared calculation of qgroup). > > Let me explain this problem in detail. > > Just as explained by Peter Grandi, for any snapshot system (or any > system supports reflink) there must be a reserved mapping tree, to tell > which extent is used by who. > > It's very critical, to determine if an extent is shared so we determine > if we need to do CoW. > > There are several different ways to implement it, and this hugely > affects snapshot creation performance. > > 1) Direct mapping record >Just records exactly which extent is used by who, directly. >So when we needs to check the owner, just search the tree ONCE, then >we get it. > >This is simple and it seems that LVM thin-provision and LVM >traditional targets are all using them. >(Maybe XFS also follows this way?) Yes, it does. >Pros: >*FAST* backref walk, which means quick extent deletion and CoW >condition check. > > >Cons: >*SLOW* snapshot creation. >Each snapshot creation needs to insert new owner relationship into >the tree. This modification grow with the size of snapshot source. ...of course xfs also doesn't support snapshots. :) --D > 2) Indirect mapping record >Records upper level referencer only. > >To get all direct owner of an extent, it will needs multiple lookup >in the reserved mapping tree. > >And obviously, btrfs is using this method. > >Pros: >*FAST* owner inheritance, which means snapshot creation. >(Well, the only advantage I can think of) > >Cons: >*VERY SLOW* backref walk, used by extent deletion, relocation, qgroup >and Cow condition check. >(That may also be why btrfs default to CoW data, so that it can skip > the costy backref walk) > > And a more detailed example of the difference between them will be: > > [Basic tree layout] > Tree X > node A >/\ > node B node C > / \ / \ > leaf D leaf E leaf F leaf G > > Use above tree X as snapshot source. > > [Snapshot creation: Direct mapping] > Then for direct mapping record, if we are going to create snapshot Y > then we would get: > > Tree X Tree Y > node A > | \ / | > | X | > | / \ | > node B node C > / \ / \ > leaf D leaf E leaf F leaf G > > We need to create new node H, and update the owner for node B/C/D/E/F/G. > > That's to say, we need to create 1 new node, and update 6 references of > existing nodes/leaves. > And this will grow rapidly if the tree is large, but still should be a > linear increase. > > > [Snapshot creation: Indirect mapping] > And if using indirect mapping tree, firstly, reserved mapping tree > doesn't record exactly the owner for each leaf/node, but only records > its parent(s). > > So even when tree X exists along, without snapshot Y, if we need to know > the owner of leaf D, we only knows its only parent is node B. > And do the same query on node B until we read node A and knows it's > owned by tree X. > > Tree X ^ > node A ^ Look upward until >/| we reach tree root > node B | to search the owner > / | of a leaf/node >
Re: mount btrfs takes 30 minutes, btrfs check runs out of memory
On 2018年02月14日 00:24, Holger Hoffstätte wrote: > On 02/13/18 13:54, Qu Wenruo wrote: >> On 2018年02月13日 20:26, Holger Hoffstätte wrote: >>> On 02/13/18 12:40, Qu Wenruo wrote: >> The problem is not about how much space it takes, but how many extents >> are here in the filesystem. >>> >>> I have no idea why btrfs' mount even needs to touch all block groups to >>> get going (which seems to be the root of the problem), but here's a >>> not so crazy idea for more "mechanical sympathy". Feel free to mock >>> me if this is terribly wrong or not possible. ;) >>> >>> Mounting of even large filesystems (with many extents) seems to be fine >>> on SSDS, but not so fine on rotational storage. We've heard that from >>> several people with large (multi-TB) filesystems, and obviously it's >>> even more terrible on 5400RPM drives because their seeks are sooo sloow. >>> >>> If the problem is that the bgs are touched/iterated in "tree order", >>> would it then not be possible to sort the block groups in physical order >>> before trying to load whatever mount needs to load? >> >> This is in fact a good idea. >> Make block group into its own tree. > > Well, that's not what I was thinking about at all..yet. :) > (keep in mind I'm not really that familiar with the internals). > > Out of curiosity I ran a bit of perf on my own mount process, which is > fast (~700 ms) despite being a ~1.1TB fs, mixture of lots of large and > small files. Unfortunately it's also very fresh since I recreated it just > this weekend, so everything is neatly packed together and fast. > > In contrast a friend's fs is ~800 GB, but has 11 GB metadata and is pretty > old and fragmented (but running an up-to-date kernel). His fs mounts in ~5s. > > My perf run shows that the only interesting part responsible for mount time > is the nested loop in btrfs_read_block_groups calling find_first_block_group > (which got inlined & is not in the perf callgraph) over and over again, > accounting for 75% of time spent. > > I now understand your comment why the real solution to this problem > is to move bgs into their own tree, and agree: both kitchens and databases > have figured out a long time ago that the key to fast scan and lookup > performance is to not put different things in the same storage container; > in the case of analytical DBMS this is columnar storage. :) > > But what I originally meant was something much simpler and more > brute-force-ish. I see that btrfs_read_block_groups adds readahead > (is that actually effective?) but what I was looking for was the equivalent > of a DBMS' sequential scan. Right now finding (and loading) a bg seems to > involve a nested loop of tree lookups. It seems easier to rip through the > entire tree in nice 8MB chunks and discard what you don't need instead > of seeking around trying to find all the right bits in scattered order. The problem is, the tree (extent tree) containing block groups is very, very very large. It's a tree shared by all subvolumes. And since tree nodes and leaves can be scattered around the whole disk, it's pretty hard to do batch readahead. > > Could we alleviate cold mounts by starting more readaheads in > btrfs_read_block_groups, so that the extent tree is scanned more linearly? Since extent tree is not linear, it won't be as effective as we believe. Thanks, Qu > > cheers, > Holger > signature.asc Description: OpenPGP digital signature
Re: [PATCH 1/2] btrfs: Remove invalid null checks from btrfs_cleanup_dirty_bgs
On Thu, Feb 08, 2018 at 10:57:11AM -0800, Liu Bo wrote: > On Thu, Feb 08, 2018 at 06:25:17PM +0200, Nikolay Borisov wrote: > > list_first_entry is essentially a wrapper over cotnainer_of. The latter > > can never return null even if it's working on inconsistent list since it > > will either crash or return some offset in the wrong struct. > > Additionally, for the dirty_bgs list the iteration is done under > > dirty_bgs_lock which ensures consistency of the list. > > > > Reviewed-by: Liu BoAdded to next, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs: fix endianness compatibility during the SB RW
On Tue, Feb 13, 2018 at 06:27:13PM +0800, Anand Jain wrote: > On 02/13/2018 05:01 PM, Qu Wenruo wrote: > > On 2018年02月13日 11:00, Anand Jain wrote: > >> Fixes the endianness bug in the fs_info::super_copy by using its > >> btrfs_set_super...() function to set values in the SB, as these > >> functions manage the endianness compatibility nicely. > >> > >> Signed-off-by: Anand Jain> > > > Also went through all btrfs_super_block SETGET functions, greping using > > \>, seems that there are still some left here: > > > > fs/btrfs/sysfs.c: > > In both btrfs_sectorsize_show() and btrfs_clone_alignment_show(): > > return snprintf(buf, PAGE_SIZE, "%u\n", > > fs_info->super_copy->sectorsize); > > > > In btrfs_nodesize_show(): > > return snprintf(buf, PAGE_SIZE, "%u\n", fs_info->super_copy->nodesize); > > Oh. Thanks. Will fix. Maybe it's a good idea to add sysfs fixes > into a new patch. I'd prefer a single patch as it fixes the same problem for one structure, the context of use is not that important to justify 2 patches. I went through the possible uses of superblock again and did not find anything else than the update_super_roots and sysfs read handlers. There are some direct uses of super block members, like label, sys_array, uuid that are passed unconverted and must be accessed via the set/get helpers. In some places the superblock is put to a temporary variable so simple grep may miss these. > > And what about cc this to stable kernel? > > IIRC it's a very critical problem for btrfs. If the filesystem is always used on a same endianity host, this will not be a problem. Moving between opposite endianity hosts will report bogus numbers in sysfs and the backup root would not be restored correctly. As this is not common, I'd rate thats as a bugfix for stable, but "only" a serious one. > > Maybe cc: sta...@vger.kernel.org # v3.2+? > > Thanks for the suggestion. Will do. Any idea what if the patch which > applied on mainline ends up conflict on LTS, so write a separate patch > to stable? If the patch does not apply to some older stable branch, all involved people get a mail from stable team and have an opportunity to send an updated version of the patch. Regarding the long-term branches, I would consider 4.4 and up. Anything older is a plus but fixing merge conflicts is more likely there so I think the respective maintainers would either fix it by themselves or ask for help. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs: log, when replace, is canceled by the user
On Tue, Feb 13, 2018 at 11:53:43AM +0800, Anand Jain wrote: > For forensic investigations of issues, we would want to know > if and when the user cancels the replace. > > Signed-off-by: Anand JainReviewed-by: David Sterba I've updated the changelog a bit. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: fix bare unsigned declarations
On Tue, Feb 13, 2018 at 05:50:48PM +0800, Anand Jain wrote: > Kernel style prefers "unsigned int " over "unsigned " > and "signed int " over "signed ". The changelog does not match the changes, you're switching to u32. I agree u32 looks more suitable and consistent with the other changes. Please update the changelog and fixup the temporary variable in btrfs_remount. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: extent_buffer_uptodate() make it static and inline
On Tue, Feb 13, 2018 at 09:18:51AM +0200, Nikolay Borisov wrote: > > > On 13.02.2018 06:35, Anand Jain wrote: > > extent_buffer_uptodate() is a trivial wrapper around test_bit() > > and nothing else. So make it static and inline, save on code > > space and call indirection. > > > > Before: > >textdata bss dec hex filename > > 1131257 82898 18992 1233147 12d0fb fs/btrfs/btrfs.ko > > > > After: > >textdata bss dec hex filename > > 1131090 82898 18992 1232980 12d054 fs/btrfs/btrfs.ko > > > > Signed-off-by: Anand Jain> > Reviewed-by: Nikolay Borisov although there is one > nit below > > > --- > > fs/btrfs/extent_io.c | 5 - > > fs/btrfs/extent_io.h | 5 - > > 2 files changed, 4 insertions(+), 6 deletions(-) > > > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > > index bae57b408901..f300edae044b 100644 > > --- a/fs/btrfs/extent_io.c > > +++ b/fs/btrfs/extent_io.c > > @@ -5229,11 +5229,6 @@ void set_extent_buffer_uptodate(struct extent_buffer > > *eb) > > } > > } > > > > -int extent_buffer_uptodate(struct extent_buffer *eb) > > -{ > > - return test_bit(EXTENT_BUFFER_UPTODATE, >bflags); > > -} > > - > > int read_extent_buffer_pages(struct extent_io_tree *tree, > > struct extent_buffer *eb, int wait, int mirror_num) > > { > > diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h > > index 72e5af2965a8..953d94662b8e 100644 > > --- a/fs/btrfs/extent_io.h > > +++ b/fs/btrfs/extent_io.h > > @@ -489,7 +489,10 @@ void clear_extent_buffer_dirty(struct extent_buffer > > *eb); > > int set_extent_buffer_dirty(struct extent_buffer *eb); > > void set_extent_buffer_uptodate(struct extent_buffer *eb); > > void clear_extent_buffer_uptodate(struct extent_buffer *eb); > > -int extent_buffer_uptodate(struct extent_buffer *eb); > > +static inline int extent_buffer_uptodate(struct extent_buffer *eb) > > +{ > > + return test_bit(EXTENT_BUFFER_UPTODATE, >bflags); > > +} > > nit: It just doesn't look good to have a static inline function > definition among a list of function declarations. A quick look at > extent_io.h shows that this function could have been put below the > inline def of extent_buffer_get. I will move it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] btrfs: declare max_inline as u32
On Tue, Feb 13, 2018 at 05:49:49PM +0800, Anand Jain wrote: > As of now btrfs_fs_info::max_line is u64, which can't be > larger than btrfs_fs_info::sectorsize which is defined as > u32, so make btrfs_fs_info::max_line u32, > > Signed-off-by: Anand JainReviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] btrfs: verify max_inline mount parameter
On Tue, Feb 13, 2018 at 05:49:50PM +0800, Anand Jain wrote: > We aren't verifying the parameter passed to the max_inline mount option, > so we won't report and fail the mount if a junk value is specified for > example, -o max_inline=abc. > This patch converts the max_inline option to %d and checks if it's a > number >= 0. As the max_inline is a size, the suffixes are allowed here and this is documented in the btrfs(5) page. I've checked all current options and max_inline should be the only one where we want the suffixes. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mount btrfs takes 30 minutes, btrfs check runs out of memory
On 02/13/18 13:54, Qu Wenruo wrote: > On 2018年02月13日 20:26, Holger Hoffstätte wrote: >> On 02/13/18 12:40, Qu Wenruo wrote: > The problem is not about how much space it takes, but how many extents > are here in the filesystem. >> >> I have no idea why btrfs' mount even needs to touch all block groups to >> get going (which seems to be the root of the problem), but here's a >> not so crazy idea for more "mechanical sympathy". Feel free to mock >> me if this is terribly wrong or not possible. ;) >> >> Mounting of even large filesystems (with many extents) seems to be fine >> on SSDS, but not so fine on rotational storage. We've heard that from >> several people with large (multi-TB) filesystems, and obviously it's >> even more terrible on 5400RPM drives because their seeks are sooo sloow. >> >> If the problem is that the bgs are touched/iterated in "tree order", >> would it then not be possible to sort the block groups in physical order >> before trying to load whatever mount needs to load? > > This is in fact a good idea. > Make block group into its own tree. Well, that's not what I was thinking about at all..yet. :) (keep in mind I'm not really that familiar with the internals). Out of curiosity I ran a bit of perf on my own mount process, which is fast (~700 ms) despite being a ~1.1TB fs, mixture of lots of large and small files. Unfortunately it's also very fresh since I recreated it just this weekend, so everything is neatly packed together and fast. In contrast a friend's fs is ~800 GB, but has 11 GB metadata and is pretty old and fragmented (but running an up-to-date kernel). His fs mounts in ~5s. My perf run shows that the only interesting part responsible for mount time is the nested loop in btrfs_read_block_groups calling find_first_block_group (which got inlined & is not in the perf callgraph) over and over again, accounting for 75% of time spent. I now understand your comment why the real solution to this problem is to move bgs into their own tree, and agree: both kitchens and databases have figured out a long time ago that the key to fast scan and lookup performance is to not put different things in the same storage container; in the case of analytical DBMS this is columnar storage. :) But what I originally meant was something much simpler and more brute-force-ish. I see that btrfs_read_block_groups adds readahead (is that actually effective?) but what I was looking for was the equivalent of a DBMS' sequential scan. Right now finding (and loading) a bg seems to involve a nested loop of tree lookups. It seems easier to rip through the entire tree in nice 8MB chunks and discard what you don't need instead of seeking around trying to find all the right bits in scattered order. Could we alleviate cold mounts by starting more readaheads in btrfs_read_block_groups, so that the extent tree is scanned more linearly? cheers, Holger signature.asc Description: OpenPGP digital signature
Re: [PATCH] btrfs: manage subvolid mount option as %u
On Tue, Feb 13, 2018 at 05:50:43PM +0800, Anand Jain wrote: > As -o subvolid mount option is an u64 manage it as %u for > token verifications, instead of %s. > > Signed-off-by: Anand Jain> --- > fs/btrfs/super.c | 23 ++- > 1 file changed, 10 insertions(+), 13 deletions(-) > > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c > index 8112619cac95..bf629c8a6a47 100644 > --- a/fs/btrfs/super.c > +++ b/fs/btrfs/super.c > @@ -336,7 +336,7 @@ enum { > static const match_table_t tokens = { > {Opt_degraded, "degraded"}, > {Opt_subvol, "subvol=%s"}, > - {Opt_subvolid, "subvolid=%s"}, > + {Opt_subvolid, "subvolid=%u"}, This gets implemented as simple_strtoul, deep in the perser it uses an unsigned long long and then casts to usigned long. Long and long-long are same only on subset of architectures, so we'd need to either extend teh parser capabilities to really do u64 or we'd have to use the %s and match_u64. Though a subvolid larger than full 32bit number is unlikely, for sake of correctness I think we should stick to the constraints of the subvolid that allows u64. > {Opt_device, "device=%s"}, > {Opt_nodatasum, "nodatasum"}, > {Opt_datasum, "datasum"}, > @@ -964,8 +964,8 @@ static int btrfs_parse_subvol_options(const char > *options, fmode_t flags, > { > substring_t args[MAX_OPT_ARGS]; > char *opts, *orig, *p; > - char *num = NULL; > int error = 0; > + u64 subvolid; > > if (!options) > return 0; > @@ -995,18 +995,15 @@ static int btrfs_parse_subvol_options(const char > *options, fmode_t flags, > } > break; > case Opt_subvolid: > - num = match_strdup([0]); > - if (num) { > - *subvol_objectid = memparse(num, NULL); > - kfree(num); > - /* we want the original fs_tree */ > - if (!*subvol_objectid) > - *subvol_objectid = > - BTRFS_FS_TREE_OBJECTID; > - } else { > - error = -EINVAL; > + error = match_u64([0], ); So this is the right way (with the %s), as memparse accepts the size suffixes (K/M/G/...), that we don't want for a subvolume id. > + if (error) > goto out; > - } > + > + /* we want the original fs_tree */ > + if (subvolid == 0) > + subvolid = BTRFS_FS_TREE_OBJECTID; > + > + *subvol_objectid = subvolid; > break; > case Opt_subvolrootid: > pr_warn("BTRFS: 'subvolrootid' mount option is > deprecated and has no effect\n"); > -- > 2.7.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: manage thread_pool mount option as %u
On Tue, Feb 13, 2018 at 05:34:56PM +0200, Nikolay Borisov wrote: > >>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c > >>> index 02c7766e6849..8112619cac95 100644 > >>> --- a/fs/btrfs/super.c > >>> +++ b/fs/btrfs/super.c > >>> @@ -346,7 +346,7 @@ static const match_table_t tokens = { > >>> {Opt_barrier, "barrier"}, > >>> {Opt_max_inline, "max_inline=%u"}, > >>> {Opt_alloc_start, "alloc_start=%s"}, > >>> - {Opt_thread_pool, "thread_pool=%d"}, > >>> + {Opt_thread_pool, "thread_pool=%u"}, > >>> {Opt_compress, "compress"}, > >>> {Opt_compress_type, "compress=%s"}, > >>> {Opt_compress_force, "compress-force"}, > >>> @@ -596,12 +596,11 @@ int btrfs_parse_options(struct btrfs_fs_info > >>> *info, char *options, > >>> ret = match_int([0], ); > >>> if (ret) { > >>> goto out; > >>> - } else if (intarg > 0) { > >>> - info->thread_pool_size = intarg; > >>> - } else { > >>> + } else if (intarg == 0) { > >> > >> One thing I'm worried about is the fact that match_int parses a signed > >> int. So If someone pases -1 then it would be parsed to -1 but when you > >> set it to thread_pool_size the actual value is going to be the 2's > >> complement of the value i.e. a very large number. So a check for intarg > >> < 0 is required to avoid that. Same applies to your other patches > > > > That's not true. When -o thread_pool=-1 is passed it would fail > > to match to any token. And same applies to other patches too. > > Indeed, the subtleties of the matching machinery. In that case for the > whole series: Yeah, it's not very clear from the sources, but with %u the parser refuses negative numbers. Even with match_int, the typecasts will not damage the result when assigning to u32 from intarg. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: manage thread_pool mount option as %u
On 13.02.2018 17:18, Anand Jain wrote: > > > > > > >>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c >>> index 02c7766e6849..8112619cac95 100644 >>> --- a/fs/btrfs/super.c >>> +++ b/fs/btrfs/super.c >>> @@ -346,7 +346,7 @@ static const match_table_t tokens = { >>> {Opt_barrier, "barrier"}, >>> {Opt_max_inline, "max_inline=%u"}, >>> {Opt_alloc_start, "alloc_start=%s"}, >>> - {Opt_thread_pool, "thread_pool=%d"}, >>> + {Opt_thread_pool, "thread_pool=%u"}, >>> {Opt_compress, "compress"}, >>> {Opt_compress_type, "compress=%s"}, >>> {Opt_compress_force, "compress-force"}, >>> @@ -596,12 +596,11 @@ int btrfs_parse_options(struct btrfs_fs_info >>> *info, char *options, >>> ret = match_int([0], ); >>> if (ret) { >>> goto out; >>> - } else if (intarg > 0) { >>> - info->thread_pool_size = intarg; >>> - } else { >>> + } else if (intarg == 0) { >> >> One thing I'm worried about is the fact that match_int parses a signed >> int. So If someone pases -1 then it would be parsed to -1 but when you >> set it to thread_pool_size the actual value is going to be the 2's >> complement of the value i.e. a very large number. So a check for intarg >> < 0 is required to avoid that. Same applies to your other patches > > That's not true. When -o thread_pool=-1 is passed it would fail > to match to any token. And same applies to other patches too. Indeed, the subtleties of the matching machinery. In that case for the whole series: Reviewed-by: Nikolay Borisov> > Thanks, Anand > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-progs: compile with musl libc
> Thanks, this has been fixed in > https://github.com/kdave/btrfs-progs/commit/3aa1bbdd89ee9c9c48d260a6192fae08328f1b2f Ah, my bad. Didn't check the dev branch. Thanks for the quick response! Cheers, Dan signature.asc Description: Digital signature
Re: [PATCH] btrfs: manage thread_pool mount option as %u
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 02c7766e6849..8112619cac95 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -346,7 +346,7 @@ static const match_table_t tokens = { {Opt_barrier, "barrier"}, {Opt_max_inline, "max_inline=%u"}, {Opt_alloc_start, "alloc_start=%s"}, - {Opt_thread_pool, "thread_pool=%d"}, + {Opt_thread_pool, "thread_pool=%u"}, {Opt_compress, "compress"}, {Opt_compress_type, "compress=%s"}, {Opt_compress_force, "compress-force"}, @@ -596,12 +596,11 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options, ret = match_int([0], ); if (ret) { goto out; - } else if (intarg > 0) { - info->thread_pool_size = intarg; - } else { + } else if (intarg == 0) { One thing I'm worried about is the fact that match_int parses a signed int. So If someone pases -1 then it would be parsed to -1 but when you set it to thread_pool_size the actual value is going to be the 2's complement of the value i.e. a very large number. So a check for intarg < 0 is required to avoid that. Same applies to your other patches That's not true. When -o thread_pool=-1 is passed it would fail to match to any token. And same applies to other patches too. Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-progs: compile with musl libc
On Tue, Feb 13, 2018 at 01:25:10PM +, Dan Robertson wrote: > Ensure that limits.h is included in mkfs/main.c as PATH_MAX is used. The lack > of this currently causes btrfs-progs to fail to compile when the systems libc > is musl. > From 7859e0d01778e844ae0fcbefe55581277ce7cab3 Mon Sep 17 00:00:00 2001 > From: Dan Robertson> Date: Tue, 13 Feb 2018 13:04:29 + > Subject: [PATCH] btrfs-progs: compile with musl libc > > Ensure that limits.h is included in mkfs/main.c as PATH_MAX is used. > > Signed-off-by: Dan Robertson Thanks, this has been fixed in https://github.com/kdave/btrfs-progs/commit/3aa1bbdd89ee9c9c48d260a6192fae08328f1b2f and the musl build check is now in place so the breakage should not happen anymore. I'll do a release this week with the fix included. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: manage thread_pool mount option as %u
On 13.02.2018 11:50, Anand Jain wrote: > -o thread_pool is alway unsigned. Manage it that way all around. > > Signed-off-by: Anand Jain> --- > fs/btrfs/ctree.h | 2 +- > fs/btrfs/disk-io.c | 4 ++-- > fs/btrfs/super.c | 13 ++--- > 3 files changed, 9 insertions(+), 10 deletions(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 0b738f0a9a23..23b7e6756b15 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -935,7 +935,7 @@ struct btrfs_fs_info { > struct btrfs_workqueue *extent_workers; > struct task_struct *transaction_kthread; > struct task_struct *cleaner_kthread; > - int thread_pool_size; > + u32 thread_pool_size; > > struct kobject *space_info_kobj; > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index 87d24c903152..3cec2455b603 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -2180,7 +2180,7 @@ static void btrfs_init_qgroup(struct btrfs_fs_info > *fs_info) > static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info, > struct btrfs_fs_devices *fs_devices) > { > - int max_active = fs_info->thread_pool_size; > + u32 max_active = fs_info->thread_pool_size; > unsigned int flags = WQ_MEM_RECLAIM | WQ_FREEZABLE | WQ_UNBOUND; > > fs_info->workers = > @@ -2401,7 +2401,7 @@ int open_ctree(struct super_block *sb, > int err = -EINVAL; > int num_backups_tried = 0; > int backup_index = 0; > - int max_active; > + u32 max_active; > int clear_free_space_tree = 0; > > tree_root = fs_info->tree_root = btrfs_alloc_root(fs_info, GFP_KERNEL); > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c > index 02c7766e6849..8112619cac95 100644 > --- a/fs/btrfs/super.c > +++ b/fs/btrfs/super.c > @@ -346,7 +346,7 @@ static const match_table_t tokens = { > {Opt_barrier, "barrier"}, > {Opt_max_inline, "max_inline=%u"}, > {Opt_alloc_start, "alloc_start=%s"}, > - {Opt_thread_pool, "thread_pool=%d"}, > + {Opt_thread_pool, "thread_pool=%u"}, > {Opt_compress, "compress"}, > {Opt_compress_type, "compress=%s"}, > {Opt_compress_force, "compress-force"}, > @@ -596,12 +596,11 @@ int btrfs_parse_options(struct btrfs_fs_info *info, > char *options, > ret = match_int([0], ); > if (ret) { > goto out; > - } else if (intarg > 0) { > - info->thread_pool_size = intarg; > - } else { > + } else if (intarg == 0) { One thing I'm worried about is the fact that match_int parses a signed int. So If someone pases -1 then it would be parsed to -1 but when you set it to thread_pool_size the actual value is going to be the 2's complement of the value i.e. a very large number. So a check for intarg < 0 is required to avoid that. Same applies to your other patches > ret = -EINVAL; > goto out; > } > + info->thread_pool_size = intarg; > break; > case Opt_max_inline: > ret = match_int([0], ); > @@ -1317,7 +1316,7 @@ static int btrfs_show_options(struct seq_file *seq, > struct dentry *dentry) > seq_printf(seq, ",max_inline=%u", info->max_inline); > if (info->thread_pool_size != min_t(unsigned long, >num_online_cpus() + 2, 8)) > - seq_printf(seq, ",thread_pool=%d", info->thread_pool_size); > + seq_printf(seq, ",thread_pool=%u", info->thread_pool_size); > if (btrfs_test_opt(info, COMPRESS)) { > compress_type = btrfs_compress_type2str(info->compress_type); > if (btrfs_test_opt(info, FORCE_COMPRESS)) > @@ -1723,7 +1722,7 @@ static struct dentry *btrfs_mount(struct > file_system_type *fs_type, int flags, > } > > static void btrfs_resize_thread_pool(struct btrfs_fs_info *fs_info, > - int new_pool_size, int old_pool_size) > + u32 new_pool_size, u32 old_pool_size) > { > if (new_pool_size == old_pool_size) > return; > @@ -1791,7 +1790,7 @@ static int btrfs_remount(struct super_block *sb, int > *flags, char *data) > unsigned long old_opts = fs_info->mount_opt; > unsigned long old_compress_type = fs_info->compress_type; > u32 old_max_inline = fs_info->max_inline; > - int old_thread_pool_size = fs_info->thread_pool_size; > + u32 old_thread_pool_size = fs_info->thread_pool_size; > unsigned int old_metadata_ratio = fs_info->metadata_ratio; > int ret; > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] btrfs: Remove extra run_delayed_items call
Before commit 581227d0d2b8 ("Btrfs: remove the time check in btrfs_commit_transaction()") there would be a loop in the transaction commit path that will go on until all existing writers ended their transaction handles. This loop would try and flush as much pending stuff as possible before going into the transaction critical section. However, the aforementioned commit removed it, yet continued calling this "extra" code for the sake of optimisation. As it stands the rules for running the delayed items are a bit of a mess. First we have the one call which this patch removes, then we have another one, acting as a "safety net" by catching any delayed items introduced after the previous call is finished and extwriters becoming 0 (meaning no more TRANS_START/TRANS_ATTACHS are possible, only joins). Then we have the delayed items running as part of creating a snapshot (once per pedning snapshot). Finally, delayed items are run once more _after_ snapshots have been created. All in all this amounts to 3 + N (N = number of snapshots slated for creation). I think this could really be reduced to 2 calls - 1 before create_pending_snapshots is called and 1 after it's finished. This suffices to ensure consistent fs state during snapshot creation and after they are created. I did a full xfstest run to ensure no consistency issues came up. Then I picked the tests which exhibitted rather long runtimes and looked closer to see if it was as a result of this commit - which it wasn't. Finally I did an artifical test with fsstres with only those operations that put stress on the delayed items code: fsstress -z -f mkdir=25% -f rmdir=25% -f creat=25 -f unlink=25% -f attr_set=25% -p5 -n 25000 and also didn't observe any performance regressions Signed-off-by: Nikolay Borisov--- fs/btrfs/transaction.c | 4 1 file changed, 4 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 2cdf7be02f41..02bc1e6212e6 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -2066,10 +2066,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) if (ret) goto cleanup_transaction; - ret = btrfs_run_delayed_items(trans); - if (ret) - goto cleanup_transaction; - wait_event(cur_trans->writer_wait, extwriter_counter_read(cur_trans) == 0); -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] btrfs: Refactor running of delayed inode items during transaction commit
Currently calls to btrfs_run_delayed_inode items have been scattered around the transaction commit code with no real design argument when they should be execute. We have one call, after transaction writers go to 0. Then we have the delayed items running as part of creating a snapshot (once per pedning snapshot). Finally, delayed items are run once more _after_ snapshots have been created. All in all this amounts to 2 + N (N = number of snapshots slated for creation). In reality we need to flush delayed items once before create_pending_snapshots is called to ensure snapshosts are consistent with inode data and once after snapshots are created so that newly introduced inode items during snapshot creation process are correctly persisted on disk. This patch brings the total executions of run_delayed_items to just 2. This survived multiple xfstest runs. Signed-off-by: Nikolay Borisov--- fs/btrfs/transaction.c | 31 ++- 1 file changed, 10 insertions(+), 21 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 02bc1e6212e6..b32d3136f36c 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1524,18 +1524,6 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans, } btrfs_release_path(path); - /* -* pull in the delayed directory update -* and the delayed inode item -* otherwise we corrupt the FS during -* snapshot -*/ - ret = btrfs_run_delayed_items(trans); - if (ret) { /* Transaction aborted */ - btrfs_abort_transaction(trans, ret); - goto fail; - } - record_root_in_trans(trans, root, 0); btrfs_set_root_last_snapshot(>root_item, trans->transid); memcpy(new_root_item, >root_item, sizeof(*new_root_item)); @@ -2069,10 +2057,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) wait_event(cur_trans->writer_wait, extwriter_counter_read(cur_trans) == 0); - /* some pending stuffs might be added after the previous flush. */ - ret = btrfs_run_delayed_items(trans); - if (ret) - goto cleanup_transaction; btrfs_wait_delalloc_flush(fs_info); @@ -2095,6 +2079,16 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) ret = cur_trans->aborted; goto scrub_continue; } + + /* +* Run delayed items because we need them to be consistent on-disk +* so that snapshots created in create_pending_snapshots don't corrupt +* the filesystem. At this point we are the one doing transaction +* commit and now one can come and introduce new delayed inode items +*/ + ret = btrfs_run_delayed_items(trans); + if (ret) + goto scrub_continue; /* * the reloc mutex makes sure that we stop * the balancing code from coming in and moving @@ -2102,11 +2096,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) */ mutex_lock(_info->reloc_mutex); - /* -* We needn't worry about the delayed items because we will -* deal with them in create_pending_snapshot(), which is the -* core function of the snapshot creation. -*/ ret = create_pending_snapshots(trans); if (ret) { mutex_unlock(_info->reloc_mutex); -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-cleaner / snapshot performance analysis
On Mon, Feb 12, 2018 at 10:37 AM, Ellis H. Wilson IIIwrote: > On 02/11/2018 01:24 PM, Hans van Kranenburg wrote: >> >> Why not just use `btrfs fi du ` now and then and >> update your administration with the results? .. Instead of putting the >> burden of keeping track of all administration during every tiny change >> all day long? > > > I will look into that if using built-in group capacity functionality proves > to be truly untenable. Thanks! > >>> CoW is still valuable for us as we're shooting to support on the order >>> of hundreds of snapshots per subvolume, >> >> >> Hundreds will get you into trouble even without qgroups. > > > I should have been more specific. We are looking to use up to a few dozen > snapshots per subvolume, but will have many (tens to hundreds of) discrete > subvolumes (each with up to a few dozen snapshots) in a BTRFS filesystem. > If I have it wrong and the scalability issues in BTRFS do not solely apply > to subvolumes and their snapshot counts, please let me know. > > I will note you focused on my tiny desktop filesystem when making some of > your previous comments -- this is why I didn't want to share specific > details. Our filesystem will be RAID0 with six large HDDs (12TB each). > Reliability concerns do not apply to our situation for technical reasons, > but if there are capacity scaling issues with BTRFS I should be made aware > of, I'd be glad to hear them. I have not seen any in technical > documentation of such a limit, and experiments so far on 6x6TB arrays has > not shown any performance problems, so I'm inclined to believe the only > scaling issue exists with reflinks. Correct me if I'm wrong. > > Thanks, > > ellis > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html When testing btrfs on large volumes, especially with metadata heavy operations, I'd suggest you match the node size of your mkfs.btrfs (-n) with the stripe size used in creating your RAID array. Also, use the ssd_spread mount option as discussed in a previous thread. It make a big difference on arrays. It allocates much more space for metadata, but it greatly reduces fragmentation over time. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs-progs: compile with musl libc
Ensure that limits.h is included in mkfs/main.c as PATH_MAX is used. The lack of this currently causes btrfs-progs to fail to compile when the systems libc is musl. From 7859e0d01778e844ae0fcbefe55581277ce7cab3 Mon Sep 17 00:00:00 2001 From: Dan RobertsonDate: Tue, 13 Feb 2018 13:04:29 + Subject: [PATCH] btrfs-progs: compile with musl libc Ensure that limits.h is included in mkfs/main.c as PATH_MAX is used. Signed-off-by: Dan Robertson --- mkfs/main.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mkfs/main.c b/mkfs/main.c index ea65c6d8..f44520aa 100644 --- a/mkfs/main.c +++ b/mkfs/main.c @@ -27,6 +27,7 @@ /* #include included via androidcompat.h */ #include #include +#include #include #include #include -- 2.16.1 signature.asc Description: Digital signature
Re: [PATCH] btrfs: Verify extent allocated by find_free_extent() won't overlap with extents from previous transaction
On 13.02.2018 13:07, Qu Wenruo wrote: > > > On 2018年02月13日 09:13, Qu Wenruo wrote: >> There are reports in mail list, even with latest mainline kernel, btrfs >> can't survive a power loss. >> >> Unlike journal based filesystem, btrfs doesn't use journal for such >> work. (log tree is an optimization for fsync, not to keep fs healthy) >> In btrfs we use metadata CoW to ensure all tree blocks are as atomic as >> superblock. >> >> This leads to an obvious assumption, some code breaks such metadata CoW >> makes btrfs no longer bullet-proof against power loss. >> >> This patch adds extra runtime selftest to find_free_extent(), which >> will check the range in commit root of extent tree to ensure there is no >> overlap at all. >> >> And hopes this could help us to catch the cause of the problem. >> >> Signed-off-by: Qu Wenruo>> --- >> Unfortunately, no new problem exposed yet. > > Interestingly, I hit a bug during looping xfstest, where > btrfs_free_path() in check_extent_conflicts() caused rcu error. > > This seems quite interesting and may be the clue to the problem. > > BTW, for anyone interesting with this problem, the backtrace is: > > ODEBUG: activate active (active state 1) object type: rcu_head hint: > (null) > WARNING: CPU: 2 PID: 15435 at lib/debugobjects.c:291 > debug_print_object+0xc1/0xe0 > Modules linked in: dm_flakey ext4 mbcache jbd2 btrfs(O) xor > zstd_decompress zstd_compress xxhash raid6_pq efivarfs xfs dm_mod > virtio_scsi > CPU: 2 PID: 15435 Comm: kworker/u16:12 Tainted: G O > 4.15.0+ #32 > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 > Workqueue: writeback wb_workfn (flush-btrfs-367) > RIP: 0010:debug_print_object+0xc1/0xe0 > RSP: 0018:88001819d1c0 EFLAGS: 00010086 > RAX: dc08 RBX: 0003 RCX: 0006 > RDX: 0007 RSI: 110003033a01 RDI: 88006abde6d0 > RBP: 0001 R08: R09: 0001 > R10: 88001819d0b0 R11: R12: 82472120 > R13: 8255cae0 R14: R15: 880054737268 > FS: () GS:88006aa0() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 7f5e42c8d6c0 CR3: 608aa000 CR4: 003406e0 > Call Trace: > debug_object_activate+0x309/0x360 Can you get a line number of that function? So in release_extent_buffer you call_rcu which calls debug_rc_head_queue in __call_rcu. So what those debug checks are supposed to do (I haven't looked too deeply into the code) seems to be to detect if we are doing double free i.e. calling call_rcu twice for the same extent buffer. Looking at debug_object_activate it seems there are 2 cases where we can print the ODEBUG stuff if the object is already in STATE_ACTIVE and if the object has been free i.e state ODEBUG_STATE_DESTROYED. However judging by the "activate active" I'd say we hit the ODEBUG_STATE_ACTIV case. And since we don't have a ->fixup_activate function then we should return -EINVAL and the following WARN_ONCE should be triggered: WARN_ONCE(1, "__call_rcu(): Double-freed CB %p->%pF()!!!\n", head, head->func); > __call_rcu.constprop.72+0x135/0xc50 > release_extent_buffer+0xf0/0x280 [btrfs] > free_extent_buffer+0x16d/0x210 [btrfs] > btrfs_release_path+0x2d/0xe0 [btrfs] > btrfs_free_path.part.28+0xe/0x30 [btrfs] > find_free_extent+0x1082/0x2440 [btrfs] > btrfs_reserve_extent+0xda/0x1f0 [btrfs] > cow_file_range.isra.67+0x223/0x5d0 [btrfs] > run_delalloc_range+0x1cc/0x610 [btrfs] > writepage_delalloc+0x1bb/0x2c0 [btrfs] > __extent_writepage+0x422/0x630 [btrfs] > extent_write_cache_pages.constprop.56+0x2f7/0x7b0 [btrfs] > extent_writepages+0xce/0x110 [btrfs] > do_writepages+0x8e/0xb0 > __writeback_single_inode+0x155/0xbf0 > writeback_sb_inodes+0x462/0x8e0 > wb_writeback+0x313/0x970 > wb_workfn+0x217/0xc80 > process_one_work+0x740/0xe40 > worker_thread+0x1c6/0xce0 > kthread+0x1ba/0x1e0 > ret_from_fork+0x24/0x30 > > Thanks, > Qu > >> --- >> fs/btrfs/extent-tree.c | 118 >> + >> 1 file changed, 118 insertions(+) >> >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index 2f4328511ac8..3b3cd82bce3a 100644 >> --- a/fs/btrfs/extent-tree.c >> +++ b/fs/btrfs/extent-tree.c >> @@ -7458,6 +7458,114 @@ btrfs_release_block_group(struct >> btrfs_block_group_cache *cache, >> btrfs_put_block_group(cache); >> } >> >> +#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS >> +/* >> + * Verify if the given extent range [@start, @start + @len) conflicts any >> + * existing extent in commit root. >> + * >> + * Btrfs doesn't use journal, but depends on metadata (and data) CoW to keep >> + * the whole filesystem consistent against powerloss. >> + * If we have overwritten any extent used by previous trans (commit root), >> + * and powerloss happen we will corrupt our filesystem. >> + *
Re: mount btrfs takes 30 minutes, btrfs check runs out of memory
On 2018年02月13日 20:26, Holger Hoffstätte wrote: > On 02/13/18 12:40, Qu Wenruo wrote: The problem is not about how much space it takes, but how many extents are here in the filesystem. > > I have no idea why btrfs' mount even needs to touch all block groups to > get going (which seems to be the root of the problem), but here's a > not so crazy idea for more "mechanical sympathy". Feel free to mock > me if this is terribly wrong or not possible. ;) > > Mounting of even large filesystems (with many extents) seems to be fine > on SSDS, but not so fine on rotational storage. We've heard that from > several people with large (multi-TB) filesystems, and obviously it's > even more terrible on 5400RPM drives because their seeks are sooo sloow. > > If the problem is that the bgs are touched/iterated in "tree order", > would it then not be possible to sort the block groups in physical order > before trying to load whatever mount needs to load? This is in fact a good idea. Make block group into its own tree. But it will takes a lot of work to do, since we a modifying the on-disk format. In that case, a leaf with default leaf size (16K) can store 678 block group items. And that many block groups can contain data between 169G (256M metadata size) to 1.6T (10G for max data chunk size). And even for tens of tegas, a level-2 tree should handle it without problem, and searching them should be quite fast. The only problem is, I'm not sure if there will be enough developer interesting with this idea, and this idea may have extra problems hidden. Thanks, Qu > That way the entire > process would involve less seeking (no backward seeks for one) and the > drive could very likely get more done during a rotation before stepping > further. > > cheers, > Holger > signature.asc Description: OpenPGP digital signature
Re: mount btrfs takes 30 minutes, btrfs check runs out of memory
On Tue, Feb 13, 2018 at 4:46 AM, Qu Wenruowrote: > > > On 2018年02月13日 20:06, John Ettedgui wrote: > That's fairly straightforward to do, though it should be quite slow so I'd hope not to have to do that too often. >>> >>> Then it could be tried on the most frequently updated files then. >> >> That's an interesting idea. >> More than 3/4 of the data is just storage, so that should be very ok. > > BTW, how the initial data is created? > > If the initial data is all written once and doesn't get modified later, > then the problem may not be fragments. > Mostly at once when I recreated the FS a few years ago, and then adding on to it slowly. Though I do try to somewhat balance the free space on all partitions of similar drives, so it may be a tad further more from its original condition. >> >>> >>> And since you don't use snapshot, locate such files and then "chattr +C" >>> would make them nodatacow, reducing later fragments. >> >> I don't understand, why would that reduce later fragments? > > Later overwrite will not create new extent, but overwrite existing extents. > Other than CoW and cause new extents (fragments) > > Although expand write will still cause new extent, but that's > unavoidable anyway. > That's why I didn't understand. Fair enough! Thank you! John -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mount btrfs takes 30 minutes, btrfs check runs out of memory
On 2018年02月13日 20:06, John Ettedgui wrote: >>> That's fairly straightforward to do, though it should be quite slow so >>> I'd hope not to have to do that too often. >> >> Then it could be tried on the most frequently updated files then. > > That's an interesting idea. > More than 3/4 of the data is just storage, so that should be very ok. BTW, how the initial data is created? If the initial data is all written once and doesn't get modified later, then the problem may not be fragments. > >> >> And since you don't use snapshot, locate such files and then "chattr +C" >> would make them nodatacow, reducing later fragments. > > I don't understand, why would that reduce later fragments? Later overwrite will not create new extent, but overwrite existing extents. Other than CoW and cause new extents (fragments) Although expand write will still cause new extent, but that's unavoidable anyway. Thanks, Qu > >> This should acts much better than traditional defrag, although it's time-consuming and makes snapshot completely meaningless. (and since you're already hitting ENOSPC, I don't think the idea is really working for you) And since you're already hitting ENOSPC, either it's caused by unbalanced meta/data usage, or it's really going hit the limit, I would recommend to enlarge the fs or delete some files to see if it helps. >>> Yup, I usually either slowly ramp up the {d,m}usage to pass it, or >>> when that does not work I free some space, then balance will finish. >>> Or did you mean to free some space to see about mount speed? >> >> Kind of, just do such freeing in advance, and try to make btrfs always >> have unallocated space in case. >> > > I actually have very little free space on those partitions, usually > under 90Gb, maybe that's part of my problem. > >> And finally, use latest kernel if possible. >> IIRC old kernel doesn't have empty block group auto remove, which makes >> user need to manually balance to free some space. >> >> Thanks, >> Qu >> > > I am on 4.15 so no problem there. > > So manual defrag and new FS to try. > > Thank you! > signature.asc Description: OpenPGP digital signature
Re: mount btrfs takes 30 minutes, btrfs check runs out of memory
On 02/13/18 12:40, Qu Wenruo wrote: >>> The problem is not about how much space it takes, but how many extents >>> are here in the filesystem. I have no idea why btrfs' mount even needs to touch all block groups to get going (which seems to be the root of the problem), but here's a not so crazy idea for more "mechanical sympathy". Feel free to mock me if this is terribly wrong or not possible. ;) Mounting of even large filesystems (with many extents) seems to be fine on SSDS, but not so fine on rotational storage. We've heard that from several people with large (multi-TB) filesystems, and obviously it's even more terrible on 5400RPM drives because their seeks are sooo sloow. If the problem is that the bgs are touched/iterated in "tree order", would it then not be possible to sort the block groups in physical order before trying to load whatever mount needs to load? That way the entire process would involve less seeking (no backward seeks for one) and the drive could very likely get more done during a rotation before stepping further. cheers, Holger signature.asc Description: OpenPGP digital signature
Re: mount btrfs takes 30 minutes, btrfs check runs out of memory
On Tue, Feb 13, 2018 at 3:40 AM, Qu Wenruowrote: > > > On 2018年02月13日 19:25, John Ettedgui wrote: >> On Tue, Feb 13, 2018 at 3:04 AM, Qu Wenruo wrote: >>> >>> >>> >>> The problem is not about how much space it takes, but how many extents >>> are here in the filesystem. >>> >>> For new fs filled with normal data, I'm pretty sure data extents will be >>> as large as its maximum size (256M), causing very little or even no >>> pressure to block group search. >>> >> What do you mean by "new fs", > > I mean the 4TB partition on that 5400rpm HDD. > >> was there any change that would improve >> the behavior if I were to recreate the FS? > > If you backed up your fs, and recreate a new, empty btrfs on your > original SSD, then copying all data back, I believe it would be much > faster to mount. > Alright, I'll have to wait on getting some more drives for that but I look forward to trying that. >> Last time we talked I believe max extent was 128M for non-compressed >> files, so maybe there's been some good change. > > My fault, 128M is correct. > >>> And since I went to SUSE, some mail/info is lost during the procedure. >> I still have all mails, if you want it. No dump left though. >>> >>> Despite that, I have several more assumption to this problem: >>> >>> 1) Metadata usage bumped by inline files >> What are inline files? Should I view this as inline in C, in that the >> small files are stored in the tree directly? > > Exactly. > >>>If there are a lot of small files (<2K as default), >> Of the slow to mount partitions: >> 2 partitions have less than a dozen files smaller than 2K. >> 1 has about 5 thousand and the last one 15 thousand. >> Are the latter considered a lot? > > If using default 16K nodesize, 8 small files takes one leaf. > And 15K small failes means about 2K tree extents. > > Not that much in my opinion, can't even fill half of a metadata chunk. > >>> and your metadata >>>usage is quite high (generally speaking, it meta:data ratio should be >>>way below 1:8), that may be the cause. >>> >> The ratio is about 1:900 on average so that should be ok I guess. > > Yep, that should be fine. > So not metadata to blame. > > Then purely fragmented data extents. > >>>If so, try mount the fs with "max_inline=0" mount option and then >>>try to rewrite such small files. >>> >> Should I try that? > > No need, it won't cause much difference. Alright! >>> 2) SSD write amplification along with dynamic remapping >>>To be honest, I'm not really buying this idea, since mount doesn't >>>have anything related to write. >>>But running fstrim won't harm anyway. >>> >> Oh I am not complaining about slow SSDs mounting. I was just amazed >> that a partition on a slow HDD mounted faster. >> Without any specific work, my SSDs partitions tend to mount around 1 sec or >> so. >> Of course I'd be happy to worry about them once all the partitions on >> HDDs mount in a handful of ms :) >> >>> 3) Rewrite the existing files (extreme defrag) >>>In fact, defrag doesn't work well if there are subvolumes/snapshots >> I have no subvolume or snapshot so that's not a problem. >>>/reflink involved. >>>The most stupid and mindless way, is to write a small script and find >>>all regular files, read them out and rewrite it back. >>> >> That's fairly straightforward to do, though it should be quite slow so >> I'd hope not to have to do that too often. > > Then it could be tried on the most frequently updated files then. That's an interesting idea. More than 3/4 of the data is just storage, so that should be very ok. > > And since you don't use snapshot, locate such files and then "chattr +C" > would make them nodatacow, reducing later fragments. I don't understand, why would that reduce later fragments? > >>>This should acts much better than traditional defrag, although it's >>>time-consuming and makes snapshot completely meaningless. >>>(and since you're already hitting ENOSPC, I don't think the idea is >>> really working for you) >>> >>> And since you're already hitting ENOSPC, either it's caused by >>> unbalanced meta/data usage, or it's really going hit the limit, I would >>> recommend to enlarge the fs or delete some files to see if it helps. >>> >> Yup, I usually either slowly ramp up the {d,m}usage to pass it, or >> when that does not work I free some space, then balance will finish. >> Or did you mean to free some space to see about mount speed? > > Kind of, just do such freeing in advance, and try to make btrfs always > have unallocated space in case. > I actually have very little free space on those partitions, usually under 90Gb, maybe that's part of my problem. > And finally, use latest kernel if possible. > IIRC old kernel doesn't have empty block group auto remove, which makes > user need to manually balance to free some space. > > Thanks, > Qu > I am on 4.15 so no problem there. So manual defrag and new FS to
Re: mount btrfs takes 30 minutes, btrfs check runs out of memory
On 2018年02月13日 19:25, John Ettedgui wrote: > On Tue, Feb 13, 2018 at 3:04 AM, Qu Wenruowrote: >> >> >> On 2018年02月13日 18:21, John Ettedgui wrote: >>> Hello there, >>> >>> have you found anything good since then? >> >> Unfortunately, not really much to speed it up. > Oh :/ >> >> This reminds me of the old (and crazy) idea to skip block group build >> for RO mount. >> But not really helpful for it. >> >>> With a default system, the behavior is pretty much still the same, >>> though I have not recreated the partitions since. >>> >>> Defrag helps, but I think balance helps even more. >>> clear_cache may help too, but I'm not really sure as I've not tried it >>> on its own. >>> I was actually able to get a 4TB partition on a 5400rpm HDD to mount >>> in around 500ms, quite faster that even some Gb partitions I have on >>> SSDs! Alas I wrote some files to it and it's taking over a second >>> again, so no more magic there. >> >> The problem is not about how much space it takes, but how many extents >> are here in the filesystem. >> >> For new fs filled with normal data, I'm pretty sure data extents will be >> as large as its maximum size (256M), causing very little or even no >> pressure to block group search. >> > What do you mean by "new fs", I mean the 4TB partition on that 5400rpm HDD. > was there any change that would improve > the behavior if I were to recreate the FS? If you backed up your fs, and recreate a new, empty btrfs on your original SSD, then copying all data back, I believe it would be much faster to mount. > Last time we talked I believe max extent was 128M for non-compressed > files, so maybe there's been some good change. My fault, 128M is correct. >>> >>> The workarounds do work, so it's still not a major issue, but they're >>> slow and sometimes I have to workaround the "no space left on device" >>> which then takes even more time. >> >> And since I went to SUSE, some mail/info is lost during the procedure. > I still have all mails, if you want it. No dump left though. >> >> Despite that, I have several more assumption to this problem: >> >> 1) Metadata usage bumped by inline files > What are inline files? Should I view this as inline in C, in that the > small files are stored in the tree directly? Exactly. >>If there are a lot of small files (<2K as default), > Of the slow to mount partitions: > 2 partitions have less than a dozen files smaller than 2K. > 1 has about 5 thousand and the last one 15 thousand. > Are the latter considered a lot? If using default 16K nodesize, 8 small files takes one leaf. And 15K small failes means about 2K tree extents. Not that much in my opinion, can't even fill half of a metadata chunk. >> and your metadata >>usage is quite high (generally speaking, it meta:data ratio should be >>way below 1:8), that may be the cause. >> > The ratio is about 1:900 on average so that should be ok I guess. Yep, that should be fine. So not metadata to blame. Then purely fragmented data extents. >>If so, try mount the fs with "max_inline=0" mount option and then >>try to rewrite such small files. >> > Should I try that? No need, it won't cause much difference. >> 2) SSD write amplification along with dynamic remapping >>To be honest, I'm not really buying this idea, since mount doesn't >>have anything related to write. >>But running fstrim won't harm anyway. >> > Oh I am not complaining about slow SSDs mounting. I was just amazed > that a partition on a slow HDD mounted faster. > Without any specific work, my SSDs partitions tend to mount around 1 sec or > so. > Of course I'd be happy to worry about them once all the partitions on > HDDs mount in a handful of ms :) > >> 3) Rewrite the existing files (extreme defrag) >>In fact, defrag doesn't work well if there are subvolumes/snapshots > I have no subvolume or snapshot so that's not a problem. >>/reflink involved. >>The most stupid and mindless way, is to write a small script and find >>all regular files, read them out and rewrite it back. >> > That's fairly straightforward to do, though it should be quite slow so > I'd hope not to have to do that too often. Then it could be tried on the most frequently updated files then. And since you don't use snapshot, locate such files and then "chattr +C" would make them nodatacow, reducing later fragments. >>This should acts much better than traditional defrag, although it's >>time-consuming and makes snapshot completely meaningless. >>(and since you're already hitting ENOSPC, I don't think the idea is >> really working for you) >> >> And since you're already hitting ENOSPC, either it's caused by >> unbalanced meta/data usage, or it's really going hit the limit, I would >> recommend to enlarge the fs or delete some files to see if it helps. >> > Yup, I usually either slowly ramp up the {d,m}usage to pass it, or > when that does not work I free some space, then balance will
Re: mount btrfs takes 30 minutes, btrfs check runs out of memory
On Tue, Feb 13, 2018 at 3:04 AM, Qu Wenruowrote: > > > On 2018年02月13日 18:21, John Ettedgui wrote: >> Hello there, >> >> have you found anything good since then? > > Unfortunately, not really much to speed it up. Oh :/ > > This reminds me of the old (and crazy) idea to skip block group build > for RO mount. > But not really helpful for it. > >> With a default system, the behavior is pretty much still the same, >> though I have not recreated the partitions since. >> >> Defrag helps, but I think balance helps even more. >> clear_cache may help too, but I'm not really sure as I've not tried it >> on its own. >> I was actually able to get a 4TB partition on a 5400rpm HDD to mount >> in around 500ms, quite faster that even some Gb partitions I have on >> SSDs! Alas I wrote some files to it and it's taking over a second >> again, so no more magic there. > > The problem is not about how much space it takes, but how many extents > are here in the filesystem. > > For new fs filled with normal data, I'm pretty sure data extents will be > as large as its maximum size (256M), causing very little or even no > pressure to block group search. > What do you mean by "new fs", was there any change that would improve the behavior if I were to recreate the FS? Last time we talked I believe max extent was 128M for non-compressed files, so maybe there's been some good change. >> >> The workarounds do work, so it's still not a major issue, but they're >> slow and sometimes I have to workaround the "no space left on device" >> which then takes even more time. > > And since I went to SUSE, some mail/info is lost during the procedure. I still have all mails, if you want it. No dump left though. > > Despite that, I have several more assumption to this problem: > > 1) Metadata usage bumped by inline files What are inline files? Should I view this as inline in C, in that the small files are stored in the tree directly? >If there are a lot of small files (<2K as default), Of the slow to mount partitions: 2 partitions have less than a dozen files smaller than 2K. 1 has about 5 thousand and the last one 15 thousand. Are the latter considered a lot? > and your metadata >usage is quite high (generally speaking, it meta:data ratio should be >way below 1:8), that may be the cause. > The ratio is about 1:900 on average so that should be ok I guess. >If so, try mount the fs with "max_inline=0" mount option and then >try to rewrite such small files. > Should I try that? > 2) SSD write amplification along with dynamic remapping >To be honest, I'm not really buying this idea, since mount doesn't >have anything related to write. >But running fstrim won't harm anyway. > Oh I am not complaining about slow SSDs mounting. I was just amazed that a partition on a slow HDD mounted faster. Without any specific work, my SSDs partitions tend to mount around 1 sec or so. Of course I'd be happy to worry about them once all the partitions on HDDs mount in a handful of ms :) > 3) Rewrite the existing files (extreme defrag) >In fact, defrag doesn't work well if there are subvolumes/snapshots I have no subvolume or snapshot so that's not a problem. >/reflink involved. >The most stupid and mindless way, is to write a small script and find >all regular files, read them out and rewrite it back. > That's fairly straightforward to do, though it should be quite slow so I'd hope not to have to do that too often. >This should acts much better than traditional defrag, although it's >time-consuming and makes snapshot completely meaningless. >(and since you're already hitting ENOSPC, I don't think the idea is > really working for you) > > And since you're already hitting ENOSPC, either it's caused by > unbalanced meta/data usage, or it's really going hit the limit, I would > recommend to enlarge the fs or delete some files to see if it helps. > Yup, I usually either slowly ramp up the {d,m}usage to pass it, or when that does not work I free some space, then balance will finish. Or did you mean to free some space to see about mount speed? > Thanks, > Qu > Thank you for the quick reply! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Verify extent allocated by find_free_extent() won't overlap with extents from previous transaction
On 2018年02月13日 09:13, Qu Wenruo wrote: > There are reports in mail list, even with latest mainline kernel, btrfs > can't survive a power loss. > > Unlike journal based filesystem, btrfs doesn't use journal for such > work. (log tree is an optimization for fsync, not to keep fs healthy) > In btrfs we use metadata CoW to ensure all tree blocks are as atomic as > superblock. > > This leads to an obvious assumption, some code breaks such metadata CoW > makes btrfs no longer bullet-proof against power loss. > > This patch adds extra runtime selftest to find_free_extent(), which > will check the range in commit root of extent tree to ensure there is no > overlap at all. > > And hopes this could help us to catch the cause of the problem. > > Signed-off-by: Qu Wenruo> --- > Unfortunately, no new problem exposed yet. Interestingly, I hit a bug during looping xfstest, where btrfs_free_path() in check_extent_conflicts() caused rcu error. This seems quite interesting and may be the clue to the problem. BTW, for anyone interesting with this problem, the backtrace is: ODEBUG: activate active (active state 1) object type: rcu_head hint: (null) WARNING: CPU: 2 PID: 15435 at lib/debugobjects.c:291 debug_print_object+0xc1/0xe0 Modules linked in: dm_flakey ext4 mbcache jbd2 btrfs(O) xor zstd_decompress zstd_compress xxhash raid6_pq efivarfs xfs dm_mod virtio_scsi CPU: 2 PID: 15435 Comm: kworker/u16:12 Tainted: G O 4.15.0+ #32 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Workqueue: writeback wb_workfn (flush-btrfs-367) RIP: 0010:debug_print_object+0xc1/0xe0 RSP: 0018:88001819d1c0 EFLAGS: 00010086 RAX: dc08 RBX: 0003 RCX: 0006 RDX: 0007 RSI: 110003033a01 RDI: 88006abde6d0 RBP: 0001 R08: R09: 0001 R10: 88001819d0b0 R11: R12: 82472120 R13: 8255cae0 R14: R15: 880054737268 FS: () GS:88006aa0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7f5e42c8d6c0 CR3: 608aa000 CR4: 003406e0 Call Trace: debug_object_activate+0x309/0x360 __call_rcu.constprop.72+0x135/0xc50 release_extent_buffer+0xf0/0x280 [btrfs] free_extent_buffer+0x16d/0x210 [btrfs] btrfs_release_path+0x2d/0xe0 [btrfs] btrfs_free_path.part.28+0xe/0x30 [btrfs] find_free_extent+0x1082/0x2440 [btrfs] btrfs_reserve_extent+0xda/0x1f0 [btrfs] cow_file_range.isra.67+0x223/0x5d0 [btrfs] run_delalloc_range+0x1cc/0x610 [btrfs] writepage_delalloc+0x1bb/0x2c0 [btrfs] __extent_writepage+0x422/0x630 [btrfs] extent_write_cache_pages.constprop.56+0x2f7/0x7b0 [btrfs] extent_writepages+0xce/0x110 [btrfs] do_writepages+0x8e/0xb0 __writeback_single_inode+0x155/0xbf0 writeback_sb_inodes+0x462/0x8e0 wb_writeback+0x313/0x970 wb_workfn+0x217/0xc80 process_one_work+0x740/0xe40 worker_thread+0x1c6/0xce0 kthread+0x1ba/0x1e0 ret_from_fork+0x24/0x30 Thanks, Qu > --- > fs/btrfs/extent-tree.c | 118 > + > 1 file changed, 118 insertions(+) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 2f4328511ac8..3b3cd82bce3a 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -7458,6 +7458,114 @@ btrfs_release_block_group(struct > btrfs_block_group_cache *cache, > btrfs_put_block_group(cache); > } > > +#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS > +/* > + * Verify if the given extent range [@start, @start + @len) conflicts any > + * existing extent in commit root. > + * > + * Btrfs doesn't use journal, but depends on metadata (and data) CoW to keep > + * the whole filesystem consistent against powerloss. > + * If we have overwritten any extent used by previous trans (commit root), > + * and powerloss happen we will corrupt our filesystem. > + * > + * Return 0 if nothing wrong. > + * Return <0 (including ENOMEM) means we have something wrong. > + * Except NOEMEM, this normally means we have extent conflicts with previous > + * transaction. > + */ > +static int check_extent_conflicts(struct btrfs_fs_info *fs_info, > + u64 start, u64 len) > +{ > + struct btrfs_key key; > + struct btrfs_path *path; > + struct btrfs_root *extent_root = fs_info->extent_root; > + u64 extent_start; > + u64 extent_len; > + int ret; > + > + path = btrfs_alloc_path(); > + if (!path) > + return -ENOMEM; > + > + > + key.objectid = start + len; > + key.type = 0; > + key.offset = 0; > + > + /* > + * Here we use extent commit root to search for any conflicts. > + * If extent commit tree is already overwritten, we will get transid > + * error and error out any way. > + * If extent commit tree is OK, but other extent conflicts, we will > + * find it. > + * So anyway, such search should be OK to
Re: mount btrfs takes 30 minutes, btrfs check runs out of memory
On 2018年02月13日 18:21, John Ettedgui wrote: > On Thu, Jul 21, 2016 at 1:19 AM, Qu Wenruowrote: >> >> >> No more. >> >> The dump is already good enough for me to dig for some time. >> >> We don't usually get such large extent tree dump from a real world use case. >> >> It would help us in several ways, from determine how fragmented a block >> group is to determine if a defrag will help. >> >> Thanks, >> Qu >> >> > > > Hello there, > > have you found anything good since then? Unfortunately, not really much to speed it up. This reminds me of the old (and crazy) idea to skip block group build for RO mount. But not really helpful for it. > With a default system, the behavior is pretty much still the same, > though I have not recreated the partitions since. > > Defrag helps, but I think balance helps even more. > clear_cache may help too, but I'm not really sure as I've not tried it > on its own. > I was actually able to get a 4TB partition on a 5400rpm HDD to mount > in around 500ms, quite faster that even some Gb partitions I have on > SSDs! Alas I wrote some files to it and it's taking over a second > again, so no more magic there. The problem is not about how much space it takes, but how many extents are here in the filesystem. For new fs filled with normal data, I'm pretty sure data extents will be as large as its maximum size (256M), causing very little or even no pressure to block group search. > > The workarounds do work, so it's still not a major issue, but they're > slow and sometimes I have to workaround the "no space left on device" > which then takes even more time. And since I went to SUSE, some mail/info is lost during the procedure. Despite that, I have several more assumption to this problem: 1) Metadata usage bumped by inline files If there are a lot of small files (<2K as default), and your metadata usage is quite high (generally speaking, it meta:data ratio should be way below 1:8), that may be the cause. If so, try mount the fs with "max_inline=0" mount option and then try to rewrite such small files. 2) SSD write amplification along with dynamic remapping To be honest, I'm not really buying this idea, since mount doesn't have anything related to write. But running fstrim won't harm anyway. 3) Rewrite the existing files (extreme defrag) In fact, defrag doesn't work well if there are subvolumes/snapshots /reflink involved. The most stupid and mindless way, is to write a small script and find all regular files, read them out and rewrite it back. This should acts much better than traditional defrag, although it's time-consuming and makes snapshot completely meaningless. (and since you're already hitting ENOSPC, I don't think the idea is really working for you) And since you're already hitting ENOSPC, either it's caused by unbalanced meta/data usage, or it's really going hit the limit, I would recommend to enlarge the fs or delete some files to see if it helps. Thanks, Qu > > Thank you! > John > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > signature.asc Description: OpenPGP digital signature
Re: [PATCH v2] btrfs: fix endianness compatibility during the SB RW
On 2018年02月13日 18:27, Anand Jain wrote: > > > On 02/13/2018 05:01 PM, Qu Wenruo wrote: >> >> >> On 2018年02月13日 11:00, Anand Jain wrote: >>> Fixes the endianness bug in the fs_info::super_copy by using its >>> btrfs_set_super...() function to set values in the SB, as these >>> functions manage the endianness compatibility nicely. >>> >>> Signed-off-by: Anand Jain>> >> Also went through all btrfs_super_block SETGET functions, greping using >> \>, seems that there are still some left here: >> >> fs/btrfs/sysfs.c: >> In both btrfs_sectorsize_show() and btrfs_clone_alignment_show(): >> return snprintf(buf, PAGE_SIZE, "%u\n", >> fs_info->super_copy->sectorsize); >> >> In btrfs_nodesize_show(): >> return snprintf(buf, PAGE_SIZE, "%u\n", >> fs_info->super_copy->nodesize); > > Oh. Thanks. Will fix. Maybe it's a good idea to add sysfs fixes > into a new patch. > >> And what about cc this to stable kernel? >> IIRC it's a very critical problem for btrfs. >> >> Maybe cc: sta...@vger.kernel.org # v3.2+? > > Thanks for the suggestion. Will do. Any idea what if the patch which > applied on mainline ends up conflict on LTS, so write a separate patch > to stable? In fact I'm not sure about this either. But considering much of the offending code doesn't change since 2009, it wouldn't cause much conflict IIRC. Thanks, Qu > > Thanks, Anand > > >> Thanks, >> Qu >> >>> --- >>> v1->v2: Update change log. Update $Subject. >>> Old: >>> [PATCH] btrfs: use set functions to update latest refs to >>> the SB >>> fs/btrfs/transaction.c | 20 >>> 1 file changed, 12 insertions(+), 8 deletions(-) >>> >>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c >>> index 04f07144b45c..9220f004001c 100644 >>> --- a/fs/btrfs/transaction.c >>> +++ b/fs/btrfs/transaction.c >>> @@ -1722,19 +1722,23 @@ static void update_super_roots(struct >>> btrfs_fs_info *fs_info) >>> super = fs_info->super_copy; >>> + /* update latest btrfs_super_block::chunk_root refs */ >>> root_item = _info->chunk_root->root_item; >>> - super->chunk_root = root_item->bytenr; >>> - super->chunk_root_generation = root_item->generation; >>> - super->chunk_root_level = root_item->level; >>> + btrfs_set_super_chunk_root(super, root_item->bytenr); >>> + btrfs_set_super_chunk_root_generation(super, >>> root_item->generation); >>> + btrfs_set_super_chunk_root_level(super, root_item->level); >>> + /* update latest btrfs_super_block::root refs */ >>> root_item = _info->tree_root->root_item; >>> - super->root = root_item->bytenr; >>> - super->generation = root_item->generation; >>> - super->root_level = root_item->level; >>> + btrfs_set_super_root(super, root_item->bytenr); >>> + btrfs_set_super_generation(super, root_item->generation); >>> + btrfs_set_super_root_level(super, root_item->level); >>> + >>> if (btrfs_test_opt(fs_info, SPACE_CACHE)) >>> - super->cache_generation = root_item->generation; >>> + btrfs_set_super_cache_generation(super, root_item->generation); >>> if (test_bit(BTRFS_FS_UPDATE_UUID_TREE_GEN, _info->flags)) >>> - super->uuid_tree_generation = root_item->generation; >>> + btrfs_set_super_uuid_tree_generation(super, >>> + root_item->generation); >>> } >>> int btrfs_transaction_in_commit(struct btrfs_fs_info *info) >>> >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: OpenPGP digital signature
Re: [PATCH v2] btrfs: fix endianness compatibility during the SB RW
On 02/13/2018 05:01 PM, Qu Wenruo wrote: On 2018年02月13日 11:00, Anand Jain wrote: Fixes the endianness bug in the fs_info::super_copy by using its btrfs_set_super...() function to set values in the SB, as these functions manage the endianness compatibility nicely. Signed-off-by: Anand JainAlso went through all btrfs_super_block SETGET functions, greping using \>, seems that there are still some left here: fs/btrfs/sysfs.c: In both btrfs_sectorsize_show() and btrfs_clone_alignment_show(): return snprintf(buf, PAGE_SIZE, "%u\n", fs_info->super_copy->sectorsize); In btrfs_nodesize_show(): return snprintf(buf, PAGE_SIZE, "%u\n", fs_info->super_copy->nodesize); Oh. Thanks. Will fix. Maybe it's a good idea to add sysfs fixes into a new patch. And what about cc this to stable kernel? IIRC it's a very critical problem for btrfs. Maybe cc: sta...@vger.kernel.org # v3.2+? Thanks for the suggestion. Will do. Any idea what if the patch which applied on mainline ends up conflict on LTS, so write a separate patch to stable? Thanks, Anand Thanks, Qu --- v1->v2: Update change log. Update $Subject. Old: [PATCH] btrfs: use set functions to update latest refs to the SB fs/btrfs/transaction.c | 20 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 04f07144b45c..9220f004001c 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1722,19 +1722,23 @@ static void update_super_roots(struct btrfs_fs_info *fs_info) super = fs_info->super_copy; + /* update latest btrfs_super_block::chunk_root refs */ root_item = _info->chunk_root->root_item; - super->chunk_root = root_item->bytenr; - super->chunk_root_generation = root_item->generation; - super->chunk_root_level = root_item->level; + btrfs_set_super_chunk_root(super, root_item->bytenr); + btrfs_set_super_chunk_root_generation(super, root_item->generation); + btrfs_set_super_chunk_root_level(super, root_item->level); + /* update latest btrfs_super_block::root refs */ root_item = _info->tree_root->root_item; - super->root = root_item->bytenr; - super->generation = root_item->generation; - super->root_level = root_item->level; + btrfs_set_super_root(super, root_item->bytenr); + btrfs_set_super_generation(super, root_item->generation); + btrfs_set_super_root_level(super, root_item->level); + if (btrfs_test_opt(fs_info, SPACE_CACHE)) - super->cache_generation = root_item->generation; + btrfs_set_super_cache_generation(super, root_item->generation); if (test_bit(BTRFS_FS_UPDATE_UUID_TREE_GEN, _info->flags)) - super->uuid_tree_generation = root_item->generation; + btrfs_set_super_uuid_tree_generation(super, +root_item->generation); } int btrfs_transaction_in_commit(struct btrfs_fs_info *info) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mount btrfs takes 30 minutes, btrfs check runs out of memory
On Thu, Jul 21, 2016 at 1:19 AM, Qu Wenruowrote: > > > No more. > > The dump is already good enough for me to dig for some time. > > We don't usually get such large extent tree dump from a real world use case. > > It would help us in several ways, from determine how fragmented a block > group is to determine if a defrag will help. > > Thanks, > Qu > > Hello there, have you found anything good since then? With a default system, the behavior is pretty much still the same, though I have not recreated the partitions since. Defrag helps, but I think balance helps even more. clear_cache may help too, but I'm not really sure as I've not tried it on its own. I was actually able to get a 4TB partition on a 5400rpm HDD to mount in around 500ms, quite faster that even some Gb partitions I have on SSDs! Alas I wrote some files to it and it's taking over a second again, so no more magic there. The workarounds do work, so it's still not a major issue, but they're slow and sometimes I have to workaround the "no space left on device" which then takes even more time. Thank you! John -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs: log, when replace, is canceled by the user
On 2018年02月13日 17:08, Nikolay Borisov wrote: > > > On 13.02.2018 11:05, Qu Wenruo wrote: >> >> >> On 2018年02月13日 11:53, Anand Jain wrote: >>> For forensic investigations of issues, we would want to know >>> if and when the user cancels the replace. >>> >>> Signed-off-by: Anand Jain> --- >>> v1->v2: use btrfs_dev_name() instead of rcu_str_deref() >>> as btrfs_dev_name() also provides "missing" string, >>> if needed. >> >> Would you please provide the branch you're using? >> >> As I couldn't find the function or macro of btrfs_dev_name() > > > I have it in my misc-next checkout. I got introduced by > 3c958bd23b60 ("btrfs: add helper for device path or missing") > > which is in 4.16-rc1 so pretty recent. I was on v4.15, no wonder no such function. Thank you very much, Qu > > >> >> Thanks, >> Qu >> >>> >>> fs/btrfs/dev-replace.c | 7 +++ >>> 1 file changed, 7 insertions(+) >>> >>> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c >>> index dd717e204b5e..3b0760f7ec8a 100644 >>> --- a/fs/btrfs/dev-replace.c >>> +++ b/fs/btrfs/dev-replace.c >>> @@ -703,6 +703,7 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info >>> *fs_info) >>> { >>> struct btrfs_dev_replace *dev_replace = _info->dev_replace; >>> struct btrfs_device *tgt_device = NULL; >>> + struct btrfs_device *src_device = NULL; >>> struct btrfs_trans_handle *trans; >>> struct btrfs_root *root = fs_info->tree_root; >>> int result; >>> @@ -724,6 +725,7 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info >>> *fs_info) >>> case BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED: >>> result = BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_ERROR; >>> tgt_device = dev_replace->tgtdev; >>> + src_device = dev_replace->srcdev; >>> dev_replace->tgtdev = NULL; >>> dev_replace->srcdev = NULL; >>> break; >>> @@ -741,6 +743,11 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info >>> *fs_info) >>> } >>> ret = btrfs_commit_transaction(trans); >>> WARN_ON(ret); >>> + >>> + btrfs_info(fs_info, "dev_replace from %s (devid %llu) to %s canceled", >>> + btrfs_dev_name(src_device), src_device->devid, >>> + btrfs_dev_name(tgt_device)); >>> + >>> if (tgt_device) >>> btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device); >>> >>> >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > signature.asc Description: OpenPGP digital signature
[PATCH] btrfs: manage subvolid mount option as %u
As -o subvolid mount option is an u64 manage it as %u for token verifications, instead of %s. Signed-off-by: Anand Jain--- fs/btrfs/super.c | 23 ++- 1 file changed, 10 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 8112619cac95..bf629c8a6a47 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -336,7 +336,7 @@ enum { static const match_table_t tokens = { {Opt_degraded, "degraded"}, {Opt_subvol, "subvol=%s"}, - {Opt_subvolid, "subvolid=%s"}, + {Opt_subvolid, "subvolid=%u"}, {Opt_device, "device=%s"}, {Opt_nodatasum, "nodatasum"}, {Opt_datasum, "datasum"}, @@ -964,8 +964,8 @@ static int btrfs_parse_subvol_options(const char *options, fmode_t flags, { substring_t args[MAX_OPT_ARGS]; char *opts, *orig, *p; - char *num = NULL; int error = 0; + u64 subvolid; if (!options) return 0; @@ -995,18 +995,15 @@ static int btrfs_parse_subvol_options(const char *options, fmode_t flags, } break; case Opt_subvolid: - num = match_strdup([0]); - if (num) { - *subvol_objectid = memparse(num, NULL); - kfree(num); - /* we want the original fs_tree */ - if (!*subvol_objectid) - *subvol_objectid = - BTRFS_FS_TREE_OBJECTID; - } else { - error = -EINVAL; + error = match_u64([0], ); + if (error) goto out; - } + + /* we want the original fs_tree */ + if (subvolid == 0) + subvolid = BTRFS_FS_TREE_OBJECTID; + + *subvol_objectid = subvolid; break; case Opt_subvolrootid: pr_warn("BTRFS: 'subvolrootid' mount option is deprecated and has no effect\n"); -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: manage commit mount option as %u
As -o commit mount option is unsinged so manage it as %u for token verifications, instead of %d. Signed-off-by: Anand Jain--- fs/btrfs/ctree.h | 2 +- fs/btrfs/super.c | 26 ++ 2 files changed, 11 insertions(+), 17 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 23b7e6756b15..80d1d4d12f9d 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -788,7 +788,7 @@ struct btrfs_fs_info { unsigned long pending_changes; unsigned long compress_type:4; unsigned int compress_level; - int commit_interval; + u32 commit_interval; /* * It is a suggestive number, the read side is safe even it gets a * wrong number because we will write out the data into a regular diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 83e11bbacc17..0a299a0ded3a 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -385,7 +385,7 @@ static const match_table_t tokens = { {Opt_check_integrity_print_mask, "check_int_print_mask=%u"}, {Opt_rescan_uuid_tree, "rescan_uuid_tree"}, {Opt_fatal_errors, "fatal_errors=%s"}, - {Opt_commit_interval, "commit=%d"}, + {Opt_commit_interval, "commit=%u"}, #ifdef CONFIG_BTRFS_DEBUG {Opt_fragment_data, "fragment=data"}, {Opt_fragment_metadata, "fragment=metadata"}, @@ -777,24 +777,18 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options, case Opt_commit_interval: intarg = 0; ret = match_int([0], ); - if (ret < 0) { - btrfs_err(info, "invalid commit interval"); - ret = -EINVAL; + if (ret) goto out; - } - if (intarg > 0) { - if (intarg > 300) { - btrfs_warn(info, - "excessive commit interval %d", - intarg); - } - info->commit_interval = intarg; - } else { + if (intarg == 0) { btrfs_info(info, - "using default commit interval %ds", + "using default commit interval %us", BTRFS_DEFAULT_COMMIT_INTERVAL); - info->commit_interval = BTRFS_DEFAULT_COMMIT_INTERVAL; + intarg = BTRFS_DEFAULT_COMMIT_INTERVAL; + } else if (intarg > 300) { + btrfs_warn(info, "excessive commit interval %d", + intarg); } + info->commit_interval = intarg; break; #ifdef CONFIG_BTRFS_DEBUG case Opt_fragment_all: @@ -1363,7 +1357,7 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry) if (btrfs_test_opt(info, PANIC_ON_FATAL_ERROR)) seq_puts(seq, ",fatal_errors=panic"); if (info->commit_interval != BTRFS_DEFAULT_COMMIT_INTERVAL) - seq_printf(seq, ",commit=%d", info->commit_interval); + seq_printf(seq, ",commit=%u", info->commit_interval); #ifdef CONFIG_BTRFS_DEBUG if (btrfs_test_opt(info, FRAGMENT_DATA)) seq_puts(seq, ",fragment=data"); -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: manage thread_pool mount option as %u
-o thread_pool is alway unsigned. Manage it that way all around. Signed-off-by: Anand Jain--- fs/btrfs/ctree.h | 2 +- fs/btrfs/disk-io.c | 4 ++-- fs/btrfs/super.c | 13 ++--- 3 files changed, 9 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 0b738f0a9a23..23b7e6756b15 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -935,7 +935,7 @@ struct btrfs_fs_info { struct btrfs_workqueue *extent_workers; struct task_struct *transaction_kthread; struct task_struct *cleaner_kthread; - int thread_pool_size; + u32 thread_pool_size; struct kobject *space_info_kobj; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 87d24c903152..3cec2455b603 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2180,7 +2180,7 @@ static void btrfs_init_qgroup(struct btrfs_fs_info *fs_info) static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info, struct btrfs_fs_devices *fs_devices) { - int max_active = fs_info->thread_pool_size; + u32 max_active = fs_info->thread_pool_size; unsigned int flags = WQ_MEM_RECLAIM | WQ_FREEZABLE | WQ_UNBOUND; fs_info->workers = @@ -2401,7 +2401,7 @@ int open_ctree(struct super_block *sb, int err = -EINVAL; int num_backups_tried = 0; int backup_index = 0; - int max_active; + u32 max_active; int clear_free_space_tree = 0; tree_root = fs_info->tree_root = btrfs_alloc_root(fs_info, GFP_KERNEL); diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 02c7766e6849..8112619cac95 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -346,7 +346,7 @@ static const match_table_t tokens = { {Opt_barrier, "barrier"}, {Opt_max_inline, "max_inline=%u"}, {Opt_alloc_start, "alloc_start=%s"}, - {Opt_thread_pool, "thread_pool=%d"}, + {Opt_thread_pool, "thread_pool=%u"}, {Opt_compress, "compress"}, {Opt_compress_type, "compress=%s"}, {Opt_compress_force, "compress-force"}, @@ -596,12 +596,11 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options, ret = match_int([0], ); if (ret) { goto out; - } else if (intarg > 0) { - info->thread_pool_size = intarg; - } else { + } else if (intarg == 0) { ret = -EINVAL; goto out; } + info->thread_pool_size = intarg; break; case Opt_max_inline: ret = match_int([0], ); @@ -1317,7 +1316,7 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry) seq_printf(seq, ",max_inline=%u", info->max_inline); if (info->thread_pool_size != min_t(unsigned long, num_online_cpus() + 2, 8)) - seq_printf(seq, ",thread_pool=%d", info->thread_pool_size); + seq_printf(seq, ",thread_pool=%u", info->thread_pool_size); if (btrfs_test_opt(info, COMPRESS)) { compress_type = btrfs_compress_type2str(info->compress_type); if (btrfs_test_opt(info, FORCE_COMPRESS)) @@ -1723,7 +1722,7 @@ static struct dentry *btrfs_mount(struct file_system_type *fs_type, int flags, } static void btrfs_resize_thread_pool(struct btrfs_fs_info *fs_info, -int new_pool_size, int old_pool_size) +u32 new_pool_size, u32 old_pool_size) { if (new_pool_size == old_pool_size) return; @@ -1791,7 +1790,7 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data) unsigned long old_opts = fs_info->mount_opt; unsigned long old_compress_type = fs_info->compress_type; u32 old_max_inline = fs_info->max_inline; - int old_thread_pool_size = fs_info->thread_pool_size; + u32 old_thread_pool_size = fs_info->thread_pool_size; unsigned int old_metadata_ratio = fs_info->metadata_ratio; int ret; -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: manage check_int_print_mask mount option as %u
As -o check_int_print_mask mount option is unsinged so manage it as %u for token verifications, instead of %d. Signed-off-by: Anand Jain--- fs/btrfs/super.c | 16 +--- 1 file changed, 5 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index b925a649a01a..83e11bbacc17 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -382,7 +382,7 @@ static const match_table_t tokens = { {Opt_skip_balance, "skip_balance"}, {Opt_check_integrity, "check_int"}, {Opt_check_integrity_including_extent_data, "check_int_data"}, - {Opt_check_integrity_print_mask, "check_int_print_mask=%d"}, + {Opt_check_integrity_print_mask, "check_int_print_mask=%u"}, {Opt_rescan_uuid_tree, "rescan_uuid_tree"}, {Opt_fatal_errors, "fatal_errors=%s"}, {Opt_commit_interval, "commit=%d"}, @@ -747,17 +747,11 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options, break; case Opt_check_integrity_print_mask: ret = match_int([0], ); - if (ret) { - goto out; - } else if (intarg >= 0) { - info->check_integrity_print_mask = intarg; - btrfs_info(info, - "check_integrity_print_mask 0x%x", - info->check_integrity_print_mask); - } else { - ret = -EINVAL; + if (ret) goto out; - } + info->check_integrity_print_mask = intarg; + btrfs_info(info, "check_integrity_print_mask 0x%x", + info->check_integrity_print_mask); break; #else case Opt_check_integrity_including_extent_data: -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: manage metadata_ratio mount option as %u
As -o metadata_ratio mount option is unsinged so manage it as %u for token verifications, instead of %s. Signed-off-by: Anand Jain--- fs/btrfs/super.c | 17 ++--- 1 file changed, 6 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index bf629c8a6a47..b925a649a01a 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -362,7 +362,7 @@ static const match_table_t tokens = { {Opt_norecovery, "norecovery"}, {Opt_flushoncommit, "flushoncommit"}, {Opt_noflushoncommit, "noflushoncommit"}, - {Opt_ratio, "metadata_ratio=%d"}, + {Opt_ratio, "metadata_ratio=%u"}, {Opt_discard, "discard"}, {Opt_nodiscard, "nodiscard"}, {Opt_space_cache, "space_cache"}, @@ -648,16 +648,11 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options, break; case Opt_ratio: ret = match_int([0], ); - if (ret) { - goto out; - } else if (intarg >= 0) { - info->metadata_ratio = intarg; - btrfs_info(info, "metadata ratio %d", - info->metadata_ratio); - } else { - ret = -EINVAL; + if (ret) goto out; - } + info->metadata_ratio = intarg; + btrfs_info(info, "metadata ratio %u", + info->metadata_ratio); break; case Opt_discard: btrfs_set_and_info(info, DISCARD, @@ -1369,7 +1364,7 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry) info->check_integrity_print_mask); #endif if (info->metadata_ratio) - seq_printf(seq, ",metadata_ratio=%d", + seq_printf(seq, ",metadata_ratio=%u", info->metadata_ratio); if (btrfs_test_opt(info, PANIC_ON_FATAL_ERROR)) seq_puts(seq, ",fatal_errors=panic"); -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: fix bare unsigned declarations
Kernel style prefers "unsigned int " over "unsigned " and "signed int " over "signed ". Signed-off-by: Anand Jain--- fs/btrfs/ctree.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 80d1d4d12f9d..c6de1a5281ca 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -993,8 +993,8 @@ struct btrfs_fs_info { struct btrfs_balance_control *balance_ctl; wait_queue_head_t balance_wait_q; - unsigned data_chunk_allocations; - unsigned metadata_ratio; + u32 data_chunk_allocations; + u32 metadata_ratio; void *bdev_holder; -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: add a comment to mark the deprecated mount option
-o alloc_start and -o subvolrootid are deprecated mount options, comment them in the tokens list. And leave them as it is. No functional changes. Signed-off-by: Anand Jain--- fs/btrfs/super.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 0a299a0ded3a..7771e690460c 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -345,7 +345,7 @@ static const match_table_t tokens = { {Opt_nobarrier, "nobarrier"}, {Opt_barrier, "barrier"}, {Opt_max_inline, "max_inline=%u"}, - {Opt_alloc_start, "alloc_start=%s"}, + {Opt_alloc_start, "alloc_start=%s"}, /* deprecated */ {Opt_thread_pool, "thread_pool=%u"}, {Opt_compress, "compress"}, {Opt_compress_type, "compress=%s"}, @@ -371,7 +371,7 @@ static const match_table_t tokens = { {Opt_user_subvol_rm_allowed, "user_subvol_rm_allowed"}, {Opt_enospc_debug, "enospc_debug"}, {Opt_noenospc_debug, "noenospc_debug"}, - {Opt_subvolrootid, "subvolrootid=%d"}, + {Opt_subvolrootid, "subvolrootid=%d"}, /* deprecated */ {Opt_defrag, "autodefrag"}, {Opt_nodefrag, "noautodefrag"}, {Opt_inode_cache, "inode_cache"}, -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/2] btrfs: declare max_inline as u32
As of now btrfs_fs_info::max_line is u64, which can't be larger than btrfs_fs_info::sectorsize which is defined as u32, so make btrfs_fs_info::max_line u32, Signed-off-by: Anand Jain--- v1->v2: Born in v2. fs/btrfs/ctree.h | 2 +- fs/btrfs/super.c | 8 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 4759e988b0df..0b738f0a9a23 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -795,7 +795,7 @@ struct btrfs_fs_info { * extent. The write side(mount/remount) is under ->s_umount lock, * so it is also safe. */ - u64 max_inline; + u32 max_inline; struct btrfs_transaction *running_transaction; wait_queue_head_t transaction_throttle; diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 383be3609cc9..b13d68506f15 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -610,11 +610,11 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options, kfree(num); if (info->max_inline) { - info->max_inline = min_t(u64, + info->max_inline = min_t(u32, info->max_inline, info->sectorsize); } - btrfs_info(info, "max_inline at %llu", + btrfs_info(info, "max_inline at %u", info->max_inline); } else { ret = -ENOMEM; @@ -1325,7 +1325,7 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry) if (btrfs_test_opt(info, NOBARRIER)) seq_puts(seq, ",nobarrier"); if (info->max_inline != BTRFS_DEFAULT_MAX_INLINE) - seq_printf(seq, ",max_inline=%llu", info->max_inline); + seq_printf(seq, ",max_inline=%u", info->max_inline); if (info->thread_pool_size != min_t(unsigned long, num_online_cpus() + 2, 8)) seq_printf(seq, ",thread_pool=%d", info->thread_pool_size); @@ -1801,7 +1801,7 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data) unsigned old_flags = sb->s_flags; unsigned long old_opts = fs_info->mount_opt; unsigned long old_compress_type = fs_info->compress_type; - u64 old_max_inline = fs_info->max_inline; + u32 old_max_inline = fs_info->max_inline; int old_thread_pool_size = fs_info->thread_pool_size; unsigned int old_metadata_ratio = fs_info->metadata_ratio; int ret; -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/2] btrfs: verify max_inline mount parameter
We aren't verifying the parameter passed to the max_inline mount option, so we won't report and fail the mount if a junk value is specified for example, -o max_inline=abc. This patch converts the max_inline option to %d and checks if it's a number >= 0. Signed-off-by: Anand Jain--- v1->v2: use match_int ret value if error use %u instead of %d for parser fs/btrfs/super.c | 23 ++- 1 file changed, 6 insertions(+), 17 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index b13d68506f15..02c7766e6849 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -344,7 +344,7 @@ static const match_table_t tokens = { {Opt_datacow, "datacow"}, {Opt_nobarrier, "nobarrier"}, {Opt_barrier, "barrier"}, - {Opt_max_inline, "max_inline=%s"}, + {Opt_max_inline, "max_inline=%u"}, {Opt_alloc_start, "alloc_start=%s"}, {Opt_thread_pool, "thread_pool=%d"}, {Opt_compress, "compress"}, @@ -407,7 +407,7 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options, unsigned long new_flags) { substring_t args[MAX_OPT_ARGS]; - char *p, *num; + char *p; u64 cache_gen; int intarg; int ret = 0; @@ -604,22 +604,11 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options, } break; case Opt_max_inline: - num = match_strdup([0]); - if (num) { - info->max_inline = memparse(num, NULL); - kfree(num); - - if (info->max_inline) { - info->max_inline = min_t(u32, - info->max_inline, - info->sectorsize); - } - btrfs_info(info, "max_inline at %u", - info->max_inline); - } else { - ret = -ENOMEM; + ret = match_int([0], ); + if (ret) goto out; - } + info->max_inline = min_t(u32, intarg, info->sectorsize); + btrfs_info(info, "max_inline at %u", info->max_inline); break; case Opt_alloc_start: btrfs_info(info, -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: verify max_inline mount parameter
On 02/13/2018 01:13 AM, David Sterba wrote: On Mon, Feb 12, 2018 at 11:35:46PM +0800, Anand Jain wrote: We aren't verifying the parameter passed to the max_inline mount option, so we won't report and fail the mount if a junk value is specified for example, -o max_inline=abc. This patch converts the max_inline option to %d and checks if it's a number >= 0. Oh right, the parser can verify that for us. As the %d gets stored to int, there's no reason to store fs_info::max_inline as u64. fixed this in the patch set sent out. Looking at the parser capabilities, it accepts %u and matches an unsigned type, which we use for all our options. It would be good to unify that too. fixed this too. Please update this patch and possibly send more cleaning up the rest of the match strings, and the max_inline type. Thanks. Sure. will do. Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs: log, when replace, is canceled by the user
On 13.02.2018 11:05, Qu Wenruo wrote: > > > On 2018年02月13日 11:53, Anand Jain wrote: >> For forensic investigations of issues, we would want to know >> if and when the user cancels the replace. >> >> Signed-off-by: Anand Jain> --- >> v1->v2: use btrfs_dev_name() instead of rcu_str_deref() >> as btrfs_dev_name() also provides "missing" string, >> if needed. > > Would you please provide the branch you're using? > > As I couldn't find the function or macro of btrfs_dev_name() I have it in my misc-next checkout. I got introduced by 3c958bd23b60 ("btrfs: add helper for device path or missing") which is in 4.16-rc1 so pretty recent. > > Thanks, > Qu > >> >> fs/btrfs/dev-replace.c | 7 +++ >> 1 file changed, 7 insertions(+) >> >> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c >> index dd717e204b5e..3b0760f7ec8a 100644 >> --- a/fs/btrfs/dev-replace.c >> +++ b/fs/btrfs/dev-replace.c >> @@ -703,6 +703,7 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info >> *fs_info) >> { >> struct btrfs_dev_replace *dev_replace = _info->dev_replace; >> struct btrfs_device *tgt_device = NULL; >> +struct btrfs_device *src_device = NULL; >> struct btrfs_trans_handle *trans; >> struct btrfs_root *root = fs_info->tree_root; >> int result; >> @@ -724,6 +725,7 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info >> *fs_info) >> case BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED: >> result = BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_ERROR; >> tgt_device = dev_replace->tgtdev; >> +src_device = dev_replace->srcdev; >> dev_replace->tgtdev = NULL; >> dev_replace->srcdev = NULL; >> break; >> @@ -741,6 +743,11 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info >> *fs_info) >> } >> ret = btrfs_commit_transaction(trans); >> WARN_ON(ret); >> + >> +btrfs_info(fs_info, "dev_replace from %s (devid %llu) to %s canceled", >> + btrfs_dev_name(src_device), src_device->devid, >> + btrfs_dev_name(tgt_device)); >> + >> if (tgt_device) >> btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device); >> >> > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs: log, when replace, is canceled by the user
On 2018年02月13日 11:53, Anand Jain wrote: > For forensic investigations of issues, we would want to know > if and when the user cancels the replace. > > Signed-off-by: Anand Jain> --- > v1->v2: use btrfs_dev_name() instead of rcu_str_deref() > as btrfs_dev_name() also provides "missing" string, > if needed. Would you please provide the branch you're using? As I couldn't find the function or macro of btrfs_dev_name() Thanks, Qu > > fs/btrfs/dev-replace.c | 7 +++ > 1 file changed, 7 insertions(+) > > diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c > index dd717e204b5e..3b0760f7ec8a 100644 > --- a/fs/btrfs/dev-replace.c > +++ b/fs/btrfs/dev-replace.c > @@ -703,6 +703,7 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info > *fs_info) > { > struct btrfs_dev_replace *dev_replace = _info->dev_replace; > struct btrfs_device *tgt_device = NULL; > + struct btrfs_device *src_device = NULL; > struct btrfs_trans_handle *trans; > struct btrfs_root *root = fs_info->tree_root; > int result; > @@ -724,6 +725,7 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info > *fs_info) > case BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED: > result = BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_ERROR; > tgt_device = dev_replace->tgtdev; > + src_device = dev_replace->srcdev; > dev_replace->tgtdev = NULL; > dev_replace->srcdev = NULL; > break; > @@ -741,6 +743,11 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info > *fs_info) > } > ret = btrfs_commit_transaction(trans); > WARN_ON(ret); > + > + btrfs_info(fs_info, "dev_replace from %s (devid %llu) to %s canceled", > +btrfs_dev_name(src_device), src_device->devid, > +btrfs_dev_name(tgt_device)); > + > if (tgt_device) > btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device); > > signature.asc Description: OpenPGP digital signature
Re: [PATCH v2] btrfs: fix endianness compatibility during the SB RW
On 2018年02月13日 11:00, Anand Jain wrote: > Fixes the endianness bug in the fs_info::super_copy by using its > btrfs_set_super...() function to set values in the SB, as these > functions manage the endianness compatibility nicely. > > Signed-off-by: Anand JainAlso went through all btrfs_super_block SETGET functions, greping using \>, seems that there are still some left here: fs/btrfs/sysfs.c: In both btrfs_sectorsize_show() and btrfs_clone_alignment_show(): return snprintf(buf, PAGE_SIZE, "%u\n", fs_info->super_copy->sectorsize); In btrfs_nodesize_show(): return snprintf(buf, PAGE_SIZE, "%u\n", fs_info->super_copy->nodesize); And what about cc this to stable kernel? IIRC it's a very critical problem for btrfs. Maybe cc: sta...@vger.kernel.org # v3.2+? Thanks, Qu > --- > v1->v2: Update change log. Update $Subject. > Old: > [PATCH] btrfs: use set functions to update latest refs to the SB > fs/btrfs/transaction.c | 20 > 1 file changed, 12 insertions(+), 8 deletions(-) > > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c > index 04f07144b45c..9220f004001c 100644 > --- a/fs/btrfs/transaction.c > +++ b/fs/btrfs/transaction.c > @@ -1722,19 +1722,23 @@ static void update_super_roots(struct btrfs_fs_info > *fs_info) > > super = fs_info->super_copy; > > + /* update latest btrfs_super_block::chunk_root refs */ > root_item = _info->chunk_root->root_item; > - super->chunk_root = root_item->bytenr; > - super->chunk_root_generation = root_item->generation; > - super->chunk_root_level = root_item->level; > + btrfs_set_super_chunk_root(super, root_item->bytenr); > + btrfs_set_super_chunk_root_generation(super, root_item->generation); > + btrfs_set_super_chunk_root_level(super, root_item->level); > > + /* update latest btrfs_super_block::root refs */ > root_item = _info->tree_root->root_item; > - super->root = root_item->bytenr; > - super->generation = root_item->generation; > - super->root_level = root_item->level; > + btrfs_set_super_root(super, root_item->bytenr); > + btrfs_set_super_generation(super, root_item->generation); > + btrfs_set_super_root_level(super, root_item->level); > + > if (btrfs_test_opt(fs_info, SPACE_CACHE)) > - super->cache_generation = root_item->generation; > + btrfs_set_super_cache_generation(super, root_item->generation); > if (test_bit(BTRFS_FS_UPDATE_UUID_TREE_GEN, _info->flags)) > - super->uuid_tree_generation = root_item->generation; > + btrfs_set_super_uuid_tree_generation(super, > + root_item->generation); > } > > int btrfs_transaction_in_commit(struct btrfs_fs_info *info) > signature.asc Description: OpenPGP digital signature
Re: [PATCH] btrfs: Verify extent allocated by find_free_extent() won't overlap with extents from previous transaction
On 2018年02月13日 15:20, Nikolay Borisov wrote: > > > On 13.02.2018 03:13, Qu Wenruo wrote: >> There are reports in mail list, even with latest mainline kernel, btrfs >> can't survive a power loss. >> >> Unlike journal based filesystem, btrfs doesn't use journal for such >> work. (log tree is an optimization for fsync, not to keep fs healthy) >> In btrfs we use metadata CoW to ensure all tree blocks are as atomic as >> superblock. >> >> This leads to an obvious assumption, some code breaks such metadata CoW >> makes btrfs no longer bullet-proof against power loss. >> >> This patch adds extra runtime selftest to find_free_extent(), which >> will check the range in commit root of extent tree to ensure there is no >> overlap at all. >> >> And hopes this could help us to catch the cause of the problem. >> >> Signed-off-by: Qu Wenruo>> --- >> Unfortunately, no new problem exposed yet. >> --- >> fs/btrfs/extent-tree.c | 118 >> + >> 1 file changed, 118 insertions(+) >> >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index 2f4328511ac8..3b3cd82bce3a 100644 >> --- a/fs/btrfs/extent-tree.c >> +++ b/fs/btrfs/extent-tree.c >> @@ -7458,6 +7458,114 @@ btrfs_release_block_group(struct >> btrfs_block_group_cache *cache, >> btrfs_put_block_group(cache); >> } >> >> +#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS >> +/* >> + * Verify if the given extent range [@start, @start + @len) conflicts any >> + * existing extent in commit root. >> + * >> + * Btrfs doesn't use journal, but depends on metadata (and data) CoW to keep >> + * the whole filesystem consistent against powerloss. >> + * If we have overwritten any extent used by previous trans (commit root), >> + * and powerloss happen we will corrupt our filesystem. > > Currently do we know if the corruption is caused due to extent overlap > or some other subtle bug? No direct evidence yet. But some strange behavior like valid tree block but wrong level (parent level is 1, leave level is still 1, but otherwise completely valid) happens after power loss. And I strongly suspect transid error after powerloss can be related to this. Overwriting commit root is just one possible cause. This patch is just trying to catch some clue. I am also working on new test case using dm-flakey and dm-log to try to create some crazy powerloss test case to try to reproduce it. > I.e is there a guarnteee that this self-check > should trigger? No guarantee yet. But at least for normal tree allocation and new data extent allocation, it is going through such call chain and should trigger this self-check: For tree allocation: btrfs_alloc_tree_block() |- btrfs_reserve_extent() |- find_free_extent() |- check_extent_conflicts() For data extent, there are several cases: For prealloc btrfs_prealloc_file_range() or btrfs_prealloc_file_range_trans() |- __btrfs_prealloc_file_range() |- btrfs_reserve_extent() For data CoW: cow_file_range() |- btrfs_reserve_extent() For compression: submit_compressed_extents() |- btrfs_reserve_extent() For direct: btrfs_new_extent_direct() |- btrfs_reserve_extent() So btrfs_reserve_extent() should be a quite good entry point, although I'm still not 100% sure if there is any other case which could reserve extent without calling btrfs_reserve_extent() If we could found that, it may be the cause. Thanks, Qu > >> + * >> + * Return 0 if nothing wrong. >> + * Return <0 (including ENOMEM) means we have something wrong. >> + * Except NOEMEM, this normally means we have extent conflicts with previous >> + * transaction. >> + */ >> +static int check_extent_conflicts(struct btrfs_fs_info *fs_info, >> + u64 start, u64 len) >> +{ >> +struct btrfs_key key; >> +struct btrfs_path *path; >> +struct btrfs_root *extent_root = fs_info->extent_root; >> +u64 extent_start; >> +u64 extent_len; >> +int ret; >> + >> +path = btrfs_alloc_path(); >> +if (!path) >> +return -ENOMEM; >> + >> + >> +key.objectid = start + len; >> +key.type = 0; >> +key.offset = 0; >> + >> +/* >> + * Here we use extent commit root to search for any conflicts. >> + * If extent commit tree is already overwritten, we will get transid >> + * error and error out any way. >> + * If extent commit tree is OK, but other extent conflicts, we will >> + * find it. >> + * So anyway, such search should be OK to find the conflicts. >> + */ >> +path->search_commit_root = true; >> +ret = btrfs_search_slot(NULL, extent_root, , path, 0, 0); >> + >> +if (ret < 0) >> +goto out; >> +/* Impossible as no type/offset should be (u64)-1 */ >> +if (ret == 0) { >> +ret = -EUCLEAN; >> +goto out; >> +} >> +ret = btrfs_previous_extent_item(extent_root, path, start); >> +if (ret < 0) >> +goto out; >> +if (ret == 0) { >> +