[PATCH] btrfs-progs: mkfs/rootdir: Fix memory leak in traverse_directory()

2018-02-13 Thread Qu Wenruo
The bug is exposed by mkfs test case 009, with D=asan.

We are leaking memory of parent_dir_entry->path() which ,except the
rootdir, is allocated by strdup().

Before fixing it, unifiy the allocation of parent_dir_entry() to heap
allocation.

Then fix it by adding "free(parent_dir_entry->path);" in
traverse_directory() and error handler.

Signed-off-by: Qu Wenruo 
---
 mkfs/rootdir.c | 5 +++--
 mkfs/rootdir.h | 2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/mkfs/rootdir.c b/mkfs/rootdir.c
index 9be4bcbfd0b5..f9472e126674 100644
--- a/mkfs/rootdir.c
+++ b/mkfs/rootdir.c
@@ -453,7 +453,6 @@ static int traverse_directory(struct btrfs_trans_handle 
*trans,
ino_t parent_inum, cur_inum;
ino_t highest_inum = 0;
const char *parent_dir_name;
-   char real_path[PATH_MAX];
struct btrfs_path path;
struct extent_buffer *leaf;
struct btrfs_key root_dir_key;
@@ -464,7 +463,7 @@ static int traverse_directory(struct btrfs_trans_handle 
*trans,
if (!dir_entry)
return -ENOMEM;
dir_entry->dir_name = dir_name;
-   dir_entry->path = realpath(dir_name, real_path);
+   dir_entry->path = realpath(dir_name, NULL);
if (!dir_entry->path) {
error("realpath  failed for %s: %s", dir_name, strerror(errno));
ret = -1;
@@ -616,6 +615,7 @@ static int traverse_directory(struct btrfs_trans_handle 
*trans,
}
 
free_namelist(files, count);
+   free(parent_dir_entry->path);
free(parent_dir_entry);
 
index_cnt = 2;
@@ -686,6 +686,7 @@ fail:
dir_entry = list_entry(dir_head.list.next,
   struct directory_name_entry, list);
list_del(_entry->list);
+   free(dir_entry->path);
free(dir_entry);
}
 out:
diff --git a/mkfs/rootdir.h b/mkfs/rootdir.h
index d0fc2eb58563..f06c7dd16e3c 100644
--- a/mkfs/rootdir.h
+++ b/mkfs/rootdir.h
@@ -23,7 +23,7 @@
 
 struct directory_name_entry {
const char *dir_name;
-   const char *path;
+   char *path;
ino_t inum;
struct list_head list;
 };
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: Fix read beyond boundary bug in build_roots_info_cache()

2018-02-13 Thread Qu Wenruo
This bug is exposed by fsck-test with D=asan, hit by test case 020, with
the following error report:

=
==10740==ERROR: AddressSanitizer: heap-buffer-overflow on address 
0x62161580 at pc 0x56051f0db6cd bp 0x7ffe170f3e20 sp 0x7ffe170f3e10
READ of size 1 at 0x62161580 thread T0
#0 0x56051f0db6cc in btrfs_extent_inline_ref_type 
/home/adam/btrfs/btrfs-progs/ctree.h:1727
#1 0x56051f13b669 in build_roots_info_cache 
/home/adam/btrfs/btrfs-progs/cmds-check.c:14306
#2 0x56051f13c86a in repair_root_items 
/home/adam/btrfs/btrfs-progs/cmds-check.c:14450
#3 0x56051f13ea89 in cmd_check 
/home/adam/btrfs/btrfs-progs/cmds-check.c:14965
#4 0x56051efe75bb in main /home/adam/btrfs/btrfs-progs/btrfs.c:302
#5 0x7f04ddbb0f49 in __libc_start_main (/usr/lib/libc.so.6+0x20f49)
#6 0x56051efe68c9 in _start (/home/adam/btrfs/btrfs-progs/btrfs+0x5b8c9)

0x62161580 is located 0 bytes to the right of 4224-byte region 
[0x62160500,0x62161580)
allocated by thread T0 here:
#0 0x7f04ded50ce1 in __interceptor_calloc 
/build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cc:70
#1 0x56051f04685e in __alloc_extent_buffer 
/home/adam/btrfs/btrfs-progs/extent_io.c:553
#2 0x56051f047563 in alloc_extent_buffer 
/home/adam/btrfs/btrfs-progs/extent_io.c:687
#3 0x56051efff1d1 in btrfs_find_create_tree_block 
/home/adam/btrfs/btrfs-progs/disk-io.c:187
#4 0x56051f000133 in read_tree_block 
/home/adam/btrfs/btrfs-progs/disk-io.c:327
#5 0x56051efeddb8 in read_node_slot /home/adam/btrfs/btrfs-progs/ctree.c:652
#6 0x56051effb0d9 in btrfs_next_leaf 
/home/adam/btrfs/btrfs-progs/ctree.c:2853
#7 0x56051f13b343 in build_roots_info_cache 
/home/adam/btrfs/btrfs-progs/cmds-check.c:14267
#8 0x56051f13c86a in repair_root_items 
/home/adam/btrfs/btrfs-progs/cmds-check.c:14450
#9 0x56051f13ea89 in cmd_check 
/home/adam/btrfs/btrfs-progs/cmds-check.c:14965
#10 0x56051efe75bb in main /home/adam/btrfs/btrfs-progs/btrfs.c:302
#11 0x7f04ddbb0f49 in __libc_start_main (/usr/lib/libc.so.6+0x20f49)

It's completely possible that one extent/metadata item has no inline
reference, while build_roots_info_cache() doesn't have such check.

Fix it by checking @iref against item end to avoid such problem.

Signed-off-by: Qu Wenruo 
---

I'm pretty sure this fix will cause conflict with original mode code
split.
Maybe it's better to merge original code split first?

---
 cmds-check.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/cmds-check.c b/cmds-check.c
index 99fbafc5538c..b15802887e70 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -14255,6 +14255,7 @@ static int build_roots_info_cache(struct btrfs_fs_info 
*info)
struct btrfs_key found_key;
struct btrfs_extent_item *ei;
struct btrfs_extent_inline_ref *iref;
+   unsigned long item_end;
int slot = path.slots[0];
int type;
u64 flags;
@@ -14283,6 +14284,7 @@ static int build_roots_info_cache(struct btrfs_fs_info 
*info)
 
ei = btrfs_item_ptr(leaf, slot, struct btrfs_extent_item);
flags = btrfs_extent_flags(leaf, ei);
+   item_end = (unsigned long)ei + btrfs_item_size_nr(leaf, slot);
 
if (found_key.type == BTRFS_EXTENT_ITEM_KEY &&
!(flags & BTRFS_EXTENT_FLAG_TREE_BLOCK))
@@ -14299,6 +14301,15 @@ static int build_roots_info_cache(struct btrfs_fs_info 
*info)
level = btrfs_tree_block_level(leaf, binfo);
}
 
+   /*
+* It's valid one extent/metadata item has no inline ref, but
+* SHARED_BLOCK_REF or other shared refernce.
+* So we need to do extra check to avoid read beyond leaf
+* boudnary.
+*/
+   if ((unsigned long)iref >= item_end)
+   goto next;
+
/*
 * For a root extent, it must be of the following type and the
 * first (and only one) iref in the item.
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-cleaner / snapshot performance analysis

2018-02-13 Thread Darrick J. Wong
On Sun, Feb 11, 2018 at 02:40:16PM +0800, Qu Wenruo wrote:
> 
> 
> On 2018年02月10日 00:45, Ellis H. Wilson III wrote:
> > Hi all,
> > 
> > I am trying to better understand how the cleaner kthread (btrfs-cleaner)
> > impacts foreground performance, specifically during snapshot deletion.
> > My experience so far has been that it can be dramatically disruptive to
> > foreground I/O.
> > 
> > Looking through the wiki at kernel.org I have not yet stumbled onto any
> > analysis that would shed light on this specific problem.  I have found
> > numerous complaints about btrfs-cleaner online, especially relating to
> > quotas being enabled.  This has proven thus far less than helpful, as
> > the response tends to be "use less snapshots," or "disable quotas," both
> > of which strike me as intellectually unsatisfying answers, especially
> > the former in a filesystem where snapshots are supposed to be
> > "first-class citizens."
> 
> Yes, snapshots of btrfs is really "first-class citizen".
> Tons of designs are all biased to snapshot.
> 
> But one should be clear about one thing:
> Snapshot creation and backref walk (used in qgroup, relocation and
> extent deletion), are two conflicting workload in fact.
> 
> Btrfs puts snapshot creation on a very high priority, so that it greatly
> degrades the performance of backref walk (used in snapshot deletion,
> relocation and extent exclusive/shared calculation of qgroup).
> 
> Let me explain this problem in detail.
> 
> Just as explained by Peter Grandi, for any snapshot system (or any
> system supports reflink) there must be a reserved mapping tree, to tell
> which extent is used by who.
> 
> It's very critical, to determine if an extent is shared so we determine
> if we need to do CoW.
> 
> There are several different ways to implement it, and this hugely
> affects snapshot creation performance.
> 
> 1) Direct mapping record
>Just records exactly which extent is used by who, directly.
>So when we needs to check the owner, just search the tree ONCE, then
>we get it.
> 
>This is simple and it seems that LVM thin-provision and LVM
>traditional targets are all using them.
>(Maybe XFS also follows this way?)

Yes, it does.

>Pros:
>*FAST* backref walk, which means quick extent deletion and CoW
>condition check.
> 
> 
>Cons:
>*SLOW* snapshot creation.
>Each snapshot creation needs to insert new owner relationship into
>the tree. This modification grow with the size of snapshot source.

...of course xfs also doesn't support snapshots. :)

--D

> 2) Indirect mapping record
>Records upper level referencer only.
> 
>To get all direct owner of an extent, it will needs multiple lookup
>in the reserved mapping tree.
> 
>And obviously, btrfs is using this method.
> 
>Pros:
>*FAST* owner inheritance, which means snapshot creation.
>(Well, the only advantage I can think of)
> 
>Cons:
>*VERY SLOW* backref walk, used by extent deletion, relocation, qgroup
>and Cow condition check.
>(That may also be why btrfs default to CoW data, so that it can skip
> the costy backref walk)
> 
> And a more detailed example of the difference between them will be:
> 
> [Basic tree layout]
>  Tree X
>  node A
>/\
> node B node C
> / \   /  \
>  leaf D  leaf E  leaf F  leaf G
> 
> Use above tree X as snapshot source.
> 
> [Snapshot creation: Direct mapping]
> Then for direct mapping record, if we are going to create snapshot Y
> then we would get:
> 
> Tree X  Tree Y
> node A 
>  |  \ / |
>  |   X  |
>  |  / \ |
> node B  node C
>  /  \  / \
>   leaf D  leaf E   leaf F   leaf G
> 
> We need to create new node H, and update the owner for node B/C/D/E/F/G.
> 
> That's to say, we need to create 1 new node, and update 6 references of
> existing nodes/leaves.
> And this will grow rapidly if the tree is large, but still should be a
> linear increase.
> 
> 
> [Snapshot creation: Indirect mapping]
> And if using indirect mapping tree, firstly, reserved mapping tree
> doesn't record exactly the owner for each leaf/node, but only records
> its parent(s).
> 
> So even when tree X exists along, without snapshot Y, if we need to know
> the owner of leaf D, we only knows its only parent is node B.
> And do the same query on node B until we read node A and knows it's
> owned by tree X.
> 
>  Tree X ^
>  node A ^ Look upward until
>/| we reach tree root
> node B  | to search the owner
> /   | of a leaf/node
>  

Re: mount btrfs takes 30 minutes, btrfs check runs out of memory

2018-02-13 Thread Qu Wenruo


On 2018年02月14日 00:24, Holger Hoffstätte wrote:
> On 02/13/18 13:54, Qu Wenruo wrote:
>> On 2018年02月13日 20:26, Holger Hoffstätte wrote:
>>> On 02/13/18 12:40, Qu Wenruo wrote:
>> The problem is not about how much space it takes, but how many extents
>> are here in the filesystem.
>>>
>>> I have no idea why btrfs' mount even needs to touch all block groups to
>>> get going (which seems to be the root of the problem), but here's a
>>> not so crazy idea for more "mechanical sympathy". Feel free to mock
>>> me if this is terribly wrong or not possible. ;)
>>>
>>> Mounting of even large filesystems (with many extents) seems to be fine
>>> on SSDS, but not so fine on rotational storage. We've heard that from
>>> several people with large (multi-TB) filesystems, and obviously it's
>>> even more terrible on 5400RPM drives because their seeks are sooo sloow.
>>>
>>> If the problem is that the bgs are touched/iterated in "tree order",
>>> would it then not be possible to sort the block groups in physical order
>>> before trying to load whatever mount needs to load?
>>
>> This is in fact a good idea.
>> Make block group into its own tree.
> 
> Well, that's not what I was thinking about at all..yet. :)
> (keep in mind I'm not really that familiar with the internals).
> 
> Out of curiosity I ran a bit of perf on my own mount process, which is
> fast (~700 ms) despite being a ~1.1TB fs, mixture of lots of large and
> small files. Unfortunately it's also very fresh since I recreated it just
> this weekend, so everything is neatly packed together and fast.
> 
> In contrast a friend's fs is ~800 GB, but has 11 GB metadata and is pretty
> old and fragmented (but running an up-to-date kernel). His fs mounts in ~5s.
> 
> My perf run shows that the only interesting part responsible for mount time
> is the nested loop in btrfs_read_block_groups calling find_first_block_group
> (which got inlined & is not in the perf callgraph) over and over again,
> accounting for 75% of time spent.
> 
> I now understand your comment why the real solution to this problem
> is to move bgs into their own tree, and agree: both kitchens and databases
> have figured out a long time ago that the key to fast scan and lookup
> performance is to not put different things in the same storage container;
> in the case of analytical DBMS this is columnar storage. :)
> 
> But what I originally meant was something much simpler and more
> brute-force-ish. I see that btrfs_read_block_groups adds readahead
> (is that actually effective?) but what I was looking for was the equivalent
> of a DBMS' sequential scan. Right now finding (and loading) a bg seems to
> involve a nested loop of tree lookups. It seems easier to rip through the
> entire tree in nice 8MB chunks and discard what you don't need instead
> of seeking around trying to find all the right bits in scattered order.

The problem is, the tree (extent tree) containing block groups is very,
very very large.

It's a tree shared by all subvolumes. And since tree nodes and leaves
can be scattered around the whole disk, it's pretty hard to do batch
readahead.

> 
> Could we alleviate cold mounts by starting more readaheads in
> btrfs_read_block_groups, so that the extent tree is scanned more linearly?

Since extent tree is not linear, it won't be as effective as we believe.

Thanks,
Qu

> 
> cheers,
> Holger
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 1/2] btrfs: Remove invalid null checks from btrfs_cleanup_dirty_bgs

2018-02-13 Thread David Sterba
On Thu, Feb 08, 2018 at 10:57:11AM -0800, Liu Bo wrote:
> On Thu, Feb 08, 2018 at 06:25:17PM +0200, Nikolay Borisov wrote:
> > list_first_entry is essentially a wrapper over cotnainer_of. The latter
> > can never return null even if it's working on inconsistent list since it
> > will either crash or return some offset in the wrong struct.
> > Additionally, for the dirty_bgs list the iteration is done under
> > dirty_bgs_lock which ensures consistency of the list.
> > 
> 
> Reviewed-by: Liu Bo 

Added to next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: fix endianness compatibility during the SB RW

2018-02-13 Thread David Sterba
On Tue, Feb 13, 2018 at 06:27:13PM +0800, Anand Jain wrote:
> On 02/13/2018 05:01 PM, Qu Wenruo wrote:
> > On 2018年02月13日 11:00, Anand Jain wrote:
> >> Fixes the endianness bug in the fs_info::super_copy by using its
> >> btrfs_set_super...() function to set values in the SB, as these
> >> functions manage the endianness compatibility nicely.
> >>
> >> Signed-off-by: Anand Jain 
> > 
> > Also went through all btrfs_super_block SETGET functions, greping using
> > \>, seems that there are still some left here:
> > 
> > fs/btrfs/sysfs.c:
> > In both btrfs_sectorsize_show() and btrfs_clone_alignment_show():
> > return snprintf(buf, PAGE_SIZE, "%u\n",
> > fs_info->super_copy->sectorsize);
> > 
> > In btrfs_nodesize_show():
> > return snprintf(buf, PAGE_SIZE, "%u\n", fs_info->super_copy->nodesize);
> 
>   Oh. Thanks. Will fix. Maybe it's a good idea to add sysfs fixes
>   into a new patch.

I'd prefer a single patch as it fixes the same problem for one
structure, the context of use is not that important to justify 2
patches.

I went through the possible uses of superblock again and did not find
anything else than the update_super_roots and sysfs read handlers. There
are some direct uses of super block members, like label, sys_array, uuid
that are passed unconverted and must be accessed via the set/get
helpers.

In some places the superblock is put to a temporary variable so simple
grep may miss these.

> > And what about cc this to stable kernel?
> > IIRC it's a very critical problem for btrfs.

If the filesystem is always used on a same endianity host, this will not
be a problem. Moving between opposite endianity hosts will report bogus
numbers in sysfs and the backup root would not be restored correctly.

As this is not common, I'd rate thats as a bugfix for stable, but "only"
a serious one.

> > Maybe cc: sta...@vger.kernel.org # v3.2+?
> 
>   Thanks for the suggestion. Will do. Any idea what if the patch which
>   applied on mainline ends up conflict on LTS, so write a separate patch
>   to stable?

If the patch does not apply to some older stable branch, all involved
people get a mail from stable team and have an opportunity to send an
updated version of the patch.

Regarding the long-term branches, I would consider 4.4 and up. Anything
older is a plus but fixing merge conflicts is more likely there so I
think the respective maintainers would either fix it by themselves or
ask for help.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: log, when replace, is canceled by the user

2018-02-13 Thread David Sterba
On Tue, Feb 13, 2018 at 11:53:43AM +0800, Anand Jain wrote:
> For forensic investigations of issues, we would want to know
> if and when the user cancels the replace.
> 
> Signed-off-by: Anand Jain 

Reviewed-by: David Sterba 

I've updated the changelog a bit.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: fix bare unsigned declarations

2018-02-13 Thread David Sterba
On Tue, Feb 13, 2018 at 05:50:48PM +0800, Anand Jain wrote:
> Kernel style prefers "unsigned int " over "unsigned "
> and "signed int " over "signed ".

The changelog does not match the changes, you're switching to u32. I
agree u32 looks more suitable and consistent with the other changes.
Please update the changelog and fixup the temporary variable in
btrfs_remount. Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: extent_buffer_uptodate() make it static and inline

2018-02-13 Thread David Sterba
On Tue, Feb 13, 2018 at 09:18:51AM +0200, Nikolay Borisov wrote:
> 
> 
> On 13.02.2018 06:35, Anand Jain wrote:
> > extent_buffer_uptodate() is a trivial wrapper around test_bit()
> > and nothing else. So make it static and inline, save on code
> > space and call indirection.
> > 
> > Before:
> >textdata bss dec hex filename
> > 1131257   82898   18992 1233147  12d0fb fs/btrfs/btrfs.ko
> > 
> > After:
> >textdata bss dec hex filename
> > 1131090   82898   18992 1232980  12d054 fs/btrfs/btrfs.ko
> > 
> > Signed-off-by: Anand Jain 
> 
> Reviewed-by: Nikolay Borisov  although there is one
> nit below
> 
> > ---
> >  fs/btrfs/extent_io.c | 5 -
> >  fs/btrfs/extent_io.h | 5 -
> >  2 files changed, 4 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index bae57b408901..f300edae044b 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -5229,11 +5229,6 @@ void set_extent_buffer_uptodate(struct extent_buffer 
> > *eb)
> > }
> >  }
> >  
> > -int extent_buffer_uptodate(struct extent_buffer *eb)
> > -{
> > -   return test_bit(EXTENT_BUFFER_UPTODATE, >bflags);
> > -}
> > -
> >  int read_extent_buffer_pages(struct extent_io_tree *tree,
> >  struct extent_buffer *eb, int wait, int mirror_num)
> >  {
> > diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> > index 72e5af2965a8..953d94662b8e 100644
> > --- a/fs/btrfs/extent_io.h
> > +++ b/fs/btrfs/extent_io.h
> > @@ -489,7 +489,10 @@ void clear_extent_buffer_dirty(struct extent_buffer 
> > *eb);
> >  int set_extent_buffer_dirty(struct extent_buffer *eb);
> >  void set_extent_buffer_uptodate(struct extent_buffer *eb);
> >  void clear_extent_buffer_uptodate(struct extent_buffer *eb);
> > -int extent_buffer_uptodate(struct extent_buffer *eb);
> > +static inline int extent_buffer_uptodate(struct extent_buffer *eb)
> > +{
> > +   return test_bit(EXTENT_BUFFER_UPTODATE, >bflags);
> > +}
> 
> nit: It just doesn't look good to have a static inline function
> definition among a list of function declarations. A quick look at
> extent_io.h shows that this function could have been put below the
> inline def of extent_buffer_get.

I will move it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] btrfs: declare max_inline as u32

2018-02-13 Thread David Sterba
On Tue, Feb 13, 2018 at 05:49:49PM +0800, Anand Jain wrote:
> As of now btrfs_fs_info::max_line is u64, which can't be
> larger than btrfs_fs_info::sectorsize which is defined as
> u32, so make btrfs_fs_info::max_line u32,
> 
> Signed-off-by: Anand Jain 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] btrfs: verify max_inline mount parameter

2018-02-13 Thread David Sterba
On Tue, Feb 13, 2018 at 05:49:50PM +0800, Anand Jain wrote:
> We aren't verifying the parameter passed to the max_inline mount option,
> so we won't report and fail the mount if a junk value is specified for
> example, -o max_inline=abc.
> This patch converts the max_inline option to %d and checks if it's a
> number >= 0.

As the max_inline is a size, the suffixes are allowed here and this is
documented in the btrfs(5) page. I've checked all current options and
max_inline should be the only one where we want the suffixes.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mount btrfs takes 30 minutes, btrfs check runs out of memory

2018-02-13 Thread Holger Hoffstätte
On 02/13/18 13:54, Qu Wenruo wrote:
> On 2018年02月13日 20:26, Holger Hoffstätte wrote:
>> On 02/13/18 12:40, Qu Wenruo wrote:
> The problem is not about how much space it takes, but how many extents
> are here in the filesystem.
>>
>> I have no idea why btrfs' mount even needs to touch all block groups to
>> get going (which seems to be the root of the problem), but here's a
>> not so crazy idea for more "mechanical sympathy". Feel free to mock
>> me if this is terribly wrong or not possible. ;)
>>
>> Mounting of even large filesystems (with many extents) seems to be fine
>> on SSDS, but not so fine on rotational storage. We've heard that from
>> several people with large (multi-TB) filesystems, and obviously it's
>> even more terrible on 5400RPM drives because their seeks are sooo sloow.
>>
>> If the problem is that the bgs are touched/iterated in "tree order",
>> would it then not be possible to sort the block groups in physical order
>> before trying to load whatever mount needs to load?
> 
> This is in fact a good idea.
> Make block group into its own tree.

Well, that's not what I was thinking about at all..yet. :)
(keep in mind I'm not really that familiar with the internals).

Out of curiosity I ran a bit of perf on my own mount process, which is
fast (~700 ms) despite being a ~1.1TB fs, mixture of lots of large and
small files. Unfortunately it's also very fresh since I recreated it just
this weekend, so everything is neatly packed together and fast.

In contrast a friend's fs is ~800 GB, but has 11 GB metadata and is pretty
old and fragmented (but running an up-to-date kernel). His fs mounts in ~5s.

My perf run shows that the only interesting part responsible for mount time
is the nested loop in btrfs_read_block_groups calling find_first_block_group
(which got inlined & is not in the perf callgraph) over and over again,
accounting for 75% of time spent.

I now understand your comment why the real solution to this problem
is to move bgs into their own tree, and agree: both kitchens and databases
have figured out a long time ago that the key to fast scan and lookup
performance is to not put different things in the same storage container;
in the case of analytical DBMS this is columnar storage. :)

But what I originally meant was something much simpler and more
brute-force-ish. I see that btrfs_read_block_groups adds readahead
(is that actually effective?) but what I was looking for was the equivalent
of a DBMS' sequential scan. Right now finding (and loading) a bg seems to
involve a nested loop of tree lookups. It seems easier to rip through the
entire tree in nice 8MB chunks and discard what you don't need instead
of seeking around trying to find all the right bits in scattered order.

Could we alleviate cold mounts by starting more readaheads in
btrfs_read_block_groups, so that the extent tree is scanned more linearly?

cheers,
Holger



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] btrfs: manage subvolid mount option as %u

2018-02-13 Thread David Sterba
On Tue, Feb 13, 2018 at 05:50:43PM +0800, Anand Jain wrote:
> As -o subvolid mount option is an u64 manage it as %u for
> token verifications, instead of %s.
> 
> Signed-off-by: Anand Jain 
> ---
>  fs/btrfs/super.c | 23 ++-
>  1 file changed, 10 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 8112619cac95..bf629c8a6a47 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -336,7 +336,7 @@ enum {
>  static const match_table_t tokens = {
>   {Opt_degraded, "degraded"},
>   {Opt_subvol, "subvol=%s"},
> - {Opt_subvolid, "subvolid=%s"},
> + {Opt_subvolid, "subvolid=%u"},

This gets implemented as simple_strtoul, deep in the perser it uses an
unsigned long long and then casts to usigned long. Long and long-long
are same only on subset of architectures, so we'd need to either extend
teh parser capabilities to really do u64 or we'd have to use the %s and
match_u64.

Though a subvolid larger than full 32bit number is unlikely, for sake of
correctness I think we should stick to the constraints of the subvolid
that allows u64.

>   {Opt_device, "device=%s"},
>   {Opt_nodatasum, "nodatasum"},
>   {Opt_datasum, "datasum"},
> @@ -964,8 +964,8 @@ static int btrfs_parse_subvol_options(const char 
> *options, fmode_t flags,
>  {
>   substring_t args[MAX_OPT_ARGS];
>   char *opts, *orig, *p;
> - char *num = NULL;
>   int error = 0;
> + u64 subvolid;
>  
>   if (!options)
>   return 0;
> @@ -995,18 +995,15 @@ static int btrfs_parse_subvol_options(const char 
> *options, fmode_t flags,
>   }
>   break;
>   case Opt_subvolid:
> - num = match_strdup([0]);
> - if (num) {
> - *subvol_objectid = memparse(num, NULL);
> - kfree(num);
> - /* we want the original fs_tree */
> - if (!*subvol_objectid)
> - *subvol_objectid =
> - BTRFS_FS_TREE_OBJECTID;
> - } else {
> - error = -EINVAL;
> + error = match_u64([0], );

So this is the right way (with the %s), as memparse accepts the size
suffixes (K/M/G/...), that we don't want for a subvolume id.

> + if (error)
>   goto out;
> - }
> +
> + /* we want the original fs_tree */
> + if (subvolid == 0)
> + subvolid = BTRFS_FS_TREE_OBJECTID;
> +
> + *subvol_objectid = subvolid;
>   break;
>   case Opt_subvolrootid:
>   pr_warn("BTRFS: 'subvolrootid' mount option is 
> deprecated and has no effect\n");
> -- 
> 2.7.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: manage thread_pool mount option as %u

2018-02-13 Thread David Sterba
On Tue, Feb 13, 2018 at 05:34:56PM +0200, Nikolay Borisov wrote:
> >>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> >>> index 02c7766e6849..8112619cac95 100644
> >>> --- a/fs/btrfs/super.c
> >>> +++ b/fs/btrfs/super.c
> >>> @@ -346,7 +346,7 @@ static const match_table_t tokens = {
> >>>   {Opt_barrier, "barrier"},
> >>>   {Opt_max_inline, "max_inline=%u"},
> >>>   {Opt_alloc_start, "alloc_start=%s"},
> >>> -    {Opt_thread_pool, "thread_pool=%d"},
> >>> +    {Opt_thread_pool, "thread_pool=%u"},
> >>>   {Opt_compress, "compress"},
> >>>   {Opt_compress_type, "compress=%s"},
> >>>   {Opt_compress_force, "compress-force"},
> >>> @@ -596,12 +596,11 @@ int btrfs_parse_options(struct btrfs_fs_info
> >>> *info, char *options,
> >>>   ret = match_int([0], );
> >>>   if (ret) {
> >>>   goto out;
> >>> -    } else if (intarg > 0) {
> >>> -    info->thread_pool_size = intarg;
> >>> -    } else {
> >>> +    } else if (intarg == 0) {
> >>
> >> One thing I'm worried about is the fact that match_int parses a signed
> >> int. So If someone pases -1 then it would be parsed to -1 but when you
> >> set it to thread_pool_size the actual value is going to be the 2's
> >> complement of the value i.e. a very large number. So a check for intarg
> >> < 0 is required to avoid that. Same applies to your other patches
> > 
> >  That's not true. When -o thread_pool=-1 is passed it would fail
> >  to match to any token. And same applies to other patches too.
> 
> Indeed, the subtleties of the matching machinery. In that case for the
> whole series:

Yeah, it's not very clear from the sources, but with %u the parser
refuses negative numbers. Even with match_int, the typecasts will not
damage the result when assigning to u32 from intarg.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: manage thread_pool mount option as %u

2018-02-13 Thread Nikolay Borisov


On 13.02.2018 17:18, Anand Jain wrote:
> 
> 
> 
> 
> 
> 
>>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>>> index 02c7766e6849..8112619cac95 100644
>>> --- a/fs/btrfs/super.c
>>> +++ b/fs/btrfs/super.c
>>> @@ -346,7 +346,7 @@ static const match_table_t tokens = {
>>>   {Opt_barrier, "barrier"},
>>>   {Opt_max_inline, "max_inline=%u"},
>>>   {Opt_alloc_start, "alloc_start=%s"},
>>> -    {Opt_thread_pool, "thread_pool=%d"},
>>> +    {Opt_thread_pool, "thread_pool=%u"},
>>>   {Opt_compress, "compress"},
>>>   {Opt_compress_type, "compress=%s"},
>>>   {Opt_compress_force, "compress-force"},
>>> @@ -596,12 +596,11 @@ int btrfs_parse_options(struct btrfs_fs_info
>>> *info, char *options,
>>>   ret = match_int([0], );
>>>   if (ret) {
>>>   goto out;
>>> -    } else if (intarg > 0) {
>>> -    info->thread_pool_size = intarg;
>>> -    } else {
>>> +    } else if (intarg == 0) {
>>
>> One thing I'm worried about is the fact that match_int parses a signed
>> int. So If someone pases -1 then it would be parsed to -1 but when you
>> set it to thread_pool_size the actual value is going to be the 2's
>> complement of the value i.e. a very large number. So a check for intarg
>> < 0 is required to avoid that. Same applies to your other patches
> 
>  That's not true. When -o thread_pool=-1 is passed it would fail
>  to match to any token. And same applies to other patches too.

Indeed, the subtleties of the matching machinery. In that case for the
whole series:

Reviewed-by: Nikolay Borisov 

> 
> Thanks, Anand
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-progs: compile with musl libc

2018-02-13 Thread Dan Robertson
> Thanks, this has been fixed in
> https://github.com/kdave/btrfs-progs/commit/3aa1bbdd89ee9c9c48d260a6192fae08328f1b2f

Ah, my bad. Didn't check the dev branch. Thanks for the quick response!

Cheers,

Dan


signature.asc
Description: Digital signature


Re: [PATCH] btrfs: manage thread_pool mount option as %u

2018-02-13 Thread Anand Jain








diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 02c7766e6849..8112619cac95 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -346,7 +346,7 @@ static const match_table_t tokens = {
{Opt_barrier, "barrier"},
{Opt_max_inline, "max_inline=%u"},
{Opt_alloc_start, "alloc_start=%s"},
-   {Opt_thread_pool, "thread_pool=%d"},
+   {Opt_thread_pool, "thread_pool=%u"},
{Opt_compress, "compress"},
{Opt_compress_type, "compress=%s"},
{Opt_compress_force, "compress-force"},
@@ -596,12 +596,11 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
*options,
ret = match_int([0], );
if (ret) {
goto out;
-   } else if (intarg > 0) {
-   info->thread_pool_size = intarg;
-   } else {
+   } else if (intarg == 0) {


One thing I'm worried about is the fact that match_int parses a signed
int. So If someone pases -1 then it would be parsed to -1 but when you
set it to thread_pool_size the actual value is going to be the 2's
complement of the value i.e. a very large number. So a check for intarg
< 0 is required to avoid that. Same applies to your other patches


 That's not true. When -o thread_pool=-1 is passed it would fail
 to match to any token. And same applies to other patches too.

Thanks, Anand
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-progs: compile with musl libc

2018-02-13 Thread David Sterba
On Tue, Feb 13, 2018 at 01:25:10PM +, Dan Robertson wrote:
> Ensure that limits.h is included in mkfs/main.c as PATH_MAX is used. The lack
> of this currently causes btrfs-progs to fail to compile when the systems libc
> is musl.

> From 7859e0d01778e844ae0fcbefe55581277ce7cab3 Mon Sep 17 00:00:00 2001
> From: Dan Robertson 
> Date: Tue, 13 Feb 2018 13:04:29 +
> Subject: [PATCH] btrfs-progs: compile with musl libc
> 
> Ensure that limits.h is included in mkfs/main.c as PATH_MAX is used.
> 
> Signed-off-by: Dan Robertson 

Thanks, this has been fixed in
https://github.com/kdave/btrfs-progs/commit/3aa1bbdd89ee9c9c48d260a6192fae08328f1b2f

and the musl build check is now in place so the breakage should not
happen anymore. I'll do a release this week with the fix included.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: manage thread_pool mount option as %u

2018-02-13 Thread Nikolay Borisov


On 13.02.2018 11:50, Anand Jain wrote:
> -o thread_pool is alway unsigned. Manage it that way all around.
> 
> Signed-off-by: Anand Jain 
> ---
>  fs/btrfs/ctree.h   |  2 +-
>  fs/btrfs/disk-io.c |  4 ++--
>  fs/btrfs/super.c   | 13 ++---
>  3 files changed, 9 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 0b738f0a9a23..23b7e6756b15 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -935,7 +935,7 @@ struct btrfs_fs_info {
>   struct btrfs_workqueue *extent_workers;
>   struct task_struct *transaction_kthread;
>   struct task_struct *cleaner_kthread;
> - int thread_pool_size;
> + u32 thread_pool_size;
>  
>   struct kobject *space_info_kobj;
>  
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 87d24c903152..3cec2455b603 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2180,7 +2180,7 @@ static void btrfs_init_qgroup(struct btrfs_fs_info 
> *fs_info)
>  static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info,
>   struct btrfs_fs_devices *fs_devices)
>  {
> - int max_active = fs_info->thread_pool_size;
> + u32 max_active = fs_info->thread_pool_size;
>   unsigned int flags = WQ_MEM_RECLAIM | WQ_FREEZABLE | WQ_UNBOUND;
>  
>   fs_info->workers =
> @@ -2401,7 +2401,7 @@ int open_ctree(struct super_block *sb,
>   int err = -EINVAL;
>   int num_backups_tried = 0;
>   int backup_index = 0;
> - int max_active;
> + u32 max_active;
>   int clear_free_space_tree = 0;
>  
>   tree_root = fs_info->tree_root = btrfs_alloc_root(fs_info, GFP_KERNEL);
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 02c7766e6849..8112619cac95 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -346,7 +346,7 @@ static const match_table_t tokens = {
>   {Opt_barrier, "barrier"},
>   {Opt_max_inline, "max_inline=%u"},
>   {Opt_alloc_start, "alloc_start=%s"},
> - {Opt_thread_pool, "thread_pool=%d"},
> + {Opt_thread_pool, "thread_pool=%u"},
>   {Opt_compress, "compress"},
>   {Opt_compress_type, "compress=%s"},
>   {Opt_compress_force, "compress-force"},
> @@ -596,12 +596,11 @@ int btrfs_parse_options(struct btrfs_fs_info *info, 
> char *options,
>   ret = match_int([0], );
>   if (ret) {
>   goto out;
> - } else if (intarg > 0) {
> - info->thread_pool_size = intarg;
> - } else {
> + } else if (intarg == 0) {

One thing I'm worried about is the fact that match_int parses a signed
int. So If someone pases -1 then it would be parsed to -1 but when you
set it to thread_pool_size the actual value is going to be the 2's
complement of the value i.e. a very large number. So a check for intarg
< 0 is required to avoid that. Same applies to your other patches

>   ret = -EINVAL;
>   goto out;
>   }
> + info->thread_pool_size = intarg;
>   break;
>   case Opt_max_inline:
>   ret = match_int([0], );
> @@ -1317,7 +1316,7 @@ static int btrfs_show_options(struct seq_file *seq, 
> struct dentry *dentry)
>   seq_printf(seq, ",max_inline=%u", info->max_inline);
>   if (info->thread_pool_size !=  min_t(unsigned long,
>num_online_cpus() + 2, 8))
> - seq_printf(seq, ",thread_pool=%d", info->thread_pool_size);
> + seq_printf(seq, ",thread_pool=%u", info->thread_pool_size);
>   if (btrfs_test_opt(info, COMPRESS)) {
>   compress_type = btrfs_compress_type2str(info->compress_type);
>   if (btrfs_test_opt(info, FORCE_COMPRESS))
> @@ -1723,7 +1722,7 @@ static struct dentry *btrfs_mount(struct 
> file_system_type *fs_type, int flags,
>  }
>  
>  static void btrfs_resize_thread_pool(struct btrfs_fs_info *fs_info,
> -  int new_pool_size, int old_pool_size)
> +  u32 new_pool_size, u32 old_pool_size)
>  {
>   if (new_pool_size == old_pool_size)
>   return;
> @@ -1791,7 +1790,7 @@ static int btrfs_remount(struct super_block *sb, int 
> *flags, char *data)
>   unsigned long old_opts = fs_info->mount_opt;
>   unsigned long old_compress_type = fs_info->compress_type;
>   u32 old_max_inline = fs_info->max_inline;
> - int old_thread_pool_size = fs_info->thread_pool_size;
> + u32 old_thread_pool_size = fs_info->thread_pool_size;
>   unsigned int old_metadata_ratio = fs_info->metadata_ratio;
>   int ret;
>  
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] btrfs: Remove extra run_delayed_items call

2018-02-13 Thread Nikolay Borisov
Before commit 581227d0d2b8 ("Btrfs: remove the time check in
btrfs_commit_transaction()") there would be a loop in the transaction
commit path that will go on until all existing writers ended their
transaction handles. This loop would try and flush as much pending stuff
as possible before going into the transaction critical section. However,
the aforementioned commit removed it, yet continued calling this
"extra" code for the sake of optimisation.

As it stands the rules for running the delayed items are a bit of
a mess. First we have the one call which this patch removes, then we
have another one, acting as a "safety net" by catching any delayed
items introduced after the previous call is finished and extwriters
becoming 0 (meaning no more TRANS_START/TRANS_ATTACHS are possible, only
joins). Then we have the delayed items running as part of creating a
snapshot (once per pedning snapshot). Finally, delayed items are run
once more _after_ snapshots have been created. All in all this amounts
to 3 + N (N = number of snapshots slated for creation). I think this
could really be reduced to 2 calls - 1 before create_pending_snapshots
is called and 1 after it's finished. This suffices to ensure consistent
fs state during snapshot creation and after they are created.

I did a full xfstest run to ensure no consistency issues came up. Then
I picked the tests which exhibitted rather long runtimes and looked
closer to see if it was as a result of this commit - which it wasn't.
Finally I did an artifical test with fsstres with only those operations
that put stress on the delayed items code:

fsstress -z -f mkdir=25% -f rmdir=25% -f creat=25 -f unlink=25%
-f attr_set=25% -p5 -n 25000 and also didn't observe any performance
regressions

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/transaction.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 2cdf7be02f41..02bc1e6212e6 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -2066,10 +2066,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
if (ret)
goto cleanup_transaction;
 
-   ret = btrfs_run_delayed_items(trans);
-   if (ret)
-   goto cleanup_transaction;
-
wait_event(cur_trans->writer_wait,
   extwriter_counter_read(cur_trans) == 0);
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs: Refactor running of delayed inode items during transaction commit

2018-02-13 Thread Nikolay Borisov
Currently calls to btrfs_run_delayed_inode items have been scattered
around the transaction commit code with no real design argument when they 
should be execute.

We have one call, after transaction writers go to 0. Then we have the delayed
items running as part of creating a snapshot (once per pedning snapshot).
Finally, delayed items are run once more _after_ snapshots have been created.
All in all this amounts to 2 + N (N = number of snapshots slated for creation).
In reality we need to flush delayed items once before
create_pending_snapshots is called to ensure snapshosts are consistent with
inode data and once after snapshots are created so that newly introduced inode 
items during snapshot creation process are correctly persisted on disk. This 
patch brings the total executions of run_delayed_items to just 2. 

This survived multiple xfstest runs. 

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/transaction.c | 31 ++-
 1 file changed, 10 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 02bc1e6212e6..b32d3136f36c 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1524,18 +1524,6 @@ static noinline int create_pending_snapshot(struct 
btrfs_trans_handle *trans,
}
btrfs_release_path(path);
 
-   /*
-* pull in the delayed directory update
-* and the delayed inode item
-* otherwise we corrupt the FS during
-* snapshot
-*/
-   ret = btrfs_run_delayed_items(trans);
-   if (ret) {  /* Transaction aborted */
-   btrfs_abort_transaction(trans, ret);
-   goto fail;
-   }
-
record_root_in_trans(trans, root, 0);
btrfs_set_root_last_snapshot(>root_item, trans->transid);
memcpy(new_root_item, >root_item, sizeof(*new_root_item));
@@ -2069,10 +2057,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
wait_event(cur_trans->writer_wait,
   extwriter_counter_read(cur_trans) == 0);
 
-   /* some pending stuffs might be added after the previous flush. */
-   ret = btrfs_run_delayed_items(trans);
-   if (ret)
-   goto cleanup_transaction;
 
btrfs_wait_delalloc_flush(fs_info);
 
@@ -2095,6 +2079,16 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
ret = cur_trans->aborted;
goto scrub_continue;
}
+
+   /*
+* Run delayed items because we need them to be consistent on-disk
+* so that snapshots created in create_pending_snapshots don't corrupt
+* the filesystem. At this point we are the one doing transaction
+* commit and now one can come and introduce new delayed inode items
+*/
+   ret = btrfs_run_delayed_items(trans);
+   if (ret)
+   goto scrub_continue;
/*
 * the reloc mutex makes sure that we stop
 * the balancing code from coming in and moving
@@ -2102,11 +2096,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
 */
mutex_lock(_info->reloc_mutex);
 
-   /*
-* We needn't worry about the delayed items because we will
-* deal with them in create_pending_snapshot(), which is the
-* core function of the snapshot creation.
-*/
ret = create_pending_snapshots(trans);
if (ret) {
mutex_unlock(_info->reloc_mutex);
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-cleaner / snapshot performance analysis

2018-02-13 Thread E V
On Mon, Feb 12, 2018 at 10:37 AM, Ellis H. Wilson III
 wrote:
> On 02/11/2018 01:24 PM, Hans van Kranenburg wrote:
>>
>> Why not just use `btrfs fi du   ` now and then and
>> update your administration with the results? .. Instead of putting the
>> burden of keeping track of all administration during every tiny change
>> all day long?
>
>
> I will look into that if using built-in group capacity functionality proves
> to be truly untenable.  Thanks!
>
>>> CoW is still valuable for us as we're shooting to support on the order
>>> of hundreds of snapshots per subvolume,
>>
>>
>> Hundreds will get you into trouble even without qgroups.
>
>
> I should have been more specific.  We are looking to use up to a few dozen
> snapshots per subvolume, but will have many (tens to hundreds of) discrete
> subvolumes (each with up to a few dozen snapshots) in a BTRFS filesystem.
> If I have it wrong and the scalability issues in BTRFS do not solely apply
> to subvolumes and their snapshot counts, please let me know.
>
> I will note you focused on my tiny desktop filesystem when making some of
> your previous comments -- this is why I didn't want to share specific
> details.  Our filesystem will be RAID0 with six large HDDs (12TB each).
> Reliability concerns do not apply to our situation for technical reasons,
> but if there are capacity scaling issues with BTRFS I should be made aware
> of, I'd be glad to hear them.  I have not seen any in technical
> documentation of such a limit, and experiments so far on 6x6TB arrays has
> not shown any performance problems, so I'm inclined to believe the only
> scaling issue exists with reflinks.  Correct me if I'm wrong.
>
> Thanks,
>
> ellis
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

When testing btrfs on large volumes, especially with metadata heavy
operations, I'd suggest you match the node size of your mkfs.btrfs
(-n) with the stripe size used in creating your RAID array. Also, use
the ssd_spread mount option as discussed in a previous thread. It make
a big difference on arrays. It allocates much more space for metadata,
but it greatly reduces fragmentation over time.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs-progs: compile with musl libc

2018-02-13 Thread Dan Robertson
Ensure that limits.h is included in mkfs/main.c as PATH_MAX is used. The lack
of this currently causes btrfs-progs to fail to compile when the systems libc
is musl.
From 7859e0d01778e844ae0fcbefe55581277ce7cab3 Mon Sep 17 00:00:00 2001
From: Dan Robertson 
Date: Tue, 13 Feb 2018 13:04:29 +
Subject: [PATCH] btrfs-progs: compile with musl libc

Ensure that limits.h is included in mkfs/main.c as PATH_MAX is used.

Signed-off-by: Dan Robertson 
---
 mkfs/main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mkfs/main.c b/mkfs/main.c
index ea65c6d8..f44520aa 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -27,6 +27,7 @@
 /* #include  included via androidcompat.h */
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
-- 
2.16.1



signature.asc
Description: Digital signature


Re: [PATCH] btrfs: Verify extent allocated by find_free_extent() won't overlap with extents from previous transaction

2018-02-13 Thread Nikolay Borisov


On 13.02.2018 13:07, Qu Wenruo wrote:
> 
> 
> On 2018年02月13日 09:13, Qu Wenruo wrote:
>> There are reports in mail list, even with latest mainline kernel, btrfs
>> can't survive a power loss.
>>
>> Unlike journal based filesystem, btrfs doesn't use journal for such
>> work. (log tree is an optimization for fsync, not to keep fs healthy)
>> In btrfs we use metadata CoW to ensure all tree blocks are as atomic as
>> superblock.
>>
>> This leads to an obvious assumption, some code breaks such metadata CoW
>> makes btrfs no longer bullet-proof against power loss.
>>
>> This patch adds extra runtime selftest to find_free_extent(), which
>> will check the range in commit root of extent tree to ensure there is no
>> overlap at all.
>>
>> And hopes this could help us to catch the cause of the problem.
>>
>> Signed-off-by: Qu Wenruo 
>> ---
>> Unfortunately, no new problem exposed yet.
> 
> Interestingly, I hit a bug during looping xfstest, where
> btrfs_free_path() in check_extent_conflicts() caused rcu error.
> 
> This seems quite interesting and may be the clue to the problem.
> 
> BTW, for anyone interesting with this problem, the backtrace is:
> 
> ODEBUG: activate active (active state 1) object type: rcu_head hint:
>   (null)
> WARNING: CPU: 2 PID: 15435 at lib/debugobjects.c:291
> debug_print_object+0xc1/0xe0
> Modules linked in: dm_flakey ext4 mbcache jbd2 btrfs(O) xor
> zstd_decompress zstd_compress xxhash raid6_pq efivarfs xfs dm_mod
> virtio_scsi
> CPU: 2 PID: 15435 Comm: kworker/u16:12 Tainted: G   O
> 4.15.0+ #32
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> Workqueue: writeback wb_workfn (flush-btrfs-367)
> RIP: 0010:debug_print_object+0xc1/0xe0
> RSP: 0018:88001819d1c0 EFLAGS: 00010086
> RAX: dc08 RBX: 0003 RCX: 0006
> RDX: 0007 RSI: 110003033a01 RDI: 88006abde6d0
> RBP: 0001 R08:  R09: 0001
> R10: 88001819d0b0 R11:  R12: 82472120
> R13: 8255cae0 R14:  R15: 880054737268
> FS:  () GS:88006aa0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 7f5e42c8d6c0 CR3: 608aa000 CR4: 003406e0
> Call Trace:
>  debug_object_activate+0x309/0x360

Can you get a line number of that function? So in release_extent_buffer
you call_rcu which calls debug_rc_head_queue in __call_rcu. So what
those debug checks are supposed to do (I haven't looked too deeply into
the code) seems to be to detect if we are doing double free i.e. calling
call_rcu twice for the same extent buffer.

Looking at debug_object_activate it seems there are 2 cases where we can
print the ODEBUG stuff if the object is already in STATE_ACTIVE and if
the object has been free i.e state ODEBUG_STATE_DESTROYED. However
judging by the "activate active" I'd say we hit the ODEBUG_STATE_ACTIV
case. And since we don't have a ->fixup_activate function then we should
return -EINVAL and the following WARN_ONCE should be triggered:

WARN_ONCE(1, "__call_rcu(): Double-freed CB %p->%pF()!!!\n",
  head, head->func);



>  __call_rcu.constprop.72+0x135/0xc50
>  release_extent_buffer+0xf0/0x280 [btrfs]
>  free_extent_buffer+0x16d/0x210 [btrfs]
>  btrfs_release_path+0x2d/0xe0 [btrfs]
>  btrfs_free_path.part.28+0xe/0x30 [btrfs]
>  find_free_extent+0x1082/0x2440 [btrfs]
>  btrfs_reserve_extent+0xda/0x1f0 [btrfs]
>  cow_file_range.isra.67+0x223/0x5d0 [btrfs]
>  run_delalloc_range+0x1cc/0x610 [btrfs]
>  writepage_delalloc+0x1bb/0x2c0 [btrfs]
>  __extent_writepage+0x422/0x630 [btrfs]
>  extent_write_cache_pages.constprop.56+0x2f7/0x7b0 [btrfs]
>  extent_writepages+0xce/0x110 [btrfs]
>  do_writepages+0x8e/0xb0
>  __writeback_single_inode+0x155/0xbf0
>  writeback_sb_inodes+0x462/0x8e0
>  wb_writeback+0x313/0x970
>  wb_workfn+0x217/0xc80
>  process_one_work+0x740/0xe40
>  worker_thread+0x1c6/0xce0
>  kthread+0x1ba/0x1e0
>  ret_from_fork+0x24/0x30
> 
> Thanks,
> Qu
> 
>> ---
>>  fs/btrfs/extent-tree.c | 118 
>> +
>>  1 file changed, 118 insertions(+)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 2f4328511ac8..3b3cd82bce3a 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -7458,6 +7458,114 @@ btrfs_release_block_group(struct 
>> btrfs_block_group_cache *cache,
>>  btrfs_put_block_group(cache);
>>  }
>>  
>> +#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
>> +/*
>> + * Verify if the given extent range [@start, @start + @len) conflicts any
>> + * existing extent in commit root.
>> + *
>> + * Btrfs doesn't use journal, but depends on metadata (and data) CoW to keep
>> + * the whole filesystem consistent against powerloss.
>> + * If we have overwritten any extent used by previous trans (commit root),
>> + * and powerloss happen we will corrupt our filesystem.
>> + *

Re: mount btrfs takes 30 minutes, btrfs check runs out of memory

2018-02-13 Thread Qu Wenruo


On 2018年02月13日 20:26, Holger Hoffstätte wrote:
> On 02/13/18 12:40, Qu Wenruo wrote:
 The problem is not about how much space it takes, but how many extents
 are here in the filesystem.
> 
> I have no idea why btrfs' mount even needs to touch all block groups to
> get going (which seems to be the root of the problem), but here's a
> not so crazy idea for more "mechanical sympathy". Feel free to mock
> me if this is terribly wrong or not possible. ;)
> 
> Mounting of even large filesystems (with many extents) seems to be fine
> on SSDS, but not so fine on rotational storage. We've heard that from
> several people with large (multi-TB) filesystems, and obviously it's
> even more terrible on 5400RPM drives because their seeks are sooo sloow.
> 
> If the problem is that the bgs are touched/iterated in "tree order",
> would it then not be possible to sort the block groups in physical order
> before trying to load whatever mount needs to load?

This is in fact a good idea.
Make block group into its own tree.

But it will takes a lot of work to do, since we a modifying the on-disk
format.

In that case, a leaf with default leaf size (16K) can store 678 block
group items.
And that many block groups can contain data between 169G (256M metadata
size) to 1.6T (10G for max data chunk size).

And even for tens of tegas, a level-2 tree should handle it without
problem, and searching them should be quite fast.

The only problem is, I'm not sure if there will be enough developer
interesting with this idea, and this idea may have extra problems hidden.

Thanks,
Qu

> That way the entire
> process would involve less seeking (no backward seeks for one) and the
> drive could very likely get more done during a rotation before stepping
> further.
> 
> cheers,
> Holger
> 



signature.asc
Description: OpenPGP digital signature


Re: mount btrfs takes 30 minutes, btrfs check runs out of memory

2018-02-13 Thread John Ettedgui
On Tue, Feb 13, 2018 at 4:46 AM, Qu Wenruo  wrote:
>
>
> On 2018年02月13日 20:06, John Ettedgui wrote:
>
 That's fairly straightforward to do, though it should be quite slow so
 I'd hope not to have to do that too often.
>>>
>>> Then it could be tried on the most frequently updated files then.
>>
>> That's an interesting idea.
>> More than 3/4 of the data is just storage, so that should be very ok.
>
> BTW, how the initial data is created?
>
> If the initial data is all written once and doesn't get modified later,
> then the problem may not be fragments.
>
Mostly at once when I recreated the FS a few years ago, and then
adding on to it slowly.
Though I do try to somewhat balance the free space on all partitions
of similar drives, so it may be a tad further more from its original
condition.

>>
>>>
>>> And since you don't use snapshot, locate such files and then "chattr +C"
>>> would make them nodatacow, reducing later fragments.
>>
>> I don't understand, why would that reduce later fragments?
>
> Later overwrite will not create new extent, but overwrite existing extents.
> Other than CoW and cause new extents (fragments)
>
> Although expand write will still cause new extent, but that's
> unavoidable anyway.
>
That's why I didn't understand.
Fair enough!

Thank you!
John
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mount btrfs takes 30 minutes, btrfs check runs out of memory

2018-02-13 Thread Qu Wenruo


On 2018年02月13日 20:06, John Ettedgui wrote:

>>> That's fairly straightforward to do, though it should be quite slow so
>>> I'd hope not to have to do that too often.
>>
>> Then it could be tried on the most frequently updated files then.
> 
> That's an interesting idea.
> More than 3/4 of the data is just storage, so that should be very ok.

BTW, how the initial data is created?

If the initial data is all written once and doesn't get modified later,
then the problem may not be fragments.

> 
>>
>> And since you don't use snapshot, locate such files and then "chattr +C"
>> would make them nodatacow, reducing later fragments.
> 
> I don't understand, why would that reduce later fragments?

Later overwrite will not create new extent, but overwrite existing extents.
Other than CoW and cause new extents (fragments)

Although expand write will still cause new extent, but that's
unavoidable anyway.

Thanks,
Qu

> 
>>
This should acts much better than traditional defrag, although it's
time-consuming and makes snapshot completely meaningless.
(and since you're already hitting ENOSPC, I don't think the idea is
 really working for you)

 And since you're already hitting ENOSPC, either it's caused by
 unbalanced meta/data usage, or it's really going hit the limit, I would
 recommend to enlarge the fs or delete some files to see if it helps.

>>> Yup, I usually either slowly ramp up the {d,m}usage to pass it, or
>>> when that does not work I free some space, then balance will finish.
>>> Or did you mean to free some space to see about mount speed?
>>
>> Kind of, just do such freeing in advance, and try to make btrfs always
>> have unallocated space in case.
>>
> 
> I actually have very little free space on those partitions, usually
> under 90Gb, maybe that's part of my problem.
> 
>> And finally, use latest kernel if possible.
>> IIRC old kernel doesn't have empty block group auto remove, which makes
>> user need to manually balance to free some space.
>>
>> Thanks,
>> Qu
>>
> 
> I am on 4.15 so no problem there.
> 
> So manual defrag and new FS to try.
> 
> Thank you!
> 



signature.asc
Description: OpenPGP digital signature


Re: mount btrfs takes 30 minutes, btrfs check runs out of memory

2018-02-13 Thread Holger Hoffstätte
On 02/13/18 12:40, Qu Wenruo wrote:
>>> The problem is not about how much space it takes, but how many extents
>>> are here in the filesystem.

I have no idea why btrfs' mount even needs to touch all block groups to
get going (which seems to be the root of the problem), but here's a
not so crazy idea for more "mechanical sympathy". Feel free to mock
me if this is terribly wrong or not possible. ;)

Mounting of even large filesystems (with many extents) seems to be fine
on SSDS, but not so fine on rotational storage. We've heard that from
several people with large (multi-TB) filesystems, and obviously it's
even more terrible on 5400RPM drives because their seeks are sooo sloow.

If the problem is that the bgs are touched/iterated in "tree order",
would it then not be possible to sort the block groups in physical order
before trying to load whatever mount needs to load? That way the entire
process would involve less seeking (no backward seeks for one) and the
drive could very likely get more done during a rotation before stepping
further.

cheers,
Holger



signature.asc
Description: OpenPGP digital signature


Re: mount btrfs takes 30 minutes, btrfs check runs out of memory

2018-02-13 Thread John Ettedgui
On Tue, Feb 13, 2018 at 3:40 AM, Qu Wenruo  wrote:
>
>
> On 2018年02月13日 19:25, John Ettedgui wrote:
>> On Tue, Feb 13, 2018 at 3:04 AM, Qu Wenruo  wrote:
>>>
>>>
>>>
>>> The problem is not about how much space it takes, but how many extents
>>> are here in the filesystem.
>>>
>>> For new fs filled with normal data, I'm pretty sure data extents will be
>>> as large as its maximum size (256M), causing very little or even no
>>> pressure to block group search.
>>>
>> What do you mean by "new fs",
>
> I mean the 4TB partition on that 5400rpm HDD.
>
>> was there any change that would improve
>> the behavior if I were to recreate the FS?
>
> If you backed up your fs, and recreate a new, empty btrfs on your
> original SSD, then copying all data back, I believe it would be much
> faster to mount.
>
Alright, I'll have to wait on getting some more drives for that but I
look forward to trying that.

>> Last time we talked I believe max extent was 128M for non-compressed
>> files, so maybe there's been some good change.
>
> My fault, 128M is correct.
>
>>> And since I went to SUSE, some mail/info is lost during the procedure.
>> I still have all mails, if you want it. No dump left though.
>>>
>>> Despite that, I have several more assumption to this problem:
>>>
>>> 1) Metadata usage bumped by inline files
>> What are inline files? Should I view this as inline in C, in that the
>> small files are stored in the tree directly?
>
> Exactly.
>
>>>If there are a lot of small files (<2K as default),
>> Of the slow to mount partitions:
>> 2 partitions have less than a dozen files smaller than 2K.
>> 1 has about 5 thousand and the last one 15 thousand.
>> Are the latter considered a lot?
>
> If using default 16K nodesize, 8 small files takes one leaf.
> And 15K small failes means about 2K tree extents.
>
> Not that much in my opinion, can't even fill half of a metadata chunk.
>
>>> and your metadata
>>>usage is quite high (generally speaking, it meta:data ratio should be
>>>way below 1:8), that may be the cause.
>>>
>> The ratio is about 1:900 on average so that should be ok I guess.
>
> Yep, that should be fine.
> So not metadata to blame.
>
> Then purely fragmented data extents.
>
>>>If so, try mount the fs with "max_inline=0" mount option and then
>>>try to rewrite such small files.
>>>
>> Should I try that?
>
> No need, it won't cause much difference.

Alright!

>>> 2) SSD write amplification along with dynamic remapping
>>>To be honest, I'm not really buying this idea, since mount doesn't
>>>have anything related to write.
>>>But running fstrim won't harm anyway.
>>>
>> Oh I am not complaining about slow SSDs mounting. I was just amazed
>> that a partition on a slow HDD mounted faster.
>> Without any specific work, my SSDs partitions tend to mount around 1 sec or 
>> so.
>> Of course I'd be happy to worry about them once all the partitions on
>> HDDs mount in a handful of ms :)
>>
>>> 3) Rewrite the existing files (extreme defrag)
>>>In fact, defrag doesn't work well if there are subvolumes/snapshots
>> I have no subvolume or snapshot so that's not a problem.
>>>/reflink involved.
>>>The most stupid and mindless way, is to write a small script and find
>>>all regular files, read them out and rewrite it back.
>>>
>> That's fairly straightforward to do, though it should be quite slow so
>> I'd hope not to have to do that too often.
>
> Then it could be tried on the most frequently updated files then.

That's an interesting idea.
More than 3/4 of the data is just storage, so that should be very ok.

>
> And since you don't use snapshot, locate such files and then "chattr +C"
> would make them nodatacow, reducing later fragments.

I don't understand, why would that reduce later fragments?

>
>>>This should acts much better than traditional defrag, although it's
>>>time-consuming and makes snapshot completely meaningless.
>>>(and since you're already hitting ENOSPC, I don't think the idea is
>>> really working for you)
>>>
>>> And since you're already hitting ENOSPC, either it's caused by
>>> unbalanced meta/data usage, or it's really going hit the limit, I would
>>> recommend to enlarge the fs or delete some files to see if it helps.
>>>
>> Yup, I usually either slowly ramp up the {d,m}usage to pass it, or
>> when that does not work I free some space, then balance will finish.
>> Or did you mean to free some space to see about mount speed?
>
> Kind of, just do such freeing in advance, and try to make btrfs always
> have unallocated space in case.
>

I actually have very little free space on those partitions, usually
under 90Gb, maybe that's part of my problem.

> And finally, use latest kernel if possible.
> IIRC old kernel doesn't have empty block group auto remove, which makes
> user need to manually balance to free some space.
>
> Thanks,
> Qu
>

I am on 4.15 so no problem there.

So manual defrag and new FS to 

Re: mount btrfs takes 30 minutes, btrfs check runs out of memory

2018-02-13 Thread Qu Wenruo


On 2018年02月13日 19:25, John Ettedgui wrote:
> On Tue, Feb 13, 2018 at 3:04 AM, Qu Wenruo  wrote:
>>
>>
>> On 2018年02月13日 18:21, John Ettedgui wrote:
>>> Hello there,
>>>
>>> have you found anything good since then?
>>
>> Unfortunately, not really much to speed it up.
> Oh :/
>>
>> This reminds me of the old (and crazy) idea to skip block group build
>> for RO mount.
>> But not really helpful for it.
>>
>>> With a default system, the behavior is pretty much still the same,
>>> though I have not recreated the partitions since.
>>>
>>> Defrag helps, but I think balance helps even more.
>>> clear_cache may help too, but I'm not really sure as I've not tried it
>>> on its own.
>>> I was actually able to get a 4TB partition on a 5400rpm HDD to mount
>>> in around 500ms, quite faster that even some Gb partitions I have on
>>> SSDs! Alas I wrote some files to it and it's taking over a second
>>> again, so no more magic there.
>>
>> The problem is not about how much space it takes, but how many extents
>> are here in the filesystem.
>>
>> For new fs filled with normal data, I'm pretty sure data extents will be
>> as large as its maximum size (256M), causing very little or even no
>> pressure to block group search.
>>
> What do you mean by "new fs",

I mean the 4TB partition on that 5400rpm HDD.

> was there any change that would improve
> the behavior if I were to recreate the FS?

If you backed up your fs, and recreate a new, empty btrfs on your
original SSD, then copying all data back, I believe it would be much
faster to mount.

> Last time we talked I believe max extent was 128M for non-compressed
> files, so maybe there's been some good change.

My fault, 128M is correct.

>>>
>>> The workarounds do work, so it's still not a major issue, but they're
>>> slow and sometimes I have to workaround the "no space left on device"
>>> which then takes even more time.
>>
>> And since I went to SUSE, some mail/info is lost during the procedure.
> I still have all mails, if you want it. No dump left though.
>>
>> Despite that, I have several more assumption to this problem:
>>
>> 1) Metadata usage bumped by inline files
> What are inline files? Should I view this as inline in C, in that the
> small files are stored in the tree directly?

Exactly.

>>If there are a lot of small files (<2K as default),
> Of the slow to mount partitions:
> 2 partitions have less than a dozen files smaller than 2K.
> 1 has about 5 thousand and the last one 15 thousand.
> Are the latter considered a lot?

If using default 16K nodesize, 8 small files takes one leaf.
And 15K small failes means about 2K tree extents.

Not that much in my opinion, can't even fill half of a metadata chunk.

>> and your metadata
>>usage is quite high (generally speaking, it meta:data ratio should be
>>way below 1:8), that may be the cause.
>>
> The ratio is about 1:900 on average so that should be ok I guess.

Yep, that should be fine.
So not metadata to blame.

Then purely fragmented data extents.

>>If so, try mount the fs with "max_inline=0" mount option and then
>>try to rewrite such small files.
>>
> Should I try that?

No need, it won't cause much difference.

>> 2) SSD write amplification along with dynamic remapping
>>To be honest, I'm not really buying this idea, since mount doesn't
>>have anything related to write.
>>But running fstrim won't harm anyway.
>>
> Oh I am not complaining about slow SSDs mounting. I was just amazed
> that a partition on a slow HDD mounted faster.
> Without any specific work, my SSDs partitions tend to mount around 1 sec or 
> so.
> Of course I'd be happy to worry about them once all the partitions on
> HDDs mount in a handful of ms :)
> 
>> 3) Rewrite the existing files (extreme defrag)
>>In fact, defrag doesn't work well if there are subvolumes/snapshots
> I have no subvolume or snapshot so that's not a problem.
>>/reflink involved.
>>The most stupid and mindless way, is to write a small script and find
>>all regular files, read them out and rewrite it back.
>>
> That's fairly straightforward to do, though it should be quite slow so
> I'd hope not to have to do that too often.

Then it could be tried on the most frequently updated files then.

And since you don't use snapshot, locate such files and then "chattr +C"
would make them nodatacow, reducing later fragments.

>>This should acts much better than traditional defrag, although it's
>>time-consuming and makes snapshot completely meaningless.
>>(and since you're already hitting ENOSPC, I don't think the idea is
>> really working for you)
>>
>> And since you're already hitting ENOSPC, either it's caused by
>> unbalanced meta/data usage, or it's really going hit the limit, I would
>> recommend to enlarge the fs or delete some files to see if it helps.
>>
> Yup, I usually either slowly ramp up the {d,m}usage to pass it, or
> when that does not work I free some space, then balance will 

Re: mount btrfs takes 30 minutes, btrfs check runs out of memory

2018-02-13 Thread John Ettedgui
On Tue, Feb 13, 2018 at 3:04 AM, Qu Wenruo  wrote:
>
>
> On 2018年02月13日 18:21, John Ettedgui wrote:
>> Hello there,
>>
>> have you found anything good since then?
>
> Unfortunately, not really much to speed it up.
Oh :/
>
> This reminds me of the old (and crazy) idea to skip block group build
> for RO mount.
> But not really helpful for it.
>
>> With a default system, the behavior is pretty much still the same,
>> though I have not recreated the partitions since.
>>
>> Defrag helps, but I think balance helps even more.
>> clear_cache may help too, but I'm not really sure as I've not tried it
>> on its own.
>> I was actually able to get a 4TB partition on a 5400rpm HDD to mount
>> in around 500ms, quite faster that even some Gb partitions I have on
>> SSDs! Alas I wrote some files to it and it's taking over a second
>> again, so no more magic there.
>
> The problem is not about how much space it takes, but how many extents
> are here in the filesystem.
>
> For new fs filled with normal data, I'm pretty sure data extents will be
> as large as its maximum size (256M), causing very little or even no
> pressure to block group search.
>
What do you mean by "new fs", was there any change that would improve
the behavior if I were to recreate the FS?
Last time we talked I believe max extent was 128M for non-compressed
files, so maybe there's been some good change.
>>
>> The workarounds do work, so it's still not a major issue, but they're
>> slow and sometimes I have to workaround the "no space left on device"
>> which then takes even more time.
>
> And since I went to SUSE, some mail/info is lost during the procedure.
I still have all mails, if you want it. No dump left though.
>
> Despite that, I have several more assumption to this problem:
>
> 1) Metadata usage bumped by inline files
What are inline files? Should I view this as inline in C, in that the
small files are stored in the tree directly?
>If there are a lot of small files (<2K as default),
Of the slow to mount partitions:
2 partitions have less than a dozen files smaller than 2K.
1 has about 5 thousand and the last one 15 thousand.
Are the latter considered a lot?
> and your metadata
>usage is quite high (generally speaking, it meta:data ratio should be
>way below 1:8), that may be the cause.
>
The ratio is about 1:900 on average so that should be ok I guess.
>If so, try mount the fs with "max_inline=0" mount option and then
>try to rewrite such small files.
>
Should I try that?
> 2) SSD write amplification along with dynamic remapping
>To be honest, I'm not really buying this idea, since mount doesn't
>have anything related to write.
>But running fstrim won't harm anyway.
>
Oh I am not complaining about slow SSDs mounting. I was just amazed
that a partition on a slow HDD mounted faster.
Without any specific work, my SSDs partitions tend to mount around 1 sec or so.
Of course I'd be happy to worry about them once all the partitions on
HDDs mount in a handful of ms :)

> 3) Rewrite the existing files (extreme defrag)
>In fact, defrag doesn't work well if there are subvolumes/snapshots
I have no subvolume or snapshot so that's not a problem.
>/reflink involved.
>The most stupid and mindless way, is to write a small script and find
>all regular files, read them out and rewrite it back.
>
That's fairly straightforward to do, though it should be quite slow so
I'd hope not to have to do that too often.
>This should acts much better than traditional defrag, although it's
>time-consuming and makes snapshot completely meaningless.
>(and since you're already hitting ENOSPC, I don't think the idea is
> really working for you)
>
> And since you're already hitting ENOSPC, either it's caused by
> unbalanced meta/data usage, or it's really going hit the limit, I would
> recommend to enlarge the fs or delete some files to see if it helps.
>
Yup, I usually either slowly ramp up the {d,m}usage to pass it, or
when that does not work I free some space, then balance will finish.
Or did you mean to free some space to see about mount speed?
> Thanks,
> Qu
>

Thank you for the quick reply!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Verify extent allocated by find_free_extent() won't overlap with extents from previous transaction

2018-02-13 Thread Qu Wenruo


On 2018年02月13日 09:13, Qu Wenruo wrote:
> There are reports in mail list, even with latest mainline kernel, btrfs
> can't survive a power loss.
> 
> Unlike journal based filesystem, btrfs doesn't use journal for such
> work. (log tree is an optimization for fsync, not to keep fs healthy)
> In btrfs we use metadata CoW to ensure all tree blocks are as atomic as
> superblock.
> 
> This leads to an obvious assumption, some code breaks such metadata CoW
> makes btrfs no longer bullet-proof against power loss.
> 
> This patch adds extra runtime selftest to find_free_extent(), which
> will check the range in commit root of extent tree to ensure there is no
> overlap at all.
> 
> And hopes this could help us to catch the cause of the problem.
> 
> Signed-off-by: Qu Wenruo 
> ---
> Unfortunately, no new problem exposed yet.

Interestingly, I hit a bug during looping xfstest, where
btrfs_free_path() in check_extent_conflicts() caused rcu error.

This seems quite interesting and may be the clue to the problem.

BTW, for anyone interesting with this problem, the backtrace is:

ODEBUG: activate active (active state 1) object type: rcu_head hint:
  (null)
WARNING: CPU: 2 PID: 15435 at lib/debugobjects.c:291
debug_print_object+0xc1/0xe0
Modules linked in: dm_flakey ext4 mbcache jbd2 btrfs(O) xor
zstd_decompress zstd_compress xxhash raid6_pq efivarfs xfs dm_mod
virtio_scsi
CPU: 2 PID: 15435 Comm: kworker/u16:12 Tainted: G   O
4.15.0+ #32
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Workqueue: writeback wb_workfn (flush-btrfs-367)
RIP: 0010:debug_print_object+0xc1/0xe0
RSP: 0018:88001819d1c0 EFLAGS: 00010086
RAX: dc08 RBX: 0003 RCX: 0006
RDX: 0007 RSI: 110003033a01 RDI: 88006abde6d0
RBP: 0001 R08:  R09: 0001
R10: 88001819d0b0 R11:  R12: 82472120
R13: 8255cae0 R14:  R15: 880054737268
FS:  () GS:88006aa0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f5e42c8d6c0 CR3: 608aa000 CR4: 003406e0
Call Trace:
 debug_object_activate+0x309/0x360
 __call_rcu.constprop.72+0x135/0xc50
 release_extent_buffer+0xf0/0x280 [btrfs]
 free_extent_buffer+0x16d/0x210 [btrfs]
 btrfs_release_path+0x2d/0xe0 [btrfs]
 btrfs_free_path.part.28+0xe/0x30 [btrfs]
 find_free_extent+0x1082/0x2440 [btrfs]
 btrfs_reserve_extent+0xda/0x1f0 [btrfs]
 cow_file_range.isra.67+0x223/0x5d0 [btrfs]
 run_delalloc_range+0x1cc/0x610 [btrfs]
 writepage_delalloc+0x1bb/0x2c0 [btrfs]
 __extent_writepage+0x422/0x630 [btrfs]
 extent_write_cache_pages.constprop.56+0x2f7/0x7b0 [btrfs]
 extent_writepages+0xce/0x110 [btrfs]
 do_writepages+0x8e/0xb0
 __writeback_single_inode+0x155/0xbf0
 writeback_sb_inodes+0x462/0x8e0
 wb_writeback+0x313/0x970
 wb_workfn+0x217/0xc80
 process_one_work+0x740/0xe40
 worker_thread+0x1c6/0xce0
 kthread+0x1ba/0x1e0
 ret_from_fork+0x24/0x30

Thanks,
Qu

> ---
>  fs/btrfs/extent-tree.c | 118 
> +
>  1 file changed, 118 insertions(+)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 2f4328511ac8..3b3cd82bce3a 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -7458,6 +7458,114 @@ btrfs_release_block_group(struct 
> btrfs_block_group_cache *cache,
>   btrfs_put_block_group(cache);
>  }
>  
> +#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
> +/*
> + * Verify if the given extent range [@start, @start + @len) conflicts any
> + * existing extent in commit root.
> + *
> + * Btrfs doesn't use journal, but depends on metadata (and data) CoW to keep
> + * the whole filesystem consistent against powerloss.
> + * If we have overwritten any extent used by previous trans (commit root),
> + * and powerloss happen we will corrupt our filesystem.
> + *
> + * Return 0 if nothing wrong.
> + * Return <0 (including ENOMEM) means we have something wrong.
> + * Except NOEMEM, this normally means we have extent conflicts with previous
> + * transaction.
> + */
> +static int check_extent_conflicts(struct btrfs_fs_info *fs_info,
> +   u64 start, u64 len)
> +{
> + struct btrfs_key key;
> + struct btrfs_path *path;
> + struct btrfs_root *extent_root = fs_info->extent_root;
> + u64 extent_start;
> + u64 extent_len;
> + int ret;
> +
> + path = btrfs_alloc_path();
> + if (!path)
> + return -ENOMEM;
> +
> +
> + key.objectid = start + len;
> + key.type = 0;
> + key.offset = 0;
> +
> + /*
> +  * Here we use extent commit root to search for any conflicts.
> +  * If extent commit tree is already overwritten, we will get transid
> +  * error and error out any way.
> +  * If extent commit tree is OK, but other extent conflicts, we will
> +  * find it.
> +  * So anyway, such search should be OK to 

Re: mount btrfs takes 30 minutes, btrfs check runs out of memory

2018-02-13 Thread Qu Wenruo


On 2018年02月13日 18:21, John Ettedgui wrote:
> On Thu, Jul 21, 2016 at 1:19 AM, Qu Wenruo  wrote:
>>
>>
>> No more.
>>
>> The dump is already good enough for me to dig for some time.
>>
>> We don't usually get such large extent tree dump from a real world use case.
>>
>> It would help us in several ways, from determine how fragmented a block
>> group is to determine if a defrag will help.
>>
>> Thanks,
>> Qu
>>
>>
> 
> 
> Hello there,
> 
> have you found anything good since then?

Unfortunately, not really much to speed it up.

This reminds me of the old (and crazy) idea to skip block group build
for RO mount.
But not really helpful for it.

> With a default system, the behavior is pretty much still the same,
> though I have not recreated the partitions since.
> 
> Defrag helps, but I think balance helps even more.
> clear_cache may help too, but I'm not really sure as I've not tried it
> on its own.
> I was actually able to get a 4TB partition on a 5400rpm HDD to mount
> in around 500ms, quite faster that even some Gb partitions I have on
> SSDs! Alas I wrote some files to it and it's taking over a second
> again, so no more magic there.

The problem is not about how much space it takes, but how many extents
are here in the filesystem.

For new fs filled with normal data, I'm pretty sure data extents will be
as large as its maximum size (256M), causing very little or even no
pressure to block group search.

> 
> The workarounds do work, so it's still not a major issue, but they're
> slow and sometimes I have to workaround the "no space left on device"
> which then takes even more time.

And since I went to SUSE, some mail/info is lost during the procedure.

Despite that, I have several more assumption to this problem:

1) Metadata usage bumped by inline files
   If there are a lot of small files (<2K as default), and your metadata
   usage is quite high (generally speaking, it meta:data ratio should be
   way below 1:8), that may be the cause.

   If so, try mount the fs with "max_inline=0" mount option and then
   try to rewrite such small files.

2) SSD write amplification along with dynamic remapping
   To be honest, I'm not really buying this idea, since mount doesn't
   have anything related to write.
   But running fstrim won't harm anyway.

3) Rewrite the existing files (extreme defrag)
   In fact, defrag doesn't work well if there are subvolumes/snapshots
   /reflink involved.
   The most stupid and mindless way, is to write a small script and find
   all regular files, read them out and rewrite it back.

   This should acts much better than traditional defrag, although it's
   time-consuming and makes snapshot completely meaningless.
   (and since you're already hitting ENOSPC, I don't think the idea is
really working for you)

And since you're already hitting ENOSPC, either it's caused by
unbalanced meta/data usage, or it's really going hit the limit, I would
recommend to enlarge the fs or delete some files to see if it helps.

Thanks,
Qu

> 
> Thank you!
> John
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2] btrfs: fix endianness compatibility during the SB RW

2018-02-13 Thread Qu Wenruo


On 2018年02月13日 18:27, Anand Jain wrote:
> 
> 
> On 02/13/2018 05:01 PM, Qu Wenruo wrote:
>>
>>
>> On 2018年02月13日 11:00, Anand Jain wrote:
>>> Fixes the endianness bug in the fs_info::super_copy by using its
>>> btrfs_set_super...() function to set values in the SB, as these
>>> functions manage the endianness compatibility nicely.
>>>
>>> Signed-off-by: Anand Jain 
>>
>> Also went through all btrfs_super_block SETGET functions, greping using
>> \>, seems that there are still some left here:
>>
>> fs/btrfs/sysfs.c:
>> In both btrfs_sectorsize_show() and btrfs_clone_alignment_show():
>> return snprintf(buf, PAGE_SIZE, "%u\n",
>>     fs_info->super_copy->sectorsize);
>>
>> In btrfs_nodesize_show():
>> return snprintf(buf, PAGE_SIZE, "%u\n",
>> fs_info->super_copy->nodesize);
> 
>  Oh. Thanks. Will fix. Maybe it's a good idea to add sysfs fixes
>  into a new patch.
> 
>> And what about cc this to stable kernel?
>> IIRC it's a very critical problem for btrfs.
>>
>> Maybe cc: sta...@vger.kernel.org # v3.2+?
> 
>  Thanks for the suggestion. Will do. Any idea what if the patch which
>  applied on mainline ends up conflict on LTS, so write a separate patch
>  to stable?

In fact I'm not sure about this either.

But considering much of the offending code doesn't change since 2009, it
wouldn't cause much conflict IIRC.

Thanks,
Qu

> 
> Thanks, Anand
> 
> 
>> Thanks,
>> Qu
>>
>>> ---
>>> v1->v2: Update change log. Update $Subject.
>>>  Old:
>>>  [PATCH] btrfs: use set functions to update latest refs to
>>> the SB
>>>   fs/btrfs/transaction.c | 20 
>>>   1 file changed, 12 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>>> index 04f07144b45c..9220f004001c 100644
>>> --- a/fs/btrfs/transaction.c
>>> +++ b/fs/btrfs/transaction.c
>>> @@ -1722,19 +1722,23 @@ static void update_super_roots(struct
>>> btrfs_fs_info *fs_info)
>>>     super = fs_info->super_copy;
>>>   +    /* update latest btrfs_super_block::chunk_root refs */
>>>   root_item = _info->chunk_root->root_item;
>>> -    super->chunk_root = root_item->bytenr;
>>> -    super->chunk_root_generation = root_item->generation;
>>> -    super->chunk_root_level = root_item->level;
>>> +    btrfs_set_super_chunk_root(super, root_item->bytenr);
>>> +    btrfs_set_super_chunk_root_generation(super,
>>> root_item->generation);
>>> +    btrfs_set_super_chunk_root_level(super, root_item->level);
>>>   +    /* update latest btrfs_super_block::root refs */
>>>   root_item = _info->tree_root->root_item;
>>> -    super->root = root_item->bytenr;
>>> -    super->generation = root_item->generation;
>>> -    super->root_level = root_item->level;
>>> +    btrfs_set_super_root(super, root_item->bytenr);
>>> +    btrfs_set_super_generation(super, root_item->generation);
>>> +    btrfs_set_super_root_level(super, root_item->level);
>>> +
>>>   if (btrfs_test_opt(fs_info, SPACE_CACHE))
>>> -    super->cache_generation = root_item->generation;
>>> +    btrfs_set_super_cache_generation(super, root_item->generation);
>>>   if (test_bit(BTRFS_FS_UPDATE_UUID_TREE_GEN, _info->flags))
>>> -    super->uuid_tree_generation = root_item->generation;
>>> +    btrfs_set_super_uuid_tree_generation(super,
>>> + root_item->generation);
>>>   }
>>>     int btrfs_transaction_in_commit(struct btrfs_fs_info *info)
>>>
>>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2] btrfs: fix endianness compatibility during the SB RW

2018-02-13 Thread Anand Jain



On 02/13/2018 05:01 PM, Qu Wenruo wrote:



On 2018年02月13日 11:00, Anand Jain wrote:

Fixes the endianness bug in the fs_info::super_copy by using its
btrfs_set_super...() function to set values in the SB, as these
functions manage the endianness compatibility nicely.

Signed-off-by: Anand Jain 


Also went through all btrfs_super_block SETGET functions, greping using
\>, seems that there are still some left here:

fs/btrfs/sysfs.c:
In both btrfs_sectorsize_show() and btrfs_clone_alignment_show():
return snprintf(buf, PAGE_SIZE, "%u\n",
fs_info->super_copy->sectorsize);

In btrfs_nodesize_show():
return snprintf(buf, PAGE_SIZE, "%u\n", fs_info->super_copy->nodesize);


 Oh. Thanks. Will fix. Maybe it's a good idea to add sysfs fixes
 into a new patch.


And what about cc this to stable kernel?
IIRC it's a very critical problem for btrfs.

Maybe cc: sta...@vger.kernel.org # v3.2+?


 Thanks for the suggestion. Will do. Any idea what if the patch which
 applied on mainline ends up conflict on LTS, so write a separate patch
 to stable?

Thanks, Anand



Thanks,
Qu


---
v1->v2: Update change log. Update $Subject.
 Old:
 [PATCH] btrfs: use set functions to update latest refs to the SB
  fs/btrfs/transaction.c | 20 
  1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 04f07144b45c..9220f004001c 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1722,19 +1722,23 @@ static void update_super_roots(struct btrfs_fs_info 
*fs_info)
  
  	super = fs_info->super_copy;
  
+	/* update latest btrfs_super_block::chunk_root refs */

root_item = _info->chunk_root->root_item;
-   super->chunk_root = root_item->bytenr;
-   super->chunk_root_generation = root_item->generation;
-   super->chunk_root_level = root_item->level;
+   btrfs_set_super_chunk_root(super, root_item->bytenr);
+   btrfs_set_super_chunk_root_generation(super, root_item->generation);
+   btrfs_set_super_chunk_root_level(super, root_item->level);
  
+	/* update latest btrfs_super_block::root refs */

root_item = _info->tree_root->root_item;
-   super->root = root_item->bytenr;
-   super->generation = root_item->generation;
-   super->root_level = root_item->level;
+   btrfs_set_super_root(super, root_item->bytenr);
+   btrfs_set_super_generation(super, root_item->generation);
+   btrfs_set_super_root_level(super, root_item->level);
+
if (btrfs_test_opt(fs_info, SPACE_CACHE))
-   super->cache_generation = root_item->generation;
+   btrfs_set_super_cache_generation(super, root_item->generation);
if (test_bit(BTRFS_FS_UPDATE_UUID_TREE_GEN, _info->flags))
-   super->uuid_tree_generation = root_item->generation;
+   btrfs_set_super_uuid_tree_generation(super,
+root_item->generation);
  }
  
  int btrfs_transaction_in_commit(struct btrfs_fs_info *info)





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mount btrfs takes 30 minutes, btrfs check runs out of memory

2018-02-13 Thread John Ettedgui
On Thu, Jul 21, 2016 at 1:19 AM, Qu Wenruo  wrote:
>
>
> No more.
>
> The dump is already good enough for me to dig for some time.
>
> We don't usually get such large extent tree dump from a real world use case.
>
> It would help us in several ways, from determine how fragmented a block
> group is to determine if a defrag will help.
>
> Thanks,
> Qu
>
>


Hello there,

have you found anything good since then?
With a default system, the behavior is pretty much still the same,
though I have not recreated the partitions since.

Defrag helps, but I think balance helps even more.
clear_cache may help too, but I'm not really sure as I've not tried it
on its own.
I was actually able to get a 4TB partition on a 5400rpm HDD to mount
in around 500ms, quite faster that even some Gb partitions I have on
SSDs! Alas I wrote some files to it and it's taking over a second
again, so no more magic there.

The workarounds do work, so it's still not a major issue, but they're
slow and sometimes I have to workaround the "no space left on device"
which then takes even more time.

Thank you!
John
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: log, when replace, is canceled by the user

2018-02-13 Thread Qu Wenruo


On 2018年02月13日 17:08, Nikolay Borisov wrote:
> 
> 
> On 13.02.2018 11:05, Qu Wenruo wrote:
>>
>>
>> On 2018年02月13日 11:53, Anand Jain wrote:
>>> For forensic investigations of issues, we would want to know
>>> if and when the user cancels the replace.
>>>
>>> Signed-off-by: Anand Jain > ---
>>> v1->v2: use btrfs_dev_name() instead of rcu_str_deref()
>>> as btrfs_dev_name() also provides "missing" string,
>>> if needed.
>>
>> Would you please provide the branch you're using?
>>
>> As I couldn't find the function or macro of btrfs_dev_name()
> 
> 
> I have it in my misc-next checkout. I got introduced by
> 3c958bd23b60 ("btrfs: add helper for device path or missing")
> 
> which is in 4.16-rc1 so pretty recent.

I was on v4.15, no wonder no such function.

Thank you very much,
Qu

> 
> 
>>
>> Thanks,
>> Qu
>>
>>>
>>>  fs/btrfs/dev-replace.c | 7 +++
>>>  1 file changed, 7 insertions(+)
>>>
>>> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
>>> index dd717e204b5e..3b0760f7ec8a 100644
>>> --- a/fs/btrfs/dev-replace.c
>>> +++ b/fs/btrfs/dev-replace.c
>>> @@ -703,6 +703,7 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info 
>>> *fs_info)
>>>  {
>>> struct btrfs_dev_replace *dev_replace = _info->dev_replace;
>>> struct btrfs_device *tgt_device = NULL;
>>> +   struct btrfs_device *src_device = NULL;
>>> struct btrfs_trans_handle *trans;
>>> struct btrfs_root *root = fs_info->tree_root;
>>> int result;
>>> @@ -724,6 +725,7 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info 
>>> *fs_info)
>>> case BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED:
>>> result = BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_ERROR;
>>> tgt_device = dev_replace->tgtdev;
>>> +   src_device = dev_replace->srcdev;
>>> dev_replace->tgtdev = NULL;
>>> dev_replace->srcdev = NULL;
>>> break;
>>> @@ -741,6 +743,11 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info 
>>> *fs_info)
>>> }
>>> ret = btrfs_commit_transaction(trans);
>>> WARN_ON(ret);
>>> +
>>> +   btrfs_info(fs_info, "dev_replace from %s (devid %llu) to %s canceled",
>>> +  btrfs_dev_name(src_device), src_device->devid,
>>> +  btrfs_dev_name(tgt_device));
>>> +
>>> if (tgt_device)
>>> btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device);
>>>  
>>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



signature.asc
Description: OpenPGP digital signature


[PATCH] btrfs: manage subvolid mount option as %u

2018-02-13 Thread Anand Jain
As -o subvolid mount option is an u64 manage it as %u for
token verifications, instead of %s.

Signed-off-by: Anand Jain 
---
 fs/btrfs/super.c | 23 ++-
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 8112619cac95..bf629c8a6a47 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -336,7 +336,7 @@ enum {
 static const match_table_t tokens = {
{Opt_degraded, "degraded"},
{Opt_subvol, "subvol=%s"},
-   {Opt_subvolid, "subvolid=%s"},
+   {Opt_subvolid, "subvolid=%u"},
{Opt_device, "device=%s"},
{Opt_nodatasum, "nodatasum"},
{Opt_datasum, "datasum"},
@@ -964,8 +964,8 @@ static int btrfs_parse_subvol_options(const char *options, 
fmode_t flags,
 {
substring_t args[MAX_OPT_ARGS];
char *opts, *orig, *p;
-   char *num = NULL;
int error = 0;
+   u64 subvolid;
 
if (!options)
return 0;
@@ -995,18 +995,15 @@ static int btrfs_parse_subvol_options(const char 
*options, fmode_t flags,
}
break;
case Opt_subvolid:
-   num = match_strdup([0]);
-   if (num) {
-   *subvol_objectid = memparse(num, NULL);
-   kfree(num);
-   /* we want the original fs_tree */
-   if (!*subvol_objectid)
-   *subvol_objectid =
-   BTRFS_FS_TREE_OBJECTID;
-   } else {
-   error = -EINVAL;
+   error = match_u64([0], );
+   if (error)
goto out;
-   }
+
+   /* we want the original fs_tree */
+   if (subvolid == 0)
+   subvolid = BTRFS_FS_TREE_OBJECTID;
+
+   *subvol_objectid = subvolid;
break;
case Opt_subvolrootid:
pr_warn("BTRFS: 'subvolrootid' mount option is 
deprecated and has no effect\n");
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: manage commit mount option as %u

2018-02-13 Thread Anand Jain
As -o commit mount option is unsinged so manage it as %u
for token verifications, instead of %d.

Signed-off-by: Anand Jain 
---
 fs/btrfs/ctree.h |  2 +-
 fs/btrfs/super.c | 26 ++
 2 files changed, 11 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 23b7e6756b15..80d1d4d12f9d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -788,7 +788,7 @@ struct btrfs_fs_info {
unsigned long pending_changes;
unsigned long compress_type:4;
unsigned int compress_level;
-   int commit_interval;
+   u32 commit_interval;
/*
 * It is a suggestive number, the read side is safe even it gets a
 * wrong number because we will write out the data into a regular
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 83e11bbacc17..0a299a0ded3a 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -385,7 +385,7 @@ static const match_table_t tokens = {
{Opt_check_integrity_print_mask, "check_int_print_mask=%u"},
{Opt_rescan_uuid_tree, "rescan_uuid_tree"},
{Opt_fatal_errors, "fatal_errors=%s"},
-   {Opt_commit_interval, "commit=%d"},
+   {Opt_commit_interval, "commit=%u"},
 #ifdef CONFIG_BTRFS_DEBUG
{Opt_fragment_data, "fragment=data"},
{Opt_fragment_metadata, "fragment=metadata"},
@@ -777,24 +777,18 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
*options,
case Opt_commit_interval:
intarg = 0;
ret = match_int([0], );
-   if (ret < 0) {
-   btrfs_err(info, "invalid commit interval");
-   ret = -EINVAL;
+   if (ret)
goto out;
-   }
-   if (intarg > 0) {
-   if (intarg > 300) {
-   btrfs_warn(info,
-   "excessive commit interval %d",
-   intarg);
-   }
-   info->commit_interval = intarg;
-   } else {
+   if (intarg == 0) {
btrfs_info(info,
-  "using default commit interval %ds",
+  "using default commit interval %us",
   BTRFS_DEFAULT_COMMIT_INTERVAL);
-   info->commit_interval = 
BTRFS_DEFAULT_COMMIT_INTERVAL;
+   intarg = BTRFS_DEFAULT_COMMIT_INTERVAL;
+   } else if (intarg > 300) {
+   btrfs_warn(info, "excessive commit interval %d",
+  intarg);
}
+   info->commit_interval = intarg;
break;
 #ifdef CONFIG_BTRFS_DEBUG
case Opt_fragment_all:
@@ -1363,7 +1357,7 @@ static int btrfs_show_options(struct seq_file *seq, 
struct dentry *dentry)
if (btrfs_test_opt(info, PANIC_ON_FATAL_ERROR))
seq_puts(seq, ",fatal_errors=panic");
if (info->commit_interval != BTRFS_DEFAULT_COMMIT_INTERVAL)
-   seq_printf(seq, ",commit=%d", info->commit_interval);
+   seq_printf(seq, ",commit=%u", info->commit_interval);
 #ifdef CONFIG_BTRFS_DEBUG
if (btrfs_test_opt(info, FRAGMENT_DATA))
seq_puts(seq, ",fragment=data");
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: manage thread_pool mount option as %u

2018-02-13 Thread Anand Jain
-o thread_pool is alway unsigned. Manage it that way all around.

Signed-off-by: Anand Jain 
---
 fs/btrfs/ctree.h   |  2 +-
 fs/btrfs/disk-io.c |  4 ++--
 fs/btrfs/super.c   | 13 ++---
 3 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0b738f0a9a23..23b7e6756b15 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -935,7 +935,7 @@ struct btrfs_fs_info {
struct btrfs_workqueue *extent_workers;
struct task_struct *transaction_kthread;
struct task_struct *cleaner_kthread;
-   int thread_pool_size;
+   u32 thread_pool_size;
 
struct kobject *space_info_kobj;
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 87d24c903152..3cec2455b603 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2180,7 +2180,7 @@ static void btrfs_init_qgroup(struct btrfs_fs_info 
*fs_info)
 static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info,
struct btrfs_fs_devices *fs_devices)
 {
-   int max_active = fs_info->thread_pool_size;
+   u32 max_active = fs_info->thread_pool_size;
unsigned int flags = WQ_MEM_RECLAIM | WQ_FREEZABLE | WQ_UNBOUND;
 
fs_info->workers =
@@ -2401,7 +2401,7 @@ int open_ctree(struct super_block *sb,
int err = -EINVAL;
int num_backups_tried = 0;
int backup_index = 0;
-   int max_active;
+   u32 max_active;
int clear_free_space_tree = 0;
 
tree_root = fs_info->tree_root = btrfs_alloc_root(fs_info, GFP_KERNEL);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 02c7766e6849..8112619cac95 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -346,7 +346,7 @@ static const match_table_t tokens = {
{Opt_barrier, "barrier"},
{Opt_max_inline, "max_inline=%u"},
{Opt_alloc_start, "alloc_start=%s"},
-   {Opt_thread_pool, "thread_pool=%d"},
+   {Opt_thread_pool, "thread_pool=%u"},
{Opt_compress, "compress"},
{Opt_compress_type, "compress=%s"},
{Opt_compress_force, "compress-force"},
@@ -596,12 +596,11 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
*options,
ret = match_int([0], );
if (ret) {
goto out;
-   } else if (intarg > 0) {
-   info->thread_pool_size = intarg;
-   } else {
+   } else if (intarg == 0) {
ret = -EINVAL;
goto out;
}
+   info->thread_pool_size = intarg;
break;
case Opt_max_inline:
ret = match_int([0], );
@@ -1317,7 +1316,7 @@ static int btrfs_show_options(struct seq_file *seq, 
struct dentry *dentry)
seq_printf(seq, ",max_inline=%u", info->max_inline);
if (info->thread_pool_size !=  min_t(unsigned long,
 num_online_cpus() + 2, 8))
-   seq_printf(seq, ",thread_pool=%d", info->thread_pool_size);
+   seq_printf(seq, ",thread_pool=%u", info->thread_pool_size);
if (btrfs_test_opt(info, COMPRESS)) {
compress_type = btrfs_compress_type2str(info->compress_type);
if (btrfs_test_opt(info, FORCE_COMPRESS))
@@ -1723,7 +1722,7 @@ static struct dentry *btrfs_mount(struct file_system_type 
*fs_type, int flags,
 }
 
 static void btrfs_resize_thread_pool(struct btrfs_fs_info *fs_info,
-int new_pool_size, int old_pool_size)
+u32 new_pool_size, u32 old_pool_size)
 {
if (new_pool_size == old_pool_size)
return;
@@ -1791,7 +1790,7 @@ static int btrfs_remount(struct super_block *sb, int 
*flags, char *data)
unsigned long old_opts = fs_info->mount_opt;
unsigned long old_compress_type = fs_info->compress_type;
u32 old_max_inline = fs_info->max_inline;
-   int old_thread_pool_size = fs_info->thread_pool_size;
+   u32 old_thread_pool_size = fs_info->thread_pool_size;
unsigned int old_metadata_ratio = fs_info->metadata_ratio;
int ret;
 
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: manage check_int_print_mask mount option as %u

2018-02-13 Thread Anand Jain
As -o check_int_print_mask mount option is unsinged so manage it as %u
for token verifications, instead of %d.

Signed-off-by: Anand Jain 
---
 fs/btrfs/super.c | 16 +---
 1 file changed, 5 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index b925a649a01a..83e11bbacc17 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -382,7 +382,7 @@ static const match_table_t tokens = {
{Opt_skip_balance, "skip_balance"},
{Opt_check_integrity, "check_int"},
{Opt_check_integrity_including_extent_data, "check_int_data"},
-   {Opt_check_integrity_print_mask, "check_int_print_mask=%d"},
+   {Opt_check_integrity_print_mask, "check_int_print_mask=%u"},
{Opt_rescan_uuid_tree, "rescan_uuid_tree"},
{Opt_fatal_errors, "fatal_errors=%s"},
{Opt_commit_interval, "commit=%d"},
@@ -747,17 +747,11 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
*options,
break;
case Opt_check_integrity_print_mask:
ret = match_int([0], );
-   if (ret) {
-   goto out;
-   } else if (intarg >= 0) {
-   info->check_integrity_print_mask = intarg;
-   btrfs_info(info,
-  "check_integrity_print_mask 0x%x",
-  info->check_integrity_print_mask);
-   } else {
-   ret = -EINVAL;
+   if (ret)
goto out;
-   }
+   info->check_integrity_print_mask = intarg;
+   btrfs_info(info, "check_integrity_print_mask 0x%x",
+  info->check_integrity_print_mask);
break;
 #else
case Opt_check_integrity_including_extent_data:
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: manage metadata_ratio mount option as %u

2018-02-13 Thread Anand Jain
As -o metadata_ratio mount option is unsinged so manage it as %u for
token verifications, instead of %s.

Signed-off-by: Anand Jain 
---
 fs/btrfs/super.c | 17 ++---
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index bf629c8a6a47..b925a649a01a 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -362,7 +362,7 @@ static const match_table_t tokens = {
{Opt_norecovery, "norecovery"},
{Opt_flushoncommit, "flushoncommit"},
{Opt_noflushoncommit, "noflushoncommit"},
-   {Opt_ratio, "metadata_ratio=%d"},
+   {Opt_ratio, "metadata_ratio=%u"},
{Opt_discard, "discard"},
{Opt_nodiscard, "nodiscard"},
{Opt_space_cache, "space_cache"},
@@ -648,16 +648,11 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
*options,
break;
case Opt_ratio:
ret = match_int([0], );
-   if (ret) {
-   goto out;
-   } else if (intarg >= 0) {
-   info->metadata_ratio = intarg;
-   btrfs_info(info, "metadata ratio %d",
-  info->metadata_ratio);
-   } else {
-   ret = -EINVAL;
+   if (ret)
goto out;
-   }
+   info->metadata_ratio = intarg;
+   btrfs_info(info, "metadata ratio %u",
+  info->metadata_ratio);
break;
case Opt_discard:
btrfs_set_and_info(info, DISCARD,
@@ -1369,7 +1364,7 @@ static int btrfs_show_options(struct seq_file *seq, 
struct dentry *dentry)
info->check_integrity_print_mask);
 #endif
if (info->metadata_ratio)
-   seq_printf(seq, ",metadata_ratio=%d",
+   seq_printf(seq, ",metadata_ratio=%u",
info->metadata_ratio);
if (btrfs_test_opt(info, PANIC_ON_FATAL_ERROR))
seq_puts(seq, ",fatal_errors=panic");
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: fix bare unsigned declarations

2018-02-13 Thread Anand Jain
Kernel style prefers "unsigned int " over "unsigned "
and "signed int " over "signed ".

Signed-off-by: Anand Jain 
---
 fs/btrfs/ctree.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 80d1d4d12f9d..c6de1a5281ca 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -993,8 +993,8 @@ struct btrfs_fs_info {
struct btrfs_balance_control *balance_ctl;
wait_queue_head_t balance_wait_q;
 
-   unsigned data_chunk_allocations;
-   unsigned metadata_ratio;
+   u32 data_chunk_allocations;
+   u32 metadata_ratio;
 
void *bdev_holder;
 
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: add a comment to mark the deprecated mount option

2018-02-13 Thread Anand Jain
-o alloc_start and -o subvolrootid are deprecated mount options,
comment them in the tokens list. And leave them as it is. No
functional changes.

Signed-off-by: Anand Jain 
---
 fs/btrfs/super.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 0a299a0ded3a..7771e690460c 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -345,7 +345,7 @@ static const match_table_t tokens = {
{Opt_nobarrier, "nobarrier"},
{Opt_barrier, "barrier"},
{Opt_max_inline, "max_inline=%u"},
-   {Opt_alloc_start, "alloc_start=%s"},
+   {Opt_alloc_start, "alloc_start=%s"}, /* deprecated */
{Opt_thread_pool, "thread_pool=%u"},
{Opt_compress, "compress"},
{Opt_compress_type, "compress=%s"},
@@ -371,7 +371,7 @@ static const match_table_t tokens = {
{Opt_user_subvol_rm_allowed, "user_subvol_rm_allowed"},
{Opt_enospc_debug, "enospc_debug"},
{Opt_noenospc_debug, "noenospc_debug"},
-   {Opt_subvolrootid, "subvolrootid=%d"},
+   {Opt_subvolrootid, "subvolrootid=%d"}, /* deprecated */
{Opt_defrag, "autodefrag"},
{Opt_nodefrag, "noautodefrag"},
{Opt_inode_cache, "inode_cache"},
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/2] btrfs: declare max_inline as u32

2018-02-13 Thread Anand Jain
As of now btrfs_fs_info::max_line is u64, which can't be
larger than btrfs_fs_info::sectorsize which is defined as
u32, so make btrfs_fs_info::max_line u32,

Signed-off-by: Anand Jain 
---
v1->v2: Born in v2.
 fs/btrfs/ctree.h | 2 +-
 fs/btrfs/super.c | 8 
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4759e988b0df..0b738f0a9a23 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -795,7 +795,7 @@ struct btrfs_fs_info {
 * extent. The write side(mount/remount) is under ->s_umount lock,
 * so it is also safe.
 */
-   u64 max_inline;
+   u32 max_inline;
 
struct btrfs_transaction *running_transaction;
wait_queue_head_t transaction_throttle;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 383be3609cc9..b13d68506f15 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -610,11 +610,11 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
*options,
kfree(num);
 
if (info->max_inline) {
-   info->max_inline = min_t(u64,
+   info->max_inline = min_t(u32,
info->max_inline,
info->sectorsize);
}
-   btrfs_info(info, "max_inline at %llu",
+   btrfs_info(info, "max_inline at %u",
   info->max_inline);
} else {
ret = -ENOMEM;
@@ -1325,7 +1325,7 @@ static int btrfs_show_options(struct seq_file *seq, 
struct dentry *dentry)
if (btrfs_test_opt(info, NOBARRIER))
seq_puts(seq, ",nobarrier");
if (info->max_inline != BTRFS_DEFAULT_MAX_INLINE)
-   seq_printf(seq, ",max_inline=%llu", info->max_inline);
+   seq_printf(seq, ",max_inline=%u", info->max_inline);
if (info->thread_pool_size !=  min_t(unsigned long,
 num_online_cpus() + 2, 8))
seq_printf(seq, ",thread_pool=%d", info->thread_pool_size);
@@ -1801,7 +1801,7 @@ static int btrfs_remount(struct super_block *sb, int 
*flags, char *data)
unsigned old_flags = sb->s_flags;
unsigned long old_opts = fs_info->mount_opt;
unsigned long old_compress_type = fs_info->compress_type;
-   u64 old_max_inline = fs_info->max_inline;
+   u32 old_max_inline = fs_info->max_inline;
int old_thread_pool_size = fs_info->thread_pool_size;
unsigned int old_metadata_ratio = fs_info->metadata_ratio;
int ret;
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/2] btrfs: verify max_inline mount parameter

2018-02-13 Thread Anand Jain
We aren't verifying the parameter passed to the max_inline mount option,
so we won't report and fail the mount if a junk value is specified for
example, -o max_inline=abc.
This patch converts the max_inline option to %d and checks if it's a
number >= 0.

Signed-off-by: Anand Jain 
---
v1->v2: use match_int ret value if error
use %u instead of %d for parser

 fs/btrfs/super.c | 23 ++-
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index b13d68506f15..02c7766e6849 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -344,7 +344,7 @@ static const match_table_t tokens = {
{Opt_datacow, "datacow"},
{Opt_nobarrier, "nobarrier"},
{Opt_barrier, "barrier"},
-   {Opt_max_inline, "max_inline=%s"},
+   {Opt_max_inline, "max_inline=%u"},
{Opt_alloc_start, "alloc_start=%s"},
{Opt_thread_pool, "thread_pool=%d"},
{Opt_compress, "compress"},
@@ -407,7 +407,7 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
*options,
unsigned long new_flags)
 {
substring_t args[MAX_OPT_ARGS];
-   char *p, *num;
+   char *p;
u64 cache_gen;
int intarg;
int ret = 0;
@@ -604,22 +604,11 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
*options,
}
break;
case Opt_max_inline:
-   num = match_strdup([0]);
-   if (num) {
-   info->max_inline = memparse(num, NULL);
-   kfree(num);
-
-   if (info->max_inline) {
-   info->max_inline = min_t(u32,
-   info->max_inline,
-   info->sectorsize);
-   }
-   btrfs_info(info, "max_inline at %u",
-  info->max_inline);
-   } else {
-   ret = -ENOMEM;
+   ret = match_int([0], );
+   if (ret)
goto out;
-   }
+   info->max_inline = min_t(u32, intarg, info->sectorsize);
+   btrfs_info(info, "max_inline at %u", info->max_inline);
break;
case Opt_alloc_start:
btrfs_info(info,
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: verify max_inline mount parameter

2018-02-13 Thread Anand Jain



On 02/13/2018 01:13 AM, David Sterba wrote:

On Mon, Feb 12, 2018 at 11:35:46PM +0800, Anand Jain wrote:

We aren't verifying the parameter passed to the max_inline mount option,
so we won't report and fail the mount if a junk value is specified for
example, -o max_inline=abc.
This patch converts the max_inline option to %d and checks if it's a
number >= 0.


Oh right, the parser can verify that for us. As the %d gets stored to
int, there's no reason to store fs_info::max_inline as u64.


 fixed this in the patch set sent out.


Looking at the parser capabilities, it accepts %u and matches an
unsigned type, which we use for all our options. It would be good to
unify that too.


 fixed this too.


Please update this patch and possibly send more cleaning up the rest of
the match strings, and the max_inline type. Thanks.


 Sure. will do.

Thanks, Anand


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: log, when replace, is canceled by the user

2018-02-13 Thread Nikolay Borisov


On 13.02.2018 11:05, Qu Wenruo wrote:
> 
> 
> On 2018年02月13日 11:53, Anand Jain wrote:
>> For forensic investigations of issues, we would want to know
>> if and when the user cancels the replace.
>>
>> Signed-off-by: Anand Jain > ---
>> v1->v2: use btrfs_dev_name() instead of rcu_str_deref()
>> as btrfs_dev_name() also provides "missing" string,
>> if needed.
> 
> Would you please provide the branch you're using?
> 
> As I couldn't find the function or macro of btrfs_dev_name()


I have it in my misc-next checkout. I got introduced by
3c958bd23b60 ("btrfs: add helper for device path or missing")

which is in 4.16-rc1 so pretty recent.


> 
> Thanks,
> Qu
> 
>>
>>  fs/btrfs/dev-replace.c | 7 +++
>>  1 file changed, 7 insertions(+)
>>
>> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
>> index dd717e204b5e..3b0760f7ec8a 100644
>> --- a/fs/btrfs/dev-replace.c
>> +++ b/fs/btrfs/dev-replace.c
>> @@ -703,6 +703,7 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info 
>> *fs_info)
>>  {
>>  struct btrfs_dev_replace *dev_replace = _info->dev_replace;
>>  struct btrfs_device *tgt_device = NULL;
>> +struct btrfs_device *src_device = NULL;
>>  struct btrfs_trans_handle *trans;
>>  struct btrfs_root *root = fs_info->tree_root;
>>  int result;
>> @@ -724,6 +725,7 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info 
>> *fs_info)
>>  case BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED:
>>  result = BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_ERROR;
>>  tgt_device = dev_replace->tgtdev;
>> +src_device = dev_replace->srcdev;
>>  dev_replace->tgtdev = NULL;
>>  dev_replace->srcdev = NULL;
>>  break;
>> @@ -741,6 +743,11 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info 
>> *fs_info)
>>  }
>>  ret = btrfs_commit_transaction(trans);
>>  WARN_ON(ret);
>> +
>> +btrfs_info(fs_info, "dev_replace from %s (devid %llu) to %s canceled",
>> +   btrfs_dev_name(src_device), src_device->devid,
>> +   btrfs_dev_name(tgt_device));
>> +
>>  if (tgt_device)
>>  btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device);
>>  
>>
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: log, when replace, is canceled by the user

2018-02-13 Thread Qu Wenruo


On 2018年02月13日 11:53, Anand Jain wrote:
> For forensic investigations of issues, we would want to know
> if and when the user cancels the replace.
> 
> Signed-off-by: Anand Jain > ---
> v1->v2: use btrfs_dev_name() instead of rcu_str_deref()
> as btrfs_dev_name() also provides "missing" string,
> if needed.

Would you please provide the branch you're using?

As I couldn't find the function or macro of btrfs_dev_name()

Thanks,
Qu

> 
>  fs/btrfs/dev-replace.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> index dd717e204b5e..3b0760f7ec8a 100644
> --- a/fs/btrfs/dev-replace.c
> +++ b/fs/btrfs/dev-replace.c
> @@ -703,6 +703,7 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info 
> *fs_info)
>  {
>   struct btrfs_dev_replace *dev_replace = _info->dev_replace;
>   struct btrfs_device *tgt_device = NULL;
> + struct btrfs_device *src_device = NULL;
>   struct btrfs_trans_handle *trans;
>   struct btrfs_root *root = fs_info->tree_root;
>   int result;
> @@ -724,6 +725,7 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info 
> *fs_info)
>   case BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED:
>   result = BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_ERROR;
>   tgt_device = dev_replace->tgtdev;
> + src_device = dev_replace->srcdev;
>   dev_replace->tgtdev = NULL;
>   dev_replace->srcdev = NULL;
>   break;
> @@ -741,6 +743,11 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info 
> *fs_info)
>   }
>   ret = btrfs_commit_transaction(trans);
>   WARN_ON(ret);
> +
> + btrfs_info(fs_info, "dev_replace from %s (devid %llu) to %s canceled",
> +btrfs_dev_name(src_device), src_device->devid,
> +btrfs_dev_name(tgt_device));
> +
>   if (tgt_device)
>   btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device);
>  
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2] btrfs: fix endianness compatibility during the SB RW

2018-02-13 Thread Qu Wenruo


On 2018年02月13日 11:00, Anand Jain wrote:
> Fixes the endianness bug in the fs_info::super_copy by using its
> btrfs_set_super...() function to set values in the SB, as these
> functions manage the endianness compatibility nicely.
> 
> Signed-off-by: Anand Jain 

Also went through all btrfs_super_block SETGET functions, greping using
\>, seems that there are still some left here:

fs/btrfs/sysfs.c:
In both btrfs_sectorsize_show() and btrfs_clone_alignment_show():
return snprintf(buf, PAGE_SIZE, "%u\n",
fs_info->super_copy->sectorsize);

In btrfs_nodesize_show():
return snprintf(buf, PAGE_SIZE, "%u\n", fs_info->super_copy->nodesize);


And what about cc this to stable kernel?
IIRC it's a very critical problem for btrfs.

Maybe cc: sta...@vger.kernel.org # v3.2+?

Thanks,
Qu

> ---
> v1->v2: Update change log. Update $Subject.
> Old:
> [PATCH] btrfs: use set functions to update latest refs to the SB
>  fs/btrfs/transaction.c | 20 
>  1 file changed, 12 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 04f07144b45c..9220f004001c 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1722,19 +1722,23 @@ static void update_super_roots(struct btrfs_fs_info 
> *fs_info)
>  
>   super = fs_info->super_copy;
>  
> + /* update latest btrfs_super_block::chunk_root refs */
>   root_item = _info->chunk_root->root_item;
> - super->chunk_root = root_item->bytenr;
> - super->chunk_root_generation = root_item->generation;
> - super->chunk_root_level = root_item->level;
> + btrfs_set_super_chunk_root(super, root_item->bytenr);
> + btrfs_set_super_chunk_root_generation(super, root_item->generation);
> + btrfs_set_super_chunk_root_level(super, root_item->level);
>  
> + /* update latest btrfs_super_block::root refs */
>   root_item = _info->tree_root->root_item;
> - super->root = root_item->bytenr;
> - super->generation = root_item->generation;
> - super->root_level = root_item->level;
> + btrfs_set_super_root(super, root_item->bytenr);
> + btrfs_set_super_generation(super, root_item->generation);
> + btrfs_set_super_root_level(super, root_item->level);
> +
>   if (btrfs_test_opt(fs_info, SPACE_CACHE))
> - super->cache_generation = root_item->generation;
> + btrfs_set_super_cache_generation(super, root_item->generation);
>   if (test_bit(BTRFS_FS_UPDATE_UUID_TREE_GEN, _info->flags))
> - super->uuid_tree_generation = root_item->generation;
> + btrfs_set_super_uuid_tree_generation(super,
> +  root_item->generation);
>  }
>  
>  int btrfs_transaction_in_commit(struct btrfs_fs_info *info)
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] btrfs: Verify extent allocated by find_free_extent() won't overlap with extents from previous transaction

2018-02-13 Thread Qu Wenruo


On 2018年02月13日 15:20, Nikolay Borisov wrote:
> 
> 
> On 13.02.2018 03:13, Qu Wenruo wrote:
>> There are reports in mail list, even with latest mainline kernel, btrfs
>> can't survive a power loss.
>>
>> Unlike journal based filesystem, btrfs doesn't use journal for such
>> work. (log tree is an optimization for fsync, not to keep fs healthy)
>> In btrfs we use metadata CoW to ensure all tree blocks are as atomic as
>> superblock.
>>
>> This leads to an obvious assumption, some code breaks such metadata CoW
>> makes btrfs no longer bullet-proof against power loss.
>>
>> This patch adds extra runtime selftest to find_free_extent(), which
>> will check the range in commit root of extent tree to ensure there is no
>> overlap at all.
>>
>> And hopes this could help us to catch the cause of the problem.
>>
>> Signed-off-by: Qu Wenruo 
>> ---
>> Unfortunately, no new problem exposed yet.
>> ---
>>  fs/btrfs/extent-tree.c | 118 
>> +
>>  1 file changed, 118 insertions(+)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 2f4328511ac8..3b3cd82bce3a 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -7458,6 +7458,114 @@ btrfs_release_block_group(struct 
>> btrfs_block_group_cache *cache,
>>  btrfs_put_block_group(cache);
>>  }
>>  
>> +#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
>> +/*
>> + * Verify if the given extent range [@start, @start + @len) conflicts any
>> + * existing extent in commit root.
>> + *
>> + * Btrfs doesn't use journal, but depends on metadata (and data) CoW to keep
>> + * the whole filesystem consistent against powerloss.
>> + * If we have overwritten any extent used by previous trans (commit root),
>> + * and powerloss happen we will corrupt our filesystem.
> 
> Currently do we know if the corruption is caused due to extent overlap
> or some other subtle bug?

No direct evidence yet.

But some strange behavior like valid tree block but wrong level (parent
level is 1, leave level is still 1, but otherwise completely valid)
happens after power loss.

And I strongly suspect transid error after powerloss can be related to this.

Overwriting commit root is just one possible cause.
This patch is just trying to catch some clue.

I am also working on new test case using dm-flakey and dm-log to try to
create some crazy powerloss test case to try to reproduce it.

> I.e is there a guarnteee that this self-check
> should trigger?

No guarantee yet.

But at least for normal tree allocation and new data extent allocation,
it is going through such call chain and should trigger this self-check:

For tree allocation:

btrfs_alloc_tree_block()
|- btrfs_reserve_extent()
   |- find_free_extent()
  |- check_extent_conflicts()

For data extent, there are several cases:

For prealloc
btrfs_prealloc_file_range() or btrfs_prealloc_file_range_trans()
|- __btrfs_prealloc_file_range()
   |- btrfs_reserve_extent()

For data CoW:
cow_file_range()
|- btrfs_reserve_extent()

For compression:
submit_compressed_extents()
|- btrfs_reserve_extent()

For direct:
btrfs_new_extent_direct()
|- btrfs_reserve_extent()

So btrfs_reserve_extent() should be a quite good entry point, although
I'm still not 100% sure if there is any other case which could reserve
extent without calling btrfs_reserve_extent()

If we could found that, it may be the cause.

Thanks,
Qu
> 
>> + *
>> + * Return 0 if nothing wrong.
>> + * Return <0 (including ENOMEM) means we have something wrong.
>> + * Except NOEMEM, this normally means we have extent conflicts with previous
>> + * transaction.
>> + */
>> +static int check_extent_conflicts(struct btrfs_fs_info *fs_info,
>> +  u64 start, u64 len)
>> +{
>> +struct btrfs_key key;
>> +struct btrfs_path *path;
>> +struct btrfs_root *extent_root = fs_info->extent_root;
>> +u64 extent_start;
>> +u64 extent_len;
>> +int ret;
>> +
>> +path = btrfs_alloc_path();
>> +if (!path)
>> +return -ENOMEM;
>> +
>> +
>> +key.objectid = start + len;
>> +key.type = 0;
>> +key.offset = 0;
>> +
>> +/*
>> + * Here we use extent commit root to search for any conflicts.
>> + * If extent commit tree is already overwritten, we will get transid
>> + * error and error out any way.
>> + * If extent commit tree is OK, but other extent conflicts, we will
>> + * find it.
>> + * So anyway, such search should be OK to find the conflicts.
>> + */
>> +path->search_commit_root = true;
>> +ret = btrfs_search_slot(NULL, extent_root, , path, 0, 0);
>> +
>> +if (ret < 0)
>> +goto out;
>> +/* Impossible as no type/offset should be (u64)-1 */
>> +if (ret == 0) {
>> +ret = -EUCLEAN;
>> +goto out;
>> +}
>> +ret = btrfs_previous_extent_item(extent_root, path, start);
>> +if (ret < 0)
>> +goto out;
>> +if (ret == 0) {
>> +