[PATCH v2 5/7] Btrfs: remove BUG() in add_data_reference
Now that we have a helper to report invalid value of extent inline ref type, we need to quit gracefully instead of throwing out a kernel panic. Signed-off-by: Liu Bo --- fs/btrfs/relocation.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index f0bef3c..4806e78 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -3774,7 +3774,10 @@ int add_data_references(struct reloc_control *rc, ret = find_data_references(rc, extent_key, eb, dref, blocks); } else { - BUG(); + ret = -EINVAL; + WARN(1, +"extent %llu slot %d has an invalid inline ref type\n", +eb->start, path->slots[0]); } if (ret) { err = ret; -- 2.9.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 7/7] Btrfs: add one more sanity check for shared ref type
Every shared ref has a parent tree block, which can be get from btrfs_extent_inline_ref_offset(). And the tree block must be aligned to the nodesize, so we'd know this inline ref is not valid if this block's bytenr is not aligned to the nodesize, in which case, most likely the ref type has been misused. This adds the above mentioned check and also updates print_extent_item() called by btrfs_print_leaf() to point out the invalid ref while printing the tree structure. Signed-off-by: Liu Bo --- fs/btrfs/extent-tree.c | 29 + fs/btrfs/print-tree.c | 27 +-- 2 files changed, 46 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 80b4db3c..61603ec 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -1158,19 +1158,40 @@ int btrfs_get_extent_inline_ref_type(struct extent_buffer *eb, enum btrfs_inline_ref_types is_data) { int type = btrfs_extent_inline_ref_type(eb, iref); + u64 offset = btrfs_extent_inline_ref_offset(eb, iref); if (type == BTRFS_TREE_BLOCK_REF_KEY || type == BTRFS_SHARED_BLOCK_REF_KEY || type == BTRFS_SHARED_DATA_REF_KEY || type == BTRFS_EXTENT_DATA_REF_KEY) { if (is_data == BTRFS_REF_TYPE_BLOCK) { - if (type == BTRFS_TREE_BLOCK_REF_KEY || - type == BTRFS_SHARED_BLOCK_REF_KEY) + if (type == BTRFS_TREE_BLOCK_REF_KEY) return type; + if (type == BTRFS_SHARED_BLOCK_REF_KEY) { + ASSERT(eb->fs_info); + /* +* Every shared one has parent tree +* block, which must be aligned to +* nodesize. +*/ + if (offset && + IS_ALIGNED(offset, eb->fs_info->nodesize)) + return type; + } } else if (is_data == BTRFS_REF_TYPE_DATA) { - if (type == BTRFS_EXTENT_DATA_REF_KEY || - type == BTRFS_SHARED_DATA_REF_KEY) + if (type == BTRFS_EXTENT_DATA_REF_KEY) return type; + if (type == BTRFS_SHARED_DATA_REF_KEY) { + ASSERT(eb->fs_info); + /* +* Every shared one has parent tree +* block, which must be aligned to +* nodesize. +*/ + if (offset && + IS_ALIGNED(offset, eb->fs_info->nodesize)) + return type; + } } else { ASSERT(is_data == BTRFS_REF_TYPE_ANY); return type; diff --git a/fs/btrfs/print-tree.c b/fs/btrfs/print-tree.c index 0e52e47..9f8c5ee 100644 --- a/fs/btrfs/print-tree.c +++ b/fs/btrfs/print-tree.c @@ -44,7 +44,7 @@ static void print_dev_item(struct extent_buffer *eb, static void print_extent_data_ref(struct extent_buffer *eb, struct btrfs_extent_data_ref *ref) { - pr_info("\t\textent data backref root %llu objectid %llu offset %llu count %u\n", + pr_cont("extent data backref root %llu objectid %llu offset %llu count %u\n", btrfs_extent_data_ref_root(eb, ref), btrfs_extent_data_ref_objectid(eb, ref), btrfs_extent_data_ref_offset(eb, ref), @@ -63,6 +63,7 @@ static void print_extent_item(struct extent_buffer *eb, int slot, int type) u32 item_size = btrfs_item_size_nr(eb, slot); u64 flags; u64 offset; + int ref_index = 0; if (item_size < sizeof(*ei)) { #ifdef BTRFS_COMPAT_EXTENT_TREE_V0 @@ -104,12 +105,20 @@ static void print_extent_item(struct extent_buffer *eb, int slot, int type) iref = (struct btrfs_extent_inline_ref *)ptr; type = btrfs_extent_inline_ref_type(eb, iref); offset = btrfs_extent_inline_ref_offset(eb, iref); + pr_info("\t\tref#%d: ", ref_index++); switch (type) { case BTRFS_TREE_BLOCK_REF_KEY: - pr_info("\t\ttree block backref root %llu\n", offset); + pr_cont("tree block backref root %llu\n", offset); break; case BTRFS_SHARED_BLOCK_REF_KEY: - pr_info("\t\tshared block backref parent %llu\n", offset); + pr_cont("shared block backref parent %llu\n", offset); + /* +
[PATCH v2 1/7] Btrfs: add a helper to retrive extent inline ref type
An invalid value of extent inline ref type may be read from a malicious image which may force btrfs to crash. This adds a helper which does sanity check for the ref type, so we can know if it's sane, return type if so, otherwise return an error. v2: add enum type and return BTRFS_REF_TYPE_INVALID instead of -EINVAL Signed-off-by: Liu Bo --- fs/btrfs/ctree.h | 11 +++ fs/btrfs/extent-tree.c | 36 2 files changed, 47 insertions(+) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 3f3eb7b..9866420 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2561,6 +2561,17 @@ static inline gfp_t btrfs_alloc_write_mask(struct address_space *mapping) /* extent-tree.c */ +enum btrfs_inline_ref_types { + BTRFS_REF_TYPE_INVALID = 0, + BTRFS_REF_TYPE_BLOCK = 1, + BTRFS_REF_TYPE_DATA =2, + BTRFS_REF_TYPE_ANY = 3, +}; + +int btrfs_get_extent_inline_ref_type(struct extent_buffer *eb, +struct btrfs_extent_inline_ref *iref, +enum btrfs_inline_ref_types is_data); + u64 btrfs_csum_bytes_to_leaves(struct btrfs_fs_info *fs_info, u64 csum_bytes); static inline u64 btrfs_calc_trans_metadata_size(struct btrfs_fs_info *fs_info, diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index e3b0b41..83a3cdc 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -1148,6 +1148,42 @@ static int convert_extent_item_v0(struct btrfs_trans_handle *trans, } #endif +/* + * is_data == BTRFS_REF_TYPE_BLOCK, tree block type is required, + * is_data == BTRFS_REF_TYPE_DATA, data type is requried, + * is_data == BTRFS_REF_TYPE_ANY, either type is OK. + */ +int btrfs_get_extent_inline_ref_type(struct extent_buffer *eb, +struct btrfs_extent_inline_ref *iref, +enum btrfs_inline_ref_types is_data) +{ + int type = btrfs_extent_inline_ref_type(eb, iref); + + if (type == BTRFS_TREE_BLOCK_REF_KEY || + type == BTRFS_SHARED_BLOCK_REF_KEY || + type == BTRFS_SHARED_DATA_REF_KEY || + type == BTRFS_EXTENT_DATA_REF_KEY) { + if (is_data == BTRFS_REF_TYPE_BLOCK) { + if (type == BTRFS_TREE_BLOCK_REF_KEY || + type == BTRFS_SHARED_BLOCK_REF_KEY) + return type; + } else if (is_data == BTRFS_REF_TYPE_DATA) { + if (type == BTRFS_EXTENT_DATA_REF_KEY || + type == BTRFS_SHARED_DATA_REF_KEY) + return type; + } else { + ASSERT(is_data == BTRFS_REF_TYPE_ANY); + return type; + } + } + + btrfs_print_leaf(eb->fs_info, eb); + WARN(1, "eb %llu invalid extent inline ref type %d\n", +eb->start, type); + + return BTRFS_REF_TYPE_INVALID; +} + static u64 hash_extent_data_ref(u64 root_objectid, u64 owner, u64 offset) { u32 high_crc = ~(u32)0; -- 2.9.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/7] Btrfs: convert to use btrfs_get_extent_inline_ref_type
Since we have a helper which can do sanity check, this converts all btrfs_extent_inline_ref_type to it. Signed-off-by: Liu Bo --- fs/btrfs/backref.c | 11 +-- fs/btrfs/extent-tree.c | 36 ++-- fs/btrfs/relocation.c | 13 +++-- 3 files changed, 50 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index f723c11..1b3d9df 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -1012,7 +1012,11 @@ static int __add_inline_refs(struct btrfs_path *path, u64 bytenr, int type; iref = (struct btrfs_extent_inline_ref *)ptr; - type = btrfs_extent_inline_ref_type(leaf, iref); + type = btrfs_get_extent_inline_ref_type(leaf, iref, + BTRFS_REF_TYPE_ANY); + if (type == BTRFS_REF_TYPE_INVALID) + return -EINVAL; + offset = btrfs_extent_inline_ref_offset(leaf, iref); switch (type) { @@ -1908,7 +1912,10 @@ static int __get_extent_inline_ref(unsigned long *ptr, struct extent_buffer *eb, end = (unsigned long)ei + item_size; *out_eiref = (struct btrfs_extent_inline_ref *)(*ptr); - *out_type = btrfs_extent_inline_ref_type(eb, *out_eiref); + *out_type = btrfs_get_extent_inline_ref_type(eb, *out_eiref, +BTRFS_REF_TYPE_ANY); + if (*out_type == BTRFS_REF_TYPE_INVALID) + return -EINVAL; *ptr += btrfs_extent_inline_ref_size(*out_type); WARN_ON(*ptr > end); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 83a3cdc..80b4db3c 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -1453,12 +1453,18 @@ static noinline u32 extent_data_ref_count(struct btrfs_path *path, struct btrfs_extent_data_ref *ref1; struct btrfs_shared_data_ref *ref2; u32 num_refs = 0; + int type; leaf = path->nodes[0]; btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); if (iref) { - if (btrfs_extent_inline_ref_type(leaf, iref) == - BTRFS_EXTENT_DATA_REF_KEY) { + /* +* If type is invalid, we should have bailed out earlier than +* this call. +*/ + type = btrfs_get_extent_inline_ref_type(leaf, iref, BTRFS_REF_TYPE_DATA); + ASSERT(type != BTRFS_REF_TYPE_INVALID); + if (type == BTRFS_EXTENT_DATA_REF_KEY) { ref1 = (struct btrfs_extent_data_ref *)(&iref->offset); num_refs = btrfs_extent_data_ref_count(leaf, ref1); } else { @@ -1619,6 +1625,7 @@ int lookup_inline_extent_backref(struct btrfs_trans_handle *trans, int ret; int err = 0; bool skinny_metadata = btrfs_fs_incompat(fs_info, SKINNY_METADATA); + int needed; key.objectid = bytenr; key.type = BTRFS_EXTENT_ITEM_KEY; @@ -1710,6 +1717,11 @@ int lookup_inline_extent_backref(struct btrfs_trans_handle *trans, BUG_ON(ptr > end); } + if (owner >= BTRFS_FIRST_FREE_OBJECTID) + needed = BTRFS_REF_TYPE_DATA; + else + needed = BTRFS_REF_TYPE_BLOCK; + err = -ENOENT; while (1) { if (ptr >= end) { @@ -1717,7 +1729,12 @@ int lookup_inline_extent_backref(struct btrfs_trans_handle *trans, break; } iref = (struct btrfs_extent_inline_ref *)ptr; - type = btrfs_extent_inline_ref_type(leaf, iref); + type = btrfs_get_extent_inline_ref_type(leaf, iref, needed); + if (type == BTRFS_REF_TYPE_INVALID) { + err = -EINVAL; + goto out; + } + if (want < type) break; if (want > type) { @@ -1909,7 +1926,12 @@ void update_inline_extent_backref(struct btrfs_fs_info *fs_info, if (extent_op) __run_delayed_extent_op(extent_op, leaf, ei); - type = btrfs_extent_inline_ref_type(leaf, iref); + /* +* If type is invalid, we should have bailed out after +* lookup_inline_extent_backref(). +*/ + type = btrfs_get_extent_inline_ref_type(leaf, iref, BTRFS_REF_TYPE_ANY); + ASSERT(type != BTRFS_REF_TYPE_INVALID); if (type == BTRFS_EXTENT_DATA_REF_KEY) { dref = (struct btrfs_extent_data_ref *)(&iref->offset); @@ -3194,6 +3216,7 @@ static noinline int check_committed_ref(struct btrfs_root *root, struct btrfs_extent_item *ei; struct btrfs_key key; u32 item_size; + int type; int ret; key.objectid = bytenr; @@ -3235,8 +3258,9 @@ static noinline int check_committed_ref(struct btrfs_root *root,
[PATCH v2 6/7] Btrfs: remove BUG_ON in __add_tree_block
The BUG_ON() can be triggered when the caller is processing an invalid extent inline ref, e.g. a shared data ref is offered instead of a extent data ref, such that it tries to find a non-exist tree block and then btrfs_search_slot returns 1 for no such item. This replaces the BUG_ON() with a WARN() followed by calling btrfs_print_leaf() to show more details about what's going on and returning -EINVAL to upper callers. Signed-off-by: Liu Bo --- fs/btrfs/relocation.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 4806e78..cc6150e 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -32,6 +32,7 @@ #include "free-space-cache.h" #include "inode-map.h" #include "qgroup.h" +#include "print-tree.h" /* * backref_node, mapping_node and tree_block start with this @@ -3485,7 +3486,15 @@ static int __add_tree_block(struct reloc_control *rc, goto again; } } - BUG_ON(ret); + if (ret) { + ASSERT(ret == 1); + btrfs_print_leaf(rc->extent_root->fs_info, path->nodes[0]); + WARN(1, +"tree block extent item (%llu) is not found in extent tree\n", +bytenr); + ret = -EINVAL; + goto out; + } ret = add_tree_block(rc, &key, path, blocks); out: -- 2.9.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 4/7] Btrfs: remove BUG() in print_extent_item
btrfs_print_leaf() is used in btrfs_get_extent_inline_ref_type, so here we really want to print the invalid value of ref type instead of causing a kernel panic. Signed-off-by: Liu Bo --- fs/btrfs/print-tree.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/print-tree.c b/fs/btrfs/print-tree.c index fcae61e..0e52e47 100644 --- a/fs/btrfs/print-tree.c +++ b/fs/btrfs/print-tree.c @@ -121,7 +121,10 @@ static void print_extent_item(struct extent_buffer *eb, int slot, int type) offset, btrfs_shared_data_ref_count(eb, sref)); break; default: - BUG(); + btrfs_err(eb->fs_info, + "extent %llu has invalid ref type %d\n", + eb->start, type); + return; } ptr += btrfs_extent_inline_ref_size(type); } -- 2.9.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/7] Btrfs: remove BUG() in btrfs_extent_inline_ref_size
Now that btrfs_get_extent_inline_ref_type() can report if type is a valid one and all callers can gracefully deal with that, we don't need to crash here. Signed-off-by: Liu Bo --- fs/btrfs/ctree.h | 1 - 1 file changed, 1 deletion(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 9866420..c915f33 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1799,7 +1799,6 @@ static inline u32 btrfs_extent_inline_ref_size(int type) if (type == BTRFS_EXTENT_DATA_REF_KEY) return sizeof(struct btrfs_extent_data_ref) + offsetof(struct btrfs_extent_inline_ref, offset); - BUG(); return 0; } -- 2.9.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 0/7] add sanity check for extent inline ref type
An invalid extent inline ref type could be read from a btrfs image and it ends up with a panic[1], this set is to deal with the insane value gracefully in patch 1-2 and clean up BUG() in the code in patch 3-6. Patch 7 adds one more check to see if the ref is a valid shared one. I'm not sure in the real world what may result in this corruption, but I've seen several reports on the ML about __btrfs_free_extent saying something was missing (or simply wrong), while testing this set with btrfs-corrupt-block, I found that switching ref type could end up that situation as well, eg. a data extent's ref type (BTRFS_EXTENT_DATA_REF_KEY) is switched to (BTRFS_TREE_BLOCK_REF_KEY). Hopefully this can give people more sights next time when that happens. [1]:https://www.spinics.net/lists/linux-btrfs/msg65646.html v2: - add enum type and return BTRFS_REF_TYPE_INVALID instead of -EINVAL. - remove one more BUG_ON which is in __add_tree_block. - add validation check for shared refs. - improve btrfs_print_leaf to show which refs has something wrong. Liu Bo (7): Btrfs: add a helper to retrive extent inline ref type Btrfs: convert to use btrfs_get_extent_inline_ref_type Btrfs: remove BUG() in btrfs_extent_inline_ref_size Btrfs: remove BUG() in print_extent_item Btrfs: remove BUG() in add_data_reference Btrfs: remove BUG_ON in __add_tree_block Btrfs: add one more sanity check for shared ref type fs/btrfs/backref.c | 11 -- fs/btrfs/ctree.h | 12 ++- fs/btrfs/extent-tree.c | 93 ++ fs/btrfs/print-tree.c | 28 --- fs/btrfs/relocation.c | 29 +--- 5 files changed, 155 insertions(+), 18 deletions(-) -- 2.9.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel BUG at /build/linux-H5UzH8/linux-4.10.0/fs/btrfs/extent_io.c:2318
Piotr Pawłow posted on Mon, 07 Aug 2017 15:26:16 +0200 as excerpted: > # uname -a > Linux pps 4.10.0-30-generic #34-Ubuntu SMP > Mon Jul 31 19:38:17 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux This is a general principles reply and chances are wouldn't help with your issue since the spread isn't /too/ large yet and I've no knowledge of a specific fix for your issue in newer kernels, but FWIW... This being the btrfs development list and btrfs, while stabilizing (but not yet to be considered fully stable) still being under rather intense development, with significant changes every kernel cycle, list focus tends to be quite forward looking, with most interest and the best chance of fixes on relatively current kernels. What we try to support, in addition to development kernels, is the latest two release kernel series in two tracks, (mainline/Linus) current and LTS. On the current track, 4.12 is the newest release so 4.12 and 4.11 are best supported, tho the 4.11 series is already listed as EOL on kernel.org. On the LTS track, 4.9 and 4.4 are the latest LTS releases, with 4.1 the previous one but already well out of btrfs focus range. That puts 4.10 out of the list's best-supported focus range, with 4.11 still in focus but EOL. Beyond that we still try, but you're not as likely to get interest from the developers themselves, and for the other btrfs-user-regulars here, as the spread gets wider, the support gets more difficult as it's harder to remember where things were back then. So the recommendation would be to either upgrade to 4.12 current if you want to stay on current track, or downgrade to the latest 4.9 release if you prefer stable track. Other alternatives would include getting support from your distro if you're running a distro kernel (as seems to be your case), since they know what they've backported and what they haven't, and are thus in a better position to support it, and of course, simply sticking with what you have and accepting that you won't get quite the support or interest on your list posts since you're out of primary focus range. Of course you can also keep a current kernel available and boot to it to try to duplicate any issues before reporting, while otherwise sticking to your older kernel, but unless you have a specific reason to do that, it would seem more trouble than simply running the current kernel by default. So 4.10 isn't /too/ far out of range yet, but I'd strongly consider upgrading (or downgrading to 4.9 LTS) as soon as it's reasonably convenient, before 4.13 in any case. Unless you prefer to go the distro support route, of course. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: fix out of bounds array access while reading extent buffer
There is a cornel case that slip through the checkers in functions reading extent buffer, ie. if (start < eb->len) and (start + len > eb->len), then a) map_private_extent_buffer() returns immediately because it's thinking the range spans across two pages, b) and the checkers in read_extent_buffer(), WARN_ON(start > eb->len) and WARN_ON(start + len > eb->start + eb->len), both are OK in this corner case, but it'd actually try to access the eb->pages out of bounds because of (start + len > eb->len). The case is found by switching extent inline ref type from shared data ref to non-shared data ref. This is adding proper checks in order to avoid invalid memory access, ie. 'general protection', before it's too late. Signed-off-by: Liu Bo --- fs/btrfs/extent_io.c | 22 ++ 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 0aff9b2..d198e87 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -5416,13 +5416,19 @@ void read_extent_buffer(struct extent_buffer *eb, void *dstv, char *dst = (char *)dstv; size_t start_offset = eb->start & ((u64)PAGE_SIZE - 1); unsigned long i = (start_offset + start) >> PAGE_SHIFT; + unsigned long num_pages = num_extent_pages(eb->start, eb->len); - WARN_ON(start > eb->len); - WARN_ON(start + len > eb->start + eb->len); + if (start + len > eb->len) { + WARN(1, KERN_ERR "btrfs bad mapping eb start %llu len %lu, wanted %lu %lu\n", +eb->start, eb->len, start, len); + memset(dst, 0, len); + return; + } offset = (start_offset + start) & (PAGE_SIZE - 1); while (len > 0) { + ASSERT(i < num_pages); page = eb->pages[i]; cur = min(len, (PAGE_SIZE - offset)); @@ -5491,6 +5497,12 @@ int map_private_extent_buffer(struct extent_buffer *eb, unsigned long start, unsigned long end_i = (start_offset + start + min_len - 1) >> PAGE_SHIFT; + if (start + min_len > eb->len) { + WARN(1, KERN_ERR "btrfs bad mapping eb start %llu len %lu, wanted %lu %lu\n", + eb->start, eb->len, start, min_len); + return -EINVAL; + } + if (i != end_i) return 1; @@ -5502,12 +5514,6 @@ int map_private_extent_buffer(struct extent_buffer *eb, unsigned long start, *map_start = ((u64)i << PAGE_SHIFT) - start_offset; } - if (start + min_len > eb->len) { - WARN(1, KERN_ERR "btrfs bad mapping eb start %llu len %lu, wanted %lu %lu\n", - eb->start, eb->len, start, min_len); - return -EINVAL; - } - p = eb->pages[i]; kaddr = page_address(p); *map = kaddr + offset; -- 2.9.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs umount hang
On 8/7/17 1:19 PM, Jeff Mahoney wrote: > On 8/7/17 10:12 AM, Angel Shtilianov wrote: >> Hi there, >> I'm investigating sporadic hanging during btrfs umount. The FS is >> contained in a loop mounted file. >> I have no reproduction scenario and the issue may happen once a day or >> once a month. It is rare, but frustrating. >> I have a crashdump (the server has been manually crashed and collected >> a crashdump), so I could take look through the data structures. >> What happens is that umount is getting in D state and a the kernel >> complains about hung tasks. We are using kernel 4.4.y The actual back >> trace is from 4.4.70, but this happens with all the 4.4 kernels I've >> used (4.4.30 through 4.4.70). >> Tasks like: >> INFO: task kworker/u32:9:27574 blocked for more than 120 seconds. >> INFO: task kworker/u32:12:27575 blocked for more than 120 seconds. >> INFO: task btrfs-transacti:31625 blocked for more than 120 seconds. >> are getting blocked waiting for btrfs_tree_read_lock, which is owned >> by task umount:31696 (which is also blocked for more than 120 seconds) >> regarding the lock debug. >> >> umount is hung in "cache_block_group", see the '>' mark: >>while (cache->cached == BTRFS_CACHE_FAST) { >> struct btrfs_caching_control *ctl; >> >> ctl = cache->caching_ctl; >> atomic_inc(&ctl->count); >> prepare_to_wait(&ctl->wait, &wait, TASK_UNINTERRUPTIBLE); >> spin_unlock(&cache->lock); >> >>>schedule(); >> >> finish_wait(&ctl->wait, &wait); >> put_caching_control(ctl); >> spin_lock(&cache->lock); >> } >> >> The complete backtraces could be found in the attached log. >> >> Do you have any ideas ? > > Hi Angel - > > In your log, it says lockdep is disabled. What tripped it earlier? > Lockdep really should be catching locking deadlocks in situations like > this, if that's really the underlying cause. Actually, I'm not sure if lockdep would catch this one. Here's my hypothesis: kworker/u32:9 is waiting on a read lock while reading the free space cache, which means it owns the cache->cached value and will issue the wakeup when it completes. umount is waiting on for the wakeup from kworker/u32:9 but is holding some tree locks in write mode. If kworker/u32:9 is waiting on the locks that umount holds, we have a deadlock. Can you dump the extent buffer that kworker/u32:9 is waiting on? Part of that will contain the PID of the holder, and if matches umount, we found the cause. -Jeff -- Jeff Mahoney SUSE Labs signature.asc Description: OpenPGP digital signature
Re: Btrfs umount hang
On 8/7/17 10:12 AM, Angel Shtilianov wrote: > Hi there, > I'm investigating sporadic hanging during btrfs umount. The FS is > contained in a loop mounted file. > I have no reproduction scenario and the issue may happen once a day or > once a month. It is rare, but frustrating. > I have a crashdump (the server has been manually crashed and collected > a crashdump), so I could take look through the data structures. > What happens is that umount is getting in D state and a the kernel > complains about hung tasks. We are using kernel 4.4.y The actual back > trace is from 4.4.70, but this happens with all the 4.4 kernels I've > used (4.4.30 through 4.4.70). > Tasks like: > INFO: task kworker/u32:9:27574 blocked for more than 120 seconds. > INFO: task kworker/u32:12:27575 blocked for more than 120 seconds. > INFO: task btrfs-transacti:31625 blocked for more than 120 seconds. > are getting blocked waiting for btrfs_tree_read_lock, which is owned > by task umount:31696 (which is also blocked for more than 120 seconds) > regarding the lock debug. > > umount is hung in "cache_block_group", see the '>' mark: >while (cache->cached == BTRFS_CACHE_FAST) { > struct btrfs_caching_control *ctl; > > ctl = cache->caching_ctl; > atomic_inc(&ctl->count); > prepare_to_wait(&ctl->wait, &wait, TASK_UNINTERRUPTIBLE); > spin_unlock(&cache->lock); > >>schedule(); > > finish_wait(&ctl->wait, &wait); > put_caching_control(ctl); > spin_lock(&cache->lock); > } > > The complete backtraces could be found in the attached log. > > Do you have any ideas ? Hi Angel - In your log, it says lockdep is disabled. What tripped it earlier? Lockdep really should be catching locking deadlocks in situations like this, if that's really the underlying cause. -Jeff -- Jeff Mahoney SUSE Labs signature.asc Description: OpenPGP digital signature
Re: Btrfs umount hang
On 7.08.2017 17:12, Angel Shtilianov wrote: > Hi there, > I'm investigating sporadic hanging during btrfs umount. The FS is > contained in a loop mounted file. > I have no reproduction scenario and the issue may happen once a day or > once a month. It is rare, but frustrating. > I have a crashdump (the server has been manually crashed and collected > a crashdump), so I could take look through the data structures. > What happens is that umount is getting in D state and a the kernel > complains about hung tasks. We are using kernel 4.4.y The actual back > trace is from 4.4.70, but this happens with all the 4.4 kernels I've > used (4.4.30 through 4.4.70). > Tasks like: > INFO: task kworker/u32:9:27574 blocked for more than 120 seconds. > INFO: task kworker/u32:12:27575 blocked for more than 120 seconds. > INFO: task btrfs-transacti:31625 blocked for more than 120 seconds. > are getting blocked waiting for btrfs_tree_read_lock, which is owned > by task umount:31696 (which is also blocked for more than 120 seconds) > regarding the lock debug. > > umount is hung in "cache_block_group", see the '>' mark: >while (cache->cached == BTRFS_CACHE_FAST) { > struct btrfs_caching_control *ctl; > > ctl = cache->caching_ctl; > atomic_inc(&ctl->count); > prepare_to_wait(&ctl->wait, &wait, TASK_UNINTERRUPTIBLE); > spin_unlock(&cache->lock); > >>schedule(); > > finish_wait(&ctl->wait, &wait); > put_caching_control(ctl); > spin_lock(&cache->lock); > } > > The complete backtraces could be found in the attached log. > > Do you have any ideas ? > Any help will be greatly appreciated. So by the looks of it while writing dirty bgs and requiring a free block for the CoW process, cache_block_group() kicks off a caching thread which should just go and read in the respective block group. So the newly spawned caching_thread should actually wake up cache_block_group either due to success, if it manages to find 2megs: if (total_found > (1024 * 1024 * 2)) { or in case of failure after the out label. But in both cases it will set cache->cached to something different than BTRFS_CACHE_FAST and it ought to exit the loop. But from your description of the issue I take it the process never comes back from the schedule, meaning it missed the wakeup from caching_thread and atm I cannot see how this could happen. Can you print the state of the 'cache' parameter of cache_block_group ? > > Best regards, > Angel Shtilianov > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On Fri, Aug 4, 2017 at 8:05 AM, Qu Wenruo wrote: > > For example, if one day there is some dm-csum to support verify csum of > given ranges (and skip unrelated ones specified by higher levels), btrfs > support for data csum is no longer an exclusive feature. How would dm-csum differ from dm-integrity? https://www.kernel.org/doc/Documentation/device-mapper/dm-integrity.txt By that description it uses a journal to guarantee atomicity. If multiqueue maybe the performance implications are neutral. But certainly on spinning drives that would slow things down, especially if the file system is also journaling, and the workload is metadata heavy. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Btrfs umount hang
Hi there, I'm investigating sporadic hanging during btrfs umount. The FS is contained in a loop mounted file. I have no reproduction scenario and the issue may happen once a day or once a month. It is rare, but frustrating. I have a crashdump (the server has been manually crashed and collected a crashdump), so I could take look through the data structures. What happens is that umount is getting in D state and a the kernel complains about hung tasks. We are using kernel 4.4.y The actual back trace is from 4.4.70, but this happens with all the 4.4 kernels I've used (4.4.30 through 4.4.70). Tasks like: INFO: task kworker/u32:9:27574 blocked for more than 120 seconds. INFO: task kworker/u32:12:27575 blocked for more than 120 seconds. INFO: task btrfs-transacti:31625 blocked for more than 120 seconds. are getting blocked waiting for btrfs_tree_read_lock, which is owned by task umount:31696 (which is also blocked for more than 120 seconds) regarding the lock debug. umount is hung in "cache_block_group", see the '>' mark: while (cache->cached == BTRFS_CACHE_FAST) { struct btrfs_caching_control *ctl; ctl = cache->caching_ctl; atomic_inc(&ctl->count); prepare_to_wait(&ctl->wait, &wait, TASK_UNINTERRUPTIBLE); spin_unlock(&cache->lock); >schedule(); finish_wait(&ctl->wait, &wait); put_caching_control(ctl); spin_lock(&cache->lock); } The complete backtraces could be found in the attached log. Do you have any ideas ? Any help will be greatly appreciated. Best regards, Angel Shtilianov Jul 21 22:30:34 smallvault22 kernel: [377419.794212] BTRFS info (device loop0): disk space caching is enabled Jul 21 22:30:34 smallvault22 kernel: [377419.794449] BTRFS: has skinny extents Jul 21 22:30:43 smallvault22 kernel: [377429.488153] BTRFS info (device loop2): disk space caching is enabled Jul 21 22:30:43 smallvault22 kernel: [377429.488389] BTRFS: has skinny extents Jul 21 22:30:44 smallvault22 kernel: [377429.710041] BTRFS info (device loop2): disk space caching is enabled Jul 21 22:30:44 smallvault22 kernel: [377429.710279] BTRFS: has skinny extents Jul 21 22:30:44 smallvault22 kernel: [377429.961130] BTRFS info (device loop0): disk space caching is enabled Jul 21 22:30:44 smallvault22 kernel: [377429.961367] BTRFS: has skinny extents Jul 21 22:32:04 smallvault22 kernel: [377510.277526] BTRFS: device fsid 7d53957c-847b-4dd6-8a70-63870a34c7f7 devid 1 transid 1222 / dev/loop0 Jul 21 22:32:04 smallvault22 kernel: [377510.294142] BTRFS info (device loop0): disk space caching is enabled Jul 21 22:32:04 smallvault22 kernel: [377510.294387] BTRFS: has skinny extents Jul 21 22:32:08 smallvault22 kernel: [377514.199173] BTRFS info (device loop0): new size for /dev/loop0 is 214748364800 Jul 21 22:32:09 smallvault22 kernel: [377514.722687] BTRFS info (device loop0): disk space caching is enabled Jul 21 22:32:09 smallvault22 kernel: [377514.722924] BTRFS: has skinny extents Jul 21 22:32:09 smallvault22 kernel: [377514.833162] INFO: task kworker/u32:9:27574 blocked for more than 120 seconds. Jul 21 22:32:09 smallvault22 kernel: [377514.833543] Tainted: G O4.4.70-backup2 #93 Jul 21 22:32:09 smallvault22 kernel: [377514.833767] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 21 22:32:09 smallvault22 kernel: [377514.834153] kworker/u32:9 D 880046142ec8 0 27574 2 0x Jul 21 22:32:09 smallvault22 kernel: [377514.834536] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs] Jul 21 22:32:09 smallvault22 kernel: [377514.834825] 880046142ec8 8802598b62f0 880046142e90 Jul 21 22:32:09 smallvault22 kernel: [377514.835468] 88046bb19a00 88038b469a00 880046144000 8802598b62f0 Jul 21 22:32:09 smallvault22 kernel: [377514.836097] 88038b469a00 88038b469a00 880046142f08 880046142ee0 Jul 21 22:32:09 smallvault22 kernel: [377514.836724] Call Trace: Jul 21 22:32:09 smallvault22 kernel: [377514.836947] [] schedule+0x3c/0x90 Jul 21 22:32:09 smallvault22 kernel: [377514.837206] [] btrfs_tree_read_lock+0xe5/0x140 [btrfs] Jul 21 22:32:09 smallvault22 kernel: [377514.837437] [] ? wait_woken+0xb0/0xb0 Jul 21 22:32:09 smallvault22 kernel: [377514.837669] [] btrfs_read_lock_root_node+0x34/0x50 [btrfs] Jul 21 22:32:09 smallvault22 kernel: [377514.838054] [] btrfs_search_slot+0x70b/0x9d0 [btrfs] Jul 21 22:32:09 smallvault22 kernel: [377514.838289] [] ? trace_hardirqs_on+0xd/0x10 Jul 21 22:32:09 smallvault22 kernel: [377514.838524] [] ? btrfs_alloc_path+0x1a/0x20 [btrfs] Jul 21 22:32:09 smallvault22 kernel: [377514.838760] [] btrfs_lookup_file_extent+0x37/0x40 [btrfs] Jul 21 22:32:09 smallvault22 kernel: [377514.838999] [] btrfs_get_extent+0x155/0xb30 [btrfs] Jul 21 22:32:09 smallvault22 kernel: [377514.839249] [] ? __set_extent_bit+0x15a/0x580 [btrfs] Jul 21
kernel BUG at /build/linux-H5UzH8/linux-4.10.0/fs/btrfs/extent_io.c:2318
Hello, my btrfs raid1 fs with 4 drives crashed after only one drive developed a couple of bad blocks. Seems similar to https://bugzilla.kernel.org/show_bug.cgi?id=196251 but it happened during normal usage instead of during replace. As a sidenote, I tried to make the system less noisy during the day, when it's not doing nightly backups, by setting drives to spin down after a few minutes. Unfortunately btrfs kept them spinning for no reason, so I tried systemd's automount to unmount the fs when not in use. It worked, but it seems it uncovered some kind of umount/mount race condition, where mount unexpectedly failed with: sie 06 03:47:39 pps kernel: BTRFS error (device dm-1): super_total_bytes 3855366488064 mismatch with fs_devices total_rw_bytes 7710732976128 sie 06 03:47:39 pps kernel: BTRFS error (device dm-1): failed to read chunk tree: -22 sie 06 03:47:39 pps kernel: BTRFS error (device dm-1): open_ctree failed ...and then the next mount worked properly as if nothing had happened. Full log: https://pp.siedziba.pl/btrfs_raid1_fail/kernel.log # uname -a Linux pps 4.10.0-30-generic #34-Ubuntu SMP Mon Jul 31 19:38:17 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux # btrfs device usage /backup /dev/mapper/pps-backup1, ID: 2 Device size: 465.76GiB Device slack: 0.00B Data,RAID1:387.00GiB Metadata,RAID1: 16.00GiB Unallocated:62.76GiB /dev/mapper/pps-backup2, ID: 8 Device size: 1.81TiB Device slack: 0.00B Data,RAID1: 1.48TiB Metadata,RAID1: 26.00GiB System,RAID1: 32.00MiB Unallocated: 314.29GiB /dev/mapper/pps-backup3, ID: 6 Device size: 334.69GiB Device slack: 0.00B Data,RAID1:268.00GiB Metadata,RAID1: 5.00GiB System,RAID1: 32.00MiB Unallocated:61.66GiB /dev/mapper/pps-backup4, ID: 7 Device size: 931.51GiB Device slack: 0.00B Data,RAID1:863.00GiB Metadata,RAID1: 5.00GiB System,RAID1: 64.00MiB Unallocated:63.45GiB # btrfs device stats /backup [/dev/mapper/pps-backup1].write_io_errs0 [/dev/mapper/pps-backup1].read_io_errs 0 [/dev/mapper/pps-backup1].flush_io_errs0 [/dev/mapper/pps-backup1].corruption_errs 0 [/dev/mapper/pps-backup1].generation_errs 0 [/dev/mapper/pps-backup3].write_io_errs0 [/dev/mapper/pps-backup3].read_io_errs 0 [/dev/mapper/pps-backup3].flush_io_errs0 [/dev/mapper/pps-backup3].corruption_errs 0 [/dev/mapper/pps-backup3].generation_errs 0 [/dev/mapper/pps-backup4].write_io_errs0 [/dev/mapper/pps-backup4].read_io_errs 0 [/dev/mapper/pps-backup4].flush_io_errs0 [/dev/mapper/pps-backup4].corruption_errs 0 [/dev/mapper/pps-backup4].generation_errs 0 [/dev/mapper/pps-backup2].write_io_errs1 [/dev/mapper/pps-backup2].read_io_errs 22 [/dev/mapper/pps-backup2].flush_io_errs0 [/dev/mapper/pps-backup2].corruption_errs 0 [/dev/mapper/pps-backup2].generation_errs 0 sie 07 04:45:05 pps kernel: [ cut here ] sie 07 04:45:05 pps kernel: BTRFS critical (device dm-1): stripe index math went horribly wrong, got stripe_index=2176876031, num_stripes=2 sie 07 04:45:05 pps kernel: BTRFS critical (device dm-1): stripe index math went horribly wrong, got stripe_index=2176876031, num_stripes=2 sie 07 04:45:05 pps kernel: BTRFS critical (device dm-1): stripe index math went horribly wrong, got stripe_index=2176876031, num_stripes=2 sie 07 04:45:05 pps kernel: BTRFS critical (device dm-1): stripe index math went horribly wrong, got stripe_index=2176876031, num_stripes=2 sie 07 04:45:05 pps kernel: BTRFS critical (device dm-1): stripe index math went horribly wrong, got stripe_index=2176876031, num_stripes=2 sie 07 04:45:05 pps kernel: BTRFS critical (device dm-1): stripe index math went horribly wrong, got stripe_index=2176876031, num_stripes=2 sie 07 04:45:05 pps kernel: BTRFS critical (device dm-1): stripe index math went horribly wrong, got stripe_index=2176876031, num_stripes=2 sie 07 04:45:05 pps kernel: BTRFS critical (device dm-1): stripe index math went horribly wrong, got stripe_index=2176876031, num_stripes=2 sie 07 04:45:05 pps kernel: BTRFS critical (device dm-1): stripe index math went horribly wrong, got stripe_index=2176876031, num_stripes=2 sie 07 04:45:05 pps kernel: BTRFS critical (device dm-1): stripe index math went horribly wrong, got stripe_index=2176876031, num_stripes=2 sie 07 04:45:05 pps kernel: BTRFS critical (device dm-1): stripe index math went horribly wrong, got stripe_index=2176876031, num_stripes=2 sie 07 04:45:05 pps kernel: BTRFS critical (device dm-1): stripe index math went horribly wrong, got stripe_index=2176876031, num_stripes=2 sie 07 04:45:05 pps kernel: BTRFS critical (device dm-1): stripe index math went horribly wrong, got stripe_index=2176876031, num_stripes=2
Re: Crashed filesystem, nothing helps
On Mon, Aug 7, 2017 at 7:12 AM, Thomas Wurfbaum wrote: > Now i do a btrfs-find-root, but it runs now since 5 day without a result. > How long should i wait? Or is it already to late to hope? > > mainframe:~ # btrfs-find-root.static /dev/sdb1 > parent transid verify failed on 29376512 wanted 1327723 found 1489835 > parent transid verify failed on 29376512 wanted 1327723 found 1489835 > parent transid verify failed on 29376512 wanted 1327723 found 1489835 > parent transid verify failed on 29376512 wanted 1327723 found 1489835 > Ignoring transid failure > Superblock thinks the generation is 1490226 > Superblock thinks the level is 1 > > The process is still running... I also have a filesystem that has a big difference in wanted and found transid. I my case the wanted is much higher than the found. I also tried to repair it for more than 2 weeks in total, but both progs and kernel fail. I haven't seen that you mounted it just with only ro mount option. I think in your case chances are low, but it worked in my case. I could copy all data (7TB unique non-shared thorugh snapshot or reflink) to a new fs with rsync. Btrfs send and other action done by btrfs progs quite soon failed on 1 or several parent transid inconsistencies. Others on this forum have stated that when there is such a big difference in transid, the fs is simple unrepairable. I must admit that this is the truth in practice. Restore with btrfs-restore I tried once on a other fs that was damaged and at that time the restore gave quite different content for some file compared to a copy of the same file from the fs mouted via the kernel. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Power down tests...
Hi Chris, Good points that you make. We're making use of btrfs raid only. (One of the reasons we want to move to btrfs) However, during this test, we haven't run with multi-disk btrfs raid. We just have one disk. (This test setup doesn't have too many disks) We do have our metadata replicated as well. For data, we do have regular async backups. However, this is something that we noticed (somewhat more frequently) while testing out btrfs as our data store (as compared to ext4). We've tests going on with flushoncommit and recovery mount options running on the same setup. Hopefully, we'll have near-ext4-like behaviour with this, w.r.t power off recovery. Regards, Shyam On Mon, Aug 7, 2017 at 7:52 AM, Chris Murphy wrote: > On Thu, Aug 3, 2017 at 11:51 PM, Shyam Prasad N > wrote: >> Hi all, >> >> We're running a couple of experiments on our servers with btrfs >> (kernel version 4.4). >> And we're running some abrupt power-off tests for a couple of scenarios: >> >> 1. We have a filesystem on top of two different btrfs filesystems >> (distributed across N disks). > > What's the layout from physical devices all the way to your 16M file? > This is hardware raid, lvm linear, Btrfs raid? All of that matters. > > Do the drives have write caching disabled? You might be better off > with the drive write cache disabled, and then add bcache or dm-cache > and an SSD to compensate. But that's just speculation on my part. The > write cache in the drives is definitely volatile. And disabling them > will definitely make writes slower. So, you might have slightly better > luck with another layout. > > But the bottom line is, you need to figure out a way to avoid *any* > data loss in your files because otherwise that means the 2nd file > system has data loss and even corruption. This is not something a file > system choice can solve. You need reliable power and reliable > shutdown. And you may also need a cluster file system like ceph or > glusterfs instead of depending on a single box to stay upright. > > > > -- > Chris Murphy -- -Shyam -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html