Re: [PATCH v2] Btrfs: return failure if btrfs_dev_replace_finishing() failed
comments below.. On 10/13/14 12:42, Eryu Guan wrote: device replace could fail due to another running scrub process or any other errors btrfs_scrub_dev() may hit, but this failure doesn't get returned to userspace. The following steps could reproduce this issue mkfs -t btrfs -f /dev/sdb1 /dev/sdb2 mount /dev/sdb1 /mnt/btrfs while true; do btrfs scrub start -B /mnt/btrfs /dev/null 21; done btrfs replace start -Bf /dev/sdb2 /dev/sdb3 /mnt/btrfs # if this replace succeeded, do the following and repeat until # you see this log in dmesg # BTRFS: btrfs_scrub_dev(/dev/sdb2, 2, /dev/sdb3) failed -115 #btrfs replace start -Bf /dev/sdb3 /dev/sdb2 /mnt/btrfs # once you see the error log in dmesg, check return value of # replace echo $? Introduce a new dev replace result BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS to catch -EINPROGRESS explicitly and return other errors directly to userspace. Signed-off-by: Eryu Guan guane...@gmail.com --- v2: - set result to SCRUB_INPROGRESS if btrfs_scrub_dev returned -EINPROGRESS and return 0 as Miao Xie suggested fs/btrfs/dev-replace.c | 12 +--- include/uapi/linux/btrfs.h | 1 + 2 files changed, 10 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index eea26e1..a141f8b 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -418,9 +418,15 @@ int btrfs_dev_replace_start(struct btrfs_root *root, dev_replace-scrub_progress, 0, 1); ret = btrfs_dev_replace_finishing(root-fs_info, ret); - WARN_ON(ret); + /* don't warn if EINPROGRESS, someone else might be running scrub */ + if (ret == -EINPROGRESS) { + args-result = BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS; + ret = 0; + } else { + WARN_ON(ret); + } looks like was are trying to manage EINPROGRESS returned by btrfs_dev_replace_finishing(). In btrfs_dev_replace_finishing() which specific func call is returning EINPROGRESS ? I didn't go deep enough. And how do we handle if replace is intervened by balance instead of scrub ? sorry if I missed something. Anand - return 0; + return ret; leave: dev_replace-srcdev = NULL; @@ -538,7 +544,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device); mutex_unlock(dev_replace-lock_finishing_cancel_unmount); - return 0; + return scrub_ret; } printk_in_rcu(KERN_INFO diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index 2f47824..611e1c5 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -157,6 +157,7 @@ struct btrfs_ioctl_dev_replace_status_params { #define BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_ERROR 0 #define BTRFS_IOCTL_DEV_REPLACE_RESULT_NOT_STARTED1 #define BTRFS_IOCTL_DEV_REPLACE_RESULT_ALREADY_STARTED2 +#define BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS3 struct btrfs_ioctl_dev_replace_args { __u64 cmd; /* in */ __u64 result; /* out */ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Btrfs: return failure if btrfs_dev_replace_finishing() failed
On Mon, Oct 13, 2014 at 02:23:57PM +0800, Anand Jain wrote: comments below.. On 10/13/14 12:42, Eryu Guan wrote: device replace could fail due to another running scrub process or any other errors btrfs_scrub_dev() may hit, but this failure doesn't get returned to userspace. The following steps could reproduce this issue mkfs -t btrfs -f /dev/sdb1 /dev/sdb2 mount /dev/sdb1 /mnt/btrfs while true; do btrfs scrub start -B /mnt/btrfs /dev/null 21; done btrfs replace start -Bf /dev/sdb2 /dev/sdb3 /mnt/btrfs # if this replace succeeded, do the following and repeat until # you see this log in dmesg # BTRFS: btrfs_scrub_dev(/dev/sdb2, 2, /dev/sdb3) failed -115 #btrfs replace start -Bf /dev/sdb3 /dev/sdb2 /mnt/btrfs # once you see the error log in dmesg, check return value of # replace echo $? Introduce a new dev replace result BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS to catch -EINPROGRESS explicitly and return other errors directly to userspace. Signed-off-by: Eryu Guan guane...@gmail.com --- v2: - set result to SCRUB_INPROGRESS if btrfs_scrub_dev returned -EINPROGRESS and return 0 as Miao Xie suggested fs/btrfs/dev-replace.c | 12 +--- include/uapi/linux/btrfs.h | 1 + 2 files changed, 10 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index eea26e1..a141f8b 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -418,9 +418,15 @@ int btrfs_dev_replace_start(struct btrfs_root *root, dev_replace-scrub_progress, 0, 1); ret = btrfs_dev_replace_finishing(root-fs_info, ret); -WARN_ON(ret); +/* don't warn if EINPROGRESS, someone else might be running scrub */ +if (ret == -EINPROGRESS) { +args-result = BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS; +ret = 0; +} else { +WARN_ON(ret); +} looks like was are trying to manage EINPROGRESS returned by Yes, that's right. btrfs_dev_replace_finishing(). In btrfs_dev_replace_finishing() which specific func call is returning EINPROGRESS ? I didn't go deep enough. btrfs_dev_replace_finishing() will check the scrub_ret(the last argument), and return scrub_ret if (!scrub_ret). It was returning 0 unconditionally before this patch. btrfs_dev_replace_start@fs/btrfs/dev-replace.c 416 ret = btrfs_scrub_dev(fs_info, src_device-devid, 0, 417src_device-total_bytes, 418dev_replace-scrub_progress, 0, 1); 419 420 ret = btrfs_dev_replace_finishing(root-fs_info, ret); and btrfs_dev_replace_finishing@fs/btrfs/dev-replace.c 529 if (!scrub_ret) { 530 btrfs_dev_replace_update_device_in_mapping_tree(fs_info, 531 src_device, 532 tgt_device); 533 } else { .. 547 return scrub_ret; 548 } And how do we handle if replace is intervened by balance instead of scrub ? Based on my test, replace ioctl would return -ENOENT if balance is running ERROR: ioctl(DEV_REPLACE_START) failed on /mnt/testarea/scratch: No such file or directory, no error (I haven't gone through this codepath yet and don't know where -ENOENT comes from, but I don't think it's a proper errno, /mnt/testarea/scratch is definitely there) sorry if I missed something. Anand Thanks for the review! Eryu -return 0; +return ret; leave: dev_replace-srcdev = NULL; @@ -538,7 +544,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device); mutex_unlock(dev_replace-lock_finishing_cancel_unmount); -return 0; +return scrub_ret; } printk_in_rcu(KERN_INFO diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index 2f47824..611e1c5 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -157,6 +157,7 @@ struct btrfs_ioctl_dev_replace_status_params { #define BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_ERROR0 #define BTRFS_IOCTL_DEV_REPLACE_RESULT_NOT_STARTED 1 #define BTRFS_IOCTL_DEV_REPLACE_RESULT_ALREADY_STARTED 2 +#define BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS 3 struct btrfs_ioctl_dev_replace_args { __u64 cmd; /* in */ __u64 result; /* out */ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Btrfs: return failure if btrfs_dev_replace_finishing() failed
On 10/13/14 14:59, Eryu Guan wrote: On Mon, Oct 13, 2014 at 02:23:57PM +0800, Anand Jain wrote: comments below.. On 10/13/14 12:42, Eryu Guan wrote: device replace could fail due to another running scrub process or any other errors btrfs_scrub_dev() may hit, but this failure doesn't get returned to userspace. The following steps could reproduce this issue mkfs -t btrfs -f /dev/sdb1 /dev/sdb2 mount /dev/sdb1 /mnt/btrfs while true; do btrfs scrub start -B /mnt/btrfs /dev/null 21; done btrfs replace start -Bf /dev/sdb2 /dev/sdb3 /mnt/btrfs # if this replace succeeded, do the following and repeat until # you see this log in dmesg # BTRFS: btrfs_scrub_dev(/dev/sdb2, 2, /dev/sdb3) failed -115 #btrfs replace start -Bf /dev/sdb3 /dev/sdb2 /mnt/btrfs # once you see the error log in dmesg, check return value of # replace echo $? Introduce a new dev replace result BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS to catch -EINPROGRESS explicitly and return other errors directly to userspace. Signed-off-by: Eryu Guan guane...@gmail.com --- v2: - set result to SCRUB_INPROGRESS if btrfs_scrub_dev returned -EINPROGRESS and return 0 as Miao Xie suggested fs/btrfs/dev-replace.c | 12 +--- include/uapi/linux/btrfs.h | 1 + 2 files changed, 10 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index eea26e1..a141f8b 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -418,9 +418,15 @@ int btrfs_dev_replace_start(struct btrfs_root *root, dev_replace-scrub_progress, 0, 1); ret = btrfs_dev_replace_finishing(root-fs_info, ret); - WARN_ON(ret); + /* don't warn if EINPROGRESS, someone else might be running scrub */ + if (ret == -EINPROGRESS) { + args-result = BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS; + ret = 0; + } else { + WARN_ON(ret); + } I am bit concerned, why these racing threads here aren't excluding each other using mutually_exclusive_operation_running ? as most of the other device operation thread does. Thanks, Anand looks like was are trying to manage EINPROGRESS returned by Yes, that's right. btrfs_dev_replace_finishing(). In btrfs_dev_replace_finishing() which specific func call is returning EINPROGRESS ? I didn't go deep enough. btrfs_dev_replace_finishing() will check the scrub_ret(the last argument), and return scrub_ret if (!scrub_ret). It was returning 0 unconditionally before this patch. btrfs_dev_replace_start@fs/btrfs/dev-replace.c 416 ret = btrfs_scrub_dev(fs_info, src_device-devid, 0, 417src_device-total_bytes, 418dev_replace-scrub_progress, 0, 1); 419 420 ret = btrfs_dev_replace_finishing(root-fs_info, ret); and btrfs_dev_replace_finishing@fs/btrfs/dev-replace.c 529 if (!scrub_ret) { 530 btrfs_dev_replace_update_device_in_mapping_tree(fs_info, 531 src_device, 532 tgt_device); 533 } else { .. 547 return scrub_ret; 548 } And how do we handle if replace is intervened by balance instead of scrub ? Based on my test, replace ioctl would return -ENOENT if balance is running ERROR: ioctl(DEV_REPLACE_START) failed on /mnt/testarea/scratch: No such file or directory, no error (I haven't gone through this codepath yet and don't know where -ENOENT comes from, but I don't think it's a proper errno, /mnt/testarea/scratch is definitely there) sorry if I missed something. Anand Thanks for the review! Eryu - return 0; + return ret; leave: dev_replace-srcdev = NULL; @@ -538,7 +544,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device); mutex_unlock(dev_replace-lock_finishing_cancel_unmount); - return 0; + return scrub_ret; } printk_in_rcu(KERN_INFO diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index 2f47824..611e1c5 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -157,6 +157,7 @@ struct btrfs_ioctl_dev_replace_status_params { #define BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_ERROR 0 #define BTRFS_IOCTL_DEV_REPLACE_RESULT_NOT_STARTED1 #define BTRFS_IOCTL_DEV_REPLACE_RESULT_ALREADY_STARTED2 +#define BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS3 struct btrfs_ioctl_dev_replace_args { __u64 cmd; /* in */ __u64 result; /* out */ -- To
[PATCH 1/3] Btrfs: deal with convert_extent_bit errors to avoid fs corruption
When committing a transaction or a log, we look for btree extents that need to be durably persisted by searching for ranges in a io tree that have some bits set (EXTENT_DIRTY or EXTENT_NEW). We then attempt to clear those bits and set the EXTENT_NEED_WAIT bit, with calls to the function convert_extent_bit, and then start writeback for the extents. That function however can return an error (at the moment only -ENOMEM is possible, specially when it does GFP_ATOMIC allocation requests through alloc_extent_state_atomic) - that means the ranges didn't got the EXTENT_NEED_WAIT bit set (or at least not for the whole range), which in turn means a call to btrfs_wait_marked_extents() won't find those ranges for which we started writeback, causing a transaction commit or a log commit to persist a new superblock without waiting for the writeback of extents in that range to finish first. Therefore if a crash happens after persisting the new superblock and before writeback finishes, we have a superblock pointing to roots that weren't fully persisted or roots that point to nodes or leafs that weren't fully persisted, causing all sorts of unexpected/bad behaviour as we endup reading garbage from disk or the content of some node/leaf from a past generation that got cowed or deleted and is no longer valid (for this later case we end up getting error messages like parent transid verify failed on X wanted Y found Z when reading btree nodes/leafs from disk). Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/transaction.c | 92 +- fs/btrfs/transaction.h | 2 -- 2 files changed, 76 insertions(+), 18 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 8f1a408..cb673d4 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -76,6 +76,32 @@ void btrfs_put_transaction(struct btrfs_transaction *transaction) } } +static void clear_btree_io_tree(struct extent_io_tree *tree) +{ + spin_lock(tree-lock); + while (!RB_EMPTY_ROOT(tree-state)) { + struct rb_node *node; + struct extent_state *state; + + node = rb_first(tree-state); + state = rb_entry(node, struct extent_state, rb_node); + rb_erase(state-rb_node, tree-state); + RB_CLEAR_NODE(state-rb_node); + /* +* btree io trees aren't supposed to have tasks waiting for +* changes in the flags of extent states ever. +*/ + ASSERT(!waitqueue_active(state-wq)); + free_extent_state(state); + if (need_resched()) { + spin_unlock(tree-lock); + cond_resched(); + spin_lock(tree-lock); + } + } + spin_unlock(tree-lock); +} + static noinline void switch_commit_roots(struct btrfs_transaction *trans, struct btrfs_fs_info *fs_info) { @@ -89,6 +115,7 @@ static noinline void switch_commit_roots(struct btrfs_transaction *trans, root-commit_root = btrfs_root_node(root); if (is_fstree(root-objectid)) btrfs_unpin_free_ino(root); + clear_btree_io_tree(root-dirty_log_pages); } up_write(fs_info-commit_root_sem); } @@ -827,17 +854,38 @@ int btrfs_write_marked_extents(struct btrfs_root *root, while (!find_first_extent_bit(dirty_pages, start, start, end, mark, cached_state)) { - convert_extent_bit(dirty_pages, start, end, EXTENT_NEED_WAIT, - mark, cached_state, GFP_NOFS); - cached_state = NULL; - err = filemap_fdatawrite_range(mapping, start, end); + bool wait_writeback = false; + + err = convert_extent_bit(dirty_pages, start, end, +EXTENT_NEED_WAIT, +mark, cached_state, GFP_NOFS); + /* +* convert_extent_bit can return -ENOMEM, which is most of the +* time a temporary error. So when it happens, ignore the error +* and wait for writeback of this range to finish - because we +* failed to set the bit EXTENT_NEED_WAIT for the range, a call +* to btrfs_wait_marked_extents() would not know that writeback +* for this range started and therefore wouldn't wait for it to +* finish - we don't want to commit a superblock that points to +* btree nodes/leafs for which writeback hasn't finished yet +* (and without errors). +* We cleanup any entries left in the io tree when committing +* the transaction (through clear_btree_io_tree()). +*/ + if (err == -ENOMEM) {
[PATCH 3/3] Btrfs: avoid returning -ENOMEM in convert_extent_bit() too early
We try to allocate an extent state before acquiring the tree's spinlock just in case we end up needing to split an existing extent state into two. If that allocation failed, we would return -ENOMEM. However, our only single caller (transaction/log commit code), passes in an extent state that was cached from a call to find_first_extent_bit() and that has a very high chance to match exactly the input range (always true for a transaction commit and very often, but not always, true for a log commit) - in this case we end up not needing at all that initial extent state used for an eventual split. Therefore just don't return -ENOMEM if we can't allocate the temporary extent state, since we might not need it at all, and if we end up needing one, we'll do it later anyway. Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/extent_io.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 0d931b1..654ed3d 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1066,13 +1066,21 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, int err = 0; u64 last_start; u64 last_end; + bool first_iteration = true; btrfs_debug_check_extent_io_range(tree, start, end); again: if (!prealloc (mask __GFP_WAIT)) { + /* +* Best effort, don't worry if extent state allocation fails +* here for the first iteration. We might have a cached state +* that matches exactly the target range, in which case no +* extent state allocations are needed. We'll only know this +* after locking the tree. +*/ prealloc = alloc_extent_state(mask); - if (!prealloc) + if (!prealloc !first_iteration) return -ENOMEM; } @@ -1242,6 +1250,7 @@ search_again: spin_unlock(tree-lock); if (mask __GFP_WAIT) cond_resched(); + first_iteration = false; goto again; } -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] Btrfs: make find_first_extent_bit be able to cache any state
Right now the only caller of find_first_extent_bit() that is interested in caching extent states (transaction or log commit), never gets an extent state cached. This is because find_first_extent_bit() only caches states that have at least one of the flags EXTENT_IOBITS or EXTENT_BOUNDARY, and the transaction/log commit caller always passes a tree that doesn't have ever extent states with any of those flags (they can only have one of the following flags: EXTENT_DIRTY, EXTENT_NEW or EXTENT_NEED_WAIT). This change together with the following one in the patch series (titled Btrfs: avoid returning -ENOMEM in convert_extent_bit() too early) will help reduce significantly the chances of calls to convert_extent_bit() fail with -ENOMEM when called from the transaction/log commit code. Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/extent_io.c | 16 fs/btrfs/transaction.c | 3 +++ 2 files changed, 15 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 420fe26..0d931b1 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -796,17 +796,25 @@ static void set_state_bits(struct extent_io_tree *tree, state-state |= bits_to_set; } -static void cache_state(struct extent_state *state, - struct extent_state **cached_ptr) +static void cache_state_if_flags(struct extent_state *state, +struct extent_state **cached_ptr, +const u64 flags) { if (cached_ptr !(*cached_ptr)) { - if (state-state (EXTENT_IOBITS | EXTENT_BOUNDARY)) { + if (!flags || (state-state flags)) { *cached_ptr = state; atomic_inc(state-refs); } } } +static void cache_state(struct extent_state *state, + struct extent_state **cached_ptr) +{ + return cache_state_if_flags(state, cached_ptr, + EXTENT_IOBITS | EXTENT_BOUNDARY); +} + /* * set some bits on a range in the tree. This may require allocations or * sleeping, so the gfp mask is used to indicate what is allowed. @@ -1482,7 +1490,7 @@ int find_first_extent_bit(struct extent_io_tree *tree, u64 start, state = find_first_extent_bit_state(tree, start, bits); got_it: if (state) { - cache_state(state, cached_state); + cache_state_if_flags(state, cached_state, 0); *start_ret = state-start; *end_ret = state-end; ret = 0; diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index cb673d4..396ae8b 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -882,6 +882,7 @@ int btrfs_write_marked_extents(struct btrfs_root *root, werr = err; else if (wait_writeback) werr = filemap_fdatawait_range(mapping, start, end); + free_extent_state(cached_state); cached_state = NULL; cond_resched(); start = end + 1; @@ -926,6 +927,8 @@ int btrfs_wait_marked_extents(struct btrfs_root *root, err = filemap_fdatawait_range(mapping, start, end); if (err) werr = err; + free_extent_state(cached_state); + cached_state = NULL; cond_resched(); start = end + 1; } -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What is the vision for btrfs fs repair?
On 2014-10-10 18:05, Eric Sandeen wrote: On 10/10/14 2:35 PM, Austin S Hemmelgarn wrote: On 2014-10-10 13:43, Bob Marley wrote: On 10/10/2014 16:37, Chris Murphy wrote: The fail safe behavior is to treat the known good tree root as the default tree root, and bypass the bad tree root if it cannot be repaired, so that the volume can be mounted with default mount options (i.e. the ones in fstab). Otherwise it's a filesystem that isn't well suited for general purpose use as rootfs let alone for boot. A filesystem which is suited for general purpose use is a filesystem which honors fsync, and doesn't *ever* auto-roll-back without user intervention. Anything different is not suited for database transactions at all. Any paid service which has the users database on btrfs is going to be at risk of losing payments, and probably without the company even knowing. If btrfs goes this way I hope a big warning is written on the wiki and on the manpages telling that this filesystem is totally unsuitable for hosting databases performing transactions. If they need reliability, they should have some form of redundancy in-place and/or run the database directly on the block device; because even ext4, XFS, and pretty much every other filesystem can lose data sometimes, Not if i.e. fsync returns. If the data is gone later, it's a hardware problem, or occasionally a bug - bugs that are usually found fixed pretty quickly. Yes, barring bugs and hardware problems they won't lose data. the difference being that those tend to give worse results when hardware is misbehaving than BTRFS does, because BTRFS usually has a old copy of whatever data structure gets corrupted to fall back on. I'm curious, is that based on conjecture or real-world testing? I wouldn't really call it testing, but based on personal experience I know that ext4 can lose whole directory sub-trees if it gets a single corrupt sector in the wrong place. I've also had that happen on FAT32 and (somewhat interestingly) HFS+ with failing/misbehaving hardware; and I've actually had individual files disappear on HFS+ without any discernible hardware issues. I don't have as much experience with XFS, but would assume based on what I do know of it that it could have similar issues. As for BTRFS, I've only ever had any issues with it 3 times, one was due to the kernel panicking during resume from S1, and the other two were due to hardware problems that would have caused issues on most other filesystems as well. In both cases of hardware issues, while the filesystem was initially unmountable, it was relatively simple to fix once I knew how. I tried to fix an ext4 fs that had become unmountable due to dropped writes once, and that was anything but simple, even with the much greater amount of documentation. smime.p7s Description: S/MIME Cryptographic Signature
Re: What is the vision for btrfs fs repair?
On 2014-10-12 06:14, Martin Steigerwald wrote: Am Freitag, 10. Oktober 2014, 10:37:44 schrieb Chris Murphy: On Oct 10, 2014, at 6:53 AM, Bob Marley bobmar...@shiftmail.org wrote: On 10/10/2014 03:58, Chris Murphy wrote: * mount -o recovery Enable autorecovery attempts if a bad tree root is found at mount time. I'm confused why it's not the default yet. Maybe it's continuing to evolve at a pace that suggests something could sneak in that makes things worse? It is almost an oxymoron in that I'm manually enabling an autorecovery If true, maybe the closest indication we'd get of btrfs stablity is the default enabling of autorecovery. No way! I wouldn't want a default like that. If you think at distributed transactions: suppose a sync was issued on both sides of a distributed transaction, then power was lost on one side, than btrfs had corruption. When I remount it, definitely the worst thing that can happen is that it auto-rolls-back to a previous known-good state. For a general purpose file system, losing 30 seconds (or less) of questionably committed data, likely corrupt, is a file system that won't mount without user intervention, which requires a secret decoder ring to get it to mount at all. And may require the use of specialized tools to retrieve that data in any case. The fail safe behavior is to treat the known good tree root as the default tree root, and bypass the bad tree root if it cannot be repaired, so that the volume can be mounted with default mount options (i.e. the ones in fstab). Otherwise it's a filesystem that isn't well suited for general purpose use as rootfs let alone for boot. To understand this a bit better: What can be the reasons a recent tree gets corrupted? Well, so far I have had the following cause corrupted trees: 1. Kernel panic during resume from ACPI S1 (suspend to RAM), which just happened to be in the middle of a tree commit. 2. Generic power loss during a tree commit. 3. A device not properly honoring write-barriers (the operations immediately adjacent to the write barrier weren't being ordered correctly all the time). Based on what I know about BTRFS, the following could also cause problems: 1. A single-event-upset somewhere in the write path. 2. The kernel issuing a write to the wrong device (I haven't had this happen to me, but know people who have). In general, any of these will cause problems for pretty much any filesystem, not just BTRFS. I always thought with a controller and device and driver combination that honors fsync with BTRFS it would either be the new state of the last known good state *anyway*. So where does the need to rollback arise from? I think that in this case the term rollback is a bit ambiguous, here it means from the point of view of userspace, which sees the FS as having 'rolled-back' from the most recent state to the last known good state. That said all journalling filesystems have some sort of rollback as far as I understand: If the last journal entry is incomplete they discard it on journal replay. So even there you use the last seconds of write activity. But in case fsync() returns the data needs to be safe on disk. I always thought BTRFS honors this under *any* circumstance. If some proposed autorollback breaks this guarentee, I think something is broke elsewhere. And fsync is an fsync is an fsync. Its semantics are clear as crystal. There is nothing, absolutely nothing to discuss about it. An fsync completes if the device itself reported Yeah, I have the data on disk, all safe and cool to go. Anything else is a bug IMO. Or a hardware issue, most filesystems need disks to properly honor write barriers to provide guaranteed semantics on an fsync, and many consumer disk drives still don't honor them consistently. smime.p7s Description: S/MIME Cryptographic Signature
Re: What is the vision for btrfs fs repair?
On Sun, Oct 12, 2014 at 6:14 AM, Martin Steigerwald mar...@lichtvoll.de wrote: Am Freitag, 10. Oktober 2014, 10:37:44 schrieb Chris Murphy: On Oct 10, 2014, at 6:53 AM, Bob Marley bobmar...@shiftmail.org wrote: On 10/10/2014 03:58, Chris Murphy wrote: * mount -o recovery Enable autorecovery attempts if a bad tree root is found at mount time. I'm confused why it's not the default yet. Maybe it's continuing to evolve at a pace that suggests something could sneak in that makes things worse? It is almost an oxymoron in that I'm manually enabling an autorecovery If true, maybe the closest indication we'd get of btrfs stablity is the default enabling of autorecovery. No way! I wouldn't want a default like that. If you think at distributed transactions: suppose a sync was issued on both sides of a distributed transaction, then power was lost on one side, than btrfs had corruption. When I remount it, definitely the worst thing that can happen is that it auto-rolls-back to a previous known-good state. For a general purpose file system, losing 30 seconds (or less) of questionably committed data, likely corrupt, is a file system that won't mount without user intervention, which requires a secret decoder ring to get it to mount at all. And may require the use of specialized tools to retrieve that data in any case. The fail safe behavior is to treat the known good tree root as the default tree root, and bypass the bad tree root if it cannot be repaired, so that the volume can be mounted with default mount options (i.e. the ones in fstab). Otherwise it's a filesystem that isn't well suited for general purpose use as rootfs let alone for boot. To understand this a bit better: What can be the reasons a recent tree gets corrupted? I always thought with a controller and device and driver combination that honors fsync with BTRFS it would either be the new state of the last known good state *anyway*. So where does the need to rollback arise from? In theory the recover option should never be necessary. Btrfs makes all the guarantees everybody wants it to - when the data is fsynced then it will never be lost. The question is what should happen when a corrupted tree root, which should never happen, happens anyway. The options are to refuse to mount the filesystem by default, or mount it by default discarding about 30-60s worth of writes. And yes, when this situation happens (whether it mounts by default or not) btrfs has broken its promise of data being written after a successful fsync return. As has been pointed out, braindead drive firmware is the most likely cause of this sort of issue. However, there are a number of other hardware and software errors that could cause it, including errors in linux outside of btrfs, and of course bugs in btrfs as well. In an ideal world no filesystem would need any kind of recovery/repair tools. They can often mean that the fsync promise was broken. The real question is, once that has happened, how do you move on? I think the best default is to auto-recover, but to have better facilities for reporting errors to the user. Right now btrfs is very quiet about failures - maybe a cryptic message in dmesg, and nobody reads all of that unless they're looking for something. If btrfs could report significant issues that might mitigate the impact of an auto-recovery. Also, another thing to consider during recovery is whether the damaged data could be optionally stored in a snapshot of some kind - maybe in the way that ext3/4 rollback data after conversion gets stored in a snapshot. My knowledge of the underlying structures is weak, but I'd think that a corrupted tree root practically is a snapshot already, and turning it into one might even be easier than cleaning it up. Of course, we would need to ensure the snapshot could be deleted without further error. Doing anything with the snapshot might require special tools, but if people want to do disk scraping they could. -- Rich -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs send and kernel 3.17
Actually it seems strange that a send operation could corrupt the source subvolume or fs. Why would the send modify the source subvolume in any significant way? The only way I can find to reconcile your observations with mine is that maybe the snapshots get corrupted not by the send operation by itself but when they are generated with -r (readonly, as it is needed to send them). Are the corrupted snapshots you have in machine 2 (the one in which send was never used) readonly? -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs balance segfault, kernel BUG at fs/btrfs/extent-tree.c:7727
On Thu, Oct 9, 2014 at 10:19 AM, Petr Janecek jane...@ucw.cz wrote: I have trouble finishing btrfs balance on five disk raid10 fs. I added a disk to 4x3TB raid10 fs and run btrfs balance start /mnt/b3, which segfaulted after few hours, probably because of the BUG below. btrfs check does not find any errors, both before the balance and after reboot (the fs becomes un-umountable). [22744.238559] WARNING: CPU: 0 PID: 4211 at fs/btrfs/extent-tree.c:876 btrfs_lookup_extent_info+0x292/0x30a [btrfs]() [22744.532378] kernel BUG at fs/btrfs/extent-tree.c:7727! I am running into something similar. I just added a 3TB drive to my raid1 btrfs and started a balance. The balance segfaulted, and I find this in dmesg: [453046.291762] BTRFS info (device sde2): relocating block group 10367073779712 flags 17 [453062.494151] BTRFS info (device sde2): found 13 extents [453069.283368] [ cut here ] [453069.283468] kernel BUG at /data/src/linux-3.17.0-gentoo/fs/btrfs/relocation.c:931! [453069.283590] invalid opcode: [#1] SMP [453069.283666] Modules linked in: vhost_net vhost macvtap macvlan tun ipt_MASQUERADE xt_conntrack veth nfsd auth_rpcgss oid_registry lockd iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_filter ip_tables it87 hwmon_vid hid_logitech_dj nxt200x cx88_dvb videobuf_dvb dvb_core cx88_vp3054_i2c tuner_simple tuner_types tuner mousedev hid_generic usbhid cx88_alsa radeon cx8800 cx8802 cx88xx snd_hda_codec_realtek btcx_risc snd_hda_codec_generic videobuf_dma_sg videobuf_core kvm_amd tveeprom kvm rc_core v4l2_common cfbfillrect fbcon videodev cfbimgblt snd_hda_intel bitblit snd_hda_controller cfbcopyarea softcursor font tileblit i2c_algo_bit k10temp snd_hda_codec backlight drm_kms_helper snd_hwdep i2c_piix4 ttm snd_pcm snd_timer drm snd soundcore 8250 evdev [453069.285043] serial_core ext4 crc16 jbd2 mbcache zram lz4_compress zsmalloc ata_generic pata_acpi btrfs xor zlib_deflate atkbd raid6_pq ohci_pci firewire_ohci firewire_core crc_itu_t pata_atiixp ehci_pci ohci_hcd ehci_hcd usbcore usb_common r8169 mii sunrpc dm_mirror dm_region_hash dm_log dm_mod [453069.285552] CPU: 1 PID: 17270 Comm: btrfs Not tainted 3.17.0-gentoo #1 [453069.285657] Hardware name: Gigabyte Technology Co., Ltd. GA-880GM-UD2H/GA-880GM-UD2H, BIOS F8 10/11/2010 [453069.285806] task: 88040ec556e0 ti: 88010cf94000 task.ti: 88010cf94000 [453069.285925] RIP: 0010:[a02ddd62] [a02ddd62] build_backref_tree+0x1152/0x11b0 [btrfs] [453069.286137] RSP: 0018:88010cf97848 EFLAGS: 00010206 [453069.286223] RAX: 8800ae67c800 RBX: 880122e94000 RCX: 880122e949c0 [453069.286336] RDX: 09270788d000 RSI: 880054c3fbc0 RDI: 8800ae67c800 [453069.286449] RBP: 88010cf97958 R08: 000159a0 R09: 880122e94000 [453069.286561] R10: 0003 R11: R12: 8802da313000 [453069.286674] R13: 8802da313c60 R14: 880122e94780 R15: 88040c277000 [453069.286787] FS: 7f175ac51880() GS:880427c4() knlGS:f7333b40 [453069.286913] CS: 0010 DS: ES: CR0: 8005003b [453069.287005] CR2: 7f208de58000 CR3: 0003b0a9c000 CR4: 07e0 [453069.287116] Stack: [453069.287151] 88010cf97868 880122e94000 01ff880122e94300 880342156060 [453069.287282] 880122e94780 8802da313c60 880122e94600 8800ae67c800 [453069.287412] 880122e947c0 8802da313000 88040c277120 88010005 [453069.287542] Call Trace: [453069.287640] [a02ddfa3] relocate_tree_blocks+0x1e3/0x630 [btrfs] [453069.287796] [a02e0550] relocate_block_group+0x3d0/0x650 [btrfs] [453069.287951] [a02e0958] btrfs_relocate_block_group+0x188/0x2a0 [btrfs] [453069.288113] [a02b48f0] btrfs_relocate_chunk.isra.61+0x70/0x780 [btrfs] [453069.288276] [a02c7fd0] ? btrfs_set_lock_blocking_rw+0x70/0xc0 [btrfs] [453069.288438] [a02b0e79] ? free_extent_buffer+0x59/0xb0 [btrfs] [453069.288590] [a02b8e99] btrfs_balance+0x829/0xf40 [btrfs] [453069.288738] [a02bf80f] btrfs_ioctl_balance+0x1af/0x510 [btrfs] [453069.288890] [a02c59e4] btrfs_ioctl+0xa54/0x2950 [btrfs] [453069.288995] [8111d016] ? lru_cache_add_active_or_unevictable+0x26/0x90 [453069.289119] [8113a061] ? handle_mm_fault+0xbe1/0xdb0 [453069.289219] [811ffdce] ? cred_has_capability+0x5e/0x100 [453069.289323] [8104065c] ? __do_page_fault+0x1fc/0x4f0 [453069.289422] [8117d80e] do_vfs_ioctl+0x7e/0x4f0 [453069.289513] [811ff64f] ? file_has_perm+0x8f/0xa0 [453069.289606] [8117dd09] SyS_ioctl+0x89/0xa0 [453069.289692] [81040a1c] ? do_page_fault+0xc/0x10 [453069.289785] [814f5752] system_call_fastpath+0x16/0x1b [453069.289881] Code: ff ff 48 8b 9d 20 ff ff ff e9 11 ff ff ff 0f 0b be ec 03 00 00 48 c7 c7 d0 f0 30 a0 e8 28 00 d7 e0 e9 06 f3 ff ff e8 c4 42
Re: 3.17.0-rc7: kernel BUG at fs/btrfs/relocation.c:931!
On Thu, Oct 2, 2014 at 3:27 AM, Tomasz Chmielewski t...@virtall.com wrote: Got this when running balance with 3.17.0-rc7: [173475.410717] kernel BUG at fs/btrfs/relocation.c:931! I just started a post on another thread with this exact same issue on 3.17.0. I started a balance after adding a new drive. [453046.291762] BTRFS info (device sde2): relocating block group 10367073779712 flags 17 [453062.494151] BTRFS info (device sde2): found 13 extents [453069.283368] [ cut here ] [453069.283468] kernel BUG at /data/src/linux-3.17.0-gentoo/fs/btrfs/relocation.c:931! [453069.283590] invalid opcode: [#1] SMP [453069.283666] Modules linked in: vhost_net vhost macvtap macvlan tun ipt_MASQUERADE xt_conntrack veth nfsd auth_rpcgss oid_registry lockd iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_filter ip_tables it87 hwmon_vid hid_logitech_dj nxt200x cx88_dvb videobuf_dvb dvb_core cx88_vp3054_i2c tuner_simple tuner_types tuner mousedev hid_generic usbhid cx88_alsa radeon cx8800 cx8802 cx88xx snd_hda_codec_realtek btcx_risc snd_hda_codec_generic videobuf_dma_sg videobuf_core kvm_amd tveeprom kvm rc_core v4l2_common cfbfillrect fbcon videodev cfbimgblt snd_hda_intel bitblit snd_hda_controller cfbcopyarea softcursor font tileblit i2c_algo_bit k10temp snd_hda_codec backlight drm_kms_helper snd_hwdep i2c_piix4 ttm snd_pcm snd_timer drm snd soundcore 8250 evdev [453069.285043] serial_core ext4 crc16 jbd2 mbcache zram lz4_compress zsmalloc ata_generic pata_acpi btrfs xor zlib_deflate atkbd raid6_pq ohci_pci firewire_ohci firewire_core crc_itu_t pata_atiixp ehci_pci ohci_hcd ehci_hcd usbcore usb_common r8169 mii sunrpc dm_mirror dm_region_hash dm_log dm_mod [453069.285552] CPU: 1 PID: 17270 Comm: btrfs Not tainted 3.17.0-gentoo #1 [453069.285657] Hardware name: Gigabyte Technology Co., Ltd. GA-880GM-UD2H/GA-880GM-UD2H, BIOS F8 10/11/2010 [453069.285806] task: 88040ec556e0 ti: 88010cf94000 task.ti: 88010cf94000 [453069.285925] RIP: 0010:[a02ddd62] [a02ddd62] build_backref_tree+0x1152/0x11b0 [btrfs] [453069.286137] RSP: 0018:88010cf97848 EFLAGS: 00010206 [453069.286223] RAX: 8800ae67c800 RBX: 880122e94000 RCX: 880122e949c0 [453069.286336] RDX: 09270788d000 RSI: 880054c3fbc0 RDI: 8800ae67c800 [453069.286449] RBP: 88010cf97958 R08: 000159a0 R09: 880122e94000 [453069.286561] R10: 0003 R11: R12: 8802da313000 [453069.286674] R13: 8802da313c60 R14: 880122e94780 R15: 88040c277000 [453069.286787] FS: 7f175ac51880() GS:880427c4() knlGS:f7333b40 [453069.286913] CS: 0010 DS: ES: CR0: 8005003b [453069.287005] CR2: 7f208de58000 CR3: 0003b0a9c000 CR4: 07e0 [453069.287116] Stack: [453069.287151] 88010cf97868 880122e94000 01ff880122e94300 880342156060 [453069.287282] 880122e94780 8802da313c60 880122e94600 8800ae67c800 [453069.287412] 880122e947c0 8802da313000 88040c277120 88010005 [453069.287542] Call Trace: [453069.287640] [a02ddfa3] relocate_tree_blocks+0x1e3/0x630 [btrfs] [453069.287796] [a02e0550] relocate_block_group+0x3d0/0x650 [btrfs] [453069.287951] [a02e0958] btrfs_relocate_block_group+0x188/0x2a0 [btrfs] [453069.288113] [a02b48f0] btrfs_relocate_chunk.isra.61+0x70/0x780 [btrfs] [453069.288276] [a02c7fd0] ? btrfs_set_lock_blocking_rw+0x70/0xc0 [btrfs] [453069.288438] [a02b0e79] ? free_extent_buffer+0x59/0xb0 [btrfs] [453069.288590] [a02b8e99] btrfs_balance+0x829/0xf40 [btrfs] [453069.288738] [a02bf80f] btrfs_ioctl_balance+0x1af/0x510 [btrfs] [453069.288890] [a02c59e4] btrfs_ioctl+0xa54/0x2950 [btrfs] [453069.288995] [8111d016] ? lru_cache_add_active_or_unevictable+0x26/0x90 [453069.289119] [8113a061] ? handle_mm_fault+0xbe1/0xdb0 [453069.289219] [811ffdce] ? cred_has_capability+0x5e/0x100 [453069.289323] [8104065c] ? __do_page_fault+0x1fc/0x4f0 [453069.289422] [8117d80e] do_vfs_ioctl+0x7e/0x4f0 [453069.289513] [811ff64f] ? file_has_perm+0x8f/0xa0 [453069.289606] [8117dd09] SyS_ioctl+0x89/0xa0 [453069.289692] [81040a1c] ? do_page_fault+0xc/0x10 [453069.289785] [814f5752] system_call_fastpath+0x16/0x1b [453069.289881] Code: ff ff 48 8b 9d 20 ff ff ff e9 11 ff ff ff 0f 0b be ec 03 00 00 48 c7 c7 d0 f0 30 a0 e8 28 00 d7 e0 e9 06 f3 ff ff e8 c4 42 02 00 0f 0b 3c b0 0f 84 72 f1 ff ff be 22 03 00 00 48 c7 c7 d0 f0 30 [453069.290429] RIP [a02ddd62] build_backref_tree+0x1152/0x11b0 [btrfs] [453069.290591] RSP 88010cf97848 [453069.316194] ---[ end trace 5fdc0af4cc62bf41 ]--- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: btrfs send and kernel 3.17
On 10/13/2014 02:40 PM, john terragon wrote: Actually it seems strange that a send operation could corrupt the source subvolume or fs. Why would the send modify the source subvolume in any significant way? The only way I can find to reconcile your observations with mine is that maybe the snapshots get corrupted not by the send operation by itself but when they are generated with -r (readonly, as it is needed to send them). Are the corrupted snapshots you have in machine 2 (the one in which send was never used) readonly? Yes, on both machines there are only readonly snapshots. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs send and kernel 3.17
On Sun, Oct 12, 2014 at 7:11 AM, David Arendt ad...@prnet.org wrote: This weekend I finally had time to try btrfs send again on the newly created fs. Now I am running into another problem: btrfs send returns: ERROR: send ioctl failed with -12: Cannot allocate memory In dmesg I see only the following output: parent transid verify failed on 21325004800 wanted 2620 found 8325 I'm not using send at all, but I've been running into parent transid verify failed messages where the wanted is way smaller than the found when trying to balance a raid1 after adding a new drive. Originally I had gotten a BUG, and after reboot the drive finished balancing (interestingly enough without moving any chunks to the new drive - just consolidating everything on the old drives), and then when I try to do another balance I get: [ 4426.987177] BTRFS info (device sdc2): relocating block group 10367073779712 flags 17 [ 4446.287998] BTRFS info (device sdc2): found 13 extents [ 4451.330887] parent transid verify failed on 10063286579200 wanted 987432 found 993678 [ 4451.350663] parent transid verify failed on 10063286579200 wanted 987432 found 993678 The btrfs program itself outputs: btrfs balance start -v /data Dumping filters: flags 0x7, state 0x0, force is off DATA (flags 0x0): balancing METADATA (flags 0x0): balancing SYSTEM (flags 0x0): balancing ERROR: error during balancing '/data' - Cannot allocate memory There may be more info in syslog - try dmesg | tail This is also on 3.17. This may be completely unrelated, but it seemed similar enough to be worth mentioning. The filesystem otherwise seems to work fine, other than the new drive not having any data on it: Label: 'datafs' uuid: cd074207-9bc3-402d-bee8-6a8c77d56959 Total devices 6 FS bytes used 2.16TiB devid1 size 2.73TiB used 2.40TiB path /dev/sdc2 devid2 size 931.32GiB used 695.03GiB path /dev/sda2 devid3 size 931.32GiB used 700.00GiB path /dev/sdb2 devid4 size 931.32GiB used 700.00GiB path /dev/sdd2 devid5 size 931.32GiB used 699.00GiB path /dev/sde2 devid6 size 2.73TiB used 0.00 path /dev/sdf2 This is btrfs-progs-3.16.2. -- Rich -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: what is the best way to monitor raid1 drive failures?
I had progs 3.12 and updated to the latest from git(3.16). With this update, btrfs fi show reports there is a missing device immediately after i pull it out. Thanks! I am using virtualbox to test this. So, I am detaching the drive like so: vboxmanage storageattach vm --storagectl controller --port port --device device --medium none Next I am going to try and test a more realistic scenario where a harddrive is not pulled out, but is damaged. Can/does btrfs mark a filesystem(say, 2 drive raid1) degraded or unhealthy automatically when one drive is damaged badly enough that it cannot be written to or read from reliably? Suman On Sun, Oct 12, 2014 at 7:21 PM, Anand Jain anand.j...@oracle.com wrote: Suman, To simulate the failure, I detached one of the drives from the system. After that, I see no sign of a problem except for these errors: Are you physically pulling out the device ? I wonder if lsblk or blkid shows the error ? reporting device missing logic is in the progs (so have that latest) and it works provided user script such as blkid/lsblk also reports the problem. OR for soft-detach tests you could use devmgt at http://github.com/anajain/devmgt Also I am trying to get the device management framework for the btrfs with a more better device management and reporting. Thanks, Anand On 10/13/14 07:50, Suman C wrote: Hi, I am testing some disk failure scenarios in a 2 drive raid1 mirror. They are 4GB each, virtual SATA drives inside virtualbox. To simulate the failure, I detached one of the drives from the system. After that, I see no sign of a problem except for these errors: Oct 12 15:37:14 rock-dev kernel: btrfs: bdev /dev/sdb errs: wr 0, rd 0, flush 1, corrupt 0, gen 0 Oct 12 15:37:14 rock-dev kernel: lost page write due to I/O error on /dev/sdb /dev/sdb is gone from the system, but btrfs fi show still lists it. Label: raid1pool uuid: 4e5d8b43-1d34-4672-8057-99c51649b7c6 Total devices 2 FS bytes used 1.46GiB devid1 size 4.00GiB used 2.45GiB path /dev/sdb devid2 size 4.00GiB used 2.43GiB path /dev/sdc I am able to read and write just fine, but do see the above errors in dmesg. What is the best way to find out that one of the drives has gone bad? Suman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs random filesystem corruption in kernel 3.17
From my own experience and based on what other people are saying, I think there is a random btrfs filesystem corruption problem in kernel 3.17 at least related to snapshots, therefore I decided to post using another subject to draw attention from people not concerned about btrfs send to it. More information can be found in the brtfs send posts. Did the filesystem you tried to balance contain snapshots ? Read only ones ? On 10/13/2014 07:22 PM, Rich Freeman wrote: On Sun, Oct 12, 2014 at 7:11 AM, David Arendt ad...@prnet.org wrote: This weekend I finally had time to try btrfs send again on the newly created fs. Now I am running into another problem: btrfs send returns: ERROR: send ioctl failed with -12: Cannot allocate memory In dmesg I see only the following output: parent transid verify failed on 21325004800 wanted 2620 found 8325 I'm not using send at all, but I've been running into parent transid verify failed messages where the wanted is way smaller than the found when trying to balance a raid1 after adding a new drive. Originally I had gotten a BUG, and after reboot the drive finished balancing (interestingly enough without moving any chunks to the new drive - just consolidating everything on the old drives), and then when I try to do another balance I get: [ 4426.987177] BTRFS info (device sdc2): relocating block group 10367073779712 flags 17 [ 4446.287998] BTRFS info (device sdc2): found 13 extents [ 4451.330887] parent transid verify failed on 10063286579200 wanted 987432 found 993678 [ 4451.350663] parent transid verify failed on 10063286579200 wanted 987432 found 993678 The btrfs program itself outputs: btrfs balance start -v /data Dumping filters: flags 0x7, state 0x0, force is off DATA (flags 0x0): balancing METADATA (flags 0x0): balancing SYSTEM (flags 0x0): balancing ERROR: error during balancing '/data' - Cannot allocate memory There may be more info in syslog - try dmesg | tail This is also on 3.17. This may be completely unrelated, but it seemed similar enough to be worth mentioning. The filesystem otherwise seems to work fine, other than the new drive not having any data on it: Label: 'datafs' uuid: cd074207-9bc3-402d-bee8-6a8c77d56959 Total devices 6 FS bytes used 2.16TiB devid1 size 2.73TiB used 2.40TiB path /dev/sdc2 devid2 size 931.32GiB used 695.03GiB path /dev/sda2 devid3 size 931.32GiB used 700.00GiB path /dev/sdb2 devid4 size 931.32GiB used 700.00GiB path /dev/sdd2 devid5 size 931.32GiB used 699.00GiB path /dev/sde2 devid6 size 2.73TiB used 0.00 path /dev/sdf2 This is btrfs-progs-3.16.2. -- Rich -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs random filesystem corruption in kernel 3.17
On Mon, Oct 13, 2014 at 4:27 PM, David Arendt ad...@prnet.org wrote: From my own experience and based on what other people are saying, I think there is a random btrfs filesystem corruption problem in kernel 3.17 at least related to snapshots, therefore I decided to post using another subject to draw attention from people not concerned about btrfs send to it. More information can be found in the brtfs send posts. Did the filesystem you tried to balance contain snapshots ? Read only ones ? The filesystem contains numerous subvolumes and snapshots, many of which are read-only. I'm managing many with snapper. The similarity of the transid verify errors made me think this issue is related, and the root cause may have nothing to do with btrfs send. As far as I can tell these errors aren't having any affect on my data - hopefully the system is catching the problems before there are actual disk writes/etc. -- Rich -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs random filesystem corruption in kernel 3.17
I think I just found a consistent simple way to trigger the problem (at least on my system). And, as I guessed before, it seems to be related just to readonly snapshots: 1) I create a readonly snapshot 2) I do some changes on the source subvolume for the snapshot (I'm not sure changes are strictly needed) 3) reboot (or probably just unmount and remount. I reboot because the fs I've problems with contains my root subvolume) After the rebooting (or the remount) I consistently have the corruption with the usual multitude of these in dmesg parent transid verify failed on 902316032 wanted 2484 found 4101 and the characteristic ls -la output drwxr-xr-x 1 root root 250 Oct 10 15:37 root d? ? ?? ?? root-b2 drwxr-xr-x 1 root root 250 Oct 10 15:37 root-b3 d? ? ?? ?? root-backup root-backup and root-b2 are both readonly whereas root-b3 is rw (and it didn't get corrupted). David, maybe you can try the same steps on one of your machines? John -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs random filesystem corruption in kernel 3.17
On Mon, Oct 13, 2014 at 4:48 PM, john terragon jterra...@gmail.com wrote: I think I just found a consistent simple way to trigger the problem (at least on my system). And, as I guessed before, it seems to be related just to readonly snapshots: 1) I create a readonly snapshot 2) I do some changes on the source subvolume for the snapshot (I'm not sure changes are strictly needed) 3) reboot (or probably just unmount and remount. I reboot because the fs I've problems with contains my root subvolume) After the rebooting (or the remount) I consistently have the corruption with the usual multitude of these in dmesg parent transid verify failed on 902316032 wanted 2484 found 4101 and the characteristic ls -la output drwxr-xr-x 1 root root 250 Oct 10 15:37 root d? ? ?? ?? root-b2 drwxr-xr-x 1 root root 250 Oct 10 15:37 root-b3 d? ? ?? ?? root-backup root-backup and root-b2 are both readonly whereas root-b3 is rw (and it didn't get corrupted). David, maybe you can try the same steps on one of your machines? Look at that. I didn't realize it, but indeed I have a corrupted snapshot: /data/.snapshots/5338/: ls: cannot access /data/.snapshots/5338/snapshot: Cannot allocate memory total 4 drwxr-xr-x 1 root root 32 Oct 11 06:09 . drwxr-x--- 1 root root 32 Oct 11 07:42 .. -rw--- 1 root root 135 Oct 11 06:09 info.xml d? ? ?? ?? snapshot Several older snapshots are fine, and those predate my 3.17 upgrade. I noticed that this corrupted snapshot isn't even listed in my snapper lists. btrfs su delete /data/.snapshots/5338/snapshot Transaction commit: none (default) ERROR: error accessing '/data/.snapshots/5338/snapshot' Removing them appears to be problematic as well. I might just disable compress=lzo and go back to 3.16 to see how that goes. -- Rich -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs random filesystem corruption in kernel 3.17
On Mon, Oct 13, 2014 at 4:55 PM, Rich Freeman r-bt...@thefreemanclan.net wrote: On Mon, Oct 13, 2014 at 4:48 PM, john terragon jterra...@gmail.com wrote: After the rebooting (or the remount) I consistently have the corruption with the usual multitude of these in dmesg parent transid verify failed on 902316032 wanted 2484 found 4101 and the characteristic ls -la output Sorry to double-reply, but I left this out. I have a long string of these early in boot as well that I never noticed before. -- Rich -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What is the vision for btrfs fs repair?
On 10/08/2014 03:11 PM, Eric Sandeen wrote: I was looking at Marc's post: https://urldefense.proofpoint.com/v1/url?u=http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.htmlk=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar=cKCbChRKsMpTX8ybrSkonQ%3D%3D%0Am=XJPoqgf9jjvuE1IqCerEXXuwF4w3hbDS3%2F63x5KI4R4%3D%0As=b1f817d758eacf914bd60f20ada715384e13c1f8e040100794b0cb21261ec884 and it feels like there isn't exactly a cohesive, overarching vision for repair of a corrupted btrfs filesystem. In other words - I'm an admin cruising along, when the kernel throws some fs corruption error, or for whatever reason btrfs fails to mount. What should I do? Marc lays out several steps, but to me this highlights that there seem to be a lot of disjoint mechanisms out there to deal with these problems; mostly from Marc's blog, with some bits of my own: * btrfs scrub Errors are corrected along if possible (what *is* possible?) * mount -o recovery Enable autorecovery attempts if a bad tree root is found at mount time. * mount -o degraded Allow mounts to continue with missing devices. (This isn't really a way to recover from corruption, right?) * btrfs-zero-log remove the log tree if log tree is corrupt * btrfs rescue Recover a damaged btrfs filesystem chunk-recover super-recover How does this relate to btrfs check? * btrfs check repair a btrfs filesystem --repair --init-csum-tree --init-extent-tree How does this relate to btrfs rescue? * btrfs restore try to salvage files from a damaged filesystem (not really repair, it's disk-scraping) What's the vision for, say, scrub vs. check vs. rescue? Should they repair the same errors, only online vs. offline? If not, what class of errors does one fix vs. the other? How would an admin know? Can btrfs check recover a bad tree root in the same way that mount -o recovery does? How would I know if I should use --init-*-tree, or chunk-recover, and what are the ramifications of using these options? It feels like recovery tools have been badly splintered, and if there's an overarching design or vision for btrfs fs repair, I can't tell what it is. Can anyone help me? We probably should just consolidate under 3 commands, one for online checking, one for offline repair and one for pulling stuff off of the disk when things go to hell. A lot of these tools were born out of the fact that we didn't have a fsck tool for a long time so there were these stop gaps put into place, so now its time to go back and clean it up. I'll try and do this after I finish my cleanup/sync between kernel and progs work and fill out the documentation a little better so its clear when to use what. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs random filesystem corruption in kernel 3.17
I'm using compress=no so compression doesn't seem to be related, at least in my case. Just read-only snapshots on 3.17 (although I haven't tried 3.16). John -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs random filesystem corruption in kernel 3.17
As these to machines are running as server for different purposes (yes, I know that btrfs is unstable and any corruption or data loss is at my own risk therefore I have good backups), I want to reboot them not more then necessary. However I tried to bring my reboot times in relation with corruptions: machine 1: d? ? ? ? ?? root.20141009.000503.backup reboot system boot 3.17.0 Thu Oct 9 23:20 still running reboot system boot 3.17.0 Tue Oct 7 21:25 - 23:18 (2+01:53) reboot system boot 3.17.0 Mon Oct 6 22:47 - 23:18 (3+00:31) For this machine, corruption seems to have occurred for a snapshot created after a reboot. machine 2: d? ? ?? ?? root.20141006.003239.backup d? ? ?? ?? root.20141007.001616.backup d? ? ?? ?? root.20141008.000501.backup d? ? ?? ?? root.20141009.052436.backup reboot system boot 3.17.0 Thu Oct 9 21:31 still running reboot system boot 3.17.0 Tue Oct 7 21:27 - 21:30 (2+00:03) reboot system boot 3.17.0 Tue Oct 7 17:51 - 21:26 (03:34) reboot system boot 3.17.0 Sun Oct 5 23:50 - 17:50 (1+17:59) reboot system boot 3.17.0 Sun Oct 5 23:47 - 23:49 (00:01) During the next days, I will setup a virtual machine to do more tests. On 10/13/2014 10:48 PM, john terragon wrote: I think I just found a consistent simple way to trigger the problem (at least on my system). And, as I guessed before, it seems to be related just to readonly snapshots: 1) I create a readonly snapshot 2) I do some changes on the source subvolume for the snapshot (I'm not sure changes are strictly needed) 3) reboot (or probably just unmount and remount. I reboot because the fs I've problems with contains my root subvolume) After the rebooting (or the remount) I consistently have the corruption with the usual multitude of these in dmesg parent transid verify failed on 902316032 wanted 2484 found 4101 and the characteristic ls -la output drwxr-xr-x 1 root root 250 Oct 10 15:37 root d? ? ?? ?? root-b2 drwxr-xr-x 1 root root 250 Oct 10 15:37 root-b3 d? ? ?? ?? root-backup root-backup and root-b2 are both readonly whereas root-b3 is rw (and it didn't get corrupted). David, maybe you can try the same steps on one of your machines? John -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs random filesystem corruption in kernel 3.17
I'm also using no compression. On 10/13/2014 11:22 PM, john terragon wrote: I'm using compress=no so compression doesn't seem to be related, at least in my case. Just read-only snapshots on 3.17 (although I haven't tried 3.16). John -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs random filesystem corruption in kernel 3.17
David Arendt posted on Mon, 13 Oct 2014 23:25:23 +0200 as excerpted: I'm also using no compression. On 10/13/2014 11:22 PM, john terragon wrote: I'm using compress=no so compression doesn't seem to be related, at least in my case. Just read-only snapshots on 3.17 (although I haven't tried 3.16). While I'm not a mind-reader and thus don't know for sure, Rich's reference to 3.16 and compression might not be related to this bug at all. In 3.15 and early 3.16, there was a different bug related to compression, tho IIRC it was patched in 3.16.2 and 3.17-rc2 (or maybe .3 and rc3, it's patched in the latest 3.16.x anyway, and in 3.17). So how I read his comment was that he was considering going back to 3.16 and disabling compression to deal with that bug (he may not know the patch was marked for stable and is in current 3.16.x), rather than stay on 3.17, since this bug hasn't even been traced yet, let alone patched. Meanwhile, this bug makes me glad my use-case doesn't involve snapshots, and I've seen nothing of it. =:^) -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs random filesystem corruption in kernel 3.17
Rich Freeman posted on Mon, 13 Oct 2014 16:42:14 -0400 as excerpted: On Mon, Oct 13, 2014 at 4:27 PM, David Arendt ad...@prnet.org wrote: From my own experience and based on what other people are saying, I think there is a random btrfs filesystem corruption problem in kernel 3.17 at least related to snapshots, therefore I decided to post using another subject to draw attention from people not concerned about btrfs send to it. More information can be found in the brtfs send posts. Did the filesystem you tried to balance contain snapshots ? Read only ones ? The filesystem contains numerous subvolumes and snapshots, many of which are read-only. I'm managing many with snapper. The similarity of the transid verify errors made me think this issue is related, and the root cause may have nothing to do with btrfs send. As far as I can tell these errors aren't having any affect on my data - hopefully the system is catching the problems before there are actual disk writes/etc. Summarizing what I've seen on the threads... 1) The bug seems to be read-only snapshot related. The connection to send is that send creates read-only snapshots, but people creating read- only snapshots for other purposes are now reporting the same problem, so it's not send, it's the read-only snapshots. 2) Writable snapshots haven't been implicated yet, and the working set from which the snapshots are taken doesn't seem to be affected, either. So in that sense it's not affecting ordinary usage, only the read-only snapshots themselves. 3) More problematic, however, is the fact that these apparently corrupted read-only snapshots often are not listed properly and can't be deleted, tho I'm not sure if that's /all/ the corrupted snapshots or only part of them. So while it may not affect ordinary operation in the short term, over time until there's a fix, people routinely doing read-only snapshots are going to be getting more and more of these undeletable snapshots, and depending on whether the eventual patch only prevents more or can actually fix the bad ones (possibly via btrfs check or the like), affected filesystems may ultimately have to be blown away and recreated with a fresh mkfs, in ordered to kill the currently undeletable snapshots. So the first thing to do would be to shut off whatever's making read-only snapshots, so you don't make the problem worse while it's being investigated. For those who can do that without too big an interruption to their normal routine (who don't depend on send/receive, for instance), just keep it off for the time being. For those who depend on read-only snapshots (send-receive for backup and the data is too valuable to not do the backups for a few days), consider switching back to 3.16-stable -- from 3.16.3 at least, the patch for the compress bug is there, so that shouldn't be a problem. And if you're affected, be aware that until we have a fix, we don't know if it'll be possible to remove the affected and currently undeletable snapshots. If it's not, at some point you'll need to do a fresh mkfs.btrfs, to get rid of the damage. Since the bug doesn't appear to affect writable snapshots or the head from which snapshots are made, it's not urgent, and a full fix is likely to include a patch to detect and fix the problem as well, but until we know what the problem is we can't be sure of that, so be prepared to do that mkfs at some point, as at this point it's possible that's the only way you'll be able to kill the corrupted snapshots. 4) Total speculation on my part, but given the wanted transid (aka generation, in different contexts) is significantly lower than the found transid, and the fact that the problem appears to be limited to /read-only/ snapshots, my first suspicion is that something's getting updated that would normally apply to all snapshots, but the read-only nature of the snapshots is preventing the full update there. The transid of the block is updated, but the snapshot being read-only is preventing update of the pointer in that snapshot accordingly. What I do /not/ know is whether the bug is that something's getting updated that should NOT be, and it's simply the read-only snapshots letting us know about it since the writable snapshots are fully updated, even if that breaks the snapshot (breaking writable snapshots in a different and currently undetected way), or if instead, it's a legitimate update, like a balance simply moving the snapshot around but not affecting it otherwise, and the bug is that the read-only snapshots aren't allowing the legitimate update. Either way, this more or less developed over the weekend, and it's Monday now, so the devs should be on it. If it's anything like the 3.15/3.16 compression bug, it'll take some time for them to properly trace it, and then to figure out an appropriate fix, but they will. Chances are we'll have at least some decent progress on a trace by Friday, and maybe
Re: btrfs random filesystem corruption in kernel 3.17
On Mon, Oct 13, 2014 at 5:22 PM, john terragon jterra...@gmail.com wrote: I'm using compress=no so compression doesn't seem to be related, at least in my case. Just read-only snapshots on 3.17 (although I haven't tried 3.16). I was using lzo compression, and hence my comment about turning it off before going back to 3.16 (not realizing that 3.16 has subsequently been fixed). Ironically enough I discovered this as I was about to migrate my ext4 backup drive into my btrfs raid1. Maybe I'll go ahead and wait on that and have an rsync backup of the filesystem handy (minus snapshots) just in case. :) I'd switch to 3.16, but it sounds like there is no way to remove the snapshots at the moment, and I can live for a while without the ability to create new ones. interestingly enough it doesn't look like ALL snapshots are affected. I checked and some of the snapshots I made last weekend while doing system updates look accessible. They are significantly smaller, and the subvolumes they were made from are also fairly new - though I have no idea if that is related. The subvolumes do show up in btrfs su list. They cannot be examined using btrfs su show. It would be VERY nice to have a way of cleaning this up without blowing away the entire filesystem... -- Rich -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs random filesystem corruption in kernel 3.17
And another worrying thing I didn't notice before. Two snapshots have dates that do not make sense. root-b3 and root-b4 have been created Oct 14th (and btw root's modification time was also on Oct the 14th). So why do they show Oct 10th? And root-prov has actually been created on Oct 10 15:37, as it correctly shows, so it's like btrfs sub snap picks up old stale data from who knows were or when or for what reason. Moreover, root-b4 was created with 3.16.5not good. drwxrwsr-x 1 root staff 30 Sep 11 16:15 home d? ? ?? ?? home-backup drwxr-xr-x 1 root root 250 Oct 14 03:02 root d? ? ?? ?? root-b2 drwxr-xr-x 1 root root 250 Oct 10 15:37 root-b3 drwxr-xr-x 1 root root 250 Oct 10 15:37 root-b4 drwxr-xr-x 1 root root 250 Oct 14 03:02 root-b5 drwxr-xr-x 1 root root 250 Oct 14 03:02 root-b6 d? ? ?? ?? root-backup drwxr-xr-x 1 root root 250 Oct 10 15:37 root-prov drwxr-xr-x 1 root root 88 Sep 15 16:02 vms On Tue, Oct 14, 2014 at 1:18 AM, Rich Freeman r-bt...@thefreemanclan.net wrote: On Mon, Oct 13, 2014 at 5:22 PM, john terragon jterra...@gmail.com wrote: I'm using compress=no so compression doesn't seem to be related, at least in my case. Just read-only snapshots on 3.17 (although I haven't tried 3.16). I was using lzo compression, and hence my comment about turning it off before going back to 3.16 (not realizing that 3.16 has subsequently been fixed). Ironically enough I discovered this as I was about to migrate my ext4 backup drive into my btrfs raid1. Maybe I'll go ahead and wait on that and have an rsync backup of the filesystem handy (minus snapshots) just in case. :) I'd switch to 3.16, but it sounds like there is no way to remove the snapshots at the moment, and I can live for a while without the ability to create new ones. interestingly enough it doesn't look like ALL snapshots are affected. I checked and some of the snapshots I made last weekend while doing system updates look accessible. They are significantly smaller, and the subvolumes they were made from are also fairly new - though I have no idea if that is related. The subvolumes do show up in btrfs su list. They cannot be examined using btrfs su show. It would be VERY nice to have a way of cleaning this up without blowing away the entire filesystem... -- Rich -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: what is the best way to monitor raid1 drive failures?
On 10/14/14 03:50, Suman C wrote: I had progs 3.12 and updated to the latest from git(3.16). With this update, btrfs fi show reports there is a missing device immediately after i pull it out. Thanks! I am using virtualbox to test this. So, I am detaching the drive like so: vboxmanage storageattach vm --storagectl controller --port port --device device --medium none Next I am going to try and test a more realistic scenario where a harddrive is not pulled out, but is damaged. Can/does btrfs mark a filesystem(say, 2 drive raid1) degraded or unhealthy automatically when one drive is damaged badly enough that it cannot be written to or read from reliably? There are some gaps as directly compared to an enterprise volume manager, which is being fixed. but pls do report what you find. Thanks, Anand Suman On Sun, Oct 12, 2014 at 7:21 PM, Anand Jain anand.j...@oracle.com wrote: Suman, To simulate the failure, I detached one of the drives from the system. After that, I see no sign of a problem except for these errors: Are you physically pulling out the device ? I wonder if lsblk or blkid shows the error ? reporting device missing logic is in the progs (so have that latest) and it works provided user script such as blkid/lsblk also reports the problem. OR for soft-detach tests you could use devmgt at http://github.com/anajain/devmgt Also I am trying to get the device management framework for the btrfs with a more better device management and reporting. Thanks, Anand On 10/13/14 07:50, Suman C wrote: Hi, I am testing some disk failure scenarios in a 2 drive raid1 mirror. They are 4GB each, virtual SATA drives inside virtualbox. To simulate the failure, I detached one of the drives from the system. After that, I see no sign of a problem except for these errors: Oct 12 15:37:14 rock-dev kernel: btrfs: bdev /dev/sdb errs: wr 0, rd 0, flush 1, corrupt 0, gen 0 Oct 12 15:37:14 rock-dev kernel: lost page write due to I/O error on /dev/sdb /dev/sdb is gone from the system, but btrfs fi show still lists it. Label: raid1pool uuid: 4e5d8b43-1d34-4672-8057-99c51649b7c6 Total devices 2 FS bytes used 1.46GiB devid1 size 4.00GiB used 2.45GiB path /dev/sdb devid2 size 4.00GiB used 2.43GiB path /dev/sdc I am able to read and write just fine, but do see the above errors in dmesg. What is the best way to find out that one of the drives has gone bad? Suman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Btrfs: return failure if btrfs_dev_replace_finishing() failed
On Mon, Oct 13, 2014 at 06:18:04PM +0800, Anand Jain wrote: On 10/13/14 14:59, Eryu Guan wrote: On Mon, Oct 13, 2014 at 02:23:57PM +0800, Anand Jain wrote: comments below.. On 10/13/14 12:42, Eryu Guan wrote: device replace could fail due to another running scrub process or any other errors btrfs_scrub_dev() may hit, but this failure doesn't get returned to userspace. The following steps could reproduce this issue mkfs -t btrfs -f /dev/sdb1 /dev/sdb2 mount /dev/sdb1 /mnt/btrfs while true; do btrfs scrub start -B /mnt/btrfs /dev/null 21; done btrfs replace start -Bf /dev/sdb2 /dev/sdb3 /mnt/btrfs # if this replace succeeded, do the following and repeat until # you see this log in dmesg # BTRFS: btrfs_scrub_dev(/dev/sdb2, 2, /dev/sdb3) failed -115 #btrfs replace start -Bf /dev/sdb3 /dev/sdb2 /mnt/btrfs # once you see the error log in dmesg, check return value of # replace echo $? Introduce a new dev replace result BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS to catch -EINPROGRESS explicitly and return other errors directly to userspace. Signed-off-by: Eryu Guan guane...@gmail.com --- v2: - set result to SCRUB_INPROGRESS if btrfs_scrub_dev returned -EINPROGRESS and return 0 as Miao Xie suggested fs/btrfs/dev-replace.c | 12 +--- include/uapi/linux/btrfs.h | 1 + 2 files changed, 10 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index eea26e1..a141f8b 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -418,9 +418,15 @@ int btrfs_dev_replace_start(struct btrfs_root *root, dev_replace-scrub_progress, 0, 1); ret = btrfs_dev_replace_finishing(root-fs_info, ret); - WARN_ON(ret); + /* don't warn if EINPROGRESS, someone else might be running scrub */ + if (ret == -EINPROGRESS) { + args-result = BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS; + ret = 0; + } else { + WARN_ON(ret); + } I am bit concerned, why these racing threads here aren't excluding each other using mutually_exclusive_operation_running ? as most of the other device operation thread does. Thanks, Anand btrfs_ioctl_scrub() doesn't use mutually_exclusive_operation_running as other device operations do, I'm not sure if it should(seems scrub should do it too to me). But I think that's a different problem from the one I'm trying to fix here. The main purpose is to return error to userspace when btrfs_scrub_dev() hit some error. Dealing with -EINPROGRESS is to match the current behavior(replace and scrub could run at the same time). Thanks, Eryu looks like was are trying to manage EINPROGRESS returned by Yes, that's right. btrfs_dev_replace_finishing(). In btrfs_dev_replace_finishing() which specific func call is returning EINPROGRESS ? I didn't go deep enough. btrfs_dev_replace_finishing() will check the scrub_ret(the last argument), and return scrub_ret if (!scrub_ret). It was returning 0 unconditionally before this patch. btrfs_dev_replace_start@fs/btrfs/dev-replace.c 416 ret = btrfs_scrub_dev(fs_info, src_device-devid, 0, 417src_device-total_bytes, 418dev_replace-scrub_progress, 0, 1); 419 420 ret = btrfs_dev_replace_finishing(root-fs_info, ret); and btrfs_dev_replace_finishing@fs/btrfs/dev-replace.c 529 if (!scrub_ret) { 530 btrfs_dev_replace_update_device_in_mapping_tree(fs_info, 531 src_device, 532 tgt_device); 533 } else { .. 547 return scrub_ret; 548 } And how do we handle if replace is intervened by balance instead of scrub ? Based on my test, replace ioctl would return -ENOENT if balance is running ERROR: ioctl(DEV_REPLACE_START) failed on /mnt/testarea/scratch: No such file or directory, no error (I haven't gone through this codepath yet and don't know where -ENOENT comes from, but I don't think it's a proper errno, /mnt/testarea/scratch is definitely there) sorry if I missed something. Anand Thanks for the review! Eryu - return 0; + return ret; leave: dev_replace-srcdev = NULL; @@ -538,7 +544,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device); mutex_unlock(dev_replace-lock_finishing_cancel_unmount); - return 0; + return scrub_ret; } printk_in_rcu(KERN_INFO diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index 2f47824..611e1c5 100644 --- a/include/uapi/linux/btrfs.h +++