[PATCH 2/2 v12] Btrfs: be aware of btree inode write errors to avoid fs corruption
While we have a transaction ongoing, the VM might decide at any time to call btree_inode-i_mapping-a_ops-writepages(), which will start writeback of dirty pages belonging to btree nodes/leafs. This call might return an error or the writeback might finish with an error before we attempt to commit the running transaction. If this happens, we might have no way of knowing that such error happened when we are committing the transaction - because the pages might no longer be marked dirty nor tagged for writeback (if a subsequent modification to the extent buffer didn't happen before the transaction commit) which makes filemap_fdata[write|wait]_range unable to find such pages (even if they're marked with SetPageError). So if this happens we must abort the transaction, otherwise we commit a super block with btree roots that point to btree nodes/leafs whose content on disk is invalid - either garbage or the content of some node/leaf from a past generation that got cowed or deleted and is no longer valid (for this later case we end up getting error messages like parent transid verify failed on 10826481664 wanted 25748 found 29562 when reading btree nodes/leafs from disk). Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's i_mapping would not be enough because we need to distinguish between log tree extents (not fatal) vs non-log tree extents (fatal) and because the next call to filemap_fdatawait_range() will catch and clear such errors in the mapping - and that call might be from a log sync and not from a transaction commit, which means we would not know about the error at transaction commit time. Also, checking for the eb flag EXTENT_BUFFER_IOERR at transaction commit time isn't done and would not be completely reliable, as the eb might be removed from memory and read back when trying to get it, which clears that flag right before reading the eb's pages from disk, making us not know about the previous write error. Using the new 3 flags for the btree inode also makes us achieve the goal of AS_EIO/AS_ENOSPC when writepages() returns success, started writeback for all dirty pages and before filemap_fdatawait_range() is called, the writeback for all dirty pages had already finished with errors - because we were not using AS_EIO/AS_ENOSPC, filemap_fdatawait_range() would return success, as it could not know that writeback errors happened (the pages were no longer tagged for writeback). Signed-off-by: Filipe Manana fdman...@suse.com --- V2: If an extent buffer's write failed but it's also deleted from the tree before the transaction commits, don't abort the transaction with -EIO, since the unwritten node/leaf it represents can't be pointed to by any other node in a tree. V3: Correct V2, missed unstaged changes. V4: Use root's key to figure out which counter to update. V5: Decrement the error counters too when an eb is made dirty again (the next write attempt might succeed). V6: Moved counters from transaction struct to fs_info struct, because there's a (short) time window where fs_info-running_transaction is NULL. There's now 2 counters for log extent buffers too, each one representing a different log transaction. V7: Track the eb's log index in the eb itself, otherwise it wasn't possible to find it when writeback triggered from a transaction commit. V8: Track the log eb write errors per root instead, and reset them on a transaction commit. V9: Don't decrement the error counters if the eb is deleted or re-written. It is not safe because there's a time window when committing a transaction, between setting fs_info-current_transaction to NULL and checking the error counters in btrfs_write_and_wait_transaction(), where a new transaction can start and delete or re-write an eb that has the write error flag set. If this happens it means the previous transaction can write a superblock that refers to trees that point to unwritten nodes. Replaced the counters with simple flags in the btree inode's runtime flags - essentially back to V1 but accounting for the 2 different log sub-transactions. Removed access to an eb's parent root through BTRFS_I(eb-pages[0]-mapping-host)-root since it was not correct, as this always gives us the btree inode's root (objectid 1ULL). Instead use the field eb-log_index to know wether it's a log btree eb (and which sub- -transaction) or a non-log btree eb. V10: Clear the log eb write error flags in a more logical place (transaction commit function). V11: Updated commit message and a comment, replaced an ASSERT() with a BUG() and changed eb-lock_nested to a short to keep the structure size. V12: Removed leftovers from previous versions (no longer necessary #include and prototype in extent_io.h of no longer existing function) and updated parts from a comment that apply only to some past versions. Rebased against latest integration branch (didn't apply
Re: [PATCH] Btrfs-progs: let btrfs-image actually work on a balanced fs
On 09/25/2014 05:44 PM, Zach Brown wrote: --- a/btrfs-image.c +++ b/btrfs-image.c @@ -1020,6 +1020,9 @@ static int copy_tree_blocks(struct btrfs_root *root, struct extent_buffer *eb, int i = 0; int ret; + if (btrfs_header_bytenr(eb) == 65536) + printf(We have bytenr 65536, belongs to %llu, level %d\n, + btrfs_header_owner(eb), btrfs_header_level(eb)); ret = add_extent(btrfs_header_bytenr(eb), root-leafsize, metadump, 0); if (ret) { fprintf(stderr, Error adding metadata block\n); Did you mean to leave that in? The raw super offset sure makes it look like debugging output. Sigh sorry, thought I culled everything. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs-progs: let btrfs-image actually work on a balanced fs V2
We use the read extent buffer infrastructure to read the super block when we are creating a btrfs-image. This works out fine most of the time except when the fs has been balanced, then it fails to map the super block. So we could fix btrfs-image to read in the super in a special way, but thats more code. So instead just check in the eb reading code if we are reading the super and then don't bother mapping the block, just read the actual offset. This fixed some poor guy who was trying to btrfs-image his fs that had been balanced. Thanks, Signed-off-by: Josef Bacik jba...@fb.com --- V1-V2: Remove the debugging stuff that was left accidentally. disk-io.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/disk-io.c b/disk-io.c index 26a532e..34c0a97 100644 --- a/disk-io.c +++ b/disk-io.c @@ -201,7 +201,8 @@ int read_whole_eb(struct btrfs_fs_info *info, struct extent_buffer *eb, int mirr read_len = bytes_left; device = NULL; - if (!info-on_restoring) { + if (!info-on_restoring + eb-start != BTRFS_SUPER_INFO_OFFSET) { ret = btrfs_map_block(info-mapping_tree, READ, eb-start + offset, read_len, multi, mirror, NULL); -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3.16 Managed to ENOSPC with 80% used
On Thu, Sep 25, 2014 at 5:21 PM, Holger Hoffstätte holger.hoffstae...@googlemail.com wrote: That's why I mentioned adding a second device - that will immediately allow cleanup with headroom. An additional 8GB tmpfs volume can works wonders. If you add a single 8GB tmpfs to a RAID1 btrfs array, is it safe to assume that you'll still always have a redundant copy of everything on a disk somewhere during the recovery? Would only a single tmpfs volume actually help in this case? I get a bit nervous about doing a cleanup that involves moving metadata to tmpfs of all places, since some kind of deadlock/etc could result in unrecoverable data loss. Doing the same thing with an actual hard drive would concern me less. -- Rich -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: fix race in WAIT_SYNC ioctl
We check whether transid is already committed via last_trans_committed and then search through trans_list for pending transactions. If last_trans_committed is updated by btrfs_commit_transaction after we check it (there is no locking), we will fail to find the committed transaction and return EINVAL to the caller. This has been observed occasionally by ceph-osd (which uses this ioctl heavily). Fix by rechecking whether the provided transid = last_trans_committed after the search fails, and if so return 0. Signed-off-by: Sage Weil s...@redhat.com --- fs/btrfs/transaction.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index d89c6d3..98a25df 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -609,7 +609,6 @@ int btrfs_wait_for_commit(struct btrfs_root *root, u64 transid) if (transid = root-fs_info-last_trans_committed) goto out; - ret = -EINVAL; /* find specified transaction */ spin_lock(root-fs_info-trans_lock); list_for_each_entry(t, root-fs_info-trans_list, list) { @@ -625,9 +624,16 @@ int btrfs_wait_for_commit(struct btrfs_root *root, u64 transid) } } spin_unlock(root-fs_info-trans_lock); - /* The specified transaction doesn't exist */ - if (!cur_trans) + + /* +* The specified transaction doesn't exist, or we +* raced with btrfs_commit_transaction +*/ + if (!cur_trans) { + if (transid root-fs_info-last_trans_committed) + ret = -EINVAL; goto out; + } } else { /* find newest transaction that is committing | committed */ spin_lock(root-fs_info-trans_lock); -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] Move BTRFS RCU string to common library
On Fri, Sep 19, 2014 at 02:01:30AM -0700, Omar Sandoval wrote: The RCU-friendy string API used internally by BTRFS is generic enough for common use. This doesn't add any new functionality, but instead just moves the code and documents the existing API. Signed-off-by: Omar Sandoval osan...@osandov.com One nit: shouldn't the returned rcu_string pointer have an __rcu address space annotation? With that fixed: Reviewed-by: Josh Triplett j...@joshtriplett.org fs/btrfs/check-integrity.c | 6 +-- fs/btrfs/dev-replace.c | 19 +- fs/btrfs/disk-io.c | 6 +-- fs/btrfs/extent_io.c | 4 +- fs/btrfs/ioctl.c | 4 +- fs/btrfs/raid56.c | 2 +- fs/btrfs/rcu-string.h | 56 fs/btrfs/scrub.c | 15 fs/btrfs/super.c | 2 +- fs/btrfs/volumes.c | 14 +++ include/linux/rcustring.h | 91 ++ 11 files changed, 128 insertions(+), 91 deletions(-) delete mode 100644 fs/btrfs/rcu-string.h create mode 100644 include/linux/rcustring.h diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c index ce92ae3..4ccd7da 100644 --- a/fs/btrfs/check-integrity.c +++ b/fs/btrfs/check-integrity.c @@ -94,6 +94,7 @@ #include linux/mutex.h #include linux/genhd.h #include linux/blkdev.h +#include linux/rcustring.h #include ctree.h #include disk-io.h #include hash.h @@ -103,7 +104,6 @@ #include print-tree.h #include locking.h #include check-integrity.h -#include rcu-string.h #define BTRFSIC_BLOCK_HASHTABLE_SIZE 0x1 #define BTRFSIC_BLOCK_LINK_HASHTABLE_SIZE 0x1 @@ -851,8 +851,8 @@ static int btrfsic_process_superblock_dev_mirror( printk_in_rcu(KERN_INFO New initial S-block (bdev %p, %s) @%llu (%s/%llu/%d)\n, superblock_bdev, - rcu_str_deref(device-name), dev_bytenr, - dev_state-name, dev_bytenr, + rcu_string_dereference(device-name), + dev_bytenr, dev_state-name, dev_bytenr, superblock_mirror_num); list_add(superblock_tmp-all_blocks_node, state-all_blocks_list); diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index eea26e1..87d10cc 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -25,6 +25,7 @@ #include linux/capability.h #include linux/kthread.h #include linux/math64.h +#include linux/rcustring.h #include asm/div64.h #include ctree.h #include extent_map.h @@ -34,7 +35,6 @@ #include volumes.h #include async-thread.h #include check-integrity.h -#include rcu-string.h #include dev-replace.h #include sysfs.h @@ -376,9 +376,9 @@ int btrfs_dev_replace_start(struct btrfs_root *root, printk_in_rcu(KERN_INFO BTRFS: dev_replace from %s (devid %llu) to %s started\n, src_device-missing ? missing disk : - rcu_str_deref(src_device-name), + rcu_string_dereference(src_device-name), src_device-devid, - rcu_str_deref(tgt_device-name)); + rcu_string_dereference(tgt_device-name)); tgt_device-total_bytes = src_device-total_bytes; tgt_device-disk_total_bytes = src_device-disk_total_bytes; @@ -528,9 +528,10 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, printk_in_rcu(KERN_ERR BTRFS: btrfs_scrub_dev(%s, %llu, %s) failed %d\n, src_device-missing ? missing disk : - rcu_str_deref(src_device-name), + rcu_string_dereference(src_device-name), src_device-devid, - rcu_str_deref(tgt_device-name), scrub_ret); + rcu_string_dereference(tgt_device-name), + scrub_ret); btrfs_dev_replace_unlock(dev_replace); mutex_unlock(root-fs_info-fs_devices-device_list_mutex); mutex_unlock(root-fs_info-chunk_mutex); @@ -544,9 +545,9 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, printk_in_rcu(KERN_INFO BTRFS: dev_replace from %s (devid %llu) to %s) finished\n, src_device-missing ? missing disk : - rcu_str_deref(src_device-name), + rcu_string_dereference(src_device-name), src_device-devid, - rcu_str_deref(tgt_device-name)); + rcu_string_dereference(tgt_device-name)); tgt_device-is_tgtdev_for_dev_replace = 0; tgt_device-devid =
Re: 3.16 Managed to ENOSPC with 80% used
Rich Freeman posted on Fri, 26 Sep 2014 10:18:37 -0400 as excerpted: On Thu, Sep 25, 2014 at 5:21 PM, Holger Hoffstätte holger.hoffstae...@googlemail.com wrote: That's why I mentioned adding a second device - that will immediately allow cleanup with headroom. An additional 8GB tmpfs volume can works wonders. If you add a single 8GB tmpfs to a RAID1 btrfs array, is it safe to assume that you'll still always have a redundant copy of everything on a disk somewhere during the recovery? Would only a single tmpfs volume actually help in this case? I get a bit nervous about doing a cleanup that involves moving metadata to tmpfs of all places, since some kind of deadlock/etc could result in unrecoverable data loss. Doing the same thing with an actual hard drive would concern me less. That has been my concern too, and why I'd be leery about using a loopback on tmpfs, even for the few minutes (more like seconds since I'm on SSD and we /are/ talking memory-backed tmpfs) it'd take to free a minimal number of chunks (say usage=2% or 5% or whatever, the smallest number that actually frees anything). With SSD and with backups I'd probably do it, but it's not something I could recommend, and I'm not sure I'd do it on slower spinning rust, just because the time is longer. I'd probably use a thumb drive or the like, instead, and would certainly recommend that to others, altho if they're comfortable with it and want to risk it, a loopback file on tmpfs should work fine, provided the power doesn't go out in the middle or something. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 0/2] Move BTRFS RCU string to common library
This patch series makes the generic RCU string library used internally by BTRFS accessible by anyone. The first patch makes printk_ratelimited pass on the return value from printk. Version 3 gives the temporary return variable a unique name to avoid name clashes. The second patch moves the RCU string header and updates the BTRFS code that depends on it. Version 3 adds the __rcu annotation to the relevant functions. Version 3 also adds Paul's ack and Josh's review. Omar Sandoval (2): Return a value from printk_ratelimited Move BTRFS RCU string to common library fs/btrfs/check-integrity.c | 6 +-- fs/btrfs/dev-replace.c | 19 +- fs/btrfs/disk-io.c | 6 +-- fs/btrfs/extent_io.c | 4 +- fs/btrfs/ioctl.c | 4 +- fs/btrfs/raid56.c | 2 +- fs/btrfs/rcu-string.h | 56 fs/btrfs/scrub.c | 15 fs/btrfs/super.c | 2 +- fs/btrfs/volumes.c | 14 +++ include/linux/printk.h | 4 +- include/linux/rcustring.h | 92 ++ 12 files changed, 132 insertions(+), 92 deletions(-) delete mode 100644 fs/btrfs/rcu-string.h create mode 100644 include/linux/rcustring.h -- 2.1.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 1/2] Return a value from printk_ratelimited
printk returns an integer; there's no reason for printk_ratelimited to swallow it. Signed-off-by: Omar Sandoval osan...@osandov.com Acked-by: Paul E. McKenney paul...@linux.vnet.ibm.com --- include/linux/printk.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/include/linux/printk.h b/include/linux/printk.h index d78125f..89bb7ab 100644 --- a/include/linux/printk.h +++ b/include/linux/printk.h @@ -343,12 +343,14 @@ extern asmlinkage void dump_stack(void) __cold; #ifdef CONFIG_PRINTK #define printk_ratelimited(fmt, ...) \ ({ \ + int __ret_printk_ratelimited = 0; \ static DEFINE_RATELIMIT_STATE(_rs, \ DEFAULT_RATELIMIT_INTERVAL, \ DEFAULT_RATELIMIT_BURST); \ \ if (__ratelimit(_rs)) \ - printk(fmt, ##__VA_ARGS__); \ + __ret_printk_ratelimited = printk(fmt, ##__VA_ARGS__); \ + __ret_printk_ratelimited; \ }) #else #define printk_ratelimited(fmt, ...) \ -- 2.1.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/2] Move BTRFS RCU string to common library
On Fri, Sep 26, 2014 at 10:36:47PM -0700, Omar Sandoval wrote: This patch series makes the generic RCU string library used internally by BTRFS accessible by anyone. The first patch makes printk_ratelimited pass on the return value from printk. Version 3 gives the temporary return variable a unique name to avoid name clashes. The second patch moves the RCU string header and updates the BTRFS code that depends on it. Version 3 adds the __rcu annotation to the relevant functions. Version 3 also adds Paul's ack and Josh's review. Omar Sandoval (2): Return a value from printk_ratelimited Move BTRFS RCU string to common library fs/btrfs/check-integrity.c | 6 +-- fs/btrfs/dev-replace.c | 19 +- fs/btrfs/disk-io.c | 6 +-- fs/btrfs/extent_io.c | 4 +- fs/btrfs/ioctl.c | 4 +- fs/btrfs/raid56.c | 2 +- fs/btrfs/rcu-string.h | 56 fs/btrfs/scrub.c | 15 fs/btrfs/super.c | 2 +- fs/btrfs/volumes.c | 14 +++ include/linux/printk.h | 4 +- include/linux/rcustring.h | 92 ++ 12 files changed, 132 insertions(+), 92 deletions(-) delete mode 100644 fs/btrfs/rcu-string.h create mode 100644 include/linux/rcustring.h I forgot to mention: this patch series is against 3.17-rc6. -- Omar -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/2] Return a value from printk_ratelimited
On Fri, 2014-09-26 at 22:36 -0700, Omar Sandoval wrote: printk returns an integer; there's no reason for printk_ratelimited to swallow it. Signed-off-by: Omar Sandoval osan...@osandov.com Acked-by: Paul E. McKenney paul...@linux.vnet.ibm.com --- I'd prefer to keep it the way it is actually. I've submitted several patches to convert the int return to void for printk derived functions recently. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html