[PATCH 2/2 v12] Btrfs: be aware of btree inode write errors to avoid fs corruption

2014-09-26 Thread Filipe Manana
While we have a transaction ongoing, the VM might decide at any time
to call btree_inode-i_mapping-a_ops-writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
parent transid verify failed on 10826481664 wanted 25748 found 29562
when reading btree nodes/leafs from disk).

Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.

Using the new 3 flags for the btree inode also makes us achieve the
goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
writeback for all dirty pages and before filemap_fdatawait_range() is
called, the writeback for all dirty pages had already finished with
errors - because we were not using AS_EIO/AS_ENOSPC,
filemap_fdatawait_range() would return success, as it could not know
that writeback errors happened (the pages were no longer tagged for
writeback).

Signed-off-by: Filipe Manana fdman...@suse.com
---

V2: If an extent buffer's write failed but it's also deleted from the tree
before the transaction commits, don't abort the transaction with -EIO,
since the unwritten node/leaf it represents can't be pointed to by any
other node in a tree.

V3: Correct V2, missed unstaged changes.

V4: Use root's key to figure out which counter to update.

V5: Decrement the error counters too when an eb is made dirty again (the
next write attempt might succeed).

V6: Moved counters from transaction struct to fs_info struct, because there's
a (short) time window where fs_info-running_transaction is NULL.
There's now 2 counters for log extent buffers too, each one representing
a different log transaction.

V7: Track the eb's log index in the eb itself, otherwise it wasn't possible
to find it when writeback triggered from a transaction commit.

V8: Track the log eb write errors per root instead, and reset them on a
transaction commit.

V9: Don't decrement the error counters if the eb is deleted or re-written.
It is not safe because there's a time window when committing a transaction,
between setting fs_info-current_transaction to NULL and checking the
error counters in btrfs_write_and_wait_transaction(), where a new 
transaction
can start and delete or re-write an eb that has the write error flag set.
If this happens it means the previous transaction can write a superblock
that refers to trees that point to unwritten nodes.
Replaced the counters with simple flags in the btree inode's runtime
flags - essentially back to V1 but accounting for the 2 different log
sub-transactions.
Removed access to an eb's parent root through
BTRFS_I(eb-pages[0]-mapping-host)-root since it was not correct, as this
always gives us the btree inode's root (objectid 1ULL). Instead use the
field eb-log_index to know wether it's a log btree eb (and which sub-
-transaction) or a non-log btree eb.

V10: Clear the log eb write error flags in a more logical place (transaction
 commit function).

V11: Updated commit message and a comment, replaced an ASSERT() with a BUG()
 and changed eb-lock_nested to a short to keep the structure size.

V12: Removed leftovers from previous versions (no longer necessary #include and
 prototype in extent_io.h of no longer existing function) and updated parts
 from a comment that apply only to some past versions.
 Rebased against latest integration branch (didn't apply 

Re: [PATCH] Btrfs-progs: let btrfs-image actually work on a balanced fs

2014-09-26 Thread Josef Bacik

On 09/25/2014 05:44 PM, Zach Brown wrote:

--- a/btrfs-image.c
+++ b/btrfs-image.c
@@ -1020,6 +1020,9 @@ static int copy_tree_blocks(struct btrfs_root *root, 
struct extent_buffer *eb,
int i = 0;
int ret;

+   if (btrfs_header_bytenr(eb) == 65536)
+   printf(We have bytenr 65536, belongs to %llu, level %d\n,
+  btrfs_header_owner(eb), btrfs_header_level(eb));
ret = add_extent(btrfs_header_bytenr(eb), root-leafsize, metadump, 0);
if (ret) {
fprintf(stderr, Error adding metadata block\n);


Did you mean to leave that in?  The raw super offset sure makes it look
like debugging output.



Sigh sorry, thought I culled everything.  Thanks,

Josef

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs-progs: let btrfs-image actually work on a balanced fs V2

2014-09-26 Thread Josef Bacik
We use the read extent buffer infrastructure to read the super block when we are
creating a btrfs-image.  This works out fine most of the time except when the fs
has been balanced, then it fails to map the super block.  So we could fix
btrfs-image to read in the super in a special way, but thats more code.  So
instead just check in the eb reading code if we are reading the super and then
don't bother mapping the block, just read the actual offset.  This fixed some
poor guy who was trying to btrfs-image his fs that had been balanced.  Thanks,

Signed-off-by: Josef Bacik jba...@fb.com
---
V1-V2: Remove the debugging stuff that was left accidentally.

 disk-io.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/disk-io.c b/disk-io.c
index 26a532e..34c0a97 100644
--- a/disk-io.c
+++ b/disk-io.c
@@ -201,7 +201,8 @@ int read_whole_eb(struct btrfs_fs_info *info, struct 
extent_buffer *eb, int mirr
read_len = bytes_left;
device = NULL;
 
-   if (!info-on_restoring) {
+   if (!info-on_restoring 
+   eb-start != BTRFS_SUPER_INFO_OFFSET) {
ret = btrfs_map_block(info-mapping_tree, READ,
  eb-start + offset, read_len, 
multi,
  mirror, NULL);
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3.16 Managed to ENOSPC with 80% used

2014-09-26 Thread Rich Freeman
On Thu, Sep 25, 2014 at 5:21 PM, Holger Hoffstätte
holger.hoffstae...@googlemail.com wrote:
 That's why I mentioned adding a second device - that will immediately
 allow cleanup with headroom. An additional 8GB tmpfs volume can works
 wonders.


If you add a single 8GB tmpfs to a RAID1 btrfs array, is it safe to
assume that you'll still always have a redundant copy of everything on
a disk somewhere during the recovery?  Would only a single tmpfs
volume actually help in this case?  I get a bit nervous about doing a
cleanup that involves moving metadata to tmpfs of all places, since
some kind of deadlock/etc could result in unrecoverable data loss.

Doing the same thing with an actual hard drive would concern me less.

--
Rich
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix race in WAIT_SYNC ioctl

2014-09-26 Thread Sage Weil
We check whether transid is already committed via last_trans_committed and
then search through trans_list for pending transactions.  If
last_trans_committed is updated by btrfs_commit_transaction after we check
it (there is no locking), we will fail to find the committed transaction
and return EINVAL to the caller.  This has been observed occasionally by
ceph-osd (which uses this ioctl heavily).

Fix by rechecking whether the provided transid = last_trans_committed
after the search fails, and if so return 0.

Signed-off-by: Sage Weil s...@redhat.com
---
 fs/btrfs/transaction.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index d89c6d3..98a25df 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -609,7 +609,6 @@ int btrfs_wait_for_commit(struct btrfs_root *root, u64 
transid)
if (transid = root-fs_info-last_trans_committed)
goto out;
 
-   ret = -EINVAL;
/* find specified transaction */
spin_lock(root-fs_info-trans_lock);
list_for_each_entry(t, root-fs_info-trans_list, list) {
@@ -625,9 +624,16 @@ int btrfs_wait_for_commit(struct btrfs_root *root, u64 
transid)
}
}
spin_unlock(root-fs_info-trans_lock);
-   /* The specified transaction doesn't exist */
-   if (!cur_trans)
+
+   /*
+* The specified transaction doesn't exist, or we
+* raced with btrfs_commit_transaction
+*/
+   if (!cur_trans) {
+   if (transid  root-fs_info-last_trans_committed)
+   ret = -EINVAL;
goto out;
+   }
} else {
/* find newest transaction that is committing | committed */
spin_lock(root-fs_info-trans_lock);
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] Move BTRFS RCU string to common library

2014-09-26 Thread josh
On Fri, Sep 19, 2014 at 02:01:30AM -0700, Omar Sandoval wrote:
 The RCU-friendy string API used internally by BTRFS is generic enough for
 common use. This doesn't add any new functionality, but instead just moves the
 code and documents the existing API.
 
 Signed-off-by: Omar Sandoval osan...@osandov.com

One nit: shouldn't the returned rcu_string pointer have an __rcu
address space annotation?

With that fixed:
Reviewed-by: Josh Triplett j...@joshtriplett.org

  fs/btrfs/check-integrity.c |  6 +--
  fs/btrfs/dev-replace.c | 19 +-
  fs/btrfs/disk-io.c |  6 +--
  fs/btrfs/extent_io.c   |  4 +-
  fs/btrfs/ioctl.c   |  4 +-
  fs/btrfs/raid56.c  |  2 +-
  fs/btrfs/rcu-string.h  | 56 
  fs/btrfs/scrub.c   | 15 
  fs/btrfs/super.c   |  2 +-
  fs/btrfs/volumes.c | 14 +++
  include/linux/rcustring.h  | 91 
 ++
  11 files changed, 128 insertions(+), 91 deletions(-)
  delete mode 100644 fs/btrfs/rcu-string.h
  create mode 100644 include/linux/rcustring.h
 
 diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
 index ce92ae3..4ccd7da 100644
 --- a/fs/btrfs/check-integrity.c
 +++ b/fs/btrfs/check-integrity.c
 @@ -94,6 +94,7 @@
  #include linux/mutex.h
  #include linux/genhd.h
  #include linux/blkdev.h
 +#include linux/rcustring.h
  #include ctree.h
  #include disk-io.h
  #include hash.h
 @@ -103,7 +104,6 @@
  #include print-tree.h
  #include locking.h
  #include check-integrity.h
 -#include rcu-string.h
  
  #define BTRFSIC_BLOCK_HASHTABLE_SIZE 0x1
  #define BTRFSIC_BLOCK_LINK_HASHTABLE_SIZE 0x1
 @@ -851,8 +851,8 @@ static int btrfsic_process_superblock_dev_mirror(
   printk_in_rcu(KERN_INFO New initial S-block (bdev %p, 
 %s)
 @%llu (%s/%llu/%d)\n,
superblock_bdev,
 -  rcu_str_deref(device-name), dev_bytenr,
 -  dev_state-name, dev_bytenr,
 +  rcu_string_dereference(device-name),
 +  dev_bytenr, dev_state-name, dev_bytenr,
superblock_mirror_num);
   list_add(superblock_tmp-all_blocks_node,
state-all_blocks_list);
 diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
 index eea26e1..87d10cc 100644
 --- a/fs/btrfs/dev-replace.c
 +++ b/fs/btrfs/dev-replace.c
 @@ -25,6 +25,7 @@
  #include linux/capability.h
  #include linux/kthread.h
  #include linux/math64.h
 +#include linux/rcustring.h
  #include asm/div64.h
  #include ctree.h
  #include extent_map.h
 @@ -34,7 +35,6 @@
  #include volumes.h
  #include async-thread.h
  #include check-integrity.h
 -#include rcu-string.h
  #include dev-replace.h
  #include sysfs.h
  
 @@ -376,9 +376,9 @@ int btrfs_dev_replace_start(struct btrfs_root *root,
   printk_in_rcu(KERN_INFO
 BTRFS: dev_replace from %s (devid %llu) to %s started\n,
 src_device-missing ? missing disk :
 - rcu_str_deref(src_device-name),
 +   rcu_string_dereference(src_device-name),
 src_device-devid,
 -   rcu_str_deref(tgt_device-name));
 +   rcu_string_dereference(tgt_device-name));
  
   tgt_device-total_bytes = src_device-total_bytes;
   tgt_device-disk_total_bytes = src_device-disk_total_bytes;
 @@ -528,9 +528,10 @@ static int btrfs_dev_replace_finishing(struct 
 btrfs_fs_info *fs_info,
   printk_in_rcu(KERN_ERR
 BTRFS: btrfs_scrub_dev(%s, %llu, %s) failed 
 %d\n,
 src_device-missing ? missing disk :
 - rcu_str_deref(src_device-name),
 +   rcu_string_dereference(src_device-name),
 src_device-devid,
 -   rcu_str_deref(tgt_device-name), scrub_ret);
 +   rcu_string_dereference(tgt_device-name),
 +   scrub_ret);
   btrfs_dev_replace_unlock(dev_replace);
   mutex_unlock(root-fs_info-fs_devices-device_list_mutex);
   mutex_unlock(root-fs_info-chunk_mutex);
 @@ -544,9 +545,9 @@ static int btrfs_dev_replace_finishing(struct 
 btrfs_fs_info *fs_info,
   printk_in_rcu(KERN_INFO
 BTRFS: dev_replace from %s (devid %llu) to %s) 
 finished\n,
 src_device-missing ? missing disk :
 - rcu_str_deref(src_device-name),
 +   rcu_string_dereference(src_device-name),
 src_device-devid,
 -   rcu_str_deref(tgt_device-name));
 +   rcu_string_dereference(tgt_device-name));
   tgt_device-is_tgtdev_for_dev_replace = 0;
   tgt_device-devid = 

Re: 3.16 Managed to ENOSPC with 80% used

2014-09-26 Thread Duncan
Rich Freeman posted on Fri, 26 Sep 2014 10:18:37 -0400 as excerpted:

 On Thu, Sep 25, 2014 at 5:21 PM, Holger Hoffstätte
 holger.hoffstae...@googlemail.com wrote:
 That's why I mentioned adding a second device - that will immediately
 allow cleanup with headroom. An additional 8GB tmpfs volume can works
 wonders.


 If you add a single 8GB tmpfs to a RAID1 btrfs array, is it safe to
 assume that you'll still always have a redundant copy of everything on a
 disk somewhere during the recovery?  Would only a single tmpfs volume
 actually help in this case?  I get a bit nervous about doing a cleanup
 that involves moving metadata to tmpfs of all places, since some kind of
 deadlock/etc could result in unrecoverable data loss.
 
 Doing the same thing with an actual hard drive would concern me less.

That has been my concern too, and why I'd be leery about using a loopback 
on tmpfs, even for the few minutes (more like seconds since I'm on SSD 
and we /are/ talking memory-backed tmpfs) it'd take to free a minimal 
number of chunks (say usage=2% or 5% or whatever, the smallest number 
that actually frees anything).

With SSD and with backups I'd probably do it, but it's not something I 
could recommend, and I'm not sure I'd do it on slower spinning rust, just 
because the time is longer.

I'd probably use a thumb drive or the like, instead, and would certainly 
recommend that to others, altho if they're comfortable with it and want 
to risk it, a loopback file on tmpfs should work fine, provided the power 
doesn't go out in the middle or something.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 0/2] Move BTRFS RCU string to common library

2014-09-26 Thread Omar Sandoval
This patch series makes the generic RCU string library used internally by BTRFS
accessible by anyone.

The first patch makes printk_ratelimited pass on the return value from printk.
Version 3 gives the temporary return variable a unique name to avoid name
clashes.

The second patch moves the RCU string header and updates the BTRFS code that
depends on it. Version 3 adds the __rcu annotation to the relevant functions.

Version 3 also adds Paul's ack and Josh's review.

Omar Sandoval (2):
  Return a value from printk_ratelimited
  Move BTRFS RCU string to common library

 fs/btrfs/check-integrity.c |  6 +--
 fs/btrfs/dev-replace.c | 19 +-
 fs/btrfs/disk-io.c |  6 +--
 fs/btrfs/extent_io.c   |  4 +-
 fs/btrfs/ioctl.c   |  4 +-
 fs/btrfs/raid56.c  |  2 +-
 fs/btrfs/rcu-string.h  | 56 
 fs/btrfs/scrub.c   | 15 
 fs/btrfs/super.c   |  2 +-
 fs/btrfs/volumes.c | 14 +++
 include/linux/printk.h |  4 +-
 include/linux/rcustring.h  | 92 ++
 12 files changed, 132 insertions(+), 92 deletions(-)
 delete mode 100644 fs/btrfs/rcu-string.h
 create mode 100644 include/linux/rcustring.h

-- 
2.1.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 1/2] Return a value from printk_ratelimited

2014-09-26 Thread Omar Sandoval
printk returns an integer; there's no reason for printk_ratelimited to swallow
it.

Signed-off-by: Omar Sandoval osan...@osandov.com
Acked-by: Paul E. McKenney paul...@linux.vnet.ibm.com
---
 include/linux/printk.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/printk.h b/include/linux/printk.h
index d78125f..89bb7ab 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -343,12 +343,14 @@ extern asmlinkage void dump_stack(void) __cold;
 #ifdef CONFIG_PRINTK
 #define printk_ratelimited(fmt, ...)   \
 ({ \
+   int __ret_printk_ratelimited = 0;   \
static DEFINE_RATELIMIT_STATE(_rs,  \
  DEFAULT_RATELIMIT_INTERVAL,   \
  DEFAULT_RATELIMIT_BURST); \
\
if (__ratelimit(_rs))  \
-   printk(fmt, ##__VA_ARGS__); \
+   __ret_printk_ratelimited = printk(fmt, ##__VA_ARGS__);  \
+   __ret_printk_ratelimited;   \
 })
 #else
 #define printk_ratelimited(fmt, ...)   \
-- 
2.1.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/2] Move BTRFS RCU string to common library

2014-09-26 Thread Omar Sandoval
On Fri, Sep 26, 2014 at 10:36:47PM -0700, Omar Sandoval wrote:
 This patch series makes the generic RCU string library used internally by 
 BTRFS
 accessible by anyone.
 
 The first patch makes printk_ratelimited pass on the return value from printk.
 Version 3 gives the temporary return variable a unique name to avoid name
 clashes.
 
 The second patch moves the RCU string header and updates the BTRFS code that
 depends on it. Version 3 adds the __rcu annotation to the relevant functions.
 
 Version 3 also adds Paul's ack and Josh's review.
 
 Omar Sandoval (2):
   Return a value from printk_ratelimited
   Move BTRFS RCU string to common library
 
  fs/btrfs/check-integrity.c |  6 +--
  fs/btrfs/dev-replace.c | 19 +-
  fs/btrfs/disk-io.c |  6 +--
  fs/btrfs/extent_io.c   |  4 +-
  fs/btrfs/ioctl.c   |  4 +-
  fs/btrfs/raid56.c  |  2 +-
  fs/btrfs/rcu-string.h  | 56 
  fs/btrfs/scrub.c   | 15 
  fs/btrfs/super.c   |  2 +-
  fs/btrfs/volumes.c | 14 +++
  include/linux/printk.h |  4 +-
  include/linux/rcustring.h  | 92 
 ++
  12 files changed, 132 insertions(+), 92 deletions(-)
  delete mode 100644 fs/btrfs/rcu-string.h
  create mode 100644 include/linux/rcustring.h
 

I forgot to mention: this patch series is against 3.17-rc6.

-- 
Omar
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/2] Return a value from printk_ratelimited

2014-09-26 Thread Joe Perches
On Fri, 2014-09-26 at 22:36 -0700, Omar Sandoval wrote:
 printk returns an integer; there's no reason for printk_ratelimited to swallow
 it.
 
 Signed-off-by: Omar Sandoval osan...@osandov.com
 Acked-by: Paul E. McKenney paul...@linux.vnet.ibm.com
 ---

I'd prefer to keep it the way it is actually.

I've submitted several patches to convert the int return
to void for printk derived functions recently.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html