Re: [PATCH v2] Btrfs: return failure if btrfs_dev_replace_finishing() failed

2014-10-13 Thread Anand Jain



comments below..


On 10/13/14 12:42, Eryu Guan wrote:

device replace could fail due to another running scrub process or any
other errors btrfs_scrub_dev() may hit, but this failure doesn't get
returned to userspace.

The following steps could reproduce this issue

mkfs -t btrfs -f /dev/sdb1 /dev/sdb2
mount /dev/sdb1 /mnt/btrfs
while true; do btrfs scrub start -B /mnt/btrfs /dev/null 21; done 
btrfs replace start -Bf /dev/sdb2 /dev/sdb3 /mnt/btrfs
# if this replace succeeded, do the following and repeat until
# you see this log in dmesg
# BTRFS: btrfs_scrub_dev(/dev/sdb2, 2, /dev/sdb3) failed -115
#btrfs replace start -Bf /dev/sdb3 /dev/sdb2 /mnt/btrfs

# once you see the error log in dmesg, check return value of
# replace
echo $?

Introduce a new dev replace result

BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS

to catch -EINPROGRESS explicitly and return other errors directly to
userspace.

Signed-off-by: Eryu Guan guane...@gmail.com
---

v2:
- set result to SCRUB_INPROGRESS if btrfs_scrub_dev returned -EINPROGRESS
   and return 0 as Miao Xie suggested

  fs/btrfs/dev-replace.c | 12 +---
  include/uapi/linux/btrfs.h |  1 +
  2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index eea26e1..a141f8b 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -418,9 +418,15 @@ int btrfs_dev_replace_start(struct btrfs_root *root,
  dev_replace-scrub_progress, 0, 1);

ret = btrfs_dev_replace_finishing(root-fs_info, ret);
-   WARN_ON(ret);
+   /* don't warn if EINPROGRESS, someone else might be running scrub */
+   if (ret == -EINPROGRESS) {
+   args-result = BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS;
+   ret = 0;
+   } else {
+   WARN_ON(ret);
+   }



 looks like was are trying to manage EINPROGRESS returned by
 btrfs_dev_replace_finishing(). In btrfs_dev_replace_finishing()
 which specific func call is returning EINPROGRESS ? I didn't go
 deep enough.

 And how do we handle if replace is intervened by balance
 instead of scrub ?

 sorry if I missed something.

Anand



-   return 0;
+   return ret;

  leave:
dev_replace-srcdev = NULL;
@@ -538,7 +544,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info 
*fs_info,
btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device);
mutex_unlock(dev_replace-lock_finishing_cancel_unmount);

-   return 0;
+   return scrub_ret;
}

printk_in_rcu(KERN_INFO
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 2f47824..611e1c5 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -157,6 +157,7 @@ struct btrfs_ioctl_dev_replace_status_params {
  #define BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_ERROR   0
  #define BTRFS_IOCTL_DEV_REPLACE_RESULT_NOT_STARTED1
  #define BTRFS_IOCTL_DEV_REPLACE_RESULT_ALREADY_STARTED2
+#define BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS3
  struct btrfs_ioctl_dev_replace_args {
__u64 cmd;  /* in */
__u64 result;   /* out */


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: return failure if btrfs_dev_replace_finishing() failed

2014-10-13 Thread Eryu Guan
On Mon, Oct 13, 2014 at 02:23:57PM +0800, Anand Jain wrote:
 
 
 comments below..
 
 
 On 10/13/14 12:42, Eryu Guan wrote:
 device replace could fail due to another running scrub process or any
 other errors btrfs_scrub_dev() may hit, but this failure doesn't get
 returned to userspace.
 
 The following steps could reproduce this issue
 
  mkfs -t btrfs -f /dev/sdb1 /dev/sdb2
  mount /dev/sdb1 /mnt/btrfs
  while true; do btrfs scrub start -B /mnt/btrfs /dev/null 21; done 
  btrfs replace start -Bf /dev/sdb2 /dev/sdb3 /mnt/btrfs
  # if this replace succeeded, do the following and repeat until
  # you see this log in dmesg
  # BTRFS: btrfs_scrub_dev(/dev/sdb2, 2, /dev/sdb3) failed -115
  #btrfs replace start -Bf /dev/sdb3 /dev/sdb2 /mnt/btrfs
 
  # once you see the error log in dmesg, check return value of
  # replace
  echo $?
 
 Introduce a new dev replace result
 
 BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS
 
 to catch -EINPROGRESS explicitly and return other errors directly to
 userspace.
 
 Signed-off-by: Eryu Guan guane...@gmail.com
 ---
 
 v2:
 - set result to SCRUB_INPROGRESS if btrfs_scrub_dev returned -EINPROGRESS
and return 0 as Miao Xie suggested
 
   fs/btrfs/dev-replace.c | 12 +---
   include/uapi/linux/btrfs.h |  1 +
   2 files changed, 10 insertions(+), 3 deletions(-)
 
 diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
 index eea26e1..a141f8b 100644
 --- a/fs/btrfs/dev-replace.c
 +++ b/fs/btrfs/dev-replace.c
 @@ -418,9 +418,15 @@ int btrfs_dev_replace_start(struct btrfs_root *root,
dev_replace-scrub_progress, 0, 1);
 
  ret = btrfs_dev_replace_finishing(root-fs_info, ret);
 -WARN_ON(ret);
 +/* don't warn if EINPROGRESS, someone else might be running scrub */
 +if (ret == -EINPROGRESS) {
 +args-result = BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS;
 +ret = 0;
 +} else {
 +WARN_ON(ret);
 +}
 
 
  looks like was are trying to manage EINPROGRESS returned by

Yes, that's right.

  btrfs_dev_replace_finishing(). In btrfs_dev_replace_finishing()
  which specific func call is returning EINPROGRESS ? I didn't go
  deep enough.

btrfs_dev_replace_finishing() will check the scrub_ret(the last
argument), and return scrub_ret if (!scrub_ret). It was returning 0
unconditionally before this patch.

btrfs_dev_replace_start@fs/btrfs/dev-replace.c
   416  ret = btrfs_scrub_dev(fs_info, src_device-devid, 0,
   417src_device-total_bytes,
   418dev_replace-scrub_progress, 0, 1);
   419
   420  ret = btrfs_dev_replace_finishing(root-fs_info, ret);

and btrfs_dev_replace_finishing@fs/btrfs/dev-replace.c
   529  if (!scrub_ret) {
   530  btrfs_dev_replace_update_device_in_mapping_tree(fs_info,
   531  
src_device,
   532  
tgt_device);
   533  } else {
..
   547  return scrub_ret;
   548  }

 
  And how do we handle if replace is intervened by balance
  instead of scrub ?

Based on my test, replace ioctl would return -ENOENT if balance is
running

ERROR: ioctl(DEV_REPLACE_START) failed on /mnt/testarea/scratch: No such file 
or directory, no error

(I haven't gone through this codepath yet and don't know where -ENOENT
comes from, but I don't think it's a proper errno,
/mnt/testarea/scratch is definitely there)
 
  sorry if I missed something.
 
 Anand

Thanks for the review!

Eryu
 
 
 -return 0;
 +return ret;
 
   leave:
  dev_replace-srcdev = NULL;
 @@ -538,7 +544,7 @@ static int btrfs_dev_replace_finishing(struct 
 btrfs_fs_info *fs_info,
  btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device);
  mutex_unlock(dev_replace-lock_finishing_cancel_unmount);
 
 -return 0;
 +return scrub_ret;
  }
 
  printk_in_rcu(KERN_INFO
 diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
 index 2f47824..611e1c5 100644
 --- a/include/uapi/linux/btrfs.h
 +++ b/include/uapi/linux/btrfs.h
 @@ -157,6 +157,7 @@ struct btrfs_ioctl_dev_replace_status_params {
   #define BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_ERROR0
   #define BTRFS_IOCTL_DEV_REPLACE_RESULT_NOT_STARTED 1
   #define BTRFS_IOCTL_DEV_REPLACE_RESULT_ALREADY_STARTED 2
 +#define BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS 3
   struct btrfs_ioctl_dev_replace_args {
  __u64 cmd;  /* in */
  __u64 result;   /* out */
 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: return failure if btrfs_dev_replace_finishing() failed

2014-10-13 Thread Anand Jain



On 10/13/14 14:59, Eryu Guan wrote:

On Mon, Oct 13, 2014 at 02:23:57PM +0800, Anand Jain wrote:



comments below..


On 10/13/14 12:42, Eryu Guan wrote:

device replace could fail due to another running scrub process or any
other errors btrfs_scrub_dev() may hit, but this failure doesn't get
returned to userspace.

The following steps could reproduce this issue

mkfs -t btrfs -f /dev/sdb1 /dev/sdb2
mount /dev/sdb1 /mnt/btrfs
while true; do btrfs scrub start -B /mnt/btrfs /dev/null 21; done 
btrfs replace start -Bf /dev/sdb2 /dev/sdb3 /mnt/btrfs
# if this replace succeeded, do the following and repeat until
# you see this log in dmesg
# BTRFS: btrfs_scrub_dev(/dev/sdb2, 2, /dev/sdb3) failed -115
#btrfs replace start -Bf /dev/sdb3 /dev/sdb2 /mnt/btrfs

# once you see the error log in dmesg, check return value of
# replace
echo $?

Introduce a new dev replace result

BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS

to catch -EINPROGRESS explicitly and return other errors directly to
userspace.

Signed-off-by: Eryu Guan guane...@gmail.com
---

v2:
- set result to SCRUB_INPROGRESS if btrfs_scrub_dev returned -EINPROGRESS
   and return 0 as Miao Xie suggested

  fs/btrfs/dev-replace.c | 12 +---
  include/uapi/linux/btrfs.h |  1 +
  2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index eea26e1..a141f8b 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -418,9 +418,15 @@ int btrfs_dev_replace_start(struct btrfs_root *root,
  dev_replace-scrub_progress, 0, 1);

ret = btrfs_dev_replace_finishing(root-fs_info, ret);
-   WARN_ON(ret);
+   /* don't warn if EINPROGRESS, someone else might be running scrub */
+   if (ret == -EINPROGRESS) {
+   args-result = BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS;
+   ret = 0;
+   } else {
+   WARN_ON(ret);
+   }



 I am bit concerned, why these racing threads here aren't excluding
 each other using mutually_exclusive_operation_running ? as most
 of the other device operation thread does.

Thanks, Anand


  looks like was are trying to manage EINPROGRESS returned by


Yes, that's right.


  btrfs_dev_replace_finishing(). In btrfs_dev_replace_finishing()
  which specific func call is returning EINPROGRESS ? I didn't go
  deep enough.


btrfs_dev_replace_finishing() will check the scrub_ret(the last
argument), and return scrub_ret if (!scrub_ret). It was returning 0
unconditionally before this patch.

btrfs_dev_replace_start@fs/btrfs/dev-replace.c
416  ret = btrfs_scrub_dev(fs_info, src_device-devid, 0,
417src_device-total_bytes,
418dev_replace-scrub_progress, 0, 1);
419
420  ret = btrfs_dev_replace_finishing(root-fs_info, ret);

and btrfs_dev_replace_finishing@fs/btrfs/dev-replace.c
529  if (!scrub_ret) {
530  
btrfs_dev_replace_update_device_in_mapping_tree(fs_info,
531  
src_device,
532  
tgt_device);
533  } else {
..
547  return scrub_ret;
548  }








  And how do we handle if replace is intervened by balance
  instead of scrub ?


Based on my test, replace ioctl would return -ENOENT if balance is
running

ERROR: ioctl(DEV_REPLACE_START) failed on /mnt/testarea/scratch: No such file 
or directory, no error

(I haven't gone through this codepath yet and don't know where -ENOENT
comes from, but I don't think it's a proper errno,
/mnt/testarea/scratch is definitely there)


  sorry if I missed something.

Anand


Thanks for the review!

Eryu




-   return 0;
+   return ret;

  leave:
dev_replace-srcdev = NULL;
@@ -538,7 +544,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info 
*fs_info,
btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device);
mutex_unlock(dev_replace-lock_finishing_cancel_unmount);

-   return 0;
+   return scrub_ret;
}

printk_in_rcu(KERN_INFO
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 2f47824..611e1c5 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -157,6 +157,7 @@ struct btrfs_ioctl_dev_replace_status_params {
  #define BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_ERROR   0
  #define BTRFS_IOCTL_DEV_REPLACE_RESULT_NOT_STARTED1
  #define BTRFS_IOCTL_DEV_REPLACE_RESULT_ALREADY_STARTED2
+#define BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS3
  struct btrfs_ioctl_dev_replace_args {
__u64 cmd;  /* in */
__u64 result;   /* out */


--
To 

[PATCH 1/3] Btrfs: deal with convert_extent_bit errors to avoid fs corruption

2014-10-13 Thread Filipe Manana
When committing a transaction or a log, we look for btree extents that
need to be durably persisted by searching for ranges in a io tree that
have some bits set (EXTENT_DIRTY or EXTENT_NEW). We then attempt to clear
those bits and set the EXTENT_NEED_WAIT bit, with calls to the function
convert_extent_bit, and then start writeback for the extents.

That function however can return an error (at the moment only -ENOMEM
is possible, specially when it does GFP_ATOMIC allocation requests
through alloc_extent_state_atomic) - that means the ranges didn't got
the EXTENT_NEED_WAIT bit set (or at least not for the whole range),
which in turn means a call to btrfs_wait_marked_extents() won't find
those ranges for which we started writeback, causing a transaction
commit or a log commit to persist a new superblock without waiting
for the writeback of extents in that range to finish first.

Therefore if a crash happens after persisting the new superblock and
before writeback finishes, we have a superblock pointing to roots that
weren't fully persisted or roots that point to nodes or leafs that weren't
fully persisted, causing all sorts of unexpected/bad behaviour as we endup
reading garbage from disk or the content of some node/leaf from a past
generation that got cowed or deleted and is no longer valid (for this later
case we end up getting error messages like parent transid verify failed on
X wanted Y found Z when reading btree nodes/leafs from disk).

Signed-off-by: Filipe Manana fdman...@suse.com
---
 fs/btrfs/transaction.c | 92 +-
 fs/btrfs/transaction.h |  2 --
 2 files changed, 76 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 8f1a408..cb673d4 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -76,6 +76,32 @@ void btrfs_put_transaction(struct btrfs_transaction 
*transaction)
}
 }
 
+static void clear_btree_io_tree(struct extent_io_tree *tree)
+{
+   spin_lock(tree-lock);
+   while (!RB_EMPTY_ROOT(tree-state)) {
+   struct rb_node *node;
+   struct extent_state *state;
+
+   node = rb_first(tree-state);
+   state = rb_entry(node, struct extent_state, rb_node);
+   rb_erase(state-rb_node, tree-state);
+   RB_CLEAR_NODE(state-rb_node);
+   /*
+* btree io trees aren't supposed to have tasks waiting for
+* changes in the flags of extent states ever.
+*/
+   ASSERT(!waitqueue_active(state-wq));
+   free_extent_state(state);
+   if (need_resched()) {
+   spin_unlock(tree-lock);
+   cond_resched();
+   spin_lock(tree-lock);
+   }
+   }
+   spin_unlock(tree-lock);
+}
+
 static noinline void switch_commit_roots(struct btrfs_transaction *trans,
 struct btrfs_fs_info *fs_info)
 {
@@ -89,6 +115,7 @@ static noinline void switch_commit_roots(struct 
btrfs_transaction *trans,
root-commit_root = btrfs_root_node(root);
if (is_fstree(root-objectid))
btrfs_unpin_free_ino(root);
+   clear_btree_io_tree(root-dirty_log_pages);
}
up_write(fs_info-commit_root_sem);
 }
@@ -827,17 +854,38 @@ int btrfs_write_marked_extents(struct btrfs_root *root,
 
while (!find_first_extent_bit(dirty_pages, start, start, end,
  mark, cached_state)) {
-   convert_extent_bit(dirty_pages, start, end, EXTENT_NEED_WAIT,
-  mark, cached_state, GFP_NOFS);
-   cached_state = NULL;
-   err = filemap_fdatawrite_range(mapping, start, end);
+   bool wait_writeback = false;
+
+   err = convert_extent_bit(dirty_pages, start, end,
+EXTENT_NEED_WAIT,
+mark, cached_state, GFP_NOFS);
+   /*
+* convert_extent_bit can return -ENOMEM, which is most of the
+* time a temporary error. So when it happens, ignore the error
+* and wait for writeback of this range to finish - because we
+* failed to set the bit EXTENT_NEED_WAIT for the range, a call
+* to btrfs_wait_marked_extents() would not know that writeback
+* for this range started and therefore wouldn't wait for it to
+* finish - we don't want to commit a superblock that points to
+* btree nodes/leafs for which writeback hasn't finished yet
+* (and without errors).
+* We cleanup any entries left in the io tree when committing
+* the transaction (through clear_btree_io_tree()).
+*/
+   if (err == -ENOMEM) {

[PATCH 3/3] Btrfs: avoid returning -ENOMEM in convert_extent_bit() too early

2014-10-13 Thread Filipe Manana
We try to allocate an extent state before acquiring the tree's spinlock
just in case we end up needing to split an existing extent state into two.
If that allocation failed, we would return -ENOMEM.
However, our only single caller (transaction/log commit code), passes in
an extent state that was cached from a call to find_first_extent_bit() and
that has a very high chance to match exactly the input range (always true
for a transaction commit and very often, but not always, true for a log
commit) - in this case we end up not needing at all that initial extent
state used for an eventual split. Therefore just don't return -ENOMEM if
we can't allocate the temporary extent state, since we might not need it
at all, and if we end up needing one, we'll do it later anyway.

Signed-off-by: Filipe Manana fdman...@suse.com
---
 fs/btrfs/extent_io.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 0d931b1..654ed3d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1066,13 +1066,21 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 
start, u64 end,
int err = 0;
u64 last_start;
u64 last_end;
+   bool first_iteration = true;
 
btrfs_debug_check_extent_io_range(tree, start, end);
 
 again:
if (!prealloc  (mask  __GFP_WAIT)) {
+   /*
+* Best effort, don't worry if extent state allocation fails
+* here for the first iteration. We might have a cached state
+* that matches exactly the target range, in which case no
+* extent state allocations are needed. We'll only know this
+* after locking the tree.
+*/
prealloc = alloc_extent_state(mask);
-   if (!prealloc)
+   if (!prealloc  !first_iteration)
return -ENOMEM;
}
 
@@ -1242,6 +1250,7 @@ search_again:
spin_unlock(tree-lock);
if (mask  __GFP_WAIT)
cond_resched();
+   first_iteration = false;
goto again;
 }
 
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] Btrfs: make find_first_extent_bit be able to cache any state

2014-10-13 Thread Filipe Manana
Right now the only caller of find_first_extent_bit() that is interested
in caching extent states (transaction or log commit), never gets an extent
state cached. This is because find_first_extent_bit() only caches states
that have at least one of the flags EXTENT_IOBITS or EXTENT_BOUNDARY, and
the transaction/log commit caller always passes a tree that doesn't have
ever extent states with any of those flags (they can only have one of the
following flags: EXTENT_DIRTY, EXTENT_NEW or EXTENT_NEED_WAIT).

This change together with the following one in the patch series (titled
Btrfs: avoid returning -ENOMEM in convert_extent_bit() too early) will
help reduce significantly the chances of calls to convert_extent_bit()
fail with -ENOMEM when called from the transaction/log commit code.

Signed-off-by: Filipe Manana fdman...@suse.com
---
 fs/btrfs/extent_io.c   | 16 
 fs/btrfs/transaction.c |  3 +++
 2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 420fe26..0d931b1 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -796,17 +796,25 @@ static void set_state_bits(struct extent_io_tree *tree,
state-state |= bits_to_set;
 }
 
-static void cache_state(struct extent_state *state,
-   struct extent_state **cached_ptr)
+static void cache_state_if_flags(struct extent_state *state,
+struct extent_state **cached_ptr,
+const u64 flags)
 {
if (cached_ptr  !(*cached_ptr)) {
-   if (state-state  (EXTENT_IOBITS | EXTENT_BOUNDARY)) {
+   if (!flags || (state-state  flags)) {
*cached_ptr = state;
atomic_inc(state-refs);
}
}
 }
 
+static void cache_state(struct extent_state *state,
+   struct extent_state **cached_ptr)
+{
+   return cache_state_if_flags(state, cached_ptr,
+   EXTENT_IOBITS | EXTENT_BOUNDARY);
+}
+
 /*
  * set some bits on a range in the tree.  This may require allocations or
  * sleeping, so the gfp mask is used to indicate what is allowed.
@@ -1482,7 +1490,7 @@ int find_first_extent_bit(struct extent_io_tree *tree, 
u64 start,
state = find_first_extent_bit_state(tree, start, bits);
 got_it:
if (state) {
-   cache_state(state, cached_state);
+   cache_state_if_flags(state, cached_state, 0);
*start_ret = state-start;
*end_ret = state-end;
ret = 0;
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index cb673d4..396ae8b 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -882,6 +882,7 @@ int btrfs_write_marked_extents(struct btrfs_root *root,
werr = err;
else if (wait_writeback)
werr = filemap_fdatawait_range(mapping, start, end);
+   free_extent_state(cached_state);
cached_state = NULL;
cond_resched();
start = end + 1;
@@ -926,6 +927,8 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
err = filemap_fdatawait_range(mapping, start, end);
if (err)
werr = err;
+   free_extent_state(cached_state);
+   cached_state = NULL;
cond_resched();
start = end + 1;
}
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What is the vision for btrfs fs repair?

2014-10-13 Thread Austin S Hemmelgarn

On 2014-10-10 18:05, Eric Sandeen wrote:

On 10/10/14 2:35 PM, Austin S Hemmelgarn wrote:

On 2014-10-10 13:43, Bob Marley wrote:

On 10/10/2014 16:37, Chris Murphy wrote:

The fail safe behavior is to treat the known good tree root as
the default tree root, and bypass the bad tree root if it cannot
be repaired, so that the volume can be mounted with default mount
options (i.e. the ones in fstab). Otherwise it's a filesystem
that isn't well suited for general purpose use as rootfs let
alone for boot.



A filesystem which is suited for general purpose use is a
filesystem which honors fsync, and doesn't *ever* auto-roll-back
without user intervention.

Anything different is not suited for database transactions at all.
Any paid service which has the users database on btrfs is going to
be at risk of losing payments, and probably without the company
even knowing. If btrfs goes this way I hope a big warning is
written on the wiki and on the manpages telling that this
filesystem is totally unsuitable for hosting databases performing
transactions.

If they need reliability, they should have some form of redundancy
in-place and/or run the database directly on the block device;
because even ext4, XFS, and pretty much every other filesystem can
lose data sometimes,


Not if i.e. fsync returns.  If the data is gone later, it's a hardware
problem, or occasionally a bug - bugs that are usually found  fixed
pretty quickly.

Yes, barring bugs and hardware problems they won't lose data.



the difference being that those tend to give
worse results when hardware is misbehaving than BTRFS does, because
BTRFS usually has a old copy of whatever data structure gets
corrupted to fall back on.


I'm curious, is that based on conjecture or real-world testing?

I wouldn't really call it testing, but based on personal experience I 
know that ext4 can lose whole directory sub-trees if it gets a single 
corrupt sector in the wrong place.  I've also had that happen on FAT32 
and (somewhat interestingly) HFS+ with failing/misbehaving hardware; and 
I've actually had individual files disappear on HFS+ without any 
discernible hardware issues.  I don't have as much experience with XFS, 
but would assume based on what I do know of it that it could have 
similar issues.  As for BTRFS, I've only ever had any issues with it 3 
times, one was due to the kernel panicking during resume from S1, and 
the other two were due to hardware problems that would have caused 
issues on most other filesystems as well.  In both cases of hardware 
issues, while the filesystem was initially unmountable, it was 
relatively simple to fix once I knew how.  I tried to fix an ext4 fs 
that had become unmountable due to dropped writes once, and that was 
anything but simple, even with the much greater amount of documentation.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: What is the vision for btrfs fs repair?

2014-10-13 Thread Austin S Hemmelgarn

On 2014-10-12 06:14, Martin Steigerwald wrote:

Am Freitag, 10. Oktober 2014, 10:37:44 schrieb Chris Murphy:

On Oct 10, 2014, at 6:53 AM, Bob Marley bobmar...@shiftmail.org wrote:

On 10/10/2014 03:58, Chris Murphy wrote:

* mount -o recovery

Enable autorecovery attempts if a bad tree root is found at mount
time.


I'm confused why it's not the default yet. Maybe it's continuing to
evolve at a pace that suggests something could sneak in that makes
things worse? It is almost an oxymoron in that I'm manually enabling an
autorecovery

If true, maybe the closest indication we'd get of btrfs stablity is the
default enabling of autorecovery.

No way!
I wouldn't want a default like that.

If you think at distributed transactions: suppose a sync was issued on
both sides of a distributed transaction, then power was lost on one side,
than btrfs had corruption. When I remount it, definitely the worst thing
that can happen is that it auto-rolls-back to a previous known-good
state.

For a general purpose file system, losing 30 seconds (or less) of
questionably committed data, likely corrupt, is a file system that won't
mount without user intervention, which requires a secret decoder ring to
get it to mount at all. And may require the use of specialized tools to
retrieve that data in any case.

The fail safe behavior is to treat the known good tree root as the default
tree root, and bypass the bad tree root if it cannot be repaired, so that
the volume can be mounted with default mount options (i.e. the ones in
fstab). Otherwise it's a filesystem that isn't well suited for general
purpose use as rootfs let alone for boot.


To understand this a bit better:

What can be the reasons a recent tree gets corrupted?


Well, so far I have had the following cause corrupted trees:
1. Kernel panic during resume from ACPI S1 (suspend to RAM), which just 
happened to be in the middle of a tree commit.

2. Generic power loss during a tree commit.
3. A device not properly honoring write-barriers (the operations 
immediately adjacent to the write barrier weren't being ordered 
correctly all the time).


Based on what I know about BTRFS, the following could also cause problems:
1. A single-event-upset somewhere in the write path.
2. The kernel issuing a write to the wrong device (I haven't had this 
happen to me, but know people who have).


In general, any of these will cause problems for pretty much any 
filesystem, not just BTRFS.

I always thought with a controller and device and driver combination that
honors fsync with BTRFS it would either be the new state of the last known
good state *anyway*. So where does the need to rollback arise from?

I think that in this case the term rollback is a bit ambiguous, here it 
means from the point of view of userspace, which sees the FS as having 
'rolled-back' from the most recent state to the last known good state.

That said all journalling filesystems have some sort of rollback as far as I
understand: If the last journal entry is incomplete they discard it on journal
replay. So even there you use the last seconds of write activity.

But in case fsync() returns the data needs to be safe on disk. I always
thought BTRFS honors this under *any* circumstance. If some proposed
autorollback breaks this guarentee, I think something is broke elsewhere.

And fsync is an fsync is an fsync. Its semantics are clear as crystal. There
is nothing, absolutely nothing to discuss about it.

An fsync completes if the device itself reported Yeah, I have the data on
disk, all safe and cool to go. Anything else is a bug IMO.

Or a hardware issue, most filesystems need disks to properly honor write 
barriers to provide guaranteed semantics on an fsync, and many consumer 
disk drives still don't honor them consistently.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: What is the vision for btrfs fs repair?

2014-10-13 Thread Rich Freeman
On Sun, Oct 12, 2014 at 6:14 AM, Martin Steigerwald mar...@lichtvoll.de wrote:
 Am Freitag, 10. Oktober 2014, 10:37:44 schrieb Chris Murphy:
 On Oct 10, 2014, at 6:53 AM, Bob Marley bobmar...@shiftmail.org wrote:
  On 10/10/2014 03:58, Chris Murphy wrote:
  * mount -o recovery
 
Enable autorecovery attempts if a bad tree root is found at mount
time.
 
  I'm confused why it's not the default yet. Maybe it's continuing to
  evolve at a pace that suggests something could sneak in that makes
  things worse? It is almost an oxymoron in that I'm manually enabling an
  autorecovery
 
  If true, maybe the closest indication we'd get of btrfs stablity is the
  default enabling of autorecovery.
  No way!
  I wouldn't want a default like that.
 
  If you think at distributed transactions: suppose a sync was issued on
  both sides of a distributed transaction, then power was lost on one side,
  than btrfs had corruption. When I remount it, definitely the worst thing
  that can happen is that it auto-rolls-back to a previous known-good
  state.
 For a general purpose file system, losing 30 seconds (or less) of
 questionably committed data, likely corrupt, is a file system that won't
 mount without user intervention, which requires a secret decoder ring to
 get it to mount at all. And may require the use of specialized tools to
 retrieve that data in any case.

 The fail safe behavior is to treat the known good tree root as the default
 tree root, and bypass the bad tree root if it cannot be repaired, so that
 the volume can be mounted with default mount options (i.e. the ones in
 fstab). Otherwise it's a filesystem that isn't well suited for general
 purpose use as rootfs let alone for boot.

 To understand this a bit better:

 What can be the reasons a recent tree gets corrupted?

 I always thought with a controller and device and driver combination that
 honors fsync with BTRFS it would either be the new state of the last known
 good state *anyway*. So where does the need to rollback arise from?


In theory the recover option should never be necessary.  Btrfs makes
all the guarantees everybody wants it to - when the data is fsynced
then it will never be lost.

The question is what should happen when a corrupted tree root, which
should never happen, happens anyway.  The options are to refuse to
mount the filesystem by default, or mount it by default discarding
about 30-60s worth of writes.  And yes, when this situation happens
(whether it mounts by default or not) btrfs has broken its promise of
data being written after a successful fsync return.

As has been pointed out, braindead drive firmware is the most likely
cause of this sort of issue.  However, there are a number of other
hardware and software errors that could cause it, including errors in
linux outside of btrfs, and of course bugs in btrfs as well.

In an ideal world no filesystem would need any kind of recovery/repair
tools.  They can often mean that the fsync promise was broken.  The
real question is, once that has happened, how do you move on?

I think the best default is to auto-recover, but to have better
facilities for reporting errors to the user.  Right now btrfs is very
quiet about failures - maybe a cryptic message in dmesg, and nobody
reads all of that unless they're looking for something.  If btrfs
could report significant issues that might mitigate the impact of an
auto-recovery.

Also, another thing to consider during recovery is whether the damaged
data could be optionally stored in a snapshot of some kind - maybe in
the way that ext3/4 rollback data after conversion gets stored in a
snapshot.  My knowledge of the underlying structures is weak, but I'd
think that a corrupted tree root practically is a snapshot already,
and turning it into one might even be easier than cleaning it up.  Of
course, we would need to ensure the snapshot could be deleted without
further error.  Doing anything with the snapshot might require special
tools, but if people want to do disk scraping they could.

--
Rich
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs send and kernel 3.17

2014-10-13 Thread john terragon
Actually it seems strange that a send operation could corrupt the
source subvolume or fs. Why would the send modify the source subvolume
in any significant way? The only way I can find to reconcile your
observations with mine is that maybe the snapshots get corrupted not
by the send operation by itself but when they are generated with -r
(readonly, as it is needed to send them). Are the corrupted snapshots
you have in machine 2 (the one in which send was never used) readonly?
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs balance segfault, kernel BUG at fs/btrfs/extent-tree.c:7727

2014-10-13 Thread Rich Freeman
On Thu, Oct 9, 2014 at 10:19 AM, Petr Janecek jane...@ucw.cz wrote:

   I have trouble finishing btrfs balance on five disk raid10 fs.
 I added a disk to 4x3TB raid10 fs and run btrfs balance start
 /mnt/b3, which segfaulted after few hours, probably because of the BUG
 below. btrfs check does not find any errors, both before the balance
 and after reboot (the fs becomes un-umountable).

 [22744.238559] WARNING: CPU: 0 PID: 4211 at fs/btrfs/extent-tree.c:876 
 btrfs_lookup_extent_info+0x292/0x30a [btrfs]()

 [22744.532378] kernel BUG at fs/btrfs/extent-tree.c:7727!

I am running into something similar. I just added a 3TB drive to my
raid1 btrfs and started a balance.  The balance segfaulted, and I find
this in dmesg:


[453046.291762] BTRFS info (device sde2): relocating block group
10367073779712 flags 17
[453062.494151] BTRFS info (device sde2): found 13 extents
[453069.283368] [ cut here ]
[453069.283468] kernel BUG at
/data/src/linux-3.17.0-gentoo/fs/btrfs/relocation.c:931!
[453069.283590] invalid opcode:  [#1] SMP
[453069.283666] Modules linked in: vhost_net vhost macvtap macvlan tun
ipt_MASQUERADE xt_conntrack veth nfsd auth_rpcgss oid_registry lockd
iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4
nf_nat_ipv4 nf_nat nf_conntrack iptable_filter ip_tables it87
hwmon_vid hid_logitech_dj nxt200x cx88_dvb videobuf_dvb dvb_core
cx88_vp3054_i2c tuner_simple tuner_types tuner mousedev hid_generic
usbhid cx88_alsa radeon cx8800 cx8802 cx88xx snd_hda_codec_realtek
btcx_risc snd_hda_codec_generic videobuf_dma_sg videobuf_core kvm_amd
tveeprom kvm rc_core v4l2_common cfbfillrect fbcon videodev cfbimgblt
snd_hda_intel bitblit snd_hda_controller cfbcopyarea softcursor font
tileblit i2c_algo_bit k10temp snd_hda_codec backlight drm_kms_helper
snd_hwdep i2c_piix4 ttm snd_pcm snd_timer drm snd soundcore 8250 evdev
[453069.285043]  serial_core ext4 crc16 jbd2 mbcache zram lz4_compress
zsmalloc ata_generic pata_acpi btrfs xor zlib_deflate atkbd raid6_pq
ohci_pci firewire_ohci firewire_core crc_itu_t pata_atiixp ehci_pci
ohci_hcd ehci_hcd usbcore usb_common r8169 mii sunrpc dm_mirror
dm_region_hash dm_log dm_mod
[453069.285552] CPU: 1 PID: 17270 Comm: btrfs Not tainted 3.17.0-gentoo #1
[453069.285657] Hardware name: Gigabyte Technology Co., Ltd.
GA-880GM-UD2H/GA-880GM-UD2H, BIOS F8 10/11/2010
[453069.285806] task: 88040ec556e0 ti: 88010cf94000 task.ti:
88010cf94000
[453069.285925] RIP: 0010:[a02ddd62]  [a02ddd62]
build_backref_tree+0x1152/0x11b0 [btrfs]
[453069.286137] RSP: 0018:88010cf97848  EFLAGS: 00010206
[453069.286223] RAX: 8800ae67c800 RBX: 880122e94000 RCX:
880122e949c0
[453069.286336] RDX: 09270788d000 RSI: 880054c3fbc0 RDI:
8800ae67c800
[453069.286449] RBP: 88010cf97958 R08: 000159a0 R09:
880122e94000
[453069.286561] R10: 0003 R11:  R12:
8802da313000
[453069.286674] R13: 8802da313c60 R14: 880122e94780 R15:
88040c277000
[453069.286787] FS:  7f175ac51880() GS:880427c4()
knlGS:f7333b40
[453069.286913] CS:  0010 DS:  ES:  CR0: 8005003b
[453069.287005] CR2: 7f208de58000 CR3: 0003b0a9c000 CR4:
07e0
[453069.287116] Stack:
[453069.287151]  88010cf97868 880122e94000 01ff880122e94300
880342156060
[453069.287282]  880122e94780 8802da313c60 880122e94600
8800ae67c800
[453069.287412]  880122e947c0 8802da313000 88040c277120
88010005
[453069.287542] Call Trace:
[453069.287640]  [a02ddfa3] relocate_tree_blocks+0x1e3/0x630 [btrfs]
[453069.287796]  [a02e0550] relocate_block_group+0x3d0/0x650 [btrfs]
[453069.287951]  [a02e0958]
btrfs_relocate_block_group+0x188/0x2a0 [btrfs]
[453069.288113]  [a02b48f0]
btrfs_relocate_chunk.isra.61+0x70/0x780 [btrfs]
[453069.288276]  [a02c7fd0] ?
btrfs_set_lock_blocking_rw+0x70/0xc0 [btrfs]
[453069.288438]  [a02b0e79] ? free_extent_buffer+0x59/0xb0 [btrfs]
[453069.288590]  [a02b8e99] btrfs_balance+0x829/0xf40 [btrfs]
[453069.288738]  [a02bf80f] btrfs_ioctl_balance+0x1af/0x510 [btrfs]
[453069.288890]  [a02c59e4] btrfs_ioctl+0xa54/0x2950 [btrfs]
[453069.288995]  [8111d016] ?
lru_cache_add_active_or_unevictable+0x26/0x90
[453069.289119]  [8113a061] ? handle_mm_fault+0xbe1/0xdb0
[453069.289219]  [811ffdce] ? cred_has_capability+0x5e/0x100
[453069.289323]  [8104065c] ? __do_page_fault+0x1fc/0x4f0
[453069.289422]  [8117d80e] do_vfs_ioctl+0x7e/0x4f0
[453069.289513]  [811ff64f] ? file_has_perm+0x8f/0xa0
[453069.289606]  [8117dd09] SyS_ioctl+0x89/0xa0
[453069.289692]  [81040a1c] ? do_page_fault+0xc/0x10
[453069.289785]  [814f5752] system_call_fastpath+0x16/0x1b
[453069.289881] Code: ff ff 48 8b 9d 20 ff ff ff e9 11 ff ff ff 0f 0b
be ec 03 00 00 48 c7 c7 d0 f0 30 a0 e8 28 00 d7 e0 e9 06 f3 ff ff e8
c4 42 

Re: 3.17.0-rc7: kernel BUG at fs/btrfs/relocation.c:931!

2014-10-13 Thread Rich Freeman
On Thu, Oct 2, 2014 at 3:27 AM, Tomasz Chmielewski t...@virtall.com wrote:
 Got this when running balance with 3.17.0-rc7:

 [173475.410717] kernel BUG at fs/btrfs/relocation.c:931!

I just started a post on another thread with this exact same issue on
3.17.0. I started a balance after adding a new drive.

[453046.291762] BTRFS info (device sde2): relocating block group
10367073779712 flags 17
[453062.494151] BTRFS info (device sde2): found 13 extents
[453069.283368] [ cut here ]
[453069.283468] kernel BUG at
/data/src/linux-3.17.0-gentoo/fs/btrfs/relocation.c:931!
[453069.283590] invalid opcode:  [#1] SMP
[453069.283666] Modules linked in: vhost_net vhost macvtap macvlan tun
ipt_MASQUERADE xt_conntrack veth nfsd auth_rpcgss oid_registry lockd
iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4
nf_nat_ipv4 nf_nat nf_conntrack iptable_filter ip_tables it87
hwmon_vid hid_logitech_dj nxt200x cx88_dvb videobuf_dvb dvb_core
cx88_vp3054_i2c tuner_simple tuner_types tuner mousedev hid_generic
usbhid cx88_alsa radeon cx8800 cx8802 cx88xx snd_hda_codec_realtek
btcx_risc snd_hda_codec_generic videobuf_dma_sg videobuf_core kvm_amd
tveeprom kvm rc_core v4l2_common cfbfillrect fbcon videodev cfbimgblt
snd_hda_intel bitblit snd_hda_controller cfbcopyarea softcursor font
tileblit i2c_algo_bit k10temp snd_hda_codec backlight drm_kms_helper
snd_hwdep i2c_piix4 ttm snd_pcm snd_timer drm snd soundcore 8250 evdev
[453069.285043]  serial_core ext4 crc16 jbd2 mbcache zram lz4_compress
zsmalloc ata_generic pata_acpi btrfs xor zlib_deflate atkbd raid6_pq
ohci_pci firewire_ohci firewire_core crc_itu_t pata_atiixp ehci_pci
ohci_hcd ehci_hcd usbcore usb_common r8169 mii sunrpc dm_mirror
dm_region_hash dm_log dm_mod
[453069.285552] CPU: 1 PID: 17270 Comm: btrfs Not tainted 3.17.0-gentoo #1
[453069.285657] Hardware name: Gigabyte Technology Co., Ltd.
GA-880GM-UD2H/GA-880GM-UD2H, BIOS F8 10/11/2010
[453069.285806] task: 88040ec556e0 ti: 88010cf94000 task.ti:
88010cf94000
[453069.285925] RIP: 0010:[a02ddd62]  [a02ddd62]
build_backref_tree+0x1152/0x11b0 [btrfs]
[453069.286137] RSP: 0018:88010cf97848  EFLAGS: 00010206
[453069.286223] RAX: 8800ae67c800 RBX: 880122e94000 RCX:
880122e949c0
[453069.286336] RDX: 09270788d000 RSI: 880054c3fbc0 RDI:
8800ae67c800
[453069.286449] RBP: 88010cf97958 R08: 000159a0 R09:
880122e94000
[453069.286561] R10: 0003 R11:  R12:
8802da313000
[453069.286674] R13: 8802da313c60 R14: 880122e94780 R15:
88040c277000
[453069.286787] FS:  7f175ac51880() GS:880427c4()
knlGS:f7333b40
[453069.286913] CS:  0010 DS:  ES:  CR0: 8005003b
[453069.287005] CR2: 7f208de58000 CR3: 0003b0a9c000 CR4:
07e0
[453069.287116] Stack:
[453069.287151]  88010cf97868 880122e94000 01ff880122e94300
880342156060
[453069.287282]  880122e94780 8802da313c60 880122e94600
8800ae67c800
[453069.287412]  880122e947c0 8802da313000 88040c277120
88010005
[453069.287542] Call Trace:
[453069.287640]  [a02ddfa3] relocate_tree_blocks+0x1e3/0x630 [btrfs]
[453069.287796]  [a02e0550] relocate_block_group+0x3d0/0x650 [btrfs]
[453069.287951]  [a02e0958]
btrfs_relocate_block_group+0x188/0x2a0 [btrfs]
[453069.288113]  [a02b48f0]
btrfs_relocate_chunk.isra.61+0x70/0x780 [btrfs]
[453069.288276]  [a02c7fd0] ?
btrfs_set_lock_blocking_rw+0x70/0xc0 [btrfs]
[453069.288438]  [a02b0e79] ? free_extent_buffer+0x59/0xb0 [btrfs]
[453069.288590]  [a02b8e99] btrfs_balance+0x829/0xf40 [btrfs]
[453069.288738]  [a02bf80f] btrfs_ioctl_balance+0x1af/0x510 [btrfs]
[453069.288890]  [a02c59e4] btrfs_ioctl+0xa54/0x2950 [btrfs]
[453069.288995]  [8111d016] ?
lru_cache_add_active_or_unevictable+0x26/0x90
[453069.289119]  [8113a061] ? handle_mm_fault+0xbe1/0xdb0
[453069.289219]  [811ffdce] ? cred_has_capability+0x5e/0x100
[453069.289323]  [8104065c] ? __do_page_fault+0x1fc/0x4f0
[453069.289422]  [8117d80e] do_vfs_ioctl+0x7e/0x4f0
[453069.289513]  [811ff64f] ? file_has_perm+0x8f/0xa0
[453069.289606]  [8117dd09] SyS_ioctl+0x89/0xa0
[453069.289692]  [81040a1c] ? do_page_fault+0xc/0x10
[453069.289785]  [814f5752] system_call_fastpath+0x16/0x1b
[453069.289881] Code: ff ff 48 8b 9d 20 ff ff ff e9 11 ff ff ff 0f 0b
be ec 03 00 00 48 c7 c7 d0 f0 30 a0 e8 28 00 d7 e0 e9 06 f3 ff ff e8
c4 42 02 00 0f 0b 3c b0 0f 84 72 f1 ff ff be 22 03 00 00 48 c7 c7 d0
f0 30
[453069.290429] RIP  [a02ddd62]
build_backref_tree+0x1152/0x11b0 [btrfs]
[453069.290591]  RSP 88010cf97848
[453069.316194] ---[ end trace 5fdc0af4cc62bf41 ]---
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: btrfs send and kernel 3.17

2014-10-13 Thread David Arendt
On 10/13/2014 02:40 PM, john terragon wrote:
 Actually it seems strange that a send operation could corrupt the
 source subvolume or fs. Why would the send modify the source subvolume
 in any significant way? The only way I can find to reconcile your
 observations with mine is that maybe the snapshots get corrupted not
 by the send operation by itself but when they are generated with -r
 (readonly, as it is needed to send them). Are the corrupted snapshots
 you have in machine 2 (the one in which send was never used) readonly?
Yes, on both machines there are only readonly snapshots.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs send and kernel 3.17

2014-10-13 Thread Rich Freeman
On Sun, Oct 12, 2014 at 7:11 AM, David Arendt ad...@prnet.org wrote:
 This weekend I finally had time to try btrfs send again on the newly
 created fs. Now I am running into another problem:

 btrfs send returns: ERROR: send ioctl failed with -12: Cannot allocate
 memory

 In dmesg I see only the following output:

 parent transid verify failed on 21325004800 wanted 2620 found 8325


I'm not using send at all, but I've been running into parent transid
verify failed messages where the wanted is way smaller than the found
when trying to balance a raid1 after adding a new drive.  Originally I
had gotten a BUG, and after reboot the drive finished balancing
(interestingly enough without moving any chunks to the new drive -
just consolidating everything on the old drives), and then when I try
to do another balance I get:
[ 4426.987177] BTRFS info (device sdc2): relocating block group
10367073779712 flags 17
[ 4446.287998] BTRFS info (device sdc2): found 13 extents
[ 4451.330887] parent transid verify failed on 10063286579200 wanted
987432 found 993678
[ 4451.350663] parent transid verify failed on 10063286579200 wanted
987432 found 993678

The btrfs program itself outputs:
btrfs balance start -v /data
Dumping filters: flags 0x7, state 0x0, force is off
  DATA (flags 0x0): balancing
  METADATA (flags 0x0): balancing
  SYSTEM (flags 0x0): balancing
ERROR: error during balancing '/data' - Cannot allocate memory
There may be more info in syslog - try dmesg | tail

This is also on 3.17.  This may be completely unrelated, but it seemed
similar enough to be worth mentioning.

The filesystem otherwise seems to work fine, other than the new drive
not having any data on it:
Label: 'datafs'  uuid: cd074207-9bc3-402d-bee8-6a8c77d56959
Total devices 6 FS bytes used 2.16TiB
devid1 size 2.73TiB used 2.40TiB path /dev/sdc2
devid2 size 931.32GiB used 695.03GiB path /dev/sda2
devid3 size 931.32GiB used 700.00GiB path /dev/sdb2
devid4 size 931.32GiB used 700.00GiB path /dev/sdd2
devid5 size 931.32GiB used 699.00GiB path /dev/sde2
devid6 size 2.73TiB used 0.00 path /dev/sdf2

This is btrfs-progs-3.16.2.

--
Rich
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: what is the best way to monitor raid1 drive failures?

2014-10-13 Thread Suman C
I had progs 3.12 and updated to the latest from git(3.16). With this
update, btrfs fi show reports there is a missing device immediately
after i pull it out. Thanks!

I am using virtualbox to test this. So, I am detaching the drive like so:

vboxmanage storageattach vm --storagectl controller --port port
--device device --medium none

Next I am going to try and test a more realistic scenario where a
harddrive is not pulled out, but is damaged.

Can/does btrfs mark a filesystem(say, 2 drive raid1) degraded or
unhealthy automatically when one drive is damaged badly enough that it
cannot be written to or read from reliably?

Suman

On Sun, Oct 12, 2014 at 7:21 PM, Anand Jain anand.j...@oracle.com wrote:

 Suman,

 To simulate the failure, I detached one of the drives from the system.
 After that, I see no sign of a problem except for these errors:

  Are you physically pulling out the device ? I wonder if lsblk or blkid
  shows the error ? reporting device missing logic is in the progs (so
  have that latest) and it works provided user script such as blkid/lsblk
  also reports the problem. OR for soft-detach tests you could use
  devmgt at http://github.com/anajain/devmgt

  Also I am trying to get the device management framework for the btrfs
  with a more better device management and reporting.

 Thanks,  Anand



 On 10/13/14 07:50, Suman C wrote:

 Hi,

 I am testing some disk failure scenarios in a 2 drive raid1 mirror.
 They are 4GB each, virtual SATA drives inside virtualbox.

 To simulate the failure, I detached one of the drives from the system.
 After that, I see no sign of a problem except for these errors:

 Oct 12 15:37:14 rock-dev kernel: btrfs: bdev /dev/sdb errs: wr 0, rd
 0, flush 1, corrupt 0, gen 0
 Oct 12 15:37:14 rock-dev kernel: lost page write due to I/O error on
 /dev/sdb

 /dev/sdb is gone from the system, but btrfs fi show still lists it.

 Label: raid1pool  uuid: 4e5d8b43-1d34-4672-8057-99c51649b7c6
  Total devices 2 FS bytes used 1.46GiB
  devid1 size 4.00GiB used 2.45GiB path /dev/sdb
  devid2 size 4.00GiB used 2.43GiB path /dev/sdc

 I am able to read and write just fine, but do see the above errors in
 dmesg.

 What is the best way to find out that one of the drives has gone bad?

 Suman
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs random filesystem corruption in kernel 3.17

2014-10-13 Thread David Arendt
From my own experience and based on what other people are saying, I
think there is a random btrfs filesystem corruption problem in kernel
3.17 at least related to snapshots, therefore I decided to post using
another subject to draw attention from people not concerned about btrfs
send to it. More information can be found in the brtfs send posts.

Did the filesystem you tried to balance contain snapshots ? Read only ones ?

On 10/13/2014 07:22 PM, Rich Freeman wrote:
 On Sun, Oct 12, 2014 at 7:11 AM, David Arendt ad...@prnet.org wrote:
 This weekend I finally had time to try btrfs send again on the newly
 created fs. Now I am running into another problem:

 btrfs send returns: ERROR: send ioctl failed with -12: Cannot allocate
 memory

 In dmesg I see only the following output:

 parent transid verify failed on 21325004800 wanted 2620 found 8325

 I'm not using send at all, but I've been running into parent transid
 verify failed messages where the wanted is way smaller than the found
 when trying to balance a raid1 after adding a new drive.  Originally I
 had gotten a BUG, and after reboot the drive finished balancing
 (interestingly enough without moving any chunks to the new drive -
 just consolidating everything on the old drives), and then when I try
 to do another balance I get:
 [ 4426.987177] BTRFS info (device sdc2): relocating block group
 10367073779712 flags 17
 [ 4446.287998] BTRFS info (device sdc2): found 13 extents
 [ 4451.330887] parent transid verify failed on 10063286579200 wanted
 987432 found 993678
 [ 4451.350663] parent transid verify failed on 10063286579200 wanted
 987432 found 993678

 The btrfs program itself outputs:
 btrfs balance start -v /data
 Dumping filters: flags 0x7, state 0x0, force is off
   DATA (flags 0x0): balancing
   METADATA (flags 0x0): balancing
   SYSTEM (flags 0x0): balancing
 ERROR: error during balancing '/data' - Cannot allocate memory
 There may be more info in syslog - try dmesg | tail

 This is also on 3.17.  This may be completely unrelated, but it seemed
 similar enough to be worth mentioning.

 The filesystem otherwise seems to work fine, other than the new drive
 not having any data on it:
 Label: 'datafs'  uuid: cd074207-9bc3-402d-bee8-6a8c77d56959
 Total devices 6 FS bytes used 2.16TiB
 devid1 size 2.73TiB used 2.40TiB path /dev/sdc2
 devid2 size 931.32GiB used 695.03GiB path /dev/sda2
 devid3 size 931.32GiB used 700.00GiB path /dev/sdb2
 devid4 size 931.32GiB used 700.00GiB path /dev/sdd2
 devid5 size 931.32GiB used 699.00GiB path /dev/sde2
 devid6 size 2.73TiB used 0.00 path /dev/sdf2

 This is btrfs-progs-3.16.2.

 --
 Rich

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs random filesystem corruption in kernel 3.17

2014-10-13 Thread Rich Freeman
On Mon, Oct 13, 2014 at 4:27 PM, David Arendt ad...@prnet.org wrote:
 From my own experience and based on what other people are saying, I
 think there is a random btrfs filesystem corruption problem in kernel
 3.17 at least related to snapshots, therefore I decided to post using
 another subject to draw attention from people not concerned about btrfs
 send to it. More information can be found in the brtfs send posts.

 Did the filesystem you tried to balance contain snapshots ? Read only ones ?

The filesystem contains numerous subvolumes and snapshots, many of
which are read-only.  I'm managing many with snapper.

The similarity of the transid verify errors made me think this issue
is related, and the root cause may have nothing to do with btrfs send.

As far as I can tell these errors aren't having any affect on my data
- hopefully the system is catching the problems before there are
actual disk writes/etc.

--
Rich
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs random filesystem corruption in kernel 3.17

2014-10-13 Thread john terragon
I think I just found a consistent simple way to trigger the problem
(at least on my system). And, as I guessed before, it seems to be
related just to readonly snapshots:

1) I create a readonly snapshot
2) I do some changes on the source subvolume for the snapshot (I'm not
sure changes are strictly needed)
3) reboot (or probably just unmount and remount. I reboot because the
fs I've problems with contains my root subvolume)

After the rebooting (or the remount) I consistently have the corruption
with the usual multitude of these in dmesg
parent transid verify failed on 902316032 wanted 2484 found 4101
and the characteristic ls -la output

drwxr-xr-x 1 root root  250 Oct 10 15:37 root
d? ? ??   ?? root-b2
drwxr-xr-x 1 root root  250 Oct 10 15:37 root-b3
d? ? ??   ?? root-backup

root-backup and root-b2 are both readonly whereas root-b3 is rw (and
it didn't get corrupted).

David, maybe you can try the same steps on one of your machines?

John
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs random filesystem corruption in kernel 3.17

2014-10-13 Thread Rich Freeman
On Mon, Oct 13, 2014 at 4:48 PM, john terragon jterra...@gmail.com wrote:
 I think I just found a consistent simple way to trigger the problem
 (at least on my system). And, as I guessed before, it seems to be
 related just to readonly snapshots:

 1) I create a readonly snapshot
 2) I do some changes on the source subvolume for the snapshot (I'm not
 sure changes are strictly needed)
 3) reboot (or probably just unmount and remount. I reboot because the
 fs I've problems with contains my root subvolume)

 After the rebooting (or the remount) I consistently have the corruption
 with the usual multitude of these in dmesg
 parent transid verify failed on 902316032 wanted 2484 found 4101
 and the characteristic ls -la output

 drwxr-xr-x 1 root root  250 Oct 10 15:37 root
 d? ? ??   ?? root-b2
 drwxr-xr-x 1 root root  250 Oct 10 15:37 root-b3
 d? ? ??   ?? root-backup

 root-backup and root-b2 are both readonly whereas root-b3 is rw (and
 it didn't get corrupted).

 David, maybe you can try the same steps on one of your machines?


Look at that.  I didn't realize it, but indeed I have a corrupted snapshot:
/data/.snapshots/5338/:
ls: cannot access /data/.snapshots/5338/snapshot: Cannot allocate memory
total 4
drwxr-xr-x 1 root root  32 Oct 11 06:09 .
drwxr-x--- 1 root root  32 Oct 11 07:42 ..
-rw--- 1 root root 135 Oct 11 06:09 info.xml
d? ? ??  ?? snapshot

Several older snapshots are fine, and those predate my 3.17 upgrade.

I noticed that this corrupted snapshot isn't even listed in my snapper lists.

btrfs su delete /data/.snapshots/5338/snapshot
Transaction commit: none (default)
ERROR: error accessing '/data/.snapshots/5338/snapshot'

Removing them appears to be problematic as well.  I might just disable
compress=lzo and go back to 3.16 to see how that goes.

--
Rich
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs random filesystem corruption in kernel 3.17

2014-10-13 Thread Rich Freeman
On Mon, Oct 13, 2014 at 4:55 PM, Rich Freeman
r-bt...@thefreemanclan.net wrote:
 On Mon, Oct 13, 2014 at 4:48 PM, john terragon jterra...@gmail.com wrote:

 After the rebooting (or the remount) I consistently have the corruption
 with the usual multitude of these in dmesg
 parent transid verify failed on 902316032 wanted 2484 found 4101
 and the characteristic ls -la output

Sorry to double-reply, but I left this out.  I have a long string of
these early in boot as well that I never noticed before.

--
Rich
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What is the vision for btrfs fs repair?

2014-10-13 Thread Josef Bacik

On 10/08/2014 03:11 PM, Eric Sandeen wrote:

I was looking at Marc's post:

https://urldefense.proofpoint.com/v1/url?u=http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.htmlk=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar=cKCbChRKsMpTX8ybrSkonQ%3D%3D%0Am=XJPoqgf9jjvuE1IqCerEXXuwF4w3hbDS3%2F63x5KI4R4%3D%0As=b1f817d758eacf914bd60f20ada715384e13c1f8e040100794b0cb21261ec884

and it feels like there isn't exactly a cohesive, overarching vision for
repair of a corrupted btrfs filesystem.

In other words - I'm an admin cruising along, when the kernel throws some
fs corruption error, or for whatever reason btrfs fails to mount.
What should I do?

Marc lays out several steps, but to me this highlights that there seem to
be a lot of disjoint mechanisms out there to deal with these problems;
mostly from Marc's blog, with some bits of my own:

* btrfs scrub
Errors are corrected along if possible (what *is* possible?)
* mount -o recovery
Enable autorecovery attempts if a bad tree root is found at mount 
time.
* mount -o degraded
Allow mounts to continue with missing devices.
(This isn't really a way to recover from corruption, right?)
* btrfs-zero-log
remove the log tree if log tree is corrupt
* btrfs rescue
Recover a damaged btrfs filesystem
chunk-recover
super-recover
How does this relate to btrfs check?
* btrfs check
repair a btrfs filesystem
--repair
--init-csum-tree
--init-extent-tree
How does this relate to btrfs rescue?
* btrfs restore
try to salvage files from a damaged filesystem
(not really repair, it's disk-scraping)


What's the vision for, say, scrub vs. check vs. rescue?  Should they repair the
same errors, only online vs. offline?  If not, what class of errors does one 
fix vs.
the other?  How would an admin know?  Can btrfs check recover a bad tree root
in the same way that mount -o recovery does?  How would I know if I should use
--init-*-tree, or chunk-recover, and what are the ramifications of using
these options?

It feels like recovery tools have been badly splintered, and if there's an
overarching design or vision for btrfs fs repair, I can't tell what it is.
Can anyone help me?



We probably should just consolidate under 3 commands, one for online 
checking, one for offline repair and one for pulling stuff off of the 
disk when things go to hell.  A lot of these tools were born out of the 
fact that we didn't have a fsck tool for a long time so there were these 
stop gaps put into place, so now its time to go back and clean it up.


I'll try and do this after I finish my cleanup/sync between kernel and 
progs work and fill out the documentation a little better so its clear 
when to use what.  Thanks,


Josef

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs random filesystem corruption in kernel 3.17

2014-10-13 Thread john terragon
I'm using compress=no so compression doesn't seem to be related, at
least in my case. Just read-only snapshots on 3.17 (although I haven't
tried 3.16).

John
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs random filesystem corruption in kernel 3.17

2014-10-13 Thread David Arendt
As these to machines are running as server for different purposes (yes,
I know that btrfs is unstable and any corruption or data loss is at my
own risk therefore I have good backups), I want to reboot them not more
then necessary.

However I tried to bring my reboot times in relation with corruptions:

machine 1:

d? ? ?  ? ?? root.20141009.000503.backup

reboot   system boot  3.17.0   Thu Oct  9 23:20   still running
reboot   system boot  3.17.0   Tue Oct  7 21:25 - 23:18 (2+01:53)
reboot   system boot  3.17.0   Mon Oct  6 22:47 - 23:18 (3+00:31)

For this machine, corruption seems to have occurred for a snapshot
created after a reboot.


machine 2:

d? ? ??  ?? root.20141006.003239.backup
d? ? ??  ?? root.20141007.001616.backup
d? ? ??  ?? root.20141008.000501.backup
d? ? ??  ?? root.20141009.052436.backup

reboot   system boot  3.17.0   Thu Oct  9 21:31   still running
reboot   system boot  3.17.0   Tue Oct  7 21:27 - 21:30 (2+00:03)
reboot   system boot  3.17.0   Tue Oct  7 17:51 - 21:26  (03:34)
reboot   system boot  3.17.0   Sun Oct  5 23:50 - 17:50 (1+17:59)
reboot   system boot  3.17.0   Sun Oct  5 23:47 - 23:49  (00:01)

During the next days, I will setup a virtual machine to do more tests.

On 10/13/2014 10:48 PM, john terragon wrote:
 I think I just found a consistent simple way to trigger the problem
 (at least on my system). And, as I guessed before, it seems to be
 related just to readonly snapshots:

 1) I create a readonly snapshot
 2) I do some changes on the source subvolume for the snapshot (I'm not
 sure changes are strictly needed)
 3) reboot (or probably just unmount and remount. I reboot because the
 fs I've problems with contains my root subvolume)

 After the rebooting (or the remount) I consistently have the corruption
 with the usual multitude of these in dmesg
 parent transid verify failed on 902316032 wanted 2484 found 4101
 and the characteristic ls -la output

 drwxr-xr-x 1 root root  250 Oct 10 15:37 root
 d? ? ??   ?? root-b2
 drwxr-xr-x 1 root root  250 Oct 10 15:37 root-b3
 d? ? ??   ?? root-backup

 root-backup and root-b2 are both readonly whereas root-b3 is rw (and
 it didn't get corrupted).

 David, maybe you can try the same steps on one of your machines?

 John

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs random filesystem corruption in kernel 3.17

2014-10-13 Thread David Arendt
I'm also using no compression.

On 10/13/2014 11:22 PM, john terragon wrote:
 I'm using compress=no so compression doesn't seem to be related, at
 least in my case. Just read-only snapshots on 3.17 (although I haven't
 tried 3.16).

 John

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs random filesystem corruption in kernel 3.17

2014-10-13 Thread Duncan
David Arendt posted on Mon, 13 Oct 2014 23:25:23 +0200 as excerpted:

 I'm also using no compression.
 
 On 10/13/2014 11:22 PM, john terragon wrote:
 I'm using compress=no so compression doesn't seem to be related, at
 least in my case. Just read-only snapshots on 3.17 (although I haven't
 tried 3.16).

While I'm not a mind-reader and thus don't know for sure, Rich's 
reference to 3.16 and compression might not be related to this bug at 
all.  In 3.15 and early 3.16, there was a different bug related to 
compression, tho IIRC it was patched in 3.16.2 and 3.17-rc2 (or maybe .3 
and rc3, it's patched in the latest 3.16.x anyway, and in 3.17).  So how 
I read his comment was that he was considering going back to 3.16 and 
disabling compression to deal with that bug (he may not know the patch 
was marked for stable and is in current 3.16.x), rather than stay on 
3.17, since this bug hasn't even been traced yet, let alone patched.

Meanwhile, this bug makes me glad my use-case doesn't involve snapshots, 
and I've seen nothing of it. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs random filesystem corruption in kernel 3.17

2014-10-13 Thread Duncan
Rich Freeman posted on Mon, 13 Oct 2014 16:42:14 -0400 as excerpted:

 On Mon, Oct 13, 2014 at 4:27 PM, David Arendt ad...@prnet.org wrote:
 From my own experience and based on what other people are saying, I
 think there is a random btrfs filesystem corruption problem in kernel
 3.17 at least related to snapshots, therefore I decided to post using
 another subject to draw attention from people not concerned about btrfs
 send to it. More information can be found in the brtfs send posts.

 Did the filesystem you tried to balance contain snapshots ? Read only
 ones ?
 
 The filesystem contains numerous subvolumes and snapshots, many of which
 are read-only.  I'm managing many with snapper.
 
 The similarity of the transid verify errors made me think this issue is
 related, and the root cause may have nothing to do with btrfs send.
 
 As far as I can tell these errors aren't having any affect on my data -
 hopefully the system is catching the problems before there are actual
 disk writes/etc.

Summarizing what I've seen on the threads...

1) The bug seems to be read-only snapshot related.  The connection to 
send is that send creates read-only snapshots, but people creating read-
only snapshots for other purposes are now reporting the same problem, so 
it's not send, it's the read-only snapshots.

2) Writable snapshots haven't been implicated yet, and the working set 
from which the snapshots are taken doesn't seem to be affected, either.  
So in that sense it's not affecting ordinary usage, only the read-only 
snapshots themselves.

3) More problematic, however, is the fact that these apparently corrupted 
read-only snapshots often are not listed properly and can't be deleted, 
tho I'm not sure if that's /all/ the corrupted snapshots or only part of 
them. So while it may not affect ordinary operation in the short term, 
over time until there's a fix, people routinely doing read-only snapshots 
are going to be getting more and more of these undeletable snapshots, and 
depending on whether the eventual patch only prevents more or can 
actually fix the bad ones (possibly via btrfs check or the like), 
affected filesystems may ultimately have to be blown away and recreated 
with a fresh mkfs, in ordered to kill the currently undeletable snapshots.

So the first thing to do would be to shut off whatever's making read-only 
snapshots, so you don't make the problem worse while it's being 
investigated.  For those who can do that without too big an interruption 
to their normal routine (who don't depend on send/receive, for instance), 
just keep it off for the time being.  For those who depend on read-only 
snapshots (send-receive for backup and the data is too valuable to not do 
the backups for a few days), consider switching back to 3.16-stable -- 
from 3.16.3 at least, the patch for the compress bug is there, so that 
shouldn't be a problem.

And if you're affected, be aware that until we have a fix, we don't know 
if it'll be possible to remove the affected and currently undeletable 
snapshots.  If it's not, at some point you'll need to do a fresh 
mkfs.btrfs, to get rid of the damage.  Since the bug doesn't appear to 
affect writable snapshots or the head from which snapshots are made, 
it's not urgent, and a full fix is likely to include a patch to detect 
and fix the problem as well, but until we know what the problem is we 
can't be sure of that, so be prepared to do that mkfs at some point, as 
at this point it's possible that's the only way you'll be able to kill 
the corrupted snapshots.

4) Total speculation on my part, but given the wanted transid (aka 
generation, in different contexts) is significantly lower than the found 
transid, and the fact that the problem appears to be limited to
/read-only/ snapshots, my first suspicion is that something's getting 
updated that would normally apply to all snapshots, but the read-only 
nature of the snapshots is preventing the full update there.  The transid 
of the block is updated, but the snapshot being read-only is preventing 
update of the pointer in that snapshot accordingly.

What I do /not/ know is whether the bug is that something's getting 
updated that should NOT be, and it's simply the read-only snapshots 
letting us know about it since the writable snapshots are fully updated, 
even if that breaks the snapshot (breaking writable snapshots in a 
different and currently undetected way), or if instead, it's a legitimate 
update, like a balance simply moving the snapshot around but not 
affecting it otherwise, and the bug is that the read-only snapshots 
aren't allowing the legitimate update.

Either way, this more or less developed over the weekend, and it's Monday 
now, so the devs should be on it.  If it's anything like the 3.15/3.16 
compression bug, it'll take some time for them to properly trace it, and 
then to figure out an appropriate fix, but they will.  Chances are we'll 
have at least some decent progress on a trace by Friday, and maybe 

Re: btrfs random filesystem corruption in kernel 3.17

2014-10-13 Thread Rich Freeman
On Mon, Oct 13, 2014 at 5:22 PM, john terragon jterra...@gmail.com wrote:
 I'm using compress=no so compression doesn't seem to be related, at
 least in my case. Just read-only snapshots on 3.17 (although I haven't
 tried 3.16).

I was using lzo compression, and hence my comment about turning it off
before going back to 3.16 (not realizing that 3.16 has subsequently
been fixed).

Ironically enough I discovered this as I was about to migrate my ext4
backup drive into my btrfs raid1.  Maybe I'll go ahead and wait on
that and have an rsync backup of the filesystem handy (minus
snapshots) just in case.  :)

I'd switch to 3.16, but it sounds like there is no way to remove the
snapshots at the moment, and I can live for a while without the
ability to create new ones.

interestingly enough it doesn't look like ALL snapshots are affected.
I checked and some of the snapshots I made last weekend while doing
system updates look accessible.  They are significantly smaller, and
the subvolumes they were made from are also fairly new - though I have
no idea if that is related.

The subvolumes do show up in btrfs su list.  They cannot be examined
using btrfs su show.

It would be VERY nice to have a way of cleaning this up without
blowing away the entire filesystem...

--
Rich
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs random filesystem corruption in kernel 3.17

2014-10-13 Thread john terragon
And another worrying thing I didn't notice before. Two snapshots have
dates that do not make sense. root-b3 and root-b4 have been created
Oct 14th (and btw root's modification time was also on Oct the 14th).
So why do they show Oct 10th? And root-prov has actually been created
on Oct 10 15:37, as it correctly shows, so it's like btrfs sub snap
picks up old stale data from who knows were or when or for what
reason. Moreover, root-b4 was created with 3.16.5not good.

drwxrwsr-x 1 root staff  30 Sep 11 16:15 home
d? ? ??   ?? home-backup
drwxr-xr-x 1 root root  250 Oct 14 03:02 root
d? ? ??   ?? root-b2
drwxr-xr-x 1 root root  250 Oct 10 15:37 root-b3
drwxr-xr-x 1 root root  250 Oct 10 15:37 root-b4
drwxr-xr-x 1 root root  250 Oct 14 03:02 root-b5
drwxr-xr-x 1 root root  250 Oct 14 03:02 root-b6
d? ? ??   ?? root-backup
drwxr-xr-x 1 root root  250 Oct 10 15:37 root-prov
drwxr-xr-x 1 root root   88 Sep 15 16:02 vms

On Tue, Oct 14, 2014 at 1:18 AM, Rich Freeman
r-bt...@thefreemanclan.net wrote:
 On Mon, Oct 13, 2014 at 5:22 PM, john terragon jterra...@gmail.com wrote:
 I'm using compress=no so compression doesn't seem to be related, at
 least in my case. Just read-only snapshots on 3.17 (although I haven't
 tried 3.16).

 I was using lzo compression, and hence my comment about turning it off
 before going back to 3.16 (not realizing that 3.16 has subsequently
 been fixed).

 Ironically enough I discovered this as I was about to migrate my ext4
 backup drive into my btrfs raid1.  Maybe I'll go ahead and wait on
 that and have an rsync backup of the filesystem handy (minus
 snapshots) just in case.  :)

 I'd switch to 3.16, but it sounds like there is no way to remove the
 snapshots at the moment, and I can live for a while without the
 ability to create new ones.

 interestingly enough it doesn't look like ALL snapshots are affected.
 I checked and some of the snapshots I made last weekend while doing
 system updates look accessible.  They are significantly smaller, and
 the subvolumes they were made from are also fairly new - though I have
 no idea if that is related.

 The subvolumes do show up in btrfs su list.  They cannot be examined
 using btrfs su show.

 It would be VERY nice to have a way of cleaning this up without
 blowing away the entire filesystem...

 --
 Rich
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: what is the best way to monitor raid1 drive failures?

2014-10-13 Thread Anand Jain




On 10/14/14 03:50, Suman C wrote:

I had progs 3.12 and updated to the latest from git(3.16). With this
update, btrfs fi show reports there is a missing device immediately
after i pull it out. Thanks!

I am using virtualbox to test this. So, I am detaching the drive like so:

vboxmanage storageattach vm --storagectl controller --port port
--device device --medium none

Next I am going to try and test a more realistic scenario where a
harddrive is not pulled out, but is damaged.




Can/does btrfs mark a filesystem(say, 2 drive raid1) degraded or
unhealthy automatically when one drive is damaged badly enough that it
cannot be written to or read from reliably?


 There are some gaps as directly compared to an enterprise volume
 manager, which is being fixed.  but pls do report what you find.

Thanks, Anand



Suman

On Sun, Oct 12, 2014 at 7:21 PM, Anand Jain anand.j...@oracle.com wrote:


Suman,


To simulate the failure, I detached one of the drives from the system.
After that, I see no sign of a problem except for these errors:


  Are you physically pulling out the device ? I wonder if lsblk or blkid
  shows the error ? reporting device missing logic is in the progs (so
  have that latest) and it works provided user script such as blkid/lsblk
  also reports the problem. OR for soft-detach tests you could use
  devmgt at http://github.com/anajain/devmgt

  Also I am trying to get the device management framework for the btrfs
  with a more better device management and reporting.

Thanks,  Anand



On 10/13/14 07:50, Suman C wrote:


Hi,

I am testing some disk failure scenarios in a 2 drive raid1 mirror.
They are 4GB each, virtual SATA drives inside virtualbox.

To simulate the failure, I detached one of the drives from the system.
After that, I see no sign of a problem except for these errors:

Oct 12 15:37:14 rock-dev kernel: btrfs: bdev /dev/sdb errs: wr 0, rd
0, flush 1, corrupt 0, gen 0
Oct 12 15:37:14 rock-dev kernel: lost page write due to I/O error on
/dev/sdb

/dev/sdb is gone from the system, but btrfs fi show still lists it.

Label: raid1pool  uuid: 4e5d8b43-1d34-4672-8057-99c51649b7c6
  Total devices 2 FS bytes used 1.46GiB
  devid1 size 4.00GiB used 2.45GiB path /dev/sdb
  devid2 size 4.00GiB used 2.43GiB path /dev/sdc

I am able to read and write just fine, but do see the above errors in
dmesg.

What is the best way to find out that one of the drives has gone bad?

Suman
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: return failure if btrfs_dev_replace_finishing() failed

2014-10-13 Thread Eryu Guan
On Mon, Oct 13, 2014 at 06:18:04PM +0800, Anand Jain wrote:
 
 
 On 10/13/14 14:59, Eryu Guan wrote:
 On Mon, Oct 13, 2014 at 02:23:57PM +0800, Anand Jain wrote:
 
 
 comments below..
 
 
 On 10/13/14 12:42, Eryu Guan wrote:
 device replace could fail due to another running scrub process or any
 other errors btrfs_scrub_dev() may hit, but this failure doesn't get
 returned to userspace.
 
 The following steps could reproduce this issue
 
mkfs -t btrfs -f /dev/sdb1 /dev/sdb2
mount /dev/sdb1 /mnt/btrfs
while true; do btrfs scrub start -B /mnt/btrfs /dev/null 21; done 
btrfs replace start -Bf /dev/sdb2 /dev/sdb3 /mnt/btrfs
# if this replace succeeded, do the following and repeat until
# you see this log in dmesg
# BTRFS: btrfs_scrub_dev(/dev/sdb2, 2, /dev/sdb3) failed -115
#btrfs replace start -Bf /dev/sdb3 /dev/sdb2 /mnt/btrfs
 
# once you see the error log in dmesg, check return value of
# replace
echo $?
 
 Introduce a new dev replace result
 
 BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS
 
 to catch -EINPROGRESS explicitly and return other errors directly to
 userspace.
 
 Signed-off-by: Eryu Guan guane...@gmail.com
 ---
 
 v2:
 - set result to SCRUB_INPROGRESS if btrfs_scrub_dev returned -EINPROGRESS
and return 0 as Miao Xie suggested
 
   fs/btrfs/dev-replace.c | 12 +---
   include/uapi/linux/btrfs.h |  1 +
   2 files changed, 10 insertions(+), 3 deletions(-)
 
 diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
 index eea26e1..a141f8b 100644
 --- a/fs/btrfs/dev-replace.c
 +++ b/fs/btrfs/dev-replace.c
 @@ -418,9 +418,15 @@ int btrfs_dev_replace_start(struct btrfs_root *root,
  dev_replace-scrub_progress, 0, 1);
 
ret = btrfs_dev_replace_finishing(root-fs_info, ret);
 -  WARN_ON(ret);
 +  /* don't warn if EINPROGRESS, someone else might be running scrub */
 +  if (ret == -EINPROGRESS) {
 +  args-result = BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS;
 +  ret = 0;
 +  } else {
 +  WARN_ON(ret);
 +  }
 
 
  I am bit concerned, why these racing threads here aren't excluding
  each other using mutually_exclusive_operation_running ? as most
  of the other device operation thread does.
 
 Thanks, Anand

btrfs_ioctl_scrub() doesn't use mutually_exclusive_operation_running
as other device operations do, I'm not sure if it should(seems scrub
should do it too to me).

But I think that's a different problem from the one I'm trying to fix
here. The main purpose is to return error to userspace when
btrfs_scrub_dev() hit some error. Dealing with -EINPROGRESS is to
match the current behavior(replace and scrub could run at the same
time).

Thanks,
Eryu
 
   looks like was are trying to manage EINPROGRESS returned by
 
 Yes, that's right.
 
   btrfs_dev_replace_finishing(). In btrfs_dev_replace_finishing()
   which specific func call is returning EINPROGRESS ? I didn't go
   deep enough.
 
 btrfs_dev_replace_finishing() will check the scrub_ret(the last
 argument), and return scrub_ret if (!scrub_ret). It was returning 0
 unconditionally before this patch.
 
 btrfs_dev_replace_start@fs/btrfs/dev-replace.c
 416  ret = btrfs_scrub_dev(fs_info, src_device-devid, 0,
 417src_device-total_bytes,
 418dev_replace-scrub_progress, 0, 1);
 419
 420  ret = btrfs_dev_replace_finishing(root-fs_info, ret);
 
 and btrfs_dev_replace_finishing@fs/btrfs/dev-replace.c
 529  if (!scrub_ret) {
 530  
  btrfs_dev_replace_update_device_in_mapping_tree(fs_info,
 531  
  src_device,
 532  
  tgt_device);
 533  } else {
 ..
 547  return scrub_ret;
 548  }
 
 
 
 
 
 
   And how do we handle if replace is intervened by balance
   instead of scrub ?
 
 Based on my test, replace ioctl would return -ENOENT if balance is
 running
 
 ERROR: ioctl(DEV_REPLACE_START) failed on /mnt/testarea/scratch: No such 
 file or directory, no error
 
 (I haven't gone through this codepath yet and don't know where -ENOENT
 comes from, but I don't think it's a proper errno,
 /mnt/testarea/scratch is definitely there)
 
   sorry if I missed something.
 
 Anand
 
 Thanks for the review!
 
 Eryu
 
 
 -  return 0;
 +  return ret;
 
   leave:
dev_replace-srcdev = NULL;
 @@ -538,7 +544,7 @@ static int btrfs_dev_replace_finishing(struct 
 btrfs_fs_info *fs_info,
btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device);
mutex_unlock(dev_replace-lock_finishing_cancel_unmount);
 
 -  return 0;
 +  return scrub_ret;
}
 
printk_in_rcu(KERN_INFO
 diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
 index 2f47824..611e1c5 100644
 --- a/include/uapi/linux/btrfs.h
 +++