Re: Corrupt btrfs filesystem recovery... What best instructions?

2013-09-30 Thread Duncan
Martin posted on Sun, 29 Sep 2013 22:55:43 +0100 as excerpted:

 On 29/09/13 22:29, Martin wrote:
 
 Looking up what's available for Gentoo, the maintainers there look to
 be nicely sharp with multiple versions available all the way up to
 kernel 3.11.2...

Cool, another gentooer! =:^)

 That is being pulled in now as expected:
 
 sys-kernel/gentoo-sources-3.11.2

FWIW, I've been doing my own kernels (mainline) since back on mandrake a 
decade ago, and I just changed up my scripts a bit when I switched to 
gentoo.  Then later on I changed them up a bit more to be able to run git 
kernels.  These days I normally first try (and switch to if no serious 
bugs) to the dev kernel around -rc2 or so, by which point I figure 
anything likely to eat my system for breakfast should be either worked 
thru, or at least news of it available.  As a non-dev, it's very cool 
being able to spot and report bugs, possibly bisecting to a specific 
commit, and watch them get fixed before general kernel release. Just one 
way I as a non-dev can contribute back.  =:^)

To take care of packages that depend on a kernel package, I used to have 
a kernel (gentoo-sources-2.6. or some such, back then, now of course 
it'd be 3.) in package.provided, but these days I don't even need 
that. =:^)

 There's also the latest available from btrfs tools with
 sys-fs/btrfs-progs ...
 
 Oddly, that caused emerge to report:
 
 [ebuild UD ] sys-fs/btrfs-progs-0.19.11 [0.20_rc1_p358] 0 kB
 
 which is a *downgrade*. Hence, I'm keeping with the 0.20_rc1_p358.

btrfs-progs- is available, but as a live package, it's masked in 
keeping with gentoo policy.  So to get it you'd need to unmask it.

But 0.20_rc1_p358 shouldn't be /too/ far back.  In fact, I'm guessing the 
p-number is the (serialized) patch sequence number indicating the number 
of commits forward from the rc1 tag.  And on the (locally unmasked) - 
version here, a git describe --tags gives me

v0.20-rc1-358-g194aa4a

... so 0.20_rc1_p358 is very likely identical to the live version at this 
point, and it makes no difference, except that the non-live version is a 
stable snapshot instead of a version that might change every time you 
merge it, if upstream has done any further commits.

So btrfs-progs-0.20_rc1_p358 should be fine.  And you were updating 
kernel to 3.11.2, so that should be fine by the time you read this, as 
well. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix crash of compressed writes

2013-09-30 Thread Liu Bo
The crash[1] is found by xfstests/generic/208 with -o compress,
it's not reproduced everytime, but it does panic.

The bug is quite interesting, it's actually introduced by a recent commit
(573aecafca1cf7a974231b759197a1aebcf39c2a,
Btrfs: actually limit the size of delalloc range).

Btrfs implements delay allocation, so during writeback, we
(1) get a page A and lock it
(2) search the state tree for delalloc bytes and lock all pages within the range
(3) process the delalloc range, including find disk space and create
ordered extent and so on.
(4) submit the page A.

It runs well in normal cases, but if we're in a racy case, eg.
buffered compressed writes and aio-dio writes,
sometimes we may fail to lock all pages in the 'delalloc' range,
in which case, we need to fall back to search the state tree again with
a smaller range limit(max_bytes = PAGE_CACHE_SIZE - offset).

The mentioned commit has a side effect, that is, in the fallback case,
we can find delalloc bytes before the index of the page we already have locked,
so we're in the case of (delalloc_end = *start) and return with (found  0).

This ends with not locking delalloc pages but making -writepage still
process them, and the crash happens.

This fixes it by not enforcing the 'max_bytes' limit in the special fallback
case.

[1]:
[ cut here ]
kernel BUG at mm/page-writeback.c:2170!
[...]
CPU: 2 PID: 11755 Comm: btrfs-delalloc- Tainted: G   O 3.11.0+ #8
[...]
RIP: 0010:[810f5093]  [810f5093] 
clear_page_dirty_for_io+0x1e/0x83
[...]
[ 4934.248731] Stack:
[ 4934.248731]  8801477e5dc8 ea00049b9f00 8801869f9ce8 
a02b841a
[ 4934.248731]    0fff 
0620
[ 4934.248731]  88018db59c78 ea0005da8d40 a02ff860 
0001810016c0
[ 4934.248731] Call Trace:
[ 4934.248731]  [a02b841a] extent_range_clear_dirty_for_io+0xcf/0xf5 
[btrfs]
[ 4934.248731]  [a02a8889] compress_file_range+0x1dc/0x4cb [btrfs]
[ 4934.248731]  [8104f7af] ? detach_if_pending+0x22/0x4b
[ 4934.248731]  [a02a8bad] async_cow_start+0x35/0x53 [btrfs]
[ 4934.248731]  [a02c694b] worker_loop+0x14b/0x48c [btrfs]
[ 4934.248731]  [a02c6800] ? btrfs_queue_worker+0x25c/0x25c [btrfs]
[ 4934.248731]  [810608f5] kthread+0x8d/0x95
[ 4934.248731]  [81060868] ? kthread_freezable_should_stop+0x43/0x43
[ 4934.248731]  [814fe09c] ret_from_fork+0x7c/0xb0
[ 4934.248731]  [81060868] ? kthread_freezable_should_stop+0x43/0x43
[ 4934.248731] Code: ff 85 c0 0f 94 c0 0f b6 c0 59 5b 5d c3 0f 1f 44 00 00 55 
48 89 e5 41 54 53 48 89 fb e8 2c de 00 00 49 89 c4 48 8b 03 a8 01 75 02 0f 0b 
4d 85 e4 74 52 49 8b 84 24 80 00 00 00 f6 40 20 01 75 44
[ 4934.248731] RIP  [810f5093] clear_page_dirty_for_io+0x1e/0x83
[ 4934.248731]  RSP 8801869f9c48
[ 4934.280307] ---[ end trace 36f06d3f8750236a ]---

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/extent_io.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c09a40d..76a977e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1483,7 +1483,12 @@ static noinline u64 find_delalloc_range(struct 
extent_io_tree *tree,
node = rb_next(node);
total_bytes += state-end - state-start + 1;
if (total_bytes = max_bytes) {
-   *end = *start + max_bytes - 1;
+   /*
+* tell if we fall back from the failure of
+* lock_delalloc_pages()
+*/
+   if (max_bytes  PAGE_CACHE_SIZE)
+   *end = *start + max_bytes - 1;
break;
}
if (!node)
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Progress of device deletion?

2013-09-30 Thread Chris Murphy

On Sep 29, 2013, at 1:13 AM, Fredrik Tolf fred...@dolda2000.com wrote:
 
 Is there any way I can find out what's going on?

For whatever reason, it started out with every drive practically full, in terms 
of chunk allocation.

e.g.devid5 size 2.73TB used 2.71TB path /dev/sdh1

I don't know if the code works this way, but it needs to do a balance (or 
partial one) to make room before it can start migrating actual complete chunks 
from 4 disks to 2 disks. And my guess is that it's not yet done balancing sdg1 
still.

Post kernel version, btrfs-progs version, and dmesg from the time the delete 
command was initiated.


Chris Murphy--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: don't delete ordered roots from list during cleanup

2013-09-30 Thread Josef Bacik
On Fri, Sep 27, 2013 at 04:18:14PM -0700, Zach Brown wrote:
 On Fri, Sep 27, 2013 at 04:37:46PM -0400, Josef Bacik wrote:
  During transaction cleanup after an abort we are just removing roots from 
  the
  ordered roots list which is incorrect.  We have a BUG_ON() to make sure 
  that the
  root is still part of the ordered roots list when we put our ordered extent
  which we were tripping in this case.  So do like we do everywhere else and 
  just
  move it to the tail of the ordered roots list and allow the normal cleanup 
  to
  take care of stuff.  Thanks,
  
  Signed-off-by: Josef Bacik jba...@fusionio.com
  ---
   fs/btrfs/disk-io.c | 3 ++-
   1 file changed, 2 insertions(+), 1 deletion(-)
  
  diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
  index f38211f..872b4ce 100644
  --- a/fs/btrfs/disk-io.c
  +++ b/fs/btrfs/disk-io.c
  @@ -3835,7 +3835,8 @@ static void btrfs_destroy_all_ordered_extents(struct 
  btrfs_fs_info *fs_info)
  while (!list_empty(splice)) {
  root = list_first_entry(splice, struct btrfs_root,
  ordered_root);
  -   list_del_init(root-ordered_root);
  +   list_move_tail(root-ordered_root,
  +  fs_info-ordered_roots);
 
 This function basically only does:
 
   lock
   list_for_each
   lock
   list_for_each
   set_bit
 
 Could we instead add a bit to the root or trans or fs_info or anything
 else that could be trivialy set in _destroy_all_ordered_extents and
 tested in _finish_ordered_io()?  It'd remove a bunch of tedious locking
 and iteration here.
 
 The similar metaphor in the core page cache is (address_space-flags |
 AS_EIO).
 
 Would that be too coarse or racey?

So I _think_ we may need to truncate the ordered range in the inode as well, but
I haven't had a consistent reproducer for this case.  I want to leave it like
this for now until I'm sure we don't need the truncate and then we could
probably just replace this with a test for FS_ERROR in finish_ordered_io.
Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix two use-after-free bugs with transaction cleanup

2013-09-30 Thread Josef Bacik
I was noticing the slab redzone stuff going off every once and a while during
transaction aborts.  This was caused by two things

1) We would walk the pending snapshots and set their error to -ECANCELED.  We
don't need to do this, the snapshot stuff waits for a transaction commit and if
there is a problem we just free our pending snapshot object and exit.  Doing
this was causing us to touch the pending snapshot object after the thing had
already been freed.

2) We were freeing the transaction manually with wanton disregard for it's
use_count reference counter.  To fix this I cleaned up the transaction freeing
loop to either wait for the transaction commit to finish if it was in the middle
of that (since it will be cleaned and freed up there) or to do the cleanup
oursevles.

I also moved the global kill all things dirty everywhere stuff outside of the
transaction cleanup loop since that only needs to be done once.  With this patch
I'm no longer seeing slab corruption because of use after frees.  Thanks,

Signed-off-by: Josef Bacik jba...@fusionio.com
---
 fs/btrfs/disk-io.c | 111 ++---
 fs/btrfs/transaction.c |  22 +-
 fs/btrfs/transaction.h |   1 +
 3 files changed, 52 insertions(+), 82 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 872b4ce..8aa93a9 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -64,7 +64,6 @@ static void btrfs_destroy_ordered_operations(struct 
btrfs_transaction *t,
 static void btrfs_destroy_ordered_extents(struct btrfs_root *root);
 static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
  struct btrfs_root *root);
-static void btrfs_evict_pending_snapshots(struct btrfs_transaction *t);
 static void btrfs_destroy_delalloc_inodes(struct btrfs_root *root);
 static int btrfs_destroy_marked_extents(struct btrfs_root *root,
struct extent_io_tree *dirty_pages,
@@ -3914,24 +3913,6 @@ static int btrfs_destroy_delayed_refs(struct 
btrfs_transaction *trans,
return ret;
 }
 
-static void btrfs_evict_pending_snapshots(struct btrfs_transaction *t)
-{
-   struct btrfs_pending_snapshot *snapshot;
-   struct list_head splice;
-
-   INIT_LIST_HEAD(splice);
-
-   list_splice_init(t-pending_snapshots, splice);
-
-   while (!list_empty(splice)) {
-   snapshot = list_entry(splice.next,
- struct btrfs_pending_snapshot,
- list);
-   snapshot-error = -ECANCELED;
-   list_del_init(snapshot-list);
-   }
-}
-
 static void btrfs_destroy_delalloc_inodes(struct btrfs_root *root)
 {
struct btrfs_inode *btrfs_inode;
@@ -4061,6 +4042,8 @@ again:
 void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans,
   struct btrfs_root *root)
 {
+   btrfs_destroy_ordered_operations(cur_trans, root);
+
btrfs_destroy_delayed_refs(cur_trans, root);
btrfs_block_rsv_release(root, root-fs_info-trans_block_rsv,
cur_trans-dirty_pages.dirty_bytes);
@@ -4068,8 +4051,6 @@ void btrfs_cleanup_one_transaction(struct 
btrfs_transaction *cur_trans,
cur_trans-state = TRANS_STATE_COMMIT_START;
wake_up(root-fs_info-transaction_blocked_wait);
 
-   btrfs_evict_pending_snapshots(cur_trans);
-
cur_trans-state = TRANS_STATE_UNBLOCKED;
wake_up(root-fs_info-transaction_wait);
 
@@ -4093,63 +4074,51 @@ void btrfs_cleanup_one_transaction(struct 
btrfs_transaction *cur_trans,
 static int btrfs_cleanup_transaction(struct btrfs_root *root)
 {
struct btrfs_transaction *t;
-   LIST_HEAD(list);
 
mutex_lock(root-fs_info-transaction_kthread_mutex);
 
spin_lock(root-fs_info-trans_lock);
-   list_splice_init(root-fs_info-trans_list, list);
-   root-fs_info-running_transaction = NULL;
-   spin_unlock(root-fs_info-trans_lock);
-
-   while (!list_empty(list)) {
-   t = list_entry(list.next, struct btrfs_transaction, list);
-
-   btrfs_destroy_ordered_operations(t, root);
-
-   btrfs_destroy_all_ordered_extents(root-fs_info);
-
-   btrfs_destroy_delayed_refs(t, root);
-
-   /*
-*  FIXME: cleanup wait for commit
-*  We needn't acquire the lock here, because we are during
-*  the umount, there is no other task which will change it.
-*/
-   t-state = TRANS_STATE_COMMIT_START;
-   smp_mb();
-   if (waitqueue_active(root-fs_info-transaction_blocked_wait))
-   wake_up(root-fs_info-transaction_blocked_wait);
-
-   btrfs_evict_pending_snapshots(t);
-
-   t-state = TRANS_STATE_UNBLOCKED;
-   smp_mb();
-   if 

Re: [PATCH] Btrfs: fix crash of compressed writes

2013-09-30 Thread Josef Bacik
On Mon, Sep 30, 2013 at 08:39:57PM +0800, Liu Bo wrote:
 The crash[1] is found by xfstests/generic/208 with -o compress,
 it's not reproduced everytime, but it does panic.
 
 The bug is quite interesting, it's actually introduced by a recent commit
 (573aecafca1cf7a974231b759197a1aebcf39c2a,
 Btrfs: actually limit the size of delalloc range).
 
 Btrfs implements delay allocation, so during writeback, we
 (1) get a page A and lock it
 (2) search the state tree for delalloc bytes and lock all pages within the 
 range
 (3) process the delalloc range, including find disk space and create
 ordered extent and so on.
 (4) submit the page A.
 
 It runs well in normal cases, but if we're in a racy case, eg.
 buffered compressed writes and aio-dio writes,
 sometimes we may fail to lock all pages in the 'delalloc' range,
 in which case, we need to fall back to search the state tree again with
 a smaller range limit(max_bytes = PAGE_CACHE_SIZE - offset).
 
 The mentioned commit has a side effect, that is, in the fallback case,
 we can find delalloc bytes before the index of the page we already have 
 locked,
 so we're in the case of (delalloc_end = *start) and return with (found  0).
 
 This ends with not locking delalloc pages but making -writepage still
 process them, and the crash happens.
 
 This fixes it by not enforcing the 'max_bytes' limit in the special fallback
 case.
 

Great analysis, thank you for that, however I don't like the fix ;).  Instead We
need to change the

if (!found || delalloc_end = *start)

to always return 0 since we should not call fill delalloc if the delalloc range
doesn't start at our current offset.  Secondly the

max_bytes = PAGE_CACHE_SIZE - offset;

is completely crap since we are talking about the entire page/sector.  We should
simply set max_bytes to sectorszie or PAGE_CACHE_SIZE that way we avoid this
issue.  Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: don't delete ordered roots from list during cleanup

2013-09-30 Thread Zach Brown
 So I _think_ we may need to truncate the ordered range in the inode as well, 
 but
 I haven't had a consistent reproducer for this case.  I want to leave it like
 this for now until I'm sure we don't need the truncate and then we could
 probably just replace this with a test for FS_ERROR in finish_ordered_io.

*nod*, added to the list :)

- z
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix transid verify errors when recovering log tree

2013-09-30 Thread Josef Bacik
If we crash with a log, remount and recover that log, and then crash before we
can commit another transaction we will get transid verify errors on the next
mount.  This is because we were not zero'ing out the log when we committed the
transaction after recovery.  This is ok as long as we commit another transaction
at some point in the future, but if you abort or something else goes wrong you
can end up in this weird state because the recovery stuff says that the tree log
should have a generation+1 of the super generation, which won't be the case of
the transaction that was started for recovery.  Fix this by removing the check
and _always_ zero out the log portion of the super when we commit a transaction.
This fixes the transid verify issues I was seeing with my force errors tests.
Thanks,

Signed-off-by: Josef Bacik jba...@fusionio.com
---
 fs/btrfs/transaction.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index b903192..134039f 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1845,11 +1845,8 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans,
assert_qgroups_uptodate(trans);
update_super_roots(root);
 
-   if (!root-fs_info-log_root_recovering) {
-   btrfs_set_super_log_root(root-fs_info-super_copy, 0);
-   btrfs_set_super_log_root_level(root-fs_info-super_copy, 0);
-   }
-
+   btrfs_set_super_log_root(root-fs_info-super_copy, 0);
+   btrfs_set_super_log_root_level(root-fs_info-super_copy, 0);
memcpy(root-fs_info-super_for_commit, root-fs_info-super_copy,
   sizeof(*root-fs_info-super_copy));
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions regarding logging upon fsync in btrfs

2013-09-30 Thread Aastha Mehta
On 29 September 2013 15:12, Josef Bacik jba...@fusionio.com wrote:
 On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote:
 Thank you very much for the reply. That clarifies a lot of things.

 I was trying a small test case that opens a file, writes a block of
 data, calls fsync and then closes the file. If I understand correctly,
 fsync would return only after all in-memory buffers have been
 committed to disk. I have added few print statements in the
 __extent_writepage function, and I notice that the function gets
 called a bit later after fsync returns. It seems that I am not
 guaranteed to see the data going to disk by the time fsync returns.

 Am I doing something wrong, or am I looking at the wrong place for
 disk write? This happens both with tree logging enabled as well as
 with notreelog.


 So 3.1 was a long time ago and to be sure it had issues I don't think it was
 _that_ broken.  You are probably better off instrumenting a recent kernel, 
 3.11
 or just build btrfs-next from git.  But if I were to make a guess I'd say that
 __extent_writepage was how both data and metadata was written out at the time 
 (I
 don't think I changed it until 3.2 or something later) so what you are likely
 seeing is the normal transaction commit after the fsync.  In the case of
 notreelog we are likely starting another transaction and you are seeing that
 commit (at the time the transaction kthread would start a transaction even if
 none had been started yet.)  Thanks,

 Josef

Is there any special handling for very small file write, less than 4K? As
I understand there is an optimization to inline the first extent in a file if
it is smaller than 4K, does it affect the writeback on fsync as well? I did
set the max_inline mount option to 0, but even then it seems there is
some difference in fsync behaviour for writing first extent of less than 4K
size and writing 4K or more.

Thanks,
Aastha.


-- 
Aastha Mehta
MPI-SWS, Germany
E-mail: aasth...@mpi-sws.org
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions regarding logging upon fsync in btrfs

2013-09-30 Thread Josef Bacik
On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote:
 On 29 September 2013 15:12, Josef Bacik jba...@fusionio.com wrote:
  On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote:
  Thank you very much for the reply. That clarifies a lot of things.
 
  I was trying a small test case that opens a file, writes a block of
  data, calls fsync and then closes the file. If I understand correctly,
  fsync would return only after all in-memory buffers have been
  committed to disk. I have added few print statements in the
  __extent_writepage function, and I notice that the function gets
  called a bit later after fsync returns. It seems that I am not
  guaranteed to see the data going to disk by the time fsync returns.
 
  Am I doing something wrong, or am I looking at the wrong place for
  disk write? This happens both with tree logging enabled as well as
  with notreelog.
 
 
  So 3.1 was a long time ago and to be sure it had issues I don't think it was
  _that_ broken.  You are probably better off instrumenting a recent kernel, 
  3.11
  or just build btrfs-next from git.  But if I were to make a guess I'd say 
  that
  __extent_writepage was how both data and metadata was written out at the 
  time (I
  don't think I changed it until 3.2 or something later) so what you are 
  likely
  seeing is the normal transaction commit after the fsync.  In the case of
  notreelog we are likely starting another transaction and you are seeing that
  commit (at the time the transaction kthread would start a transaction even 
  if
  none had been started yet.)  Thanks,
 
  Josef
 
 Is there any special handling for very small file write, less than 4K? As
 I understand there is an optimization to inline the first extent in a file if
 it is smaller than 4K, does it affect the writeback on fsync as well? I did
 set the max_inline mount option to 0, but even then it seems there is
 some difference in fsync behaviour for writing first extent of less than 4K
 size and writing 4K or more.
 

Yeah if the file is an inline extent then it will be copied into the log
directly and the log will be written out, no going through the data write path
at all.  Max inline == 0 should make it so we don't inline, so if it isn't
honoring that then that may be a bug.  Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs-progs: make the repair option a global static var

2013-09-30 Thread Josef Bacik
It's just annoying to have to pass it around everywhere.  Thanks,

Signed-off-by: Josef Bacik jba...@fusionio.com
---
 cmds-check.c | 23 +++
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index f05c73e..6da35ea 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -49,6 +49,7 @@ static u64 data_bytes_allocated = 0;
 static u64 data_bytes_referenced = 0;
 static int found_old_backref = 0;
 static LIST_HEAD(duplicate_extents);
+static int repair = 0;
 
 struct extent_backref {
struct list_head list;
@@ -1404,7 +1405,7 @@ out:
 }
 
 static int check_inode_recs(struct btrfs_root *root,
-   struct cache_tree *inode_cache, int repair)
+   struct cache_tree *inode_cache)
 {
struct cache_extent *cache;
struct ptr_node *node;
@@ -1809,7 +1810,7 @@ static int process_root_ref(struct extent_buffer *eb, int 
slot,
 
 static int check_fs_root(struct btrfs_root *root,
 struct cache_tree *root_cache,
-struct walk_control *wc, int repair)
+struct walk_control *wc)
 {
int ret = 0;
int wret;
@@ -1879,7 +1880,7 @@ static int check_fs_root(struct btrfs_root *root,
root_node.current);
}
 
-   ret = check_inode_recs(root, root_node.inode_cache, repair);
+   ret = check_inode_recs(root, root_node.inode_cache);
return ret;
 }
 
@@ -1895,8 +1896,7 @@ static int fs_root_objectid(u64 objectid)
 }
 
 static int check_fs_roots(struct btrfs_root *root,
- struct cache_tree *root_cache,
- int repair)
+ struct cache_tree *root_cache)
 {
struct btrfs_path path;
struct btrfs_key key;
@@ -1939,7 +1939,7 @@ static int check_fs_roots(struct btrfs_root *root,
err = 1;
goto next;
}
-   ret = check_fs_root(tmp_root, root_cache, wc, repair);
+   ret = check_fs_root(tmp_root, root_cache, wc);
if (ret)
err = 1;
btrfs_free_fs_root(tmp_root);
@@ -5009,7 +5009,7 @@ static void reset_cached_block_groups(struct 
btrfs_fs_info *fs_info)
 
 static int check_extent_refs(struct btrfs_trans_handle *trans,
 struct btrfs_root *root,
-struct cache_tree *extent_cache, int repair)
+struct cache_tree *extent_cache)
 {
struct extent_record *rec;
struct cache_extent *cache;
@@ -5403,7 +5403,7 @@ static int check_devices(struct rb_root *dev_cache,
return ret;
 }
 
-static int check_chunks_and_extents(struct btrfs_root *root, int repair)
+static int check_chunks_and_extents(struct btrfs_root *root)
 {
struct rb_root dev_cache;
struct cache_tree chunk_cache;
@@ -5510,7 +5510,7 @@ again:
break;
}
 
-   ret = check_extent_refs(trans, root, extent_cache, repair);
+   ret = check_extent_refs(trans, root, extent_cache);
if (ret == -EAGAIN) {
ret = btrfs_commit_transaction(trans, root);
if (ret)
@@ -5982,7 +5982,6 @@ int cmd_check(int argc, char **argv)
char uuidbuf[37];
int ret;
int num;
-   int repair = 0;
int option_index = 0;
int init_csum_tree = 0;
int init_extent_tree = 0;
@@ -6084,7 +6083,7 @@ int cmd_check(int argc, char **argv)
exit(1);
goto out;
}
-   ret = check_chunks_and_extents(root, repair);
+   ret = check_chunks_and_extents(root);
if (ret)
fprintf(stderr, Errors found in extent allocation tree or 
chunk allocation\n);
 
@@ -6094,7 +6093,7 @@ int cmd_check(int argc, char **argv)
goto out;
 
fprintf(stderr, checking fs roots\n);
-   ret = check_fs_roots(root, root_cache, repair);
+   ret = check_fs_roots(root, root_cache);
if (ret)
goto out;
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions regarding logging upon fsync in btrfs

2013-09-30 Thread Aastha Mehta
On 30 September 2013 22:11, Josef Bacik jba...@fusionio.com wrote:
 On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote:
 On 29 September 2013 15:12, Josef Bacik jba...@fusionio.com wrote:
  On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote:
  Thank you very much for the reply. That clarifies a lot of things.
 
  I was trying a small test case that opens a file, writes a block of
  data, calls fsync and then closes the file. If I understand correctly,
  fsync would return only after all in-memory buffers have been
  committed to disk. I have added few print statements in the
  __extent_writepage function, and I notice that the function gets
  called a bit later after fsync returns. It seems that I am not
  guaranteed to see the data going to disk by the time fsync returns.
 
  Am I doing something wrong, or am I looking at the wrong place for
  disk write? This happens both with tree logging enabled as well as
  with notreelog.
 
 
  So 3.1 was a long time ago and to be sure it had issues I don't think it 
  was
  _that_ broken.  You are probably better off instrumenting a recent kernel, 
  3.11
  or just build btrfs-next from git.  But if I were to make a guess I'd say 
  that
  __extent_writepage was how both data and metadata was written out at the 
  time (I
  don't think I changed it until 3.2 or something later) so what you are 
  likely
  seeing is the normal transaction commit after the fsync.  In the case of
  notreelog we are likely starting another transaction and you are seeing 
  that
  commit (at the time the transaction kthread would start a transaction even 
  if
  none had been started yet.)  Thanks,
 
  Josef

 Is there any special handling for very small file write, less than 4K? As
 I understand there is an optimization to inline the first extent in a file if
 it is smaller than 4K, does it affect the writeback on fsync as well? I did
 set the max_inline mount option to 0, but even then it seems there is
 some difference in fsync behaviour for writing first extent of less than 4K
 size and writing 4K or more.


 Yeah if the file is an inline extent then it will be copied into the log
 directly and the log will be written out, no going through the data write path
 at all.  Max inline == 0 should make it so we don't inline, so if it isn't
 honoring that then that may be a bug.  Thanks,

 Josef

I tried it on 3.12-rc2 release, and it seems there is a bug then.
Please find attached logs to confirm.
Also, probably on the older release.

Thanks,
Aastha.

-- 
Aastha Mehta
MPI-SWS, Germany
E-mail: aasth...@mpi-sws.org
[ 2808.125838] [do_writepages] inum: 260, nr_to_write: 9223372036854775807, 
writepages fn: 8129ce40
[ 2808.125977] CPU: 0 PID: 10215 Comm: bb Not tainted 3.12.0-rc2-latest+ #4
[ 2808.125980] Hardware name: innotek GmbH VirtualBox, BIOS VirtualBox 
12/01/2006
[ 2808.125983]  88003d6b5e60 88003d6b5e28 817a8a6f 
88003fc0eb48
[ 2808.125987]  88002e325b48 88003d6b5e48 81105c1f 
88002e325a00
[ 2808.125990]  880037b1f300 88003d6b5e88 810fb689 
88003fc12c00
[ 2808.125993] Call Trace:
[ 2808.126003]  [817a8a6f] dump_stack+0x46/0x58
[ 2808.126010]  [81105c1f] do_writepages+0x4f/0x90
[ 2808.126013]  [810fb689] __filemap_fdatawrite_range+0x49/0x50
[ 2808.126013]  [810fc41e] filemap_fdatawrite_range+0xe/0x10
[ 2808.126013]  [812aafdc] btrfs_sync_file+0xac/0x380
[ 2808.126098]  [817aedd2] ? __schedule+0x692/0x730
[ 2808.126115]  [811752d8] do_fsync+0x58/0x90
[ 2808.126149]  [81148a6d] ? SyS_write+0x4d/0xa0
[ 2808.126165]  [8117565b] SyS_fsync+0xb/0x10
[ 2808.126182]  [817b83a2] system_call_fastpath+0x16/0x1b
[ 2808.127447] CPU: 0 PID: 10215 Comm: bb Not tainted 3.12.0-rc2-latest+ #4
[ 2808.127450] Hardware name: innotek GmbH VirtualBox, BIOS VirtualBox 
12/01/2006
[ 2808.127452]  ea802440 88003d6b5ba8 817a8a6f 
88003d6b5de0
[ 2808.127456]  88002e325870 88003d6b5cb8 812b7c9f 
88003e2083d0
[ 2808.127459]   88003d6b5c48 817aedd2 
88003c3707b0
[ 2808.127462] Call Trace:
[ 2808.127467]  [817a8a6f] dump_stack+0x46/0x58
[ 2808.127473]  [812b7c9f] __extent_writepage+0x6f/0x750
[ 2808.127479]  [817aedd2] ? __schedule+0x692/0x730
[ 2808.127483]  [810fac71] ? find_get_pages_tag+0x121/0x160
[ 2808.127487]  [812b8552] 
extent_write_cache_pages.isra.24.constprop.31+0x1d2/0x380
[ 2808.127492]  [81005877] ? show_trace_log_lvl+0x57/0x70
[ 2808.127496]  [812b89d9] extent_writepages+0x49/0x60
[ 2808.127501]  [8129f160] ? btrfs_submit_direct+0x6c0/0x6c0
[ 2808.127505]  [8129ce63] btrfs_writepages+0x23/0x30
[ 2808.127510]  [81105c3e] do_writepages+0x6e/0x90
[ 2808.127513]  [810fb689] __filemap_fdatawrite_range+0x49/0x50
[ 2808.127517]  [810fc41e] filemap_fdatawrite_range+0xe/0x10

Re: Questions regarding logging upon fsync in btrfs

2013-09-30 Thread Josef Bacik
On Mon, Sep 30, 2013 at 10:30:59PM +0200, Aastha Mehta wrote:
 On 30 September 2013 22:11, Josef Bacik jba...@fusionio.com wrote:
  On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote:
  On 29 September 2013 15:12, Josef Bacik jba...@fusionio.com wrote:
   On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote:
   Thank you very much for the reply. That clarifies a lot of things.
  
   I was trying a small test case that opens a file, writes a block of
   data, calls fsync and then closes the file. If I understand correctly,
   fsync would return only after all in-memory buffers have been
   committed to disk. I have added few print statements in the
   __extent_writepage function, and I notice that the function gets
   called a bit later after fsync returns. It seems that I am not
   guaranteed to see the data going to disk by the time fsync returns.
  
   Am I doing something wrong, or am I looking at the wrong place for
   disk write? This happens both with tree logging enabled as well as
   with notreelog.
  
  
   So 3.1 was a long time ago and to be sure it had issues I don't think it 
   was
   _that_ broken.  You are probably better off instrumenting a recent 
   kernel, 3.11
   or just build btrfs-next from git.  But if I were to make a guess I'd 
   say that
   __extent_writepage was how both data and metadata was written out at the 
   time (I
   don't think I changed it until 3.2 or something later) so what you are 
   likely
   seeing is the normal transaction commit after the fsync.  In the case of
   notreelog we are likely starting another transaction and you are seeing 
   that
   commit (at the time the transaction kthread would start a transaction 
   even if
   none had been started yet.)  Thanks,
  
   Josef
 
  Is there any special handling for very small file write, less than 4K? As
  I understand there is an optimization to inline the first extent in a file 
  if
  it is smaller than 4K, does it affect the writeback on fsync as well? I did
  set the max_inline mount option to 0, but even then it seems there is
  some difference in fsync behaviour for writing first extent of less than 4K
  size and writing 4K or more.
 
 
  Yeah if the file is an inline extent then it will be copied into the log
  directly and the log will be written out, no going through the data write 
  path
  at all.  Max inline == 0 should make it so we don't inline, so if it isn't
  honoring that then that may be a bug.  Thanks,
 
  Josef
 
 I tried it on 3.12-rc2 release, and it seems there is a bug then.
 Please find attached logs to confirm.
 Also, probably on the older release.
 

Oooh ok I understand, you have your printk's in the wrong place ;).
do_writepages doesn't necessarily mean you are writing something.  If you want
to see if stuff got written to the disk I'd put a printk at run_delalloc_range
and have it spit out the range it is writing out since thats what we think is
actually dirty.  Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions regarding logging upon fsync in btrfs

2013-09-30 Thread Aastha Mehta
On 30 September 2013 22:47, Josef Bacik jba...@fusionio.com wrote:
 On Mon, Sep 30, 2013 at 10:30:59PM +0200, Aastha Mehta wrote:
 On 30 September 2013 22:11, Josef Bacik jba...@fusionio.com wrote:
  On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote:
  On 29 September 2013 15:12, Josef Bacik jba...@fusionio.com wrote:
   On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote:
   Thank you very much for the reply. That clarifies a lot of things.
  
   I was trying a small test case that opens a file, writes a block of
   data, calls fsync and then closes the file. If I understand correctly,
   fsync would return only after all in-memory buffers have been
   committed to disk. I have added few print statements in the
   __extent_writepage function, and I notice that the function gets
   called a bit later after fsync returns. It seems that I am not
   guaranteed to see the data going to disk by the time fsync returns.
  
   Am I doing something wrong, or am I looking at the wrong place for
   disk write? This happens both with tree logging enabled as well as
   with notreelog.
  
  
   So 3.1 was a long time ago and to be sure it had issues I don't think 
   it was
   _that_ broken.  You are probably better off instrumenting a recent 
   kernel, 3.11
   or just build btrfs-next from git.  But if I were to make a guess I'd 
   say that
   __extent_writepage was how both data and metadata was written out at 
   the time (I
   don't think I changed it until 3.2 or something later) so what you are 
   likely
   seeing is the normal transaction commit after the fsync.  In the case of
   notreelog we are likely starting another transaction and you are seeing 
   that
   commit (at the time the transaction kthread would start a transaction 
   even if
   none had been started yet.)  Thanks,
  
   Josef
 
  Is there any special handling for very small file write, less than 4K? As
  I understand there is an optimization to inline the first extent in a 
  file if
  it is smaller than 4K, does it affect the writeback on fsync as well? I 
  did
  set the max_inline mount option to 0, but even then it seems there is
  some difference in fsync behaviour for writing first extent of less than 
  4K
  size and writing 4K or more.
 
 
  Yeah if the file is an inline extent then it will be copied into the log
  directly and the log will be written out, no going through the data write 
  path
  at all.  Max inline == 0 should make it so we don't inline, so if it isn't
  honoring that then that may be a bug.  Thanks,
 
  Josef

 I tried it on 3.12-rc2 release, and it seems there is a bug then.
 Please find attached logs to confirm.
 Also, probably on the older release.


 Oooh ok I understand, you have your printk's in the wrong place ;).
 do_writepages doesn't necessarily mean you are writing something.  If you want
 to see if stuff got written to the disk I'd put a printk at run_delalloc_range
 and have it spit out the range it is writing out since thats what we think is
 actually dirty.  Thanks,

 Josef

No, but I also placed dump_stack() in the beginning of
__extent_writepage. run_delalloc_range is being called only from
__extent_writepage, if it were to be called, the dump_stack() at the
top of __extent_writepage would have printed as well, no?

Thanks

-- 
Aastha Mehta
MPI-SWS, Germany
E-mail: aasth...@mpi-sws.org
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions regarding logging upon fsync in btrfs

2013-09-30 Thread Josef Bacik
On Mon, Sep 30, 2013 at 11:07:20PM +0200, Aastha Mehta wrote:
 On 30 September 2013 22:47, Josef Bacik jba...@fusionio.com wrote:
  On Mon, Sep 30, 2013 at 10:30:59PM +0200, Aastha Mehta wrote:
  On 30 September 2013 22:11, Josef Bacik jba...@fusionio.com wrote:
   On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote:
   On 29 September 2013 15:12, Josef Bacik jba...@fusionio.com wrote:
On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote:
Thank you very much for the reply. That clarifies a lot of things.
   
I was trying a small test case that opens a file, writes a block of
data, calls fsync and then closes the file. If I understand 
correctly,
fsync would return only after all in-memory buffers have been
committed to disk. I have added few print statements in the
__extent_writepage function, and I notice that the function gets
called a bit later after fsync returns. It seems that I am not
guaranteed to see the data going to disk by the time fsync returns.
   
Am I doing something wrong, or am I looking at the wrong place for
disk write? This happens both with tree logging enabled as well as
with notreelog.
   
   
So 3.1 was a long time ago and to be sure it had issues I don't think 
it was
_that_ broken.  You are probably better off instrumenting a recent 
kernel, 3.11
or just build btrfs-next from git.  But if I were to make a guess I'd 
say that
__extent_writepage was how both data and metadata was written out at 
the time (I
don't think I changed it until 3.2 or something later) so what you 
are likely
seeing is the normal transaction commit after the fsync.  In the case 
of
notreelog we are likely starting another transaction and you are 
seeing that
commit (at the time the transaction kthread would start a transaction 
even if
none had been started yet.)  Thanks,
   
Josef
  
   Is there any special handling for very small file write, less than 4K? 
   As
   I understand there is an optimization to inline the first extent in a 
   file if
   it is smaller than 4K, does it affect the writeback on fsync as well? I 
   did
   set the max_inline mount option to 0, but even then it seems there is
   some difference in fsync behaviour for writing first extent of less 
   than 4K
   size and writing 4K or more.
  
  
   Yeah if the file is an inline extent then it will be copied into the log
   directly and the log will be written out, no going through the data 
   write path
   at all.  Max inline == 0 should make it so we don't inline, so if it 
   isn't
   honoring that then that may be a bug.  Thanks,
  
   Josef
 
  I tried it on 3.12-rc2 release, and it seems there is a bug then.
  Please find attached logs to confirm.
  Also, probably on the older release.
 
 
  Oooh ok I understand, you have your printk's in the wrong place ;).
  do_writepages doesn't necessarily mean you are writing something.  If you 
  want
  to see if stuff got written to the disk I'd put a printk at 
  run_delalloc_range
  and have it spit out the range it is writing out since thats what we think 
  is
  actually dirty.  Thanks,
 
  Josef
 
 No, but I also placed dump_stack() in the beginning of
 __extent_writepage. run_delalloc_range is being called only from
 __extent_writepage, if it were to be called, the dump_stack() at the
 top of __extent_writepage would have printed as well, no?
 

Yeah, so I don't know whats going on and I'm in the middle of something, I'll
look at it tomorrow and see if I can't figure out what is going on.  I'm sure
it's working, we have a xfstest to test this sort of thing and it's passing so
we're definitely getting the data to disk properly, I'm probably just missing
some peice around here somewhere.  Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: init device stats for new devices

2013-09-30 Thread Ondřej Kunc
Hi Zach,

thank you for your answer and clarification. I cannot just unmount and
mount that filesystem, because it is running busy NFS server now, so I
will just try it on some testbench server. Can mount -o remount be
sufficient (to prevent stopping service, umount, mount and starting
service) ?

Thank you

Ondřej

2013/9/30 Zach Brown z...@redhat.com:
 I discovered one minor bug in BTRFS filesystem.

 You sure did.

 ERROR: ioctl(BTRFS_IOC_GET_DEV_STATS) on /dev/sde failed: No such device

 But this is not true ... all specified devices exist and are members
 of btrfs filesystem. In dmesg I see this:
 ...
 [973077.099118] btrfs: get dev_stats failed, not yet valid
 

 What makes device statistics valid ? I tried doing full filesystem
 scrub ... but it did not fix that issue.

 The stats are only initialized (considered valid) for devices that are
 known at mount.  You could unmount and mount after adding (or replacing)
 new devices and they'd start returning stats.

 The following (bad) patch illustrates the problem, but the code should
 be restructured so stats are reliably read as devices are added.

 - z

 From: Zach Brown z...@redhat.com
 Date: Mon, 30 Sep 2013 17:48:05 -0400
 Subject: [PATCH] btrfs: init device stats for new devices

 Device stats are only initialized (read from tree items) on mount.
 Trying to read device stats after adding or replacing new devices will
 return errors.

 This cheesy patch demonstrates the problem, but this should really be a
 natural side-effect of adding devices to the fs_devices list.  We have
 evidence that trying to do it by hand doesn't work.

 Any preferences for how to restructure this?
 ---
  fs/btrfs/dev-replace.c | 4 +++-
  fs/btrfs/volumes.c | 6 ++
  2 files changed, 9 insertions(+), 1 deletion(-)

 diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
 index 5d84443..7309096 100644
 --- a/fs/btrfs/dev-replace.c
 +++ b/fs/btrfs/dev-replace.c
 @@ -556,7 +556,9 @@ static int btrfs_dev_replace_finishing(struct 
 btrfs_fs_info *fs_info,

 mutex_unlock(dev_replace-lock_finishing_cancel_unmount);

 -   return 0;
 +   ret = btrfs_init_dev_stats(root-fs_info);
 +
 +   return ret;
  }

  static void btrfs_dev_replace_update_device_in_mapping_tree(
 diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
 index 0431147..e4ccc9b 100644
 --- a/fs/btrfs/volumes.c
 +++ b/fs/btrfs/volumes.c
 @@ -2126,6 +2126,9 @@ int btrfs_init_new_device(struct btrfs_root *root, char 
 *device_path)
 ret = btrfs_commit_transaction(trans, root);
 }

 +   if (!ret)
 +   ret = btrfs_init_dev_stats(root-fs_info);
 +
 return ret;

  error_trans:
 @@ -6060,6 +6063,9 @@ int btrfs_init_dev_stats(struct btrfs_fs_info *fs_info)
 int item_size;
 struct btrfs_dev_stats_item *ptr;

 +   if (device-dev_stats_valid)
 +   continue;
 +
 key.objectid = 0;
 key.type = BTRFS_DEV_STATS_KEY;
 key.offset = device-devid;
 --
 1.7.11.7



-- 
Ondřej Kunc
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: init device stats for new devices

2013-09-30 Thread Zach Brown
On Tue, Oct 01, 2013 at 12:03:05AM +0200, Ondřej Kunc wrote:
 Hi Zach,
 
 thank you for your answer and clarification. I cannot just unmount and
 mount that filesystem, because it is running busy NFS server now, so I
 will just try it on some testbench server. Can mount -o remount be
 sufficient (to prevent stopping service, umount, mount and starting
 service) ?

Sadly, no, remounting won't initialize the stats.

- z
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Progress of device deletion?

2013-09-30 Thread Chris Murphy

On Sep 30, 2013, at 8:27 AM, Chris Murphy li...@colorremedies.com wrote:

 
 On Sep 29, 2013, at 1:13 AM, Fredrik Tolf fred...@dolda2000.com wrote:
 
 Is there any way I can find out what's going on?
 
 For whatever reason, it started out with every drive practically full, in 
 terms of chunk allocation.
 
 e.g.devid5 size 2.73TB used 2.71TB path /dev/sdh1
 
 I don't know if the code works this way, but it needs to do a balance (or 
 partial one) to make room before it can start migrating actual complete 
 chunks from 4 disks to 2 disks. And my guess is that it's not yet done 
 balancing sdg1 still.
 
 Post kernel version, btrfs-progs version, and dmesg from the time the delete 
 command was initiated.

Without knowing more information about how it's expected to behave (now and 
near future), I think if I were you I would have added a couple of drives to 
the volume first so that it had more maneuvering room. It probably seems weird 
to add drives to remove drives, but sometimes (always?) Btrfs really gets a bit 
piggish about allocating a lot more chunks than there is data. Or maybe it's 
not deallocating space as aggressively as it could. So it can get to a point 
where even though there isn't that much data in the volume (in your case 1.5x 
the drive size, and you have 4 drives) yet all of it's effectively allocated. 
So to back out of that takes free space. Then once the chunks are better 
allocated, you'd have been able to remove the drives. Open question if I would 
have removed 2, then another 2. Or removed 4.

And for all I know adding one drive might be enough even though it's raid0. 
*shrug* New territory.



Chris Murphy--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix crash of compressed writes

2013-09-30 Thread Liu Bo
On Mon, Sep 30, 2013 at 01:02:49PM -0400, Josef Bacik wrote:
 On Mon, Sep 30, 2013 at 08:39:57PM +0800, Liu Bo wrote:
  The crash[1] is found by xfstests/generic/208 with -o compress,
  it's not reproduced everytime, but it does panic.
  
  The bug is quite interesting, it's actually introduced by a recent commit
  (573aecafca1cf7a974231b759197a1aebcf39c2a,
  Btrfs: actually limit the size of delalloc range).
  
  Btrfs implements delay allocation, so during writeback, we
  (1) get a page A and lock it
  (2) search the state tree for delalloc bytes and lock all pages within the 
  range
  (3) process the delalloc range, including find disk space and create
  ordered extent and so on.
  (4) submit the page A.
  
  It runs well in normal cases, but if we're in a racy case, eg.
  buffered compressed writes and aio-dio writes,
  sometimes we may fail to lock all pages in the 'delalloc' range,
  in which case, we need to fall back to search the state tree again with
  a smaller range limit(max_bytes = PAGE_CACHE_SIZE - offset).
  
  The mentioned commit has a side effect, that is, in the fallback case,
  we can find delalloc bytes before the index of the page we already have 
  locked,
  so we're in the case of (delalloc_end = *start) and return with (found  
  0).
  
  This ends with not locking delalloc pages but making -writepage still
  process them, and the crash happens.
  
  This fixes it by not enforcing the 'max_bytes' limit in the special fallback
  case.
  
 
 Great analysis, thank you for that, however I don't like the fix ;).  Instead 
 We
 need to change the
 
 if (!found || delalloc_end = *start)
 
 to always return 0 since we should not call fill delalloc if the delalloc 
 range
 doesn't start at our current offset.  Secondly the

Setting 'found = 0' is also the first idea into my mind :), but I'm not sure if
it's right.

I've check it out, commit 70b99e6959a4c28ae1b314985eca731f3db72f1d
(Btrfs: Compression corner fixes,
Signed-off-by: Chris Mason chris.ma...@oracle.com)
introduced the (delalloc_end = *start).

The commit log says, 
Make sure we lock pages in the correct order during delalloc.  The
 search into the state tree for delalloc bytes can return bytes
 before the page we already have locked.

But I think if we find bytes before the page we locked, then we
should get (found = 0).

So I don't know why we need the check (delalloc_end = *start),
maybe some corner cases need that? Chris can explain a bit?

 
 max_bytes = PAGE_CACHE_SIZE - offset;
 
 is completely crap since we are talking about the entire page/sector.  We 
 should
 simply set max_bytes to sectorszie or PAGE_CACHE_SIZE that way we avoid this
 issue.  Thanks,

This might be due to historical reasons, but for now, it doesn't cause
problems as the passed @start is page aligned and @offset is always 0.

thanks,
-liubo
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Progress of device deletion?

2013-09-30 Thread Duncan
Chris Murphy posted on Mon, 30 Sep 2013 19:05:36 -0600 as excerpted:

 It probably seems weird to add drives to remove drives, but sometimes
 (always?) Btrfs really gets a bit piggish about allocating a lot more
 chunks than there is data. Or maybe it's not deallocating space as
 aggressively as it could. So it can get to a point where even though
 there isn't that much data in the volume (in your case 1.5x the drive
 size, and you have 4 drives) yet all of it's effectively allocated. So
 to back out of that takes free space. Then once the chunks are better
 allocated, you'd have been able to remove the drives.

As I understand things and from what I've actually observed here, btrfs 
only allocates chunks on-demand, but doesn't normally DEallocate them at 
all, except during balance, etc, when it rewrites all the (meta)data that 
matches the filters, compacting all those data holes that were opened 
up thru deletion in the process, by filling chunks as it rewrites the 
(meta)data.

So effectively, allocated chunks should always be the high-water-mark 
(rounded up to the nearest chunk size) of usage since the last balance 
effectively compacted chunk usage, because chunk allocation is automatic 
but chunk deallocation requires a balance.

This is actually a fairly reasonable approach in the normal case, since 
it's reasonable to assume that even if the size of the data has reduced 
substantially, if it once reached a particular size, it's likely to reach 
it again, and particularly the deallocation process has a serious time 
cost to rewrite the remaining active data to other chunks, so best to 
just let it be unless an admin decides it's worth eating that cost to get 
the lower chunk allocation and invokes a balance to effect that.


So as you were saying, the most efficient way to delete a device could be 
to add one first if chunk allocation is almost maxed out and well above 
actual (meta)data size, then do a balance to rewrite all those nearly 
empty chunks to nearly full ones and shrink the number of allocated 
chunks to something reasonable as a result, and only THEN, when there's 
some reasonable amount of unallocated space available, attempt the device 
delete.


Meanwhile, I really do have to question the use case where the risks of a 
single dead device killing a raid0 (or for that matter, running still 
experimental btrfs) are fine, but spending days doing data maintenance on 
data not valuable enough to put on anything but experimental btrfs raid0, 
is warranted over simply blowing the data away and starting with brand 
new mkfs-ed filesystems.  That a strong hint to me that either the raid0 
use case is wrong, or the days of data move and reshape instead of 
blowing it away and recreating brand new filesystems, is wrong, and that 
one or the other should be reevaluated.  However, I'm sure there must be  
use cases for which it's appropriate, and I simply don't have a 
sufficiently creative imagination, so I'll admit I could be wildly wrong 
on that.   If a sysadmin is sure he's on solid ground with his use case, 
for him, he very well could be.  =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Progress of device deletion?

2013-09-30 Thread Chris Murphy

On Sep 30, 2013, at 10:43 PM, Duncan 1i5t5.dun...@cox.net wrote:
 
 Meanwhile, I really do have to question the use case where the risks of a 
 single dead device killing a raid0 (or for that matter, running still 
 experimental btrfs) are fine, but spending days doing data maintenance on 
 data not valuable enough to put on anything but experimental btrfs raid0, 
 is warranted over simply blowing the data away and starting with brand 
 new mkfs-ed filesystems.

Yes of course. It must be a test case, and I think for non-experimental stable 
Btrfs it's reasonable to have reliable device delete regardless of the raid 
level because it's offered. And after all maybe the use case involves 
enterprise SSDs, each of which should have a less than 1% chance of failing 
during their service life. (Naturally, that's going to go a lot faster than 
days.)


 That a strong hint to me that either the raid0 
 use case is wrong, or the days of data move and reshape instead of 
 blowing it away and recreating brand new filesystems, is wrong, and that 
 one or the other should be reevaluated.

I think it's the wrong use case today, except for testing it. It's legit to try 
and blow things up, simply because it's offered functionality, so long as the 
idea is I really would like to have this workflow actually work in 2-5 years. 
Otherwise it is sortof a rat hole.

The other thing, clearly the OP is surprised it's taking anywhere near this 
long. Had he known in advance, he probably would have made a different choice.

Chris Murphy

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html