Re: Corrupt btrfs filesystem recovery... What best instructions?
Martin posted on Sun, 29 Sep 2013 22:55:43 +0100 as excerpted: On 29/09/13 22:29, Martin wrote: Looking up what's available for Gentoo, the maintainers there look to be nicely sharp with multiple versions available all the way up to kernel 3.11.2... Cool, another gentooer! =:^) That is being pulled in now as expected: sys-kernel/gentoo-sources-3.11.2 FWIW, I've been doing my own kernels (mainline) since back on mandrake a decade ago, and I just changed up my scripts a bit when I switched to gentoo. Then later on I changed them up a bit more to be able to run git kernels. These days I normally first try (and switch to if no serious bugs) to the dev kernel around -rc2 or so, by which point I figure anything likely to eat my system for breakfast should be either worked thru, or at least news of it available. As a non-dev, it's very cool being able to spot and report bugs, possibly bisecting to a specific commit, and watch them get fixed before general kernel release. Just one way I as a non-dev can contribute back. =:^) To take care of packages that depend on a kernel package, I used to have a kernel (gentoo-sources-2.6. or some such, back then, now of course it'd be 3.) in package.provided, but these days I don't even need that. =:^) There's also the latest available from btrfs tools with sys-fs/btrfs-progs ... Oddly, that caused emerge to report: [ebuild UD ] sys-fs/btrfs-progs-0.19.11 [0.20_rc1_p358] 0 kB which is a *downgrade*. Hence, I'm keeping with the 0.20_rc1_p358. btrfs-progs- is available, but as a live package, it's masked in keeping with gentoo policy. So to get it you'd need to unmask it. But 0.20_rc1_p358 shouldn't be /too/ far back. In fact, I'm guessing the p-number is the (serialized) patch sequence number indicating the number of commits forward from the rc1 tag. And on the (locally unmasked) - version here, a git describe --tags gives me v0.20-rc1-358-g194aa4a ... so 0.20_rc1_p358 is very likely identical to the live version at this point, and it makes no difference, except that the non-live version is a stable snapshot instead of a version that might change every time you merge it, if upstream has done any further commits. So btrfs-progs-0.20_rc1_p358 should be fine. And you were updating kernel to 3.11.2, so that should be fine by the time you read this, as well. =:^) -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: fix crash of compressed writes
The crash[1] is found by xfstests/generic/208 with -o compress, it's not reproduced everytime, but it does panic. The bug is quite interesting, it's actually introduced by a recent commit (573aecafca1cf7a974231b759197a1aebcf39c2a, Btrfs: actually limit the size of delalloc range). Btrfs implements delay allocation, so during writeback, we (1) get a page A and lock it (2) search the state tree for delalloc bytes and lock all pages within the range (3) process the delalloc range, including find disk space and create ordered extent and so on. (4) submit the page A. It runs well in normal cases, but if we're in a racy case, eg. buffered compressed writes and aio-dio writes, sometimes we may fail to lock all pages in the 'delalloc' range, in which case, we need to fall back to search the state tree again with a smaller range limit(max_bytes = PAGE_CACHE_SIZE - offset). The mentioned commit has a side effect, that is, in the fallback case, we can find delalloc bytes before the index of the page we already have locked, so we're in the case of (delalloc_end = *start) and return with (found 0). This ends with not locking delalloc pages but making -writepage still process them, and the crash happens. This fixes it by not enforcing the 'max_bytes' limit in the special fallback case. [1]: [ cut here ] kernel BUG at mm/page-writeback.c:2170! [...] CPU: 2 PID: 11755 Comm: btrfs-delalloc- Tainted: G O 3.11.0+ #8 [...] RIP: 0010:[810f5093] [810f5093] clear_page_dirty_for_io+0x1e/0x83 [...] [ 4934.248731] Stack: [ 4934.248731] 8801477e5dc8 ea00049b9f00 8801869f9ce8 a02b841a [ 4934.248731] 0fff 0620 [ 4934.248731] 88018db59c78 ea0005da8d40 a02ff860 0001810016c0 [ 4934.248731] Call Trace: [ 4934.248731] [a02b841a] extent_range_clear_dirty_for_io+0xcf/0xf5 [btrfs] [ 4934.248731] [a02a8889] compress_file_range+0x1dc/0x4cb [btrfs] [ 4934.248731] [8104f7af] ? detach_if_pending+0x22/0x4b [ 4934.248731] [a02a8bad] async_cow_start+0x35/0x53 [btrfs] [ 4934.248731] [a02c694b] worker_loop+0x14b/0x48c [btrfs] [ 4934.248731] [a02c6800] ? btrfs_queue_worker+0x25c/0x25c [btrfs] [ 4934.248731] [810608f5] kthread+0x8d/0x95 [ 4934.248731] [81060868] ? kthread_freezable_should_stop+0x43/0x43 [ 4934.248731] [814fe09c] ret_from_fork+0x7c/0xb0 [ 4934.248731] [81060868] ? kthread_freezable_should_stop+0x43/0x43 [ 4934.248731] Code: ff 85 c0 0f 94 c0 0f b6 c0 59 5b 5d c3 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb e8 2c de 00 00 49 89 c4 48 8b 03 a8 01 75 02 0f 0b 4d 85 e4 74 52 49 8b 84 24 80 00 00 00 f6 40 20 01 75 44 [ 4934.248731] RIP [810f5093] clear_page_dirty_for_io+0x1e/0x83 [ 4934.248731] RSP 8801869f9c48 [ 4934.280307] ---[ end trace 36f06d3f8750236a ]--- Signed-off-by: Liu Bo bo.li@oracle.com --- fs/btrfs/extent_io.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index c09a40d..76a977e 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1483,7 +1483,12 @@ static noinline u64 find_delalloc_range(struct extent_io_tree *tree, node = rb_next(node); total_bytes += state-end - state-start + 1; if (total_bytes = max_bytes) { - *end = *start + max_bytes - 1; + /* +* tell if we fall back from the failure of +* lock_delalloc_pages() +*/ + if (max_bytes PAGE_CACHE_SIZE) + *end = *start + max_bytes - 1; break; } if (!node) -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Progress of device deletion?
On Sep 29, 2013, at 1:13 AM, Fredrik Tolf fred...@dolda2000.com wrote: Is there any way I can find out what's going on? For whatever reason, it started out with every drive practically full, in terms of chunk allocation. e.g.devid5 size 2.73TB used 2.71TB path /dev/sdh1 I don't know if the code works this way, but it needs to do a balance (or partial one) to make room before it can start migrating actual complete chunks from 4 disks to 2 disks. And my guess is that it's not yet done balancing sdg1 still. Post kernel version, btrfs-progs version, and dmesg from the time the delete command was initiated. Chris Murphy-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: don't delete ordered roots from list during cleanup
On Fri, Sep 27, 2013 at 04:18:14PM -0700, Zach Brown wrote: On Fri, Sep 27, 2013 at 04:37:46PM -0400, Josef Bacik wrote: During transaction cleanup after an abort we are just removing roots from the ordered roots list which is incorrect. We have a BUG_ON() to make sure that the root is still part of the ordered roots list when we put our ordered extent which we were tripping in this case. So do like we do everywhere else and just move it to the tail of the ordered roots list and allow the normal cleanup to take care of stuff. Thanks, Signed-off-by: Josef Bacik jba...@fusionio.com --- fs/btrfs/disk-io.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index f38211f..872b4ce 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3835,7 +3835,8 @@ static void btrfs_destroy_all_ordered_extents(struct btrfs_fs_info *fs_info) while (!list_empty(splice)) { root = list_first_entry(splice, struct btrfs_root, ordered_root); - list_del_init(root-ordered_root); + list_move_tail(root-ordered_root, + fs_info-ordered_roots); This function basically only does: lock list_for_each lock list_for_each set_bit Could we instead add a bit to the root or trans or fs_info or anything else that could be trivialy set in _destroy_all_ordered_extents and tested in _finish_ordered_io()? It'd remove a bunch of tedious locking and iteration here. The similar metaphor in the core page cache is (address_space-flags | AS_EIO). Would that be too coarse or racey? So I _think_ we may need to truncate the ordered range in the inode as well, but I haven't had a consistent reproducer for this case. I want to leave it like this for now until I'm sure we don't need the truncate and then we could probably just replace this with a test for FS_ERROR in finish_ordered_io. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: fix two use-after-free bugs with transaction cleanup
I was noticing the slab redzone stuff going off every once and a while during transaction aborts. This was caused by two things 1) We would walk the pending snapshots and set their error to -ECANCELED. We don't need to do this, the snapshot stuff waits for a transaction commit and if there is a problem we just free our pending snapshot object and exit. Doing this was causing us to touch the pending snapshot object after the thing had already been freed. 2) We were freeing the transaction manually with wanton disregard for it's use_count reference counter. To fix this I cleaned up the transaction freeing loop to either wait for the transaction commit to finish if it was in the middle of that (since it will be cleaned and freed up there) or to do the cleanup oursevles. I also moved the global kill all things dirty everywhere stuff outside of the transaction cleanup loop since that only needs to be done once. With this patch I'm no longer seeing slab corruption because of use after frees. Thanks, Signed-off-by: Josef Bacik jba...@fusionio.com --- fs/btrfs/disk-io.c | 111 ++--- fs/btrfs/transaction.c | 22 +- fs/btrfs/transaction.h | 1 + 3 files changed, 52 insertions(+), 82 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 872b4ce..8aa93a9 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -64,7 +64,6 @@ static void btrfs_destroy_ordered_operations(struct btrfs_transaction *t, static void btrfs_destroy_ordered_extents(struct btrfs_root *root); static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans, struct btrfs_root *root); -static void btrfs_evict_pending_snapshots(struct btrfs_transaction *t); static void btrfs_destroy_delalloc_inodes(struct btrfs_root *root); static int btrfs_destroy_marked_extents(struct btrfs_root *root, struct extent_io_tree *dirty_pages, @@ -3914,24 +3913,6 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans, return ret; } -static void btrfs_evict_pending_snapshots(struct btrfs_transaction *t) -{ - struct btrfs_pending_snapshot *snapshot; - struct list_head splice; - - INIT_LIST_HEAD(splice); - - list_splice_init(t-pending_snapshots, splice); - - while (!list_empty(splice)) { - snapshot = list_entry(splice.next, - struct btrfs_pending_snapshot, - list); - snapshot-error = -ECANCELED; - list_del_init(snapshot-list); - } -} - static void btrfs_destroy_delalloc_inodes(struct btrfs_root *root) { struct btrfs_inode *btrfs_inode; @@ -4061,6 +4042,8 @@ again: void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans, struct btrfs_root *root) { + btrfs_destroy_ordered_operations(cur_trans, root); + btrfs_destroy_delayed_refs(cur_trans, root); btrfs_block_rsv_release(root, root-fs_info-trans_block_rsv, cur_trans-dirty_pages.dirty_bytes); @@ -4068,8 +4051,6 @@ void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans, cur_trans-state = TRANS_STATE_COMMIT_START; wake_up(root-fs_info-transaction_blocked_wait); - btrfs_evict_pending_snapshots(cur_trans); - cur_trans-state = TRANS_STATE_UNBLOCKED; wake_up(root-fs_info-transaction_wait); @@ -4093,63 +4074,51 @@ void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans, static int btrfs_cleanup_transaction(struct btrfs_root *root) { struct btrfs_transaction *t; - LIST_HEAD(list); mutex_lock(root-fs_info-transaction_kthread_mutex); spin_lock(root-fs_info-trans_lock); - list_splice_init(root-fs_info-trans_list, list); - root-fs_info-running_transaction = NULL; - spin_unlock(root-fs_info-trans_lock); - - while (!list_empty(list)) { - t = list_entry(list.next, struct btrfs_transaction, list); - - btrfs_destroy_ordered_operations(t, root); - - btrfs_destroy_all_ordered_extents(root-fs_info); - - btrfs_destroy_delayed_refs(t, root); - - /* -* FIXME: cleanup wait for commit -* We needn't acquire the lock here, because we are during -* the umount, there is no other task which will change it. -*/ - t-state = TRANS_STATE_COMMIT_START; - smp_mb(); - if (waitqueue_active(root-fs_info-transaction_blocked_wait)) - wake_up(root-fs_info-transaction_blocked_wait); - - btrfs_evict_pending_snapshots(t); - - t-state = TRANS_STATE_UNBLOCKED; - smp_mb(); - if
Re: [PATCH] Btrfs: fix crash of compressed writes
On Mon, Sep 30, 2013 at 08:39:57PM +0800, Liu Bo wrote: The crash[1] is found by xfstests/generic/208 with -o compress, it's not reproduced everytime, but it does panic. The bug is quite interesting, it's actually introduced by a recent commit (573aecafca1cf7a974231b759197a1aebcf39c2a, Btrfs: actually limit the size of delalloc range). Btrfs implements delay allocation, so during writeback, we (1) get a page A and lock it (2) search the state tree for delalloc bytes and lock all pages within the range (3) process the delalloc range, including find disk space and create ordered extent and so on. (4) submit the page A. It runs well in normal cases, but if we're in a racy case, eg. buffered compressed writes and aio-dio writes, sometimes we may fail to lock all pages in the 'delalloc' range, in which case, we need to fall back to search the state tree again with a smaller range limit(max_bytes = PAGE_CACHE_SIZE - offset). The mentioned commit has a side effect, that is, in the fallback case, we can find delalloc bytes before the index of the page we already have locked, so we're in the case of (delalloc_end = *start) and return with (found 0). This ends with not locking delalloc pages but making -writepage still process them, and the crash happens. This fixes it by not enforcing the 'max_bytes' limit in the special fallback case. Great analysis, thank you for that, however I don't like the fix ;). Instead We need to change the if (!found || delalloc_end = *start) to always return 0 since we should not call fill delalloc if the delalloc range doesn't start at our current offset. Secondly the max_bytes = PAGE_CACHE_SIZE - offset; is completely crap since we are talking about the entire page/sector. We should simply set max_bytes to sectorszie or PAGE_CACHE_SIZE that way we avoid this issue. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: don't delete ordered roots from list during cleanup
So I _think_ we may need to truncate the ordered range in the inode as well, but I haven't had a consistent reproducer for this case. I want to leave it like this for now until I'm sure we don't need the truncate and then we could probably just replace this with a test for FS_ERROR in finish_ordered_io. *nod*, added to the list :) - z -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: fix transid verify errors when recovering log tree
If we crash with a log, remount and recover that log, and then crash before we can commit another transaction we will get transid verify errors on the next mount. This is because we were not zero'ing out the log when we committed the transaction after recovery. This is ok as long as we commit another transaction at some point in the future, but if you abort or something else goes wrong you can end up in this weird state because the recovery stuff says that the tree log should have a generation+1 of the super generation, which won't be the case of the transaction that was started for recovery. Fix this by removing the check and _always_ zero out the log portion of the super when we commit a transaction. This fixes the transid verify issues I was seeing with my force errors tests. Thanks, Signed-off-by: Josef Bacik jba...@fusionio.com --- fs/btrfs/transaction.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index b903192..134039f 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1845,11 +1845,8 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans, assert_qgroups_uptodate(trans); update_super_roots(root); - if (!root-fs_info-log_root_recovering) { - btrfs_set_super_log_root(root-fs_info-super_copy, 0); - btrfs_set_super_log_root_level(root-fs_info-super_copy, 0); - } - + btrfs_set_super_log_root(root-fs_info-super_copy, 0); + btrfs_set_super_log_root_level(root-fs_info-super_copy, 0); memcpy(root-fs_info-super_for_commit, root-fs_info-super_copy, sizeof(*root-fs_info-super_copy)); -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questions regarding logging upon fsync in btrfs
On 29 September 2013 15:12, Josef Bacik jba...@fusionio.com wrote: On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote: Thank you very much for the reply. That clarifies a lot of things. I was trying a small test case that opens a file, writes a block of data, calls fsync and then closes the file. If I understand correctly, fsync would return only after all in-memory buffers have been committed to disk. I have added few print statements in the __extent_writepage function, and I notice that the function gets called a bit later after fsync returns. It seems that I am not guaranteed to see the data going to disk by the time fsync returns. Am I doing something wrong, or am I looking at the wrong place for disk write? This happens both with tree logging enabled as well as with notreelog. So 3.1 was a long time ago and to be sure it had issues I don't think it was _that_ broken. You are probably better off instrumenting a recent kernel, 3.11 or just build btrfs-next from git. But if I were to make a guess I'd say that __extent_writepage was how both data and metadata was written out at the time (I don't think I changed it until 3.2 or something later) so what you are likely seeing is the normal transaction commit after the fsync. In the case of notreelog we are likely starting another transaction and you are seeing that commit (at the time the transaction kthread would start a transaction even if none had been started yet.) Thanks, Josef Is there any special handling for very small file write, less than 4K? As I understand there is an optimization to inline the first extent in a file if it is smaller than 4K, does it affect the writeback on fsync as well? I did set the max_inline mount option to 0, but even then it seems there is some difference in fsync behaviour for writing first extent of less than 4K size and writing 4K or more. Thanks, Aastha. -- Aastha Mehta MPI-SWS, Germany E-mail: aasth...@mpi-sws.org -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questions regarding logging upon fsync in btrfs
On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote: On 29 September 2013 15:12, Josef Bacik jba...@fusionio.com wrote: On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote: Thank you very much for the reply. That clarifies a lot of things. I was trying a small test case that opens a file, writes a block of data, calls fsync and then closes the file. If I understand correctly, fsync would return only after all in-memory buffers have been committed to disk. I have added few print statements in the __extent_writepage function, and I notice that the function gets called a bit later after fsync returns. It seems that I am not guaranteed to see the data going to disk by the time fsync returns. Am I doing something wrong, or am I looking at the wrong place for disk write? This happens both with tree logging enabled as well as with notreelog. So 3.1 was a long time ago and to be sure it had issues I don't think it was _that_ broken. You are probably better off instrumenting a recent kernel, 3.11 or just build btrfs-next from git. But if I were to make a guess I'd say that __extent_writepage was how both data and metadata was written out at the time (I don't think I changed it until 3.2 or something later) so what you are likely seeing is the normal transaction commit after the fsync. In the case of notreelog we are likely starting another transaction and you are seeing that commit (at the time the transaction kthread would start a transaction even if none had been started yet.) Thanks, Josef Is there any special handling for very small file write, less than 4K? As I understand there is an optimization to inline the first extent in a file if it is smaller than 4K, does it affect the writeback on fsync as well? I did set the max_inline mount option to 0, but even then it seems there is some difference in fsync behaviour for writing first extent of less than 4K size and writing 4K or more. Yeah if the file is an inline extent then it will be copied into the log directly and the log will be written out, no going through the data write path at all. Max inline == 0 should make it so we don't inline, so if it isn't honoring that then that may be a bug. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs-progs: make the repair option a global static var
It's just annoying to have to pass it around everywhere. Thanks, Signed-off-by: Josef Bacik jba...@fusionio.com --- cmds-check.c | 23 +++ 1 file changed, 11 insertions(+), 12 deletions(-) diff --git a/cmds-check.c b/cmds-check.c index f05c73e..6da35ea 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -49,6 +49,7 @@ static u64 data_bytes_allocated = 0; static u64 data_bytes_referenced = 0; static int found_old_backref = 0; static LIST_HEAD(duplicate_extents); +static int repair = 0; struct extent_backref { struct list_head list; @@ -1404,7 +1405,7 @@ out: } static int check_inode_recs(struct btrfs_root *root, - struct cache_tree *inode_cache, int repair) + struct cache_tree *inode_cache) { struct cache_extent *cache; struct ptr_node *node; @@ -1809,7 +1810,7 @@ static int process_root_ref(struct extent_buffer *eb, int slot, static int check_fs_root(struct btrfs_root *root, struct cache_tree *root_cache, -struct walk_control *wc, int repair) +struct walk_control *wc) { int ret = 0; int wret; @@ -1879,7 +1880,7 @@ static int check_fs_root(struct btrfs_root *root, root_node.current); } - ret = check_inode_recs(root, root_node.inode_cache, repair); + ret = check_inode_recs(root, root_node.inode_cache); return ret; } @@ -1895,8 +1896,7 @@ static int fs_root_objectid(u64 objectid) } static int check_fs_roots(struct btrfs_root *root, - struct cache_tree *root_cache, - int repair) + struct cache_tree *root_cache) { struct btrfs_path path; struct btrfs_key key; @@ -1939,7 +1939,7 @@ static int check_fs_roots(struct btrfs_root *root, err = 1; goto next; } - ret = check_fs_root(tmp_root, root_cache, wc, repair); + ret = check_fs_root(tmp_root, root_cache, wc); if (ret) err = 1; btrfs_free_fs_root(tmp_root); @@ -5009,7 +5009,7 @@ static void reset_cached_block_groups(struct btrfs_fs_info *fs_info) static int check_extent_refs(struct btrfs_trans_handle *trans, struct btrfs_root *root, -struct cache_tree *extent_cache, int repair) +struct cache_tree *extent_cache) { struct extent_record *rec; struct cache_extent *cache; @@ -5403,7 +5403,7 @@ static int check_devices(struct rb_root *dev_cache, return ret; } -static int check_chunks_and_extents(struct btrfs_root *root, int repair) +static int check_chunks_and_extents(struct btrfs_root *root) { struct rb_root dev_cache; struct cache_tree chunk_cache; @@ -5510,7 +5510,7 @@ again: break; } - ret = check_extent_refs(trans, root, extent_cache, repair); + ret = check_extent_refs(trans, root, extent_cache); if (ret == -EAGAIN) { ret = btrfs_commit_transaction(trans, root); if (ret) @@ -5982,7 +5982,6 @@ int cmd_check(int argc, char **argv) char uuidbuf[37]; int ret; int num; - int repair = 0; int option_index = 0; int init_csum_tree = 0; int init_extent_tree = 0; @@ -6084,7 +6083,7 @@ int cmd_check(int argc, char **argv) exit(1); goto out; } - ret = check_chunks_and_extents(root, repair); + ret = check_chunks_and_extents(root); if (ret) fprintf(stderr, Errors found in extent allocation tree or chunk allocation\n); @@ -6094,7 +6093,7 @@ int cmd_check(int argc, char **argv) goto out; fprintf(stderr, checking fs roots\n); - ret = check_fs_roots(root, root_cache, repair); + ret = check_fs_roots(root, root_cache); if (ret) goto out; -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questions regarding logging upon fsync in btrfs
On 30 September 2013 22:11, Josef Bacik jba...@fusionio.com wrote: On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote: On 29 September 2013 15:12, Josef Bacik jba...@fusionio.com wrote: On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote: Thank you very much for the reply. That clarifies a lot of things. I was trying a small test case that opens a file, writes a block of data, calls fsync and then closes the file. If I understand correctly, fsync would return only after all in-memory buffers have been committed to disk. I have added few print statements in the __extent_writepage function, and I notice that the function gets called a bit later after fsync returns. It seems that I am not guaranteed to see the data going to disk by the time fsync returns. Am I doing something wrong, or am I looking at the wrong place for disk write? This happens both with tree logging enabled as well as with notreelog. So 3.1 was a long time ago and to be sure it had issues I don't think it was _that_ broken. You are probably better off instrumenting a recent kernel, 3.11 or just build btrfs-next from git. But if I were to make a guess I'd say that __extent_writepage was how both data and metadata was written out at the time (I don't think I changed it until 3.2 or something later) so what you are likely seeing is the normal transaction commit after the fsync. In the case of notreelog we are likely starting another transaction and you are seeing that commit (at the time the transaction kthread would start a transaction even if none had been started yet.) Thanks, Josef Is there any special handling for very small file write, less than 4K? As I understand there is an optimization to inline the first extent in a file if it is smaller than 4K, does it affect the writeback on fsync as well? I did set the max_inline mount option to 0, but even then it seems there is some difference in fsync behaviour for writing first extent of less than 4K size and writing 4K or more. Yeah if the file is an inline extent then it will be copied into the log directly and the log will be written out, no going through the data write path at all. Max inline == 0 should make it so we don't inline, so if it isn't honoring that then that may be a bug. Thanks, Josef I tried it on 3.12-rc2 release, and it seems there is a bug then. Please find attached logs to confirm. Also, probably on the older release. Thanks, Aastha. -- Aastha Mehta MPI-SWS, Germany E-mail: aasth...@mpi-sws.org [ 2808.125838] [do_writepages] inum: 260, nr_to_write: 9223372036854775807, writepages fn: 8129ce40 [ 2808.125977] CPU: 0 PID: 10215 Comm: bb Not tainted 3.12.0-rc2-latest+ #4 [ 2808.125980] Hardware name: innotek GmbH VirtualBox, BIOS VirtualBox 12/01/2006 [ 2808.125983] 88003d6b5e60 88003d6b5e28 817a8a6f 88003fc0eb48 [ 2808.125987] 88002e325b48 88003d6b5e48 81105c1f 88002e325a00 [ 2808.125990] 880037b1f300 88003d6b5e88 810fb689 88003fc12c00 [ 2808.125993] Call Trace: [ 2808.126003] [817a8a6f] dump_stack+0x46/0x58 [ 2808.126010] [81105c1f] do_writepages+0x4f/0x90 [ 2808.126013] [810fb689] __filemap_fdatawrite_range+0x49/0x50 [ 2808.126013] [810fc41e] filemap_fdatawrite_range+0xe/0x10 [ 2808.126013] [812aafdc] btrfs_sync_file+0xac/0x380 [ 2808.126098] [817aedd2] ? __schedule+0x692/0x730 [ 2808.126115] [811752d8] do_fsync+0x58/0x90 [ 2808.126149] [81148a6d] ? SyS_write+0x4d/0xa0 [ 2808.126165] [8117565b] SyS_fsync+0xb/0x10 [ 2808.126182] [817b83a2] system_call_fastpath+0x16/0x1b [ 2808.127447] CPU: 0 PID: 10215 Comm: bb Not tainted 3.12.0-rc2-latest+ #4 [ 2808.127450] Hardware name: innotek GmbH VirtualBox, BIOS VirtualBox 12/01/2006 [ 2808.127452] ea802440 88003d6b5ba8 817a8a6f 88003d6b5de0 [ 2808.127456] 88002e325870 88003d6b5cb8 812b7c9f 88003e2083d0 [ 2808.127459] 88003d6b5c48 817aedd2 88003c3707b0 [ 2808.127462] Call Trace: [ 2808.127467] [817a8a6f] dump_stack+0x46/0x58 [ 2808.127473] [812b7c9f] __extent_writepage+0x6f/0x750 [ 2808.127479] [817aedd2] ? __schedule+0x692/0x730 [ 2808.127483] [810fac71] ? find_get_pages_tag+0x121/0x160 [ 2808.127487] [812b8552] extent_write_cache_pages.isra.24.constprop.31+0x1d2/0x380 [ 2808.127492] [81005877] ? show_trace_log_lvl+0x57/0x70 [ 2808.127496] [812b89d9] extent_writepages+0x49/0x60 [ 2808.127501] [8129f160] ? btrfs_submit_direct+0x6c0/0x6c0 [ 2808.127505] [8129ce63] btrfs_writepages+0x23/0x30 [ 2808.127510] [81105c3e] do_writepages+0x6e/0x90 [ 2808.127513] [810fb689] __filemap_fdatawrite_range+0x49/0x50 [ 2808.127517] [810fc41e] filemap_fdatawrite_range+0xe/0x10
Re: Questions regarding logging upon fsync in btrfs
On Mon, Sep 30, 2013 at 10:30:59PM +0200, Aastha Mehta wrote: On 30 September 2013 22:11, Josef Bacik jba...@fusionio.com wrote: On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote: On 29 September 2013 15:12, Josef Bacik jba...@fusionio.com wrote: On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote: Thank you very much for the reply. That clarifies a lot of things. I was trying a small test case that opens a file, writes a block of data, calls fsync and then closes the file. If I understand correctly, fsync would return only after all in-memory buffers have been committed to disk. I have added few print statements in the __extent_writepage function, and I notice that the function gets called a bit later after fsync returns. It seems that I am not guaranteed to see the data going to disk by the time fsync returns. Am I doing something wrong, or am I looking at the wrong place for disk write? This happens both with tree logging enabled as well as with notreelog. So 3.1 was a long time ago and to be sure it had issues I don't think it was _that_ broken. You are probably better off instrumenting a recent kernel, 3.11 or just build btrfs-next from git. But if I were to make a guess I'd say that __extent_writepage was how both data and metadata was written out at the time (I don't think I changed it until 3.2 or something later) so what you are likely seeing is the normal transaction commit after the fsync. In the case of notreelog we are likely starting another transaction and you are seeing that commit (at the time the transaction kthread would start a transaction even if none had been started yet.) Thanks, Josef Is there any special handling for very small file write, less than 4K? As I understand there is an optimization to inline the first extent in a file if it is smaller than 4K, does it affect the writeback on fsync as well? I did set the max_inline mount option to 0, but even then it seems there is some difference in fsync behaviour for writing first extent of less than 4K size and writing 4K or more. Yeah if the file is an inline extent then it will be copied into the log directly and the log will be written out, no going through the data write path at all. Max inline == 0 should make it so we don't inline, so if it isn't honoring that then that may be a bug. Thanks, Josef I tried it on 3.12-rc2 release, and it seems there is a bug then. Please find attached logs to confirm. Also, probably on the older release. Oooh ok I understand, you have your printk's in the wrong place ;). do_writepages doesn't necessarily mean you are writing something. If you want to see if stuff got written to the disk I'd put a printk at run_delalloc_range and have it spit out the range it is writing out since thats what we think is actually dirty. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questions regarding logging upon fsync in btrfs
On 30 September 2013 22:47, Josef Bacik jba...@fusionio.com wrote: On Mon, Sep 30, 2013 at 10:30:59PM +0200, Aastha Mehta wrote: On 30 September 2013 22:11, Josef Bacik jba...@fusionio.com wrote: On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote: On 29 September 2013 15:12, Josef Bacik jba...@fusionio.com wrote: On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote: Thank you very much for the reply. That clarifies a lot of things. I was trying a small test case that opens a file, writes a block of data, calls fsync and then closes the file. If I understand correctly, fsync would return only after all in-memory buffers have been committed to disk. I have added few print statements in the __extent_writepage function, and I notice that the function gets called a bit later after fsync returns. It seems that I am not guaranteed to see the data going to disk by the time fsync returns. Am I doing something wrong, or am I looking at the wrong place for disk write? This happens both with tree logging enabled as well as with notreelog. So 3.1 was a long time ago and to be sure it had issues I don't think it was _that_ broken. You are probably better off instrumenting a recent kernel, 3.11 or just build btrfs-next from git. But if I were to make a guess I'd say that __extent_writepage was how both data and metadata was written out at the time (I don't think I changed it until 3.2 or something later) so what you are likely seeing is the normal transaction commit after the fsync. In the case of notreelog we are likely starting another transaction and you are seeing that commit (at the time the transaction kthread would start a transaction even if none had been started yet.) Thanks, Josef Is there any special handling for very small file write, less than 4K? As I understand there is an optimization to inline the first extent in a file if it is smaller than 4K, does it affect the writeback on fsync as well? I did set the max_inline mount option to 0, but even then it seems there is some difference in fsync behaviour for writing first extent of less than 4K size and writing 4K or more. Yeah if the file is an inline extent then it will be copied into the log directly and the log will be written out, no going through the data write path at all. Max inline == 0 should make it so we don't inline, so if it isn't honoring that then that may be a bug. Thanks, Josef I tried it on 3.12-rc2 release, and it seems there is a bug then. Please find attached logs to confirm. Also, probably on the older release. Oooh ok I understand, you have your printk's in the wrong place ;). do_writepages doesn't necessarily mean you are writing something. If you want to see if stuff got written to the disk I'd put a printk at run_delalloc_range and have it spit out the range it is writing out since thats what we think is actually dirty. Thanks, Josef No, but I also placed dump_stack() in the beginning of __extent_writepage. run_delalloc_range is being called only from __extent_writepage, if it were to be called, the dump_stack() at the top of __extent_writepage would have printed as well, no? Thanks -- Aastha Mehta MPI-SWS, Germany E-mail: aasth...@mpi-sws.org -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questions regarding logging upon fsync in btrfs
On Mon, Sep 30, 2013 at 11:07:20PM +0200, Aastha Mehta wrote: On 30 September 2013 22:47, Josef Bacik jba...@fusionio.com wrote: On Mon, Sep 30, 2013 at 10:30:59PM +0200, Aastha Mehta wrote: On 30 September 2013 22:11, Josef Bacik jba...@fusionio.com wrote: On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote: On 29 September 2013 15:12, Josef Bacik jba...@fusionio.com wrote: On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote: Thank you very much for the reply. That clarifies a lot of things. I was trying a small test case that opens a file, writes a block of data, calls fsync and then closes the file. If I understand correctly, fsync would return only after all in-memory buffers have been committed to disk. I have added few print statements in the __extent_writepage function, and I notice that the function gets called a bit later after fsync returns. It seems that I am not guaranteed to see the data going to disk by the time fsync returns. Am I doing something wrong, or am I looking at the wrong place for disk write? This happens both with tree logging enabled as well as with notreelog. So 3.1 was a long time ago and to be sure it had issues I don't think it was _that_ broken. You are probably better off instrumenting a recent kernel, 3.11 or just build btrfs-next from git. But if I were to make a guess I'd say that __extent_writepage was how both data and metadata was written out at the time (I don't think I changed it until 3.2 or something later) so what you are likely seeing is the normal transaction commit after the fsync. In the case of notreelog we are likely starting another transaction and you are seeing that commit (at the time the transaction kthread would start a transaction even if none had been started yet.) Thanks, Josef Is there any special handling for very small file write, less than 4K? As I understand there is an optimization to inline the first extent in a file if it is smaller than 4K, does it affect the writeback on fsync as well? I did set the max_inline mount option to 0, but even then it seems there is some difference in fsync behaviour for writing first extent of less than 4K size and writing 4K or more. Yeah if the file is an inline extent then it will be copied into the log directly and the log will be written out, no going through the data write path at all. Max inline == 0 should make it so we don't inline, so if it isn't honoring that then that may be a bug. Thanks, Josef I tried it on 3.12-rc2 release, and it seems there is a bug then. Please find attached logs to confirm. Also, probably on the older release. Oooh ok I understand, you have your printk's in the wrong place ;). do_writepages doesn't necessarily mean you are writing something. If you want to see if stuff got written to the disk I'd put a printk at run_delalloc_range and have it spit out the range it is writing out since thats what we think is actually dirty. Thanks, Josef No, but I also placed dump_stack() in the beginning of __extent_writepage. run_delalloc_range is being called only from __extent_writepage, if it were to be called, the dump_stack() at the top of __extent_writepage would have printed as well, no? Yeah, so I don't know whats going on and I'm in the middle of something, I'll look at it tomorrow and see if I can't figure out what is going on. I'm sure it's working, we have a xfstest to test this sort of thing and it's passing so we're definitely getting the data to disk properly, I'm probably just missing some peice around here somewhere. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: init device stats for new devices
Hi Zach, thank you for your answer and clarification. I cannot just unmount and mount that filesystem, because it is running busy NFS server now, so I will just try it on some testbench server. Can mount -o remount be sufficient (to prevent stopping service, umount, mount and starting service) ? Thank you Ondřej 2013/9/30 Zach Brown z...@redhat.com: I discovered one minor bug in BTRFS filesystem. You sure did. ERROR: ioctl(BTRFS_IOC_GET_DEV_STATS) on /dev/sde failed: No such device But this is not true ... all specified devices exist and are members of btrfs filesystem. In dmesg I see this: ... [973077.099118] btrfs: get dev_stats failed, not yet valid What makes device statistics valid ? I tried doing full filesystem scrub ... but it did not fix that issue. The stats are only initialized (considered valid) for devices that are known at mount. You could unmount and mount after adding (or replacing) new devices and they'd start returning stats. The following (bad) patch illustrates the problem, but the code should be restructured so stats are reliably read as devices are added. - z From: Zach Brown z...@redhat.com Date: Mon, 30 Sep 2013 17:48:05 -0400 Subject: [PATCH] btrfs: init device stats for new devices Device stats are only initialized (read from tree items) on mount. Trying to read device stats after adding or replacing new devices will return errors. This cheesy patch demonstrates the problem, but this should really be a natural side-effect of adding devices to the fs_devices list. We have evidence that trying to do it by hand doesn't work. Any preferences for how to restructure this? --- fs/btrfs/dev-replace.c | 4 +++- fs/btrfs/volumes.c | 6 ++ 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 5d84443..7309096 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -556,7 +556,9 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, mutex_unlock(dev_replace-lock_finishing_cancel_unmount); - return 0; + ret = btrfs_init_dev_stats(root-fs_info); + + return ret; } static void btrfs_dev_replace_update_device_in_mapping_tree( diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 0431147..e4ccc9b 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -2126,6 +2126,9 @@ int btrfs_init_new_device(struct btrfs_root *root, char *device_path) ret = btrfs_commit_transaction(trans, root); } + if (!ret) + ret = btrfs_init_dev_stats(root-fs_info); + return ret; error_trans: @@ -6060,6 +6063,9 @@ int btrfs_init_dev_stats(struct btrfs_fs_info *fs_info) int item_size; struct btrfs_dev_stats_item *ptr; + if (device-dev_stats_valid) + continue; + key.objectid = 0; key.type = BTRFS_DEV_STATS_KEY; key.offset = device-devid; -- 1.7.11.7 -- Ondřej Kunc -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: init device stats for new devices
On Tue, Oct 01, 2013 at 12:03:05AM +0200, Ondřej Kunc wrote: Hi Zach, thank you for your answer and clarification. I cannot just unmount and mount that filesystem, because it is running busy NFS server now, so I will just try it on some testbench server. Can mount -o remount be sufficient (to prevent stopping service, umount, mount and starting service) ? Sadly, no, remounting won't initialize the stats. - z -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Progress of device deletion?
On Sep 30, 2013, at 8:27 AM, Chris Murphy li...@colorremedies.com wrote: On Sep 29, 2013, at 1:13 AM, Fredrik Tolf fred...@dolda2000.com wrote: Is there any way I can find out what's going on? For whatever reason, it started out with every drive practically full, in terms of chunk allocation. e.g.devid5 size 2.73TB used 2.71TB path /dev/sdh1 I don't know if the code works this way, but it needs to do a balance (or partial one) to make room before it can start migrating actual complete chunks from 4 disks to 2 disks. And my guess is that it's not yet done balancing sdg1 still. Post kernel version, btrfs-progs version, and dmesg from the time the delete command was initiated. Without knowing more information about how it's expected to behave (now and near future), I think if I were you I would have added a couple of drives to the volume first so that it had more maneuvering room. It probably seems weird to add drives to remove drives, but sometimes (always?) Btrfs really gets a bit piggish about allocating a lot more chunks than there is data. Or maybe it's not deallocating space as aggressively as it could. So it can get to a point where even though there isn't that much data in the volume (in your case 1.5x the drive size, and you have 4 drives) yet all of it's effectively allocated. So to back out of that takes free space. Then once the chunks are better allocated, you'd have been able to remove the drives. Open question if I would have removed 2, then another 2. Or removed 4. And for all I know adding one drive might be enough even though it's raid0. *shrug* New territory. Chris Murphy-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix crash of compressed writes
On Mon, Sep 30, 2013 at 01:02:49PM -0400, Josef Bacik wrote: On Mon, Sep 30, 2013 at 08:39:57PM +0800, Liu Bo wrote: The crash[1] is found by xfstests/generic/208 with -o compress, it's not reproduced everytime, but it does panic. The bug is quite interesting, it's actually introduced by a recent commit (573aecafca1cf7a974231b759197a1aebcf39c2a, Btrfs: actually limit the size of delalloc range). Btrfs implements delay allocation, so during writeback, we (1) get a page A and lock it (2) search the state tree for delalloc bytes and lock all pages within the range (3) process the delalloc range, including find disk space and create ordered extent and so on. (4) submit the page A. It runs well in normal cases, but if we're in a racy case, eg. buffered compressed writes and aio-dio writes, sometimes we may fail to lock all pages in the 'delalloc' range, in which case, we need to fall back to search the state tree again with a smaller range limit(max_bytes = PAGE_CACHE_SIZE - offset). The mentioned commit has a side effect, that is, in the fallback case, we can find delalloc bytes before the index of the page we already have locked, so we're in the case of (delalloc_end = *start) and return with (found 0). This ends with not locking delalloc pages but making -writepage still process them, and the crash happens. This fixes it by not enforcing the 'max_bytes' limit in the special fallback case. Great analysis, thank you for that, however I don't like the fix ;). Instead We need to change the if (!found || delalloc_end = *start) to always return 0 since we should not call fill delalloc if the delalloc range doesn't start at our current offset. Secondly the Setting 'found = 0' is also the first idea into my mind :), but I'm not sure if it's right. I've check it out, commit 70b99e6959a4c28ae1b314985eca731f3db72f1d (Btrfs: Compression corner fixes, Signed-off-by: Chris Mason chris.ma...@oracle.com) introduced the (delalloc_end = *start). The commit log says, Make sure we lock pages in the correct order during delalloc. The search into the state tree for delalloc bytes can return bytes before the page we already have locked. But I think if we find bytes before the page we locked, then we should get (found = 0). So I don't know why we need the check (delalloc_end = *start), maybe some corner cases need that? Chris can explain a bit? max_bytes = PAGE_CACHE_SIZE - offset; is completely crap since we are talking about the entire page/sector. We should simply set max_bytes to sectorszie or PAGE_CACHE_SIZE that way we avoid this issue. Thanks, This might be due to historical reasons, but for now, it doesn't cause problems as the passed @start is page aligned and @offset is always 0. thanks, -liubo -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Progress of device deletion?
Chris Murphy posted on Mon, 30 Sep 2013 19:05:36 -0600 as excerpted: It probably seems weird to add drives to remove drives, but sometimes (always?) Btrfs really gets a bit piggish about allocating a lot more chunks than there is data. Or maybe it's not deallocating space as aggressively as it could. So it can get to a point where even though there isn't that much data in the volume (in your case 1.5x the drive size, and you have 4 drives) yet all of it's effectively allocated. So to back out of that takes free space. Then once the chunks are better allocated, you'd have been able to remove the drives. As I understand things and from what I've actually observed here, btrfs only allocates chunks on-demand, but doesn't normally DEallocate them at all, except during balance, etc, when it rewrites all the (meta)data that matches the filters, compacting all those data holes that were opened up thru deletion in the process, by filling chunks as it rewrites the (meta)data. So effectively, allocated chunks should always be the high-water-mark (rounded up to the nearest chunk size) of usage since the last balance effectively compacted chunk usage, because chunk allocation is automatic but chunk deallocation requires a balance. This is actually a fairly reasonable approach in the normal case, since it's reasonable to assume that even if the size of the data has reduced substantially, if it once reached a particular size, it's likely to reach it again, and particularly the deallocation process has a serious time cost to rewrite the remaining active data to other chunks, so best to just let it be unless an admin decides it's worth eating that cost to get the lower chunk allocation and invokes a balance to effect that. So as you were saying, the most efficient way to delete a device could be to add one first if chunk allocation is almost maxed out and well above actual (meta)data size, then do a balance to rewrite all those nearly empty chunks to nearly full ones and shrink the number of allocated chunks to something reasonable as a result, and only THEN, when there's some reasonable amount of unallocated space available, attempt the device delete. Meanwhile, I really do have to question the use case where the risks of a single dead device killing a raid0 (or for that matter, running still experimental btrfs) are fine, but spending days doing data maintenance on data not valuable enough to put on anything but experimental btrfs raid0, is warranted over simply blowing the data away and starting with brand new mkfs-ed filesystems. That a strong hint to me that either the raid0 use case is wrong, or the days of data move and reshape instead of blowing it away and recreating brand new filesystems, is wrong, and that one or the other should be reevaluated. However, I'm sure there must be use cases for which it's appropriate, and I simply don't have a sufficiently creative imagination, so I'll admit I could be wildly wrong on that. If a sysadmin is sure he's on solid ground with his use case, for him, he very well could be. =:^) -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Progress of device deletion?
On Sep 30, 2013, at 10:43 PM, Duncan 1i5t5.dun...@cox.net wrote: Meanwhile, I really do have to question the use case where the risks of a single dead device killing a raid0 (or for that matter, running still experimental btrfs) are fine, but spending days doing data maintenance on data not valuable enough to put on anything but experimental btrfs raid0, is warranted over simply blowing the data away and starting with brand new mkfs-ed filesystems. Yes of course. It must be a test case, and I think for non-experimental stable Btrfs it's reasonable to have reliable device delete regardless of the raid level because it's offered. And after all maybe the use case involves enterprise SSDs, each of which should have a less than 1% chance of failing during their service life. (Naturally, that's going to go a lot faster than days.) That a strong hint to me that either the raid0 use case is wrong, or the days of data move and reshape instead of blowing it away and recreating brand new filesystems, is wrong, and that one or the other should be reevaluated. I think it's the wrong use case today, except for testing it. It's legit to try and blow things up, simply because it's offered functionality, so long as the idea is I really would like to have this workflow actually work in 2-5 years. Otherwise it is sortof a rat hole. The other thing, clearly the OP is surprised it's taking anywhere near this long. Had he known in advance, he probably would have made a different choice. Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html