Re: (renamed thread) btrfs metrics
I am looking at what metrics are needed to monitor btrfs in production. I actually look after the ganglia-modules-linux package, which includes some FS space metrics, but I figured that btrfs throws all that out the window. Can you suggest metrics that would be meaningful, do I look in /proc or with syscalls, is there any code I should look at for an example of how to extract them with C? Ideally, Ganglia runs without root privileges too, so please let me know if btrfs will allow me to access them It depends on what you want to know, really. If you want how close am I to a full filesystem?, then the output of df will give you a measure, even if it could be up to a factor of 2 out -- you can use it for predictive planning, though, as it'll be near zero when the FS runs out of space. Maybe if you look at it from the point of the sysadmin and think about what questions he might want to ask: a) how much space would I reclaim if I deleted snapshot X? b) how much space would I reclaim if I deleted all snapshots? c) how much space would I need if I start making 4 snapshots a day and keeping them for 48 hours? (Ganglia also sums data across the enterprise, so if such metrics are implemented, the admin can quickly see the sum of all snapshot usage on his grid/cluster) and also: d) what metrics would be useful for someone developing or testing some solution involving btrfs? The Ganglia framework uses rrdtool to graph metric data and present it alongside other system data (e.g. to see disk IO rates, CPU load, cache activity all on the same graph) so it could provide some useful insights into any performance testing of btrfs. Even better, using rrdtool, you can overlay some btrfs metric on the same graph as a system metric (e.g. IO request sizes) If you really want to, you can get your hands into the filesystem structures on a read-only (and non-locked) basis using the BTRFS_IOC_TREE_SEARCH ioctl, and the FS structures are documented at [1]. However, that's generally going to be pretty ugly, and most likely pretty slow for many operations at the subvolume level. If you want anything on a per-subvolume basis, then you'll have to wait for Arne to finish the work on quotas. Initially, I could just start with some simple metric (even just retrieving the btrfs UUID) as a proof-of-concept, and then add more stuff later, for example, when Arne has the quota work in a stable form. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
checkums when converting from ext[234] to Btrfs
Hi, Could someone help me with a clarification whether the btrfs-convert tool creates checksums on blocks of the existing ext[234] filesystem? Any experiences how the size and the filesystem utilization (used vs. total diskspace) impacts the time of conversion? Thanks, Gábor -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] xfstests: new check 278 to ensure btrfs backref integrity
On Thu, Dec 22, 2011 at 12:08:58PM +0100, Jan Schmidt wrote: This is a btrfs specific scratch test checking the backref walker. It creates a file system with compressed and uncompressed data extents, picks files randomly and uses filefrag to get their extents. It then asks the btrfs utility (inspect-internal) to do the backref resolving from fs-logical address (the one filefrag calls physical) back to the inode number and file-logical offset, verifying the result. I was about to apply this, but for some reason it fails for me when running xfstest on xfs: 276 [failed, exit status 1] - output mismatch (see 276.out.bad) --- 276.out 2012-01-04 16:14:36.0 + +++ 276.out.bad 2012-01-04 16:32:26.0 + @@ -1,4 +1,5 @@ QA output created by 276 -*** test backref walking -*** done +common.rc: Error: $TEST_DEV (/dev/vdb1) is not a MOUNTED btrfs filesystem +FilesystemType 1K-blocks Used Available Use% Mounted on +/dev/vdb1 xfs39042944 32928 39010016 1% /mnt/test *** unmount which is a bit confusing -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] xfstests: new check 278 to ensure btrfs backref integrity
On 1/4/12 10:39 AM, Christoph Hellwig wrote: On Thu, Dec 22, 2011 at 12:08:58PM +0100, Jan Schmidt wrote: This is a btrfs specific scratch test checking the backref walker. It creates a file system with compressed and uncompressed data extents, picks files randomly and uses filefrag to get their extents. It then asks the btrfs utility (inspect-internal) to do the backref resolving from fs-logical address (the one filefrag calls physical) back to the inode number and file-logical offset, verifying the result. I was about to apply this, but for some reason it fails for me when running xfstest on xfs: 276[failed, exit status 1] - output mismatch (see 276.out.bad) --- 276.out 2012-01-04 16:14:36.0 + +++ 276.out.bad 2012-01-04 16:32:26.0 + @@ -1,4 +1,5 @@ QA output created by 276 -*** test backref walking -*** done +common.rc: Error: $TEST_DEV (/dev/vdb1) is not a MOUNTED btrfs filesystem +FilesystemType 1K-blocks Used Available Use% Mounted on +/dev/vdb1 xfs39042944 32928 39010016 1% /mnt/test *** unmount which is a bit confusing 276 got merged on Dec 28 before my requests for fixup, I guess? And it explicitly sets FSTYP=btrfs which is why it fails. the 278 patch v2 in this thread works ok for me. so munging the 278 patch here into the existing 276 should be the right approach. -Eric -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] xfstests: new check 278 to ensure btrfs backref integrity
On 04.01.2012 18:01, Eric Sandeen wrote: 276 got merged on Dec 28 before my requests for fixup, I guess? And it explicitly sets FSTYP=btrfs which is why it fails. the 278 patch v2 in this thread works ok for me. so munging the 278 patch here into the existing 276 should be the right approach. Yeah we figured that out on irc some minutes ago :-) I'm currently building v2 as an incremental patch to 276 (without rename to 278) and send it as [PATCH] xfstests: fixup check 276 soon. -Jan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 00/10] Btrfs: backref walking rewrite
This patch series is a major rewrite of the backref walking code. The patch series Arne sent some weeks ago for quota groups had a very interesting function, find_all_roots. I took this from him together with the bits needed for find_all_roots to work and replaced a major part of the code in backref.c with it. It can be pulled from git://git.jan-o-sch.net/btrfs-unstable for-chris There's also a gitweb for that repo on http://git.jan-o-sch.net/?p=btrfs-unstable My old backref code had several problems: - it relied on a consistent state of the trees in memory - it ignored delayed refs - it only featured rudimentary locking - it could miss some references depending on the tree layout The biggest advantage is, that we're now able to do reliable backref resolving, even on busy file systems. So we've got benefits for: - the existing btrfs inspect-internal commands - aforementioned qgroups (patches on the list) - btrfs send (currently in development) - snapshot-aware defrag - ... possibly more to come Splitting the needed bits out of Arne's code was a quite intrusive operation. In case this goes into 3.3, any of us will soon make a rebased version of the qgroup patch set. Things corrected/changed in Arne's code along the way: - don't assume INODE_ITEMs and the corresponding EXTENT_DATA items are in the same leaf (use the correct EXTENT_DATA_KEY for tree searches) - don't assume all EXTENT_DATA items with the same backref for the same inode are in the same leaf (__resolve_indirect_refs can now add more refs) - added missing key and level to prelim lists for shared block refs - delayed ref sequence locking ability without wasting sequence numbers - waitqueue instead of busy waiting for more delayed refs As this touches a critical part of the file system, I also did some speed benchmarks. It turns out that dbench shows no performance decrease on my hardware. I can do more tests if desired. By the way: this patch series fixes xfstest 278 (to be published soon) :-) -Jan --- Changelog v1-v2: - nested locking is now allowed implicitly, not only when path-nested is set - force-pushed new version to mentioned git repo --- Arne Jansen (6): Btrfs: generic data structure to build unique lists Btrfs: mark delayed refs as for cow Btrfs: always save ref_root in delayed refs Btrfs: add nested locking mode for paths Btrfs: add sequence numbers to delayed refs Btrfs: put back delayed refs that are too new Jan Schmidt (4): Btrfs: added helper btrfs_next_item() Btrfs: add waitqueue instead of doing busy waiting for more delayed refs Btrfs: added btrfs_find_all_roots() Btrfs: new backref walking code fs/btrfs/Makefile |2 +- fs/btrfs/backref.c | 1131 +--- fs/btrfs/backref.h |5 + fs/btrfs/ctree.c | 42 +- fs/btrfs/ctree.h | 24 +- fs/btrfs/delayed-ref.c | 153 +-- fs/btrfs/delayed-ref.h | 104 - fs/btrfs/disk-io.c |3 +- fs/btrfs/extent-tree.c | 187 ++-- fs/btrfs/extent_io.c |1 + fs/btrfs/extent_io.h |2 + fs/btrfs/file.c| 10 +- fs/btrfs/inode.c |2 +- fs/btrfs/ioctl.c | 13 +- fs/btrfs/locking.c | 53 +++- fs/btrfs/relocation.c | 18 +- fs/btrfs/scrub.c |7 +- fs/btrfs/transaction.c |9 +- fs/btrfs/tree-log.c|2 +- fs/btrfs/ulist.c | 220 ++ fs/btrfs/ulist.h | 68 +++ 21 files changed, 1638 insertions(+), 418 deletions(-) create mode 100644 fs/btrfs/ulist.c create mode 100644 fs/btrfs/ulist.h -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 02/10] Btrfs: added helper btrfs_next_item()
btrfs_next_item() makes the btrfs path point to the next item, crossing leaf boundaries if needed. Signed-off-by: Arne Jansen sensi...@gmx.net Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- fs/btrfs/ctree.h |7 +++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 50634abe..3e4a07b 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2482,6 +2482,13 @@ static inline int btrfs_insert_empty_item(struct btrfs_trans_handle *trans, } int btrfs_next_leaf(struct btrfs_root *root, struct btrfs_path *path); +static inline int btrfs_next_item(struct btrfs_root *root, struct btrfs_path *p) +{ + ++p-slots[0]; + if (p-slots[0] = btrfs_header_nritems(p-nodes[0])) + return btrfs_next_leaf(root, p); + return 0; +} int btrfs_prev_leaf(struct btrfs_root *root, struct btrfs_path *path); int btrfs_leaf_free_space(struct btrfs_root *root, struct extent_buffer *leaf); void btrfs_drop_snapshot(struct btrfs_root *root, -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 09/10] Btrfs: added btrfs_find_all_roots()
This function gets a byte number (a data extent), collects all the leafs pointing to it and walks up the trees to find all fs roots pointing to those leafs. It also returns the list of all leafs pointing to that extent. It does proper locking for the involved trees, can be used on busy file systems and honors delayed refs. Signed-off-by: Arne Jansen sensi...@gmx.net Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- fs/btrfs/backref.c | 783 fs/btrfs/backref.h |5 + 2 files changed, 788 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index 22c64ff..03c30a1 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -19,6 +19,9 @@ #include ctree.h #include disk-io.h #include backref.h +#include ulist.h +#include transaction.h +#include delayed-ref.h struct __data_ref { struct list_head list; @@ -32,6 +35,786 @@ struct __shared_ref { u64 disk_byte; }; +/* + * this structure records all encountered refs on the way up to the root + */ +struct __prelim_ref { + struct list_head list; + u64 root_id; + struct btrfs_key key; + int level; + int count; + u64 parent; + u64 wanted_disk_byte; +}; + +static int __add_prelim_ref(struct list_head *head, u64 root_id, + struct btrfs_key *key, int level, u64 parent, + u64 wanted_disk_byte, int count) +{ + struct __prelim_ref *ref; + + /* in case we're adding delayed refs, we're holding the refs spinlock */ + ref = kmalloc(sizeof(*ref), GFP_ATOMIC); + if (!ref) + return -ENOMEM; + + ref-root_id = root_id; + if (key) + ref-key = *key; + else + memset(ref-key, 0, sizeof(ref-key)); + + ref-level = level; + ref-count = count; + ref-parent = parent; + ref-wanted_disk_byte = wanted_disk_byte; + list_add_tail(ref-list, head); + + return 0; +} + +static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path, + struct ulist *parents, + struct extent_buffer *eb, int level, + u64 wanted_objectid, u64 wanted_disk_byte) +{ + int ret; + int slot; + struct btrfs_file_extent_item *fi; + struct btrfs_key key; + u64 disk_byte; + +add_parent: + ret = ulist_add(parents, eb-start, 0, GFP_NOFS); + if (ret 0) + return ret; + + if (level != 0) + return 0; + + /* +* if the current leaf is full with EXTENT_DATA items, we must +* check the next one if that holds a reference as well. +* ref-count cannot be used to skip this check. +* repeat this until we don't find any additional EXTENT_DATA items. +*/ + while (1) { + ret = btrfs_next_leaf(root, path); + if (ret 0) + return ret; + if (ret) + return 0; + + eb = path-nodes[0]; + for (slot = 0; slot btrfs_header_nritems(eb); ++slot) { + btrfs_item_key_to_cpu(eb, key, slot); + if (key.objectid != wanted_objectid || + key.type != BTRFS_EXTENT_DATA_KEY) + return 0; + fi = btrfs_item_ptr(eb, slot, + struct btrfs_file_extent_item); + disk_byte = btrfs_file_extent_disk_bytenr(eb, fi); + if (disk_byte == wanted_disk_byte) + goto add_parent; + } + } + + return 0; +} + +/* + * resolve an indirect backref in the form (root_id, key, level) + * to a logical address + */ +static int __resolve_indirect_ref(struct btrfs_fs_info *fs_info, + struct __prelim_ref *ref, + struct ulist *parents) +{ + struct btrfs_path *path; + struct btrfs_root *root; + struct btrfs_key root_key; + struct btrfs_key key = {0}; + struct extent_buffer *eb; + int ret = 0; + int root_level; + int level = ref-level; + + path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; + + root_key.objectid = ref-root_id; + root_key.type = BTRFS_ROOT_ITEM_KEY; + root_key.offset = (u64)-1; + root = btrfs_read_fs_root_no_name(fs_info, root_key); + if (IS_ERR(root)) { + ret = PTR_ERR(root); + goto out; + } + + rcu_read_lock(); + root_level = btrfs_header_level(root-node); + rcu_read_unlock(); + + if (root_level + 1 == level) + goto out; + + path-lowest_level = level; + ret = btrfs_search_slot(NULL, root, ref-key,
[PATCH v2 03/10] Btrfs: mark delayed refs as for cow
From: Arne Jansen sensi...@gmx.net Add a for_cow parameter to add_delayed_*_ref and pass the appropriate value from every call site. The for_cow parameter will later on be used to determine if a ref will change anything with respect to qgroups. Delayed refs coming from relocation are always counted as for_cow, as they don't change subvol quota. Also pass in the fs_info for later use. btrfs_find_all_roots() will use this as an optimization, as changes that are for_cow will not change anything with respect to which root points to a certain leaf. Thus, we don't need to add the current sequence number to those delayed refs. Signed-off-by: Arne Jansen sensi...@gmx.net Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- fs/btrfs/ctree.c | 42 ++-- fs/btrfs/ctree.h | 17 fs/btrfs/delayed-ref.c | 50 +++- fs/btrfs/delayed-ref.h | 15 +-- fs/btrfs/disk-io.c |3 +- fs/btrfs/extent-tree.c | 101 --- fs/btrfs/file.c| 10 ++-- fs/btrfs/inode.c |2 +- fs/btrfs/ioctl.c |5 +- fs/btrfs/relocation.c | 18 + fs/btrfs/transaction.c |4 +- fs/btrfs/tree-log.c|2 +- 12 files changed, 155 insertions(+), 114 deletions(-) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index dede441..0639a55 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -240,7 +240,7 @@ int btrfs_copy_root(struct btrfs_trans_handle *trans, cow = btrfs_alloc_free_block(trans, root, buf-len, 0, new_root_objectid, disk_key, level, -buf-start, 0); +buf-start, 0, 1); if (IS_ERR(cow)) return PTR_ERR(cow); @@ -261,9 +261,9 @@ int btrfs_copy_root(struct btrfs_trans_handle *trans, WARN_ON(btrfs_header_generation(buf) trans-transid); if (new_root_objectid == BTRFS_TREE_RELOC_OBJECTID) - ret = btrfs_inc_ref(trans, root, cow, 1); + ret = btrfs_inc_ref(trans, root, cow, 1, 1); else - ret = btrfs_inc_ref(trans, root, cow, 0); + ret = btrfs_inc_ref(trans, root, cow, 0, 1); if (ret) return ret; @@ -350,14 +350,14 @@ static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans, if ((owner == root-root_key.objectid || root-root_key.objectid == BTRFS_TREE_RELOC_OBJECTID) !(flags BTRFS_BLOCK_FLAG_FULL_BACKREF)) { - ret = btrfs_inc_ref(trans, root, buf, 1); + ret = btrfs_inc_ref(trans, root, buf, 1, 1); BUG_ON(ret); if (root-root_key.objectid == BTRFS_TREE_RELOC_OBJECTID) { - ret = btrfs_dec_ref(trans, root, buf, 0); + ret = btrfs_dec_ref(trans, root, buf, 0, 1); BUG_ON(ret); - ret = btrfs_inc_ref(trans, root, cow, 1); + ret = btrfs_inc_ref(trans, root, cow, 1, 1); BUG_ON(ret); } new_flags |= BTRFS_BLOCK_FLAG_FULL_BACKREF; @@ -365,9 +365,9 @@ static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans, if (root-root_key.objectid == BTRFS_TREE_RELOC_OBJECTID) - ret = btrfs_inc_ref(trans, root, cow, 1); + ret = btrfs_inc_ref(trans, root, cow, 1, 1); else - ret = btrfs_inc_ref(trans, root, cow, 0); + ret = btrfs_inc_ref(trans, root, cow, 0, 1); BUG_ON(ret); } if (new_flags != 0) { @@ -381,11 +381,11 @@ static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans, if (flags BTRFS_BLOCK_FLAG_FULL_BACKREF) { if (root-root_key.objectid == BTRFS_TREE_RELOC_OBJECTID) - ret = btrfs_inc_ref(trans, root, cow, 1); + ret = btrfs_inc_ref(trans, root, cow, 1, 1); else - ret = btrfs_inc_ref(trans, root, cow, 0); + ret = btrfs_inc_ref(trans, root, cow, 0, 1); BUG_ON(ret); - ret = btrfs_dec_ref(trans, root, buf, 1); + ret = btrfs_dec_ref(trans, root, buf, 1, 1); BUG_ON(ret); } clean_tree_block(trans, root, buf); @@ -446,7 +446,7 @@ static noinline int __btrfs_cow_block(struct btrfs_trans_handle *trans, cow =
[PATCH v2 04/10] Btrfs: always save ref_root in delayed refs
From: Arne Jansen sensi...@gmx.net For consistent backref walking and (later) qgroup calculation the information to which root a delayed ref belongs is useful even for shared refs. Signed-off-by: Arne Jansen sensi...@gmx.net Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- fs/btrfs/delayed-ref.c | 18 -- fs/btrfs/delayed-ref.h | 12 2 files changed, 12 insertions(+), 18 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 3a0f0ab..babd37b 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -495,13 +495,12 @@ static noinline int add_delayed_tree_ref(struct btrfs_fs_info *fs_info, ref-in_tree = 1; full_ref = btrfs_delayed_node_to_tree_ref(ref); - if (parent) { - full_ref-parent = parent; + full_ref-parent = parent; + full_ref-root = ref_root; + if (parent) ref-type = BTRFS_SHARED_BLOCK_REF_KEY; - } else { - full_ref-root = ref_root; + else ref-type = BTRFS_TREE_BLOCK_REF_KEY; - } full_ref-level = level; trace_btrfs_delayed_tree_ref(ref, full_ref, action); @@ -551,13 +550,12 @@ static noinline int add_delayed_data_ref(struct btrfs_fs_info *fs_info, ref-in_tree = 1; full_ref = btrfs_delayed_node_to_data_ref(ref); - if (parent) { - full_ref-parent = parent; + full_ref-parent = parent; + full_ref-root = ref_root; + if (parent) ref-type = BTRFS_SHARED_DATA_REF_KEY; - } else { - full_ref-root = ref_root; + else ref-type = BTRFS_EXTENT_DATA_REF_KEY; - } full_ref-objectid = owner; full_ref-offset = offset; diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h index 8316bff..a5fb2bc 100644 --- a/fs/btrfs/delayed-ref.h +++ b/fs/btrfs/delayed-ref.h @@ -98,19 +98,15 @@ struct btrfs_delayed_ref_head { struct btrfs_delayed_tree_ref { struct btrfs_delayed_ref_node node; - union { - u64 root; - u64 parent; - }; + u64 root; + u64 parent; int level; }; struct btrfs_delayed_data_ref { struct btrfs_delayed_ref_node node; - union { - u64 root; - u64 parent; - }; + u64 root; + u64 parent; u64 objectid; u64 offset; }; -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 10/10] Btrfs: new backref walking code
The old backref iteration code could only safely be used on commit roots. Besides this limitation, it had bugs in finding the roots for these references. This commit replaces large parts of it by btrfs_find_all_roots() which a) really finds all roots and the correct roots, b) works correctly under heavy file system load, c) considers delayed refs. Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- fs/btrfs/backref.c | 354 +++- fs/btrfs/ioctl.c |8 +- fs/btrfs/scrub.c |7 +- 3 files changed, 107 insertions(+), 262 deletions(-) diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index 03c30a1..c4c3d5d 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -23,18 +23,6 @@ #include transaction.h #include delayed-ref.h -struct __data_ref { - struct list_head list; - u64 inum; - u64 root; - u64 extent_data_item_offset; -}; - -struct __shared_ref { - struct list_head list; - u64 disk_byte; -}; - /* * this structure records all encountered refs on the way up to the root */ @@ -964,8 +952,11 @@ int extent_from_logical(struct btrfs_fs_info *fs_info, u64 logical, btrfs_item_key_to_cpu(path-nodes[0], found_key, path-slots[0]); if (found_key-type != BTRFS_EXTENT_ITEM_KEY || found_key-objectid logical || - found_key-objectid + found_key-offset = logical) + found_key-objectid + found_key-offset = logical) { + pr_debug(logical %llu is not within any extent\n, +(unsigned long long)logical); return -ENOENT; + } eb = path-nodes[0]; item_size = btrfs_item_size_nr(eb, path-slots[0]); @@ -974,6 +965,13 @@ int extent_from_logical(struct btrfs_fs_info *fs_info, u64 logical, ei = btrfs_item_ptr(eb, path-slots[0], struct btrfs_extent_item); flags = btrfs_extent_flags(eb, ei); + pr_debug(logical %llu is at position %llu within the extent (%llu +EXTENT_ITEM %llu) flags %#llx size %u\n, +(unsigned long long)logical, +(unsigned long long)(logical - found_key-objectid), +(unsigned long long)found_key-objectid, +(unsigned long long)found_key-offset, +(unsigned long long)flags, item_size); if (flags BTRFS_EXTENT_FLAG_TREE_BLOCK) return BTRFS_EXTENT_FLAG_TREE_BLOCK; if (flags BTRFS_EXTENT_FLAG_DATA) @@ -1070,128 +1068,11 @@ int tree_backref_for_extent(unsigned long *ptr, struct extent_buffer *eb, return 0; } -static int __data_list_add(struct list_head *head, u64 inum, - u64 extent_data_item_offset, u64 root) -{ - struct __data_ref *ref; - - ref = kmalloc(sizeof(*ref), GFP_NOFS); - if (!ref) - return -ENOMEM; - - ref-inum = inum; - ref-extent_data_item_offset = extent_data_item_offset; - ref-root = root; - list_add_tail(ref-list, head); - - return 0; -} - -static int __data_list_add_eb(struct list_head *head, struct extent_buffer *eb, - struct btrfs_extent_data_ref *dref) -{ - return __data_list_add(head, btrfs_extent_data_ref_objectid(eb, dref), - btrfs_extent_data_ref_offset(eb, dref), - btrfs_extent_data_ref_root(eb, dref)); -} - -static int __shared_list_add(struct list_head *head, u64 disk_byte) -{ - struct __shared_ref *ref; - - ref = kmalloc(sizeof(*ref), GFP_NOFS); - if (!ref) - return -ENOMEM; - - ref-disk_byte = disk_byte; - list_add_tail(ref-list, head); - - return 0; -} - -static int __iter_shared_inline_ref_inodes(struct btrfs_fs_info *fs_info, - u64 logical, u64 inum, - u64 extent_data_item_offset, - u64 extent_offset, - struct btrfs_path *path, - struct list_head *data_refs, - iterate_extent_inodes_t *iterate, - void *ctx) -{ - u64 ref_root; - u32 item_size; - struct btrfs_key key; - struct extent_buffer *eb; - struct btrfs_extent_item *ei; - struct btrfs_extent_inline_ref *eiref; - struct __data_ref *ref; - int ret; - int type; - int last; - unsigned long ptr = 0; - - WARN_ON(!list_empty(data_refs)); - ret = extent_from_logical(fs_info, logical, path, key); - if (ret BTRFS_EXTENT_FLAG_DATA) - ret = -EIO; - if (ret 0) - goto out; - - eb = path-nodes[0]; - ei = btrfs_item_ptr(eb, path-slots[0], struct btrfs_extent_item); - item_size = btrfs_item_size_nr(eb,
[PATCH v2 06/10] Btrfs: add sequence numbers to delayed refs
From: Arne Jansen sensi...@gmx.net Sequence numbers are needed to reconstruct the backrefs of a given extent to a certain point in time. The total set of backrefs consist of the set of backrefs recorded on disk plus the enqueued delayed refs for it that existed at that moment. This patch also adds a list that records all delayed refs which are currently in the process of being added. When walking all refs of an extent in btrfs_find_all_roots(), we freeze the current state of delayed refs, honor anythinh up to this point and prevent processing newer delayed refs to assert consistency. Signed-off-by: Arne Jansen sensi...@gmx.net Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- fs/btrfs/delayed-ref.c | 34 +++ fs/btrfs/delayed-ref.h | 70 fs/btrfs/transaction.c |4 +++ 3 files changed, 108 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index babd37b..a405db0 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -101,6 +101,11 @@ static int comp_entry(struct btrfs_delayed_ref_node *ref2, return -1; if (ref1-type ref2-type) return 1; + /* merging of sequenced refs is not allowed */ + if (ref1-seq ref2-seq) + return -1; + if (ref1-seq ref2-seq) + return 1; if (ref1-type == BTRFS_TREE_BLOCK_REF_KEY || ref1-type == BTRFS_SHARED_BLOCK_REF_KEY) { return comp_tree_refs(btrfs_delayed_node_to_tree_ref(ref2), @@ -209,6 +214,24 @@ int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans, return 0; } +int btrfs_check_delayed_seq(struct btrfs_delayed_ref_root *delayed_refs, + u64 seq) +{ + struct seq_list *elem; + + assert_spin_locked(delayed_refs-lock); + if (list_empty(delayed_refs-seq_head)) + return 0; + + elem = list_first_entry(delayed_refs-seq_head, struct seq_list, list); + if (seq = elem-seq) { + pr_debug(holding back delayed_ref %llu, lowest is %llu (%p)\n, +seq, elem-seq, delayed_refs); + return 1; + } + return 0; +} + int btrfs_find_ref_cluster(struct btrfs_trans_handle *trans, struct list_head *cluster, u64 start) { @@ -438,6 +461,7 @@ static noinline int add_delayed_ref_head(struct btrfs_fs_info *fs_info, ref-action = 0; ref-is_head = 1; ref-in_tree = 1; + ref-seq = 0; head_ref = btrfs_delayed_node_to_head(ref); head_ref-must_insert_reserved = must_insert_reserved; @@ -479,6 +503,7 @@ static noinline int add_delayed_tree_ref(struct btrfs_fs_info *fs_info, struct btrfs_delayed_ref_node *existing; struct btrfs_delayed_tree_ref *full_ref; struct btrfs_delayed_ref_root *delayed_refs; + u64 seq = 0; if (action == BTRFS_ADD_DELAYED_EXTENT) action = BTRFS_ADD_DELAYED_REF; @@ -494,6 +519,10 @@ static noinline int add_delayed_tree_ref(struct btrfs_fs_info *fs_info, ref-is_head = 0; ref-in_tree = 1; + if (need_ref_seq(for_cow, ref_root)) + seq = inc_delayed_seq(delayed_refs); + ref-seq = seq; + full_ref = btrfs_delayed_node_to_tree_ref(ref); full_ref-parent = parent; full_ref-root = ref_root; @@ -534,6 +563,7 @@ static noinline int add_delayed_data_ref(struct btrfs_fs_info *fs_info, struct btrfs_delayed_ref_node *existing; struct btrfs_delayed_data_ref *full_ref; struct btrfs_delayed_ref_root *delayed_refs; + u64 seq = 0; if (action == BTRFS_ADD_DELAYED_EXTENT) action = BTRFS_ADD_DELAYED_REF; @@ -549,6 +579,10 @@ static noinline int add_delayed_data_ref(struct btrfs_fs_info *fs_info, ref-is_head = 0; ref-in_tree = 1; + if (need_ref_seq(for_cow, ref_root)) + seq = inc_delayed_seq(delayed_refs); + ref-seq = seq; + full_ref = btrfs_delayed_node_to_data_ref(ref); full_ref-parent = parent; full_ref-root = ref_root; diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h index a5fb2bc..174416f 100644 --- a/fs/btrfs/delayed-ref.h +++ b/fs/btrfs/delayed-ref.h @@ -33,6 +33,9 @@ struct btrfs_delayed_ref_node { /* the size of the extent */ u64 num_bytes; + /* seq number to keep track of insertion order */ + u64 seq; + /* ref count on this data structure */ atomic_t refs; @@ -136,6 +139,20 @@ struct btrfs_delayed_ref_root { int flushing; u64 run_delayed_start; + + /* +* seq number of delayed refs. We need to know if a backref was being +* added before the currently processed ref or afterwards. +*/ + u64 seq; + + /* +* seq_list holds a list of all seq numbers that are
[PATCH v2 05/10] Btrfs: add nested locking mode for paths
From: Arne Jansen sensi...@gmx.net This patch adds the possibilty to read-lock an extent even if it is already write-locked from the same thread. btrfs_find_all_roots() needs this capability. Signed-off-by: Arne Jansen sensi...@gmx.net Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- fs/btrfs/extent_io.c |1 + fs/btrfs/extent_io.h |2 + fs/btrfs/locking.c | 53 - 3 files changed, 54 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index be1bf62..dd8d140 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3571,6 +3571,7 @@ static struct extent_buffer *__alloc_extent_buffer(struct extent_io_tree *tree, atomic_set(eb-blocking_writers, 0); atomic_set(eb-spinning_readers, 0); atomic_set(eb-spinning_writers, 0); + eb-lock_nested = 0; init_waitqueue_head(eb-write_lock_wq); init_waitqueue_head(eb-read_lock_wq); diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index 7604c30..bc6a042cb 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -129,6 +129,7 @@ struct extent_buffer { struct list_head leak_list; struct rcu_head rcu_head; atomic_t refs; + pid_t lock_owner; /* count of read lock holders on the extent buffer */ atomic_t write_locks; @@ -137,6 +138,7 @@ struct extent_buffer { atomic_t blocking_readers; atomic_t spinning_readers; atomic_t spinning_writers; + int lock_nested; /* protects write locks */ rwlock_t lock; diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c index d77b67c..5e178d8 100644 --- a/fs/btrfs/locking.c +++ b/fs/btrfs/locking.c @@ -33,6 +33,14 @@ void btrfs_assert_tree_read_locked(struct extent_buffer *eb); */ void btrfs_set_lock_blocking_rw(struct extent_buffer *eb, int rw) { + if (eb-lock_nested) { + read_lock(eb-lock); + if (eb-lock_nested current-pid == eb-lock_owner) { + read_unlock(eb-lock); + return; + } + read_unlock(eb-lock); + } if (rw == BTRFS_WRITE_LOCK) { if (atomic_read(eb-blocking_writers) == 0) { WARN_ON(atomic_read(eb-spinning_writers) != 1); @@ -57,6 +65,14 @@ void btrfs_set_lock_blocking_rw(struct extent_buffer *eb, int rw) */ void btrfs_clear_lock_blocking_rw(struct extent_buffer *eb, int rw) { + if (eb-lock_nested) { + read_lock(eb-lock); + if (eb-lock_nested current-pid == eb-lock_owner) { + read_unlock(eb-lock); + return; + } + read_unlock(eb-lock); + } if (rw == BTRFS_WRITE_LOCK_BLOCKING) { BUG_ON(atomic_read(eb-blocking_writers) != 1); write_lock(eb-lock); @@ -81,12 +97,25 @@ void btrfs_clear_lock_blocking_rw(struct extent_buffer *eb, int rw) void btrfs_tree_read_lock(struct extent_buffer *eb) { again: + read_lock(eb-lock); + if (atomic_read(eb-blocking_writers) + current-pid == eb-lock_owner) { + /* +* This extent is already write-locked by our thread. We allow +* an additional read lock to be added because it's for the same +* thread. btrfs_find_all_roots() depends on this as it may be +* called on a partly (write-)locked tree. +*/ + BUG_ON(eb-lock_nested); + eb-lock_nested = 1; + read_unlock(eb-lock); + return; + } + read_unlock(eb-lock); wait_event(eb-write_lock_wq, atomic_read(eb-blocking_writers) == 0); read_lock(eb-lock); if (atomic_read(eb-blocking_writers)) { read_unlock(eb-lock); - wait_event(eb-write_lock_wq, - atomic_read(eb-blocking_writers) == 0); goto again; } atomic_inc(eb-read_locks); @@ -129,6 +158,7 @@ int btrfs_try_tree_write_lock(struct extent_buffer *eb) } atomic_inc(eb-write_locks); atomic_inc(eb-spinning_writers); + eb-lock_owner = current-pid; return 1; } @@ -137,6 +167,15 @@ int btrfs_try_tree_write_lock(struct extent_buffer *eb) */ void btrfs_tree_read_unlock(struct extent_buffer *eb) { + if (eb-lock_nested) { + read_lock(eb-lock); + if (eb-lock_nested current-pid == eb-lock_owner) { + eb-lock_nested = 0; + read_unlock(eb-lock); + return; + } + read_unlock(eb-lock); + } btrfs_assert_tree_read_locked(eb); WARN_ON(atomic_read(eb-spinning_readers) == 0); atomic_dec(eb-spinning_readers); @@ -149,6 +188,15 @@ void btrfs_tree_read_unlock(struct
[PATCH v2 08/10] Btrfs: add waitqueue instead of doing busy waiting for more delayed refs
Now that we may be holding back delayed refs for a limited period, we might end up having no runnable delayed refs. Without this commit, we'd do busy waiting in that thread until another (runnable) ref arives. Instead, we're detecting this situation and use a waitqueue, such that we only try to run more refs after a) another runnable ref was added or b) delayed refs are no longer held back Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- fs/btrfs/delayed-ref.c |8 ++ fs/btrfs/delayed-ref.h |7 + fs/btrfs/extent-tree.c | 59 +++- fs/btrfs/transaction.c |1 + 4 files changed, 74 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index ee18198..66e4f29 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -664,6 +664,9 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info, num_bytes, parent, ref_root, level, action, for_cow); BUG_ON(ret); + if (!need_ref_seq(for_cow, ref_root) + waitqueue_active(delayed_refs-seq_wait)) + wake_up(delayed_refs-seq_wait); spin_unlock(delayed_refs-lock); return 0; } @@ -712,6 +715,9 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info, num_bytes, parent, ref_root, owner, offset, action, for_cow); BUG_ON(ret); + if (!need_ref_seq(for_cow, ref_root) + waitqueue_active(delayed_refs-seq_wait)) + wake_up(delayed_refs-seq_wait); spin_unlock(delayed_refs-lock); return 0; } @@ -739,6 +745,8 @@ int btrfs_add_delayed_extent_op(struct btrfs_fs_info *fs_info, extent_op-is_data); BUG_ON(ret); + if (waitqueue_active(delayed_refs-seq_wait)) + wake_up(delayed_refs-seq_wait); spin_unlock(delayed_refs-lock); return 0; } diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h index 174416f..d8f244d 100644 --- a/fs/btrfs/delayed-ref.h +++ b/fs/btrfs/delayed-ref.h @@ -153,6 +153,12 @@ struct btrfs_delayed_ref_root { * as it might influence the outcome of the walk. */ struct list_head seq_head; + + /* +* when the only refs we have in the list must not be processed, we want +* to wait for more refs to show up or for the end of backref walking. +*/ + wait_queue_head_t seq_wait; }; static inline void btrfs_put_delayed_ref(struct btrfs_delayed_ref_node *ref) @@ -216,6 +222,7 @@ btrfs_put_delayed_seq(struct btrfs_delayed_ref_root *delayed_refs, { spin_lock(delayed_refs-lock); list_del(elem-list); + wake_up(delayed_refs-seq_wait); spin_unlock(delayed_refs-lock); } diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index bbcca12..0a435e2 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2300,7 +2300,12 @@ static noinline int run_clustered_refs(struct btrfs_trans_handle *trans, ref-in_tree = 0; rb_erase(ref-rb_node, delayed_refs-root); delayed_refs-num_entries--; - + /* +* we modified num_entries, but as we're currently running +* delayed refs, skip +* wake_up(delayed_refs-seq_wait); +* here. +*/ spin_unlock(delayed_refs-lock); ret = run_one_delayed_ref(trans, root, ref, extent_op, @@ -2317,6 +2322,23 @@ static noinline int run_clustered_refs(struct btrfs_trans_handle *trans, return count; } + +static void wait_for_more_refs(struct btrfs_delayed_ref_root *delayed_refs, + unsigned long num_refs) +{ + struct list_head *first_seq = delayed_refs-seq_head.next; + + spin_unlock(delayed_refs-lock); + pr_debug(waiting for more refs (num %ld, first %p)\n, +num_refs, first_seq); + wait_event(delayed_refs-seq_wait, + num_refs != delayed_refs-num_entries || + delayed_refs-seq_head.next != first_seq); + pr_debug(done waiting for more refs (num %ld, first %p)\n, +delayed_refs-num_entries, delayed_refs-seq_head.next); + spin_lock(delayed_refs-lock); +} + /* * this starts processing the delayed reference count updates and * extent insertions we have queued up so far. count can be @@ -2332,8 +2354,11 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, struct btrfs_delayed_ref_node *ref; struct list_head cluster; int ret; + u64 delayed_start; int run_all = count == (unsigned long)-1; int run_most = 0; + unsigned long num_refs = 0; + int consider_waiting; if (root == root-fs_info-extent_root)
[PATCH v2 01/10] Btrfs: generic data structure to build unique lists
From: Arne Jansen sensi...@gmx.net ulist is a generic data structures to hold a collection of unique u64 values. The only operations it supports is adding to the list and enumerating it. It is possible to store an auxiliary value along with the key. The implementation is preliminary and can probably be sped up significantly. It is used by btrfs_find_all_roots() quota to translate recursions into iterative loops. Signed-off-by: Arne Jansen sensi...@gmx.net Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- fs/btrfs/Makefile |2 +- fs/btrfs/ulist.c | 220 + fs/btrfs/ulist.h | 68 3 files changed, 289 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile index c0ddfd2..7079840 100644 --- a/fs/btrfs/Makefile +++ b/fs/btrfs/Makefile @@ -8,6 +8,6 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \ export.o tree-log.o free-space-cache.o zlib.o lzo.o \ compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \ - reada.o backref.o + reada.o backref.o ulist.o btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o diff --git a/fs/btrfs/ulist.c b/fs/btrfs/ulist.c new file mode 100644 index 000..12f5147 --- /dev/null +++ b/fs/btrfs/ulist.c @@ -0,0 +1,220 @@ +/* + * Copyright (C) 2011 STRATO AG + * written by Arne Jansen sensi...@gmx.net + * Distributed under the GNU GPL license version 2. + */ + +#include linux/slab.h +#include linux/module.h +#include ulist.h + +/* + * ulist is a generic data structure to hold a collection of unique u64 + * values. The only operations it supports is adding to the list and + * enumerating it. + * It is possible to store an auxiliary value along with the key. + * + * The implementation is preliminary and can probably be sped up + * significantly. A first step would be to store the values in an rbtree + * as soon as ULIST_SIZE is exceeded. + * + * A sample usage for ulists is the enumeration of directed graphs without + * visiting a node twice. The pseudo-code could look like this: + * + * ulist = ulist_alloc(); + * ulist_add(ulist, root); + * elem = NULL; + * + * while ((elem = ulist_next(ulist, elem)) { + * for (all child nodes n in elem) + * ulist_add(ulist, n); + * do something useful with the node; + * } + * ulist_free(ulist); + * + * This assumes the graph nodes are adressable by u64. This stems from the + * usage for tree enumeration in btrfs, where the logical addresses are + * 64 bit. + * + * It is also useful for tree enumeration which could be done elegantly + * recursively, but is not possible due to kernel stack limitations. The + * loop would be similar to the above. + */ + +/** + * ulist_init - freshly initialize a ulist + * @ulist: the ulist to initialize + * + * Note: don't use this function to init an already used ulist, use + * ulist_reinit instead. + */ +void ulist_init(struct ulist *ulist) +{ + ulist-nnodes = 0; + ulist-nodes = ulist-int_nodes; + ulist-nodes_alloced = ULIST_SIZE; +} +EXPORT_SYMBOL(ulist_init); + +/** + * ulist_fini - free up additionally allocated memory for the ulist + * @ulist: the ulist from which to free the additional memory + * + * This is useful in cases where the base 'struct ulist' has been statically + * allocated. + */ +void ulist_fini(struct ulist *ulist) +{ + /* +* The first ULIST_SIZE elements are stored inline in struct ulist. +* Only if more elements are alocated they need to be freed. +*/ + if (ulist-nodes_alloced ULIST_SIZE) + kfree(ulist-nodes); + ulist-nodes_alloced = 0; /* in case ulist_fini is called twice */ +} +EXPORT_SYMBOL(ulist_fini); + +/** + * ulist_reinit - prepare a ulist for reuse + * @ulist: ulist to be reused + * + * Free up all additional memory allocated for the list elements and reinit + * the ulist. + */ +void ulist_reinit(struct ulist *ulist) +{ + ulist_fini(ulist); + ulist_init(ulist); +} +EXPORT_SYMBOL(ulist_reinit); + +/** + * ulist_alloc - dynamically allocate a ulist + * @gfp_mask: allocation flags to for base allocation + * + * The allocated ulist will be returned in an initialized state. + */ +struct ulist *ulist_alloc(unsigned long gfp_mask) +{ + struct ulist *ulist = kmalloc(sizeof(*ulist), gfp_mask); + + if (!ulist) + return NULL; + + ulist_init(ulist); + + return ulist; +} +EXPORT_SYMBOL(ulist_alloc); + +/** + * ulist_free - free dynamically allocated ulist + * @ulist: ulist to free + * + * It is not necessary to call ulist_fini before. + */ +void ulist_free(struct ulist *ulist) +{ + if (!ulist) + return; + ulist_fini(ulist); + kfree(ulist); +} +EXPORT_SYMBOL(ulist_free); + +/** + * ulist_add - add an element to the ulist + * @ulist:
[PATCH v2 07/10] Btrfs: put back delayed refs that are too new
From: Arne Jansen sensi...@gmx.net When processing a delayed ref, first check if there are still old refs in the process of being added. If so, put this ref back to the tree. To avoid looping on this ref, choose a newer one in the next loop. btrfs_find_ref_cluster has to take care of that. Signed-off-by: Arne Jansen sensi...@gmx.net Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- fs/btrfs/delayed-ref.c | 43 +-- fs/btrfs/extent-tree.c | 27 ++- 2 files changed, 47 insertions(+), 23 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index a405db0..ee18198 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -155,16 +155,22 @@ static struct btrfs_delayed_ref_node *tree_insert(struct rb_root *root, /* * find an head entry based on bytenr. This returns the delayed ref - * head if it was able to find one, or NULL if nothing was in that spot + * head if it was able to find one, or NULL if nothing was in that spot. + * If return_bigger is given, the next bigger entry is returned if no exact + * match is found. */ static struct btrfs_delayed_ref_node *find_ref_head(struct rb_root *root, u64 bytenr, - struct btrfs_delayed_ref_node **last) + struct btrfs_delayed_ref_node **last, + int return_bigger) { - struct rb_node *n = root-rb_node; + struct rb_node *n; struct btrfs_delayed_ref_node *entry; - int cmp; + int cmp = 0; +again: + n = root-rb_node; + entry = NULL; while (n) { entry = rb_entry(n, struct btrfs_delayed_ref_node, rb_node); WARN_ON(!entry-in_tree); @@ -187,6 +193,19 @@ static struct btrfs_delayed_ref_node *find_ref_head(struct rb_root *root, else return entry; } + if (entry return_bigger) { + if (cmp 0) { + n = rb_next(entry-rb_node); + if (!n) + n = rb_first(root); + entry = rb_entry(n, struct btrfs_delayed_ref_node, +rb_node); + bytenr = entry-bytenr; + return_bigger = 0; + goto again; + } + return entry; + } return NULL; } @@ -246,20 +265,8 @@ int btrfs_find_ref_cluster(struct btrfs_trans_handle *trans, node = rb_first(delayed_refs-root); } else { ref = NULL; - find_ref_head(delayed_refs-root, start, ref); + find_ref_head(delayed_refs-root, start + 1, ref, 1); if (ref) { - struct btrfs_delayed_ref_node *tmp; - - node = rb_prev(ref-rb_node); - while (node) { - tmp = rb_entry(node, - struct btrfs_delayed_ref_node, - rb_node); - if (tmp-bytenr start) - break; - ref = tmp; - node = rb_prev(ref-rb_node); - } node = ref-rb_node; } else node = rb_first(delayed_refs-root); @@ -748,7 +755,7 @@ btrfs_find_delayed_ref_head(struct btrfs_trans_handle *trans, u64 bytenr) struct btrfs_delayed_ref_root *delayed_refs; delayed_refs = trans-transaction-delayed_refs; - ref = find_ref_head(delayed_refs-root, bytenr, NULL); + ref = find_ref_head(delayed_refs-root, bytenr, NULL, 0); if (ref) return btrfs_delayed_node_to_head(ref); return NULL; diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index dc8b9a8..bbcca12 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2237,6 +2237,28 @@ static noinline int run_clustered_refs(struct btrfs_trans_handle *trans, } /* +* locked_ref is the head node, so we have to go one +* node back for any delayed ref updates +*/ + ref = select_delayed_ref(locked_ref); + + if (ref ref-seq + btrfs_check_delayed_seq(delayed_refs, ref-seq)) { + /* +* there are still refs with lower seq numbers in the +* process of being added. Don't run this ref yet. +*/ + list_del_init(locked_ref-cluster); + mutex_unlock(locked_ref-mutex); + locked_ref = NULL; + delayed_refs-num_heads_ready++;
[PATCH] xfstests: fixup check 276
This commit fixes bd8ee45c. Changes: - added a _require_btrfs helper function - check for filefrag with _require_command - always use _fail in case of errors - added some comments - removed $fresh code - don't set FSTYP Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- 276 | 119 + common.rc | 12 ++ 2 files changed, 84 insertions(+), 47 deletions(-) diff --git a/276 b/276 index f22d089..082f943 100755 --- a/276 +++ b/276 @@ -1,5 +1,29 @@ #! /bin/bash - +# FSQA Test No. 276 +# +# Run fsstress to create a reasonably strange file system, make a +# snapshot and run more fsstress. Then select some files from that fs, +# run filefrag to get the extent mapping and follow the backrefs. +# We check to end up back at the original file with the correct offset. +# +#--- +# Copyright (C) 2011 STRATO. All rights reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +# +#--- +# # creator owner=list.bt...@jan-o-sch.net @@ -7,18 +31,13 @@ seq=`basename $0` echo QA output created by $seq here=`pwd` -# 1=production, 0=avoid touching the scratch dev (no mount/umount, no writes) -fresh=1 tmp=/tmp/$$ status=1 -FSTYP=btrfs _cleanup() { - if [ $fresh -ne 0 ]; then - echo *** unmount - umount $SCRATCH_MNT 2/dev/null - fi + echo *** unmount + umount $SCRATCH_MNT 2/dev/null rm -f $tmp.* } trap _cleanup; exit \$status 0 1 2 3 15 @@ -28,21 +47,14 @@ trap _cleanup; exit \$status 0 1 2 3 15 . ./common.filter # real QA test starts here +_need_to_be_root _supported_fs btrfs _supported_os Linux - -if [ $fresh -ne 0 ]; then - _require_scratch -fi +_require_scratch _require_nobigloopfs - -[ -n $BTRFS_UTIL_PROG ] || _notrun btrfs executable not found -$BTRFS_UTIL_PROG inspect-internal --help /dev/null 21 -[ $? -eq 0 ] || _notrun btrfs executable too old -which filefrag /dev/null 21 -[ $? -eq 0 ] || _notrun filefrag missing - +_require_btrfs inspect-internal +_require_command /usr/sbin/filefrag rm -f $seq.full @@ -52,6 +64,10 @@ FILEFRAG_FILTER='if (/, blocksize (\d+)/) {$blocksize = $1; next} ($ext, '\ '/(?:^|,)inline(?:,|$)/ and next; print $physical * $blocksize, #, '\ '$length * $blocksize, #, $logical * $blocksize, ' +# this makes filefrag output script readable by using a perl helper. +# output is one extent per line, with three numbers separated by '#' +# the numbers are: physical, length, logical (all in bytes) +# sample output: 1234#10#5678 - physical 1234, length 10, logical 5678 _filter_extents() { tee -a $seq.full | $PERL_PROG -ne $FILEFRAG_FILTER @@ -70,6 +86,9 @@ _check_file_extents() return 0 } +# use a logical address and walk the backrefs back to the inode. +# compare to the expected result. +# returns 0 on success, 1 on error (with output made) _btrfs_inspect_addr() { mp=$1 @@ -101,6 +120,9 @@ _btrfs_inspect_addr() return 1 } +# use an inode number and walk the backrefs back to the file name. +# compare to the expected result. +# returns 0 on success, 1 on error (with output made) _btrfs_inspect_inum() { file=$1 @@ -134,14 +156,13 @@ _btrfs_inspect_check() echo # $cmd $seq.full inum=`$cmd` echo $inum $seq.full - _btrfs_inspect_addr $SCRATCH_MNT/$snap_name $physical $logical $inum\ - $file + _btrfs_inspect_addr $SCRATCH_MNT $physical $logical $inum $file ret=$? if [ $ret -eq 0 ]; then _btrfs_inspect_inum $file $inum $snap_name ret=$? fi - return $? + return $ret } run_check() @@ -157,30 +178,34 @@ workout() procs=$3 snap_name=$4 - if [ $fresh -ne 0 ]; then - umount $SCRATCH_DEV /dev/null 21 - echo *** mkfs -dsize=$fsz$seq.full - echo $seq.full - _scratch_mkfs_sized $fsz $seq.full 21 \ - || _fail size=$fsz mkfs failed - _scratch_mount $seq.full 21 \ - || _fail mount failed - # -w ensures that the only ops are ones which cause
Re: btrfs-tools in Debian squeeze-backports?
On 02.01.2012 16:01, Daniel Pocock wrote: One thing I've already noticed in 2.6.39 (and both versions of the tools) is that df results are misleading. E.g. if I run regular df (not btrfs fi df), I am seeing the same amount of available space for all filesystems. Is there currently a way to see space used by each subvolume and snapshot and which kernel and tools versions might be needed? If you're interested (and brave), you might give the subvolume quota patches a try. Arne sent them last October, subject [PATCH v0 00/18] btfs: Subvolume Quota Groups Be warned that this is a v0 patch, it's not been tested too much and will very likely be reworked in the future. But that kind of functionality will hopefully be added to btrfs, consequently. -Jan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfsprogs source code
On Tue, Jan 03, 2012 at 01:05:07PM -0500, Calvin Walton wrote: The best way to get the btrfs-progs source is probably via git; Chris Mason's repository for it can be found at http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-progs.git Chris, The wiki at https://btrfs.wiki.kernel.org/articles/b/t/r/Btrfs_source_repositories.html still refers to a btrfs-progs-unstable.git repository, which is not present at git.kernel.org. Should we update this wiki, or do you have plans on pushing an unstable repository again? Thanks, David -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfsprogs source code
On Wed, Jan 04, 2012 at 10:24:20AM -0800, David Brown wrote: On Tue, Jan 03, 2012 at 01:05:07PM -0500, Calvin Walton wrote: The best way to get the btrfs-progs source is probably via git; Chris Mason's repository for it can be found at http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-progs.git Chris, The wiki at https://btrfs.wiki.kernel.org/articles/b/t/r/Btrfs_source_repositories.html still refers to a btrfs-progs-unstable.git repository, which is not present at git.kernel.org. Should we update this wiki, or do you have plans on pushing an unstable repository again? That wiki is read-only, unfortunately. The up-to-date wiki is at [1], and we'll be decanting that back onto the kernel.org one when the kernel.org wiki is back in full working order. Hugo. [1] http://btrfs.ipv5.de/ -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- echo killall cat ~/curiosity.sh --- signature.asc Description: Digital signature
Re: use btrfsck to check btrfs filesystems
On Wed, Dec 14, 2011 at 03:35:20PM +0800, Miao Xie wrote: We failed to get fsck program to check the btrfs file system, it is because btrfs uses its independent check tool which is named btrfsck to check the file system, so the common checker -- fsck -- could not find it, and reported there is no checker. This patch fix it by using btrfsck directly. Signed-off-by: Miao Xie mi...@cn.fujitsu.com Thanks, applied. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 277: new test to verify on disk ctime update for chattr
On Thu, Dec 22, 2011 at 11:55:03AM +0800, Li Zefan wrote: We had a bug in btrfs which can be triggered by this test. Thanks, applied. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: (renamed thread) btrfs metrics
On Wed, Jan 4, 2012 at 3:48 AM, Daniel Pocock dan...@pocock.com.au wrote: I am looking at what metrics are needed to monitor btrfs in production. I actually look after the ganglia-modules-linux package, which includes some FS space metrics, but I figured that btrfs throws all that out the window. Can you suggest metrics that would be meaningful, do I look in /proc or with syscalls, is there any code I should look at for an example of how to extract them with C? Ideally, Ganglia runs without root privileges too, so please let me know if btrfs will allow me to access them It depends on what you want to know, really. If you want how close am I to a full filesystem?, then the output of df will give you a measure, even if it could be up to a factor of 2 out -- you can use it for predictive planning, though, as it'll be near zero when the FS runs out of space. Maybe if you look at it from the point of the sysadmin and think about what questions he might want to ask: a) how much space would I reclaim if I deleted snapshot X? b) how much space would I reclaim if I deleted all snapshots? c) how much space would I need if I start making 4 snapshots a day and keeping them for 48 hours? chiming in on the discussion - what I'd like to personally see: First, probably easiest: Display per subvol the space used that is unique (not used by other subvolumes), and shared (the opposite - all blocks that appear in other subvolumes as well). From there on, one could potentially create a matrix: (proportional font art, apologies): | subvol1 | subvol2 | subvol3 | --+--+--+--+ subvol1 | 200M | 20M | 50M | --+--+--+--+ subvol2 |20M |350M | 22M | --+--+--+--+ subvol3 |50M | 22M |634M | --+--+--+--+ The diagonal obviously shows the unique blocks, subvol2 and subvol1 share 20M data, etc. Missing from this plot would be how much is shared between subvol1, subvol2, and subvol3 together, but it's a start and not something that hard to understand. One might add a column for total size of each subvol, which may obviously not be an addition of the rest of the columns in this diagram. Anyway, something like this would be high on my list of `df` numbers I'd like to see - since I think they are useful numbers. Cheers, Auke -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Btrfs partition lost after RAID1 mirror disk failure?
Hi, thanks for the reply. Yes, I agree, after going back over the commands, those ones you highlighted seem very suspicious These commands were executed weeks ago amid a fair amount of confusion. But yes, I think that you are right - from memory the FS became inaccessible at about the time you mention. I would say that was the best theory as regards this problem. Assuming that this is the case, do I stand a chance of retrieving that volume and accessing that data again? Or does destructive imply total loss? (In which case, I'll cut my losses) On 3 January 2012 21:49, C Anthony Risinger anth...@xtfx.me wrote: On Tue, Jan 3, 2012 at 8:44 AM, Dan Garton dan.gar...@gmail.com wrote: [...] 1327 btrfs-vol -a 1328 btrfs-vol -a /nuvat 1329 btrfs-vol -a asdasd /nuvat 1330 btrfs-vol -a missing /nuvat 1331 btrfs-vol -a /dev/sdc /nuvat 1332 btrfs-vol -a /dev/sdb /nuvat 1334 btrfs-vol -a missing /nuvat [...] these look destructive to me ... adding the wrong devices and the existing devices back to the current array? IIRC you should have `-r missing`, but in general, do not use the btrfsctl utility at all -- it won't have as much visibility/exception-handling/recovery as the `btrfs` utility. at what point did your FS become inaccessible? your command history suggest it was working until shortly after these commands ... :-( -- C Anthony -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [3.2-rc7] slowdown, warning + oops creating lots of files
On Thu, Jan 05, 2012 at 08:44:45AM +1100, Dave Chinner wrote: Hi there buttery folks, I just hit this warning and oops running a parallel fs_mark create workload on a test VM using a 17TB btrfs filesystem (12 disk dm RAID0) using default mkfs and mount parmeters, mounted on /mnt/scratch. The VM has 8p and 4GB RAM, and the fs_mark command line was: $ ./fs_mark -D 1 -S0 -n 10 -s 0 -L 250 \ -d /mnt/scratch/0 -d /mnt/scratch/1 \ -d /mnt/scratch/2 -d /mnt/scratch/3 \ -d /mnt/scratch/4 -d /mnt/scratch/5 \ -d /mnt/scratch/6 -d /mnt/scratch/7 The attached image should give you a better idea of the performance drop-off that was well under way when the crash occurred at about 96 million files created. I'm rerunning the test on a fresh filesystem, so I guess I'll see if this a one-off in the next couple of hours Looks to be reproducable. With a fresh filesystem performance was all over the place from the start, and the warning/oops occurred at about 43M files created. the failure stacks this time are: [ 1490.841957] device fsid 4b7ec51b-9747-4244-a568-fbecdb157383 devid 1 transid 4 /dev/vdc [ 1490.847408] btrfs: disk space caching is enabled [ 3027.690722] [ cut here ] [ 3027.692612] WARNING: at fs/btrfs/extent-tree.c:4771 __btrfs_free_extent+0x630/0x6d0() [ 3027.695607] Hardware name: Bochs [ 3027.696968] Modules linked in: [ 3027.697894] Pid: 3460, comm: fs_mark Not tainted 3.2.0-rc7-dgc+ #167 [ 3027.699581] Call Trace: [ 3027.700452] [8108a69f] warn_slowpath_common+0x7f/0xc0 [ 3027.701973] [8108a6fa] warn_slowpath_null+0x1a/0x20 [ 3027.703480] [815b0680] __btrfs_free_extent+0x630/0x6d0 [ 3027.704981] [815ac110] ? block_group_cache_tree_search+0x90/0xc0 [ 3027.706368] [815b42f1] run_clustered_refs+0x381/0x800 [ 3027.707897] [815b483a] btrfs_run_delayed_refs+0xca/0x220 [ 3027.709347] [815b8f1c] ? btrfs_update_root+0x9c/0xe0 [ 3027.710909] [815c3c33] commit_cowonly_roots+0x33/0x1e0 [ 3027.712370] [81ab168e] ? _raw_spin_lock+0xe/0x20 [ 3027.713220] [815c54cf] btrfs_commit_transaction+0x3bf/0x840 [ 3027.714412] [810ac420] ? add_wait_queue+0x60/0x60 [ 3027.715460] [815c5da4] ? start_transaction+0x94/0x2b0 [ 3027.716790] [815ac80c] may_commit_transaction+0x6c/0x100 [ 3027.717843] [815b2b47] reserve_metadata_bytes.isra.71+0x5a7/0x660 [ 3027.719223] [81073c23] ? __wake_up+0x53/0x70 [ 3027.720328] [815a43ba] ? btrfs_free_path+0x2a/0x40 [ 3027.721511] [815b2f9e] btrfs_block_rsv_add+0x3e/0x70 [ 3027.722610] [81666dfb] ? security_d_instantiate+0x1b/0x30 [ 3027.723765] [815c5f65] start_transaction+0x255/0x2b0 [ 3027.725204] [815c6283] btrfs_start_transaction+0x13/0x20 [ 3027.726273] [815d2236] btrfs_create+0x46/0x220 [ 3027.727275] [8116c204] vfs_create+0xb4/0xf0 [ 3027.728344] [8116e1d7] do_last.isra.45+0x547/0x7c0 [ 3027.729400] [8116f7ab] path_openat+0xcb/0x3d0 [ 3027.730363] [81ab168e] ? _raw_spin_lock+0xe/0x20 [ 3027.731394] [8117cc1e] ? vfsmount_lock_local_unlock+0x1e/0x30 [ 3027.733077] [8116fbd2] do_filp_open+0x42/0xa0 [ 3027.733949] [8117c487] ? alloc_fd+0xf7/0x150 [ 3027.734911] [8115f8e7] do_sys_open+0xf7/0x1d0 [ 3027.735894] [810b572a] ? do_gettimeofday+0x1a/0x50 [ 3027.737304] [8115f9e0] sys_open+0x20/0x30 [ 3027.738099] [81ab9502] system_call_fastpath+0x16/0x1b [ 3027.739199] ---[ end trace df586861a93ef3bf ]--- [ 3027.740348] btrfs unable to find ref byte nr 19982405632 parent 0 root 2 owner 0 offset 0 [ 3027.742001] BUG: unable to handle kernel NULL pointer dereference at (null) [ 3027.743502] IP: [815e60f2] map_private_extent_buffer+0x12/0x150 [ 3027.744982] PGD 109d8e067 PUD 1050a9067 PMD 0 [ 3027.745968] Oops: [#1] SMP [ 3027.745968] CPU 7 [ 3027.745968] Modules linked in: [ 3027.745968] [ 3027.745968] Pid: 3460, comm: fs_mark Tainted: GW3.2.0-rc7-dgc+ #167 Bochs Bochs [ 3027.745968] RIP: 0010:[815e60f2] [815e60f2] map_private_extent_buffer+0x12/0x150 [ 3027.745968] RSP: 0018:8800d2ac36d8 EFLAGS: 00010296 [ 3027.745968] RAX: RBX: 0065 RCX: 8800d2ac3708 [ 3027.745968] RDX: 0004 RSI: 007a RDI: [ 3027.745968] RBP: 8800d2ac36f8 R08: 8800d2ac3710 R09: 8800d2ac3718 [ 3027.745968] R10: R11: 0001 R12: 007a [ 3027.745968] R13: R14: ffe4 R15: 1000 [ 3027.745968] FS: 7f3bf8ab5700() GS:88011fdc() knlGS: [ 3027.745968] CS: 0010 DS: ES: CR0: 8005003b [ 3027.745968] CR2: 7fe424b0e000 CR3: 000106c33000 CR4: 06e0 [ 3027.745968] DR0:
Re: [3.2-rc7] slowdown, warning + oops creating lots of files
On 05/01/12 09:11, Dave Chinner wrote: Looks to be reproducable. Does this happen with rc6 ? If not then it might be easy to track down as there are only 2 modifications between rc6 and rc7.. commit 08c422c27f855d27b0b3d9fa30ebd938d4ae6f1f Author: Al Viro v...@zeniv.linux.org.uk Date: Fri Dec 23 07:58:13 2011 -0500 Btrfs: call d_instantiate after all ops are setup This closes races where btrfs is calling d_instantiate too soon during inode creation. All of the callers of btrfs_add_nondir are updated to instantiate after the inode is fully setup in memory. Signed-off-by: Al Viro v...@zeniv.linux.org.uk Signed-off-by: Chris Mason chris.ma...@oracle.com commit 8d532b2afb2eacc84588db709ec280a3d1219be3 Author: Chris Mason chris.ma...@oracle.com Date: Fri Dec 23 07:53:00 2011 -0500 Btrfs: fix worker lock misuse in find_worker Dan Carpenter noticed that we were doing a double unlock on the worker lock, and sometimes picking a worker thread without the lock held. This fixes both errors. Signed-off-by: Chris Mason chris.ma...@oracle.com Reported-by: Dan Carpenter dan.carpen...@oracle.com -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [3.2-rc7] slowdown, warning + oops creating lots of files
On Thu, Jan 05, 2012 at 09:23:52AM +1100, Chris Samuel wrote: On 05/01/12 09:11, Dave Chinner wrote: Looks to be reproducable. Does this happen with rc6 ? I haven't tried. All I'm doing is running some benchmarks to get numbers for a talk I'm giving about improvements in XFS metadata scalability, so I wanted to update my last set of numbers from 2.6.39. As it was, these benchmarks also failed on btrfs with oopsen and corruptions back in 2.6.39 time frame. e.g. same VM, same test, different crashes, similar slowdowns as reported here: http://comments.gmane.org/gmane.comp.file-systems.btrfs/11062 Given that there is now a history of this simple test uncovering problems, perhaps this is a test that should be run more regularly by btrfs developers? If not then it might be easy to track down as there are only 2 modifications between rc6 and rc7.. They don't look like they'd be responsible for fixing an extent tree corruption, and I don't really have the time to do an open-ended bisect to find where the problem fix arose. As it is, 3rd attempt failed at 22m inodes, without the warning this time: [ 59.433452] device fsid 4d27dc14-562d-4722-9591-723bd2bbe94c devid 1 transid 4 /dev/vdc [ 59.437050] btrfs: disk space caching is enabled [ 753.258465] [ cut here ] [ 753.259806] kernel BUG at fs/btrfs/extent-tree.c:5797! [ 753.260014] invalid opcode: [#1] SMP [ 753.260014] CPU 7 [ 753.260014] Modules linked in: [ 753.260014] [ 753.260014] Pid: 2874, comm: fs_mark Not tainted 3.2.0-rc7-dgc+ #167 Bochs Bochs [ 753.260014] RIP: 0010:[815b475b] [815b475b] run_clustered_refs+0x7eb/0x800 [ 753.260014] RSP: 0018:8800430258a8 EFLAGS: 00010286 [ 753.260014] RAX: ffe4 RBX: 88009c8ab1c0 RCX: [ 753.260014] RDX: 0008 RSI: 0282 RDI: [ 753.260014] RBP: 880043025988 R08: R09: 0002 [ 753.260014] R10: 8801188f6000 R11: 880101b50d20 R12: 88008fc1ad40 [ 753.260014] R13: 88003940a6c0 R14: 880118a49000 R15: 88010fc77e80 [ 753.260014] FS: 7f416ce90700() GS:88011fdc() knlGS: [ 753.260014] CS: 0010 DS: ES: CR0: 8005003b [ 753.260014] CR2: 7f416c2f6000 CR3: 3aaea000 CR4: 06e0 [ 753.260014] DR0: DR1: DR2: [ 753.260014] DR3: DR6: 0ff0 DR7: 0400 [ 753.260014] Process fs_mark (pid: 2874, threadinfo 880043024000, task 8800090e6180) [ 753.260014] Stack: [ 753.260014] 8801 [ 753.260014] 88010fc77f38 0e92 0002 [ 753.260014] 0e03 0e68 8800430259d8 [ 753.260014] Call Trace: [ 753.260014] [815b483a] btrfs_run_delayed_refs+0xca/0x220 [ 753.260014] [815c5469] btrfs_commit_transaction+0x359/0x840 [ 753.260014] [810ac420] ? add_wait_queue+0x60/0x60 [ 753.260014] [815c5da4] ? start_transaction+0x94/0x2b0 [ 753.260014] [815ac80c] may_commit_transaction+0x6c/0x100 [ 753.260014] [815b2b47] reserve_metadata_bytes.isra.71+0x5a7/0x660 [ 753.260014] [81073c23] ? __wake_up+0x53/0x70 [ 753.260014] [815a43ba] ? btrfs_free_path+0x2a/0x40 [ 753.260014] [815b2f9e] btrfs_block_rsv_add+0x3e/0x70 [ 753.260014] [81666dfb] ? security_d_instantiate+0x1b/0x30 [ 753.260014] [815c5f65] start_transaction+0x255/0x2b0 [ 753.260014] [815c6283] btrfs_start_transaction+0x13/0x20 [ 753.260014] [815d2236] btrfs_create+0x46/0x220 [ 753.260014] [8116c204] vfs_create+0xb4/0xf0 [ 753.260014] [8116e1d7] do_last.isra.45+0x547/0x7c0 [ 753.260014] [8116f7ab] path_openat+0xcb/0x3d0 [ 753.260014] [81ab168e] ? _raw_spin_lock+0xe/0x20 [ 753.260014] [8117cc1e] ? vfsmount_lock_local_unlock+0x1e/0x30 [ 753.260014] [8116fbd2] do_filp_open+0x42/0xa0 [ 753.260014] [8117c487] ? alloc_fd+0xf7/0x150 [ 753.260014] [8115f8e7] do_sys_open+0xf7/0x1d0 [ 753.260014] [810b572a] ? do_gettimeofday+0x1a/0x50 [ 753.260014] [8115f9e0] sys_open+0x20/0x30 [ 753.260014] [81ab9502] system_call_fastpath+0x16/0x1b [ 753.260014] Code: ff e9 37 f9 ff ff be 95 00 00 00 48 c7 c7 43 6f df 81 e8 99 5f ad ff e9 36 f9 ff ff 80 fa b2 0f 84 d0 f9 ff ff 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f [ 753.260014] RIP [815b475b] run_clustered_refs+0x7eb/0x800 [ 753.260014] RSP 8800430258a8 [ 753.330089] ---[ end trace f3d0e286a928c349 ]--- It's hard to tell exactly what path gets to that BUG_ON(), so much code is inlined by the compiler into run_clustered_refs() that I can't tell exactly how it got to the BUG_ON() triggered in
Re: [3.2-rc7] slowdown, warning + oops creating lots of files
On 01/04/2012 06:01 PM, Dave Chinner wrote: On Thu, Jan 05, 2012 at 09:23:52AM +1100, Chris Samuel wrote: On 05/01/12 09:11, Dave Chinner wrote: Looks to be reproducable. Does this happen with rc6 ? I haven't tried. All I'm doing is running some benchmarks to get numbers for a talk I'm giving about improvements in XFS metadata scalability, so I wanted to update my last set of numbers from 2.6.39. As it was, these benchmarks also failed on btrfs with oopsen and corruptions back in 2.6.39 time frame. e.g. same VM, same test, different crashes, similar slowdowns as reported here: http://comments.gmane.org/gmane.comp.file-systems.btrfs/11062 Given that there is now a history of this simple test uncovering problems, perhaps this is a test that should be run more regularly by btrfs developers? If not then it might be easy to track down as there are only 2 modifications between rc6 and rc7.. They don't look like they'd be responsible for fixing an extent tree corruption, and I don't really have the time to do an open-ended bisect to find where the problem fix arose. As it is, 3rd attempt failed at 22m inodes, without the warning this time: [ 59.433452] device fsid 4d27dc14-562d-4722-9591-723bd2bbe94c devid 1 transid 4 /dev/vdc [ 59.437050] btrfs: disk space caching is enabled [ 753.258465] [ cut here ] [ 753.259806] kernel BUG at fs/btrfs/extent-tree.c:5797! [ 753.260014] invalid opcode: [#1] SMP [ 753.260014] CPU 7 [ 753.260014] Modules linked in: [ 753.260014] [ 753.260014] Pid: 2874, comm: fs_mark Not tainted 3.2.0-rc7-dgc+ #167 Bochs Bochs [ 753.260014] RIP: 0010:[815b475b] [815b475b] run_clustered_refs+0x7eb/0x800 [ 753.260014] RSP: 0018:8800430258a8 EFLAGS: 00010286 [ 753.260014] RAX: ffe4 RBX: 88009c8ab1c0 RCX: [ 753.260014] RDX: 0008 RSI: 0282 RDI: [ 753.260014] RBP: 880043025988 R08: R09: 0002 [ 753.260014] R10: 8801188f6000 R11: 880101b50d20 R12: 88008fc1ad40 [ 753.260014] R13: 88003940a6c0 R14: 880118a49000 R15: 88010fc77e80 [ 753.260014] FS: 7f416ce90700() GS:88011fdc() knlGS: [ 753.260014] CS: 0010 DS: ES: CR0: 8005003b [ 753.260014] CR2: 7f416c2f6000 CR3: 3aaea000 CR4: 06e0 [ 753.260014] DR0: DR1: DR2: [ 753.260014] DR3: DR6: 0ff0 DR7: 0400 [ 753.260014] Process fs_mark (pid: 2874, threadinfo 880043024000, task 8800090e6180) [ 753.260014] Stack: [ 753.260014] 8801 [ 753.260014] 88010fc77f38 0e92 0002 [ 753.260014] 0e03 0e68 8800430259d8 [ 753.260014] Call Trace: [ 753.260014] [815b483a] btrfs_run_delayed_refs+0xca/0x220 [ 753.260014] [815c5469] btrfs_commit_transaction+0x359/0x840 [ 753.260014] [810ac420] ? add_wait_queue+0x60/0x60 [ 753.260014] [815c5da4] ? start_transaction+0x94/0x2b0 [ 753.260014] [815ac80c] may_commit_transaction+0x6c/0x100 [ 753.260014] [815b2b47] reserve_metadata_bytes.isra.71+0x5a7/0x660 [ 753.260014] [81073c23] ? __wake_up+0x53/0x70 [ 753.260014] [815a43ba] ? btrfs_free_path+0x2a/0x40 [ 753.260014] [815b2f9e] btrfs_block_rsv_add+0x3e/0x70 [ 753.260014] [81666dfb] ? security_d_instantiate+0x1b/0x30 [ 753.260014] [815c5f65] start_transaction+0x255/0x2b0 [ 753.260014] [815c6283] btrfs_start_transaction+0x13/0x20 [ 753.260014] [815d2236] btrfs_create+0x46/0x220 [ 753.260014] [8116c204] vfs_create+0xb4/0xf0 [ 753.260014] [8116e1d7] do_last.isra.45+0x547/0x7c0 [ 753.260014] [8116f7ab] path_openat+0xcb/0x3d0 [ 753.260014] [81ab168e] ? _raw_spin_lock+0xe/0x20 [ 753.260014] [8117cc1e] ? vfsmount_lock_local_unlock+0x1e/0x30 [ 753.260014] [8116fbd2] do_filp_open+0x42/0xa0 [ 753.260014] [8117c487] ? alloc_fd+0xf7/0x150 [ 753.260014] [8115f8e7] do_sys_open+0xf7/0x1d0 [ 753.260014] [810b572a] ? do_gettimeofday+0x1a/0x50 [ 753.260014] [8115f9e0] sys_open+0x20/0x30 [ 753.260014] [81ab9502] system_call_fastpath+0x16/0x1b [ 753.260014] Code: ff e9 37 f9 ff ff be 95 00 00 00 48 c7 c7 43 6f df 81 e8 99 5f ad ff e9 36 f9 ff ff 80 fa b2 0f 84 d0 f9 ff ff 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f [ 753.260014] RIP [815b475b] run_clustered_refs+0x7eb/0x800 [ 753.260014] RSP 8800430258a8 [ 753.330089] ---[ end trace f3d0e286a928c349 ]--- It's hard to tell exactly what path gets to that
Re: [3.2-rc7] slowdown, warning + oops creating lots of files
On Wed, Jan 04, 2012 at 09:23:18PM -0500, Liu Bo wrote: On 01/04/2012 06:01 PM, Dave Chinner wrote: On Thu, Jan 05, 2012 at 09:23:52AM +1100, Chris Samuel wrote: On 05/01/12 09:11, Dave Chinner wrote: Looks to be reproducable. Does this happen with rc6 ? I haven't tried. All I'm doing is running some benchmarks to get numbers for a talk I'm giving about improvements in XFS metadata scalability, so I wanted to update my last set of numbers from 2.6.39. As it was, these benchmarks also failed on btrfs with oopsen and corruptions back in 2.6.39 time frame. e.g. same VM, same test, different crashes, similar slowdowns as reported here: http://comments.gmane.org/gmane.comp.file-systems.btrfs/11062 Given that there is now a history of this simple test uncovering problems, perhaps this is a test that should be run more regularly by btrfs developers? If not then it might be easy to track down as there are only 2 modifications between rc6 and rc7.. They don't look like they'd be responsible for fixing an extent tree corruption, and I don't really have the time to do an open-ended bisect to find where the problem fix arose. As it is, 3rd attempt failed at 22m inodes, without the warning this time: . It's hard to tell exactly what path gets to that BUG_ON(), so much code is inlined by the compiler into run_clustered_refs() that I can't tell exactly how it got to the BUG_ON() triggered in alloc_reserved_tree_block(). This seems to be an oops led by ENOSPC. At the time of the oops, this is the space used on the filesystem: $ df -h /mnt/scratch Filesystem Size Used Avail Use% Mounted on /dev/vdc 17T 31G 17T 1% /mnt/scratch It's less than 0.2% full, so I think ENOSPC can be ruled out here. I have noticed one thing, however, in that the there are significant numbers of reads coming from disk when the slowdowns and oops occur. When everything runs fast, there are virtually no reads occurring at all. It looks to me that maybe the working set of metadata is being kicked out of memory, only to be read back in again short while later. Maybe that is a contributing factor. BTW, there is a lot of CPU time being spent on the tree locks. perf shows this as the top 2 CPU consumers: - 9.49% [kernel] [k] __write_lock_failed - __write_lock_failed - 99.80% _raw_write_lock - 79.35% btrfs_try_tree_write_lock 99.99% btrfs_search_slot - 20.63% btrfs_tree_lock 89.19% btrfs_search_slot 10.54% btrfs_lock_root_node btrfs_search_slot - 9.25% [kernel] [k] _raw_spin_unlock_irqrestore - _raw_spin_unlock_irqrestore - 55.87% __wake_up + 93.89% btrfs_clear_lock_blocking_rw + 3.46% btrfs_tree_read_unlock_blocking + 2.35% btrfs_tree_unlock Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/10] Btrfs: backref walking rewrite
Jan Schmidt wrote: This patch series is a major rewrite of the backref walking code. The patch series Arne sent some weeks ago for quota groups had a very interesting function, find_all_roots. I took this from him together with the bits needed for find_all_roots to work and replaced a major part of the code in backref.c with it. It can be pulled from git://git.jan-o-sch.net/btrfs-unstable for-chris There's also a gitweb for that repo on http://git.jan-o-sch.net/?p=btrfs-unstable Thanks for the work! I got a compile warning: CC [M] fs/btrfs/backref.o fs/btrfs/backref.c: In function 'inode_to_path': fs/btrfs/backref.c:1312:3: warning: format '%ld' expects type 'long int', but argument 3 has type 'int -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [3.2-rc7] slowdown, warning + oops creating lots of files
On 01/04/2012 09:26 PM, Dave Chinner wrote: On Wed, Jan 04, 2012 at 09:23:18PM -0500, Liu Bo wrote: On 01/04/2012 06:01 PM, Dave Chinner wrote: On Thu, Jan 05, 2012 at 09:23:52AM +1100, Chris Samuel wrote: On 05/01/12 09:11, Dave Chinner wrote: Looks to be reproducable. Does this happen with rc6 ? I haven't tried. All I'm doing is running some benchmarks to get numbers for a talk I'm giving about improvements in XFS metadata scalability, so I wanted to update my last set of numbers from 2.6.39. As it was, these benchmarks also failed on btrfs with oopsen and corruptions back in 2.6.39 time frame. e.g. same VM, same test, different crashes, similar slowdowns as reported here: http://comments.gmane.org/gmane.comp.file-systems.btrfs/11062 Given that there is now a history of this simple test uncovering problems, perhaps this is a test that should be run more regularly by btrfs developers? If not then it might be easy to track down as there are only 2 modifications between rc6 and rc7.. They don't look like they'd be responsible for fixing an extent tree corruption, and I don't really have the time to do an open-ended bisect to find where the problem fix arose. As it is, 3rd attempt failed at 22m inodes, without the warning this time: . It's hard to tell exactly what path gets to that BUG_ON(), so much code is inlined by the compiler into run_clustered_refs() that I can't tell exactly how it got to the BUG_ON() triggered in alloc_reserved_tree_block(). This seems to be an oops led by ENOSPC. At the time of the oops, this is the space used on the filesystem: $ df -h /mnt/scratch Filesystem Size Used Avail Use% Mounted on /dev/vdc 17T 31G 17T 1% /mnt/scratch It's less than 0.2% full, so I think ENOSPC can be ruled out here. This bug has done something with our block reservation allocator, not the real disk space. Can you try the below one and see what happens? diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index b1c8732..5a7f918 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3978,8 +3978,8 @@ static u64 calc_global_metadata_size(struct btrfs_fs_info *fs_info) csum_size * 2; num_bytes += div64_u64(data_used + meta_used, 50); - if (num_bytes * 3 meta_used) - num_bytes = div64_u64(meta_used, 3); + if (num_bytes * 2 meta_used) + num_bytes = div64_u64(meta_used, 2); return ALIGN(num_bytes, fs_info-extent_root-leafsize 10); } I have noticed one thing, however, in that the there are significant numbers of reads coming from disk when the slowdowns and oops occur. When everything runs fast, there are virtually no reads occurring at all. It looks to me that maybe the working set of metadata is being kicked out of memory, only to be read back in again short while later. Maybe that is a contributing factor. BTW, there is a lot of CPU time being spent on the tree locks. perf shows this as the top 2 CPU consumers: - 9.49% [kernel] [k] __write_lock_failed - __write_lock_failed - 99.80% _raw_write_lock - 79.35% btrfs_try_tree_write_lock 99.99% btrfs_search_slot - 20.63% btrfs_tree_lock 89.19% btrfs_search_slot 10.54% btrfs_lock_root_node btrfs_search_slot - 9.25% [kernel] [k] _raw_spin_unlock_irqrestore - _raw_spin_unlock_irqrestore - 55.87% __wake_up + 93.89% btrfs_clear_lock_blocking_rw + 3.46% btrfs_tree_read_unlock_blocking + 2.35% btrfs_tree_unlock hmm, the new extent_buffer lock scheme written by Chris is aimed to avoid such cases, maybe he can provide some advices. thanks, liubo Cheers, Dave. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html