Re: (renamed thread) btrfs metrics

2012-01-04 Thread Daniel Pocock

 I am looking at what metrics are needed to monitor btrfs in production.
  I actually look after the ganglia-modules-linux package, which includes
 some FS space metrics, but I figured that btrfs throws all that out the
 window.

 Can you suggest metrics that would be meaningful, do I look in /proc or
 with syscalls, is there any code I should look at for an example of how
 to extract them with C?  Ideally, Ganglia runs without root privileges
 too, so please let me know if btrfs will allow me to access them
 
It depends on what you want to know, really. If you want how close
 am I to a full filesystem?, then the output of df will give you a
 measure, even if it could be up to a factor of 2 out -- you can use it
 for predictive planning, though, as it'll be near zero when the FS
 runs out of space.


Maybe if you look at it from the point of the sysadmin and think about
what questions he might want to ask:

a) how much space would I reclaim if I deleted snapshot X?

b) how much space would I reclaim if I deleted all snapshots?

c) how much space would I need if I start making 4 snapshots a day and
keeping them for 48 hours?

(Ganglia also sums data across the enterprise, so if such metrics are
implemented, the admin can quickly see the sum of all snapshot usage on
his grid/cluster)

and also:

d) what metrics would be useful for someone developing or testing some
solution involving btrfs?

The Ganglia framework uses rrdtool to graph metric data and present it
alongside other system data (e.g. to see disk IO rates, CPU load, cache
activity all on the same graph) so it could provide some useful insights
into any performance testing of btrfs.  Even better, using rrdtool, you
can overlay some btrfs metric on the same graph as a system metric (e.g.
IO request sizes)


If you really want to, you can get your hands into the filesystem
 structures on a read-only (and non-locked) basis using the
 BTRFS_IOC_TREE_SEARCH ioctl, and the FS structures are documented at
 [1]. However, that's generally going to be pretty ugly, and most
 likely pretty slow for many operations at the subvolume level.
 
If you want anything on a per-subvolume basis, then you'll have to
 wait for Arne to finish the work on quotas.
 

Initially, I could just start with some simple metric (even just
retrieving the btrfs UUID) as a proof-of-concept, and then add more
stuff later, for example, when Arne has the quota work in a stable form.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


checkums when converting from ext[234] to Btrfs

2012-01-04 Thread pubny
Hi,

Could someone help me with a clarification whether the btrfs-convert
tool creates checksums on blocks of the existing ext[234] filesystem?

Any experiences how the size and the filesystem utilization (used vs.
total diskspace) impacts the time of conversion?

Thanks,
Gábor



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] xfstests: new check 278 to ensure btrfs backref integrity

2012-01-04 Thread Christoph Hellwig
On Thu, Dec 22, 2011 at 12:08:58PM +0100, Jan Schmidt wrote:
 This is a btrfs specific scratch test checking the backref walker. It
 creates a file system with compressed and uncompressed data extents, picks
 files randomly and uses filefrag to get their extents. It then asks the
 btrfs utility (inspect-internal) to do the backref resolving from fs-logical
 address (the one filefrag calls physical) back to the inode number and
 file-logical offset, verifying the result.

I was about to apply this, but for some reason it fails for me when
running xfstest on xfs:

276  [failed, exit status 1] - output mismatch (see 276.out.bad)
--- 276.out 2012-01-04 16:14:36.0 +
+++ 276.out.bad 2012-01-04 16:32:26.0 +
@@ -1,4 +1,5 @@
QA output created by 276
-*** test backref walking
-*** done
+common.rc: Error: $TEST_DEV (/dev/vdb1) is not a MOUNTED btrfs
filesystem
+FilesystemType   1K-blocks  Used Available Use% Mounted on
+/dev/vdb1  xfs39042944 32928  39010016   1% /mnt/test
 *** unmount

which is a bit confusing
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] xfstests: new check 278 to ensure btrfs backref integrity

2012-01-04 Thread Eric Sandeen
On 1/4/12 10:39 AM, Christoph Hellwig wrote:
 On Thu, Dec 22, 2011 at 12:08:58PM +0100, Jan Schmidt wrote:
 This is a btrfs specific scratch test checking the backref walker. It
 creates a file system with compressed and uncompressed data extents, picks
 files randomly and uses filefrag to get their extents. It then asks the
 btrfs utility (inspect-internal) to do the backref resolving from fs-logical
 address (the one filefrag calls physical) back to the inode number and
 file-logical offset, verifying the result.
 
 I was about to apply this, but for some reason it fails for me when
 running xfstest on xfs:
 
 276[failed, exit status 1] - output mismatch (see 276.out.bad)
 --- 276.out   2012-01-04 16:14:36.0 +
 +++ 276.out.bad   2012-01-04 16:32:26.0 +
 @@ -1,4 +1,5 @@
 QA output created by 276
 -*** test backref walking
 -*** done
 +common.rc: Error: $TEST_DEV (/dev/vdb1) is not a MOUNTED btrfs
 filesystem
 +FilesystemType   1K-blocks  Used Available Use% Mounted on
 +/dev/vdb1  xfs39042944 32928  39010016   1% /mnt/test
  *** unmount
 
 which is a bit confusing
 

276 got merged on Dec 28 before my requests for fixup, I guess?  And it
explicitly sets FSTYP=btrfs which is why it fails.

the 278 patch v2 in this thread works ok for me.

so munging the 278 patch here into the existing 276 should be the
right approach.

-Eric
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] xfstests: new check 278 to ensure btrfs backref integrity

2012-01-04 Thread Jan Schmidt
On 04.01.2012 18:01, Eric Sandeen wrote:
 276 got merged on Dec 28 before my requests for fixup, I guess?  And it
 explicitly sets FSTYP=btrfs which is why it fails.
 
 the 278 patch v2 in this thread works ok for me.
 
 so munging the 278 patch here into the existing 276 should be the
 right approach.

Yeah we figured that out on irc some minutes ago :-)

I'm currently building v2 as an incremental patch to 276 (without rename
to 278) and send it as

   [PATCH] xfstests: fixup check 276

soon.

-Jan
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 00/10] Btrfs: backref walking rewrite

2012-01-04 Thread Jan Schmidt
This patch series is a major rewrite of the backref walking code. The patch
series Arne sent some weeks ago for quota groups had a very interesting
function, find_all_roots. I took this from him together with the bits needed
for find_all_roots to work and replaced a major part of the code in backref.c
with it.

It can be pulled from
git://git.jan-o-sch.net/btrfs-unstable for-chris
There's also a gitweb for that repo on
http://git.jan-o-sch.net/?p=btrfs-unstable

My old backref code had several problems:
- it relied on a consistent state of the trees in memory
- it ignored delayed refs
- it only featured rudimentary locking
- it could miss some references depending on the tree layout

The biggest advantage is, that we're now able to do reliable backref resolving,
even on busy file systems. So we've got benefits for:
- the existing btrfs inspect-internal commands
- aforementioned qgroups (patches on the list)
- btrfs send (currently in development)
- snapshot-aware defrag
- ... possibly more to come

Splitting the needed bits out of Arne's code was a quite intrusive operation. In
case this goes into 3.3, any of us will soon make a rebased version of the
qgroup patch set. Things corrected/changed in Arne's code along the way:
- don't assume INODE_ITEMs and the corresponding EXTENT_DATA items are in the
  same leaf (use the correct EXTENT_DATA_KEY for tree searches)
- don't assume all EXTENT_DATA items with the same backref for the same inode
  are in the same leaf (__resolve_indirect_refs can now add more refs)
- added missing key and level to prelim lists for shared block refs
- delayed ref sequence locking ability without wasting sequence numbers
- waitqueue instead of busy waiting for more delayed refs

As this touches a critical part of the file system, I also did some speed
benchmarks. It turns out that dbench shows no performance decrease on my
hardware. I can do more tests if desired.

By the way: this patch series fixes xfstest 278 (to be published soon) :-)

-Jan

---
Changelog v1-v2:
- nested locking is now allowed implicitly, not only when path-nested is set
- force-pushed new version to mentioned git repo
---
Arne Jansen (6):
  Btrfs: generic data structure to build unique lists
  Btrfs: mark delayed refs as for cow
  Btrfs: always save ref_root in delayed refs
  Btrfs: add nested locking mode for paths
  Btrfs: add sequence numbers to delayed refs
  Btrfs: put back delayed refs that are too new

Jan Schmidt (4):
  Btrfs: added helper btrfs_next_item()
  Btrfs: add waitqueue instead of doing busy waiting for more delayed
refs
  Btrfs: added btrfs_find_all_roots()
  Btrfs: new backref walking code

 fs/btrfs/Makefile  |2 +-
 fs/btrfs/backref.c | 1131 +---
 fs/btrfs/backref.h |5 +
 fs/btrfs/ctree.c   |   42 +-
 fs/btrfs/ctree.h   |   24 +-
 fs/btrfs/delayed-ref.c |  153 +--
 fs/btrfs/delayed-ref.h |  104 -
 fs/btrfs/disk-io.c |3 +-
 fs/btrfs/extent-tree.c |  187 ++--
 fs/btrfs/extent_io.c   |1 +
 fs/btrfs/extent_io.h   |2 +
 fs/btrfs/file.c|   10 +-
 fs/btrfs/inode.c   |2 +-
 fs/btrfs/ioctl.c   |   13 +-
 fs/btrfs/locking.c |   53 +++-
 fs/btrfs/relocation.c  |   18 +-
 fs/btrfs/scrub.c   |7 +-
 fs/btrfs/transaction.c |9 +-
 fs/btrfs/tree-log.c|2 +-
 fs/btrfs/ulist.c   |  220 ++
 fs/btrfs/ulist.h   |   68 +++
 21 files changed, 1638 insertions(+), 418 deletions(-)
 create mode 100644 fs/btrfs/ulist.c
 create mode 100644 fs/btrfs/ulist.h

-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 02/10] Btrfs: added helper btrfs_next_item()

2012-01-04 Thread Jan Schmidt
btrfs_next_item() makes the btrfs path point to the next item, crossing leaf
boundaries if needed.

Signed-off-by: Arne Jansen sensi...@gmx.net
Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net
---
 fs/btrfs/ctree.h |7 +++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 50634abe..3e4a07b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2482,6 +2482,13 @@ static inline int btrfs_insert_empty_item(struct 
btrfs_trans_handle *trans,
 }
 
 int btrfs_next_leaf(struct btrfs_root *root, struct btrfs_path *path);
+static inline int btrfs_next_item(struct btrfs_root *root, struct btrfs_path 
*p)
+{
+   ++p-slots[0];
+   if (p-slots[0] = btrfs_header_nritems(p-nodes[0]))
+   return btrfs_next_leaf(root, p);
+   return 0;
+}
 int btrfs_prev_leaf(struct btrfs_root *root, struct btrfs_path *path);
 int btrfs_leaf_free_space(struct btrfs_root *root, struct extent_buffer *leaf);
 void btrfs_drop_snapshot(struct btrfs_root *root,
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 09/10] Btrfs: added btrfs_find_all_roots()

2012-01-04 Thread Jan Schmidt
This function gets a byte number (a data extent), collects all the leafs
pointing to it and walks up the trees to find all fs roots pointing to those
leafs. It also returns the list of all leafs pointing to that extent.

It does proper locking for the involved trees, can be used on busy file
systems and honors delayed refs.

Signed-off-by: Arne Jansen sensi...@gmx.net
Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net
---
 fs/btrfs/backref.c |  783 
 fs/btrfs/backref.h |5 +
 2 files changed, 788 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 22c64ff..03c30a1 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -19,6 +19,9 @@
 #include ctree.h
 #include disk-io.h
 #include backref.h
+#include ulist.h
+#include transaction.h
+#include delayed-ref.h
 
 struct __data_ref {
struct list_head list;
@@ -32,6 +35,786 @@ struct __shared_ref {
u64 disk_byte;
 };
 
+/*
+ * this structure records all encountered refs on the way up to the root
+ */
+struct __prelim_ref {
+   struct list_head list;
+   u64 root_id;
+   struct btrfs_key key;
+   int level;
+   int count;
+   u64 parent;
+   u64 wanted_disk_byte;
+};
+
+static int __add_prelim_ref(struct list_head *head, u64 root_id,
+   struct btrfs_key *key, int level, u64 parent,
+   u64 wanted_disk_byte, int count)
+{
+   struct __prelim_ref *ref;
+
+   /* in case we're adding delayed refs, we're holding the refs spinlock */
+   ref = kmalloc(sizeof(*ref), GFP_ATOMIC);
+   if (!ref)
+   return -ENOMEM;
+
+   ref-root_id = root_id;
+   if (key)
+   ref-key = *key;
+   else
+   memset(ref-key, 0, sizeof(ref-key));
+
+   ref-level = level;
+   ref-count = count;
+   ref-parent = parent;
+   ref-wanted_disk_byte = wanted_disk_byte;
+   list_add_tail(ref-list, head);
+
+   return 0;
+}
+
+static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path,
+   struct ulist *parents,
+   struct extent_buffer *eb, int level,
+   u64 wanted_objectid, u64 wanted_disk_byte)
+{
+   int ret;
+   int slot;
+   struct btrfs_file_extent_item *fi;
+   struct btrfs_key key;
+   u64 disk_byte;
+
+add_parent:
+   ret = ulist_add(parents, eb-start, 0, GFP_NOFS);
+   if (ret  0)
+   return ret;
+
+   if (level != 0)
+   return 0;
+
+   /*
+* if the current leaf is full with EXTENT_DATA items, we must
+* check the next one if that holds a reference as well.
+* ref-count cannot be used to skip this check.
+* repeat this until we don't find any additional EXTENT_DATA items.
+*/
+   while (1) {
+   ret = btrfs_next_leaf(root, path);
+   if (ret  0)
+   return ret;
+   if (ret)
+   return 0;
+
+   eb = path-nodes[0];
+   for (slot = 0; slot  btrfs_header_nritems(eb); ++slot) {
+   btrfs_item_key_to_cpu(eb, key, slot);
+   if (key.objectid != wanted_objectid ||
+   key.type != BTRFS_EXTENT_DATA_KEY)
+   return 0;
+   fi = btrfs_item_ptr(eb, slot,
+   struct btrfs_file_extent_item);
+   disk_byte = btrfs_file_extent_disk_bytenr(eb, fi);
+   if (disk_byte == wanted_disk_byte)
+   goto add_parent;
+   }
+   }
+
+   return 0;
+}
+
+/*
+ * resolve an indirect backref in the form (root_id, key, level)
+ * to a logical address
+ */
+static int __resolve_indirect_ref(struct btrfs_fs_info *fs_info,
+   struct __prelim_ref *ref,
+   struct ulist *parents)
+{
+   struct btrfs_path *path;
+   struct btrfs_root *root;
+   struct btrfs_key root_key;
+   struct btrfs_key key = {0};
+   struct extent_buffer *eb;
+   int ret = 0;
+   int root_level;
+   int level = ref-level;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   root_key.objectid = ref-root_id;
+   root_key.type = BTRFS_ROOT_ITEM_KEY;
+   root_key.offset = (u64)-1;
+   root = btrfs_read_fs_root_no_name(fs_info, root_key);
+   if (IS_ERR(root)) {
+   ret = PTR_ERR(root);
+   goto out;
+   }
+
+   rcu_read_lock();
+   root_level = btrfs_header_level(root-node);
+   rcu_read_unlock();
+
+   if (root_level + 1 == level)
+   goto out;
+
+   path-lowest_level = level;
+   ret = btrfs_search_slot(NULL, root, ref-key, 

[PATCH v2 03/10] Btrfs: mark delayed refs as for cow

2012-01-04 Thread Jan Schmidt
From: Arne Jansen sensi...@gmx.net

Add a for_cow parameter to add_delayed_*_ref and pass the appropriate value
from every call site. The for_cow parameter will later on be used to
determine if a ref will change anything with respect to qgroups.

Delayed refs coming from relocation are always counted as for_cow, as they
don't change subvol quota.

Also pass in the fs_info for later use.

btrfs_find_all_roots() will use this as an optimization, as changes that are
for_cow will not change anything with respect to which root points to a
certain leaf. Thus, we don't need to add the current sequence number to
those delayed refs.

Signed-off-by: Arne Jansen sensi...@gmx.net
Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net
---
 fs/btrfs/ctree.c   |   42 ++--
 fs/btrfs/ctree.h   |   17 
 fs/btrfs/delayed-ref.c |   50 +++-
 fs/btrfs/delayed-ref.h |   15 +--
 fs/btrfs/disk-io.c |3 +-
 fs/btrfs/extent-tree.c |  101 ---
 fs/btrfs/file.c|   10 ++--
 fs/btrfs/inode.c   |2 +-
 fs/btrfs/ioctl.c   |5 +-
 fs/btrfs/relocation.c  |   18 +
 fs/btrfs/transaction.c |4 +-
 fs/btrfs/tree-log.c|2 +-
 12 files changed, 155 insertions(+), 114 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index dede441..0639a55 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -240,7 +240,7 @@ int btrfs_copy_root(struct btrfs_trans_handle *trans,
 
cow = btrfs_alloc_free_block(trans, root, buf-len, 0,
 new_root_objectid, disk_key, level,
-buf-start, 0);
+buf-start, 0, 1);
if (IS_ERR(cow))
return PTR_ERR(cow);
 
@@ -261,9 +261,9 @@ int btrfs_copy_root(struct btrfs_trans_handle *trans,
 
WARN_ON(btrfs_header_generation(buf)  trans-transid);
if (new_root_objectid == BTRFS_TREE_RELOC_OBJECTID)
-   ret = btrfs_inc_ref(trans, root, cow, 1);
+   ret = btrfs_inc_ref(trans, root, cow, 1, 1);
else
-   ret = btrfs_inc_ref(trans, root, cow, 0);
+   ret = btrfs_inc_ref(trans, root, cow, 0, 1);
 
if (ret)
return ret;
@@ -350,14 +350,14 @@ static noinline int update_ref_for_cow(struct 
btrfs_trans_handle *trans,
if ((owner == root-root_key.objectid ||
 root-root_key.objectid == BTRFS_TREE_RELOC_OBJECTID) 
!(flags  BTRFS_BLOCK_FLAG_FULL_BACKREF)) {
-   ret = btrfs_inc_ref(trans, root, buf, 1);
+   ret = btrfs_inc_ref(trans, root, buf, 1, 1);
BUG_ON(ret);
 
if (root-root_key.objectid ==
BTRFS_TREE_RELOC_OBJECTID) {
-   ret = btrfs_dec_ref(trans, root, buf, 0);
+   ret = btrfs_dec_ref(trans, root, buf, 0, 1);
BUG_ON(ret);
-   ret = btrfs_inc_ref(trans, root, cow, 1);
+   ret = btrfs_inc_ref(trans, root, cow, 1, 1);
BUG_ON(ret);
}
new_flags |= BTRFS_BLOCK_FLAG_FULL_BACKREF;
@@ -365,9 +365,9 @@ static noinline int update_ref_for_cow(struct 
btrfs_trans_handle *trans,
 
if (root-root_key.objectid ==
BTRFS_TREE_RELOC_OBJECTID)
-   ret = btrfs_inc_ref(trans, root, cow, 1);
+   ret = btrfs_inc_ref(trans, root, cow, 1, 1);
else
-   ret = btrfs_inc_ref(trans, root, cow, 0);
+   ret = btrfs_inc_ref(trans, root, cow, 0, 1);
BUG_ON(ret);
}
if (new_flags != 0) {
@@ -381,11 +381,11 @@ static noinline int update_ref_for_cow(struct 
btrfs_trans_handle *trans,
if (flags  BTRFS_BLOCK_FLAG_FULL_BACKREF) {
if (root-root_key.objectid ==
BTRFS_TREE_RELOC_OBJECTID)
-   ret = btrfs_inc_ref(trans, root, cow, 1);
+   ret = btrfs_inc_ref(trans, root, cow, 1, 1);
else
-   ret = btrfs_inc_ref(trans, root, cow, 0);
+   ret = btrfs_inc_ref(trans, root, cow, 0, 1);
BUG_ON(ret);
-   ret = btrfs_dec_ref(trans, root, buf, 1);
+   ret = btrfs_dec_ref(trans, root, buf, 1, 1);
BUG_ON(ret);
}
clean_tree_block(trans, root, buf);
@@ -446,7 +446,7 @@ static noinline int __btrfs_cow_block(struct 
btrfs_trans_handle *trans,
 
cow = 

[PATCH v2 04/10] Btrfs: always save ref_root in delayed refs

2012-01-04 Thread Jan Schmidt
From: Arne Jansen sensi...@gmx.net

For consistent backref walking and (later) qgroup calculation the
information to which root a delayed ref belongs is useful even for shared
refs.

Signed-off-by: Arne Jansen sensi...@gmx.net
Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net
---
 fs/btrfs/delayed-ref.c |   18 --
 fs/btrfs/delayed-ref.h |   12 
 2 files changed, 12 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 3a0f0ab..babd37b 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -495,13 +495,12 @@ static noinline int add_delayed_tree_ref(struct 
btrfs_fs_info *fs_info,
ref-in_tree = 1;
 
full_ref = btrfs_delayed_node_to_tree_ref(ref);
-   if (parent) {
-   full_ref-parent = parent;
+   full_ref-parent = parent;
+   full_ref-root = ref_root;
+   if (parent)
ref-type = BTRFS_SHARED_BLOCK_REF_KEY;
-   } else {
-   full_ref-root = ref_root;
+   else
ref-type = BTRFS_TREE_BLOCK_REF_KEY;
-   }
full_ref-level = level;
 
trace_btrfs_delayed_tree_ref(ref, full_ref, action);
@@ -551,13 +550,12 @@ static noinline int add_delayed_data_ref(struct 
btrfs_fs_info *fs_info,
ref-in_tree = 1;
 
full_ref = btrfs_delayed_node_to_data_ref(ref);
-   if (parent) {
-   full_ref-parent = parent;
+   full_ref-parent = parent;
+   full_ref-root = ref_root;
+   if (parent)
ref-type = BTRFS_SHARED_DATA_REF_KEY;
-   } else {
-   full_ref-root = ref_root;
+   else
ref-type = BTRFS_EXTENT_DATA_REF_KEY;
-   }
 
full_ref-objectid = owner;
full_ref-offset = offset;
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 8316bff..a5fb2bc 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -98,19 +98,15 @@ struct btrfs_delayed_ref_head {
 
 struct btrfs_delayed_tree_ref {
struct btrfs_delayed_ref_node node;
-   union {
-   u64 root;
-   u64 parent;
-   };
+   u64 root;
+   u64 parent;
int level;
 };
 
 struct btrfs_delayed_data_ref {
struct btrfs_delayed_ref_node node;
-   union {
-   u64 root;
-   u64 parent;
-   };
+   u64 root;
+   u64 parent;
u64 objectid;
u64 offset;
 };
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 10/10] Btrfs: new backref walking code

2012-01-04 Thread Jan Schmidt
The old backref iteration code could only safely be used on commit roots.
Besides this limitation, it had bugs in finding the roots for these
references. This commit replaces large parts of it by btrfs_find_all_roots()
which a) really finds all roots and the correct roots, b) works correctly
under heavy file system load, c) considers delayed refs.

Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net
---
 fs/btrfs/backref.c |  354 +++-
 fs/btrfs/ioctl.c   |8 +-
 fs/btrfs/scrub.c   |7 +-
 3 files changed, 107 insertions(+), 262 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 03c30a1..c4c3d5d 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -23,18 +23,6 @@
 #include transaction.h
 #include delayed-ref.h
 
-struct __data_ref {
-   struct list_head list;
-   u64 inum;
-   u64 root;
-   u64 extent_data_item_offset;
-};
-
-struct __shared_ref {
-   struct list_head list;
-   u64 disk_byte;
-};
-
 /*
  * this structure records all encountered refs on the way up to the root
  */
@@ -964,8 +952,11 @@ int extent_from_logical(struct btrfs_fs_info *fs_info, u64 
logical,
btrfs_item_key_to_cpu(path-nodes[0], found_key, path-slots[0]);
if (found_key-type != BTRFS_EXTENT_ITEM_KEY ||
found_key-objectid  logical ||
-   found_key-objectid + found_key-offset = logical)
+   found_key-objectid + found_key-offset = logical) {
+   pr_debug(logical %llu is not within any extent\n,
+(unsigned long long)logical);
return -ENOENT;
+   }
 
eb = path-nodes[0];
item_size = btrfs_item_size_nr(eb, path-slots[0]);
@@ -974,6 +965,13 @@ int extent_from_logical(struct btrfs_fs_info *fs_info, u64 
logical,
ei = btrfs_item_ptr(eb, path-slots[0], struct btrfs_extent_item);
flags = btrfs_extent_flags(eb, ei);
 
+   pr_debug(logical %llu is at position %llu within the extent (%llu 
+EXTENT_ITEM %llu) flags %#llx size %u\n,
+(unsigned long long)logical,
+(unsigned long long)(logical - found_key-objectid),
+(unsigned long long)found_key-objectid,
+(unsigned long long)found_key-offset,
+(unsigned long long)flags, item_size);
if (flags  BTRFS_EXTENT_FLAG_TREE_BLOCK)
return BTRFS_EXTENT_FLAG_TREE_BLOCK;
if (flags  BTRFS_EXTENT_FLAG_DATA)
@@ -1070,128 +1068,11 @@ int tree_backref_for_extent(unsigned long *ptr, struct 
extent_buffer *eb,
return 0;
 }
 
-static int __data_list_add(struct list_head *head, u64 inum,
-   u64 extent_data_item_offset, u64 root)
-{
-   struct __data_ref *ref;
-
-   ref = kmalloc(sizeof(*ref), GFP_NOFS);
-   if (!ref)
-   return -ENOMEM;
-
-   ref-inum = inum;
-   ref-extent_data_item_offset = extent_data_item_offset;
-   ref-root = root;
-   list_add_tail(ref-list, head);
-
-   return 0;
-}
-
-static int __data_list_add_eb(struct list_head *head, struct extent_buffer *eb,
-   struct btrfs_extent_data_ref *dref)
-{
-   return __data_list_add(head, btrfs_extent_data_ref_objectid(eb, dref),
-   btrfs_extent_data_ref_offset(eb, dref),
-   btrfs_extent_data_ref_root(eb, dref));
-}
-
-static int __shared_list_add(struct list_head *head, u64 disk_byte)
-{
-   struct __shared_ref *ref;
-
-   ref = kmalloc(sizeof(*ref), GFP_NOFS);
-   if (!ref)
-   return -ENOMEM;
-
-   ref-disk_byte = disk_byte;
-   list_add_tail(ref-list, head);
-
-   return 0;
-}
-
-static int __iter_shared_inline_ref_inodes(struct btrfs_fs_info *fs_info,
-  u64 logical, u64 inum,
-  u64 extent_data_item_offset,
-  u64 extent_offset,
-  struct btrfs_path *path,
-  struct list_head *data_refs,
-  iterate_extent_inodes_t *iterate,
-  void *ctx)
-{
-   u64 ref_root;
-   u32 item_size;
-   struct btrfs_key key;
-   struct extent_buffer *eb;
-   struct btrfs_extent_item *ei;
-   struct btrfs_extent_inline_ref *eiref;
-   struct __data_ref *ref;
-   int ret;
-   int type;
-   int last;
-   unsigned long ptr = 0;
-
-   WARN_ON(!list_empty(data_refs));
-   ret = extent_from_logical(fs_info, logical, path, key);
-   if (ret  BTRFS_EXTENT_FLAG_DATA)
-   ret = -EIO;
-   if (ret  0)
-   goto out;
-
-   eb = path-nodes[0];
-   ei = btrfs_item_ptr(eb, path-slots[0], struct btrfs_extent_item);
-   item_size = btrfs_item_size_nr(eb, 

[PATCH v2 06/10] Btrfs: add sequence numbers to delayed refs

2012-01-04 Thread Jan Schmidt
From: Arne Jansen sensi...@gmx.net

Sequence numbers are needed to reconstruct the backrefs of a given extent to
a certain point in time. The total set of backrefs consist of the set of
backrefs recorded on disk plus the enqueued delayed refs for it that existed
at that moment.

This patch also adds a list that records all delayed refs which are
currently in the process of being added.

When walking all refs of an extent in btrfs_find_all_roots(), we freeze the
current state of delayed refs, honor anythinh up to this point and prevent
processing newer delayed refs to assert consistency.

Signed-off-by: Arne Jansen sensi...@gmx.net
Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net
---
 fs/btrfs/delayed-ref.c |   34 +++
 fs/btrfs/delayed-ref.h |   70 
 fs/btrfs/transaction.c |4 +++
 3 files changed, 108 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index babd37b..a405db0 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -101,6 +101,11 @@ static int comp_entry(struct btrfs_delayed_ref_node *ref2,
return -1;
if (ref1-type  ref2-type)
return 1;
+   /* merging of sequenced refs is not allowed */
+   if (ref1-seq  ref2-seq)
+   return -1;
+   if (ref1-seq  ref2-seq)
+   return 1;
if (ref1-type == BTRFS_TREE_BLOCK_REF_KEY ||
ref1-type == BTRFS_SHARED_BLOCK_REF_KEY) {
return comp_tree_refs(btrfs_delayed_node_to_tree_ref(ref2),
@@ -209,6 +214,24 @@ int btrfs_delayed_ref_lock(struct btrfs_trans_handle 
*trans,
return 0;
 }
 
+int btrfs_check_delayed_seq(struct btrfs_delayed_ref_root *delayed_refs,
+   u64 seq)
+{
+   struct seq_list *elem;
+
+   assert_spin_locked(delayed_refs-lock);
+   if (list_empty(delayed_refs-seq_head))
+   return 0;
+
+   elem = list_first_entry(delayed_refs-seq_head, struct seq_list, list);
+   if (seq = elem-seq) {
+   pr_debug(holding back delayed_ref %llu, lowest is %llu (%p)\n,
+seq, elem-seq, delayed_refs);
+   return 1;
+   }
+   return 0;
+}
+
 int btrfs_find_ref_cluster(struct btrfs_trans_handle *trans,
   struct list_head *cluster, u64 start)
 {
@@ -438,6 +461,7 @@ static noinline int add_delayed_ref_head(struct 
btrfs_fs_info *fs_info,
ref-action  = 0;
ref-is_head = 1;
ref-in_tree = 1;
+   ref-seq = 0;
 
head_ref = btrfs_delayed_node_to_head(ref);
head_ref-must_insert_reserved = must_insert_reserved;
@@ -479,6 +503,7 @@ static noinline int add_delayed_tree_ref(struct 
btrfs_fs_info *fs_info,
struct btrfs_delayed_ref_node *existing;
struct btrfs_delayed_tree_ref *full_ref;
struct btrfs_delayed_ref_root *delayed_refs;
+   u64 seq = 0;
 
if (action == BTRFS_ADD_DELAYED_EXTENT)
action = BTRFS_ADD_DELAYED_REF;
@@ -494,6 +519,10 @@ static noinline int add_delayed_tree_ref(struct 
btrfs_fs_info *fs_info,
ref-is_head = 0;
ref-in_tree = 1;
 
+   if (need_ref_seq(for_cow, ref_root))
+   seq = inc_delayed_seq(delayed_refs);
+   ref-seq = seq;
+
full_ref = btrfs_delayed_node_to_tree_ref(ref);
full_ref-parent = parent;
full_ref-root = ref_root;
@@ -534,6 +563,7 @@ static noinline int add_delayed_data_ref(struct 
btrfs_fs_info *fs_info,
struct btrfs_delayed_ref_node *existing;
struct btrfs_delayed_data_ref *full_ref;
struct btrfs_delayed_ref_root *delayed_refs;
+   u64 seq = 0;
 
if (action == BTRFS_ADD_DELAYED_EXTENT)
action = BTRFS_ADD_DELAYED_REF;
@@ -549,6 +579,10 @@ static noinline int add_delayed_data_ref(struct 
btrfs_fs_info *fs_info,
ref-is_head = 0;
ref-in_tree = 1;
 
+   if (need_ref_seq(for_cow, ref_root))
+   seq = inc_delayed_seq(delayed_refs);
+   ref-seq = seq;
+
full_ref = btrfs_delayed_node_to_data_ref(ref);
full_ref-parent = parent;
full_ref-root = ref_root;
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index a5fb2bc..174416f 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -33,6 +33,9 @@ struct btrfs_delayed_ref_node {
/* the size of the extent */
u64 num_bytes;
 
+   /* seq number to keep track of insertion order */
+   u64 seq;
+
/* ref count on this data structure */
atomic_t refs;
 
@@ -136,6 +139,20 @@ struct btrfs_delayed_ref_root {
int flushing;
 
u64 run_delayed_start;
+
+   /*
+* seq number of delayed refs. We need to know if a backref was being
+* added before the currently processed ref or afterwards.
+*/
+   u64 seq;
+
+   /*
+* seq_list holds a list of all seq numbers that are 

[PATCH v2 05/10] Btrfs: add nested locking mode for paths

2012-01-04 Thread Jan Schmidt
From: Arne Jansen sensi...@gmx.net

This patch adds the possibilty to read-lock an extent even if it is already
write-locked from the same thread. btrfs_find_all_roots() needs this
capability.

Signed-off-by: Arne Jansen sensi...@gmx.net
Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net
---
 fs/btrfs/extent_io.c |1 +
 fs/btrfs/extent_io.h |2 +
 fs/btrfs/locking.c   |   53 -
 3 files changed, 54 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index be1bf62..dd8d140 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3571,6 +3571,7 @@ static struct extent_buffer *__alloc_extent_buffer(struct 
extent_io_tree *tree,
atomic_set(eb-blocking_writers, 0);
atomic_set(eb-spinning_readers, 0);
atomic_set(eb-spinning_writers, 0);
+   eb-lock_nested = 0;
init_waitqueue_head(eb-write_lock_wq);
init_waitqueue_head(eb-read_lock_wq);
 
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 7604c30..bc6a042cb 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -129,6 +129,7 @@ struct extent_buffer {
struct list_head leak_list;
struct rcu_head rcu_head;
atomic_t refs;
+   pid_t lock_owner;
 
/* count of read lock holders on the extent buffer */
atomic_t write_locks;
@@ -137,6 +138,7 @@ struct extent_buffer {
atomic_t blocking_readers;
atomic_t spinning_readers;
atomic_t spinning_writers;
+   int lock_nested;
 
/* protects write locks */
rwlock_t lock;
diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
index d77b67c..5e178d8 100644
--- a/fs/btrfs/locking.c
+++ b/fs/btrfs/locking.c
@@ -33,6 +33,14 @@ void btrfs_assert_tree_read_locked(struct extent_buffer *eb);
  */
 void btrfs_set_lock_blocking_rw(struct extent_buffer *eb, int rw)
 {
+   if (eb-lock_nested) {
+   read_lock(eb-lock);
+   if (eb-lock_nested  current-pid == eb-lock_owner) {
+   read_unlock(eb-lock);
+   return;
+   }
+   read_unlock(eb-lock);
+   }
if (rw == BTRFS_WRITE_LOCK) {
if (atomic_read(eb-blocking_writers) == 0) {
WARN_ON(atomic_read(eb-spinning_writers) != 1);
@@ -57,6 +65,14 @@ void btrfs_set_lock_blocking_rw(struct extent_buffer *eb, 
int rw)
  */
 void btrfs_clear_lock_blocking_rw(struct extent_buffer *eb, int rw)
 {
+   if (eb-lock_nested) {
+   read_lock(eb-lock);
+   if (eb-lock_nested  current-pid == eb-lock_owner) {
+   read_unlock(eb-lock);
+   return;
+   }
+   read_unlock(eb-lock);
+   }
if (rw == BTRFS_WRITE_LOCK_BLOCKING) {
BUG_ON(atomic_read(eb-blocking_writers) != 1);
write_lock(eb-lock);
@@ -81,12 +97,25 @@ void btrfs_clear_lock_blocking_rw(struct extent_buffer *eb, 
int rw)
 void btrfs_tree_read_lock(struct extent_buffer *eb)
 {
 again:
+   read_lock(eb-lock);
+   if (atomic_read(eb-blocking_writers) 
+   current-pid == eb-lock_owner) {
+   /*
+* This extent is already write-locked by our thread. We allow
+* an additional read lock to be added because it's for the same
+* thread. btrfs_find_all_roots() depends on this as it may be
+* called on a partly (write-)locked tree.
+*/
+   BUG_ON(eb-lock_nested);
+   eb-lock_nested = 1;
+   read_unlock(eb-lock);
+   return;
+   }
+   read_unlock(eb-lock);
wait_event(eb-write_lock_wq, atomic_read(eb-blocking_writers) == 0);
read_lock(eb-lock);
if (atomic_read(eb-blocking_writers)) {
read_unlock(eb-lock);
-   wait_event(eb-write_lock_wq,
-  atomic_read(eb-blocking_writers) == 0);
goto again;
}
atomic_inc(eb-read_locks);
@@ -129,6 +158,7 @@ int btrfs_try_tree_write_lock(struct extent_buffer *eb)
}
atomic_inc(eb-write_locks);
atomic_inc(eb-spinning_writers);
+   eb-lock_owner = current-pid;
return 1;
 }
 
@@ -137,6 +167,15 @@ int btrfs_try_tree_write_lock(struct extent_buffer *eb)
  */
 void btrfs_tree_read_unlock(struct extent_buffer *eb)
 {
+   if (eb-lock_nested) {
+   read_lock(eb-lock);
+   if (eb-lock_nested  current-pid == eb-lock_owner) {
+   eb-lock_nested = 0;
+   read_unlock(eb-lock);
+   return;
+   }
+   read_unlock(eb-lock);
+   }
btrfs_assert_tree_read_locked(eb);
WARN_ON(atomic_read(eb-spinning_readers) == 0);
atomic_dec(eb-spinning_readers);
@@ -149,6 +188,15 @@ void btrfs_tree_read_unlock(struct 

[PATCH v2 08/10] Btrfs: add waitqueue instead of doing busy waiting for more delayed refs

2012-01-04 Thread Jan Schmidt
Now that we may be holding back delayed refs for a limited period, we
might end up having no runnable delayed refs. Without this commit, we'd
do busy waiting in that thread until another (runnable) ref arives.
Instead, we're detecting this situation and use a waitqueue, such that
we only try to run more refs after
a) another runnable ref was added  or
b) delayed refs are no longer held back

Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net
---
 fs/btrfs/delayed-ref.c |8 ++
 fs/btrfs/delayed-ref.h |7 +
 fs/btrfs/extent-tree.c |   59 +++-
 fs/btrfs/transaction.c |1 +
 4 files changed, 74 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index ee18198..66e4f29 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -664,6 +664,9 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info 
*fs_info,
   num_bytes, parent, ref_root, level, action,
   for_cow);
BUG_ON(ret);
+   if (!need_ref_seq(for_cow, ref_root) 
+   waitqueue_active(delayed_refs-seq_wait))
+   wake_up(delayed_refs-seq_wait);
spin_unlock(delayed_refs-lock);
return 0;
 }
@@ -712,6 +715,9 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
*fs_info,
   num_bytes, parent, ref_root, owner, offset,
   action, for_cow);
BUG_ON(ret);
+   if (!need_ref_seq(for_cow, ref_root) 
+   waitqueue_active(delayed_refs-seq_wait))
+   wake_up(delayed_refs-seq_wait);
spin_unlock(delayed_refs-lock);
return 0;
 }
@@ -739,6 +745,8 @@ int btrfs_add_delayed_extent_op(struct btrfs_fs_info 
*fs_info,
   extent_op-is_data);
BUG_ON(ret);
 
+   if (waitqueue_active(delayed_refs-seq_wait))
+   wake_up(delayed_refs-seq_wait);
spin_unlock(delayed_refs-lock);
return 0;
 }
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 174416f..d8f244d 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -153,6 +153,12 @@ struct btrfs_delayed_ref_root {
 * as it might influence the outcome of the walk.
 */
struct list_head seq_head;
+
+   /*
+* when the only refs we have in the list must not be processed, we want
+* to wait for more refs to show up or for the end of backref walking.
+*/
+   wait_queue_head_t seq_wait;
 };
 
 static inline void btrfs_put_delayed_ref(struct btrfs_delayed_ref_node *ref)
@@ -216,6 +222,7 @@ btrfs_put_delayed_seq(struct btrfs_delayed_ref_root 
*delayed_refs,
 {
spin_lock(delayed_refs-lock);
list_del(elem-list);
+   wake_up(delayed_refs-seq_wait);
spin_unlock(delayed_refs-lock);
 }
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index bbcca12..0a435e2 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2300,7 +2300,12 @@ static noinline int run_clustered_refs(struct 
btrfs_trans_handle *trans,
ref-in_tree = 0;
rb_erase(ref-rb_node, delayed_refs-root);
delayed_refs-num_entries--;
-
+   /*
+* we modified num_entries, but as we're currently running
+* delayed refs, skip
+* wake_up(delayed_refs-seq_wait);
+* here.
+*/
spin_unlock(delayed_refs-lock);
 
ret = run_one_delayed_ref(trans, root, ref, extent_op,
@@ -2317,6 +2322,23 @@ static noinline int run_clustered_refs(struct 
btrfs_trans_handle *trans,
return count;
 }
 
+
+static void wait_for_more_refs(struct btrfs_delayed_ref_root *delayed_refs,
+   unsigned long num_refs)
+{
+   struct list_head *first_seq = delayed_refs-seq_head.next;
+
+   spin_unlock(delayed_refs-lock);
+   pr_debug(waiting for more refs (num %ld, first %p)\n,
+num_refs, first_seq);
+   wait_event(delayed_refs-seq_wait,
+  num_refs != delayed_refs-num_entries ||
+  delayed_refs-seq_head.next != first_seq);
+   pr_debug(done waiting for more refs (num %ld, first %p)\n,
+delayed_refs-num_entries, delayed_refs-seq_head.next);
+   spin_lock(delayed_refs-lock);
+}
+
 /*
  * this starts processing the delayed reference count updates and
  * extent insertions we have queued up so far.  count can be
@@ -2332,8 +2354,11 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle 
*trans,
struct btrfs_delayed_ref_node *ref;
struct list_head cluster;
int ret;
+   u64 delayed_start;
int run_all = count == (unsigned long)-1;
int run_most = 0;
+   unsigned long num_refs = 0;
+   int consider_waiting;
 
if (root == root-fs_info-extent_root)
   

[PATCH v2 01/10] Btrfs: generic data structure to build unique lists

2012-01-04 Thread Jan Schmidt
From: Arne Jansen sensi...@gmx.net

ulist is a generic data structures to hold a collection of unique u64
values. The only operations it supports is adding to the list and
enumerating it.

It is possible to store an auxiliary value along with the key. The
implementation is preliminary and can probably be sped up significantly.

It is used by btrfs_find_all_roots() quota to translate recursions into
iterative loops.

Signed-off-by: Arne Jansen sensi...@gmx.net
Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net
---
 fs/btrfs/Makefile |2 +-
 fs/btrfs/ulist.c  |  220 +
 fs/btrfs/ulist.h  |   68 
 3 files changed, 289 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index c0ddfd2..7079840 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -8,6 +8,6 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o 
root-tree.o dir-item.o \
   extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
   export.o tree-log.o free-space-cache.o zlib.o lzo.o \
   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
-  reada.o backref.o
+  reada.o backref.o ulist.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
diff --git a/fs/btrfs/ulist.c b/fs/btrfs/ulist.c
new file mode 100644
index 000..12f5147
--- /dev/null
+++ b/fs/btrfs/ulist.c
@@ -0,0 +1,220 @@
+/*
+ * Copyright (C) 2011 STRATO AG
+ * written by Arne Jansen sensi...@gmx.net
+ * Distributed under the GNU GPL license version 2.
+ */
+
+#include linux/slab.h
+#include linux/module.h
+#include ulist.h
+
+/*
+ * ulist is a generic data structure to hold a collection of unique u64
+ * values. The only operations it supports is adding to the list and
+ * enumerating it.
+ * It is possible to store an auxiliary value along with the key.
+ *
+ * The implementation is preliminary and can probably be sped up
+ * significantly. A first step would be to store the values in an rbtree
+ * as soon as ULIST_SIZE is exceeded.
+ *
+ * A sample usage for ulists is the enumeration of directed graphs without
+ * visiting a node twice. The pseudo-code could look like this:
+ *
+ * ulist = ulist_alloc();
+ * ulist_add(ulist, root);
+ * elem = NULL;
+ *
+ * while ((elem = ulist_next(ulist, elem)) {
+ * for (all child nodes n in elem)
+ * ulist_add(ulist, n);
+ * do something useful with the node;
+ * }
+ * ulist_free(ulist);
+ *
+ * This assumes the graph nodes are adressable by u64. This stems from the
+ * usage for tree enumeration in btrfs, where the logical addresses are
+ * 64 bit.
+ *
+ * It is also useful for tree enumeration which could be done elegantly
+ * recursively, but is not possible due to kernel stack limitations. The
+ * loop would be similar to the above.
+ */
+
+/**
+ * ulist_init - freshly initialize a ulist
+ * @ulist: the ulist to initialize
+ *
+ * Note: don't use this function to init an already used ulist, use
+ * ulist_reinit instead.
+ */
+void ulist_init(struct ulist *ulist)
+{
+   ulist-nnodes = 0;
+   ulist-nodes = ulist-int_nodes;
+   ulist-nodes_alloced = ULIST_SIZE;
+}
+EXPORT_SYMBOL(ulist_init);
+
+/**
+ * ulist_fini - free up additionally allocated memory for the ulist
+ * @ulist: the ulist from which to free the additional memory
+ *
+ * This is useful in cases where the base 'struct ulist' has been statically
+ * allocated.
+ */
+void ulist_fini(struct ulist *ulist)
+{
+   /*
+* The first ULIST_SIZE elements are stored inline in struct ulist.
+* Only if more elements are alocated they need to be freed.
+*/
+   if (ulist-nodes_alloced  ULIST_SIZE)
+   kfree(ulist-nodes);
+   ulist-nodes_alloced = 0;   /* in case ulist_fini is called twice */
+}
+EXPORT_SYMBOL(ulist_fini);
+
+/**
+ * ulist_reinit - prepare a ulist for reuse
+ * @ulist: ulist to be reused
+ *
+ * Free up all additional memory allocated for the list elements and reinit
+ * the ulist.
+ */
+void ulist_reinit(struct ulist *ulist)
+{
+   ulist_fini(ulist);
+   ulist_init(ulist);
+}
+EXPORT_SYMBOL(ulist_reinit);
+
+/**
+ * ulist_alloc - dynamically allocate a ulist
+ * @gfp_mask:  allocation flags to for base allocation
+ *
+ * The allocated ulist will be returned in an initialized state.
+ */
+struct ulist *ulist_alloc(unsigned long gfp_mask)
+{
+   struct ulist *ulist = kmalloc(sizeof(*ulist), gfp_mask);
+
+   if (!ulist)
+   return NULL;
+
+   ulist_init(ulist);
+
+   return ulist;
+}
+EXPORT_SYMBOL(ulist_alloc);
+
+/**
+ * ulist_free - free dynamically allocated ulist
+ * @ulist: ulist to free
+ *
+ * It is not necessary to call ulist_fini before.
+ */
+void ulist_free(struct ulist *ulist)
+{
+   if (!ulist)
+   return;
+   ulist_fini(ulist);
+   kfree(ulist);
+}
+EXPORT_SYMBOL(ulist_free);
+
+/**
+ * ulist_add - add an element to the ulist
+ * @ulist: 

[PATCH v2 07/10] Btrfs: put back delayed refs that are too new

2012-01-04 Thread Jan Schmidt
From: Arne Jansen sensi...@gmx.net

When processing a delayed ref, first check if there are still old refs in
the process of being added. If so, put this ref back to the tree. To avoid
looping on this ref, choose a newer one in the next loop.
btrfs_find_ref_cluster has to take care of that.

Signed-off-by: Arne Jansen sensi...@gmx.net
Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net
---
 fs/btrfs/delayed-ref.c |   43 +--
 fs/btrfs/extent-tree.c |   27 ++-
 2 files changed, 47 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index a405db0..ee18198 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -155,16 +155,22 @@ static struct btrfs_delayed_ref_node *tree_insert(struct 
rb_root *root,
 
 /*
  * find an head entry based on bytenr. This returns the delayed ref
- * head if it was able to find one, or NULL if nothing was in that spot
+ * head if it was able to find one, or NULL if nothing was in that spot.
+ * If return_bigger is given, the next bigger entry is returned if no exact
+ * match is found.
  */
 static struct btrfs_delayed_ref_node *find_ref_head(struct rb_root *root,
  u64 bytenr,
- struct btrfs_delayed_ref_node **last)
+ struct btrfs_delayed_ref_node **last,
+ int return_bigger)
 {
-   struct rb_node *n = root-rb_node;
+   struct rb_node *n;
struct btrfs_delayed_ref_node *entry;
-   int cmp;
+   int cmp = 0;
 
+again:
+   n = root-rb_node;
+   entry = NULL;
while (n) {
entry = rb_entry(n, struct btrfs_delayed_ref_node, rb_node);
WARN_ON(!entry-in_tree);
@@ -187,6 +193,19 @@ static struct btrfs_delayed_ref_node *find_ref_head(struct 
rb_root *root,
else
return entry;
}
+   if (entry  return_bigger) {
+   if (cmp  0) {
+   n = rb_next(entry-rb_node);
+   if (!n)
+   n = rb_first(root);
+   entry = rb_entry(n, struct btrfs_delayed_ref_node,
+rb_node);
+   bytenr = entry-bytenr;
+   return_bigger = 0;
+   goto again;
+   }
+   return entry;
+   }
return NULL;
 }
 
@@ -246,20 +265,8 @@ int btrfs_find_ref_cluster(struct btrfs_trans_handle 
*trans,
node = rb_first(delayed_refs-root);
} else {
ref = NULL;
-   find_ref_head(delayed_refs-root, start, ref);
+   find_ref_head(delayed_refs-root, start + 1, ref, 1);
if (ref) {
-   struct btrfs_delayed_ref_node *tmp;
-
-   node = rb_prev(ref-rb_node);
-   while (node) {
-   tmp = rb_entry(node,
-  struct btrfs_delayed_ref_node,
-  rb_node);
-   if (tmp-bytenr  start)
-   break;
-   ref = tmp;
-   node = rb_prev(ref-rb_node);
-   }
node = ref-rb_node;
} else
node = rb_first(delayed_refs-root);
@@ -748,7 +755,7 @@ btrfs_find_delayed_ref_head(struct btrfs_trans_handle 
*trans, u64 bytenr)
struct btrfs_delayed_ref_root *delayed_refs;
 
delayed_refs = trans-transaction-delayed_refs;
-   ref = find_ref_head(delayed_refs-root, bytenr, NULL);
+   ref = find_ref_head(delayed_refs-root, bytenr, NULL, 0);
if (ref)
return btrfs_delayed_node_to_head(ref);
return NULL;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index dc8b9a8..bbcca12 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2237,6 +2237,28 @@ static noinline int run_clustered_refs(struct 
btrfs_trans_handle *trans,
}
 
/*
+* locked_ref is the head node, so we have to go one
+* node back for any delayed ref updates
+*/
+   ref = select_delayed_ref(locked_ref);
+
+   if (ref  ref-seq 
+   btrfs_check_delayed_seq(delayed_refs, ref-seq)) {
+   /*
+* there are still refs with lower seq numbers in the
+* process of being added. Don't run this ref yet.
+*/
+   list_del_init(locked_ref-cluster);
+   mutex_unlock(locked_ref-mutex);
+   locked_ref = NULL;
+   delayed_refs-num_heads_ready++;

[PATCH] xfstests: fixup check 276

2012-01-04 Thread Jan Schmidt
This commit fixes bd8ee45c. Changes:
- added a _require_btrfs helper function
- check for filefrag with _require_command
- always use _fail in case of errors
- added some comments
- removed $fresh code
- don't set FSTYP

Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net
---
 276   |  119 +
 common.rc |   12 ++
 2 files changed, 84 insertions(+), 47 deletions(-)

diff --git a/276 b/276
index f22d089..082f943 100755
--- a/276
+++ b/276
@@ -1,5 +1,29 @@
 #! /bin/bash
-
+# FSQA Test No. 276
+#
+# Run fsstress to create a reasonably strange file system, make a
+# snapshot and run more fsstress. Then select some files from that fs,
+# run filefrag to get the extent mapping and follow the backrefs.
+# We check to end up back at the original file with the correct offset.
+#
+#---
+# Copyright (C) 2011 STRATO.  All rights reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#
+#---
+#
 # creator
 owner=list.bt...@jan-o-sch.net
 
@@ -7,18 +31,13 @@ seq=`basename $0`
 echo QA output created by $seq
 
 here=`pwd`
-# 1=production, 0=avoid touching the scratch dev (no mount/umount, no writes)
-fresh=1
 tmp=/tmp/$$
 status=1
-FSTYP=btrfs
 
 _cleanup()
 {
-   if [ $fresh -ne 0 ]; then
-   echo *** unmount 
-   umount $SCRATCH_MNT 2/dev/null
-   fi
+   echo *** unmount
+   umount $SCRATCH_MNT 2/dev/null
rm -f $tmp.*
 }
 trap _cleanup; exit \$status 0 1 2 3 15
@@ -28,21 +47,14 @@ trap _cleanup; exit \$status 0 1 2 3 15
 . ./common.filter
 
 # real QA test starts here
+_need_to_be_root
 _supported_fs btrfs
 _supported_os Linux
-
-if [ $fresh -ne 0 ]; then
-   _require_scratch
-fi
+_require_scratch
 
 _require_nobigloopfs
-
-[ -n $BTRFS_UTIL_PROG ] || _notrun btrfs executable not found
-$BTRFS_UTIL_PROG inspect-internal --help /dev/null 21
-[ $? -eq 0 ] || _notrun btrfs executable too old
-which filefrag /dev/null 21
-[ $? -eq 0 ] || _notrun filefrag missing
-
+_require_btrfs inspect-internal
+_require_command /usr/sbin/filefrag
 
 rm -f $seq.full
 
@@ -52,6 +64,10 @@ FILEFRAG_FILTER='if (/, blocksize (\d+)/) {$blocksize = $1; 
next} ($ext, '\
 '/(?:^|,)inline(?:,|$)/ and next; print $physical * $blocksize, #, '\
 '$length * $blocksize, #, $logical * $blocksize,  '
 
+# this makes filefrag output script readable by using a perl helper.
+# output is one extent per line, with three numbers separated by '#'
+# the numbers are: physical, length, logical (all in bytes)
+# sample output: 1234#10#5678 - physical 1234, length 10, logical 5678
 _filter_extents()
 {
tee -a $seq.full | $PERL_PROG -ne $FILEFRAG_FILTER
@@ -70,6 +86,9 @@ _check_file_extents()
return 0
 }
 
+# use a logical address and walk the backrefs back to the inode.
+# compare to the expected result.
+# returns 0 on success, 1 on error (with output made)
 _btrfs_inspect_addr()
 {
mp=$1
@@ -101,6 +120,9 @@ _btrfs_inspect_addr()
return 1
 }
 
+# use an inode number and walk the backrefs back to the file name.
+# compare to the expected result.
+# returns 0 on success, 1 on error (with output made)
 _btrfs_inspect_inum()
 {
file=$1
@@ -134,14 +156,13 @@ _btrfs_inspect_check()
echo # $cmd  $seq.full
inum=`$cmd`
echo $inum  $seq.full
-   _btrfs_inspect_addr $SCRATCH_MNT/$snap_name $physical $logical $inum\
-   $file
+   _btrfs_inspect_addr $SCRATCH_MNT $physical $logical $inum $file
ret=$?
if [ $ret -eq 0 ]; then
_btrfs_inspect_inum $file $inum $snap_name
ret=$?
fi
-   return $?
+   return $ret
 }
 
 run_check()
@@ -157,30 +178,34 @@ workout()
procs=$3
snap_name=$4
 
-   if [ $fresh -ne 0 ]; then
-   umount $SCRATCH_DEV /dev/null 21
-   echo *** mkfs -dsize=$fsz$seq.full
-   echo  $seq.full
-   _scratch_mkfs_sized $fsz $seq.full 21 \
-   || _fail size=$fsz mkfs failed
-   _scratch_mount  $seq.full 21 \
-   || _fail mount failed
-   # -w ensures that the only ops are ones which cause 

Re: btrfs-tools in Debian squeeze-backports?

2012-01-04 Thread Jan Schmidt
On 02.01.2012 16:01, Daniel Pocock wrote:
 One thing I've already noticed in 2.6.39 (and both versions of the
 tools) is that df results are misleading.  E.g. if I run regular df (not
 btrfs fi df), I am seeing the same amount of available space for all
 filesystems.  Is there currently a way to see space used by each
 subvolume and snapshot and which kernel and tools versions might be needed?

If you're interested (and brave), you might give the subvolume quota
patches a try. Arne sent them last October, subject

   [PATCH v0 00/18] btfs: Subvolume Quota Groups

Be warned that this is a v0 patch, it's not been tested too much and
will very likely be reworked in the future. But that kind of
functionality will hopefully be added to btrfs, consequently.

-Jan
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfsprogs source code

2012-01-04 Thread David Brown

On Tue, Jan 03, 2012 at 01:05:07PM -0500, Calvin Walton wrote:


The best way to get the btrfs-progs source is probably via git; Chris
Mason's repository for it can be found at
http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-progs.git


Chris,

The wiki at
https://btrfs.wiki.kernel.org/articles/b/t/r/Btrfs_source_repositories.html
still refers to a btrfs-progs-unstable.git repository, which is not
present at git.kernel.org.  Should we update this wiki, or do you have
plans on pushing an unstable repository again?

Thanks,
David
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfsprogs source code

2012-01-04 Thread Hugo Mills
On Wed, Jan 04, 2012 at 10:24:20AM -0800, David Brown wrote:
 On Tue, Jan 03, 2012 at 01:05:07PM -0500, Calvin Walton wrote:
 
 The best way to get the btrfs-progs source is probably via git; Chris
 Mason's repository for it can be found at
 http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-progs.git
 
 Chris,
 
 The wiki at
 https://btrfs.wiki.kernel.org/articles/b/t/r/Btrfs_source_repositories.html
 still refers to a btrfs-progs-unstable.git repository, which is not
 present at git.kernel.org.  Should we update this wiki, or do you have
 plans on pushing an unstable repository again?

   That wiki is read-only, unfortunately. The up-to-date wiki is at
[1], and we'll be decanting that back onto the kernel.org one when the
kernel.org wiki is back in full working order.

   Hugo.

[1] http://btrfs.ipv5.de/

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- echo killall cat  ~/curiosity.sh ---   


signature.asc
Description: Digital signature


Re: use btrfsck to check btrfs filesystems

2012-01-04 Thread Christoph Hellwig
On Wed, Dec 14, 2011 at 03:35:20PM +0800, Miao Xie wrote:
 We failed to get fsck program to check the btrfs file system, it is
 because btrfs uses its independent check tool which is named btrfsck
 to check the file system, so the common checker -- fsck -- could not
 find it, and reported there is no checker.
 
 This patch fix it by using btrfsck directly. 
 
 Signed-off-by: Miao Xie mi...@cn.fujitsu.com

Thanks, applied.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 277: new test to verify on disk ctime update for chattr

2012-01-04 Thread Christoph Hellwig
On Thu, Dec 22, 2011 at 11:55:03AM +0800, Li Zefan wrote:
 We had a bug in btrfs which can be triggered by this test.

Thanks, applied.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: (renamed thread) btrfs metrics

2012-01-04 Thread Kok, Auke-jan H
On Wed, Jan 4, 2012 at 3:48 AM, Daniel Pocock dan...@pocock.com.au wrote:

 I am looking at what metrics are needed to monitor btrfs in production.
  I actually look after the ganglia-modules-linux package, which includes
 some FS space metrics, but I figured that btrfs throws all that out the
 window.

 Can you suggest metrics that would be meaningful, do I look in /proc or
 with syscalls, is there any code I should look at for an example of how
 to extract them with C?  Ideally, Ganglia runs without root privileges
 too, so please let me know if btrfs will allow me to access them

    It depends on what you want to know, really. If you want how close
 am I to a full filesystem?, then the output of df will give you a
 measure, even if it could be up to a factor of 2 out -- you can use it
 for predictive planning, though, as it'll be near zero when the FS
 runs out of space.


 Maybe if you look at it from the point of the sysadmin and think about
 what questions he might want to ask:

 a) how much space would I reclaim if I deleted snapshot X?

 b) how much space would I reclaim if I deleted all snapshots?

 c) how much space would I need if I start making 4 snapshots a day and
 keeping them for 48 hours?

chiming in on the discussion - what I'd like to personally see:

First, probably easiest: Display per subvol the space used that is
unique (not used by other subvolumes), and shared (the opposite -
all blocks that appear in other subvolumes as well).

From there on, one could potentially create a matrix: (proportional
font art, apologies):

  | subvol1  | subvol2  | subvol3  |
--+--+--+--+
 subvol1  |   200M   | 20M  | 50M  |
--+--+--+--+
 subvol2  |20M   |350M  | 22M  |
--+--+--+--+
 subvol3  |50M   | 22M  |634M  |
--+--+--+--+

The diagonal obviously shows the unique blocks, subvol2 and subvol1
share 20M data, etc. Missing from this plot would be how much is
shared between subvol1, subvol2, and subvol3 together, but it's a
start and not something that hard to understand. One might add a
column for total size of each subvol, which may obviously not be an
addition of the rest of the columns in this diagram.

Anyway, something like this would be high on my list of `df` numbers
I'd like to see - since I think they are useful numbers.

Cheers,

Auke
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs partition lost after RAID1 mirror disk failure?

2012-01-04 Thread Dan Garton
Hi, thanks for the reply.

Yes, I agree, after going back over the commands, those ones you
highlighted seem very suspicious
These commands were executed weeks ago amid a fair amount of confusion.

But yes, I think that you are right - from memory the FS became
inaccessible at about the time you mention.
I would say that was the best theory as regards this problem.

Assuming that this is the case, do I stand a chance of retrieving that
volume and accessing that data again?
Or does destructive imply total loss? (In which case, I'll cut my losses)



On 3 January 2012 21:49, C Anthony Risinger anth...@xtfx.me wrote:

 On Tue, Jan 3, 2012 at 8:44 AM, Dan Garton dan.gar...@gmail.com wrote:
 
   [...]
   1327  btrfs-vol -a
   1328  btrfs-vol -a /nuvat
   1329  btrfs-vol -a asdasd /nuvat
   1330  btrfs-vol -a missing /nuvat
   1331  btrfs-vol -a /dev/sdc /nuvat
   1332  btrfs-vol -a /dev/sdb /nuvat
   1334  btrfs-vol -a missing /nuvat
   [...]

 these look destructive to me ... adding the wrong devices and the
 existing devices back to the current array?  IIRC you should have `-r
 missing`, but in general, do not use the btrfsctl utility at all -- it
 won't have as much visibility/exception-handling/recovery as the
 `btrfs` utility.

 at what point did your FS become inaccessible?  your command history
 suggest it was working until shortly after these commands ... :-(

 --

 C Anthony
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [3.2-rc7] slowdown, warning + oops creating lots of files

2012-01-04 Thread Dave Chinner
On Thu, Jan 05, 2012 at 08:44:45AM +1100, Dave Chinner wrote:
 Hi there buttery folks,
 
 I just hit this warning and oops running a parallel fs_mark create
 workload on a test VM using a 17TB btrfs filesystem (12 disk dm
 RAID0) using default mkfs and mount parmeters, mounted on
 /mnt/scratch. The VM has 8p and 4GB RAM, and the fs_mark command
 line was:
 
 $ ./fs_mark  -D  1  -S0  -n  10 -s  0  -L  250 \
   -d /mnt/scratch/0  -d /mnt/scratch/1 \
   -d /mnt/scratch/2  -d /mnt/scratch/3 \
   -d /mnt/scratch/4  -d /mnt/scratch/5 \
   -d /mnt/scratch/6  -d /mnt/scratch/7
 
 The attached image should give you a better idea of the performance
 drop-off that was well under way when the crash occurred at about 96
 million files created.
 
 I'm rerunning the test on a fresh filesystem, so I guess I'll see if
 this a one-off in the next couple of hours

Looks to be reproducable. With a fresh filesystem performance was all
over the place from the start, and the warning/oops occurred at
about 43M files created. the failure stacks this time are:

[ 1490.841957] device fsid 4b7ec51b-9747-4244-a568-fbecdb157383 devid 1 transid 
4 /dev/vdc
[ 1490.847408] btrfs: disk space caching is enabled
[ 3027.690722] [ cut here ]
[ 3027.692612] WARNING: at fs/btrfs/extent-tree.c:4771 
__btrfs_free_extent+0x630/0x6d0()
[ 3027.695607] Hardware name: Bochs
[ 3027.696968] Modules linked in:
[ 3027.697894] Pid: 3460, comm: fs_mark Not tainted 3.2.0-rc7-dgc+ #167
[ 3027.699581] Call Trace:
[ 3027.700452]  [8108a69f] warn_slowpath_common+0x7f/0xc0
[ 3027.701973]  [8108a6fa] warn_slowpath_null+0x1a/0x20
[ 3027.703480]  [815b0680] __btrfs_free_extent+0x630/0x6d0
[ 3027.704981]  [815ac110] ? block_group_cache_tree_search+0x90/0xc0
[ 3027.706368]  [815b42f1] run_clustered_refs+0x381/0x800
[ 3027.707897]  [815b483a] btrfs_run_delayed_refs+0xca/0x220
[ 3027.709347]  [815b8f1c] ? btrfs_update_root+0x9c/0xe0
[ 3027.710909]  [815c3c33] commit_cowonly_roots+0x33/0x1e0
[ 3027.712370]  [81ab168e] ? _raw_spin_lock+0xe/0x20
[ 3027.713220]  [815c54cf] btrfs_commit_transaction+0x3bf/0x840
[ 3027.714412]  [810ac420] ? add_wait_queue+0x60/0x60
[ 3027.715460]  [815c5da4] ? start_transaction+0x94/0x2b0
[ 3027.716790]  [815ac80c] may_commit_transaction+0x6c/0x100
[ 3027.717843]  [815b2b47] reserve_metadata_bytes.isra.71+0x5a7/0x660
[ 3027.719223]  [81073c23] ? __wake_up+0x53/0x70
[ 3027.720328]  [815a43ba] ? btrfs_free_path+0x2a/0x40
[ 3027.721511]  [815b2f9e] btrfs_block_rsv_add+0x3e/0x70
[ 3027.722610]  [81666dfb] ? security_d_instantiate+0x1b/0x30
[ 3027.723765]  [815c5f65] start_transaction+0x255/0x2b0
[ 3027.725204]  [815c6283] btrfs_start_transaction+0x13/0x20
[ 3027.726273]  [815d2236] btrfs_create+0x46/0x220
[ 3027.727275]  [8116c204] vfs_create+0xb4/0xf0
[ 3027.728344]  [8116e1d7] do_last.isra.45+0x547/0x7c0
[ 3027.729400]  [8116f7ab] path_openat+0xcb/0x3d0
[ 3027.730363]  [81ab168e] ? _raw_spin_lock+0xe/0x20
[ 3027.731394]  [8117cc1e] ? vfsmount_lock_local_unlock+0x1e/0x30
[ 3027.733077]  [8116fbd2] do_filp_open+0x42/0xa0
[ 3027.733949]  [8117c487] ? alloc_fd+0xf7/0x150
[ 3027.734911]  [8115f8e7] do_sys_open+0xf7/0x1d0
[ 3027.735894]  [810b572a] ? do_gettimeofday+0x1a/0x50
[ 3027.737304]  [8115f9e0] sys_open+0x20/0x30
[ 3027.738099]  [81ab9502] system_call_fastpath+0x16/0x1b
[ 3027.739199] ---[ end trace df586861a93ef3bf ]---
[ 3027.740348] btrfs unable to find ref byte nr 19982405632 parent 0 root 2  
owner 0 offset 0
[ 3027.742001] BUG: unable to handle kernel NULL pointer dereference at 
  (null)
[ 3027.743502] IP: [815e60f2] map_private_extent_buffer+0x12/0x150
[ 3027.744982] PGD 109d8e067 PUD 1050a9067 PMD 0
[ 3027.745968] Oops:  [#1] SMP
[ 3027.745968] CPU 7
[ 3027.745968] Modules linked in:
[ 3027.745968]
[ 3027.745968] Pid: 3460, comm: fs_mark Tainted: GW3.2.0-rc7-dgc+ 
#167 Bochs Bochs
[ 3027.745968] RIP: 0010:[815e60f2]  [815e60f2] 
map_private_extent_buffer+0x12/0x150
[ 3027.745968] RSP: 0018:8800d2ac36d8  EFLAGS: 00010296
[ 3027.745968] RAX:  RBX: 0065 RCX: 8800d2ac3708
[ 3027.745968] RDX: 0004 RSI: 007a RDI: 
[ 3027.745968] RBP: 8800d2ac36f8 R08: 8800d2ac3710 R09: 8800d2ac3718
[ 3027.745968] R10:  R11: 0001 R12: 007a
[ 3027.745968] R13:  R14: ffe4 R15: 1000
[ 3027.745968] FS:  7f3bf8ab5700() GS:88011fdc() 
knlGS:
[ 3027.745968] CS:  0010 DS:  ES:  CR0: 8005003b
[ 3027.745968] CR2: 7fe424b0e000 CR3: 000106c33000 CR4: 06e0
[ 3027.745968] DR0: 

Re: [3.2-rc7] slowdown, warning + oops creating lots of files

2012-01-04 Thread Chris Samuel
On 05/01/12 09:11, Dave Chinner wrote:

 Looks to be reproducable.

Does this happen with rc6 ?

If not then it might be easy to track down as there are only
2 modifications between rc6 and rc7..

commit 08c422c27f855d27b0b3d9fa30ebd938d4ae6f1f
Author: Al Viro v...@zeniv.linux.org.uk
Date:   Fri Dec 23 07:58:13 2011 -0500

Btrfs: call d_instantiate after all ops are setup

This closes races where btrfs is calling d_instantiate too soon during
inode creation.  All of the callers of btrfs_add_nondir are updated to
instantiate after the inode is fully setup in memory.

Signed-off-by: Al Viro v...@zeniv.linux.org.uk
Signed-off-by: Chris Mason chris.ma...@oracle.com

commit 8d532b2afb2eacc84588db709ec280a3d1219be3
Author: Chris Mason chris.ma...@oracle.com
Date:   Fri Dec 23 07:53:00 2011 -0500

Btrfs: fix worker lock misuse in find_worker

Dan Carpenter noticed that we were doing a double unlock on the worker
lock, and sometimes picking a worker thread without the lock held.

This fixes both errors.

Signed-off-by: Chris Mason chris.ma...@oracle.com
Reported-by: Dan Carpenter dan.carpen...@oracle.com


-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [3.2-rc7] slowdown, warning + oops creating lots of files

2012-01-04 Thread Dave Chinner
On Thu, Jan 05, 2012 at 09:23:52AM +1100, Chris Samuel wrote:
 On 05/01/12 09:11, Dave Chinner wrote:
 
  Looks to be reproducable.
 
 Does this happen with rc6 ?

I haven't tried. All I'm doing is running some benchmarks to get
numbers for a talk I'm giving about improvements in XFS metadata
scalability, so I wanted to update my last set of numbers from
2.6.39.

As it was, these benchmarks also failed on btrfs with oopsen and
corruptions back in 2.6.39 time frame.  e.g. same VM, same
test, different crashes, similar slowdowns as reported here:
http://comments.gmane.org/gmane.comp.file-systems.btrfs/11062

Given that there is now a history of this simple test uncovering
problems, perhaps this is a test that should be run more regularly
by btrfs developers?

 If not then it might be easy to track down as there are only
 2 modifications between rc6 and rc7..

They don't look like they'd be responsible for fixing an extent tree
corruption, and I don't really have the time to do an open-ended
bisect to find where the problem fix arose.

As it is, 3rd attempt failed at 22m inodes, without the warning this
time:

[   59.433452] device fsid 4d27dc14-562d-4722-9591-723bd2bbe94c devid 1 transid 
4 /dev/vdc
[   59.437050] btrfs: disk space caching is enabled
[  753.258465] [ cut here ]
[  753.259806] kernel BUG at fs/btrfs/extent-tree.c:5797!
[  753.260014] invalid opcode:  [#1] SMP
[  753.260014] CPU 7
[  753.260014] Modules linked in:
[  753.260014]
[  753.260014] Pid: 2874, comm: fs_mark Not tainted 3.2.0-rc7-dgc+ #167 Bochs 
Bochs
[  753.260014] RIP: 0010:[815b475b]  [815b475b] 
run_clustered_refs+0x7eb/0x800
[  753.260014] RSP: 0018:8800430258a8  EFLAGS: 00010286
[  753.260014] RAX: ffe4 RBX: 88009c8ab1c0 RCX: 
[  753.260014] RDX: 0008 RSI: 0282 RDI: 
[  753.260014] RBP: 880043025988 R08:  R09: 0002
[  753.260014] R10: 8801188f6000 R11: 880101b50d20 R12: 88008fc1ad40
[  753.260014] R13: 88003940a6c0 R14: 880118a49000 R15: 88010fc77e80
[  753.260014] FS:  7f416ce90700() GS:88011fdc() 
knlGS:
[  753.260014] CS:  0010 DS:  ES:  CR0: 8005003b
[  753.260014] CR2: 7f416c2f6000 CR3: 3aaea000 CR4: 06e0
[  753.260014] DR0:  DR1:  DR2: 
[  753.260014] DR3:  DR6: 0ff0 DR7: 0400
[  753.260014] Process fs_mark (pid: 2874, threadinfo 880043024000, task 
8800090e6180)
[  753.260014] Stack:
[  753.260014]    8801 

[  753.260014]  88010fc77f38 0e92  
0002
[  753.260014]  0e03 0e68  
8800430259d8
[  753.260014] Call Trace:
[  753.260014]  [815b483a] btrfs_run_delayed_refs+0xca/0x220
[  753.260014]  [815c5469] btrfs_commit_transaction+0x359/0x840
[  753.260014]  [810ac420] ? add_wait_queue+0x60/0x60
[  753.260014]  [815c5da4] ? start_transaction+0x94/0x2b0
[  753.260014]  [815ac80c] may_commit_transaction+0x6c/0x100
[  753.260014]  [815b2b47] reserve_metadata_bytes.isra.71+0x5a7/0x660
[  753.260014]  [81073c23] ? __wake_up+0x53/0x70
[  753.260014]  [815a43ba] ? btrfs_free_path+0x2a/0x40
[  753.260014]  [815b2f9e] btrfs_block_rsv_add+0x3e/0x70
[  753.260014]  [81666dfb] ? security_d_instantiate+0x1b/0x30
[  753.260014]  [815c5f65] start_transaction+0x255/0x2b0
[  753.260014]  [815c6283] btrfs_start_transaction+0x13/0x20
[  753.260014]  [815d2236] btrfs_create+0x46/0x220
[  753.260014]  [8116c204] vfs_create+0xb4/0xf0
[  753.260014]  [8116e1d7] do_last.isra.45+0x547/0x7c0
[  753.260014]  [8116f7ab] path_openat+0xcb/0x3d0
[  753.260014]  [81ab168e] ? _raw_spin_lock+0xe/0x20
[  753.260014]  [8117cc1e] ? vfsmount_lock_local_unlock+0x1e/0x30
[  753.260014]  [8116fbd2] do_filp_open+0x42/0xa0
[  753.260014]  [8117c487] ? alloc_fd+0xf7/0x150
[  753.260014]  [8115f8e7] do_sys_open+0xf7/0x1d0
[  753.260014]  [810b572a] ? do_gettimeofday+0x1a/0x50
[  753.260014]  [8115f9e0] sys_open+0x20/0x30
[  753.260014]  [81ab9502] system_call_fastpath+0x16/0x1b
[  753.260014] Code: ff e9 37 f9 ff ff be 95 00 00 00 48 c7 c7 43 6f df 81 e8 
99 5f ad ff e9 36 f9 ff ff 80 fa b2 0f 84 d0 f9 ff ff 0f 0b 0f 0b 0f 0b 0f 0b 
0f 0b 0f
[  753.260014] RIP  [815b475b] run_clustered_refs+0x7eb/0x800
[  753.260014]  RSP 8800430258a8
[  753.330089] ---[ end trace f3d0e286a928c349 ]---

It's hard to tell exactly what path gets to that BUG_ON(), so much
code is inlined by the compiler into run_clustered_refs() that I
can't tell exactly how it got to the BUG_ON() triggered in

Re: [3.2-rc7] slowdown, warning + oops creating lots of files

2012-01-04 Thread Liu Bo
On 01/04/2012 06:01 PM, Dave Chinner wrote:
 On Thu, Jan 05, 2012 at 09:23:52AM +1100, Chris Samuel wrote:
 On 05/01/12 09:11, Dave Chinner wrote:

 Looks to be reproducable.
 Does this happen with rc6 ?
 
 I haven't tried. All I'm doing is running some benchmarks to get
 numbers for a talk I'm giving about improvements in XFS metadata
 scalability, so I wanted to update my last set of numbers from
 2.6.39.
 
 As it was, these benchmarks also failed on btrfs with oopsen and
 corruptions back in 2.6.39 time frame.  e.g. same VM, same
 test, different crashes, similar slowdowns as reported here:
 http://comments.gmane.org/gmane.comp.file-systems.btrfs/11062
 
 Given that there is now a history of this simple test uncovering
 problems, perhaps this is a test that should be run more regularly
 by btrfs developers?
 
 If not then it might be easy to track down as there are only
 2 modifications between rc6 and rc7..
 
 They don't look like they'd be responsible for fixing an extent tree
 corruption, and I don't really have the time to do an open-ended
 bisect to find where the problem fix arose.
 
 As it is, 3rd attempt failed at 22m inodes, without the warning this
 time:
 
 [   59.433452] device fsid 4d27dc14-562d-4722-9591-723bd2bbe94c devid 1 
 transid 4 /dev/vdc
 [   59.437050] btrfs: disk space caching is enabled
 [  753.258465] [ cut here ]
 [  753.259806] kernel BUG at fs/btrfs/extent-tree.c:5797!
 [  753.260014] invalid opcode:  [#1] SMP
 [  753.260014] CPU 7
 [  753.260014] Modules linked in:
 [  753.260014]
 [  753.260014] Pid: 2874, comm: fs_mark Not tainted 3.2.0-rc7-dgc+ #167 Bochs 
 Bochs
 [  753.260014] RIP: 0010:[815b475b]  [815b475b] 
 run_clustered_refs+0x7eb/0x800
 [  753.260014] RSP: 0018:8800430258a8  EFLAGS: 00010286
 [  753.260014] RAX: ffe4 RBX: 88009c8ab1c0 RCX: 
 
 [  753.260014] RDX: 0008 RSI: 0282 RDI: 
 
 [  753.260014] RBP: 880043025988 R08:  R09: 
 0002
 [  753.260014] R10: 8801188f6000 R11: 880101b50d20 R12: 
 88008fc1ad40
 [  753.260014] R13: 88003940a6c0 R14: 880118a49000 R15: 
 88010fc77e80
 [  753.260014] FS:  7f416ce90700() GS:88011fdc() 
 knlGS:
 [  753.260014] CS:  0010 DS:  ES:  CR0: 8005003b
 [  753.260014] CR2: 7f416c2f6000 CR3: 3aaea000 CR4: 
 06e0
 [  753.260014] DR0:  DR1:  DR2: 
 
 [  753.260014] DR3:  DR6: 0ff0 DR7: 
 0400
 [  753.260014] Process fs_mark (pid: 2874, threadinfo 880043024000, task 
 8800090e6180)
 [  753.260014] Stack:
 [  753.260014]    8801 
 
 [  753.260014]  88010fc77f38 0e92  
 0002
 [  753.260014]  0e03 0e68  
 8800430259d8
 [  753.260014] Call Trace:
 [  753.260014]  [815b483a] btrfs_run_delayed_refs+0xca/0x220
 [  753.260014]  [815c5469] btrfs_commit_transaction+0x359/0x840
 [  753.260014]  [810ac420] ? add_wait_queue+0x60/0x60
 [  753.260014]  [815c5da4] ? start_transaction+0x94/0x2b0
 [  753.260014]  [815ac80c] may_commit_transaction+0x6c/0x100
 [  753.260014]  [815b2b47] 
 reserve_metadata_bytes.isra.71+0x5a7/0x660
 [  753.260014]  [81073c23] ? __wake_up+0x53/0x70
 [  753.260014]  [815a43ba] ? btrfs_free_path+0x2a/0x40
 [  753.260014]  [815b2f9e] btrfs_block_rsv_add+0x3e/0x70
 [  753.260014]  [81666dfb] ? security_d_instantiate+0x1b/0x30
 [  753.260014]  [815c5f65] start_transaction+0x255/0x2b0
 [  753.260014]  [815c6283] btrfs_start_transaction+0x13/0x20
 [  753.260014]  [815d2236] btrfs_create+0x46/0x220
 [  753.260014]  [8116c204] vfs_create+0xb4/0xf0
 [  753.260014]  [8116e1d7] do_last.isra.45+0x547/0x7c0
 [  753.260014]  [8116f7ab] path_openat+0xcb/0x3d0
 [  753.260014]  [81ab168e] ? _raw_spin_lock+0xe/0x20
 [  753.260014]  [8117cc1e] ? vfsmount_lock_local_unlock+0x1e/0x30
 [  753.260014]  [8116fbd2] do_filp_open+0x42/0xa0
 [  753.260014]  [8117c487] ? alloc_fd+0xf7/0x150
 [  753.260014]  [8115f8e7] do_sys_open+0xf7/0x1d0
 [  753.260014]  [810b572a] ? do_gettimeofday+0x1a/0x50
 [  753.260014]  [8115f9e0] sys_open+0x20/0x30
 [  753.260014]  [81ab9502] system_call_fastpath+0x16/0x1b
 [  753.260014] Code: ff e9 37 f9 ff ff be 95 00 00 00 48 c7 c7 43 6f df 81 e8 
 99 5f ad ff e9 36 f9 ff ff 80 fa b2 0f 84 d0 f9 ff ff 0f 0b 0f 0b 0f 0b 0f 
 0b 0f 0b 0f
 [  753.260014] RIP  [815b475b] run_clustered_refs+0x7eb/0x800
 [  753.260014]  RSP 8800430258a8
 [  753.330089] ---[ end trace f3d0e286a928c349 ]---
 
 It's hard to tell exactly what path gets to that 

Re: [3.2-rc7] slowdown, warning + oops creating lots of files

2012-01-04 Thread Dave Chinner
On Wed, Jan 04, 2012 at 09:23:18PM -0500, Liu Bo wrote:
 On 01/04/2012 06:01 PM, Dave Chinner wrote:
  On Thu, Jan 05, 2012 at 09:23:52AM +1100, Chris Samuel wrote:
  On 05/01/12 09:11, Dave Chinner wrote:
 
  Looks to be reproducable.
  Does this happen with rc6 ?
  
  I haven't tried. All I'm doing is running some benchmarks to get
  numbers for a talk I'm giving about improvements in XFS metadata
  scalability, so I wanted to update my last set of numbers from
  2.6.39.
  
  As it was, these benchmarks also failed on btrfs with oopsen and
  corruptions back in 2.6.39 time frame.  e.g. same VM, same
  test, different crashes, similar slowdowns as reported here:
  http://comments.gmane.org/gmane.comp.file-systems.btrfs/11062
  
  Given that there is now a history of this simple test uncovering
  problems, perhaps this is a test that should be run more regularly
  by btrfs developers?
  
  If not then it might be easy to track down as there are only
  2 modifications between rc6 and rc7..
  
  They don't look like they'd be responsible for fixing an extent tree
  corruption, and I don't really have the time to do an open-ended
  bisect to find where the problem fix arose.
  
  As it is, 3rd attempt failed at 22m inodes, without the warning this
  time:

.

  It's hard to tell exactly what path gets to that BUG_ON(), so much
  code is inlined by the compiler into run_clustered_refs() that I
  can't tell exactly how it got to the BUG_ON() triggered in
  alloc_reserved_tree_block().
  
 
 This seems to be an oops led by ENOSPC.

At the time of the oops, this is the space used on the filesystem:

$ df -h /mnt/scratch
Filesystem  Size  Used Avail Use% Mounted on
/dev/vdc 17T   31G   17T   1% /mnt/scratch

It's less than 0.2% full, so I think ENOSPC can be ruled out here.

I have noticed one thing, however, in that the there are significant
numbers of reads coming from disk when the slowdowns and oops occur.
When everything runs fast, there are virtually no reads occurring at
all.  It looks to me that maybe the working set of metadata is being
kicked out of memory, only to be read back in again short while
later. Maybe that is a contributing factor.

BTW, there is a lot of CPU time being spent on the tree locks. perf
shows this as the top 2 CPU consumers:

-   9.49%  [kernel]  [k] __write_lock_failed
   - __write_lock_failed
  - 99.80% _raw_write_lock
 - 79.35% btrfs_try_tree_write_lock
  99.99% btrfs_search_slot
 - 20.63% btrfs_tree_lock
  89.19% btrfs_search_slot
  10.54% btrfs_lock_root_node
 btrfs_search_slot
-   9.25%  [kernel]  [k] _raw_spin_unlock_irqrestore
   - _raw_spin_unlock_irqrestore
  - 55.87% __wake_up
 + 93.89% btrfs_clear_lock_blocking_rw
 + 3.46% btrfs_tree_read_unlock_blocking
 + 2.35% btrfs_tree_unlock

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/10] Btrfs: backref walking rewrite

2012-01-04 Thread Li Zefan
Jan Schmidt wrote:
 This patch series is a major rewrite of the backref walking code. The patch
 series Arne sent some weeks ago for quota groups had a very interesting
 function, find_all_roots. I took this from him together with the bits needed
 for find_all_roots to work and replaced a major part of the code in backref.c
 with it.
 
 It can be pulled from
   git://git.jan-o-sch.net/btrfs-unstable for-chris
 There's also a gitweb for that repo on
   http://git.jan-o-sch.net/?p=btrfs-unstable
 

Thanks for the work!

I got a compile warning:

  CC [M]  fs/btrfs/backref.o
fs/btrfs/backref.c: In function 'inode_to_path':
fs/btrfs/backref.c:1312:3: warning: format '%ld' expects type 'long int', but 
argument 3 has type 'int
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [3.2-rc7] slowdown, warning + oops creating lots of files

2012-01-04 Thread Liu Bo
On 01/04/2012 09:26 PM, Dave Chinner wrote:
 On Wed, Jan 04, 2012 at 09:23:18PM -0500, Liu Bo wrote:
 On 01/04/2012 06:01 PM, Dave Chinner wrote:
 On Thu, Jan 05, 2012 at 09:23:52AM +1100, Chris Samuel wrote:
 On 05/01/12 09:11, Dave Chinner wrote:

 Looks to be reproducable.
 Does this happen with rc6 ?
 I haven't tried. All I'm doing is running some benchmarks to get
 numbers for a talk I'm giving about improvements in XFS metadata
 scalability, so I wanted to update my last set of numbers from
 2.6.39.

 As it was, these benchmarks also failed on btrfs with oopsen and
 corruptions back in 2.6.39 time frame.  e.g. same VM, same
 test, different crashes, similar slowdowns as reported here:
 http://comments.gmane.org/gmane.comp.file-systems.btrfs/11062

 Given that there is now a history of this simple test uncovering
 problems, perhaps this is a test that should be run more regularly
 by btrfs developers?

 If not then it might be easy to track down as there are only
 2 modifications between rc6 and rc7..
 They don't look like they'd be responsible for fixing an extent tree
 corruption, and I don't really have the time to do an open-ended
 bisect to find where the problem fix arose.

 As it is, 3rd attempt failed at 22m inodes, without the warning this
 time:
 
 .
 
 It's hard to tell exactly what path gets to that BUG_ON(), so much
 code is inlined by the compiler into run_clustered_refs() that I
 can't tell exactly how it got to the BUG_ON() triggered in
 alloc_reserved_tree_block().

 This seems to be an oops led by ENOSPC.
 
 At the time of the oops, this is the space used on the filesystem:
 
 $ df -h /mnt/scratch
 Filesystem  Size  Used Avail Use% Mounted on
 /dev/vdc 17T   31G   17T   1% /mnt/scratch
 
 It's less than 0.2% full, so I think ENOSPC can be ruled out here.
 

This bug has done something with our block reservation allocator, not the real 
disk space.

Can you try the below one and see what happens?

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index b1c8732..5a7f918 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3978,8 +3978,8 @@ static u64 calc_global_metadata_size(struct btrfs_fs_info 
*fs_info)
csum_size * 2;
num_bytes += div64_u64(data_used + meta_used, 50);
 
-   if (num_bytes * 3  meta_used)
-   num_bytes = div64_u64(meta_used, 3);
+   if (num_bytes * 2  meta_used)
+   num_bytes = div64_u64(meta_used, 2);
 
return ALIGN(num_bytes, fs_info-extent_root-leafsize  10);
 }

 I have noticed one thing, however, in that the there are significant
 numbers of reads coming from disk when the slowdowns and oops occur.
 When everything runs fast, there are virtually no reads occurring at
 all.  It looks to me that maybe the working set of metadata is being
 kicked out of memory, only to be read back in again short while
 later. Maybe that is a contributing factor.
 
 BTW, there is a lot of CPU time being spent on the tree locks. perf
 shows this as the top 2 CPU consumers:
 
 -   9.49%  [kernel]  [k] __write_lock_failed
- __write_lock_failed
   - 99.80% _raw_write_lock
  - 79.35% btrfs_try_tree_write_lock
   99.99% btrfs_search_slot
  - 20.63% btrfs_tree_lock
   89.19% btrfs_search_slot
   10.54% btrfs_lock_root_node
  btrfs_search_slot
 -   9.25%  [kernel]  [k] _raw_spin_unlock_irqrestore
- _raw_spin_unlock_irqrestore
   - 55.87% __wake_up
  + 93.89% btrfs_clear_lock_blocking_rw
  + 3.46% btrfs_tree_read_unlock_blocking
  + 2.35% btrfs_tree_unlock
 

hmm, the new extent_buffer lock scheme written by Chris is aimed to avoid such 
cases,
maybe he can provide some advices.

thanks,
liubo

 Cheers,
 
 Dave.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html