Re: btrfs: open_ctree failed error

2011-12-22 Thread Malwina Bartoszynska

W dniu 2011-12-21 20:06, Chris Mason pisze:

On Wed, Dec 21, 2011 at 01:54:06PM +, Malwina Bartoszynska wrote:

Hello,
after unmounting btrfs partition, I can't mount it again.

root@xxx:~# btrfs device scan
Scanning for Btrfs filesystems
root@xxx:~# mount /dev/sdb /data/osd.0/
mount: wrong fs type, bad option, bad superblock on /dev/sdb,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail  or so

root@:~# dmesg|tail
[57192.607912] device fsid ed25c604-3e11-4459-85b5-e4090c4d22d0 devid
2 transid14429 /dev/sda
[57204.796573] end_request: I/O error, dev fd0, sector 0
[57231.660913] device fsid ed25c604-3e11-4459-85b5-e4090c4d22d0 devid 1
  transid 14429 /dev/sdb
[57231.680387] parent transid verify failed on 424308420608 wanted 6970
  found 8959
[57231.680546] parent transid verify failed on 424308420608 wanted 6970
found 8959
[57231.680705] parent transid verify failed on 424308420608 wanted 6970
found 8959
[57231.680861] parent transid verify failed on 424308420608 wanted 6970
found 8959
[57231.680869] parent transid verify failed on 424308420608 wanted 6970
found 8959
[57231.680875] Failed to read block groups: -5
[57231.704165] btrfs: open_ctree failed

Can you tell us more about this filesystem?  Was there an unclean
shutdown or did you just unmount, mount again?

The confusing thing is that all of your disks seem to have the same copy
of the block, so it looks like things were written properly.

-chris
There was no shutdown before this, filesystem was just unmounted(which 
looked as properly done - no errors). Then tried to mount it again.

Is there way of fixing it?
--
Malwina Bartoszynska
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] xfstests: new check 276 to ensure btrfs backref integrity

2011-12-22 Thread Jan Schmidt
Thanks for the feedback. I've now removed the $fresh code completely as
it's not meant to be used by anyone but me :-)

_require_btrfs will become a new helper in common.rc. Will resend soon
as 278 (hoping that number holds).

-Jan
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] xfstests: new check 278 to ensure btrfs backref integrity

2011-12-22 Thread Jan Schmidt
This is a btrfs specific scratch test checking the backref walker. It
creates a file system with compressed and uncompressed data extents, picks
files randomly and uses filefrag to get their extents. It then asks the
btrfs utility (inspect-internal) to do the backref resolving from fs-logical
address (the one filefrag calls physical) back to the inode number and
file-logical offset, verifying the result.

Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net
---
change log -v2:
- renamed 276-278
- added _require_btrfs helper
- check for filefrag with _require_command
- added some comments
- removed $fresh code
- don't set FSTYP
---
 278   |  255 +
 278.out   |4 +
 common.config |1 +
 common.rc |   12 +++
 group |1 +
 5 files changed, 273 insertions(+), 0 deletions(-)
 create mode 100755 278
 create mode 100644 278.out

diff --git a/278 b/278
new file mode 100755
index 000..f831a0e
--- /dev/null
+++ b/278
@@ -0,0 +1,255 @@
+#! /bin/bash
+# FSQA Test No. 278
+#
+# Run fsstress to create a reasonably strange file system, make a
+# snapshot and run more fsstress. Then select some files from that fs,
+# run filefrag to get the extent mapping and follow the backrefs.
+# We check to end up back at the original file with the correct offset.
+#
+#---
+# Copyright (C) 2011 STRATO.  All rights reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#
+#---
+#
+# creator
+owner=list.bt...@jan-o-sch.net
+
+seq=`basename $0`
+echo QA output created by $seq
+
+here=`pwd`
+tmp=/tmp/$$
+status=1
+
+_cleanup()
+{
+   echo *** unmount
+   umount $SCRATCH_MNT 2/dev/null
+   rm -f $tmp.*
+}
+trap _cleanup; exit \$status 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+
+# real QA test starts here
+_need_to_be_root
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+
+_require_nobigloopfs
+_require_btrfs inspect-internal
+_require_command /usr/sbin/filefrag
+
+rm -f $seq.full
+
+FILEFRAG_FILTER='if (/, blocksize (\d+)/) {$blocksize = $1; next} ($ext, '\
+'$logical, $physical, $expected, $length, $flags) = (/^\s*(\d+)\s+(\d+)'\
+'\s+(\d+)\s+(?:(\d+)\s+)?(\d+)\s+(.*)/) or next; $flags =~ '\
+'/(?:^|,)inline(?:,|$)/ and next; print $physical * $blocksize, #, '\
+'$length * $blocksize, #, $logical * $blocksize,  '
+
+# this makes filefrag output script readable by using a perl helper.
+# output is one extent per line, with three numbers separated by '#'
+# the numbers are: physical, length, logical (all in bytes)
+# sample output: 1234#10#5678 - physical 1234, length 10, logical 5678
+_filter_extents()
+{
+   tee -a $seq.full | $PERL_PROG -ne $FILEFRAG_FILTER
+}
+
+_check_file_extents()
+{
+   cmd=filefrag -vx $1
+   echo # $cmd  $seq.full
+   out=`$cmd | _filter_extents`
+   if [ -z $out ]; then
+   return 1
+   fi
+   echo after filter: $out  $seq.full
+   echo $out
+   return 0
+}
+
+# use a logical address and walk the backrefs back to the inode.
+# compare to the expected result.
+# returns 0 on success, 1 on error (with output made)
+_btrfs_inspect_addr()
+{
+   mp=$1
+   addr=$2
+   expect_addr=$3
+   expect_inum=$4
+   file=$5
+   cmd=$BTRFS_UTIL_PROG inspect-internal logical-resolve -P $addr $mp
+   echo # $cmd  $seq.full
+   out=`$cmd`
+   echo $out  $seq.full
+   grep_expr=inode $expect_inum offset $expect_addr root
+   echo $out | grep ^$grep_expr 5$ /dev/null
+   ret=$?
+   if [ $ret -eq 0 ]; then
+   # look for a root number that is not 5
+   echo $out | grep ^$grep_expr \([0-46-9][0-9]*\|5[0-9]\+\)$ \
+   /dev/null
+   ret=$?
+   fi
+   if [ $ret -eq 0 ]; then
+   return 0
+   fi
+   echo unexpected output from
+   echo   $cmd
+   echo expected inum: $expect_inum, expected address: $expect_addr,\
+   file: $file, got:
+   echo $out
+   return 1
+}
+
+# use an inode number and walk the backrefs back to the file name.
+# compare to the expected result.
+# returns 0 on success, 1 on error (with output made)
+_btrfs_inspect_inum()
+{
+ 

Re: COW a file from snapshot

2011-12-22 Thread Chris Samuel
On Thu, 22 Dec 2011 07:12:13 PM Roman Kapusta wrote:

 I'm using btrfs for about two years and this is the key feature I'm
 missing all the time. Why is it not part of mainline btrfs already?

Because nobody has written the code to do it yet?

I'm sure the developers would welcome patches for this with open arms!

cheers,
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

This email may come with a PGP signature as a file. Do not panic.
For more info see: http://en.wikipedia.org/wiki/OpenPGP


signature.asc
Description: This is a digitally signed message part.


Re: COW a file from snapshot

2011-12-22 Thread Gareth Pye
Chris, I recommend reading the previously linked thread. The supplied
(and reportedly working) patch was nacked because it violates some
principles or another of file systems. (although from my limited
understanding it only does it in the same way that btrfs snapshots do
in the first place)

On Thu, Dec 22, 2011 at 10:35 PM, Chris Samuel ch...@csamuel.org wrote:

 On Thu, 22 Dec 2011 07:12:13 PM Roman Kapusta wrote:

  I'm using btrfs for about two years and this is the key feature I'm
  missing all the time. Why is it not part of mainline btrfs already?

 Because nobody has written the code to do it yet?

 I'm sure the developers would welcome patches for this with open arms!

 cheers,
 Chris
 --
  Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

 This email may come with a PGP signature as a file. Do not panic.
 For more info see: http://en.wikipedia.org/wiki/OpenPGP




--
Gareth Pye
Level 2 Judge, Melbourne, Australia
Australian MTG Forum: mtgau.com
gar...@cerberos.id.au - www.rockpaperdynamite.wordpress.com
Dear God, I would like to file a bug report
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: COW a file from snapshot

2011-12-22 Thread Sander
Chris Samuel wrote (ao):
 On Thu, 22 Dec 2011 07:12:13 PM Roman Kapusta wrote:
  I'm using btrfs for about two years and this is the key feature I'm
  missing all the time. Why is it not part of mainline btrfs already?
 
 Because nobody has written the code to do it yet?
 
 I'm sure the developers would welcome patches for this with open arms!

As posted in this thread by Jerome two days ago:

You would need to apply this patch to your kernel:
http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg09096.html

Is there any chance this patch gets in linux-next ?
I use this feature all the time and it never broke on me.


Sander

-- 
Humilis IT Services and Solutions
http://www.humilis.net
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: COW a file from snapshot

2011-12-22 Thread Chris Samuel
On Thu, 22 Dec 2011 10:57:10 PM Gareth Pye wrote:

 Chris, I recommend reading the previously linked thread.

Mea culpa, I blame reading email out of order after getting out of 
hospital yesterday. :-(

We now return you to your regularly scheduled mailing list..

-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

This email may come with a PGP signature as a file. Do not panic.
For more info see: http://en.wikipedia.org/wiki/OpenPGP


signature.asc
Description: This is a digitally signed message part.


Re: [PATCH 0/2] btrfs: allow cross-subvolume BTRFS_IOC_CLONE

2011-12-22 Thread Chris Samuel
Christoph,

On Sat, 2 Apr 2011 12:40:11 AM Chris Mason wrote:

 Excerpts from Christoph Hellwig's message of 2011-04-01 09:34:05 
-0400:

  I don't think it's a good idea to introduce any user visible
  operations over subvolume boundaries.  Currently we don't have
  any operations over mount boundaries, which is pretty
  fumdamental to the unix filesystem semantics.  If you want to
  change this please come up with a clear description of the
  semantics and post it to linux-fsdevel for discussion.  That of
  course requires a clear description of the btrfs subvolumes,
  which is still completely missing.
 
 The subvolume is just a directory tree that can be snapshotted, and
 has it's own private inode number space.
 
 reflink across subvolumes is no different from copying a file from
 one subvolume to another at the VFS level.  The src and
 destination are different files and different inodes, they just
 happen to share data extents.

Were Chris Mason's points above enough to sway your opposition to this 
functionality/patch?

There is demand for the ability to move data between subvolumes 
without needing to copy the extents themselves, it's cropped up again 
on the list in recent days.

It seems a little hard (and counterintuitive) to enforce a wasteful 
use of resources to copy data between different parts of the same 
filesystem which happen to be a on a different subvolume when it's 
permitted  functional to the same filesystem on the same subvolume.

I don't dispute the comment about documentation on subvolumes though, 
there is a short discussion of them on the btrfs wiki in the sysadmins 
guide, but not really a lot of detail. :-)

All the best,
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

This email may come with a PGP signature as a file. Do not panic.
For more info see: http://en.wikipedia.org/wiki/OpenPGP


signature.asc
Description: This is a digitally signed message part.


[PATCH v1 02/10] Btrfs: added helper btrfs_next_item()

2011-12-22 Thread Jan Schmidt
btrfs_next_item() makes the btrfs path point to the next item, crossing leaf
boundaries if needed.

Signed-off-by: Arne Jansen sensi...@gmx.net
Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net
---
 fs/btrfs/ctree.h |7 +++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 50634abe..3e4a07b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2482,6 +2482,13 @@ static inline int btrfs_insert_empty_item(struct 
btrfs_trans_handle *trans,
 }
 
 int btrfs_next_leaf(struct btrfs_root *root, struct btrfs_path *path);
+static inline int btrfs_next_item(struct btrfs_root *root, struct btrfs_path 
*p)
+{
+   ++p-slots[0];
+   if (p-slots[0] = btrfs_header_nritems(p-nodes[0]))
+   return btrfs_next_leaf(root, p);
+   return 0;
+}
 int btrfs_prev_leaf(struct btrfs_root *root, struct btrfs_path *path);
 int btrfs_leaf_free_space(struct btrfs_root *root, struct extent_buffer *leaf);
 void btrfs_drop_snapshot(struct btrfs_root *root,
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 04/10] Btrfs: always save ref_root in delayed refs

2011-12-22 Thread Jan Schmidt
From: Arne Jansen sensi...@gmx.net

For consistent backref walking and (later) qgroup calculation the
information to which root a delayed ref belongs is useful even for shared
refs.

Signed-off-by: Arne Jansen sensi...@gmx.net
Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net
---
 fs/btrfs/delayed-ref.c |   18 --
 fs/btrfs/delayed-ref.h |   12 
 2 files changed, 12 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 3a0f0ab..babd37b 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -495,13 +495,12 @@ static noinline int add_delayed_tree_ref(struct 
btrfs_fs_info *fs_info,
ref-in_tree = 1;
 
full_ref = btrfs_delayed_node_to_tree_ref(ref);
-   if (parent) {
-   full_ref-parent = parent;
+   full_ref-parent = parent;
+   full_ref-root = ref_root;
+   if (parent)
ref-type = BTRFS_SHARED_BLOCK_REF_KEY;
-   } else {
-   full_ref-root = ref_root;
+   else
ref-type = BTRFS_TREE_BLOCK_REF_KEY;
-   }
full_ref-level = level;
 
trace_btrfs_delayed_tree_ref(ref, full_ref, action);
@@ -551,13 +550,12 @@ static noinline int add_delayed_data_ref(struct 
btrfs_fs_info *fs_info,
ref-in_tree = 1;
 
full_ref = btrfs_delayed_node_to_data_ref(ref);
-   if (parent) {
-   full_ref-parent = parent;
+   full_ref-parent = parent;
+   full_ref-root = ref_root;
+   if (parent)
ref-type = BTRFS_SHARED_DATA_REF_KEY;
-   } else {
-   full_ref-root = ref_root;
+   else
ref-type = BTRFS_EXTENT_DATA_REF_KEY;
-   }
 
full_ref-objectid = owner;
full_ref-offset = offset;
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 8316bff..a5fb2bc 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -98,19 +98,15 @@ struct btrfs_delayed_ref_head {
 
 struct btrfs_delayed_tree_ref {
struct btrfs_delayed_ref_node node;
-   union {
-   u64 root;
-   u64 parent;
-   };
+   u64 root;
+   u64 parent;
int level;
 };
 
 struct btrfs_delayed_data_ref {
struct btrfs_delayed_ref_node node;
-   union {
-   u64 root;
-   u64 parent;
-   };
+   u64 root;
+   u64 parent;
u64 objectid;
u64 offset;
 };
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 05/10] Btrfs: add nested locking mode for paths

2011-12-22 Thread Jan Schmidt
From: Arne Jansen sensi...@gmx.net

This patch adds the possibilty to read-lock an extent even if it is already
write-locked from the same thread. btrfs_find_all_roots() needs this
capability.

Signed-off-by: Arne Jansen sensi...@gmx.net
Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net
---
 fs/btrfs/ctree.c |   22 
 fs/btrfs/ctree.h |1 +
 fs/btrfs/extent_io.c |1 +
 fs/btrfs/extent_io.h |2 +
 fs/btrfs/locking.c   |   51 +++--
 fs/btrfs/locking.h   |2 +-
 6 files changed, 66 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 0639a55..d0cd67e 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -186,13 +186,14 @@ struct extent_buffer *btrfs_lock_root_node(struct 
btrfs_root *root)
  * tree until you end up with a lock on the root.  A locked buffer
  * is returned, with a reference held.
  */
-struct extent_buffer *btrfs_read_lock_root_node(struct btrfs_root *root)
+struct extent_buffer *btrfs_read_lock_root_node(struct btrfs_root *root,
+   int nested)
 {
struct extent_buffer *eb;
 
while (1) {
eb = btrfs_root_node(root);
-   btrfs_tree_read_lock(eb);
+   btrfs_tree_read_lock(eb, nested);
if (eb == root-node)
break;
btrfs_tree_read_unlock(eb);
@@ -1637,6 +1638,7 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, 
struct btrfs_root
/* everything at write_lock_level or lower must be write locked */
int write_lock_level = 0;
u8 lowest_level = 0;
+   int nested = p-nested;
 
lowest_level = p-lowest_level;
WARN_ON(lowest_level  ins_len  0);
@@ -1678,8 +1680,9 @@ again:
b = root-commit_root;
extent_buffer_get(b);
level = btrfs_header_level(b);
+   BUG_ON(p-skip_locking  nested);
if (!p-skip_locking)
-   btrfs_tree_read_lock(b);
+   btrfs_tree_read_lock(b, 0);
} else {
if (p-skip_locking) {
b = btrfs_root_node(root);
@@ -1688,7 +1691,7 @@ again:
/* we don't know the level of the root node
 * until we actually have it read locked
 */
-   b = btrfs_read_lock_root_node(root);
+   b = btrfs_read_lock_root_node(root, nested);
level = btrfs_header_level(b);
if (level = write_lock_level) {
/* whoops, must trade for write lock */
@@ -1827,7 +1830,8 @@ cow_done:
err = btrfs_try_tree_read_lock(b);
if (!err) {
btrfs_set_path_blocking(p);
-   btrfs_tree_read_lock(b);
+   btrfs_tree_read_lock(b,
+nested);
btrfs_clear_path_blocking(p, b,
  
BTRFS_READ_LOCK);
}
@@ -3972,7 +3976,7 @@ int btrfs_search_forward(struct btrfs_root *root, struct 
btrfs_key *min_key,
 
WARN_ON(!path-keep_locks);
 again:
-   cur = btrfs_read_lock_root_node(root);
+   cur = btrfs_read_lock_root_node(root, 0);
level = btrfs_header_level(cur);
WARN_ON(path-nodes[level]);
path-nodes[level] = cur;
@@ -4066,7 +4070,7 @@ find_next_key:
cur = read_node_slot(root, cur, slot);
BUG_ON(!cur);
 
-   btrfs_tree_read_lock(cur);
+   btrfs_tree_read_lock(cur, 0);
 
path-locks[level - 1] = BTRFS_READ_LOCK;
path-nodes[level - 1] = cur;
@@ -4260,7 +4264,7 @@ again:
ret = btrfs_try_tree_read_lock(next);
if (!ret) {
btrfs_set_path_blocking(path);
-   btrfs_tree_read_lock(next);
+   btrfs_tree_read_lock(next, 0);
btrfs_clear_path_blocking(path, next,
  BTRFS_READ_LOCK);
}
@@ -4297,7 +4301,7 @@ again:
ret = btrfs_try_tree_read_lock(next);
if (!ret) {
btrfs_set_path_blocking(path);
-   btrfs_tree_read_lock(next);
+   btrfs_tree_read_lock(next, 0);
btrfs_clear_path_blocking(path, next,
   

[PATCH v1 09/10] Btrfs: added btrfs_find_all_roots()

2011-12-22 Thread Jan Schmidt
This function gets a byte number (a data extent), collects all the leafs
pointing to it and walks up the trees to find all fs roots pointing to those
leafs. It also returns the list of all leafs pointing to that extent.

It does proper locking for the involved trees, can be used on busy file
systems and honors delayed refs.

Signed-off-by: Arne Jansen sensi...@gmx.net
Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net
---
 fs/btrfs/backref.c |  784 
 fs/btrfs/backref.h |5 +
 2 files changed, 789 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 22c64ff..e01790e 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -19,6 +19,9 @@
 #include ctree.h
 #include disk-io.h
 #include backref.h
+#include ulist.h
+#include transaction.h
+#include delayed-ref.h
 
 struct __data_ref {
struct list_head list;
@@ -32,6 +35,787 @@ struct __shared_ref {
u64 disk_byte;
 };
 
+/*
+ * this structure records all encountered refs on the way up to the root
+ */
+struct __prelim_ref {
+   struct list_head list;
+   u64 root_id;
+   struct btrfs_key key;
+   int level;
+   int count;
+   u64 parent;
+   u64 wanted_disk_byte;
+};
+
+static int __add_prelim_ref(struct list_head *head, u64 root_id,
+   struct btrfs_key *key, int level, u64 parent,
+   u64 wanted_disk_byte, int count)
+{
+   struct __prelim_ref *ref;
+
+   /* in case we're adding delayed refs, we're holding the refs spinlock */
+   ref = kmalloc(sizeof(*ref), GFP_ATOMIC);
+   if (!ref)
+   return -ENOMEM;
+
+   ref-root_id = root_id;
+   if (key)
+   ref-key = *key;
+   else
+   memset(ref-key, 0, sizeof(ref-key));
+
+   ref-level = level;
+   ref-count = count;
+   ref-parent = parent;
+   ref-wanted_disk_byte = wanted_disk_byte;
+   list_add_tail(ref-list, head);
+
+   return 0;
+}
+
+static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path,
+   struct ulist *parents,
+   struct extent_buffer *eb, int level,
+   u64 wanted_objectid, u64 wanted_disk_byte)
+{
+   int ret;
+   int slot;
+   struct btrfs_file_extent_item *fi;
+   struct btrfs_key key;
+   u64 disk_byte;
+
+add_parent:
+   ret = ulist_add(parents, eb-start, 0, GFP_NOFS);
+   if (ret  0)
+   return ret;
+
+   if (level != 0)
+   return 0;
+
+   /*
+* if the current leaf is full with EXTENT_DATA items, we must
+* check the next one if that holds a reference as well.
+* ref-count cannot be used to skip this check.
+* repeat this until we don't find any additional EXTENT_DATA items.
+*/
+   while (1) {
+   ret = btrfs_next_leaf(root, path);
+   if (ret  0)
+   return ret;
+   if (ret)
+   return 0;
+
+   eb = path-nodes[0];
+   for (slot = 0; slot  btrfs_header_nritems(eb); ++slot) {
+   btrfs_item_key_to_cpu(eb, key, slot);
+   if (key.objectid != wanted_objectid ||
+   key.type != BTRFS_EXTENT_DATA_KEY)
+   return 0;
+   fi = btrfs_item_ptr(eb, slot,
+   struct btrfs_file_extent_item);
+   disk_byte = btrfs_file_extent_disk_bytenr(eb, fi);
+   if (disk_byte == wanted_disk_byte)
+   goto add_parent;
+   }
+   }
+
+   return 0;
+}
+
+/*
+ * resolve an indirect backref in the form (root_id, key, level)
+ * to a logical address
+ */
+static int __resolve_indirect_ref(struct btrfs_fs_info *fs_info,
+   struct __prelim_ref *ref,
+   struct ulist *parents)
+{
+   struct btrfs_path *path;
+   struct btrfs_root *root;
+   struct btrfs_key root_key;
+   struct btrfs_key key = {0};
+   struct extent_buffer *eb;
+   int ret = 0;
+   int root_level;
+   int level = ref-level;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   root_key.objectid = ref-root_id;
+   root_key.type = BTRFS_ROOT_ITEM_KEY;
+   root_key.offset = (u64)-1;
+   root = btrfs_read_fs_root_no_name(fs_info, root_key);
+   if (IS_ERR(root)) {
+   ret = PTR_ERR(root);
+   goto out;
+   }
+
+   rcu_read_lock();
+   root_level = btrfs_header_level(root-node);
+   rcu_read_unlock();
+
+   if (root_level + 1 == level)
+   goto out;
+
+   path-lowest_level = level;
+   path-nested = 1;
+   ret = 

[PATCH v1 10/10] Btrfs: new backref walking code

2011-12-22 Thread Jan Schmidt
The old backref iteration code could only safely be used on commit roots.
Besides this limitation, it had bugs in finding the roots for these
references. This commit replaces large parts of it by btrfs_find_all_roots()
which a) really finds all roots and the correct roots, b) works correctly
under heavy file system load, c) considers delayed refs.

Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net
---
 fs/btrfs/backref.c |  354 +++-
 fs/btrfs/ioctl.c   |8 +-
 fs/btrfs/scrub.c   |7 +-
 3 files changed, 107 insertions(+), 262 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index e01790e..2fdb4b1 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -23,18 +23,6 @@
 #include transaction.h
 #include delayed-ref.h
 
-struct __data_ref {
-   struct list_head list;
-   u64 inum;
-   u64 root;
-   u64 extent_data_item_offset;
-};
-
-struct __shared_ref {
-   struct list_head list;
-   u64 disk_byte;
-};
-
 /*
  * this structure records all encountered refs on the way up to the root
  */
@@ -965,8 +953,11 @@ int extent_from_logical(struct btrfs_fs_info *fs_info, u64 
logical,
btrfs_item_key_to_cpu(path-nodes[0], found_key, path-slots[0]);
if (found_key-type != BTRFS_EXTENT_ITEM_KEY ||
found_key-objectid  logical ||
-   found_key-objectid + found_key-offset = logical)
+   found_key-objectid + found_key-offset = logical) {
+   pr_debug(logical %llu is not within any extent\n,
+(unsigned long long)logical);
return -ENOENT;
+   }
 
eb = path-nodes[0];
item_size = btrfs_item_size_nr(eb, path-slots[0]);
@@ -975,6 +966,13 @@ int extent_from_logical(struct btrfs_fs_info *fs_info, u64 
logical,
ei = btrfs_item_ptr(eb, path-slots[0], struct btrfs_extent_item);
flags = btrfs_extent_flags(eb, ei);
 
+   pr_debug(logical %llu is at position %llu within the extent (%llu 
+EXTENT_ITEM %llu) flags %#llx size %u\n,
+(unsigned long long)logical,
+(unsigned long long)(logical - found_key-objectid),
+(unsigned long long)found_key-objectid,
+(unsigned long long)found_key-offset,
+(unsigned long long)flags, item_size);
if (flags  BTRFS_EXTENT_FLAG_TREE_BLOCK)
return BTRFS_EXTENT_FLAG_TREE_BLOCK;
if (flags  BTRFS_EXTENT_FLAG_DATA)
@@ -1071,128 +1069,11 @@ int tree_backref_for_extent(unsigned long *ptr, struct 
extent_buffer *eb,
return 0;
 }
 
-static int __data_list_add(struct list_head *head, u64 inum,
-   u64 extent_data_item_offset, u64 root)
-{
-   struct __data_ref *ref;
-
-   ref = kmalloc(sizeof(*ref), GFP_NOFS);
-   if (!ref)
-   return -ENOMEM;
-
-   ref-inum = inum;
-   ref-extent_data_item_offset = extent_data_item_offset;
-   ref-root = root;
-   list_add_tail(ref-list, head);
-
-   return 0;
-}
-
-static int __data_list_add_eb(struct list_head *head, struct extent_buffer *eb,
-   struct btrfs_extent_data_ref *dref)
-{
-   return __data_list_add(head, btrfs_extent_data_ref_objectid(eb, dref),
-   btrfs_extent_data_ref_offset(eb, dref),
-   btrfs_extent_data_ref_root(eb, dref));
-}
-
-static int __shared_list_add(struct list_head *head, u64 disk_byte)
-{
-   struct __shared_ref *ref;
-
-   ref = kmalloc(sizeof(*ref), GFP_NOFS);
-   if (!ref)
-   return -ENOMEM;
-
-   ref-disk_byte = disk_byte;
-   list_add_tail(ref-list, head);
-
-   return 0;
-}
-
-static int __iter_shared_inline_ref_inodes(struct btrfs_fs_info *fs_info,
-  u64 logical, u64 inum,
-  u64 extent_data_item_offset,
-  u64 extent_offset,
-  struct btrfs_path *path,
-  struct list_head *data_refs,
-  iterate_extent_inodes_t *iterate,
-  void *ctx)
-{
-   u64 ref_root;
-   u32 item_size;
-   struct btrfs_key key;
-   struct extent_buffer *eb;
-   struct btrfs_extent_item *ei;
-   struct btrfs_extent_inline_ref *eiref;
-   struct __data_ref *ref;
-   int ret;
-   int type;
-   int last;
-   unsigned long ptr = 0;
-
-   WARN_ON(!list_empty(data_refs));
-   ret = extent_from_logical(fs_info, logical, path, key);
-   if (ret  BTRFS_EXTENT_FLAG_DATA)
-   ret = -EIO;
-   if (ret  0)
-   goto out;
-
-   eb = path-nodes[0];
-   ei = btrfs_item_ptr(eb, path-slots[0], struct btrfs_extent_item);
-   item_size = btrfs_item_size_nr(eb, 

[PATCH v1 06/10] Btrfs: add sequence numbers to delayed refs

2011-12-22 Thread Jan Schmidt
From: Arne Jansen sensi...@gmx.net

Sequence numbers are needed to reconstruct the backrefs of a given extent to
a certain point in time. The total set of backrefs consist of the set of
backrefs recorded on disk plus the enqueued delayed refs for it that existed
at that moment.

This patch also adds a list that records all delayed refs which are
currently in the process of being added.

When walking all refs of an extent in btrfs_find_all_roots(), we freeze the
current state of delayed refs, honor anythinh up to this point and prevent
processing newer delayed refs to assert consistency.

Signed-off-by: Arne Jansen sensi...@gmx.net
Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net
---
 fs/btrfs/delayed-ref.c |   34 +++
 fs/btrfs/delayed-ref.h |   70 
 fs/btrfs/transaction.c |4 +++
 3 files changed, 108 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index babd37b..a405db0 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -101,6 +101,11 @@ static int comp_entry(struct btrfs_delayed_ref_node *ref2,
return -1;
if (ref1-type  ref2-type)
return 1;
+   /* merging of sequenced refs is not allowed */
+   if (ref1-seq  ref2-seq)
+   return -1;
+   if (ref1-seq  ref2-seq)
+   return 1;
if (ref1-type == BTRFS_TREE_BLOCK_REF_KEY ||
ref1-type == BTRFS_SHARED_BLOCK_REF_KEY) {
return comp_tree_refs(btrfs_delayed_node_to_tree_ref(ref2),
@@ -209,6 +214,24 @@ int btrfs_delayed_ref_lock(struct btrfs_trans_handle 
*trans,
return 0;
 }
 
+int btrfs_check_delayed_seq(struct btrfs_delayed_ref_root *delayed_refs,
+   u64 seq)
+{
+   struct seq_list *elem;
+
+   assert_spin_locked(delayed_refs-lock);
+   if (list_empty(delayed_refs-seq_head))
+   return 0;
+
+   elem = list_first_entry(delayed_refs-seq_head, struct seq_list, list);
+   if (seq = elem-seq) {
+   pr_debug(holding back delayed_ref %llu, lowest is %llu (%p)\n,
+seq, elem-seq, delayed_refs);
+   return 1;
+   }
+   return 0;
+}
+
 int btrfs_find_ref_cluster(struct btrfs_trans_handle *trans,
   struct list_head *cluster, u64 start)
 {
@@ -438,6 +461,7 @@ static noinline int add_delayed_ref_head(struct 
btrfs_fs_info *fs_info,
ref-action  = 0;
ref-is_head = 1;
ref-in_tree = 1;
+   ref-seq = 0;
 
head_ref = btrfs_delayed_node_to_head(ref);
head_ref-must_insert_reserved = must_insert_reserved;
@@ -479,6 +503,7 @@ static noinline int add_delayed_tree_ref(struct 
btrfs_fs_info *fs_info,
struct btrfs_delayed_ref_node *existing;
struct btrfs_delayed_tree_ref *full_ref;
struct btrfs_delayed_ref_root *delayed_refs;
+   u64 seq = 0;
 
if (action == BTRFS_ADD_DELAYED_EXTENT)
action = BTRFS_ADD_DELAYED_REF;
@@ -494,6 +519,10 @@ static noinline int add_delayed_tree_ref(struct 
btrfs_fs_info *fs_info,
ref-is_head = 0;
ref-in_tree = 1;
 
+   if (need_ref_seq(for_cow, ref_root))
+   seq = inc_delayed_seq(delayed_refs);
+   ref-seq = seq;
+
full_ref = btrfs_delayed_node_to_tree_ref(ref);
full_ref-parent = parent;
full_ref-root = ref_root;
@@ -534,6 +563,7 @@ static noinline int add_delayed_data_ref(struct 
btrfs_fs_info *fs_info,
struct btrfs_delayed_ref_node *existing;
struct btrfs_delayed_data_ref *full_ref;
struct btrfs_delayed_ref_root *delayed_refs;
+   u64 seq = 0;
 
if (action == BTRFS_ADD_DELAYED_EXTENT)
action = BTRFS_ADD_DELAYED_REF;
@@ -549,6 +579,10 @@ static noinline int add_delayed_data_ref(struct 
btrfs_fs_info *fs_info,
ref-is_head = 0;
ref-in_tree = 1;
 
+   if (need_ref_seq(for_cow, ref_root))
+   seq = inc_delayed_seq(delayed_refs);
+   ref-seq = seq;
+
full_ref = btrfs_delayed_node_to_data_ref(ref);
full_ref-parent = parent;
full_ref-root = ref_root;
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index a5fb2bc..174416f 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -33,6 +33,9 @@ struct btrfs_delayed_ref_node {
/* the size of the extent */
u64 num_bytes;
 
+   /* seq number to keep track of insertion order */
+   u64 seq;
+
/* ref count on this data structure */
atomic_t refs;
 
@@ -136,6 +139,20 @@ struct btrfs_delayed_ref_root {
int flushing;
 
u64 run_delayed_start;
+
+   /*
+* seq number of delayed refs. We need to know if a backref was being
+* added before the currently processed ref or afterwards.
+*/
+   u64 seq;
+
+   /*
+* seq_list holds a list of all seq numbers that are 

[PATCH v1 00/10] Btrfs: backref walking rewrite

2011-12-22 Thread Jan Schmidt
This patch series is a major rewrite of the backref walking code. The patch
series Arne sent some weeks ago for quota groups had a very interesting
function, find_all_roots. I took this from him together with the bits needed
for find_all_roots to work and replaced a major part of the code in backref.c
with it.

It can be pulled from
git://git.jan-o-sch.net/btrfs-unstable for-chris
There's also a gitweb for that repo on
http://git.jan-o-sch.net/?p=btrfs-unstable

My old backref code had several problems:
- it relied on a consistent state of the trees in memory
- it ignored delayed refs
- it only featured rudimentary locking
- it could miss some references depending on the tree layout

The biggest advantage is, that we're now able to do reliable backref resolving,
even on busy file systems. So we've got benefits for:
- the existing btrfs inspect-internal commands
- aforementioned qgroups (patches on the list)
- btrfs send (currently in development)
- snapshot-aware defrag
- ... possibly more to come

Splitting the needed bits out of Arne's code was a quite intrusive operation. In
case this goes into 3.3, any of us will soon make a rebased version of the
qgroup patch set. Things corrected/changed in Arne's code along the way:
- don't assume INODE_ITEMs and the corresponding EXTENT_DATA items are in the
  same leaf (use the correct EXTENT_DATA_KEY for tree searches)
- don't assume all EXTENT_DATA items with the same backref for the same inode
  are in the same leaf (__resolve_indirect_refs can now add more refs)
- added missing key and level to prelim lists for shared block refs
- delayed ref sequence locking ability without wasting sequence numbers
- waitqueue instead of busy waiting for more delayed refs

As this touches a critical part of the file system, I also did some speed
benchmarks. It turns out that dbench shows no performance decrease on my
hardware. I can do more tests if desired.

By the way: this patch series fixes xfstest 278 (to be published soon) :-)

-Jan

Arne Jansen (6):
  Btrfs: generic data structure to build unique lists
  Btrfs: mark delayed refs as for cow
  Btrfs: always save ref_root in delayed refs
  Btrfs: add nested locking mode for paths
  Btrfs: add sequence numbers to delayed refs
  Btrfs: put back delayed refs that are too new

Jan Schmidt (4):
  Btrfs: added helper btrfs_next_item()
  Btrfs: add waitqueue instead of doing busy waiting for more delayed
refs
  Btrfs: added btrfs_find_all_roots()
  Btrfs: new backref walking code

 fs/btrfs/Makefile  |2 +-
 fs/btrfs/backref.c | 1132 +---
 fs/btrfs/backref.h |5 +
 fs/btrfs/ctree.c   |   64 ++--
 fs/btrfs/ctree.h   |   25 +-
 fs/btrfs/delayed-ref.c |  153 +--
 fs/btrfs/delayed-ref.h |  104 -
 fs/btrfs/disk-io.c |3 +-
 fs/btrfs/extent-tree.c |  187 ++--
 fs/btrfs/extent_io.c   |1 +
 fs/btrfs/extent_io.h   |2 +
 fs/btrfs/file.c|   10 +-
 fs/btrfs/inode.c   |2 +-
 fs/btrfs/ioctl.c   |   13 +-
 fs/btrfs/locking.c |   51 ++-
 fs/btrfs/locking.h |2 +-
 fs/btrfs/relocation.c  |   18 +-
 fs/btrfs/scrub.c   |7 +-
 fs/btrfs/transaction.c |9 +-
 fs/btrfs/tree-log.c|2 +-
 fs/btrfs/ulist.c   |  220 ++
 fs/btrfs/ulist.h   |   68 +++
 22 files changed, 1651 insertions(+), 429 deletions(-)
 create mode 100644 fs/btrfs/ulist.c
 create mode 100644 fs/btrfs/ulist.h

-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 01/10] Btrfs: generic data structure to build unique lists

2011-12-22 Thread Jan Schmidt
From: Arne Jansen sensi...@gmx.net

ulist is a generic data structures to hold a collection of unique u64
values. The only operations it supports is adding to the list and
enumerating it.

It is possible to store an auxiliary value along with the key. The
implementation is preliminary and can probably be sped up significantly.

It is used by btrfs_find_all_roots() quota to translate recursions into
iterative loops.

Signed-off-by: Arne Jansen sensi...@gmx.net
Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net
---
 fs/btrfs/Makefile |2 +-
 fs/btrfs/ulist.c  |  220 +
 fs/btrfs/ulist.h  |   68 
 3 files changed, 289 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index c0ddfd2..7079840 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -8,6 +8,6 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o 
root-tree.o dir-item.o \
   extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
   export.o tree-log.o free-space-cache.o zlib.o lzo.o \
   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
-  reada.o backref.o
+  reada.o backref.o ulist.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
diff --git a/fs/btrfs/ulist.c b/fs/btrfs/ulist.c
new file mode 100644
index 000..12f5147
--- /dev/null
+++ b/fs/btrfs/ulist.c
@@ -0,0 +1,220 @@
+/*
+ * Copyright (C) 2011 STRATO AG
+ * written by Arne Jansen sensi...@gmx.net
+ * Distributed under the GNU GPL license version 2.
+ */
+
+#include linux/slab.h
+#include linux/module.h
+#include ulist.h
+
+/*
+ * ulist is a generic data structure to hold a collection of unique u64
+ * values. The only operations it supports is adding to the list and
+ * enumerating it.
+ * It is possible to store an auxiliary value along with the key.
+ *
+ * The implementation is preliminary and can probably be sped up
+ * significantly. A first step would be to store the values in an rbtree
+ * as soon as ULIST_SIZE is exceeded.
+ *
+ * A sample usage for ulists is the enumeration of directed graphs without
+ * visiting a node twice. The pseudo-code could look like this:
+ *
+ * ulist = ulist_alloc();
+ * ulist_add(ulist, root);
+ * elem = NULL;
+ *
+ * while ((elem = ulist_next(ulist, elem)) {
+ * for (all child nodes n in elem)
+ * ulist_add(ulist, n);
+ * do something useful with the node;
+ * }
+ * ulist_free(ulist);
+ *
+ * This assumes the graph nodes are adressable by u64. This stems from the
+ * usage for tree enumeration in btrfs, where the logical addresses are
+ * 64 bit.
+ *
+ * It is also useful for tree enumeration which could be done elegantly
+ * recursively, but is not possible due to kernel stack limitations. The
+ * loop would be similar to the above.
+ */
+
+/**
+ * ulist_init - freshly initialize a ulist
+ * @ulist: the ulist to initialize
+ *
+ * Note: don't use this function to init an already used ulist, use
+ * ulist_reinit instead.
+ */
+void ulist_init(struct ulist *ulist)
+{
+   ulist-nnodes = 0;
+   ulist-nodes = ulist-int_nodes;
+   ulist-nodes_alloced = ULIST_SIZE;
+}
+EXPORT_SYMBOL(ulist_init);
+
+/**
+ * ulist_fini - free up additionally allocated memory for the ulist
+ * @ulist: the ulist from which to free the additional memory
+ *
+ * This is useful in cases where the base 'struct ulist' has been statically
+ * allocated.
+ */
+void ulist_fini(struct ulist *ulist)
+{
+   /*
+* The first ULIST_SIZE elements are stored inline in struct ulist.
+* Only if more elements are alocated they need to be freed.
+*/
+   if (ulist-nodes_alloced  ULIST_SIZE)
+   kfree(ulist-nodes);
+   ulist-nodes_alloced = 0;   /* in case ulist_fini is called twice */
+}
+EXPORT_SYMBOL(ulist_fini);
+
+/**
+ * ulist_reinit - prepare a ulist for reuse
+ * @ulist: ulist to be reused
+ *
+ * Free up all additional memory allocated for the list elements and reinit
+ * the ulist.
+ */
+void ulist_reinit(struct ulist *ulist)
+{
+   ulist_fini(ulist);
+   ulist_init(ulist);
+}
+EXPORT_SYMBOL(ulist_reinit);
+
+/**
+ * ulist_alloc - dynamically allocate a ulist
+ * @gfp_mask:  allocation flags to for base allocation
+ *
+ * The allocated ulist will be returned in an initialized state.
+ */
+struct ulist *ulist_alloc(unsigned long gfp_mask)
+{
+   struct ulist *ulist = kmalloc(sizeof(*ulist), gfp_mask);
+
+   if (!ulist)
+   return NULL;
+
+   ulist_init(ulist);
+
+   return ulist;
+}
+EXPORT_SYMBOL(ulist_alloc);
+
+/**
+ * ulist_free - free dynamically allocated ulist
+ * @ulist: ulist to free
+ *
+ * It is not necessary to call ulist_fini before.
+ */
+void ulist_free(struct ulist *ulist)
+{
+   if (!ulist)
+   return;
+   ulist_fini(ulist);
+   kfree(ulist);
+}
+EXPORT_SYMBOL(ulist_free);
+
+/**
+ * ulist_add - add an element to the ulist
+ * @ulist: 

[PATCH v1 07/10] Btrfs: put back delayed refs that are too new

2011-12-22 Thread Jan Schmidt
From: Arne Jansen sensi...@gmx.net

When processing a delayed ref, first check if there are still old refs in
the process of being added. If so, put this ref back to the tree. To avoid
looping on this ref, choose a newer one in the next loop.
btrfs_find_ref_cluster has to take care of that.

Signed-off-by: Arne Jansen sensi...@gmx.net
Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net
---
 fs/btrfs/delayed-ref.c |   43 +--
 fs/btrfs/extent-tree.c |   27 ++-
 2 files changed, 47 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index a405db0..ee18198 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -155,16 +155,22 @@ static struct btrfs_delayed_ref_node *tree_insert(struct 
rb_root *root,
 
 /*
  * find an head entry based on bytenr. This returns the delayed ref
- * head if it was able to find one, or NULL if nothing was in that spot
+ * head if it was able to find one, or NULL if nothing was in that spot.
+ * If return_bigger is given, the next bigger entry is returned if no exact
+ * match is found.
  */
 static struct btrfs_delayed_ref_node *find_ref_head(struct rb_root *root,
  u64 bytenr,
- struct btrfs_delayed_ref_node **last)
+ struct btrfs_delayed_ref_node **last,
+ int return_bigger)
 {
-   struct rb_node *n = root-rb_node;
+   struct rb_node *n;
struct btrfs_delayed_ref_node *entry;
-   int cmp;
+   int cmp = 0;
 
+again:
+   n = root-rb_node;
+   entry = NULL;
while (n) {
entry = rb_entry(n, struct btrfs_delayed_ref_node, rb_node);
WARN_ON(!entry-in_tree);
@@ -187,6 +193,19 @@ static struct btrfs_delayed_ref_node *find_ref_head(struct 
rb_root *root,
else
return entry;
}
+   if (entry  return_bigger) {
+   if (cmp  0) {
+   n = rb_next(entry-rb_node);
+   if (!n)
+   n = rb_first(root);
+   entry = rb_entry(n, struct btrfs_delayed_ref_node,
+rb_node);
+   bytenr = entry-bytenr;
+   return_bigger = 0;
+   goto again;
+   }
+   return entry;
+   }
return NULL;
 }
 
@@ -246,20 +265,8 @@ int btrfs_find_ref_cluster(struct btrfs_trans_handle 
*trans,
node = rb_first(delayed_refs-root);
} else {
ref = NULL;
-   find_ref_head(delayed_refs-root, start, ref);
+   find_ref_head(delayed_refs-root, start + 1, ref, 1);
if (ref) {
-   struct btrfs_delayed_ref_node *tmp;
-
-   node = rb_prev(ref-rb_node);
-   while (node) {
-   tmp = rb_entry(node,
-  struct btrfs_delayed_ref_node,
-  rb_node);
-   if (tmp-bytenr  start)
-   break;
-   ref = tmp;
-   node = rb_prev(ref-rb_node);
-   }
node = ref-rb_node;
} else
node = rb_first(delayed_refs-root);
@@ -748,7 +755,7 @@ btrfs_find_delayed_ref_head(struct btrfs_trans_handle 
*trans, u64 bytenr)
struct btrfs_delayed_ref_root *delayed_refs;
 
delayed_refs = trans-transaction-delayed_refs;
-   ref = find_ref_head(delayed_refs-root, bytenr, NULL);
+   ref = find_ref_head(delayed_refs-root, bytenr, NULL, 0);
if (ref)
return btrfs_delayed_node_to_head(ref);
return NULL;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index dc8b9a8..bbcca12 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2237,6 +2237,28 @@ static noinline int run_clustered_refs(struct 
btrfs_trans_handle *trans,
}
 
/*
+* locked_ref is the head node, so we have to go one
+* node back for any delayed ref updates
+*/
+   ref = select_delayed_ref(locked_ref);
+
+   if (ref  ref-seq 
+   btrfs_check_delayed_seq(delayed_refs, ref-seq)) {
+   /*
+* there are still refs with lower seq numbers in the
+* process of being added. Don't run this ref yet.
+*/
+   list_del_init(locked_ref-cluster);
+   mutex_unlock(locked_ref-mutex);
+   locked_ref = NULL;
+   delayed_refs-num_heads_ready++;

[RESEND] [PATCH v2] Btrfs: runtime integrity check tool

2011-12-22 Thread Stefan Behrens
Sigh. In the previously sent v2 patch the mail 1/4 exceeded the archaic
100,000 chars limit of vger.kernel.org (no complains from checkpatch.pl
though). Therefore I now prepared a git-daemon for pulling.

Please pull from
git://btrfs.giantdisaster.de/git/btrfs integrity-check-patch-v2


Changes v1-v2:
- Merge with updated disk flush code
- Use bdevname to print the bdev's name instead of the disk's name
- Fix v1 formatting issue (labels, and a few casts still had been
  wrong)
- Merge with current scrub.c
- Cast all u64 parameters to unsigned long long for printk
- Fix issue with read errors on lower layers (caused by one
  signed / unsigned mixup)
- Fix comment in check-integrity.c
- Check that data that is referenced from a newly written superblock
  was FLUSHed before
- Change the way the hash keys are calculated
- Add code to runtime integrity check tool that verifies that all
  written meta blocks contain a logical bytenr that maps to the
  device and physical bytenr that is used to submit the bio
- Shrink Kconfig help entry to 14 lines

This patch series adds a new module to the btrfs kernel mode
code. This new module can be used to catch cases when the
btrfs kernel code executes write requests to the disk that
bring the file system in an inconsistent state. In such a
state, a power-loss or kernel panic event would cause that
the data on disk is lost or at least damaged.

Code is added that examines all block write requests during
runtime (including writes of the super block). Three rules
are verified and an error is printed on violation of the
rules:
1. It is not allowed to write a disk block which is
   currently referenced by the super block (either directly
   or indirectly).
2. When a super block is written, it is verified that all
   referenced (directly or indirectly) blocks fulfill the
   following requirements:
   2a. All referenced blocks have either been present when
   the file system was mounted, (i.e., they have been
   referenced by the super block) or they have been
   written since then and the write completion callback
   was called.
   2b. All referenced blocks need to have a generation
   number which is equal to the parent's number.

Before commit v3.1-83-g5ff921b from Chris Mason, the
xfstests 013 and 083 used to trigger integrity issues in the
log tree. Disk blocks that had been in use for the log tree
had been freed and reused too early, while being referenced
by the written super block.

Since this issue with the log tree is fixed, no more issues
with the on-disk file system have been found while running
all currently available xfstests for btrfs and generic.

The search term in the kernel log that can be used to filter
on the existence of detected integrity issues is
btrfs: attempt.

The integrity check is enabled via mount options. These
mount options are only supported if the integrity check
tool is compiled by defining BTRFS_FS_CHECK_INTEGRITY.

Example #1, apply integrity checks to all metadata:
mount /dev/sdb1 /mnt -o check_int

Example #2, apply integrity checks to all metadata and
to data extents:
mount /dev/sdb1 /mnt -o check_int_data

Example #3, apply integrity checks to all metadata and dump
the tree that the super block references to kernel messages
each time after a super block was written:
mount /dev/sdb1 /mnt -o check_int,check_int_print_mask=263

If the integrity check tool is included and activated in
the mount options, plenty of kernel memory is used, and
plenty of additional CPU cycles are spent. Enabling this
functionality is not intended for normal use. In most
cases, unless you are a btrfs developer who needs to verify
the integrity of (super)-block write requests, do not
enable the config option BTRFS_FS_CHECK_INTEGRITY to
include and compile the integrity check tool.

The patches are based on Chris Mason's v3.1-182-gd85c8a6.

Stefan Behrens (4):
  Btrfs: add optional integrity check code
  Btrfs: add config option to enable btrfs integrity check
  Btrfs: Makefile changes to optionally include btrfs integrity check
  Btrfs: integrate integrity check module into btrfs

 fs/btrfs/Kconfig   |   19 +
 fs/btrfs/Makefile  |1 +
 fs/btrfs/check-integrity.c | 3068

 fs/btrfs/check-integrity.h |   36 +
 fs/btrfs/ctree.h   |8 +-
 fs/btrfs/disk-io.c |   26 +-
 fs/btrfs/extent_io.c   |5 +-
 fs/btrfs/scrub.c   |5 +-
 fs/btrfs/super.c   |   39 +-
 fs/btrfs/volumes.c |7 +-
 10 files changed, 3203 insertions(+), 11 deletions(-)
 create mode 100644 fs/btrfs/check-integrity.c
 create mode 100644 fs/btrfs/check-integrity.h

--
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 05/10] Btrfs: add nested locking mode for paths

2011-12-22 Thread Chris Mason
On Thu, Dec 22, 2011 at 05:03:19PM +0100, Jan Schmidt wrote:
 From: Arne Jansen sensi...@gmx.net
 
 This patch adds the possibilty to read-lock an extent even if it is already
 write-locked from the same thread. btrfs_find_all_roots() needs this
 capability.

I'd rather not add a nested flag to the locking code, lets just make the
nesting explicitly allowed.

You shouldn't need locks around lock-owner.  Either your process owns
the lock (and it won't change away from your pid), or you don't own it
and it won't be your pid.  Just make sure the owner field gets cleared
when you do your final unlock.

So, if you are the owner of a write lock, you can add more write locks
or a read lock as required.

Could you please describe the case where btrfs_find_all_roots needs
this?

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[3.2.0-rc6] WARNING: at fs/btrfs/extent-tree.c:4771 while deleting subvolume

2011-12-22 Thread Kai Krakow
Hello btrfs...

I tried to delete a subvolume which probably has some transid errors. After 
this, the subvolume is gone but I cannot reboot - it hangs. After reisub, 
the deleted subvolume is right back there (this is different from previous 
kernel version before 3.2.0-rc4 (afair) where the subvolume was gone even 
after hard reboot but during mount the btrfs got some hickups from 
btrfs_cleaner then). From this I suppose that btrfs is much more robust to 
unexpected reboots now, but how can I get rid of this broken subvolume now?

Here's my dmesg (including sysrq+w):

[  121.411013] device fsid 311dda08-f33f-4cb9-9d59-6eac6026b1b1 devid 2 
transid 146955 /dev/sda3
[  121.411330] btrfs: use lzo compression
[  121.411333] btrfs: disk space caching is enabled
[  125.232594] zcache: created ephemeral tmem pool, id=2, client=65535
[  157.519388] Old style space inode found, converting.
[  157.525214] Old style space inode found, converting.
[  157.525227] Old style space inode found, converting.
[  157.525236] Old style space inode found, converting.
[  157.525242] Old style space inode found, converting.
[  157.525446] Old style space inode found, converting.
[  157.525634] Old style space inode found, converting.
[  157.528640] Old style space inode found, converting.
[  157.529025] Old style space inode found, converting.
[  157.534514] Old style space inode found, converting.
[  157.534916] Old style space inode found, converting.
[  157.544907] Old style space inode found, converting.
[  157.545118] Old style space inode found, converting.
[  157.545312] Old style space inode found, converting.
[  157.545489] Old style space inode found, converting.
[  157.545675] Old style space inode found, converting.
[  157.545683] Old style space inode found, converting.
[  157.550657] Old style space inode found, converting.
[  157.550677] Old style space inode found, converting.
[  157.550879] Old style space inode found, converting.
[  157.551085] Old style space inode found, converting.
[  157.551265] Old style space inode found, converting.
[  157.551272] Old style space inode found, converting.
[  157.564007] btrfs: truncated 1 orphans
[  157.854236] Old style space inode found, converting.
[  157.895119] btrfs: unlinked 6 orphans
[  157.895122] btrfs: truncated 8 orphans
[  225.682864] Old style space inode found, converting.
[  262.885468] Old style space inode found, converting.
[  262.885477] Old style space inode found, converting.
[  262.885484] Old style space inode found, converting.
[  262.885490] Old style space inode found, converting.
[  262.885498] Old style space inode found, converting.
[  262.885504] Old style space inode found, converting.
[  262.885511] Old style space inode found, converting.
[  262.885525] Old style space inode found, converting.
[  262.885531] Old style space inode found, converting.
[  262.885537] Old style space inode found, converting.
[  262.885543] Old style space inode found, converting.
[  298.668898] Old style space inode found, converting.
[  298.668906] Old style space inode found, converting.
[  302.264552] parent transid verify failed on 622147694592 wanted 130733 
found 134506
[  302.264562] parent transid verify failed on 622147694592 wanted 130733 
found 134506
[  302.264575] parent transid verify failed on 622147694592 wanted 130733 
found 134506
[  302.264579] parent transid verify failed on 622147694592 wanted 130733 
found 134506
[  302.264582] parent transid verify failed on 622147694592 wanted 130733 
found 134506
[  302.264585] [ cut here ]
[  302.264592] WARNING: at fs/btrfs/extent-tree.c:4771 
__btrfs_free_extent+0x290/0x5c7()
[  302.264595] Hardware name: To Be Filled By O.E.M.
[  302.264596] Modules linked in: af_packet snd_seq_oss snd_seq_midi_event 
snd_seq snd_pcm_oss snd_mixer_oss nls_iso8859_15 nls_cp437 vfat fat zram(C) 
loop tcp_cubic snd_usb_audio snd_hwdep snd_usbmidi_lib snd_rawmidi 
snd_seq_device gspca_sonixj gspca_main videodev v4l2_compat_ioctl32 evdev 
i2c_i801 pcspkr unix fuse xfs nfs nfs_acl auth_rpcgss lockd sunrpc reiserfs 
scsi_wait_scan hid_monterey hid_microsoft hid_logitech hid_ezkey hid_cypress 
hid_chicony hid_cherry hid_belkin hid_apple hid_a4tech usbhid usb_storage 
hid sr_mod cdrom sg pata_cmd64x [last unloaded: microcode]
[  302.264635] Pid: 6303, comm: btrfs-delayed-m Tainted: G C   
3.2.0-rc6 #5
[  302.264637] Call Trace:
[  302.264644]  [810333ea] ? warn_slowpath_common+0x78/0x8c
[  302.264647]  [8114e6f3] ? __btrfs_free_extent+0x290/0x5c7
[  302.264651]  [810b2998] ? __slab_free+0xd1/0x236
[  302.264655]  [81151a9f] ? run_clustered_refs+0x66c/0x6b8
[  302.264659]  [81151bb4] ? btrfs_run_delayed_refs+0xc9/0x173
[  302.264663]  [8115f82c] ? __btrfs_end_transaction+0x90/0x1dd
[  302.264668]  [810274b3] ? should_resched+0x5/0x24
[  302.264673]  [8119690d] ? 
btrfs_async_run_delayed_node_done+0x16c/0x1ca
[  302.264677]  

Re: Error handling: How to lose a transaction

2011-12-22 Thread Liu Bo
On 12/23/2011 01:12 PM, Jeff Mahoney wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 On 12/21/2011 10:38 PM, Jeff Mahoney wrote:
 On 12/21/2011 10:21 PM, Liu Bo wrote:
 On 12/22/2011 10:59 AM, Jeff Mahoney wrote: Sorry I haven't 
 responded to this yet. I started digging right in and I've
 started to have some good results. It turns out there's already a
  btrfs_cleanup_transaction call that will tear down outstanding 
 transactions. It's not perfect and I've fixed a few bugs in
 there, but it saved me a bunch of effort. I just wished I noticed
 it a day before since I had it half implemented myself. :)

 Hi Jeff,
 Yes, it should be, and I wrote this cleanup_transaction where
 I should notice you earlier... Anyway, thanks for your effort.
 The error handling part has lots of corner cases, so I just
 pick up a brute way to tear down the current transaction in
 order to make the FS RO.
 Oh, and it's worked great. The brute force method is a good start
 and will address the most severe problems (and most cases) well.
 I've decided to ignore most cases of -ENOMEM for now. The biggest
 bug I ran into so far was calling mutex_lock while holding a
 spinlock. It was a quick fix.

 The method I've generally used is to mark the transaction aborted
 and pass the error up as quickly as possible, cleaning up the
 local allocations and locks as I go. The transaction gets
 completed normally, returns an error, isn't committed, and then is
 destroyed (with others, potentially) when called from in 
 btrfs_commit_transaction. Btrfs makes this super easy since we can 
 just skip all the CoW writes.
 
 
 Now, just out of curiosity, would it be ok if I printed this when we
 ran out memory in deep call paths?
 

I'm ok with this, but it depends on Chris :)

Indeed, ENOMEM in deep call paths is a big big trouble for us, we don't yet have
a graceful solution, and we can make an memory allocation with mask __GFP_NOFAIL
flags for simplicity, although it is not recommended:

 * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
 * cannot handle allocation failures.  This modifier is deprecated and no new
 * users should be added.


  FAIL WHALE!
 
 W W  W
 WW  W W
   '.  W
   .--._ \ \.--|
  /   -..__) .-'
 | _ /
 \'-.__,   .__.,'
  `''._\--'
 V
 
 
 Happy Holidays ;)
 

Happy Holidays!

thanks,
liubo

 - -Jeff
 
 Thanks!

 -Jeff


 thanks, liubo
 This afternoon I started running xfstests on a dm-linear mapped 
 partition. Halfway through a sufficiently long test, I swap out 
 the linear mapping to an error mapping. It still crashes, but 
 somewhat less spectacularly. There are still a ton of BUG_ON's I 
 need to eliminate as well as work out the usual I/O
 error-recovery issue of uninterruptible, unrecoverable writeback
 contexts and still-locked pages holding up exit. I'm pretty
 pleased with the results so far and am pretty optimistic.
 -Jeff

 -- To unsubscribe from this list: send the line unsubscribe 
 linux-btrfs in the body of a message to 
 majord...@vger.kernel.org More majordomo info at 
 http://vger.kernel.org/majordomo-info.html


 -- To unsubscribe from this list: send the line unsubscribe
 linux-btrfs in the body of a message to majord...@vger.kernel.org 
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 - -- 
 Jeff Mahoney
 SUSE Labs
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v2.0.18 (GNU/Linux)
 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
 
 iQIcBAEBAgAGBQJO9A2/AAoJEB57S2MheeWyiNIP/3Z6NETIXskkp+OVKTiF/gaP
 bopj2dp92BlURFHEj5vJoESm4cUtQKTx9J/DB3yc7JDzc0UcRs9KCqGV9UpH6y9/
 Zetzy3ZMsYyxvV5CZ50NGr+C1r5ULVGQ/UrPex/GT0bApcdBRMkFASLH8xkFl6dE
 dfRjir038GzjVX/Phy0VPm0mg8eg77aco11Xk2+Y1MdEhsEqI+cUQYgA8O9M7HWy
 67Vv3KWxKC7PU6SYCPa0wGmQwTgs10GuKT9w+s7Ampy8iQhCgEuDo4dQxpRehQfp
 YwD/vlHwVATTAR2zMbRtI0BWa+ideBzcdQg1QrZxB3o026Z7ooy+/fTqS6MiUrXy
 mxGvb0g/BglK6Q86YQE77doIfJeUDLGoGQx2Zv1S9OzVwigo1a0LcP82P7yNnJBY
 oihql+FAYBXwjqiAQ+wUvo7wy0H+ltmQgWfUDf5wjDHquTRT1H0kE15Okc8MX8+T
 rmhp6vD1deX5Jz+JBIpCm94JhxUBPkBH2WksyA1jdLUOngHxRI0jmqz/5mPexV8e
 dChaq1rsjYs5Zbbv/jpaefnEw0kbZ0cqS7uDLVVoyjEqGnBpqjdwE86WYjxc4biM
 MkeSJ67Oof3ZGLWR0VQ+h4YnRjqAsMWsEd3jBLMo2krsr8ucc/UOzVDBVojDlGWJ
 Z2HunZuWJkNgcsBatVoS
 =z1sd
 -END PGP SIGNATURE-
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html