Re: [PATCH RFC] btrfs: Simplify locking

2011-03-21 Thread Tejun Heo
Hello, Chris.

On Sun, Mar 20, 2011 at 08:10:51PM -0400, Chris Mason wrote:
 I went through a number of benchmarks with the explicit
 blocking/spinning code and back then it was still significantly faster
 than the adaptive spin.  But, it is definitely worth doing these again,
 how many dbench procs did you use?

It was dbench 50.

 The biggest benefit to explicit spinning is that mutex_lock starts with
 might_sleep(), so we skip the cond_resched().  Do you have voluntary
 preempt on?

Ah, right, I of course forgot to actually attach the .config.  I had
CONFIG_PREEMPT, not CONFIG_PREEMPT_VOLUNTARY.  I'll re-run with
VOLUNTARY and see how its behavior changes.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2 v2] Btrfs: add datacow flag in inode flag

2011-03-21 Thread liubo

For datacow control, the corresponding inode flags are needed.
This is for btrfs use.

v1-v2:
Change FS_COW_FL to another bit due to conflict with the upstream e2fsprogs

Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com
---
 include/linux/fs.h |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 63d069b..dbcb47e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -353,6 +353,8 @@ struct inodes_stat_t {
 #define FS_TOPDIR_FL   0x0002 /* Top of directory 
hierarchies*/
 #define FS_EXTENT_FL   0x0008 /* Extents */
 #define FS_DIRECTIO_FL 0x0010 /* Use direct i/o */
+#define FS_NOCOW_FL0x0080 /* Do not cow file */
+#define FS_COW_FL  0x0200 /* Cow file */
 #define FS_RESERVED_FL 0x8000 /* reserved for ext2 lib */
 
 #define FS_FL_USER_VISIBLE 0x0003DFFF /* User visible flags */
-- 
1.6.5.2
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2 v2] Btrfs: Per file/directory controls for COW and compression

2011-03-21 Thread liubo

Data compression and data cow are controlled across the entire FS by mount
options right now.  ioctls are needed to set this on a per file or per
directory basis.  This has been proposed previously, but VFS developers
wanted us to use generic ioctls rather than btrfs-specific ones.

According to chris's comment, there should be just one true compression
method(probably LZO) stored in the super.  However, before this, we would
wait for that one method is stable enough to be adopted into the super.
So I list it as a long term goal, and just store it in ram today.

After applying this patch, we can use the generic FS_IOC_SETFLAGS ioctl to
control file and directory's datacow and compression attribute.

NOTE:
 - The compression type is selected by such rules:
   If we mount btrfs with compress options, ie, zlib/lzo, the type is it.
   Otherwise, we'll use the default compress type (zlib today).

v1-v2:
Rebase the patch with the latest btrfs.

Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com
---
 fs/btrfs/ctree.h   |1 +
 fs/btrfs/disk-io.c |6 ++
 fs/btrfs/inode.c   |   32 
 fs/btrfs/ioctl.c   |   41 +
 4 files changed, 72 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8b4b9d1..b77d1a5 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1283,6 +1283,7 @@ struct btrfs_root {
 #define BTRFS_INODE_NODUMP (1  8)
 #define BTRFS_INODE_NOATIME(1  9)
 #define BTRFS_INODE_DIRSYNC(1  10)
+#define BTRFS_INODE_COMPRESS   (1  11)
 
 /* some macros to generate set/get funcs for the struct fields.  This
  * assumes there is a lefoo_to_cpu for every type, so lets make a simple
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3e1ea3e..a894c12 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1762,6 +1762,12 @@ struct btrfs_root *open_ctree(struct super_block *sb,
 
btrfs_check_super_valid(fs_info, sb-s_flags  MS_RDONLY);
 
+   /*
+* In the long term, we'll store the compression type in the super
+* block, and it'll be used for per file compression control.
+*/
+   fs_info-compress_type = BTRFS_COMPRESS_ZLIB;
+
ret = btrfs_parse_options(tree_root, options);
if (ret) {
err = ret;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index db67821..e687bb9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -381,7 +381,8 @@ again:
 */
if (!(BTRFS_I(inode)-flags  BTRFS_INODE_NOCOMPRESS) 
(btrfs_test_opt(root, COMPRESS) ||
-(BTRFS_I(inode)-force_compress))) {
+(BTRFS_I(inode)-force_compress) ||
+(BTRFS_I(inode)-flags  BTRFS_INODE_COMPRESS))) {
WARN_ON(pages);
pages = kzalloc(sizeof(struct page *) * nr_pages, GFP_NOFS);
 
@@ -1253,7 +1254,8 @@ static int run_delalloc_range(struct inode *inode, struct 
page *locked_page,
ret = run_delalloc_nocow(inode, locked_page, start, end,
 page_started, 0, nr_written);
else if (!btrfs_test_opt(root, COMPRESS) 
-!(BTRFS_I(inode)-force_compress))
+!(BTRFS_I(inode)-force_compress) 
+!(BTRFS_I(inode)-flags  BTRFS_INODE_COMPRESS))
ret = cow_file_range(inode, locked_page, start, end,
  page_started, nr_written, 1);
else
@@ -4581,8 +4583,6 @@ static struct inode *btrfs_new_inode(struct 
btrfs_trans_handle *trans,
location-offset = 0;
btrfs_set_key_type(location, BTRFS_INODE_ITEM_KEY);
 
-   btrfs_inherit_iflags(inode, dir);
-
if ((mode  S_IFREG)) {
if (btrfs_test_opt(root, NODATASUM))
BTRFS_I(inode)-flags |= BTRFS_INODE_NODATASUM;
@@ -4590,6 +4590,8 @@ static struct inode *btrfs_new_inode(struct 
btrfs_trans_handle *trans,
BTRFS_I(inode)-flags |= BTRFS_INODE_NODATACOW;
}
 
+   btrfs_inherit_iflags(inode, dir);
+
insert_inode_hash(inode);
inode_tree_add(inode);
return inode;
@@ -6803,6 +6805,26 @@ static int btrfs_getattr(struct vfsmount *mnt,
return 0;
 }
 
+/*
+ * If a file is moved, it will inherit the cow and compression flags of the new
+ * directory.
+ */
+static void fixup_inode_flags(struct inode *dir, struct inode *inode)
+{
+   struct btrfs_inode *b_dir = BTRFS_I(dir);
+   struct btrfs_inode *b_inode = BTRFS_I(inode);
+
+   if (b_dir-flags  BTRFS_INODE_NODATACOW)
+   b_inode-flags |= BTRFS_INODE_NODATACOW;
+   else
+   b_inode-flags = ~BTRFS_INODE_NODATACOW;
+
+   if (b_dir-flags  BTRFS_INODE_COMPRESS)
+   b_inode-flags |= BTRFS_INODE_COMPRESS;
+   else
+   b_inode-flags = ~BTRFS_INODE_COMPRESS;
+}
+
 static int btrfs_rename(struct inode *old_dir, struct dentry 

Re: [RFC] a couple of i_nlink fixes in btrfs

2011-03-21 Thread Chris Mason
Excerpts from Al Viro's message of 2011-03-21 01:17:25 -0400:
 On Mon, Mar 07, 2011 at 11:58:13AM -0500, Chris Mason wrote:
 
  Thanks, these both look good but I'll test here as well.  Are you
  planning on pushing for .38?
 
 No, but .39 would be nice ;-)  Do you want that to go through btrfs tree
 or through vfs one?

I'll take them in mine, it'll be easier to put in with all these other
changes.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4] btrfs: implement delayed inode items operation

2011-03-21 Thread Chris Mason
Excerpts from Miao Xie's message of 2011-03-21 01:05:22 -0400:
 On sun, 20 Mar 2011 20:33:34 -0400, Chris Mason wrote:
  Excerpts from Miao Xie's message of 2011-03-18 05:24:46 -0400:
  Changelog V3 - V4:
  - Fix nested lock, which is reported by Itaru Kitayama, by updating space 
  cache
inodes in time.
  
  I ran some tests on this and had trouble with my stress.sh script:
  
  http://oss.oracle.com/~mason/stress.sh
  
  I used:
  
  stress.sh -n 50 -c path to linux kernel git tree /mnt
  
  The git tree has all the .git files but no .o files.
  
  The problem was that within about 20 minutes, the filesystem was
  spending almost all of its time in balance_dirty_pages().  The problem
  is that data writeback isn't complete until the endio handlers have
  finished inserting metadata into the btree.
  
  The v4 patch calls btrfs_btree_balance_dirty() from all the
  btrfs_end_transaction variants, which means that the FS writeback code
  waits for balance_dirty_pages(), which won't make progress until the FS
  writeback code is done.
  
  So I changed things to call the delayed inode balance function only from
  inside btrfs_btree_balance_dirty(), which did resolve the stalls.  But
 
 Ok, but can we invoke the delayed inode balance function before
 balance_dirty_pages_ratelimited_nr(), because the delayed item insertion and
 deletion also bring us some dirty pages.

Yes, good point.

 
  I found a few times that when I did rmmod btrfs, there would be delayed
  inode objects leaked in the slab cache.  rmmod will try to destroy the
  slab cache, which will fail because we haven't freed everything.
  
  It looks like we have a race in btrfs_get_or_create_delayed_node, where
  two concurrent callers can both create delayed nodes and then race on
  adding it to the inode.
 
 Sorry for my mistake, I thought we updated the inodes when holding i_mutex 
 originally,
 so I didn't use any lock or other method to protect delayed_node of the 
 inodes.
 
 But I think we needn't use rcu lock to protect delayed_node when we want to 
 get the
 delayed node, because we won't change it after it is created, cmpxchg() and 
 ACCESS_ONCE()
 can protect it well. What do you think about?
 
 PS: I worry about the inode update without holding i_mutex.

We have the tree locks to make sure we're serialized while we actually
change the tree.  The only places that go in without locking are times
updates.

 
  I also think that code is racing with the code that frees delayed nodes,
  but haven't yet triggered my debugging printks to prove either one.
 
 We free delayed nodes when we want to destroy the inode, at that time, just 
 one task,
 which is destroying inode, can access the delayed nodes, so I think 
 ACCESS_ONCE() is
 enough. What do you think about?

Great, I see what you mean.  The bigger problem right now is that we may do
a lot of operations in destroy_inode(), which can block the slab
shrinkers on our metadata operations.  That same stress.sh -n 50 run is
running into OOM.

So we need to rework the part where the final free is done.  We could
keep a ref on the inode until the delayed items are complete, or we
could let the inode go and make a way to lookup the delayed node when
the inode is read.

I'll read more today.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: cleanup how we setup free space clusters

2011-03-21 Thread Josef Bacik
This patch makes the free space cluster refilling code a little easier to
understand, and fixes some things with the bitmap part of it.  Currently we
either want to refill a cluster with

1) All normal extent entries (those without bitmaps)
2) A bitmap entry with enough space

The current code has this ugly jump around logic that will first try and fill up
the cluster with extent entries and then if it can't do that it will try and
find a bitmap to use.  So instead split this out into two functions, one that
tries to find only normal entries, and one that tries to find bitmaps.

This also fixes a suboptimal thing we would do with bitmaps.  If we used a
bitmap we would just tell the cluster that we were pointing at a bitmap and it
would do the tree search in the block group for that entry every time we tried
to make an allocation.  Instead of doing that now we just add it to the clusters
group.

I tested this with my ENOSPC tests and xfstests and it survived.

Signed-off-by: Josef Bacik jo...@redhat.com
---
 fs/btrfs/ctree.h|3 -
 fs/btrfs/free-space-cache.c |  361 +--
 2 files changed, 179 insertions(+), 185 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 6036fdb..0ee679b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -783,9 +783,6 @@ struct btrfs_free_cluster {
/* first extent starting offset */
u64 window_start;
 
-   /* if this cluster simply points at a bitmap in the block group */
-   bool points_to_bitmap;
-
struct btrfs_block_group_cache *block_group;
/*
 * when a cluster is allocated from a block group, we put the
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 7a808d7..a328af9 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -1644,30 +1644,28 @@ __btrfs_return_cluster_to_free_space(
 {
struct btrfs_free_space *entry;
struct rb_node *node;
-   bool bitmap;
 
spin_lock(cluster-lock);
if (cluster-block_group != block_group)
goto out;
 
-   bitmap = cluster-points_to_bitmap;
cluster-block_group = NULL;
cluster-window_start = 0;
list_del_init(cluster-block_group_list);
-   cluster-points_to_bitmap = false;
-
-   if (bitmap)
-   goto out;
 
node = rb_first(cluster-root);
while (node) {
+   bool bitmap;
+
entry = rb_entry(node, struct btrfs_free_space, offset_index);
node = rb_next(entry-offset_index);
rb_erase(entry-offset_index, cluster-root);
-   BUG_ON(entry-bitmap);
-   try_merge_free_space(block_group, entry, false);
+
+   bitmap = (entry-bitmap != NULL);
+   if (!bitmap)
+   try_merge_free_space(block_group, entry, false);
tree_insert_offset(block_group-free_space_offset,
-  entry-offset, entry-offset_index, 0);
+  entry-offset, entry-offset_index, bitmap);
}
cluster-root = RB_ROOT;
 
@@ -1790,50 +1788,24 @@ int btrfs_return_cluster_to_free_space(
 
 static u64 btrfs_alloc_from_bitmap(struct btrfs_block_group_cache *block_group,
   struct btrfs_free_cluster *cluster,
+  struct btrfs_free_space *entry,
   u64 bytes, u64 min_start)
 {
-   struct btrfs_free_space *entry;
int err;
u64 search_start = cluster-window_start;
u64 search_bytes = bytes;
u64 ret = 0;
 
-   spin_lock(block_group-tree_lock);
-   spin_lock(cluster-lock);
-
-   if (!cluster-points_to_bitmap)
-   goto out;
-
-   if (cluster-block_group != block_group)
-   goto out;
-
-   /*
-* search_start is the beginning of the bitmap, but at some point it may
-* be a good idea to point to the actual start of the free area in the
-* bitmap, so do the offset_to_bitmap trick anyway, and set bitmap_only
-* to 1 to make sure we get the bitmap entry
-*/
-   entry = tree_search_offset(block_group,
-  offset_to_bitmap(block_group, search_start),
-  1, 0);
-   if (!entry || !entry-bitmap)
-   goto out;
-
search_start = min_start;
search_bytes = bytes;
 
err = search_bitmap(block_group, entry, search_start,
search_bytes);
if (err)
-   goto out;
+   return 0;
 
ret = search_start;
bitmap_clear_bits(block_group, entry, ret, bytes);
-   if (entry-bytes == 0)
-   free_bitmap(block_group, entry);
-out:
-   spin_unlock(cluster-lock);
-   spin_unlock(block_group-tree_lock);
 
return ret;
 }
@@ -1851,10 +1823,6 @@ 

[PATCH] Btrfs: don't be as aggressive about using bitmaps V2

2011-03-21 Thread Josef Bacik
We have been creating bitmaps for small extents unconditionally forever.  This
was great when testing to make sure the bitmap stuff was working, but is
overkill normally.  So instead of always adding small chunks of free space to
bitmaps, only start doing it if we go past half of our extent threshold.  This
will keeps us from creating a bitmap for just one small free extent at the front
of the block group, and will make the allocator a little faster as a result.
Thanks,

Signed-off-by: Josef Bacik jo...@redhat.com
---
V1-V2:
-fix a formatting problem
-make the small extent threshold to be = 4 sectors, not  4 sectors

 fs/btrfs/free-space-cache.c |   19 ---
 1 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 63776ae..4ab35ea 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -1287,9 +1287,22 @@ static int insert_into_bitmap(struct 
btrfs_block_group_cache *block_group,
 * If we are below the extents threshold then we can add this as an
 * extent, and don't have to deal with the bitmap
 */
-   if (block_group-free_extents  block_group-extents_thresh 
-   info-bytes  block_group-sectorsize * 4)
-   return 0;
+   if (block_group-free_extents  block_group-extents_thresh) {
+   /*
+* If this block group has some small extents we don't want to
+* use up all of our free slots in the cache with them, we want
+* to reserve them to larger extents, however if we have plent
+* of cache left then go ahead an dadd them, no sense in adding
+* the overhead of a bitmap if we don't have to.
+*/
+   if (info-bytes = block_group-sectorsize * 4) {
+   if (block_group-free_extents * 2 =
+   block_group-extents_thresh)
+   return 0;
+   } else {
+   return 0;
+   }
+   }
 
/*
 * some block groups are so tiny they can't be enveloped by a bitmap, so
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


cloning single-device btrfs file system onto multi-device one

2011-03-21 Thread Stephane Chazelas
Hiya,

I'm trying to move a btrfs FS that's on a hardware raid 5 (6TB
large, 4 of which are in use) to another machine with 3 3TB HDs
and preserve all the subvolumes/snapshots.

Is there a way to do that without using a software/hardware raid
on the new machine (that is just use btrfs multi-device).

If fewer than 3TB were occupied, I suppose I could just resize
it so that it fits on one 3TB hd, then copy device to device
onto a 3TB disk, add the 2 other ones and do a balance, but
here, I can't do that.

I suspect that if compression was enabled, the FS could fit on
3 TB, but AFAICT, compression is enabled at mount time and would
only apply to newly created files. Is there a way to compress
files already in a btrfs filesystem?

Any help would be appreciated.
Stephane

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] btrfs: Simplify locking

2011-03-21 Thread Tejun Heo
Hello,

Here are the results with voluntary preemption.  I've moved to a
beefier machine for testing.  It's dual Opteron 2347, so dual socket,
eight core.  The memory is limited to 1GiB to force IOs and the disk
is the same OCZ Vertex 60gig SSD.  /proc/stat is captured before and
after dbench 50.

I ran the following four setups.

DFL The current custom locking implementation.
SIMPLE  Simple mutex conversion.  The first patch in this thread.
SPINSIMPLE + mutex_tryspin().  The second patch in this thread.
SPIN2   SPIN + mutex_tryspin() in btrfs_tree_lock().  Patch below.

SPIN2 should alleviate the voluntary preemption by might_sleep() in
mutex_lock().

   USER   SYSTEM   SIRQCXTSW  THROUGHPUT
DFL49427  458210   1433  7683488 642.947
SIMPLE 52836  471398   1427  3055384 705.369
SPIN   52267  473115   1467  3005603 705.369
SPIN2  52063  470453   1446  3092091 701.826

I'm running DFL again just in case but SIMPLE or SPIN seems to be a
much better choice.

Thanks.

NOT-Signed-off-by: Tejun Heo t...@kernel.org
---
 fs/btrfs/locking.h |2 ++
 1 file changed, 2 insertions(+)

Index: work/fs/btrfs/locking.h
===
--- work.orig/fs/btrfs/locking.h
+++ work/fs/btrfs/locking.h
@@ -28,6 +28,8 @@ static inline bool btrfs_try_spin_lock(s
 
 static inline void btrfs_tree_lock(struct extent_buffer *eb)
 {
+   if (mutex_tryspin(eb-lock))
+   return;
mutex_lock(eb-lock);
 }
 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] btrfs: Simplify locking

2011-03-21 Thread Tejun Heo
On Mon, Mar 21, 2011 at 05:59:55PM +0100, Tejun Heo wrote:
 I'm running DFL again just in case but SIMPLE or SPIN seems to be a
 much better choice.

Got 644.176 MB/sec, so yeah the custom locking is definitely worse
than just using mutex.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] btrfs: Simplify locking

2011-03-21 Thread Chris Mason
Excerpts from Tejun Heo's message of 2011-03-21 12:59:55 -0400:
 Hello,
 
 Here are the results with voluntary preemption.  I've moved to a
 beefier machine for testing.  It's dual Opteron 2347, so dual socket,
 eight core.  The memory is limited to 1GiB to force IOs and the disk
 is the same OCZ Vertex 60gig SSD.  /proc/stat is captured before and
 after dbench 50.
 
 I ran the following four setups.
 
 DFLThe current custom locking implementation.
 SIMPLESimple mutex conversion.  The first patch in this thread.
 SPINSIMPLE + mutex_tryspin().  The second patch in this thread.
 SPIN2SPIN + mutex_tryspin() in btrfs_tree_lock().  Patch below.
 
 SPIN2 should alleviate the voluntary preemption by might_sleep() in
 mutex_lock().
 
USER   SYSTEM   SIRQCXTSW  THROUGHPUT
 DFL49427  458210   1433  7683488 642.947
 SIMPLE 52836  471398   1427  3055384 705.369
 SPIN   52267  473115   1467  3005603 705.369
 SPIN2  52063  470453   1446  3092091 701.826
 
 I'm running DFL again just in case but SIMPLE or SPIN seems to be a
 much better choice.

Very interesting.  Ok, I'll definitely rerun my benchmarks as well.  I
used dbench extensively during the initial tuning, but you're forcing
the memory low in order to force IO.

This case doesn't really hammer on the locks, it hammers on the
transition from spinning to blocking.  We want also want to compare
dbench entirely in ram, which will hammer on the spinning portion.

-chris

 
 Thanks.
 
 NOT-Signed-off-by: Tejun Heo t...@kernel.org
 ---
  fs/btrfs/locking.h |2 ++
  1 file changed, 2 insertions(+)
 
 Index: work/fs/btrfs/locking.h
 ===
 --- work.orig/fs/btrfs/locking.h
 +++ work/fs/btrfs/locking.h
 @@ -28,6 +28,8 @@ static inline bool btrfs_try_spin_lock(s
  
  static inline void btrfs_tree_lock(struct extent_buffer *eb)
  {
 +if (mutex_tryspin(eb-lock))
 +return;
  mutex_lock(eb-lock);
  }
  
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2 v2] Btrfs: Per file/directory controls for COW and compression

2011-03-21 Thread Johann Lombardi
On Mon, Mar 21, 2011 at 04:57:13PM +0800, liubo wrote:
 @@ -4581,8 +4583,6 @@ static struct inode *btrfs_new_inode(struct 
 btrfs_trans_handle *trans,
   location-offset = 0;
   btrfs_set_key_type(location, BTRFS_INODE_ITEM_KEY);
  
 - btrfs_inherit_iflags(inode, dir);
 -
   if ((mode  S_IFREG)) {
   if (btrfs_test_opt(root, NODATASUM))
   BTRFS_I(inode)-flags |= BTRFS_INODE_NODATASUM;
 @@ -4590,6 +4590,8 @@ static struct inode *btrfs_new_inode(struct 
 btrfs_trans_handle *trans,
   BTRFS_I(inode)-flags |= BTRFS_INODE_NODATACOW;
   }
  
 + btrfs_inherit_iflags(inode, dir);

The problem is that btrfs_inherit_iflags() overwrites BTRFS_I(inode)-flags 
with the parent's flags, so you lose BTRFS_INODE_NODATA{SUM|COW}.

Johann
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs device delete not working after failed/missing device

2011-03-21 Thread Mark Bergsma
Hi,

I decided to try btrfs for a few file systems on my not-too-critical home 
server, including my root fs. Most file systems are on a RAID5 MD software 
array, but my rootfs is btrfs striped as RAID1 over 3 partitions.

I got hit by the Intel Sandy Bridge SATA chipset bug, so eventually the 3rd 
SATA drive (/dev/sdc) failed with all kinds of bus errors. My btrfs rootfs 
stayed up and working, but btrfs device delete /dev/sdc3 / did not work, and 
gave a vague Error removing device (iirc). After rebooting with the drive on 
the bad SATA bus removed, the file system didn't come up, but passing a -o 
degraded fixed that. However, I was still not able to remove the missing 
device from the rootfs. Neither btrfs device delete missing / nor btrfs 
device delete /dev/sdc3 / succeeded. btrfs fi balance / succeeded without 
errors however.

I'll paste the relevant bits of dmesg below. My btrfs rootfs is now mountable 
and working in degraded mode, and I have a (daily rsynced) backup on another 
filesystem anyway. I decided to report it anyway, as it would be good to get 
things stable and this bug fixed. ;)

The running kernel is 2.6.38.4, and btrfs utils are version v0.19 (Ubuntu 
Maverick).

dmesg:

[  197.810007] device label root devid 1 transid 65058 /dev/sda3
[  197.811844] btrfs: failed to read the system array on sda3
[  197.876797] btrfs: open_ctree failed
[  207.743237] device label root devid 1 transid 65058 /dev/sda3
[  207.745460] btrfs: failed to read the system array on sda3
[  207.793912] btrfs: open_ctree failed
[  249.18] device label backups devid 1 transid 7325 /dev/dm-4
[  250.002555] device label root devid 2 transid 65058 /dev/sdb3
[  250.003545] device label root devid 1 transid 65058 /dev/sda3
[  488.217867] device label root devid 1 transid 65058 /dev/sda3
[  488.218325] btrfs: allowing degraded mounts
[  509.983121] btrfs: relocating block group 12108955648 flags 17
[  513.096861] btrfs: found 695 extents
[  520.176191] btrfs: found 695 extents
[  520.888601] btrfs: relocating block group 11035213824 flags 17
[  537.663032] btrfs: found 4836 extents
[  550.237641] btrfs: found 4836 extents
[  551.527503] btrfs: relocating block group 9961472000 flags 17
[  556.314350] btrfs: found 1602 extents
[  564.737818] btrfs: found 1602 extents
[  565.358244] btrfs: relocating block group 9693036544 flags 20
[  586.400905] btrfs: found 5548 extents
[  586.773695] btrfs: relocating block group 8619294720 flags 17
[  593.677386] btrfs: found 3839 extents
[  599.059888] btrfs: found 3839 extents
[  600.155579] btrfs: relocating block group 7545552896 flags 17
[  612.330001] btrfs: found 3139 extents
[  621.223148] btrfs: found 3139 extents
[  622.054094] btrfs: relocating block group 6471811072 flags 17
[  634.848723] btrfs: found 5541 extents
[  649.045142] btrfs: found 5541 extents
[  649.685956] btrfs: relocating block group 5398069248 flags 17
[  663.123926] btrfs: found 12743 extents
[  683.670746] btrfs: found 12743 extents
[  684.595137] btrfs: relocating block group 4324327424 flags 17
[  717.828652] btrfs: found 13762 extents
[  740.723221] btrfs: found 13762 extents
[  742.037898] btrfs: relocating block group 29360128 flags 20
[  826.976047] btrfs: found 34862 extents
[  827.084723] btrfs: relocating block group 20971520 flags 18
[  827.126034] btrfs allocation failed flags 18, wanted 4096
[  827.127621] space_info has 0 free, is not full
[  827.127623] space_info total=12582912, used=4096, pinned=0, reserved=0, 
may_use=0, readonly=12578816
[  827.127626] block group 20971520 has 8388608 bytes, 4096 used 0 pinned 0 
reserved
[  827.127629] entry offset 20971520, bytes 4096, bitmap no
[  827.129215] entry offset 20979712, bytes 8380416, bitmap no
[  827.130778] block group has cluster?: no
[  827.130780] 2 blocks of free space at or bigger than bytes is
[  827.130782] block group 0 has 4194304 bytes, 0 used 0 pinned 0 reserved
[  827.130784] entry offset 131072, bytes 4063232, bitmap no
[  827.132340] block group has cluster?: no
[  827.132342] 1 blocks of free space at or bigger than bytes is
[  827.174753] btrfs: relocating block group 20971520 flags 18
[  827.213303] btrfs allocation failed flags 18, wanted 4096
[  827.214347] space_info has 0 free, is not full
[  827.214348] space_info total=12582912, used=4096, pinned=0, reserved=0, 
may_use=0, readonly=12578816
[  827.214350] block group 20971520 has 8388608 bytes, 4096 used 0 pinned 0 
reserved
[  827.214352] entry offset 20971520, bytes 4096, bitmap no
[  827.215367] entry offset 20979712, bytes 8380416, bitmap no
[  827.216261] block group has cluster?: no
[  827.216262] 2 blocks of free space at or bigger than bytes is
[  827.216263] block group 0 has 4194304 bytes, 0 used 0 pinned 0 reserved
[  827.216265] entry offset 131072, bytes 4063232, bitmap no
[  827.217144] block group has cluster?: no
[  827.217145] 1 blocks of free space at or bigger than bytes is

btrfs filesystem show:

Label: 'root'  uuid: 

Re: [PATCH RFC] btrfs: Simplify locking

2011-03-21 Thread Tejun Heo
Hello,

On Mon, Mar 21, 2011 at 01:24:37PM -0400, Chris Mason wrote:
 Very interesting.  Ok, I'll definitely rerun my benchmarks as well.  I
 used dbench extensively during the initial tuning, but you're forcing
 the memory low in order to force IO.
 
 This case doesn't really hammer on the locks, it hammers on the
 transition from spinning to blocking.  We want also want to compare
 dbench entirely in ram, which will hammer on the spinning portion.

Here's re-run of DFL and SIMPLE with the memory restriction lifted.
Memory is 4GiB and disk remains mostly idle with all CPUs running
full.

   USER   SYSTEM   SIRQCXTSW  THROUGHPUT
DFL59898  504517377  6814245 782.295
SIMPLE 61090  493441457  1631688 827.751

So, about the same picture.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs device returned from stat vs /proc/pid/maps

2011-03-21 Thread Mark Fasheh
Hi,

I noticed that btrfs_getattr() is filling stat-dev with an anonymous device
(for per-snapshot root?):

stat-dev = BTRFS_I(inode)-root-anon_super.s_dev;

but /proc/pid/maps uses the real block device:

dev = inode-i_sb-s_dev;


This results in some unfortunate behavior for lsof as it reports some
duplicate paths (except on different block devices). The easiest way to see
this (if your root partition is btrfs):

$ lsof | grep lsof
snip
lsof   9238root  txt   REG   0,19   139736 
14478 /usr/bin/lsof
lsof   9238root  mem   REG   0,17  
14478 /usr/bin/lsof (path dev=0,19)


Ultimately, this breaks existing software. In my case, zypper ps gets
really unhappy (which may partially also be due to a zypper bug, hooray!)


I'm not really quite sure how this should be handled though. Do we have
/proc/pid/maps report the subvolumes device (via some callback I suppose)?
Another alternative of course is to return the true block device in
btrfs_getattr() but that has some obvious downsides too.


Thanks and best regards,
--Mark

--
Mark Fasheh
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2 v2] Btrfs: Per file/directory controls for COW and compression

2011-03-21 Thread liubo
On 03/22/2011 01:43 AM, Johann Lombardi wrote:
 On Mon, Mar 21, 2011 at 04:57:13PM +0800, liubo wrote:
 @@ -4581,8 +4583,6 @@ static struct inode *btrfs_new_inode(struct 
 btrfs_trans_handle *trans,
  location-offset = 0;
  btrfs_set_key_type(location, BTRFS_INODE_ITEM_KEY);
  
 -btrfs_inherit_iflags(inode, dir);
 -
  if ((mode  S_IFREG)) {
  if (btrfs_test_opt(root, NODATASUM))
  BTRFS_I(inode)-flags |= BTRFS_INODE_NODATASUM;
 @@ -4590,6 +4590,8 @@ static struct inode *btrfs_new_inode(struct 
 btrfs_trans_handle *trans,
  BTRFS_I(inode)-flags |= BTRFS_INODE_NODATACOW;
  }
  
 +btrfs_inherit_iflags(inode, dir);
 
 The problem is that btrfs_inherit_iflags() overwrites BTRFS_I(inode)-flags 
 with the parent's flags, so you lose BTRFS_INODE_NODATA{SUM|COW}.
 

Thanks for pointing this, will fix it.

thanks,
liubo

 Johann
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4] btrfs: implement delayed inode items operation

2011-03-21 Thread Itaru Kitayama
Hi Miao,

Here is an excerpt of the V4 patch applied kernel boot log:

===
[ INFO: possible circular locking dependency detected ]
2.6.36-xie+ #117
---
vgs/1210 is trying to acquire lock:
 (delayed_node-mutex){+.+...}, at: [8121184b] 
btrfs_delayed_update_inode+0x45/0x101

but task is already holding lock:
 (mm-mmap_sem){++}, at: [810f6512] sys_mmap_pgoff+0xd6/0x12e

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

- #1 (mm-mmap_sem){++}:
   [81076a3d] lock_acquire+0x11d/0x143
   [810edc79] might_fault+0x95/0xb8
   [8112b5ce] filldir+0x75/0xd0
   [811d77f8] btrfs_real_readdir+0x3d7/0x528
   [8112b75c] vfs_readdir+0x79/0xb6
   [8112b8e9] sys_getdents+0x85/0xd8
   [81002ddb] system_call_fastpath+0x16/0x1b

- #0 (delayed_node-mutex){+.+...}:
   [81076612] __lock_acquire+0xa98/0xda6
   [81076a3d] lock_acquire+0x11d/0x143
   [814c38b1] __mutex_lock_common+0x5a/0x444
   [814c3d50] mutex_lock_nested+0x39/0x3e
   [8121184b] btrfs_delayed_update_inode+0x45/0x101
   [811dbd4f] btrfs_update_inode+0x2e/0x129
   [811de008] btrfs_dirty_inode+0x57/0x113
   [8113c2a5] __mark_inode_dirty+0x33/0x1aa
   [81130939] touch_atime+0x107/0x12a
   [811e15b2] btrfs_file_mmap+0x3e/0x57
   [810f5f40] mmap_region+0x2bb/0x4c4
   [810f63d9] do_mmap_pgoff+0x290/0x2f3
   [810f6532] sys_mmap_pgoff+0xf6/0x12e
   [81006e9a] sys_mmap+0x22/0x24
   [81002ddb] system_call_fastpath+0x16/0x1b

other info that might help us debug this:

1 lock held by vgs/1210:
 #0:  (mm-mmap_sem){++}, at: [810f6512] 
sys_mmap_pgoff+0xd6/0x12e

stack backtrace:
Pid: 1210, comm: vgs Not tainted 2.6.36-xie+ #117
Call Trace:
 [81074c15] print_circular_bug+0xaf/0xbd
 [81076612] __lock_acquire+0xa98/0xda6
 [8121184b] ? btrfs_delayed_update_inode+0x45/0x101
 [81076a3d] lock_acquire+0x11d/0x143
 [8121184b] ? btrfs_delayed_update_inode+0x45/0x101
 [8121184b] ? btrfs_delayed_update_inode+0x45/0x101
 [814c38b1] __mutex_lock_common+0x5a/0x444
 [8121184b] ? btrfs_delayed_update_inode+0x45/0x101
 [8107162f] ? debug_mutex_init+0x31/0x3c
 [814c3d50] mutex_lock_nested+0x39/0x3e
 [8121184b] btrfs_delayed_update_inode+0x45/0x101
 [814c36c6] ? __mutex_unlock_slowpath+0x129/0x13a
 [811dbd4f] btrfs_update_inode+0x2e/0x129
 [811de008] btrfs_dirty_inode+0x57/0x113
 [8113c2a5] __mark_inode_dirty+0x33/0x1aa
 [81130939] touch_atime+0x107/0x12a
 [811e15b2] btrfs_file_mmap+0x3e/0x57
 [810f5f40] mmap_region+0x2bb/0x4c4
 [81229f10] ? file_map_prot_check+0x9a/0xa3
 [810f63d9] do_mmap_pgoff+0x290/0x2f3
 [810f6512] ? sys_mmap_pgoff+0xd6/0x12e
 [810f6532] sys_mmap_pgoff+0xf6/0x12e
 [814c4b75] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [81006e9a] sys_mmap+0x22/0x24
 [81002ddb] system_call_fastpath+0x16/0x1b

As the corresponding delayed node mutex lock is taken in btrfs_real_readdir, 
that seems deadlockable.
vfs_readdir holds i_mutex, I wonder if we can execute 
btrfs_readdir_delayed_dir_index without
taking the node lock. 

 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4] btrfs: implement delayed inode items operation

2011-03-21 Thread Miao Xie
On tue, 22 Mar 2011 11:33:10 +0900, Itaru Kitayama wrote:
 Here is an excerpt of the V4 patch applied kernel boot log:
 
 ===
 [ INFO: possible circular locking dependency detected ]
 2.6.36-xie+ #117
 ---
 vgs/1210 is trying to acquire lock:
  (delayed_node-mutex){+.+...}, at: [8121184b] 
 btrfs_delayed_update_inode+0x45/0x101
 
 but task is already holding lock:
  (mm-mmap_sem){++}, at: [810f6512] sys_mmap_pgoff+0xd6/0x12e
 
 which lock already depends on the new lock.
 
 
 the existing dependency chain (in reverse order) is:
 
 - #1 (mm-mmap_sem){++}:
[81076a3d] lock_acquire+0x11d/0x143
[810edc79] might_fault+0x95/0xb8
[8112b5ce] filldir+0x75/0xd0
[811d77f8] btrfs_real_readdir+0x3d7/0x528
[8112b75c] vfs_readdir+0x79/0xb6
[8112b8e9] sys_getdents+0x85/0xd8
[81002ddb] system_call_fastpath+0x16/0x1b
 
 - #0 (delayed_node-mutex){+.+...}:
[81076612] __lock_acquire+0xa98/0xda6
[81076a3d] lock_acquire+0x11d/0x143
[814c38b1] __mutex_lock_common+0x5a/0x444
[814c3d50] mutex_lock_nested+0x39/0x3e
[8121184b] btrfs_delayed_update_inode+0x45/0x101
[811dbd4f] btrfs_update_inode+0x2e/0x129
[811de008] btrfs_dirty_inode+0x57/0x113
[8113c2a5] __mark_inode_dirty+0x33/0x1aa
[81130939] touch_atime+0x107/0x12a
[811e15b2] btrfs_file_mmap+0x3e/0x57
[810f5f40] mmap_region+0x2bb/0x4c4
[810f63d9] do_mmap_pgoff+0x290/0x2f3
[810f6532] sys_mmap_pgoff+0xf6/0x12e
[81006e9a] sys_mmap+0x22/0x24
[81002ddb] system_call_fastpath+0x16/0x1b
 
 other info that might help us debug this:
 
 1 lock held by vgs/1210:
  #0:  (mm-mmap_sem){++}, at: [810f6512] 
 sys_mmap_pgoff+0xd6/0x12e
 
 stack backtrace:
 Pid: 1210, comm: vgs Not tainted 2.6.36-xie+ #117
 Call Trace:
  [81074c15] print_circular_bug+0xaf/0xbd
  [81076612] __lock_acquire+0xa98/0xda6
  [8121184b] ? btrfs_delayed_update_inode+0x45/0x101
  [81076a3d] lock_acquire+0x11d/0x143
  [8121184b] ? btrfs_delayed_update_inode+0x45/0x101
  [8121184b] ? btrfs_delayed_update_inode+0x45/0x101
  [814c38b1] __mutex_lock_common+0x5a/0x444
  [8121184b] ? btrfs_delayed_update_inode+0x45/0x101
  [8107162f] ? debug_mutex_init+0x31/0x3c
  [814c3d50] mutex_lock_nested+0x39/0x3e
  [8121184b] btrfs_delayed_update_inode+0x45/0x101
  [814c36c6] ? __mutex_unlock_slowpath+0x129/0x13a
  [811dbd4f] btrfs_update_inode+0x2e/0x129
  [811de008] btrfs_dirty_inode+0x57/0x113
  [8113c2a5] __mark_inode_dirty+0x33/0x1aa
  [81130939] touch_atime+0x107/0x12a
  [811e15b2] btrfs_file_mmap+0x3e/0x57
  [810f5f40] mmap_region+0x2bb/0x4c4
  [81229f10] ? file_map_prot_check+0x9a/0xa3
  [810f63d9] do_mmap_pgoff+0x290/0x2f3
  [810f6512] ? sys_mmap_pgoff+0xd6/0x12e
  [810f6532] sys_mmap_pgoff+0xf6/0x12e
  [814c4b75] ? trace_hardirqs_on_thunk+0x3a/0x3f
  [81006e9a] sys_mmap+0x22/0x24
  [81002ddb] system_call_fastpath+0x16/0x1b
 
 As the corresponding delayed node mutex lock is taken in btrfs_real_readdir, 
 that seems deadlockable.
 vfs_readdir holds i_mutex, I wonder if we can execute 
 btrfs_readdir_delayed_dir_index without
 taking the node lock. 

We can't fix it by this way, because the work threads may do insertion or 
deletion at the same time,
and we may lose some directory items.

Maybe we can fix it by adding a reference for the delayed directory items, we 
can do read dir just
like this:
1. hold the node lock
2. increase the directory items' reference and put all the directory items into 
a list
3. release the node lock
4. read dir
5. decrease the directory items' reference and free them if the reference is 
zero.
What do you think about?

Thanks
Miao

 
  
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4] btrfs: implement delayed inode items operation

2011-03-21 Thread Itaru Kitayama
On Tue, 22 Mar 2011 11:12:37 +0800
Miao Xie mi...@cn.fujitsu.com wrote:

 We can't fix it by this way, because the work threads may do insertion or 
 deletion at the same time,
 and we may lose some directory items.

Ok.
 
 Maybe we can fix it by adding a reference for the delayed directory items, we 
 can do read dir just
 like this:
 1. hold the node lock
 2. increase the directory items' reference and put all the directory items 
 into a list
 3. release the node lock
 4. read dir
 5. decrease the directory items' reference and free them if the reference is 
 zero.
 What do you think about?

Sounds doable to me.

itaru
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html