Re: [RFC] a couple of i_nlink fixes in btrfs

2011-03-20 Thread Al Viro
On Mon, Mar 07, 2011 at 11:58:13AM -0500, Chris Mason wrote:

> Thanks, these both look good but I'll test here as well.  Are you
> planning on pushing for .38?

No, but .39 would be nice ;-)  Do you want that to go through btrfs tree
or through vfs one?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4] btrfs: implement delayed inode items operation

2011-03-20 Thread Miao Xie
On sun, 20 Mar 2011 20:33:34 -0400, Chris Mason wrote:
> Excerpts from Miao Xie's message of 2011-03-18 05:24:46 -0400:
>> Changelog V3 -> V4:
>> - Fix nested lock, which is reported by Itaru Kitayama, by updating space 
>> cache
>>   inodes in time.
> 
> I ran some tests on this and had trouble with my stress.sh script:
> 
> http://oss.oracle.com/~mason/stress.sh
> 
> I used:
> 
> stress.sh -n 50 -c  /mnt
> 
> The git tree has all the .git files but no .o files.
> 
> The problem was that within about 20 minutes, the filesystem was
> spending almost all of its time in balance_dirty_pages().  The problem
> is that data writeback isn't complete until the endio handlers have
> finished inserting metadata into the btree.
> 
> The v4 patch calls btrfs_btree_balance_dirty() from all the
> btrfs_end_transaction variants, which means that the FS writeback code
> waits for balance_dirty_pages(), which won't make progress until the FS
> writeback code is done.
> 
> So I changed things to call the delayed inode balance function only from
> inside btrfs_btree_balance_dirty(), which did resolve the stalls.  But

Ok, but can we invoke the delayed inode balance function before
balance_dirty_pages_ratelimited_nr(), because the delayed item insertion and
deletion also bring us some dirty pages.

> I found a few times that when I did rmmod btrfs, there would be delayed
> inode objects leaked in the slab cache.  rmmod will try to destroy the
> slab cache, which will fail because we haven't freed everything.
> 
> It looks like we have a race in btrfs_get_or_create_delayed_node, where
> two concurrent callers can both create delayed nodes and then race on
> adding it to the inode.

Sorry for my mistake, I thought we updated the inodes when holding i_mutex 
originally,
so I didn't use any lock or other method to protect delayed_node of the inodes.

But I think we needn't use rcu lock to protect delayed_node when we want to get 
the
delayed node, because we won't change it after it is created, cmpxchg() and 
ACCESS_ONCE()
can protect it well. What do you think about?

PS: I worry about the inode update without holding i_mutex.

> I also think that code is racing with the code that frees delayed nodes,
> but haven't yet triggered my debugging printks to prove either one.

We free delayed nodes when we want to destroy the inode, at that time, just one 
task,
which is destroying inode, can access the delayed nodes, so I think 
ACCESS_ONCE() is
enough. What do you think about?

Thanks
Miao

> 
> My current incremental patch is below, please take a look:
> 
> diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
> index 727d9a8..02e9390 100644
> --- a/fs/btrfs/delayed-inode.c
> +++ b/fs/btrfs/delayed-inode.c
> @@ -32,7 +32,8 @@ int __init btrfs_delayed_inode_init(void)
>   delayed_node_cache = kmem_cache_create("delayed_node",
>   sizeof(struct btrfs_delayed_node),
>   0,
> - SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD,
> + SLAB_RECLAIM_ACCOUNT |
> + SLAB_MEM_SPREAD | SLAB_DESTROY_BY_RCU,
>   NULL);
>   if (!delayed_node_cache)
>   return -ENOMEM;
> @@ -90,22 +91,35 @@ static struct btrfs_delayed_node 
> *btrfs_get_or_create_delayed_node(
>   struct btrfs_inode *btrfs_inode = BTRFS_I(inode);
>   struct btrfs_root *root = btrfs_inode->root;
>  
> - if (btrfs_inode->delayed_node) {
> - node = btrfs_inode->delayed_node;
> - atomic_inc(&node->refs);/* can be accessed */
> - return node;
> +again:
> + rcu_read_lock();
> +again_rcu:
> + node = btrfs_inode->delayed_node;
> + if (node) {
> + if (atomic_inc_not_zero(&node->refs)) {
> + rcu_read_unlock();
> + return node;
> + }
> +printk("racing on node access!\n");
> + goto again_rcu;
>   }
> + rcu_read_unlock();
>  
>   node = kmem_cache_alloc(delayed_node_cache, GFP_NOFS);
>   if (!node)
>   return ERR_PTR(-ENOMEM);
>   btrfs_init_delayed_node(node, root->objectid, inode->i_ino);
>  
> - btrfs_inode->delayed_node = node;
>   node->delayed_root = btrfs_get_delayed_root(root);
>   atomic_inc(&node->refs);/* cached in the btrfs inode */
>   atomic_inc(&node->refs);/* can be accessed */
>  
> + if (cmpxchg(&BTRFS_I(inode)->delayed_node, NULL, node)) {
> + kmem_cache_free(delayed_node_cache, node);
> +printk("racing on new node insertion!\n");
> + goto again;
> + }
> +
>   return node;
>  }
>  
> @@ -1167,7 +1181,7 @@ static void btrfs_async_run_delayed_node_done(struct 
> btrfs_work *work)
>   nr = trans->blocks_used;
>  
>   btrfs_end_transaction_dmeta(trans, root);
> - btrfs_btree_balance_dirty(root, nr)

Re: [PATCH] Btrfs: check free space in block group before searching for a cluster

2011-03-20 Thread Li Zefan
Josef Bacik wrote:
> The free space cluster stuff is heavy duty, so there is no sense in going
> through the entire song and dance if there isn't enough space in the block 
> group
> to begin with.  Thanks,
> 
> Signed-off-by: Josef Bacik 

Reviewed-by: Li Zefan 

> ---
>  fs/btrfs/free-space-cache.c |   10 ++
>  1 files changed, 10 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index 0282033..f631df8 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -1999,6 +1999,16 @@ int btrfs_find_space_cluster(struct btrfs_trans_handle 
> *trans,
>   min_bytes = max(bytes, (bytes + empty_size) >> 2);
>  
>   spin_lock(&block_group->tree_lock);
> +
> + /*
> +  * If we know we don't have enough space to make a cluster don't even
> +  * bother doing all the work to try and find one.
> +  */
> + if (block_group->free_space < min_bytes) {
> + spin_unlock(&block_group->tree_lock);
> + return -ENOSPC;
> + }
> +
>   spin_lock(&cluster->lock);
>  
>   /* someone already found a cluster, hooray */
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: don't be as aggressive about using bitmaps

2011-03-20 Thread Li Zefan
04:18, Josef Bacik wrote:
> We have been creating bitmaps for small extents unconditionally forever.  This
> was great when testing to make sure the bitmap stuff was working, but is
> overkill normally.  So instead of always adding small chunks of free space to
> bitmaps, only start doing it if we go past half of our extent threshold.  This
> will keeps us from creating a bitmap for just one small free extent at the 
> front
> of the block group, and will make the allocator a little faster as a result.
> Thanks,
> 

I was wondering this strategy when reading the code, so this patch
looks good to me.

Just a small nit:

> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/free-space-cache.c |   19 ---
>  1 files changed, 16 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index 63776ae..7a808d7 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -1287,9 +1287,22 @@ static int insert_into_bitmap(struct 
> btrfs_block_group_cache *block_group,
>* If we are below the extents threshold then we can add this as an
>* extent, and don't have to deal with the bitmap
>*/
> - if (block_group->free_extents < block_group->extents_thresh &&
> - info->bytes > block_group->sectorsize * 4)
> - return 0;
> + if (block_group->free_extents < block_group->extents_thresh) {
> + /*
> +  * If this block group has some small extents we don't want to
> +  * use up all of our free slots in the cache with them, we want
> +  * to reserve them to larger extents, however if we have plent
> +  * of cache left then go ahead an dadd them, no sense in adding
> +  * the overhead of a bitmap if we don't have to.
> +  */
> + if (info->bytes < block_group->sectorsize * 4) {

This also changes how we define a small extent (from 4 sectorsize to 3).
Is it intended?

> + if ((block_group->free_extents * 2) <=

The parentheses isn't necessary nor help in readability I think.

> + block_group->extents_thresh)
> + return 0;
> + } else {
> + return 0;
> + }
> + }
>  
>   /*
>* some block groups are so tiny they can't be enveloped by a bitmap, so

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4] btrfs: implement delayed inode items operation

2011-03-20 Thread Chris Mason
Excerpts from Miao Xie's message of 2011-03-18 05:24:46 -0400:
> Changelog V3 -> V4:
> - Fix nested lock, which is reported by Itaru Kitayama, by updating space 
> cache
>   inodes in time.

I ran some tests on this and had trouble with my stress.sh script:

http://oss.oracle.com/~mason/stress.sh

I used:

stress.sh -n 50 -c  /mnt

The git tree has all the .git files but no .o files.

The problem was that within about 20 minutes, the filesystem was
spending almost all of its time in balance_dirty_pages().  The problem
is that data writeback isn't complete until the endio handlers have
finished inserting metadata into the btree.

The v4 patch calls btrfs_btree_balance_dirty() from all the
btrfs_end_transaction variants, which means that the FS writeback code
waits for balance_dirty_pages(), which won't make progress until the FS
writeback code is done.

So I changed things to call the delayed inode balance function only from
inside btrfs_btree_balance_dirty(), which did resolve the stalls.  But
I found a few times that when I did rmmod btrfs, there would be delayed
inode objects leaked in the slab cache.  rmmod will try to destroy the
slab cache, which will fail because we haven't freed everything.

It looks like we have a race in btrfs_get_or_create_delayed_node, where
two concurrent callers can both create delayed nodes and then race on
adding it to the inode.

I also think that code is racing with the code that frees delayed nodes,
but haven't yet triggered my debugging printks to prove either one.

My current incremental patch is below, please take a look:

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 727d9a8..02e9390 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -32,7 +32,8 @@ int __init btrfs_delayed_inode_init(void)
delayed_node_cache = kmem_cache_create("delayed_node",
sizeof(struct btrfs_delayed_node),
0,
-   SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD,
+   SLAB_RECLAIM_ACCOUNT |
+   SLAB_MEM_SPREAD | SLAB_DESTROY_BY_RCU,
NULL);
if (!delayed_node_cache)
return -ENOMEM;
@@ -90,22 +91,35 @@ static struct btrfs_delayed_node 
*btrfs_get_or_create_delayed_node(
struct btrfs_inode *btrfs_inode = BTRFS_I(inode);
struct btrfs_root *root = btrfs_inode->root;
 
-   if (btrfs_inode->delayed_node) {
-   node = btrfs_inode->delayed_node;
-   atomic_inc(&node->refs);/* can be accessed */
-   return node;
+again:
+   rcu_read_lock();
+again_rcu:
+   node = btrfs_inode->delayed_node;
+   if (node) {
+   if (atomic_inc_not_zero(&node->refs)) {
+   rcu_read_unlock();
+   return node;
+   }
+printk("racing on node access!\n");
+   goto again_rcu;
}
+   rcu_read_unlock();
 
node = kmem_cache_alloc(delayed_node_cache, GFP_NOFS);
if (!node)
return ERR_PTR(-ENOMEM);
btrfs_init_delayed_node(node, root->objectid, inode->i_ino);
 
-   btrfs_inode->delayed_node = node;
node->delayed_root = btrfs_get_delayed_root(root);
atomic_inc(&node->refs);/* cached in the btrfs inode */
atomic_inc(&node->refs);/* can be accessed */
 
+   if (cmpxchg(&BTRFS_I(inode)->delayed_node, NULL, node)) {
+   kmem_cache_free(delayed_node_cache, node);
+printk("racing on new node insertion!\n");
+   goto again;
+   }
+
return node;
 }
 
@@ -1167,7 +1181,7 @@ static void btrfs_async_run_delayed_node_done(struct 
btrfs_work *work)
nr = trans->blocks_used;
 
btrfs_end_transaction_dmeta(trans, root);
-   btrfs_btree_balance_dirty(root, nr);
+   __btrfs_btree_balance_dirty(root, nr);
 free_path:
btrfs_free_path(path);
 out:
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0f589b1..0c6c336 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2649,6 +2649,28 @@ void btrfs_btree_balance_dirty(struct btrfs_root *root, 
unsigned long nr)
balance_dirty_pages_ratelimited_nr(
   root->fs_info->btree_inode->i_mapping, 1);
}
+   btrfs_balance_delayed_items(root);
+   return;
+}
+
+void __btrfs_btree_balance_dirty(struct btrfs_root *root, unsigned long nr)
+{
+   /*
+* looks as though older kernels can get into trouble with
+* this code, they end up stuck in balance_dirty_pages forever
+*/
+   u64 num_dirty;
+   unsigned long thresh = 32 * 1024 * 1024;
+
+   if (current->flags & PF_MEMALLOC)
+   return;
+
+   num_dirty = root->fs_info->dirty_metadata_bytes;
+
+   if (num_dirty > thresh) {
+   balanc

Re: [PATCH RFC] btrfs: Simplify locking

2011-03-20 Thread Chris Mason
Excerpts from Tejun Heo's message of 2011-03-20 13:44:33 -0400:
> Hello, guys.
> 
> I've been looking through btrfs code and found out that the locking
> was quite interesting.  The distinction between blocking and
> non-blocking locking is valid but is essentially attacing the same
> problem that CONFIG_MUTEX_SPIN_ON_OWNER addresses in generic manner.
> 
> It seemed that CONFIG_MUTEX_SPIN_ON_OWNER should be able to do it
> better as it actually knows whether the lock owner is running or not,
> so I did a few "dbench 50" runs with the complex locking removed.
> 
> The test machine is a dual core machine with 1 gig of memory.  The
> filesystem is on OCZ vertex 64 gig SSD.  I'm attaching the kernel
> config.  Please note that CONFIG_MUTEX_SPIN_ON_OWNER is enabled only
> when lock debugging is disabled, but it will be enabled on virtually
> any production configuration.
> 
> The machine was idle during the dbench runs and CPU usages and context
> switches are calculated from /proc/stat over the dbench runs.  The
> throughput is as reported by dbench.
> 
>  user  system  sirqcxtsw throughput
> before  14426  129332   345  1674004171.7
> after   14274  129036   308  1183346172.119
> 
> So, the extra code isn't really helping anything.  It's worse in every
> way.  Are there some obscure reasons the custom locking should be kept
> around?

Hi Tejun,

I went through a number of benchmarks with the explicit
blocking/spinning code and back then it was still significantly faster
than the adaptive spin.  But, it is definitely worth doing these again,
how many dbench procs did you use?

The biggest benefit to explicit spinning is that mutex_lock starts with
might_sleep(), so we skip the cond_resched().  Do you have voluntary
preempt on?

The long term goal was always to get rid of the blocking locks, I had a
lot of code to track the top blockers and we had gotten rid of about 90%
of them.

Thanks for looking at this

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] btrfs: Simplify locking

2011-03-20 Thread Tejun Heo
On Sun, Mar 20, 2011 at 08:56:52PM +0100, Tejun Heo wrote:
> So, here's the patch to implement and use mutex_try_spin(), which
> applies the same owner spin logic to try locking.  The result looks
> pretty good.
> 
> I re-ran all three.  DFL is the current custom locking.  SIMPLE is
> with only the previous patch applied.  SPIN is both the previous and
> this patches applied.
> 
>USER   SYSTEM   SIRQCXTSW  THROUGHPUT
> DFL14484  129368390  1669102 171.955
> SIMPLE 14483  128902318  1187031 171.512
> SPIN   14311  129222347  1198166 174.904
> 
> DFL/SIMPLE results are more or less consistent with the previous run.
> SPIN seems to consume a bit more cpu than SIMPLE but shows discernably
> better throughput.  I'm running SPIN again just in case but the result
> seems pretty consistent.

Here's another run.

  SPIN   14377  129092335  1189817 172.724
  
It's not as high as the last run, but given other runs I've been
seeing, I think meaningful (ie. not measurement error) throughput
advantage exists.  That said, the difference definitely seems minor.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] btrfs: Simplify locking

2011-03-20 Thread Tejun Heo
So, here's the patch to implement and use mutex_try_spin(), which
applies the same owner spin logic to try locking.  The result looks
pretty good.

I re-ran all three.  DFL is the current custom locking.  SIMPLE is
with only the previous patch applied.  SPIN is both the previous and
this patches applied.

   USER   SYSTEM   SIRQCXTSW  THROUGHPUT
DFL14484  129368390  1669102 171.955
SIMPLE 14483  128902318  1187031 171.512
SPIN   14311  129222347  1198166 174.904

DFL/SIMPLE results are more or less consistent with the previous run.
SPIN seems to consume a bit more cpu than SIMPLE but shows discernably
better throughput.  I'm running SPIN again just in case but the result
seems pretty consistent.

Thanks.

NOT-Signed-off-by: Tejun Heo 
---
 fs/btrfs/locking.h|2 -
 include/linux/mutex.h |1 
 kernel/mutex.c|   58 --
 3 files changed, 44 insertions(+), 17 deletions(-)

Index: work/fs/btrfs/locking.h
===
--- work.orig/fs/btrfs/locking.h
+++ work/fs/btrfs/locking.h
@@ -23,7 +23,7 @@
 
 static inline bool btrfs_try_spin_lock(struct extent_buffer *eb)
 {
-   return mutex_trylock(&eb->lock);
+   return mutex_tryspin(&eb->lock);
 }
 
 static inline void btrfs_tree_lock(struct extent_buffer *eb)
Index: work/include/linux/mutex.h
===
--- work.orig/include/linux/mutex.h
+++ work/include/linux/mutex.h
@@ -157,6 +157,7 @@ extern int __must_check mutex_lock_killa
  * Returns 1 if the mutex has been acquired successfully, and 0 on contention.
  */
 extern int mutex_trylock(struct mutex *lock);
+extern int mutex_tryspin(struct mutex *lock);
 extern void mutex_unlock(struct mutex *lock);
 extern int atomic_dec_and_mutex_lock(atomic_t *cnt, struct mutex *lock);
 
Index: work/kernel/mutex.c
===
--- work.orig/kernel/mutex.c
+++ work/kernel/mutex.c
@@ -126,20 +126,8 @@ void __sched mutex_unlock(struct mutex *
 
 EXPORT_SYMBOL(mutex_unlock);
 
-/*
- * Lock a mutex (possibly interruptible), slowpath:
- */
-static inline int __sched
-__mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
-   unsigned long ip)
+static inline bool mutex_spin(struct mutex *lock)
 {
-   struct task_struct *task = current;
-   struct mutex_waiter waiter;
-   unsigned long flags;
-
-   preempt_disable();
-   mutex_acquire(&lock->dep_map, subclass, 0, ip);
-
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
/*
 * Optimistic spinning.
@@ -158,7 +146,6 @@ __mutex_lock_common(struct mutex *lock,
 * We can't do this for DEBUG_MUTEXES because that relies on wait_lock
 * to serialize everything.
 */
-
for (;;) {
struct thread_info *owner;
 
@@ -181,7 +168,7 @@ __mutex_lock_common(struct mutex *lock,
lock_acquired(&lock->dep_map, ip);
mutex_set_owner(lock);
preempt_enable();
-   return 0;
+   return true;
}
 
/*
@@ -190,7 +177,7 @@ __mutex_lock_common(struct mutex *lock,
 * we're an RT task that will live-lock because we won't let
 * the owner complete.
 */
-   if (!owner && (need_resched() || rt_task(task)))
+   if (!owner && (need_resched() || rt_task(current)))
break;
 
/*
@@ -202,6 +189,26 @@ __mutex_lock_common(struct mutex *lock,
cpu_relax();
}
 #endif
+   return false;
+}
+
+/*
+ * Lock a mutex (possibly interruptible), slowpath:
+ */
+static inline int __sched
+__mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
+   unsigned long ip)
+{
+   struct task_struct *task = current;
+   struct mutex_waiter waiter;
+   unsigned long flags;
+
+   preempt_disable();
+   mutex_acquire(&lock->dep_map, subclass, 0, ip);
+
+   if (mutex_spin(lock))
+   return 0;
+
spin_lock_mutex(&lock->wait_lock, flags);
 
debug_mutex_lock_common(lock, &waiter);
@@ -473,6 +480,25 @@ int __sched mutex_trylock(struct mutex *
 }
 EXPORT_SYMBOL(mutex_trylock);
 
+static inline int __mutex_tryspin_slowpath(atomic_t *lock_count)
+{
+   struct mutex *lock = container_of(lock_count, struct mutex, count);
+
+   return __mutex_trylock_slowpath(lock_count) || mutex_spin(lock);
+}
+
+int __sched mutex_tryspin(struct mutex *lock)
+{
+   int ret;
+
+   ret = __mutex_fastpath_trylock(&lock->count, __mutex_tryspin_slowpath);
+   if (ret)
+   mutex_set_owner(lock);
+
+   return ret;
+}
+EXPORT_SYMBOL(mutex_tryspin);
+
 /**
  * atomic_dec_and_mutex_lock - return holding mutex if we dec to 0
  * @cnt: the 

[PATCH RFC] btrfs: Simplify locking

2011-03-20 Thread Tejun Heo
Hello, guys.

I've been looking through btrfs code and found out that the locking
was quite interesting.  The distinction between blocking and
non-blocking locking is valid but is essentially attacing the same
problem that CONFIG_MUTEX_SPIN_ON_OWNER addresses in generic manner.

It seemed that CONFIG_MUTEX_SPIN_ON_OWNER should be able to do it
better as it actually knows whether the lock owner is running or not,
so I did a few "dbench 50" runs with the complex locking removed.

The test machine is a dual core machine with 1 gig of memory.  The
filesystem is on OCZ vertex 64 gig SSD.  I'm attaching the kernel
config.  Please note that CONFIG_MUTEX_SPIN_ON_OWNER is enabled only
when lock debugging is disabled, but it will be enabled on virtually
any production configuration.

The machine was idle during the dbench runs and CPU usages and context
switches are calculated from /proc/stat over the dbench runs.  The
throughput is as reported by dbench.

 user  system  sirqcxtsw throughput
before  14426  129332   345  1674004171.7
after   14274  129036   308  1183346172.119

So, the extra code isn't really helping anything.  It's worse in every
way.  Are there some obscure reasons the custom locking should be kept
around?

One thing to note is that the patch makes try_tree_lock and
try_spin_lock identical.  It might be benefical to apply owner spin
logic to try_spin_lock.  I'll test that one too and see whether it
makes any difference.

Thanks.

NOT-Signed-off-by: Tejun Heo 
---
 fs/btrfs/Makefile|2 
 fs/btrfs/ctree.c |   16 +--
 fs/btrfs/extent_io.c |3 
 fs/btrfs/extent_io.h |   12 --
 fs/btrfs/locking.c   |  233 ---
 fs/btrfs/locking.h   |   43 +++--
 6 files changed, 48 insertions(+), 261 deletions(-)

Index: work/fs/btrfs/ctree.c
===
--- work.orig/fs/btrfs/ctree.c
+++ work/fs/btrfs/ctree.c
@@ -1074,7 +1074,7 @@ static noinline int balance_level(struct
 
left = read_node_slot(root, parent, pslot - 1);
if (left) {
-   btrfs_tree_lock(left);
+   btrfs_tree_lock_nested(left, 1);
btrfs_set_lock_blocking(left);
wret = btrfs_cow_block(trans, root, left,
   parent, pslot - 1, &left);
@@ -1085,7 +1085,7 @@ static noinline int balance_level(struct
}
right = read_node_slot(root, parent, pslot + 1);
if (right) {
-   btrfs_tree_lock(right);
+   btrfs_tree_lock_nested(right, 2);
btrfs_set_lock_blocking(right);
wret = btrfs_cow_block(trans, root, right,
   parent, pslot + 1, &right);
@@ -1241,7 +1241,7 @@ static noinline int push_nodes_for_inser
if (left) {
u32 left_nr;
 
-   btrfs_tree_lock(left);
+   btrfs_tree_lock_nested(left, 1);
btrfs_set_lock_blocking(left);
 
left_nr = btrfs_header_nritems(left);
@@ -1291,7 +1291,7 @@ static noinline int push_nodes_for_inser
if (right) {
u32 right_nr;
 
-   btrfs_tree_lock(right);
+   btrfs_tree_lock_nested(right, 1);
btrfs_set_lock_blocking(right);
 
right_nr = btrfs_header_nritems(right);
@@ -2519,7 +2519,7 @@ static int push_leaf_right(struct btrfs_
if (right == NULL)
return 1;
 
-   btrfs_tree_lock(right);
+   btrfs_tree_lock_nested(right, 1);
btrfs_set_lock_blocking(right);
 
free_space = btrfs_leaf_free_space(root, right);
@@ -2772,7 +2772,7 @@ static int push_leaf_left(struct btrfs_t
if (left == NULL)
return 1;
 
-   btrfs_tree_lock(left);
+   btrfs_tree_lock_nested(left, 1);
btrfs_set_lock_blocking(left);
 
free_space = btrfs_leaf_free_space(root, left);
@@ -4420,7 +4420,7 @@ again:
ret = btrfs_try_spin_lock(next);
if (!ret) {
btrfs_set_path_blocking(path);
-   btrfs_tree_lock(next);
+   btrfs_tree_lock_nested(next, 1);
if (!force_blocking)
btrfs_clear_path_blocking(path, next);
}
@@ -4460,7 +4460,7 @@ again:
ret = btrfs_try_spin_lock(next);
if (!ret) {
btrfs_set_path_blocking(path);
-   btrfs_tree_lock(next);
+   btrfs_tree_lock_nested(next, 1);
if (!force_blocking)
btrfs_clear_path_blocking(path, next);
}
Index: work/fs/btrfs/extent_io.c
==

Re: [PATCH v2 0/2] Balance management, kernel side

2011-03-20 Thread Hugo Mills
On Sun, Mar 20, 2011 at 09:52:26AM +0100, Andreas Philipp wrote:
> 
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>  
> Since balancing takes a long time I liked the idea of having some
> progress counter and the ability to run this long lasting task in
> background with another option to cancel it if necessary. So I wanted
> to give it a try. Unfortunately, all the patches did not apply on top
> of kernel version 2.6.38. Is there a newer version of this patch or
> died this idea in the meantime? Of course, I will test any patches.

   Odd you should mention that, I'm just redoing my patch stack
against the latest btrfs-unstable...

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- IMPROVE YOUR ORGANISMS!!  -- Subject line of spam email --- 


signature.asc
Description: Digital signature


Re: [PATCH v2 0/2] Balance management, kernel side

2011-03-20 Thread Andreas Philipp

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
 
Since balancing takes a long time I liked the idea of having some
progress counter and the ability to run this long lasting task in
background with another option to cancel it if necessary. So I wanted
to give it a try. Unfortunately, all the patches did not apply on top
of kernel version 2.6.38. Is there a newer version of this patch or
died this idea in the meantime? Of course, I will test any patches.

Thanks,
Andreas Philipp


On 12.11.2010 01:36, Hugo Mills wrote:
> These two patches give a degree of control over balance
> operations. The first makes it possible to get an idea of how much
> work remains to do, by tracking the number of block groups (chunks)
> that need to be moved/rewritten. The second patch allows a running
> balance operation to be cancelled when the current block group has
> been moved.
>
> Since the last version, I've added some more locking (assigning to
> a u64 isn't atomic on non-64-bit architectures). I've not added
> the sysfs bits, as I haven't had a chance to try out Goffredo's
> sysfs code yet. I've also not implemented liubo's suggestion of
> tracking the current block group ID (I'll take that discussion up
> with him separately -- basically it's not a good fit with the
> "polling" method required by this ioctl).
>
> Hugo Mills (2): Balance progress monitoring. Cancel filesystem
> balance.
>
> fs/btrfs/ctree.h | 10  fs/btrfs/disk-io.c | 2 +
> fs/btrfs/ioctl.c | 62
>  fs/btrfs/ioctl.h
> | 8 ++ fs/btrfs/volumes.c | 66
> ++- 5 files
> changed, 146 insertions(+), 2 deletions(-)
>
> -- To unsubscribe from this list: send the line "unsubscribe
> linux-btrfs" in the body of a message to majord...@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
 
iQIcBAEBAgAGBQJNhcBJAAoJEJIcBJ3+Xkgi5usP/iUORDKVdCw6gyzZRRWIYHpj
bTn9zgvyatAtRlwdzxA17XUzx6+r3RPMZPYGSn4tMfatjghfvtPDn9RX+YFTzSAL
OM8fYWfFY36dKYLJk4N2FZ3mDC5tHsU7SCzviqyAb5qlFWVmRXuq0YFQ1TjLQ528
r77BfkbiAVXRc+t9I4BrUHueuK1IPF+XijMzvwfH6iUaX9bZ9woOs8xCqP2MCG7U
3uiTf6Hcfw3mN77hy3zlf180Dh27h47YADPMoPym3J0o/9bjbo1KcBeJ+9TYn7mv
aN5pZWSadszlAPwcfidCNGz8O5+fsIAxfBvF0BHHISIBHU8SwlBrZNx/GzyGENd9
EQduDuvi9eLm2+T9ioKcXz7KqebKs6vt4NR5wXGv7j6vLlaB+LgbH2j0oHj1ZA94
lTwd9bfJBogZCxYUlCsEMKyv/JLY/e183H0DO9pbABrqyZbK5koF0SIZxp90i1Ep
YviSBWVyzr0yERP+qenLMNG5NMXMiCup9fGBd8Upil1hTlnxDqCbpvne2MbjsLsv
CGY2w8PAnfhmGpT9L14o6ExMriHu7OhegMvBATnBv3BI9pd0ev7Titwm9pUBW/X4
0toKMUI0630gTg1klds8ibo0x5BF+0MtE29X/WFhpepxMtrR1e+IOsAy409eTb67
qpo81U/CeaaYJi+gV367
=R7Bg
-END PGP SIGNATURE-

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html