Re: [PATCH V5 2/2] btrfs: implement delayed inode items operation

2011-03-27 Thread Miao Xie
On sun, 27 Mar 2011 14:30:55 +0900, Itaru Kitayama wrote:
 Chris' stress test, stress.sh -n 50 -c /mnt/linux-2.6 /mnt gave me another 
 lockdep splat
 (see below). I applied your V5 patches on top of the next-rc branch.

I got it. It is because the allocation flag of the metadata's page cache, which 
is stored in
the btree inode's i_mapping, was set to be GFP_HIGHUSER_MOVABLE. So if we 
allocate pages for
btree's page cache, this lockdep warning will be triggered.

I think even without my patch, this lockdep warning can also be triggered, 
btrfs_evict_inode()
do the similar operations like what I do in the btrfs_destroy_inode(). 
  Task1 Kswap0 task
  open()
...
btrfs_search_slot()
  ...
  btrfs_cow_block()
...
alloc_page()
  wait for reclaiming
shrink_slab()
  ...
  shrink_icache_memory()
...
btrfs_evict_inode()
  ...
  btrfs_search_slot()

If the path is locked by task1, the deadlock happens.

So the btree's page cache is different with the file's page cache, it can not 
allocate pages
by GFP_HIGHUSER_MOVABLE flag.

I will make a separate patch to fix it.

 I haven't triggered it in my actual testing, but do you think we can iterate 
 a list of block 
 groups in an lockless manner using rcu?

May be we can use it, but AFAIK, the write-side of the sleepable RCU is quite 
slow. Though the
operations of the block group list are few, I think we should do some test to 
check the performance
regression.

Thanks
Miao

 
 diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
 index 2164296..f40ff4e 100644
 --- a/fs/btrfs/ctree.h
 +++ b/fs/btrfs/ctree.h
 @@ -740,6 +740,7 @@ struct btrfs_space_info {
   struct list_head block_groups[BTRFS_NR_RAID_TYPES];
   spinlock_t lock;
   struct rw_semaphore groups_sem;
 + struct srcu_struct groups_srcu;
   atomic_t caching_threads;
  };
  
 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index 9e4c9f4..22d6dbb 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -3003,6 +3003,7 @@ static int update_space_info(struct btrfs_fs_info 
 *info, u64 flags,
   for (i = 0; i  BTRFS_NR_RAID_TYPES; i++)
   INIT_LIST_HEAD(found-block_groups[i]);
   init_rwsem(found-groups_sem);
 + init_srcu_struct(found-groups_srcu);
   spin_lock_init(found-lock);
   found-flags = flags  (BTRFS_BLOCK_GROUP_DATA |
   BTRFS_BLOCK_GROUP_SYSTEM |
 @@ -4853,6 +4854,7 @@ static noinline int find_free_extent(struct 
 btrfs_trans_handle *trans,
int data)
  {
   int ret = 0;
 + int idx;
   struct btrfs_root *root = orig_root-fs_info-extent_root;
   struct btrfs_free_cluster *last_ptr = NULL;
   struct btrfs_block_group_cache *block_group = NULL;
 @@ -4929,7 +4931,7 @@ ideal_cache:
   if (block_group  block_group_bits(block_group, data) 
   (block_group-cached != BTRFS_CACHE_NO ||
search_start == ideal_cache_offset)) {
 - down_read(space_info-groups_sem);
 + idx = srcu_read_lock(space_info-groups_srcu);
   if (list_empty(block_group-list) ||
   block_group-ro) {
   /*
 @@ -4939,7 +4941,7 @@ ideal_cache:
* valid
*/
   btrfs_put_block_group(block_group);
 - up_read(space_info-groups_sem);
 + srcu_read_unlock(space_info-groups_srcu, idx);
   } else {
   index = get_block_group_index(block_group);
   goto have_block_group;
 @@ -4949,8 +4951,8 @@ ideal_cache:
   }
   }
  search:
 - down_read(space_info-groups_sem);
 - list_for_each_entry(block_group, space_info-block_groups[index],
 + idx = srcu_read_lock(space_info-groups_srcu);
 + list_for_each_entry_rcu(block_group, space_info-block_groups[index],
   list) {
   u64 offset;
   int cached;
 @@ -5197,8 +5199,8 @@ loop:
   BUG_ON(index != get_block_group_index(block_group));
   btrfs_put_block_group(block_group);
   }
 - up_read(space_info-groups_sem);
 -
 + srcu_read_unlock(space_info-groups_srcu, idx);
 + 
   if (!ins-objectid  ++index  BTRFS_NR_RAID_TYPES)
   goto search;
  
 
 
 =
 [ INFO: possible irq lock inversion dependency detected ]
 2.6.36-v5+ #2
 

Re: [PATCH V5 2/2] btrfs: implement delayed inode items operation

2011-03-27 Thread Itaru Kitayama
Hi Miao,

On Sun, 27 Mar 2011 15:00:00 +0800
Miao Xie mi...@cn.fujitsu.com wrote:

 I got it. It is because the allocation flag of the metadata's page cache, 
 which is stored in
 the btree inode's i_mapping, was set to be GFP_HIGHUSER_MOVABLE. So if we 
 allocate pages for
 btree's page cache, this lockdep warning will be triggered.
 
 I think even without my patch, this lockdep warning can also be triggered, 
 btrfs_evict_inode()
 do the similar operations like what I do in the btrfs_destroy_inode(). 
   Task1   Kswap0 task
   open()
 ...
 btrfs_search_slot()
   ...
   btrfs_cow_block()
   ...
   alloc_page()
 wait for reclaiming
   shrink_slab()
 ...
 shrink_icache_memory()
   ...
   btrfs_evict_inode()
 ...
 btrfs_search_slot()
 
 If the path is locked by task1, the deadlock happens.

Ok. balance_pgdat() calls shrink_slab() with GFP_KERNEL so it's still possible 
for the kswapd0 
to call prune_icache(), no? I still see the lockdep warning even with your 
patch that clears
__GFP_FS in open_ctree().

itaru

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 2/2] btrfs: implement delayed inode items operation

2011-03-27 Thread Miao Xie
On sun, 27 Mar 2011 20:09:10 +0900, Itaru Kitayama wrote:
 Hi Miao,
 
 On Sun, 27 Mar 2011 15:00:00 +0800
 Miao Xie mi...@cn.fujitsu.com wrote:
 
 I got it. It is because the allocation flag of the metadata's page cache, 
 which is stored in
 the btree inode's i_mapping, was set to be GFP_HIGHUSER_MOVABLE. So if we 
 allocate pages for
 btree's page cache, this lockdep warning will be triggered.

 I think even without my patch, this lockdep warning can also be triggered, 
 btrfs_evict_inode()
 do the similar operations like what I do in the btrfs_destroy_inode(). 
   Task1  Kswap0 task
   open()
 ...
 btrfs_search_slot()
   ...
   btrfs_cow_block()
  ...
  alloc_page()
wait for reclaiming
  shrink_slab()
...
shrink_icache_memory()
  ...
  btrfs_evict_inode()
...
btrfs_search_slot()

 If the path is locked by task1, the deadlock happens.
 
 Ok. balance_pgdat() calls shrink_slab() with GFP_KERNEL so it's still 
 possible for the kswapd0 
 to call prune_icache(), no? I still see the lockdep warning even with your 
 patch that clears
 __GFP_FS in open_ctree().

sorry for my mistake. The above explanation is wrong, it has no business with 
kswap thread.
The correct explanation is

   Task1
   open()
 ...
 btrfs_search_slot()
   ...
   btrfs_cow_block()
...
alloc_page()
  do_try_to_free_pages()
shrink_slab()
  ...
  shrink_icache_memory()
...
btrfs_evict_inode()
  ...
  btrfs_search_slot()

If the path is locked by task1, the deadlock happens.

So balance_pgdat() is impossible to trigger the lockdep.
(My clearing __GFP_FS patch's changelog is also wrong.)

I see, except btree's page cache, free space cache's page cache is also special,
can not use __GFP_FS flag.

Thanks
Miao

 itaru
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 2/2] btrfs: implement delayed inode items operation

2011-03-27 Thread Chris Mason
Excerpts from Miao Xie's message of 2011-03-27 07:44:06 -0400:
 On sun, 27 Mar 2011 20:09:10 +0900, Itaru Kitayama wrote:
  Hi Miao,
  
  On Sun, 27 Mar 2011 15:00:00 +0800
  Miao Xie mi...@cn.fujitsu.com wrote:
  
  I got it. It is because the allocation flag of the metadata's page cache, 
  which is stored in
  the btree inode's i_mapping, was set to be GFP_HIGHUSER_MOVABLE. So if we 
  allocate pages for
  btree's page cache, this lockdep warning will be triggered.
 
  I think even without my patch, this lockdep warning can also be triggered, 
  btrfs_evict_inode()
  do the similar operations like what I do in the btrfs_destroy_inode(). 
Task1Kswap0 task
open()
  ...
  btrfs_search_slot()
...
btrfs_cow_block()
  ...
  alloc_page()
wait for reclaiming
  shrink_slab()
...
shrink_icache_memory()
  ...
  btrfs_evict_inode()
...
btrfs_search_slot()
 
  If the path is locked by task1, the deadlock happens.
  
  Ok. balance_pgdat() calls shrink_slab() with GFP_KERNEL so it's still 
  possible for the kswapd0 
  to call prune_icache(), no? I still see the lockdep warning even with your 
  patch that clears
  __GFP_FS in open_ctree().
 
 sorry for my mistake. The above explanation is wrong, it has no business with 
 kswap thread.
 The correct explanation is
 
Task1
open()
  ...
  btrfs_search_slot()
...
btrfs_cow_block()
  ...
  alloc_page()
do_try_to_free_pages()
  shrink_slab()
...
shrink_icache_memory()
  ...
  btrfs_evict_inode()
...
btrfs_search_slot()
 
 If the path is locked by task1, the deadlock happens.
 
 So balance_pgdat() is impossible to trigger the lockdep.
 (My clearing __GFP_FS patch's changelog is also wrong.)
 
 I see, except btree's page cache, free space cache's page cache is also 
 special,
 can not use __GFP_FS flag.

Ok, I've got your first patch already, I'll add a hunk for the free
space cache too.  Most of the allocations we're doing are explicitly
with GFP_NOFS, so it is just supporting allocations and readahead that
should be causing trouble.

Thanks!

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 2/2] btrfs: implement delayed inode items operation

2011-03-26 Thread Chris Mason
Excerpts from Miao Xie's message of 2011-03-24 07:41:31 -0400:
 Changelog V4 - V5:
 - Fix the race on adding the delayed node to the inode, which is spotted by
   Chris Mason.
 - Merge Chris Mason's incremental patch into this patch.
 - Fix deadlock between readdir() and memory fault, which is reported by
   Itaru Kitayama.

This does do much better than v4, but I'm still hitting oom with
stress.sh -n 50.  I tried a bunch of variations but haven't been able to
get it quite right.  I think I need to hold off on this one and get the
2.6.39 pull request out.

I'll setup a .40 tree that has this in it and we can fix the ooms.  This
is a great base for the work, and I'd like to add more items to the
delay, especially the initial inode insertions.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 2/2] btrfs: implement delayed inode items operation

2011-03-26 Thread Itaru Kitayama
Hi Miao,

Chris' stress test, stress.sh -n 50 -c /mnt/linux-2.6 /mnt gave me another 
lockdep splat
(see below). I applied your V5 patches on top of the next-rc branch.

I haven't triggered it in my actual testing, but do you think we can iterate a 
list of block 
groups in an lockless manner using rcu?

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2164296..f40ff4e 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -740,6 +740,7 @@ struct btrfs_space_info {
struct list_head block_groups[BTRFS_NR_RAID_TYPES];
spinlock_t lock;
struct rw_semaphore groups_sem;
+   struct srcu_struct groups_srcu;
atomic_t caching_threads;
 };
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 9e4c9f4..22d6dbb 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3003,6 +3003,7 @@ static int update_space_info(struct btrfs_fs_info *info, 
u64 flags,
for (i = 0; i  BTRFS_NR_RAID_TYPES; i++)
INIT_LIST_HEAD(found-block_groups[i]);
init_rwsem(found-groups_sem);
+   init_srcu_struct(found-groups_srcu);
spin_lock_init(found-lock);
found-flags = flags  (BTRFS_BLOCK_GROUP_DATA |
BTRFS_BLOCK_GROUP_SYSTEM |
@@ -4853,6 +4854,7 @@ static noinline int find_free_extent(struct 
btrfs_trans_handle *trans,
 int data)
 {
int ret = 0;
+   int idx;
struct btrfs_root *root = orig_root-fs_info-extent_root;
struct btrfs_free_cluster *last_ptr = NULL;
struct btrfs_block_group_cache *block_group = NULL;
@@ -4929,7 +4931,7 @@ ideal_cache:
if (block_group  block_group_bits(block_group, data) 
(block_group-cached != BTRFS_CACHE_NO ||
 search_start == ideal_cache_offset)) {
-   down_read(space_info-groups_sem);
+   idx = srcu_read_lock(space_info-groups_srcu);
if (list_empty(block_group-list) ||
block_group-ro) {
/*
@@ -4939,7 +4941,7 @@ ideal_cache:
 * valid
 */
btrfs_put_block_group(block_group);
-   up_read(space_info-groups_sem);
+   srcu_read_unlock(space_info-groups_srcu, idx);
} else {
index = get_block_group_index(block_group);
goto have_block_group;
@@ -4949,8 +4951,8 @@ ideal_cache:
}
}
 search:
-   down_read(space_info-groups_sem);
-   list_for_each_entry(block_group, space_info-block_groups[index],
+   idx = srcu_read_lock(space_info-groups_srcu);
+   list_for_each_entry_rcu(block_group, space_info-block_groups[index],
list) {
u64 offset;
int cached;
@@ -5197,8 +5199,8 @@ loop:
BUG_ON(index != get_block_group_index(block_group));
btrfs_put_block_group(block_group);
}
-   up_read(space_info-groups_sem);
-
+   srcu_read_unlock(space_info-groups_srcu, idx);
+   
if (!ins-objectid  ++index  BTRFS_NR_RAID_TYPES)
goto search;
 


=
[ INFO: possible irq lock inversion dependency detected ]
2.6.36-v5+ #2
-
kswapd0/49 just changed the state of lock:
 (delayed_node-mutex){+.+.-.}, at: [812131f7] 
btrfs_remove_delayed_node+0x3e/0xd2
but this lock took another, RECLAIM_FS-READ-unsafe lock in the past:
 (found-groups_sem){.+}

and interrupts could create inverse lock ordering between them.


other info that might help us debug this:
2 locks held by kswapd0/49:
 #0:  (shrinker_rwsem){..}, at: [810e242a] shrink_slab+0x3d/0x164
 #1:  (iprune_sem){.-}, at: [811316d0] 
shrink_icache_memory+0x4d/0x213

the shortest dependencies between 2nd lock and 1st lock:
 - (found-groups_sem){.+} ops: 1334 {
HARDIRQ-ON-W at:
  [81075ec0] __lock_acquire+0x346/0xda6
  [81076a3d] lock_acquire+0x11d/0x143
  [814c6a2a] down_write+0x55/0x9b
  [811c352a] __link_block_group+0x5a/0x83
  [811ca562] 
btrfs_read_block_groups+0x2fb/0x56c
  [811d4921] open_ctree+0xf78/0x14ab
  [811bafdf] btrfs_get_sb+0x236/0x467
  [8111f25e] vfs_kern_mount+0xbd/0x1a7
  [8111f3b0] do_kern_mount+0x4d/0xed
  [8113668d] do_mount+0x74e/0x7c5
  [8113678c] sys_mount+0x88/0xc2
 

Re: [PATCH V5 2/2] btrfs: implement delayed inode items operation

2011-03-24 Thread David Sterba
Hi,

there's one thing I want to bring up. It's not related to delayed
functionality itself but to git tree base of the patch.

There's a merge conflict when your patch is applied directly onto
Linus' tree, and not when on Chris' one.

On Thu, Mar 24, 2011 at 07:41:31PM +0800, Miao Xie wrote:
...
 --- a/fs/btrfs/inode.c
 +++ b/fs/btrfs/inode.c
 @@ -6726,6 +6775,9 @@ void btrfs_destroy_inode(struct inode *inode)
   inode_tree_del(inode);
   btrfs_drop_extent_cache(inode, 0, (u64)-1, 0);
  free:
 + ret = btrfs_remove_delayed_node(inode);
 + BUG_ON(ret);
 +
   kmem_cache_free(btrfs_inode_cachep, BTRFS_I(inode));
  }
  

the call to kmem_cache_free has been replaced by

commit fa0d7e3de6d6fc5004ad9dea0dd6b286af8f03e9
Author: Nick Piggin npig...@kernel.dk
Date:   Fri Jan 7 17:49:49 2011 +1100
fs: icache RCU free inodes

relevant hunk:

--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6495,6 +6495,13 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
return inode;
 }

+static void btrfs_i_callback(struct rcu_head *head)
+{
+   struct inode *inode = container_of(head, struct inode, i_rcu);
+   INIT_LIST_HEAD(inode-i_dentry);
+   kmem_cache_free(btrfs_inode_cachep, BTRFS_I(inode));
+}
+
 void btrfs_destroy_inode(struct inode *inode)
 {
struct btrfs_ordered_extent *ordered;
@@ -6564,7 +6571,7 @@ void btrfs_destroy_inode(struct inode *inode)
inode_tree_del(inode);
btrfs_drop_extent_cache(inode, 0, (u64)-1, 0);
 free:
-   kmem_cache_free(btrfs_inode_cachep, BTRFS_I(inode));
+   call_rcu(inode-i_rcu, btrfs_i_callback);
 }


I don't think this disqualifies all the testing already done but maybe it's
time to rebase btrfs-unstable.git to .38 .

Chris?


dave
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html