Re: [PATCH V4] btrfs: implement delayed inode items operation

2011-03-23 Thread Miao Xie
Hi, Kitayama-san

On wed, 23 Mar 2011 13:19:18 +0900, Itaru Kitayama wrote:
 On Wed, 23 Mar 2011 12:00:38 +0800
 Miao Xie mi...@cn.fujitsu.com wrote:
 
 I is testing the new version, in which I fixed the slab shrinker problem 
 reported by
 Chris. In the new version, the delayed node is removed before the relative 
 inode is
 moved into the unused_inode list(the slab shrinker will reclaim inodes in 
 this list).
 Maybe this method can also fix this bug. So could you tell me the reproduce 
 step
 or the options of mkfs and mount? I will check if the new patch can fix this 
 bug or not.
 
 I used the default mkfs options for $TEST_DEV and enabled the space_cache and 
 the
 compress=lzo options upon mounting the partition.

Unfortunately, I can trigger this warning, but by analyzing the following 
information,

 =
 [ INFO: possible irq lock inversion dependency detected ]
 2.6.36-xie+ #122
 -
 kswapd0/49 just changed the state of lock:
  (iprune_sem){.-}, at: [811316d0] 
 shrink_icache_memory+0x4d/0x213
 but this lock took another, RECLAIM_FS-unsafe lock in the past:
  (delayed_node-mutex){+.+.+.}
 
 and interrupts could create inverse lock ordering between them.
[SNIP]
 RECLAIM_FS-ON-W at:
  [81074292] mark_held_locks+0x52/0x70
  [81074354] 
 lockdep_trace_alloc+0xa4/0xc2
  [8110f206] __kmalloc+0x7f/0x154
  [811c2c21] kzalloc+0x14/0x16
  [811c5e83] 
 cache_block_group+0x106/0x238
  [811c7069] find_free_extent+0x4e2/0xa86
  [811c76c1] 
 btrfs_reserve_extent+0xb4/0x142
  [811c78b6] 
 btrfs_alloc_free_block+0x167/0x2af
  [811be610] 
 __btrfs_cow_block+0x103/0x346
  [811bedb8] btrfs_cow_block+0x101/0x110
  [811c05d8] 
 btrfs_search_slot+0x143/0x513
  [811cf5ab] btrfs_lookup_inode+0x2f/0x8f
  [81212405] 
 btrfs_update_delayed_inode+0x75/0x135
  [8121340d] 
 btrfs_run_delayed_items+0xd6/0x131
  [811d64d7] 
 btrfs_commit_transaction+0x28b/0x668
  [811ba786] btrfs_sync_fs+0x6b/0x70
  [81140265] __sync_filesystem+0x6b/0x83
  [81140347] sync_filesystem+0x4c/0x50
  [8111fc56] 
 generic_shutdown_super+0x27/0xd7
  [8111fd5b] kill_anon_super+0x16/0x54
  [8111effd] 
 deactivate_locked_super+0x26/0x46
  [8111f495] deactivate_super+0x45/0x4a
  [81135962] mntput_no_expire+0xd6/0x104
  [81136a87] sys_umount+0x2c1/0x2ec
  [81002ddb] 
 system_call_fastpath+0x16/0x1b

we found GFP_KERNEL was passed into kzalloc(), I think this flag trigger the 
above lockdep
warning. the attached patch, which against the delayed items operation patch, 
may fix this
problem, Could you test it for me?

Thanks
Miao
From f84daee1d2060beae945a2774cda7af2ef7f3e61 Mon Sep 17 00:00:00 2001
From: Miao Xie mi...@cn.fujitsu.com
Date: Wed, 23 Mar 2011 16:01:16 +0800
Subject: [PATCH] btrfs: use GFP_NOFS instead of GFP_KERNEL

In the filesystem context, we must allocate memory by GFP_NOFS,
or we will start another filesystem operation and trigger lockdep warnning.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
 fs/btrfs/extent-tree.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f1db57d..fe50cff 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -471,7 +471,7 @@ static int cache_block_group(struct btrfs_block_group_cache 
*cache,
if (load_cache_only)
return 0;
 
-   caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_KERNEL);
+   caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_NOFS);
BUG_ON(!caching_ctl);
 
INIT_LIST_HEAD(caching_ctl-list);
-- 
1.7.3.1



Re: [PATCH V4] btrfs: implement delayed inode items operation

2011-03-23 Thread Miao Xie
On wed, 23 Mar 2011 09:57:56 +0800, Miao Xie wrote:
 On Mon, 21 Mar 2011 08:08:17 -0400, Chris Mason wrote:
 I also think that code is racing with the code that frees delayed nodes,
 but haven't yet triggered my debugging printks to prove either one.

 We free delayed nodes when we want to destroy the inode, at that time, just 
 one task,
 which is destroying inode, can access the delayed nodes, so I think 
 ACCESS_ONCE() is
 enough. What do you think about?

 Great, I see what you mean.  The bigger problem right now is that we may do
 a lot of operations in destroy_inode(), which can block the slab
 shrinkers on our metadata operations.  That same stress.sh -n 50 run is
 running into OOM.

 So we need to rework the part where the final free is done.  We could
 keep a ref on the inode until the delayed items are complete, or we
 could let the inode go and make a way to lookup the delayed node when
 the inode is read.
 
 I find the slab shrinkers just reclaim inodes which are in the inode_unused
 list, so if we free delayed nodes before moving the inode into the 
 inode_unused
 list, we can avoid blocking the slab shrinker. Thus maybe we can fix the above
 problem by moving btrfs_remove_delayed_node() from btrfs_destroy_inode()
 to btrfs_drop_inode(). How do you think about?

This method is not good, because we may do delayed insertion/deletion and free
the delayed node when we release the inode, but we may access its delayed node
soon.

After reading the lockdep message reported by Kitayama, I guess the reason of 
the slab
shrinker block may not be that we do a lot of operations in destroy_inode(), 
maybe the
real reason is that we pass GFP_KERNEL into kzalloc()(in cache_block_group()), 
which makes
the slab shrinkers hang up. (I don't trigger OOM by running stress.sh till now, 
so I
can't locate this bug.)

Thanks
Miao

 

 I'll read more today.

 -chris

 
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4] btrfs: implement delayed inode items operation

2011-03-23 Thread Itaru Kitayama
On Wed, 23 Mar 2011 17:47:01 +0800
Miao Xie mi...@cn.fujitsu.com wrote:

 we found GFP_KERNEL was passed into kzalloc(), I think this flag trigger the 
 above lockdep
 warning. the attached patch, which against the delayed items operation patch, 
 may fix this
 problem, Could you test it for me?

The possible irq lock inversion dependency warning seems to go away.   

itaru   
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4] btrfs: implement delayed inode items operation

2011-03-22 Thread Itaru Kitayama
Hi Miao,

On Tue, 22 Mar 2011 18:03:24 +0800
Miao Xie mi...@cn.fujitsu.com wrote:

 The V5 patch is attached, could you test it for me? I have run Chris stress 
 test, dbench benchmark
 on my machine, it work well. I want to check if the above bug still exists or 
 not.
 
 Thanks
 Miao

Here's the boot log (The V5 patch is on top of the btrfs-unstable next-rc 
branch):

BUG: unable to handle kernel NULL pointer dereference at 00c8
IP: [81075c12] __lock_acquire+0x98/0xda6
PGD 7074c067 PUD 75b6d067 PMD 0 
Oops:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
last sysfs file: /sys/devices/virtual/bdi/btrfs-1/uevent
CPU 0 
Modules linked in: mptspi mptscsih mptbase scsi_transport_spi

Pid: 1, comm: init Not tainted 2.6.36-xie+ #120 440BX Desktop Reference 
Platform/VMware Virtual Platform
RIP: 0010:[81075c12]  [81075c12] __lock_acquire+0x98/0xda6
RSP: 0018:88007ceafb58  EFLAGS: 00010097
RAX: 0046 RBX: 88007ceb0050 RCX: 
RDX:  RSI:  RDI: 00c8
RBP: 88007ceafc18 R08: 0002 R09: 
R10:  R11: 88007ceafd48 R12: 88007ceb0050
R13: 00c8 R14: 0002 R15: 
FS:  7f1fe74e7720() GS:880004e0() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 00c8 CR3: 70486000 CR4: 06f0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process init (pid: 1, threadinfo 88007ceae000, task 88007ceb0050)
Stack:
 0002 0001 88007ceafb78 810bdd5d
0 8800  88007ceafbc8 0046
0 88007ceafba8 88007ceb0050 81067dcd 0001
Call Trace:
 [810bdd5d] ? time_hardirqs_off+0x2e/0x36
 [81067dcd] ? local_clock+0x2a/0x3b
 [8121294b] ? btrfs_get_delayed_items+0x37/0xb6
 [81076a3d] lock_acquire+0x11d/0x143
 [8121294b] ? btrfs_get_delayed_items+0x37/0xb6
 [8121294b] ? btrfs_get_delayed_items+0x37/0xb6
 [814c62b1] __mutex_lock_common+0x5a/0x444
 [8121294b] ? btrfs_get_delayed_items+0x37/0xb6
 [8110ba5f] ? virt_to_head_page+0xe/0x31
 [8110dbbc] ? cache_alloc_debugcheck_after+0x176/0x1cd
 [814c6750] mutex_lock_nested+0x39/0x3e
 [8121294b] btrfs_get_delayed_items+0x37/0xb6
 [811d7e01] btrfs_real_readdir+0x170/0x52c
 [8112b739] ? vfs_readdir+0x56/0xb6
 [8112b559] ? filldir+0x0/0xd0
 [810bdfc0] ? trace_preempt_on+0x24/0x26
 [814cb4e8] ? sub_preempt_count+0xa3/0xb6
 [814c6645] ? __mutex_lock_common+0x3ee/0x444
 [8112b739] ? vfs_readdir+0x56/0xb6
 [8111e435] ? rcu_read_unlock+0x0/0x4b
 [8112b559] ? filldir+0x0/0xd0
 [8112b559] ? filldir+0x0/0xd0
 [8112b75c] vfs_readdir+0x79/0xb6
 [8112b8e9] sys_getdents+0x85/0xd8
 [81002ddb] system_call_fastpath+0x16/0x1b
Code: 58 e2 8b 01 00 be b3 0a 00 00 0f 85 c4 0c 00 00 e9 4d 0c 00 00 83 fe 07 
76 11 e8 7e 55 1f 00 48 c7 c7 55 85 78 81 e9 78 0c 00 00 49 8 
RIP  [81075c12] __lock_acquire+0x98/0xda6
 RSP 88007ceafb58
CR2: 00c8
---[ end trace 87f215b3568d553d ]---

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4] btrfs: implement delayed inode items operation

2011-03-22 Thread Miao Xie
On Mon, 21 Mar 2011 08:08:17 -0400, Chris Mason wrote:
 I also think that code is racing with the code that frees delayed nodes,
 but haven't yet triggered my debugging printks to prove either one.

 We free delayed nodes when we want to destroy the inode, at that time, just 
 one task,
 which is destroying inode, can access the delayed nodes, so I think 
 ACCESS_ONCE() is
 enough. What do you think about?
 
 Great, I see what you mean.  The bigger problem right now is that we may do
 a lot of operations in destroy_inode(), which can block the slab
 shrinkers on our metadata operations.  That same stress.sh -n 50 run is
 running into OOM.
 
 So we need to rework the part where the final free is done.  We could
 keep a ref on the inode until the delayed items are complete, or we
 could let the inode go and make a way to lookup the delayed node when
 the inode is read.

I find the slab shrinkers just reclaim inodes which are in the inode_unused
list, so if we free delayed nodes before moving the inode into the inode_unused
list, we can avoid blocking the slab shrinker. Thus maybe we can fix the above
problem by moving btrfs_remove_delayed_node() from btrfs_destroy_inode()
to btrfs_drop_inode(). How do you think about?

Thanks
Miao

 
 I'll read more today.
 
 -chris
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4] btrfs: implement delayed inode items operation

2011-03-22 Thread Itaru Kitayama
Hi Miao,

The possible circular locking dependency message doesn't show up in the updated 
V5. However,
I see a new possible irq lock inversion dependency message while running 
xfstests.

=
[ INFO: possible irq lock inversion dependency detected ]
2.6.36-xie+ #122
-
kswapd0/49 just changed the state of lock:
 (iprune_sem){.-}, at: [811316d0] shrink_icache_memory+0x4d/0x213
but this lock took another, RECLAIM_FS-unsafe lock in the past:
 (delayed_node-mutex){+.+.+.}

and interrupts could create inverse lock ordering between them.


other info that might help us debug this:
1 lock held by kswapd0/49:
 #0:  (shrinker_rwsem){..}, at: [810e242a] shrink_slab+0x3d/0x164

the shortest dependencies between 2nd lock and 1st lock:
 - (delayed_node-mutex){+.+.+.} ops: 1703807 {
HARDIRQ-ON-W at:
  [81075ec0] __lock_acquire+0x346/0xda6
  [81076a3d] lock_acquire+0x11d/0x143
  [814c62b1] __mutex_lock_common+0x5a/0x444
  [814c6750] mutex_lock_nested+0x39/0x3e
  [81211fd4] 
btrfs_delayed_update_inode+0x45/0x101
  [811dc5c3] btrfs_update_inode+0x2e/0x129
  [811e0cba] btrfs_truncate+0x43d/0x477
  [810dfb22] vmtruncate+0x44/0x52
  [811e0ef6] btrfs_setattr+0x202/0x253
  [8113201e] notify_change+0x1a2/0x29d
  [8111bf08] do_truncate+0x6c/0x89
  [81127a77] do_last+0x579/0x57e
  [81129502] do_filp_open+0x215/0x5ae
  [8111aec0] do_sys_open+0x60/0xfc
  [8111af8f] sys_open+0x20/0x22
  [81002ddb] system_call_fastpath+0x16/0x1b
SOFTIRQ-ON-W at:
  [81075ee1] __lock_acquire+0x367/0xda6
  [81076a3d] lock_acquire+0x11d/0x143
  [814c62b1] __mutex_lock_common+0x5a/0x444
  [814c6750] mutex_lock_nested+0x39/0x3e
  [81211fd4] 
btrfs_delayed_update_inode+0x45/0x101
  [811dc5c3] btrfs_update_inode+0x2e/0x129
  [811e0cba] btrfs_truncate+0x43d/0x477
  [810dfb22] vmtruncate+0x44/0x52
  [811e0ef6] btrfs_setattr+0x202/0x253
  [8113201e] notify_change+0x1a2/0x29d
  [8111bf08] do_truncate+0x6c/0x89
  [81127a77] do_last+0x579/0x57e
  [81129502] do_filp_open+0x215/0x5ae
  [8111aec0] do_sys_open+0x60/0xfc
  [8111af8f] sys_open+0x20/0x22
  [81002ddb] system_call_fastpath+0x16/0x1b
RECLAIM_FS-ON-W at:
 [81074292] mark_held_locks+0x52/0x70
 [81074354] lockdep_trace_alloc+0xa4/0xc2
 [8110f206] __kmalloc+0x7f/0x154
 [811c2c21] kzalloc+0x14/0x16
 [811c5e83] cache_block_group+0x106/0x238
 [811c7069] find_free_extent+0x4e2/0xa86
 [811c76c1] 
btrfs_reserve_extent+0xb4/0x142
 [811c78b6] 
btrfs_alloc_free_block+0x167/0x2af
 [811be610] __btrfs_cow_block+0x103/0x346
 [811bedb8] btrfs_cow_block+0x101/0x110
 [811c05d8] btrfs_search_slot+0x143/0x513
 [811cf5ab] btrfs_lookup_inode+0x2f/0x8f
 [81212405] 
btrfs_update_delayed_inode+0x75/0x135
 [8121340d] 
btrfs_run_delayed_items+0xd6/0x131
 [811d64d7] 
btrfs_commit_transaction+0x28b/0x668
 [811ba786] btrfs_sync_fs+0x6b/0x70
 [81140265] __sync_filesystem+0x6b/0x83
 [81140347] sync_filesystem+0x4c/0x50
 [8111fc56] 
generic_shutdown_super+0x27/0xd7
 [8111fd5b] kill_anon_super+0x16/0x54
 [8111effd] 
deactivate_locked_super+0x26/0x46
 [8111f495] deactivate_super+0x45/0x4a
 

Re: [PATCH V4] btrfs: implement delayed inode items operation

2011-03-22 Thread Miao Xie
On wed, 23 Mar 2011 12:24:09 +0900, Itaru Kitayama wrote:
 Hi Miao,
 
 The possible circular locking dependency message doesn't show up in the 
 updated V5. However,
 I see a new possible irq lock inversion dependency message while running 
 xfstests.

I is testing the new version, in which I fixed the slab shrinker problem 
reported by
Chris. In the new version, the delayed node is removed before the relative 
inode is
moved into the unused_inode list(the slab shrinker will reclaim inodes in this 
list).
Maybe this method can also fix this bug. So could you tell me the reproduce step
or the options of mkfs and mount? I will check if the new patch can fix this 
bug or not.

Thanks
Miao

 
 =
 [ INFO: possible irq lock inversion dependency detected ]
 2.6.36-xie+ #122
 -
 kswapd0/49 just changed the state of lock:
  (iprune_sem){.-}, at: [811316d0] 
 shrink_icache_memory+0x4d/0x213
 but this lock took another, RECLAIM_FS-unsafe lock in the past:
  (delayed_node-mutex){+.+.+.}
 
 and interrupts could create inverse lock ordering between them.
 
 
 other info that might help us debug this:
 1 lock held by kswapd0/49:
  #0:  (shrinker_rwsem){..}, at: [810e242a] 
 shrink_slab+0x3d/0x164
 
 the shortest dependencies between 2nd lock and 1st lock:
  - (delayed_node-mutex){+.+.+.} ops: 1703807 {
 HARDIRQ-ON-W at:
   [81075ec0] __lock_acquire+0x346/0xda6
   [81076a3d] lock_acquire+0x11d/0x143
   [814c62b1] __mutex_lock_common+0x5a/0x444
   [814c6750] mutex_lock_nested+0x39/0x3e
   [81211fd4] 
 btrfs_delayed_update_inode+0x45/0x101
   [811dc5c3] btrfs_update_inode+0x2e/0x129
   [811e0cba] btrfs_truncate+0x43d/0x477
   [810dfb22] vmtruncate+0x44/0x52
   [811e0ef6] btrfs_setattr+0x202/0x253
   [8113201e] notify_change+0x1a2/0x29d
   [8111bf08] do_truncate+0x6c/0x89
   [81127a77] do_last+0x579/0x57e
   [81129502] do_filp_open+0x215/0x5ae
   [8111aec0] do_sys_open+0x60/0xfc
   [8111af8f] sys_open+0x20/0x22
   [81002ddb] system_call_fastpath+0x16/0x1b
 SOFTIRQ-ON-W at:
   [81075ee1] __lock_acquire+0x367/0xda6
   [81076a3d] lock_acquire+0x11d/0x143
   [814c62b1] __mutex_lock_common+0x5a/0x444
   [814c6750] mutex_lock_nested+0x39/0x3e
   [81211fd4] 
 btrfs_delayed_update_inode+0x45/0x101
   [811dc5c3] btrfs_update_inode+0x2e/0x129
   [811e0cba] btrfs_truncate+0x43d/0x477
   [810dfb22] vmtruncate+0x44/0x52
   [811e0ef6] btrfs_setattr+0x202/0x253
   [8113201e] notify_change+0x1a2/0x29d
   [8111bf08] do_truncate+0x6c/0x89
   [81127a77] do_last+0x579/0x57e
   [81129502] do_filp_open+0x215/0x5ae
   [8111aec0] do_sys_open+0x60/0xfc
   [8111af8f] sys_open+0x20/0x22
   [81002ddb] system_call_fastpath+0x16/0x1b
 RECLAIM_FS-ON-W at:
  [81074292] mark_held_locks+0x52/0x70
  [81074354] 
 lockdep_trace_alloc+0xa4/0xc2
  [8110f206] __kmalloc+0x7f/0x154
  [811c2c21] kzalloc+0x14/0x16
  [811c5e83] 
 cache_block_group+0x106/0x238
  [811c7069] find_free_extent+0x4e2/0xa86
  [811c76c1] 
 btrfs_reserve_extent+0xb4/0x142
  [811c78b6] 
 btrfs_alloc_free_block+0x167/0x2af
  [811be610] 
 __btrfs_cow_block+0x103/0x346
  [811bedb8] btrfs_cow_block+0x101/0x110
  [811c05d8] 
 btrfs_search_slot+0x143/0x513
  [811cf5ab] btrfs_lookup_inode+0x2f/0x8f
  [81212405] 
 btrfs_update_delayed_inode+0x75/0x135
  [8121340d] 
 btrfs_run_delayed_items+0xd6/0x131
  [811d64d7] 
 

Re: [PATCH V4] btrfs: implement delayed inode items operation

2011-03-22 Thread Itaru Kitayama
On Wed, 23 Mar 2011 12:00:38 +0800
Miao Xie mi...@cn.fujitsu.com wrote:

 I is testing the new version, in which I fixed the slab shrinker problem 
 reported by
 Chris. In the new version, the delayed node is removed before the relative 
 inode is
 moved into the unused_inode list(the slab shrinker will reclaim inodes in 
 this list).
 Maybe this method can also fix this bug. So could you tell me the reproduce 
 step
 or the options of mkfs and mount? I will check if the new patch can fix this 
 bug or not.

I used the default mkfs options for $TEST_DEV and enabled the space_cache and 
the
compress=lzo options upon mounting the partition.

itaru
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4] btrfs: implement delayed inode items operation

2011-03-21 Thread Chris Mason
Excerpts from Miao Xie's message of 2011-03-21 01:05:22 -0400:
 On sun, 20 Mar 2011 20:33:34 -0400, Chris Mason wrote:
  Excerpts from Miao Xie's message of 2011-03-18 05:24:46 -0400:
  Changelog V3 - V4:
  - Fix nested lock, which is reported by Itaru Kitayama, by updating space 
  cache
inodes in time.
  
  I ran some tests on this and had trouble with my stress.sh script:
  
  http://oss.oracle.com/~mason/stress.sh
  
  I used:
  
  stress.sh -n 50 -c path to linux kernel git tree /mnt
  
  The git tree has all the .git files but no .o files.
  
  The problem was that within about 20 minutes, the filesystem was
  spending almost all of its time in balance_dirty_pages().  The problem
  is that data writeback isn't complete until the endio handlers have
  finished inserting metadata into the btree.
  
  The v4 patch calls btrfs_btree_balance_dirty() from all the
  btrfs_end_transaction variants, which means that the FS writeback code
  waits for balance_dirty_pages(), which won't make progress until the FS
  writeback code is done.
  
  So I changed things to call the delayed inode balance function only from
  inside btrfs_btree_balance_dirty(), which did resolve the stalls.  But
 
 Ok, but can we invoke the delayed inode balance function before
 balance_dirty_pages_ratelimited_nr(), because the delayed item insertion and
 deletion also bring us some dirty pages.

Yes, good point.

 
  I found a few times that when I did rmmod btrfs, there would be delayed
  inode objects leaked in the slab cache.  rmmod will try to destroy the
  slab cache, which will fail because we haven't freed everything.
  
  It looks like we have a race in btrfs_get_or_create_delayed_node, where
  two concurrent callers can both create delayed nodes and then race on
  adding it to the inode.
 
 Sorry for my mistake, I thought we updated the inodes when holding i_mutex 
 originally,
 so I didn't use any lock or other method to protect delayed_node of the 
 inodes.
 
 But I think we needn't use rcu lock to protect delayed_node when we want to 
 get the
 delayed node, because we won't change it after it is created, cmpxchg() and 
 ACCESS_ONCE()
 can protect it well. What do you think about?
 
 PS: I worry about the inode update without holding i_mutex.

We have the tree locks to make sure we're serialized while we actually
change the tree.  The only places that go in without locking are times
updates.

 
  I also think that code is racing with the code that frees delayed nodes,
  but haven't yet triggered my debugging printks to prove either one.
 
 We free delayed nodes when we want to destroy the inode, at that time, just 
 one task,
 which is destroying inode, can access the delayed nodes, so I think 
 ACCESS_ONCE() is
 enough. What do you think about?

Great, I see what you mean.  The bigger problem right now is that we may do
a lot of operations in destroy_inode(), which can block the slab
shrinkers on our metadata operations.  That same stress.sh -n 50 run is
running into OOM.

So we need to rework the part where the final free is done.  We could
keep a ref on the inode until the delayed items are complete, or we
could let the inode go and make a way to lookup the delayed node when
the inode is read.

I'll read more today.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4] btrfs: implement delayed inode items operation

2011-03-21 Thread Itaru Kitayama
Hi Miao,

Here is an excerpt of the V4 patch applied kernel boot log:

===
[ INFO: possible circular locking dependency detected ]
2.6.36-xie+ #117
---
vgs/1210 is trying to acquire lock:
 (delayed_node-mutex){+.+...}, at: [8121184b] 
btrfs_delayed_update_inode+0x45/0x101

but task is already holding lock:
 (mm-mmap_sem){++}, at: [810f6512] sys_mmap_pgoff+0xd6/0x12e

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

- #1 (mm-mmap_sem){++}:
   [81076a3d] lock_acquire+0x11d/0x143
   [810edc79] might_fault+0x95/0xb8
   [8112b5ce] filldir+0x75/0xd0
   [811d77f8] btrfs_real_readdir+0x3d7/0x528
   [8112b75c] vfs_readdir+0x79/0xb6
   [8112b8e9] sys_getdents+0x85/0xd8
   [81002ddb] system_call_fastpath+0x16/0x1b

- #0 (delayed_node-mutex){+.+...}:
   [81076612] __lock_acquire+0xa98/0xda6
   [81076a3d] lock_acquire+0x11d/0x143
   [814c38b1] __mutex_lock_common+0x5a/0x444
   [814c3d50] mutex_lock_nested+0x39/0x3e
   [8121184b] btrfs_delayed_update_inode+0x45/0x101
   [811dbd4f] btrfs_update_inode+0x2e/0x129
   [811de008] btrfs_dirty_inode+0x57/0x113
   [8113c2a5] __mark_inode_dirty+0x33/0x1aa
   [81130939] touch_atime+0x107/0x12a
   [811e15b2] btrfs_file_mmap+0x3e/0x57
   [810f5f40] mmap_region+0x2bb/0x4c4
   [810f63d9] do_mmap_pgoff+0x290/0x2f3
   [810f6532] sys_mmap_pgoff+0xf6/0x12e
   [81006e9a] sys_mmap+0x22/0x24
   [81002ddb] system_call_fastpath+0x16/0x1b

other info that might help us debug this:

1 lock held by vgs/1210:
 #0:  (mm-mmap_sem){++}, at: [810f6512] 
sys_mmap_pgoff+0xd6/0x12e

stack backtrace:
Pid: 1210, comm: vgs Not tainted 2.6.36-xie+ #117
Call Trace:
 [81074c15] print_circular_bug+0xaf/0xbd
 [81076612] __lock_acquire+0xa98/0xda6
 [8121184b] ? btrfs_delayed_update_inode+0x45/0x101
 [81076a3d] lock_acquire+0x11d/0x143
 [8121184b] ? btrfs_delayed_update_inode+0x45/0x101
 [8121184b] ? btrfs_delayed_update_inode+0x45/0x101
 [814c38b1] __mutex_lock_common+0x5a/0x444
 [8121184b] ? btrfs_delayed_update_inode+0x45/0x101
 [8107162f] ? debug_mutex_init+0x31/0x3c
 [814c3d50] mutex_lock_nested+0x39/0x3e
 [8121184b] btrfs_delayed_update_inode+0x45/0x101
 [814c36c6] ? __mutex_unlock_slowpath+0x129/0x13a
 [811dbd4f] btrfs_update_inode+0x2e/0x129
 [811de008] btrfs_dirty_inode+0x57/0x113
 [8113c2a5] __mark_inode_dirty+0x33/0x1aa
 [81130939] touch_atime+0x107/0x12a
 [811e15b2] btrfs_file_mmap+0x3e/0x57
 [810f5f40] mmap_region+0x2bb/0x4c4
 [81229f10] ? file_map_prot_check+0x9a/0xa3
 [810f63d9] do_mmap_pgoff+0x290/0x2f3
 [810f6512] ? sys_mmap_pgoff+0xd6/0x12e
 [810f6532] sys_mmap_pgoff+0xf6/0x12e
 [814c4b75] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [81006e9a] sys_mmap+0x22/0x24
 [81002ddb] system_call_fastpath+0x16/0x1b

As the corresponding delayed node mutex lock is taken in btrfs_real_readdir, 
that seems deadlockable.
vfs_readdir holds i_mutex, I wonder if we can execute 
btrfs_readdir_delayed_dir_index without
taking the node lock. 

 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4] btrfs: implement delayed inode items operation

2011-03-21 Thread Miao Xie
On tue, 22 Mar 2011 11:33:10 +0900, Itaru Kitayama wrote:
 Here is an excerpt of the V4 patch applied kernel boot log:
 
 ===
 [ INFO: possible circular locking dependency detected ]
 2.6.36-xie+ #117
 ---
 vgs/1210 is trying to acquire lock:
  (delayed_node-mutex){+.+...}, at: [8121184b] 
 btrfs_delayed_update_inode+0x45/0x101
 
 but task is already holding lock:
  (mm-mmap_sem){++}, at: [810f6512] sys_mmap_pgoff+0xd6/0x12e
 
 which lock already depends on the new lock.
 
 
 the existing dependency chain (in reverse order) is:
 
 - #1 (mm-mmap_sem){++}:
[81076a3d] lock_acquire+0x11d/0x143
[810edc79] might_fault+0x95/0xb8
[8112b5ce] filldir+0x75/0xd0
[811d77f8] btrfs_real_readdir+0x3d7/0x528
[8112b75c] vfs_readdir+0x79/0xb6
[8112b8e9] sys_getdents+0x85/0xd8
[81002ddb] system_call_fastpath+0x16/0x1b
 
 - #0 (delayed_node-mutex){+.+...}:
[81076612] __lock_acquire+0xa98/0xda6
[81076a3d] lock_acquire+0x11d/0x143
[814c38b1] __mutex_lock_common+0x5a/0x444
[814c3d50] mutex_lock_nested+0x39/0x3e
[8121184b] btrfs_delayed_update_inode+0x45/0x101
[811dbd4f] btrfs_update_inode+0x2e/0x129
[811de008] btrfs_dirty_inode+0x57/0x113
[8113c2a5] __mark_inode_dirty+0x33/0x1aa
[81130939] touch_atime+0x107/0x12a
[811e15b2] btrfs_file_mmap+0x3e/0x57
[810f5f40] mmap_region+0x2bb/0x4c4
[810f63d9] do_mmap_pgoff+0x290/0x2f3
[810f6532] sys_mmap_pgoff+0xf6/0x12e
[81006e9a] sys_mmap+0x22/0x24
[81002ddb] system_call_fastpath+0x16/0x1b
 
 other info that might help us debug this:
 
 1 lock held by vgs/1210:
  #0:  (mm-mmap_sem){++}, at: [810f6512] 
 sys_mmap_pgoff+0xd6/0x12e
 
 stack backtrace:
 Pid: 1210, comm: vgs Not tainted 2.6.36-xie+ #117
 Call Trace:
  [81074c15] print_circular_bug+0xaf/0xbd
  [81076612] __lock_acquire+0xa98/0xda6
  [8121184b] ? btrfs_delayed_update_inode+0x45/0x101
  [81076a3d] lock_acquire+0x11d/0x143
  [8121184b] ? btrfs_delayed_update_inode+0x45/0x101
  [8121184b] ? btrfs_delayed_update_inode+0x45/0x101
  [814c38b1] __mutex_lock_common+0x5a/0x444
  [8121184b] ? btrfs_delayed_update_inode+0x45/0x101
  [8107162f] ? debug_mutex_init+0x31/0x3c
  [814c3d50] mutex_lock_nested+0x39/0x3e
  [8121184b] btrfs_delayed_update_inode+0x45/0x101
  [814c36c6] ? __mutex_unlock_slowpath+0x129/0x13a
  [811dbd4f] btrfs_update_inode+0x2e/0x129
  [811de008] btrfs_dirty_inode+0x57/0x113
  [8113c2a5] __mark_inode_dirty+0x33/0x1aa
  [81130939] touch_atime+0x107/0x12a
  [811e15b2] btrfs_file_mmap+0x3e/0x57
  [810f5f40] mmap_region+0x2bb/0x4c4
  [81229f10] ? file_map_prot_check+0x9a/0xa3
  [810f63d9] do_mmap_pgoff+0x290/0x2f3
  [810f6512] ? sys_mmap_pgoff+0xd6/0x12e
  [810f6532] sys_mmap_pgoff+0xf6/0x12e
  [814c4b75] ? trace_hardirqs_on_thunk+0x3a/0x3f
  [81006e9a] sys_mmap+0x22/0x24
  [81002ddb] system_call_fastpath+0x16/0x1b
 
 As the corresponding delayed node mutex lock is taken in btrfs_real_readdir, 
 that seems deadlockable.
 vfs_readdir holds i_mutex, I wonder if we can execute 
 btrfs_readdir_delayed_dir_index without
 taking the node lock. 

We can't fix it by this way, because the work threads may do insertion or 
deletion at the same time,
and we may lose some directory items.

Maybe we can fix it by adding a reference for the delayed directory items, we 
can do read dir just
like this:
1. hold the node lock
2. increase the directory items' reference and put all the directory items into 
a list
3. release the node lock
4. read dir
5. decrease the directory items' reference and free them if the reference is 
zero.
What do you think about?

Thanks
Miao

 
  
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4] btrfs: implement delayed inode items operation

2011-03-21 Thread Itaru Kitayama
On Tue, 22 Mar 2011 11:12:37 +0800
Miao Xie mi...@cn.fujitsu.com wrote:

 We can't fix it by this way, because the work threads may do insertion or 
 deletion at the same time,
 and we may lose some directory items.

Ok.
 
 Maybe we can fix it by adding a reference for the delayed directory items, we 
 can do read dir just
 like this:
 1. hold the node lock
 2. increase the directory items' reference and put all the directory items 
 into a list
 3. release the node lock
 4. read dir
 5. decrease the directory items' reference and free them if the reference is 
 zero.
 What do you think about?

Sounds doable to me.

itaru
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4] btrfs: implement delayed inode items operation

2011-03-20 Thread Chris Mason
Excerpts from Miao Xie's message of 2011-03-18 05:24:46 -0400:
 Changelog V3 - V4:
 - Fix nested lock, which is reported by Itaru Kitayama, by updating space 
 cache
   inodes in time.

I ran some tests on this and had trouble with my stress.sh script:

http://oss.oracle.com/~mason/stress.sh

I used:

stress.sh -n 50 -c path to linux kernel git tree /mnt

The git tree has all the .git files but no .o files.

The problem was that within about 20 minutes, the filesystem was
spending almost all of its time in balance_dirty_pages().  The problem
is that data writeback isn't complete until the endio handlers have
finished inserting metadata into the btree.

The v4 patch calls btrfs_btree_balance_dirty() from all the
btrfs_end_transaction variants, which means that the FS writeback code
waits for balance_dirty_pages(), which won't make progress until the FS
writeback code is done.

So I changed things to call the delayed inode balance function only from
inside btrfs_btree_balance_dirty(), which did resolve the stalls.  But
I found a few times that when I did rmmod btrfs, there would be delayed
inode objects leaked in the slab cache.  rmmod will try to destroy the
slab cache, which will fail because we haven't freed everything.

It looks like we have a race in btrfs_get_or_create_delayed_node, where
two concurrent callers can both create delayed nodes and then race on
adding it to the inode.

I also think that code is racing with the code that frees delayed nodes,
but haven't yet triggered my debugging printks to prove either one.

My current incremental patch is below, please take a look:

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 727d9a8..02e9390 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -32,7 +32,8 @@ int __init btrfs_delayed_inode_init(void)
delayed_node_cache = kmem_cache_create(delayed_node,
sizeof(struct btrfs_delayed_node),
0,
-   SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD,
+   SLAB_RECLAIM_ACCOUNT |
+   SLAB_MEM_SPREAD | SLAB_DESTROY_BY_RCU,
NULL);
if (!delayed_node_cache)
return -ENOMEM;
@@ -90,22 +91,35 @@ static struct btrfs_delayed_node 
*btrfs_get_or_create_delayed_node(
struct btrfs_inode *btrfs_inode = BTRFS_I(inode);
struct btrfs_root *root = btrfs_inode-root;
 
-   if (btrfs_inode-delayed_node) {
-   node = btrfs_inode-delayed_node;
-   atomic_inc(node-refs);/* can be accessed */
-   return node;
+again:
+   rcu_read_lock();
+again_rcu:
+   node = btrfs_inode-delayed_node;
+   if (node) {
+   if (atomic_inc_not_zero(node-refs)) {
+   rcu_read_unlock();
+   return node;
+   }
+printk(racing on node access!\n);
+   goto again_rcu;
}
+   rcu_read_unlock();
 
node = kmem_cache_alloc(delayed_node_cache, GFP_NOFS);
if (!node)
return ERR_PTR(-ENOMEM);
btrfs_init_delayed_node(node, root-objectid, inode-i_ino);
 
-   btrfs_inode-delayed_node = node;
node-delayed_root = btrfs_get_delayed_root(root);
atomic_inc(node-refs);/* cached in the btrfs inode */
atomic_inc(node-refs);/* can be accessed */
 
+   if (cmpxchg(BTRFS_I(inode)-delayed_node, NULL, node)) {
+   kmem_cache_free(delayed_node_cache, node);
+printk(racing on new node insertion!\n);
+   goto again;
+   }
+
return node;
 }
 
@@ -1167,7 +1181,7 @@ static void btrfs_async_run_delayed_node_done(struct 
btrfs_work *work)
nr = trans-blocks_used;
 
btrfs_end_transaction_dmeta(trans, root);
-   btrfs_btree_balance_dirty(root, nr);
+   __btrfs_btree_balance_dirty(root, nr);
 free_path:
btrfs_free_path(path);
 out:
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0f589b1..0c6c336 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2649,6 +2649,28 @@ void btrfs_btree_balance_dirty(struct btrfs_root *root, 
unsigned long nr)
balance_dirty_pages_ratelimited_nr(
   root-fs_info-btree_inode-i_mapping, 1);
}
+   btrfs_balance_delayed_items(root);
+   return;
+}
+
+void __btrfs_btree_balance_dirty(struct btrfs_root *root, unsigned long nr)
+{
+   /*
+* looks as though older kernels can get into trouble with
+* this code, they end up stuck in balance_dirty_pages forever
+*/
+   u64 num_dirty;
+   unsigned long thresh = 32 * 1024 * 1024;
+
+   if (current-flags  PF_MEMALLOC)
+   return;
+
+   num_dirty = root-fs_info-dirty_metadata_bytes;
+
+   if (num_dirty  thresh) {
+   

Re: [PATCH V4] btrfs: implement delayed inode items operation

2011-03-20 Thread Miao Xie
On sun, 20 Mar 2011 20:33:34 -0400, Chris Mason wrote:
 Excerpts from Miao Xie's message of 2011-03-18 05:24:46 -0400:
 Changelog V3 - V4:
 - Fix nested lock, which is reported by Itaru Kitayama, by updating space 
 cache
   inodes in time.
 
 I ran some tests on this and had trouble with my stress.sh script:
 
 http://oss.oracle.com/~mason/stress.sh
 
 I used:
 
 stress.sh -n 50 -c path to linux kernel git tree /mnt
 
 The git tree has all the .git files but no .o files.
 
 The problem was that within about 20 minutes, the filesystem was
 spending almost all of its time in balance_dirty_pages().  The problem
 is that data writeback isn't complete until the endio handlers have
 finished inserting metadata into the btree.
 
 The v4 patch calls btrfs_btree_balance_dirty() from all the
 btrfs_end_transaction variants, which means that the FS writeback code
 waits for balance_dirty_pages(), which won't make progress until the FS
 writeback code is done.
 
 So I changed things to call the delayed inode balance function only from
 inside btrfs_btree_balance_dirty(), which did resolve the stalls.  But

Ok, but can we invoke the delayed inode balance function before
balance_dirty_pages_ratelimited_nr(), because the delayed item insertion and
deletion also bring us some dirty pages.

 I found a few times that when I did rmmod btrfs, there would be delayed
 inode objects leaked in the slab cache.  rmmod will try to destroy the
 slab cache, which will fail because we haven't freed everything.
 
 It looks like we have a race in btrfs_get_or_create_delayed_node, where
 two concurrent callers can both create delayed nodes and then race on
 adding it to the inode.

Sorry for my mistake, I thought we updated the inodes when holding i_mutex 
originally,
so I didn't use any lock or other method to protect delayed_node of the inodes.

But I think we needn't use rcu lock to protect delayed_node when we want to get 
the
delayed node, because we won't change it after it is created, cmpxchg() and 
ACCESS_ONCE()
can protect it well. What do you think about?

PS: I worry about the inode update without holding i_mutex.

 I also think that code is racing with the code that frees delayed nodes,
 but haven't yet triggered my debugging printks to prove either one.

We free delayed nodes when we want to destroy the inode, at that time, just one 
task,
which is destroying inode, can access the delayed nodes, so I think 
ACCESS_ONCE() is
enough. What do you think about?

Thanks
Miao

 
 My current incremental patch is below, please take a look:
 
 diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
 index 727d9a8..02e9390 100644
 --- a/fs/btrfs/delayed-inode.c
 +++ b/fs/btrfs/delayed-inode.c
 @@ -32,7 +32,8 @@ int __init btrfs_delayed_inode_init(void)
   delayed_node_cache = kmem_cache_create(delayed_node,
   sizeof(struct btrfs_delayed_node),
   0,
 - SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD,
 + SLAB_RECLAIM_ACCOUNT |
 + SLAB_MEM_SPREAD | SLAB_DESTROY_BY_RCU,
   NULL);
   if (!delayed_node_cache)
   return -ENOMEM;
 @@ -90,22 +91,35 @@ static struct btrfs_delayed_node 
 *btrfs_get_or_create_delayed_node(
   struct btrfs_inode *btrfs_inode = BTRFS_I(inode);
   struct btrfs_root *root = btrfs_inode-root;
  
 - if (btrfs_inode-delayed_node) {
 - node = btrfs_inode-delayed_node;
 - atomic_inc(node-refs);/* can be accessed */
 - return node;
 +again:
 + rcu_read_lock();
 +again_rcu:
 + node = btrfs_inode-delayed_node;
 + if (node) {
 + if (atomic_inc_not_zero(node-refs)) {
 + rcu_read_unlock();
 + return node;
 + }
 +printk(racing on node access!\n);
 + goto again_rcu;
   }
 + rcu_read_unlock();
  
   node = kmem_cache_alloc(delayed_node_cache, GFP_NOFS);
   if (!node)
   return ERR_PTR(-ENOMEM);
   btrfs_init_delayed_node(node, root-objectid, inode-i_ino);
  
 - btrfs_inode-delayed_node = node;
   node-delayed_root = btrfs_get_delayed_root(root);
   atomic_inc(node-refs);/* cached in the btrfs inode */
   atomic_inc(node-refs);/* can be accessed */
  
 + if (cmpxchg(BTRFS_I(inode)-delayed_node, NULL, node)) {
 + kmem_cache_free(delayed_node_cache, node);
 +printk(racing on new node insertion!\n);
 + goto again;
 + }
 +
   return node;
  }
  
 @@ -1167,7 +1181,7 @@ static void btrfs_async_run_delayed_node_done(struct 
 btrfs_work *work)
   nr = trans-blocks_used;
  
   btrfs_end_transaction_dmeta(trans, root);
 - btrfs_btree_balance_dirty(root, nr);
 + __btrfs_btree_balance_dirty(root, nr);
  free_path:
   btrfs_free_path(path);
  out: