Re: [PATCH V4] btrfs: implement delayed inode items operation
Hi, Kitayama-san On wed, 23 Mar 2011 13:19:18 +0900, Itaru Kitayama wrote: On Wed, 23 Mar 2011 12:00:38 +0800 Miao Xie mi...@cn.fujitsu.com wrote: I is testing the new version, in which I fixed the slab shrinker problem reported by Chris. In the new version, the delayed node is removed before the relative inode is moved into the unused_inode list(the slab shrinker will reclaim inodes in this list). Maybe this method can also fix this bug. So could you tell me the reproduce step or the options of mkfs and mount? I will check if the new patch can fix this bug or not. I used the default mkfs options for $TEST_DEV and enabled the space_cache and the compress=lzo options upon mounting the partition. Unfortunately, I can trigger this warning, but by analyzing the following information, = [ INFO: possible irq lock inversion dependency detected ] 2.6.36-xie+ #122 - kswapd0/49 just changed the state of lock: (iprune_sem){.-}, at: [811316d0] shrink_icache_memory+0x4d/0x213 but this lock took another, RECLAIM_FS-unsafe lock in the past: (delayed_node-mutex){+.+.+.} and interrupts could create inverse lock ordering between them. [SNIP] RECLAIM_FS-ON-W at: [81074292] mark_held_locks+0x52/0x70 [81074354] lockdep_trace_alloc+0xa4/0xc2 [8110f206] __kmalloc+0x7f/0x154 [811c2c21] kzalloc+0x14/0x16 [811c5e83] cache_block_group+0x106/0x238 [811c7069] find_free_extent+0x4e2/0xa86 [811c76c1] btrfs_reserve_extent+0xb4/0x142 [811c78b6] btrfs_alloc_free_block+0x167/0x2af [811be610] __btrfs_cow_block+0x103/0x346 [811bedb8] btrfs_cow_block+0x101/0x110 [811c05d8] btrfs_search_slot+0x143/0x513 [811cf5ab] btrfs_lookup_inode+0x2f/0x8f [81212405] btrfs_update_delayed_inode+0x75/0x135 [8121340d] btrfs_run_delayed_items+0xd6/0x131 [811d64d7] btrfs_commit_transaction+0x28b/0x668 [811ba786] btrfs_sync_fs+0x6b/0x70 [81140265] __sync_filesystem+0x6b/0x83 [81140347] sync_filesystem+0x4c/0x50 [8111fc56] generic_shutdown_super+0x27/0xd7 [8111fd5b] kill_anon_super+0x16/0x54 [8111effd] deactivate_locked_super+0x26/0x46 [8111f495] deactivate_super+0x45/0x4a [81135962] mntput_no_expire+0xd6/0x104 [81136a87] sys_umount+0x2c1/0x2ec [81002ddb] system_call_fastpath+0x16/0x1b we found GFP_KERNEL was passed into kzalloc(), I think this flag trigger the above lockdep warning. the attached patch, which against the delayed items operation patch, may fix this problem, Could you test it for me? Thanks Miao From f84daee1d2060beae945a2774cda7af2ef7f3e61 Mon Sep 17 00:00:00 2001 From: Miao Xie mi...@cn.fujitsu.com Date: Wed, 23 Mar 2011 16:01:16 +0800 Subject: [PATCH] btrfs: use GFP_NOFS instead of GFP_KERNEL In the filesystem context, we must allocate memory by GFP_NOFS, or we will start another filesystem operation and trigger lockdep warnning. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/extent-tree.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index f1db57d..fe50cff 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -471,7 +471,7 @@ static int cache_block_group(struct btrfs_block_group_cache *cache, if (load_cache_only) return 0; - caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_KERNEL); + caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_NOFS); BUG_ON(!caching_ctl); INIT_LIST_HEAD(caching_ctl-list); -- 1.7.3.1
Re: [PATCH V4] btrfs: implement delayed inode items operation
On wed, 23 Mar 2011 09:57:56 +0800, Miao Xie wrote: On Mon, 21 Mar 2011 08:08:17 -0400, Chris Mason wrote: I also think that code is racing with the code that frees delayed nodes, but haven't yet triggered my debugging printks to prove either one. We free delayed nodes when we want to destroy the inode, at that time, just one task, which is destroying inode, can access the delayed nodes, so I think ACCESS_ONCE() is enough. What do you think about? Great, I see what you mean. The bigger problem right now is that we may do a lot of operations in destroy_inode(), which can block the slab shrinkers on our metadata operations. That same stress.sh -n 50 run is running into OOM. So we need to rework the part where the final free is done. We could keep a ref on the inode until the delayed items are complete, or we could let the inode go and make a way to lookup the delayed node when the inode is read. I find the slab shrinkers just reclaim inodes which are in the inode_unused list, so if we free delayed nodes before moving the inode into the inode_unused list, we can avoid blocking the slab shrinker. Thus maybe we can fix the above problem by moving btrfs_remove_delayed_node() from btrfs_destroy_inode() to btrfs_drop_inode(). How do you think about? This method is not good, because we may do delayed insertion/deletion and free the delayed node when we release the inode, but we may access its delayed node soon. After reading the lockdep message reported by Kitayama, I guess the reason of the slab shrinker block may not be that we do a lot of operations in destroy_inode(), maybe the real reason is that we pass GFP_KERNEL into kzalloc()(in cache_block_group()), which makes the slab shrinkers hang up. (I don't trigger OOM by running stress.sh till now, so I can't locate this bug.) Thanks Miao I'll read more today. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4] btrfs: implement delayed inode items operation
On Wed, 23 Mar 2011 17:47:01 +0800 Miao Xie mi...@cn.fujitsu.com wrote: we found GFP_KERNEL was passed into kzalloc(), I think this flag trigger the above lockdep warning. the attached patch, which against the delayed items operation patch, may fix this problem, Could you test it for me? The possible irq lock inversion dependency warning seems to go away. itaru -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4] btrfs: implement delayed inode items operation
Hi Miao, On Tue, 22 Mar 2011 18:03:24 +0800 Miao Xie mi...@cn.fujitsu.com wrote: The V5 patch is attached, could you test it for me? I have run Chris stress test, dbench benchmark on my machine, it work well. I want to check if the above bug still exists or not. Thanks Miao Here's the boot log (The V5 patch is on top of the btrfs-unstable next-rc branch): BUG: unable to handle kernel NULL pointer dereference at 00c8 IP: [81075c12] __lock_acquire+0x98/0xda6 PGD 7074c067 PUD 75b6d067 PMD 0 Oops: [#1] PREEMPT SMP DEBUG_PAGEALLOC last sysfs file: /sys/devices/virtual/bdi/btrfs-1/uevent CPU 0 Modules linked in: mptspi mptscsih mptbase scsi_transport_spi Pid: 1, comm: init Not tainted 2.6.36-xie+ #120 440BX Desktop Reference Platform/VMware Virtual Platform RIP: 0010:[81075c12] [81075c12] __lock_acquire+0x98/0xda6 RSP: 0018:88007ceafb58 EFLAGS: 00010097 RAX: 0046 RBX: 88007ceb0050 RCX: RDX: RSI: RDI: 00c8 RBP: 88007ceafc18 R08: 0002 R09: R10: R11: 88007ceafd48 R12: 88007ceb0050 R13: 00c8 R14: 0002 R15: FS: 7f1fe74e7720() GS:880004e0() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 00c8 CR3: 70486000 CR4: 06f0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process init (pid: 1, threadinfo 88007ceae000, task 88007ceb0050) Stack: 0002 0001 88007ceafb78 810bdd5d 0 8800 88007ceafbc8 0046 0 88007ceafba8 88007ceb0050 81067dcd 0001 Call Trace: [810bdd5d] ? time_hardirqs_off+0x2e/0x36 [81067dcd] ? local_clock+0x2a/0x3b [8121294b] ? btrfs_get_delayed_items+0x37/0xb6 [81076a3d] lock_acquire+0x11d/0x143 [8121294b] ? btrfs_get_delayed_items+0x37/0xb6 [8121294b] ? btrfs_get_delayed_items+0x37/0xb6 [814c62b1] __mutex_lock_common+0x5a/0x444 [8121294b] ? btrfs_get_delayed_items+0x37/0xb6 [8110ba5f] ? virt_to_head_page+0xe/0x31 [8110dbbc] ? cache_alloc_debugcheck_after+0x176/0x1cd [814c6750] mutex_lock_nested+0x39/0x3e [8121294b] btrfs_get_delayed_items+0x37/0xb6 [811d7e01] btrfs_real_readdir+0x170/0x52c [8112b739] ? vfs_readdir+0x56/0xb6 [8112b559] ? filldir+0x0/0xd0 [810bdfc0] ? trace_preempt_on+0x24/0x26 [814cb4e8] ? sub_preempt_count+0xa3/0xb6 [814c6645] ? __mutex_lock_common+0x3ee/0x444 [8112b739] ? vfs_readdir+0x56/0xb6 [8111e435] ? rcu_read_unlock+0x0/0x4b [8112b559] ? filldir+0x0/0xd0 [8112b559] ? filldir+0x0/0xd0 [8112b75c] vfs_readdir+0x79/0xb6 [8112b8e9] sys_getdents+0x85/0xd8 [81002ddb] system_call_fastpath+0x16/0x1b Code: 58 e2 8b 01 00 be b3 0a 00 00 0f 85 c4 0c 00 00 e9 4d 0c 00 00 83 fe 07 76 11 e8 7e 55 1f 00 48 c7 c7 55 85 78 81 e9 78 0c 00 00 49 8 RIP [81075c12] __lock_acquire+0x98/0xda6 RSP 88007ceafb58 CR2: 00c8 ---[ end trace 87f215b3568d553d ]--- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4] btrfs: implement delayed inode items operation
On Mon, 21 Mar 2011 08:08:17 -0400, Chris Mason wrote: I also think that code is racing with the code that frees delayed nodes, but haven't yet triggered my debugging printks to prove either one. We free delayed nodes when we want to destroy the inode, at that time, just one task, which is destroying inode, can access the delayed nodes, so I think ACCESS_ONCE() is enough. What do you think about? Great, I see what you mean. The bigger problem right now is that we may do a lot of operations in destroy_inode(), which can block the slab shrinkers on our metadata operations. That same stress.sh -n 50 run is running into OOM. So we need to rework the part where the final free is done. We could keep a ref on the inode until the delayed items are complete, or we could let the inode go and make a way to lookup the delayed node when the inode is read. I find the slab shrinkers just reclaim inodes which are in the inode_unused list, so if we free delayed nodes before moving the inode into the inode_unused list, we can avoid blocking the slab shrinker. Thus maybe we can fix the above problem by moving btrfs_remove_delayed_node() from btrfs_destroy_inode() to btrfs_drop_inode(). How do you think about? Thanks Miao I'll read more today. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4] btrfs: implement delayed inode items operation
Hi Miao, The possible circular locking dependency message doesn't show up in the updated V5. However, I see a new possible irq lock inversion dependency message while running xfstests. = [ INFO: possible irq lock inversion dependency detected ] 2.6.36-xie+ #122 - kswapd0/49 just changed the state of lock: (iprune_sem){.-}, at: [811316d0] shrink_icache_memory+0x4d/0x213 but this lock took another, RECLAIM_FS-unsafe lock in the past: (delayed_node-mutex){+.+.+.} and interrupts could create inverse lock ordering between them. other info that might help us debug this: 1 lock held by kswapd0/49: #0: (shrinker_rwsem){..}, at: [810e242a] shrink_slab+0x3d/0x164 the shortest dependencies between 2nd lock and 1st lock: - (delayed_node-mutex){+.+.+.} ops: 1703807 { HARDIRQ-ON-W at: [81075ec0] __lock_acquire+0x346/0xda6 [81076a3d] lock_acquire+0x11d/0x143 [814c62b1] __mutex_lock_common+0x5a/0x444 [814c6750] mutex_lock_nested+0x39/0x3e [81211fd4] btrfs_delayed_update_inode+0x45/0x101 [811dc5c3] btrfs_update_inode+0x2e/0x129 [811e0cba] btrfs_truncate+0x43d/0x477 [810dfb22] vmtruncate+0x44/0x52 [811e0ef6] btrfs_setattr+0x202/0x253 [8113201e] notify_change+0x1a2/0x29d [8111bf08] do_truncate+0x6c/0x89 [81127a77] do_last+0x579/0x57e [81129502] do_filp_open+0x215/0x5ae [8111aec0] do_sys_open+0x60/0xfc [8111af8f] sys_open+0x20/0x22 [81002ddb] system_call_fastpath+0x16/0x1b SOFTIRQ-ON-W at: [81075ee1] __lock_acquire+0x367/0xda6 [81076a3d] lock_acquire+0x11d/0x143 [814c62b1] __mutex_lock_common+0x5a/0x444 [814c6750] mutex_lock_nested+0x39/0x3e [81211fd4] btrfs_delayed_update_inode+0x45/0x101 [811dc5c3] btrfs_update_inode+0x2e/0x129 [811e0cba] btrfs_truncate+0x43d/0x477 [810dfb22] vmtruncate+0x44/0x52 [811e0ef6] btrfs_setattr+0x202/0x253 [8113201e] notify_change+0x1a2/0x29d [8111bf08] do_truncate+0x6c/0x89 [81127a77] do_last+0x579/0x57e [81129502] do_filp_open+0x215/0x5ae [8111aec0] do_sys_open+0x60/0xfc [8111af8f] sys_open+0x20/0x22 [81002ddb] system_call_fastpath+0x16/0x1b RECLAIM_FS-ON-W at: [81074292] mark_held_locks+0x52/0x70 [81074354] lockdep_trace_alloc+0xa4/0xc2 [8110f206] __kmalloc+0x7f/0x154 [811c2c21] kzalloc+0x14/0x16 [811c5e83] cache_block_group+0x106/0x238 [811c7069] find_free_extent+0x4e2/0xa86 [811c76c1] btrfs_reserve_extent+0xb4/0x142 [811c78b6] btrfs_alloc_free_block+0x167/0x2af [811be610] __btrfs_cow_block+0x103/0x346 [811bedb8] btrfs_cow_block+0x101/0x110 [811c05d8] btrfs_search_slot+0x143/0x513 [811cf5ab] btrfs_lookup_inode+0x2f/0x8f [81212405] btrfs_update_delayed_inode+0x75/0x135 [8121340d] btrfs_run_delayed_items+0xd6/0x131 [811d64d7] btrfs_commit_transaction+0x28b/0x668 [811ba786] btrfs_sync_fs+0x6b/0x70 [81140265] __sync_filesystem+0x6b/0x83 [81140347] sync_filesystem+0x4c/0x50 [8111fc56] generic_shutdown_super+0x27/0xd7 [8111fd5b] kill_anon_super+0x16/0x54 [8111effd] deactivate_locked_super+0x26/0x46 [8111f495] deactivate_super+0x45/0x4a
Re: [PATCH V4] btrfs: implement delayed inode items operation
On wed, 23 Mar 2011 12:24:09 +0900, Itaru Kitayama wrote: Hi Miao, The possible circular locking dependency message doesn't show up in the updated V5. However, I see a new possible irq lock inversion dependency message while running xfstests. I is testing the new version, in which I fixed the slab shrinker problem reported by Chris. In the new version, the delayed node is removed before the relative inode is moved into the unused_inode list(the slab shrinker will reclaim inodes in this list). Maybe this method can also fix this bug. So could you tell me the reproduce step or the options of mkfs and mount? I will check if the new patch can fix this bug or not. Thanks Miao = [ INFO: possible irq lock inversion dependency detected ] 2.6.36-xie+ #122 - kswapd0/49 just changed the state of lock: (iprune_sem){.-}, at: [811316d0] shrink_icache_memory+0x4d/0x213 but this lock took another, RECLAIM_FS-unsafe lock in the past: (delayed_node-mutex){+.+.+.} and interrupts could create inverse lock ordering between them. other info that might help us debug this: 1 lock held by kswapd0/49: #0: (shrinker_rwsem){..}, at: [810e242a] shrink_slab+0x3d/0x164 the shortest dependencies between 2nd lock and 1st lock: - (delayed_node-mutex){+.+.+.} ops: 1703807 { HARDIRQ-ON-W at: [81075ec0] __lock_acquire+0x346/0xda6 [81076a3d] lock_acquire+0x11d/0x143 [814c62b1] __mutex_lock_common+0x5a/0x444 [814c6750] mutex_lock_nested+0x39/0x3e [81211fd4] btrfs_delayed_update_inode+0x45/0x101 [811dc5c3] btrfs_update_inode+0x2e/0x129 [811e0cba] btrfs_truncate+0x43d/0x477 [810dfb22] vmtruncate+0x44/0x52 [811e0ef6] btrfs_setattr+0x202/0x253 [8113201e] notify_change+0x1a2/0x29d [8111bf08] do_truncate+0x6c/0x89 [81127a77] do_last+0x579/0x57e [81129502] do_filp_open+0x215/0x5ae [8111aec0] do_sys_open+0x60/0xfc [8111af8f] sys_open+0x20/0x22 [81002ddb] system_call_fastpath+0x16/0x1b SOFTIRQ-ON-W at: [81075ee1] __lock_acquire+0x367/0xda6 [81076a3d] lock_acquire+0x11d/0x143 [814c62b1] __mutex_lock_common+0x5a/0x444 [814c6750] mutex_lock_nested+0x39/0x3e [81211fd4] btrfs_delayed_update_inode+0x45/0x101 [811dc5c3] btrfs_update_inode+0x2e/0x129 [811e0cba] btrfs_truncate+0x43d/0x477 [810dfb22] vmtruncate+0x44/0x52 [811e0ef6] btrfs_setattr+0x202/0x253 [8113201e] notify_change+0x1a2/0x29d [8111bf08] do_truncate+0x6c/0x89 [81127a77] do_last+0x579/0x57e [81129502] do_filp_open+0x215/0x5ae [8111aec0] do_sys_open+0x60/0xfc [8111af8f] sys_open+0x20/0x22 [81002ddb] system_call_fastpath+0x16/0x1b RECLAIM_FS-ON-W at: [81074292] mark_held_locks+0x52/0x70 [81074354] lockdep_trace_alloc+0xa4/0xc2 [8110f206] __kmalloc+0x7f/0x154 [811c2c21] kzalloc+0x14/0x16 [811c5e83] cache_block_group+0x106/0x238 [811c7069] find_free_extent+0x4e2/0xa86 [811c76c1] btrfs_reserve_extent+0xb4/0x142 [811c78b6] btrfs_alloc_free_block+0x167/0x2af [811be610] __btrfs_cow_block+0x103/0x346 [811bedb8] btrfs_cow_block+0x101/0x110 [811c05d8] btrfs_search_slot+0x143/0x513 [811cf5ab] btrfs_lookup_inode+0x2f/0x8f [81212405] btrfs_update_delayed_inode+0x75/0x135 [8121340d] btrfs_run_delayed_items+0xd6/0x131 [811d64d7]
Re: [PATCH V4] btrfs: implement delayed inode items operation
On Wed, 23 Mar 2011 12:00:38 +0800 Miao Xie mi...@cn.fujitsu.com wrote: I is testing the new version, in which I fixed the slab shrinker problem reported by Chris. In the new version, the delayed node is removed before the relative inode is moved into the unused_inode list(the slab shrinker will reclaim inodes in this list). Maybe this method can also fix this bug. So could you tell me the reproduce step or the options of mkfs and mount? I will check if the new patch can fix this bug or not. I used the default mkfs options for $TEST_DEV and enabled the space_cache and the compress=lzo options upon mounting the partition. itaru -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4] btrfs: implement delayed inode items operation
Excerpts from Miao Xie's message of 2011-03-21 01:05:22 -0400: On sun, 20 Mar 2011 20:33:34 -0400, Chris Mason wrote: Excerpts from Miao Xie's message of 2011-03-18 05:24:46 -0400: Changelog V3 - V4: - Fix nested lock, which is reported by Itaru Kitayama, by updating space cache inodes in time. I ran some tests on this and had trouble with my stress.sh script: http://oss.oracle.com/~mason/stress.sh I used: stress.sh -n 50 -c path to linux kernel git tree /mnt The git tree has all the .git files but no .o files. The problem was that within about 20 minutes, the filesystem was spending almost all of its time in balance_dirty_pages(). The problem is that data writeback isn't complete until the endio handlers have finished inserting metadata into the btree. The v4 patch calls btrfs_btree_balance_dirty() from all the btrfs_end_transaction variants, which means that the FS writeback code waits for balance_dirty_pages(), which won't make progress until the FS writeback code is done. So I changed things to call the delayed inode balance function only from inside btrfs_btree_balance_dirty(), which did resolve the stalls. But Ok, but can we invoke the delayed inode balance function before balance_dirty_pages_ratelimited_nr(), because the delayed item insertion and deletion also bring us some dirty pages. Yes, good point. I found a few times that when I did rmmod btrfs, there would be delayed inode objects leaked in the slab cache. rmmod will try to destroy the slab cache, which will fail because we haven't freed everything. It looks like we have a race in btrfs_get_or_create_delayed_node, where two concurrent callers can both create delayed nodes and then race on adding it to the inode. Sorry for my mistake, I thought we updated the inodes when holding i_mutex originally, so I didn't use any lock or other method to protect delayed_node of the inodes. But I think we needn't use rcu lock to protect delayed_node when we want to get the delayed node, because we won't change it after it is created, cmpxchg() and ACCESS_ONCE() can protect it well. What do you think about? PS: I worry about the inode update without holding i_mutex. We have the tree locks to make sure we're serialized while we actually change the tree. The only places that go in without locking are times updates. I also think that code is racing with the code that frees delayed nodes, but haven't yet triggered my debugging printks to prove either one. We free delayed nodes when we want to destroy the inode, at that time, just one task, which is destroying inode, can access the delayed nodes, so I think ACCESS_ONCE() is enough. What do you think about? Great, I see what you mean. The bigger problem right now is that we may do a lot of operations in destroy_inode(), which can block the slab shrinkers on our metadata operations. That same stress.sh -n 50 run is running into OOM. So we need to rework the part where the final free is done. We could keep a ref on the inode until the delayed items are complete, or we could let the inode go and make a way to lookup the delayed node when the inode is read. I'll read more today. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4] btrfs: implement delayed inode items operation
Hi Miao, Here is an excerpt of the V4 patch applied kernel boot log: === [ INFO: possible circular locking dependency detected ] 2.6.36-xie+ #117 --- vgs/1210 is trying to acquire lock: (delayed_node-mutex){+.+...}, at: [8121184b] btrfs_delayed_update_inode+0x45/0x101 but task is already holding lock: (mm-mmap_sem){++}, at: [810f6512] sys_mmap_pgoff+0xd6/0x12e which lock already depends on the new lock. the existing dependency chain (in reverse order) is: - #1 (mm-mmap_sem){++}: [81076a3d] lock_acquire+0x11d/0x143 [810edc79] might_fault+0x95/0xb8 [8112b5ce] filldir+0x75/0xd0 [811d77f8] btrfs_real_readdir+0x3d7/0x528 [8112b75c] vfs_readdir+0x79/0xb6 [8112b8e9] sys_getdents+0x85/0xd8 [81002ddb] system_call_fastpath+0x16/0x1b - #0 (delayed_node-mutex){+.+...}: [81076612] __lock_acquire+0xa98/0xda6 [81076a3d] lock_acquire+0x11d/0x143 [814c38b1] __mutex_lock_common+0x5a/0x444 [814c3d50] mutex_lock_nested+0x39/0x3e [8121184b] btrfs_delayed_update_inode+0x45/0x101 [811dbd4f] btrfs_update_inode+0x2e/0x129 [811de008] btrfs_dirty_inode+0x57/0x113 [8113c2a5] __mark_inode_dirty+0x33/0x1aa [81130939] touch_atime+0x107/0x12a [811e15b2] btrfs_file_mmap+0x3e/0x57 [810f5f40] mmap_region+0x2bb/0x4c4 [810f63d9] do_mmap_pgoff+0x290/0x2f3 [810f6532] sys_mmap_pgoff+0xf6/0x12e [81006e9a] sys_mmap+0x22/0x24 [81002ddb] system_call_fastpath+0x16/0x1b other info that might help us debug this: 1 lock held by vgs/1210: #0: (mm-mmap_sem){++}, at: [810f6512] sys_mmap_pgoff+0xd6/0x12e stack backtrace: Pid: 1210, comm: vgs Not tainted 2.6.36-xie+ #117 Call Trace: [81074c15] print_circular_bug+0xaf/0xbd [81076612] __lock_acquire+0xa98/0xda6 [8121184b] ? btrfs_delayed_update_inode+0x45/0x101 [81076a3d] lock_acquire+0x11d/0x143 [8121184b] ? btrfs_delayed_update_inode+0x45/0x101 [8121184b] ? btrfs_delayed_update_inode+0x45/0x101 [814c38b1] __mutex_lock_common+0x5a/0x444 [8121184b] ? btrfs_delayed_update_inode+0x45/0x101 [8107162f] ? debug_mutex_init+0x31/0x3c [814c3d50] mutex_lock_nested+0x39/0x3e [8121184b] btrfs_delayed_update_inode+0x45/0x101 [814c36c6] ? __mutex_unlock_slowpath+0x129/0x13a [811dbd4f] btrfs_update_inode+0x2e/0x129 [811de008] btrfs_dirty_inode+0x57/0x113 [8113c2a5] __mark_inode_dirty+0x33/0x1aa [81130939] touch_atime+0x107/0x12a [811e15b2] btrfs_file_mmap+0x3e/0x57 [810f5f40] mmap_region+0x2bb/0x4c4 [81229f10] ? file_map_prot_check+0x9a/0xa3 [810f63d9] do_mmap_pgoff+0x290/0x2f3 [810f6512] ? sys_mmap_pgoff+0xd6/0x12e [810f6532] sys_mmap_pgoff+0xf6/0x12e [814c4b75] ? trace_hardirqs_on_thunk+0x3a/0x3f [81006e9a] sys_mmap+0x22/0x24 [81002ddb] system_call_fastpath+0x16/0x1b As the corresponding delayed node mutex lock is taken in btrfs_real_readdir, that seems deadlockable. vfs_readdir holds i_mutex, I wonder if we can execute btrfs_readdir_delayed_dir_index without taking the node lock. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4] btrfs: implement delayed inode items operation
On tue, 22 Mar 2011 11:33:10 +0900, Itaru Kitayama wrote: Here is an excerpt of the V4 patch applied kernel boot log: === [ INFO: possible circular locking dependency detected ] 2.6.36-xie+ #117 --- vgs/1210 is trying to acquire lock: (delayed_node-mutex){+.+...}, at: [8121184b] btrfs_delayed_update_inode+0x45/0x101 but task is already holding lock: (mm-mmap_sem){++}, at: [810f6512] sys_mmap_pgoff+0xd6/0x12e which lock already depends on the new lock. the existing dependency chain (in reverse order) is: - #1 (mm-mmap_sem){++}: [81076a3d] lock_acquire+0x11d/0x143 [810edc79] might_fault+0x95/0xb8 [8112b5ce] filldir+0x75/0xd0 [811d77f8] btrfs_real_readdir+0x3d7/0x528 [8112b75c] vfs_readdir+0x79/0xb6 [8112b8e9] sys_getdents+0x85/0xd8 [81002ddb] system_call_fastpath+0x16/0x1b - #0 (delayed_node-mutex){+.+...}: [81076612] __lock_acquire+0xa98/0xda6 [81076a3d] lock_acquire+0x11d/0x143 [814c38b1] __mutex_lock_common+0x5a/0x444 [814c3d50] mutex_lock_nested+0x39/0x3e [8121184b] btrfs_delayed_update_inode+0x45/0x101 [811dbd4f] btrfs_update_inode+0x2e/0x129 [811de008] btrfs_dirty_inode+0x57/0x113 [8113c2a5] __mark_inode_dirty+0x33/0x1aa [81130939] touch_atime+0x107/0x12a [811e15b2] btrfs_file_mmap+0x3e/0x57 [810f5f40] mmap_region+0x2bb/0x4c4 [810f63d9] do_mmap_pgoff+0x290/0x2f3 [810f6532] sys_mmap_pgoff+0xf6/0x12e [81006e9a] sys_mmap+0x22/0x24 [81002ddb] system_call_fastpath+0x16/0x1b other info that might help us debug this: 1 lock held by vgs/1210: #0: (mm-mmap_sem){++}, at: [810f6512] sys_mmap_pgoff+0xd6/0x12e stack backtrace: Pid: 1210, comm: vgs Not tainted 2.6.36-xie+ #117 Call Trace: [81074c15] print_circular_bug+0xaf/0xbd [81076612] __lock_acquire+0xa98/0xda6 [8121184b] ? btrfs_delayed_update_inode+0x45/0x101 [81076a3d] lock_acquire+0x11d/0x143 [8121184b] ? btrfs_delayed_update_inode+0x45/0x101 [8121184b] ? btrfs_delayed_update_inode+0x45/0x101 [814c38b1] __mutex_lock_common+0x5a/0x444 [8121184b] ? btrfs_delayed_update_inode+0x45/0x101 [8107162f] ? debug_mutex_init+0x31/0x3c [814c3d50] mutex_lock_nested+0x39/0x3e [8121184b] btrfs_delayed_update_inode+0x45/0x101 [814c36c6] ? __mutex_unlock_slowpath+0x129/0x13a [811dbd4f] btrfs_update_inode+0x2e/0x129 [811de008] btrfs_dirty_inode+0x57/0x113 [8113c2a5] __mark_inode_dirty+0x33/0x1aa [81130939] touch_atime+0x107/0x12a [811e15b2] btrfs_file_mmap+0x3e/0x57 [810f5f40] mmap_region+0x2bb/0x4c4 [81229f10] ? file_map_prot_check+0x9a/0xa3 [810f63d9] do_mmap_pgoff+0x290/0x2f3 [810f6512] ? sys_mmap_pgoff+0xd6/0x12e [810f6532] sys_mmap_pgoff+0xf6/0x12e [814c4b75] ? trace_hardirqs_on_thunk+0x3a/0x3f [81006e9a] sys_mmap+0x22/0x24 [81002ddb] system_call_fastpath+0x16/0x1b As the corresponding delayed node mutex lock is taken in btrfs_real_readdir, that seems deadlockable. vfs_readdir holds i_mutex, I wonder if we can execute btrfs_readdir_delayed_dir_index without taking the node lock. We can't fix it by this way, because the work threads may do insertion or deletion at the same time, and we may lose some directory items. Maybe we can fix it by adding a reference for the delayed directory items, we can do read dir just like this: 1. hold the node lock 2. increase the directory items' reference and put all the directory items into a list 3. release the node lock 4. read dir 5. decrease the directory items' reference and free them if the reference is zero. What do you think about? Thanks Miao -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4] btrfs: implement delayed inode items operation
On Tue, 22 Mar 2011 11:12:37 +0800 Miao Xie mi...@cn.fujitsu.com wrote: We can't fix it by this way, because the work threads may do insertion or deletion at the same time, and we may lose some directory items. Ok. Maybe we can fix it by adding a reference for the delayed directory items, we can do read dir just like this: 1. hold the node lock 2. increase the directory items' reference and put all the directory items into a list 3. release the node lock 4. read dir 5. decrease the directory items' reference and free them if the reference is zero. What do you think about? Sounds doable to me. itaru -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4] btrfs: implement delayed inode items operation
Excerpts from Miao Xie's message of 2011-03-18 05:24:46 -0400: Changelog V3 - V4: - Fix nested lock, which is reported by Itaru Kitayama, by updating space cache inodes in time. I ran some tests on this and had trouble with my stress.sh script: http://oss.oracle.com/~mason/stress.sh I used: stress.sh -n 50 -c path to linux kernel git tree /mnt The git tree has all the .git files but no .o files. The problem was that within about 20 minutes, the filesystem was spending almost all of its time in balance_dirty_pages(). The problem is that data writeback isn't complete until the endio handlers have finished inserting metadata into the btree. The v4 patch calls btrfs_btree_balance_dirty() from all the btrfs_end_transaction variants, which means that the FS writeback code waits for balance_dirty_pages(), which won't make progress until the FS writeback code is done. So I changed things to call the delayed inode balance function only from inside btrfs_btree_balance_dirty(), which did resolve the stalls. But I found a few times that when I did rmmod btrfs, there would be delayed inode objects leaked in the slab cache. rmmod will try to destroy the slab cache, which will fail because we haven't freed everything. It looks like we have a race in btrfs_get_or_create_delayed_node, where two concurrent callers can both create delayed nodes and then race on adding it to the inode. I also think that code is racing with the code that frees delayed nodes, but haven't yet triggered my debugging printks to prove either one. My current incremental patch is below, please take a look: diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index 727d9a8..02e9390 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -32,7 +32,8 @@ int __init btrfs_delayed_inode_init(void) delayed_node_cache = kmem_cache_create(delayed_node, sizeof(struct btrfs_delayed_node), 0, - SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, + SLAB_RECLAIM_ACCOUNT | + SLAB_MEM_SPREAD | SLAB_DESTROY_BY_RCU, NULL); if (!delayed_node_cache) return -ENOMEM; @@ -90,22 +91,35 @@ static struct btrfs_delayed_node *btrfs_get_or_create_delayed_node( struct btrfs_inode *btrfs_inode = BTRFS_I(inode); struct btrfs_root *root = btrfs_inode-root; - if (btrfs_inode-delayed_node) { - node = btrfs_inode-delayed_node; - atomic_inc(node-refs);/* can be accessed */ - return node; +again: + rcu_read_lock(); +again_rcu: + node = btrfs_inode-delayed_node; + if (node) { + if (atomic_inc_not_zero(node-refs)) { + rcu_read_unlock(); + return node; + } +printk(racing on node access!\n); + goto again_rcu; } + rcu_read_unlock(); node = kmem_cache_alloc(delayed_node_cache, GFP_NOFS); if (!node) return ERR_PTR(-ENOMEM); btrfs_init_delayed_node(node, root-objectid, inode-i_ino); - btrfs_inode-delayed_node = node; node-delayed_root = btrfs_get_delayed_root(root); atomic_inc(node-refs);/* cached in the btrfs inode */ atomic_inc(node-refs);/* can be accessed */ + if (cmpxchg(BTRFS_I(inode)-delayed_node, NULL, node)) { + kmem_cache_free(delayed_node_cache, node); +printk(racing on new node insertion!\n); + goto again; + } + return node; } @@ -1167,7 +1181,7 @@ static void btrfs_async_run_delayed_node_done(struct btrfs_work *work) nr = trans-blocks_used; btrfs_end_transaction_dmeta(trans, root); - btrfs_btree_balance_dirty(root, nr); + __btrfs_btree_balance_dirty(root, nr); free_path: btrfs_free_path(path); out: diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 0f589b1..0c6c336 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2649,6 +2649,28 @@ void btrfs_btree_balance_dirty(struct btrfs_root *root, unsigned long nr) balance_dirty_pages_ratelimited_nr( root-fs_info-btree_inode-i_mapping, 1); } + btrfs_balance_delayed_items(root); + return; +} + +void __btrfs_btree_balance_dirty(struct btrfs_root *root, unsigned long nr) +{ + /* +* looks as though older kernels can get into trouble with +* this code, they end up stuck in balance_dirty_pages forever +*/ + u64 num_dirty; + unsigned long thresh = 32 * 1024 * 1024; + + if (current-flags PF_MEMALLOC) + return; + + num_dirty = root-fs_info-dirty_metadata_bytes; + + if (num_dirty thresh) { +
Re: [PATCH V4] btrfs: implement delayed inode items operation
On sun, 20 Mar 2011 20:33:34 -0400, Chris Mason wrote: Excerpts from Miao Xie's message of 2011-03-18 05:24:46 -0400: Changelog V3 - V4: - Fix nested lock, which is reported by Itaru Kitayama, by updating space cache inodes in time. I ran some tests on this and had trouble with my stress.sh script: http://oss.oracle.com/~mason/stress.sh I used: stress.sh -n 50 -c path to linux kernel git tree /mnt The git tree has all the .git files but no .o files. The problem was that within about 20 minutes, the filesystem was spending almost all of its time in balance_dirty_pages(). The problem is that data writeback isn't complete until the endio handlers have finished inserting metadata into the btree. The v4 patch calls btrfs_btree_balance_dirty() from all the btrfs_end_transaction variants, which means that the FS writeback code waits for balance_dirty_pages(), which won't make progress until the FS writeback code is done. So I changed things to call the delayed inode balance function only from inside btrfs_btree_balance_dirty(), which did resolve the stalls. But Ok, but can we invoke the delayed inode balance function before balance_dirty_pages_ratelimited_nr(), because the delayed item insertion and deletion also bring us some dirty pages. I found a few times that when I did rmmod btrfs, there would be delayed inode objects leaked in the slab cache. rmmod will try to destroy the slab cache, which will fail because we haven't freed everything. It looks like we have a race in btrfs_get_or_create_delayed_node, where two concurrent callers can both create delayed nodes and then race on adding it to the inode. Sorry for my mistake, I thought we updated the inodes when holding i_mutex originally, so I didn't use any lock or other method to protect delayed_node of the inodes. But I think we needn't use rcu lock to protect delayed_node when we want to get the delayed node, because we won't change it after it is created, cmpxchg() and ACCESS_ONCE() can protect it well. What do you think about? PS: I worry about the inode update without holding i_mutex. I also think that code is racing with the code that frees delayed nodes, but haven't yet triggered my debugging printks to prove either one. We free delayed nodes when we want to destroy the inode, at that time, just one task, which is destroying inode, can access the delayed nodes, so I think ACCESS_ONCE() is enough. What do you think about? Thanks Miao My current incremental patch is below, please take a look: diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index 727d9a8..02e9390 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -32,7 +32,8 @@ int __init btrfs_delayed_inode_init(void) delayed_node_cache = kmem_cache_create(delayed_node, sizeof(struct btrfs_delayed_node), 0, - SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, + SLAB_RECLAIM_ACCOUNT | + SLAB_MEM_SPREAD | SLAB_DESTROY_BY_RCU, NULL); if (!delayed_node_cache) return -ENOMEM; @@ -90,22 +91,35 @@ static struct btrfs_delayed_node *btrfs_get_or_create_delayed_node( struct btrfs_inode *btrfs_inode = BTRFS_I(inode); struct btrfs_root *root = btrfs_inode-root; - if (btrfs_inode-delayed_node) { - node = btrfs_inode-delayed_node; - atomic_inc(node-refs);/* can be accessed */ - return node; +again: + rcu_read_lock(); +again_rcu: + node = btrfs_inode-delayed_node; + if (node) { + if (atomic_inc_not_zero(node-refs)) { + rcu_read_unlock(); + return node; + } +printk(racing on node access!\n); + goto again_rcu; } + rcu_read_unlock(); node = kmem_cache_alloc(delayed_node_cache, GFP_NOFS); if (!node) return ERR_PTR(-ENOMEM); btrfs_init_delayed_node(node, root-objectid, inode-i_ino); - btrfs_inode-delayed_node = node; node-delayed_root = btrfs_get_delayed_root(root); atomic_inc(node-refs);/* cached in the btrfs inode */ atomic_inc(node-refs);/* can be accessed */ + if (cmpxchg(BTRFS_I(inode)-delayed_node, NULL, node)) { + kmem_cache_free(delayed_node_cache, node); +printk(racing on new node insertion!\n); + goto again; + } + return node; } @@ -1167,7 +1181,7 @@ static void btrfs_async_run_delayed_node_done(struct btrfs_work *work) nr = trans-blocks_used; btrfs_end_transaction_dmeta(trans, root); - btrfs_btree_balance_dirty(root, nr); + __btrfs_btree_balance_dirty(root, nr); free_path: btrfs_free_path(path); out: