Re: [PATCH v2 1/4] btrfs: Remove userspace transaction ioctls
On Mon, 5 Feb 2018, David Sterba wrote: > On Mon, Feb 05, 2018 at 05:52:52PM +0900, Wang Shilong wrote: > >These ioctl are originally introduced by Sage Weil for Ceph use? > > Not sure whether it still useful, Cc Sage just in case. > > We have checked that the ioctl is not used in ceph, the reasons why we > think it's ok to remove the ioctl were in the cover letter of v1. Yeah, +1 on removal. Acked-by: Sage Weil <s...@redhat.com> sage -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: fix race in WAIT_SYNC ioctl
We check whether transid is already committed via last_trans_committed and then search through trans_list for pending transactions. If last_trans_committed is updated by btrfs_commit_transaction after we check it (there is no locking), we will fail to find the committed transaction and return EINVAL to the caller. This has been observed occasionally by ceph-osd (which uses this ioctl heavily). Fix by rechecking whether the provided transid = last_trans_committed after the search fails, and if so return 0. Signed-off-by: Sage Weil s...@redhat.com --- fs/btrfs/transaction.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index d89c6d3..98a25df 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -609,7 +609,6 @@ int btrfs_wait_for_commit(struct btrfs_root *root, u64 transid) if (transid = root-fs_info-last_trans_committed) goto out; - ret = -EINVAL; /* find specified transaction */ spin_lock(root-fs_info-trans_lock); list_for_each_entry(t, root-fs_info-trans_list, list) { @@ -625,9 +624,16 @@ int btrfs_wait_for_commit(struct btrfs_root *root, u64 transid) } } spin_unlock(root-fs_info-trans_lock); - /* The specified transaction doesn't exist */ - if (!cur_trans) + + /* +* The specified transaction doesn't exist, or we +* raced with btrfs_commit_transaction +*/ + if (!cur_trans) { + if (transid root-fs_info-last_trans_committed) + ret = -EINVAL; goto out; + } } else { /* find newest transaction that is committing | committed */ spin_lock(root-fs_info-trans_lock); -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
warn at fs/btrfs/extent-tree.c:5748 __btrfs_free_extent+0x9ce/0xa20
Hey, Is this something you guys have seen before? This is from v3.13-rc2. kernel: [49432.696440] WARNING: CPU: 3 PID: 26411 at /srv/autobuild-ceph/gitbuilder.git/build/fs/btrfs/extent-tree.c:5748 __btrfs_free_extent+0x9ce/0xa20 [btrfs]() kernel: [49432.710128] Modules linked in: arc4(F) md4(F) nls_utf8(F) cifs(F) ufs(F) qnx4(F) hfsplus(F) hfs(F) minix(F) ntfs(F) msdos(F) jfs(F) xfs(F) reiserfs(F) ext2(F) kvm_intel(F) kvm(F) ib_iser(F) rdma_cm(F) ib_cm(F) iw_cm(F) ib_sa(F) ib_mad(F) ib_core(F) ib_addr(F) iscsi_tcp(F) libiscsi_tcp(F) libiscsi(F) psmouse(F) ipmi_si(F) serio_raw(F) gpio_ich(F) joydev(F) dcdbas(F) i7core_edac(F) edac_core(F) ipmi_msghandler(F) mac_hid(F) acpi_power_meter(F) lpc_ich(F) tpm_tis(F) nfsd(F) nfs_acl(F) auth_rpcgss(F) scsi_transport_iscsi(F) nfs(F) fscache(F) lockd(F) lp(F) sunrpc(F) parport(F) hid_generic(F) usbhid(F) hid(F) btrfs(F) raid6_pq(F) mptsas(F) ixgbe(F) mptscsih(F) dca(F) mptbase(F) ptp(F) pps_core(F) scsi_transport_sas(F) xor(F) mdio(F) bnx2(F) libcrc32c(F) kernel: [49432.777445] CPU: 3 PID: 26411 Comm: ceph-osd Tainted: GF I 3.14.0-rc5-ceph-00016-gf31a96a #1 kernel: [49432.786704] Hardware name: Dell Inc. PowerEdge R410/01V648, BIOS 1.6.3 02/07/2011 kernel: [49432.794223] 1674 8800bf1cbac8 816e4840 88022726ef90 kernel: [49432.801700] 8800bf1cbb08 810524ac a8b07e50 kernel: [49432.809176] 880094e74120 b07c9000 kernel: [49432.816653] Call Trace: kernel: [49432.819119] [816e4840] dump_stack+0x46/0x58 kernel: [49432.825384] [810524ac] warn_slowpath_common+0x8c/0xc0 kernel: [49432.831413] [810524fa] warn_slowpath_null+0x1a/0x20 kernel: [49432.837284] [a010b4be] __btrfs_free_extent+0x9ce/0xa20 [btrfs] kernel: [49432.844108] [a01110b8] __btrfs_run_delayed_refs+0x428/0x11e0 [btrfs] kernel: [49432.851465] [a0109458] ? block_rsv_release_bytes+0x108/0x190 [btrfs] kernel: [49432.858823] [a0114066] btrfs_run_delayed_refs+0x76/0x2a0 [btrfs] kernel: [49432.865869] [a01251ff] __btrfs_end_transaction+0x26f/0x370 [btrfs] kernel: [49432.873044] [a0125330] btrfs_end_transaction+0x10/0x20 [btrfs] kernel: [49432.879872] [a01327de] btrfs_link+0x13e/0x1d0 [btrfs] kernel: [49432.885903] [811b7571] vfs_link+0x1b1/0x270 kernel: [49432.891060] [811b8120] SyS_linkat+0x210/0x2d0 kernel: [49432.896394] [811b81fe] SyS_link+0x1e/0x20 kernel: [49432.901380] [816f7cd6] system_call_fastpath+0x1a/0x1f The full dump is at http://tracker.ceph.com/issues/7688 http://tracker.ceph.com/attachments/download/1141/kern.log.gz Thanks- sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: transaction commit deadlock on current rc
On Fri, 18 Oct 2013, Josef Bacik wrote: On Thu, Oct 17, 2013 at 12:56:14PM -0700, Sage Weil wrote: Hey, I'm seeing the deadlock below under a ceph-osd workload. There may be a subtle problem with the async transaction sequence (since nobody but ceph uses that that I know of), but not obvious to me why create_pending_snapshots would get stuck on btrfs_tree_lock... Can you do sysrq+w when this happens so I can see everybody who's blocked? Thanks, Oops, forgot to attach the bug link. It's at http://tracker.ceph.com/attachments/download/1035/a http://tracker.ceph.com/issues/6451 The machine is still hung.. if there is additional info I can gather you can ping me on irc. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: transaction commit deadlock on current rc
On Fri, 18 Oct 2013, Chris Mason wrote: Quoting Sage Weil (2013-10-18 11:42:28) On Fri, 18 Oct 2013, Josef Bacik wrote: On Thu, Oct 17, 2013 at 12:56:14PM -0700, Sage Weil wrote: Hey, I'm seeing the deadlock below under a ceph-osd workload. There may be a subtle problem with the async transaction sequence (since nobody but ceph uses that that I know of), but not obvious to me why create_pending_snapshots would get stuck on btrfs_tree_lock... Can you do sysrq+w when this happens so I can see everybody who's blocked? Thanks, Oops, forgot to attach the bug link. It's at http://tracker.ceph.com/attachments/download/1035/a http://tracker.ceph.com/issues/6451 The machine is still hung.. if there is additional info I can gather you can ping me on irc. Thanks Sage and Josef, I've got this one queued up pending an ack from Sage. But it's obviously not harmful, so I'll probably send this afternoon either way. This is passing my initial tests! It'll be subjected to the full firehose later tonight; I'll let you know if anything comes up. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
transaction commit deadlock on current rc
Hey, I'm seeing the deadlock below under a ceph-osd workload. There may be a subtle problem with the async transaction sequence (since nobody but ceph uses that that I know of), but not obvious to me why create_pending_snapshots would get stuck on btrfs_tree_lock... [ 602.217383] INFO: task kworker/3:2:771 blocked for more than 120 seconds. [ 602.224234] Not tainted 3.12.0-rc2-ceph-9-g53d0281 #1 [ 602.230216] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 602.238121] kworker/3:2 D 88003677df10 0 771 2 0x [ 602.245349] Workqueue: events do_async_commit [btrfs] [ 602.250513] 8800c95c78d8 0046 0286 8800638fca08 [ 602.258192] 88003677df10 8800c95c7fd8 8800c95c7fd8 8800c95c7fd8 [ 602.265867] 880225d2df10 88003677df10 8800c95c78e8 8800638fc8e0 [ 602.273545] Call Trace: [ 602.276049] [81665849] schedule+0x29/0x70 [ 602.281087] [a0176975] btrfs_tree_lock+0x75/0x270 [btrfs] [ 602.287509] [81070310] ? __init_waitqueue_head+0x60/0x60 [ 602.293840] [a01185bb] btrfs_lock_root_node+0x3b/0x50 [btrfs] [ 602.300612] [a011da67] btrfs_search_slot+0x867/0x930 [btrfs] [ 602.307293] [a012ac62] ? run_clustered_refs+0x232/0xf30 [btrfs] [ 602.314236] [a011f238] btrfs_insert_empty_items+0x78/0xd0 [btrfs] [ 602.321393] [a01330cc] insert_with_overflow+0x3c/0x110 [btrfs] [ 602.328287] [a013325f] btrfs_insert_dir_item+0xbf/0x200 [btrfs] [ 602.335229] [a013f19c] create_pending_snapshot+0x81c/0xa00 [btrfs] [ 602.342469] [a013f423] create_pending_snapshots+0xa3/0xb0 [btrfs] [ 602.349624] [a01408fe] btrfs_commit_transaction+0x46e/0xa40 [btrfs] [ 602.356919] [81070310] ? __init_waitqueue_head+0x60/0x60 [ 602.363291] [a0140f58] do_async_commit+0x88/0xa0 [btrfs] [ 602.369665] [a0140ef9] ? do_async_commit+0x29/0xa0 [btrfs] [ 602.376166] [810672fa] process_one_work+0x1da/0x540 [ 602.382099] [8106728f] ? process_one_work+0x16f/0x540 [ 602.388205] [810684dc] worker_thread+0x11c/0x370 [ 602.393834] [810683c0] ? manage_workers.isra.20+0x2e0/0x2e0 [ 602.400462] [8106fada] kthread+0xea/0xf0 [ 602.405396] [8106f9f0] ? flush_kthread_worker+0x150/0x150 [ 602.411836] [8166fdec] ret_from_fork+0x7c/0xb0 [ 602.417300] [8106f9f0] ? flush_kthread_worker+0x150/0x150 [ 602.423787] INFO: lockdep is turned off. [ 602.427852] INFO: task btrfs-transacti:6069 blocked for more than 120 seconds. [ 602.435155] Not tainted 3.12.0-rc2-ceph-9-g53d0281 #1 [ 602.441229] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 602.449212] btrfs-transacti D 8800c96461e8 0 6069 2 0x [ 602.457660] 88022408fd08 0046 0286 8800b68a4578 [ 602.465350] 88022448df10 88022408ffd8 88022408ffd8 88022408ffd8 [ 602.473081] 880225d29fb0 88022448df10 88022408fd18 880082fd48a8 [ 602.480835] Call Trace: [ 602.483342] [81665849] schedule+0x29/0x70 [ 602.488450] [a013f74f] wait_current_trans.isra.33+0xbf/0x120 [btrfs] [ 602.495836] [81070310] ? __init_waitqueue_head+0x60/0x60 [ 602.502241] [a01416a8] start_transaction+0x348/0x540 [btrfs] [ 602.509010] [a0141907] btrfs_attach_transaction+0x17/0x20 [btrfs] [ 602.516124] [a0139c12] transaction_kthread+0x182/0x250 [btrfs] [ 602.523065] [a0139a90] ? btrfs_destroy_delayed_refs+0x370/0x370 [btrfs] [ 602.530791] [8106fada] kthread+0xea/0xf0 [ 602.535725] [8106f9f0] ? flush_kthread_worker+0x150/0x150 [ 602.542178] [8166fdec] ret_from_fork+0x7c/0xb0 [ 602.547658] [8106f9f0] ? flush_kthread_worker+0x150/0x150 [ 602.554068] INFO: lockdep is turned off. [ 602.558154] INFO: task ceph-osd:12248 blocked for more than 120 seconds. [ 602.558155] Not tainted 3.12.0-rc2-ceph-9-g53d0281 #1 [ 602.558156] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 602.558158] ceph-osdD 880082fd48a8 0 12248 12215 0x [ 602.558161] 880184441b58 0046 0282 8800b68a4578 [ 602.558162] 880077fcbf60 880184441fd8 880184441fd8 880184441fd8 [ 602.558164] 88003677df10 880077fcbf60 880184441b68 880184441ba0 [ 602.558164] Call Trace: [ 602.558166] [81665849] schedule+0x29/0x70 [ 602.558178] [a0141af7] btrfs_commit_transaction_async+0x187/0x2c0 [btrfs] [ 602.558188] [a01413f6] ? start_transaction+0x96/0x540 [btrfs] [ 602.558190] [81070310] ? __init_waitqueue_head+0x60/0x60 [ 602.558201] [a0171565] btrfs_mksubvol.isra.59+0x2a5/0x410 [btrfs] [ 602.558204] [811a3d9c] ?
Re: hang on 3.9, 3.10-rc5
Hi Chris, On Thu, 20 Jun 2013, Chris Mason wrote: Quoting Sage Weil (2013-06-20 17:56:19) On Wed, 19 Jun 2013, Sage Weil wrote: Hi Chris, On Tue, 18 Jun 2013, Chris Mason wrote: [...] Very long way of saying I think we're one release_path short. Sage, I haven't tested this at all yet, I was hoping to trigger it first. diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index c276ac9..c1954b3 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -3730,6 +3730,7 @@ next_slot: log_extents: if (fast_search) { btrfs_release_path(dst_path); + btrfs_release_path(path); ret = btrfs_log_changed_extents(trans, root, inode, dst_path); if (ret) { err = ret; This seems to be doing the trick. I'll keep testing overnight, but so far so good! ...and it's still holding up well in QA. Awesome, thanks for getting the traces for us. Looks like this one has been around since v3.7, so I'm not going to try and sneak it into the 3.10 final. I'll have it in the next merge window and for stable. I was just pulling in the current -rc for testing and realized that this isn't upstream yet. Was this missed? Sorry I didn't check earlier. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hang on 3.9, 3.10-rc5
On Wed, 19 Jun 2013, Sage Weil wrote: Hi Chris, On Tue, 18 Jun 2013, Chris Mason wrote: [...] Very long way of saying I think we're one release_path short. Sage, I haven't tested this at all yet, I was hoping to trigger it first. diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index c276ac9..c1954b3 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -3730,6 +3730,7 @@ next_slot: log_extents: if (fast_search) { btrfs_release_path(dst_path); + btrfs_release_path(path); ret = btrfs_log_changed_extents(trans, root, inode, dst_path); if (ret) { err = ret; This seems to be doing the trick. I'll keep testing overnight, but so far so good! ...and it's still holding up well in QA. Thanks, Chris! sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hang on 3.9, 3.10-rc5
On Thu, 20 Jun 2013, Chris Mason wrote: Quoting Sage Weil (2013-06-20 17:56:19) On Wed, 19 Jun 2013, Sage Weil wrote: Hi Chris, On Tue, 18 Jun 2013, Chris Mason wrote: [...] Very long way of saying I think we're one release_path short. Sage, I haven't tested this at all yet, I was hoping to trigger it first. diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index c276ac9..c1954b3 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -3730,6 +3730,7 @@ next_slot: log_extents: if (fast_search) { btrfs_release_path(dst_path); + btrfs_release_path(path); ret = btrfs_log_changed_extents(trans, root, inode, dst_path); if (ret) { err = ret; This seems to be doing the trick. I'll keep testing overnight, but so far so good! ...and it's still holding up well in QA. Awesome, thanks for getting the traces for us. Looks like this one has been around since v3.7, so I'm not going to try and sneak it into the 3.10 final. I'll have it in the next merge window and for stable. Weird, these same tests have been running on it nightly for ages and it seems like these failures just started with 3.9. Perhaps some other change made it hang when it didn't before? In any case, thanks! sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hang on 3.9, 3.10-rc5
On Thu, 20 Jun 2013, Chris Mason wrote: Quoting Sage Weil (2013-06-20 21:00:21) On Thu, 20 Jun 2013, Chris Mason wrote: Awesome, thanks for getting the traces for us. Looks like this one has been around since v3.7, so I'm not going to try and sneak it into the 3.10 final. I'll have it in the next merge window and for stable. Weird, these same tests have been running on it nightly for ages and it seems like these failures just started with 3.9. Perhaps some other change made it hang when it didn't before? It's always possible, there are a ton of moving pieces here. The wait_event you were hung on was waiting for crcs to finish, and that part at least isn't new. K. There was also a shift of writes to leveldb (which does the mmap thing), so that may explain the change in behavior. Somewhat unrelated, but are you still using notreelog? Nope, just noatime. Thanks- sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hang on 3.9, 3.10-rc5
Hi Chris, On Tue, 18 Jun 2013, Chris Mason wrote: [...] Very long way of saying I think we're one release_path short. Sage, I haven't tested this at all yet, I was hoping to trigger it first. diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index c276ac9..c1954b3 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -3730,6 +3730,7 @@ next_slot: log_extents: if (fast_search) { btrfs_release_path(dst_path); + btrfs_release_path(path); ret = btrfs_log_changed_extents(trans, root, inode, dst_path); if (ret) { err = ret; This seems to be doing the trick. I'll keep testing overnight, but so far so good! Thanks- sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hang on 3.9, 3.10-rc5
On Wed, 12 Jun 2013, Sage Weil wrote: On Tue, 11 Jun 2013, Chris Mason wrote: Quoting Sage Weil (2013-06-11 11:43:30) I'm also seeing this hang regularly with both 3.9 and 3.10-rc5. Is this is a known problem? In this case there is no powercycling; just a regular ceph-osd workload. Everyone here is waiting for the root node, but it isn't immediately clear who has the lock. log_one_extent is the most likely suspect, but I can't see how it would be scheduling with the root lock held. Could you please sysrq-w? Sorry for the slow reply; had to wait for it to reproduce again. Attached both the original dmesg output and the blocked tasks. I'll leave the box wedged this time in case there is any other info that can help. Still seeing this hang with the latest from linus' tree. Is there another tree I should try testing against? Thanks! sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
hang on 3.9, 3.10-rc5
I'm also seeing this hang regularly with both 3.9 and 3.10-rc5. Is this is a known problem? In this case there is no powercycling; just a regular ceph-osd workload. Thanks- sage [ 2885.479116] INFO: task kworker/u64:1:28713 blocked for more than 120 seconds. [ 2885.486277] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 2885.494201] kworker/u64:1 D 880222b81f90 0 28713 2 0x [ 2885.501355] Workqueue: writeback bdi_writeback_workfn (flush-btrfs-12) [ 2885.507924] 88021c3d7238 0046 0286 880082af52e8 [ 2885.515456] 880222b81f90 88021c3d7fd8 88021c3d7fd8 88021c3d7fd8 [ 2885.522987] 880225d2bf20 880222b81f90 88021c3d7248 880082af51c0 [ 2885.530497] Call Trace: [ 2885.532964] [81635ed9] schedule+0x29/0x70 [ 2885.537972] [a021ec95] btrfs_tree_lock+0x75/0x270 [btrfs] [ 2885.544407] [81067b50] ? __init_waitqueue_head+0x60/0x60 [ 2885.550760] [a01c22cb] btrfs_lock_root_node+0x3b/0x50 [btrfs] [ 2885.557490] [a01c77ba] btrfs_search_slot+0x87a/0x940 [btrfs] [ 2885.564196] [a01ff76e] ? free_extent_map+0x4e/0x90 [btrfs] [ 2885.570738] [a01dd198] btrfs_lookup_file_extent+0x38/0x40 [btrfs] [ 2885.577823] [a01fd1be] __btrfs_drop_extents+0x12e/0xaf0 [btrfs] [ 2885.584737] [8107b093] ? select_task_rq_fair+0x53/0x8a0 [ 2885.590977] [81167788] ? kmem_cache_alloc+0xd8/0x160 [ 2885.596934] [a01fe63e] btrfs_drop_extents+0x6e/0xa0 [btrfs] [ 2885.603531] [a01f16c1] cow_file_range_inline+0xf1/0x1e0 [btrfs] [ 2885.610451] [a01f1ad6] __cow_file_range+0x326/0x4b0 [btrfs] [ 2885.617014] [a01e89ea] ? join_transaction.isra.32+0x10a/0x3c0 [btrfs] [ 2885.624520] [a01f27d5] cow_file_range+0x95/0xe0 [btrfs] [ 2885.630772] [a01f2b5b] run_delalloc_range+0x33b/0x360 [btrfs] [ 2885.637581] [a0208120] __extent_writepage+0x5e0/0x770 [btrfs] [ 2885.644357] [8111cf9f] ? find_get_pages_tag+0x11f/0x1c0 [ 2885.650614] [8111ceae] ? find_get_pages_tag+0x2e/0x1c0 [ 2885.656744] [a0208532] extent_write_cache_pages.isra.29.constprop.41+0x282/0x3e0 [btrfs] [ 2885.665906] [a01ec8c0] ? add_pending_csums.isra.44+0x70/0x70 [btrfs] [ 2885.673325] [a01ebde0] ? btrfs_readpage_end_io_hook+0x270/0x270 [btrfs] [ 2885.680990] [8119815c] ? __writeback_single_inode+0x6c/0x2a0 [ 2885.687792] [a020892e] extent_writepages+0x4e/0x70 [btrfs] [ 2885.694296] [a01eeec0] ? can_nocow_odirect+0x330/0x330 [btrfs] [ 2885.701787] [a01ec698] btrfs_writepages+0x28/0x30 [btrfs] [ 2885.708164] [811289e3] do_writepages+0x23/0x40 [ 2885.713679] [8119813e] __writeback_single_inode+0x4e/0x2a0 [ 2885.720347] [81199630] writeback_sb_inodes+0x280/0x3e0 [ 2885.726528] [8119982e] __writeback_inodes_wb+0x9e/0xd0 [ 2885.732766] [81199a5b] wb_writeback+0x1fb/0x310 [ 2885.738271] [81127a21] ? bdi_dirty_limit+0x31/0xc0 [ 2885.744266] [8119b3e8] wb_do_writeback+0x138/0x1e0 [ 2885.750091] [8119b50a] bdi_writeback_workfn+0x7a/0x200 [ 2885.756207] [8105f4ba] process_one_work+0x1da/0x540 [ 2885.762111] [8105f44f] ? process_one_work+0x16f/0x540 [ 2885.768138] [8106069c] worker_thread+0x11c/0x370 [ 2885.773836] [81060580] ? manage_workers.isra.20+0x2e0/0x2e0 [ 2885.780513] [8106735a] kthread+0xea/0xf0 [ 2885.785412] [81067270] ? flush_kthread_worker+0x150/0x150 [ 2885.791805] [8163ff9c] ret_from_fork+0x7c/0xb0 [ 2885.797227] [81067270] ? flush_kthread_worker+0x150/0x150 [ 2885.803694] INFO: lockdep is turned off. [ 2885.807640] INFO: task btrfs-transacti:23992 blocked for more than 120 seconds. [ 2885.815086] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 2885.823067] btrfs-transacti D 880222493f20 0 23992 2 0x [ 2885.830265] 88020cebfa28 0046 0286 880082af52e8 [ 2885.837761] 880222493f20 88020cebffd8 88020cebffd8 88020cebffd8 [ 2885.845281] 880225d29f90 880222493f20 88020cebfa38 880082af51c0 [ 2885.852840] Call Trace: [ 2885.855306] [81635ed9] schedule+0x29/0x70 [ 2885.860393] [a021ec95] btrfs_tree_lock+0x75/0x270 [btrfs] [ 2885.866772] [81067b50] ? __init_waitqueue_head+0x60/0x60 [ 2885.873123] [a01c22cb] btrfs_lock_root_node+0x3b/0x50 [btrfs] [ 2885.879881] [a01c77ba] btrfs_search_slot+0x87a/0x940 [btrfs] [ 2885.886517] [8163760b] ? _raw_spin_unlock+0x2b/0x40 [ 2885.892465] [a01c8f28] btrfs_insert_empty_items+0x78/0xd0 [btrfs] [ 2885.899574] [a0238d54] btrfs_insert_delayed_items+0x84/0x460 [btrfs] [ 2885.906936] [a0239938] __btrfs_run_delayed_items+0xb8/0x1e0 [btrfs] [ 2885.914285] [a0239a93]
Re: v3.9 bug at /fs/btrfs/free-space-cache.c:1567 after powercycle
On Wed, 5 Jun 2013, Josef Bacik wrote: On Tue, Jun 04, 2013 at 09:59:05PM -0600, Sage Weil wrote: Hi- I'm pretty reliably triggering the following bug after powercycling an active btrfs + ceph workload and then trying to remount. Is this a known issue? Yeah sorry it's fixed in 3.10, I really need to send the patch back to stable. Thanks, Josef Cool. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
v3.9 bug at /fs/btrfs/free-space-cache.c:1567 after powercycle
Hi- I'm pretty reliably triggering the following bug after powercycling an active btrfs + ceph workload and then trying to remount. Is this a known issue? sage 2013-06-04T18:54:28.532988-07:00 plana71 kernel: [ 39.311120] [ cut here ] 2013-06-04T18:54:28.533002-07:00 plana71 kernel: [ 39.315802] kernel BUG at /srv/autobuild-ceph/gitbuilder.git/build/fs/btrfs/free-space-cache.c:1567! 2013-06-04T18:54:28.533004-07:00 plana71 kernel: [ 39.325013] invalid opcode: [#1] SMP 2013-06-04T18:54:28.533012-07:00 plana71 kernel: [ 39.329278] Modules linked in: coretemp kvm_intel kvm ghash_clmulni_intel aesni_intel ablk_helper nfsd cryptd lrw aes_x86_64 nfs_acl xts auth_rpcgss gf128mul microcode psmouse exportfs nfs dcdbas lpc_ich mfd_core i7core_edac edac_core joydev serio_raw hed lp parport fscache lockd sunrpc hid_generic usbhid hid btrfs raid6_pq ixgbe dca ptp pps_core mdio mptsas mptscsih mptbase scsi_transport_sas bnx2 xor zlib_deflate crc32c_intel libcrc32c 2013-06-04T18:54:28.533014-07:00 plana71 kernel: [ 39.370984] CPU 1 2013-06-04T18:54:28.533017-07:00 plana71 kernel: [ 39.372867] Pid: 1679, comm: mount Not tainted 3.9.0-ceph-00303-g19bb6a8 #1 Dell Inc. PowerEdge R410/01V648 2013-06-04T18:54:28.533021-07:00 plana71 kernel: [ 39.382928] RIP: 0010:[a0187917] [a0187917] remove_from_bitmap+0x1b7/0x1c0 [btrfs] 2013-06-04T18:54:28.533023-07:00 plana71 kernel: [ 39.392479] RSP: 0018:880212d456e8 EFLAGS: 00010287 2013-06-04T18:54:28.533025-07:00 plana71 kernel: [ 39.397853] RAX: RBX: 88021fea1100 RCX: 0034 2013-06-04T18:54:28.533028-07:00 plana71 kernel: [ 39.405051] RDX: 00048000 RSI: 61b0a000 RDI: 78c0 2013-06-04T18:54:28.533030-07:00 plana71 kernel: [ 39.412249] RBP: 880212d45738 R08: 8802200350f0 R09: 0740 2013-06-04T18:54:28.533032-07:00 plana71 kernel: [ 39.419447] R10: 88020a4e65d0 R11: 0001 R12: 880212d45760 2013-06-04T18:54:28.533034-07:00 plana71 kernel: [ 39.426644] R13: 88020ad97400 R14: 6940 R15: 880212d45758 2013-06-04T18:54:28.533036-07:00 plana71 kernel: [ 39.433843] FS: 7f035dc1a800() GS:88022722() knlGS: 2013-06-04T18:54:28.533038-07:00 plana71 kernel: [ 39.442011] CS: 0010 DS: ES: CR0: 8005003b 2013-06-04T18:54:28.533040-07:00 plana71 kernel: [ 39.447819] CR2: 7ff46d4524d0 CR3: 00020a52c000 CR4: 07e0 2013-06-04T18:54:28.533042-07:00 plana71 kernel: [ 39.455017] DR0: DR1: DR2: 2013-06-04T18:54:28.533044-07:00 plana71 kernel: [ 39.462215] DR3: DR6: 0ff0 DR7: 0400 2013-06-04T18:54:28.533047-07:00 plana71 kernel: [ 39.469414] Process mount (pid: 1679, threadinfo 880212d44000, task 88020a4e5eb0) 2013-06-04T18:54:28.533048-07:00 plana71 kernel: [ 39.477669] Stack: 2013-06-04T18:54:28.533050-07:00 plana71 kernel: [ 39.479739] 88020ad97454 61b0c000 00048000 2013-06-04T18:54:28.533052-07:00 plana71 kernel: [ 39.487431] 8802 88020ad97400 88020ad97454 2013-06-04T18:54:28.533054-07:00 plana71 kernel: [ 39.495123] 88022012b400 61b0a000 880212d45798 a0189073 2013-06-04T18:54:28.533055-07:00 plana71 kernel: [ 39.502817] Call Trace: 2013-06-04T18:54:28.533058-07:00 plana71 kernel: [ 39.505339] [a0189073] btrfs_remove_free_space+0x53/0x290 [btrfs] 2013-06-04T18:54:28.533061-07:00 plana71 kernel: [ 39.512465] [a0135d20] btrfs_alloc_logged_file_extent+0x1c0/0x1e0 [btrfs] 2013-06-04T18:54:28.533063-07:00 plana71 kernel: [ 39.520295] [a01215fa] ? btrfs_free_path+0x2a/0x40 [btrfs] 2013-06-04T18:54:28.533065-07:00 plana71 kernel: [ 39.526815] [a0183aef] replay_one_extent+0x5ef/0x650 [btrfs] 2013-06-04T18:54:28.533068-07:00 plana71 kernel: [ 39.533510] [a017da3a] ? btrfs_tree_read_lock+0x5a/0x140 [btrfs] 2013-06-04T18:54:28.533070-07:00 plana71 kernel: [ 39.540553] [a0183e2b] replay_one_buffer+0x2db/0x390 [btrfs] 2013-06-04T18:54:28.533075-07:00 plana71 kernel: [ 39.547247] [a01629f1] ? mark_extent_buffer_accessed+0x51/0x70 [btrfs] 2013-06-04T18:54:28.533078-07:00 plana71 kernel: [ 39.554821] [a013f6d0] ? verify_parent_transid+0x160/0x160 [btrfs] 2013-06-04T18:54:28.533080-07:00 plana71 kernel: [ 39.562037] [a017ee53] walk_up_log_tree+0x1c3/0x250 [btrfs] 2013-06-04T18:54:28.533082-07:00 plana71 kernel: [ 39.568644] [a017fce1] walk_log_tree+0xb1/0x1f0 [btrfs] 2013-06-04T18:54:28.533085-07:00 plana71 kernel: [ 39.574903] [a0186032] btrfs_recover_log_trees+0x212/0x3c0 [btrfs] 2013-06-04T18:54:28.533087-07:00 plana71 kernel: [ 39.582119] [a0183b50] ?
Re: corruption of active mmapped files in btrfs snapshots
On Fri, 22 Mar 2013, Chris Mason wrote: Quoting Alexandre Oliva (2013-03-22 10:17:30) On Mar 22, 2013, Chris Mason clma...@fusionio.com wrote: Are you using compression in btrfs or just in leveldb? btrfs lzo compression. Perfect, I'll focus on that part of things. I'd like to take snapshots out of the picture for a minute. That's understandable, I guess, but I don't know that anyone has ever got the problem without snapshots. I mean, even when the master copy of the database got corrupted, snapshots of the subvol containing it were being taken every now and again, because that's the way ceph works. Hopefully Sage can comment, but the basic idea is that if you snapshot a database file the db must participate. If it doesn't, it really is the same effect as crashing the box. Something is definitely broken if we're corrupting the source files (either with or without snapshots), but avoiding incomplete writes in the snapshot files requires synchronization with the db. In this case, we quiesce write activity, call leveldb's sync(), take the snapshot, and then continue. (FWIW, this isn't the first time we've heard about leveldb corruption, but in each case we've looked into the user had the btrfs compression enabled so I suspect that's the right avenue of investigation!) sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corruption of active mmapped files in btrfs snapshots
On Tue, 19 Mar 2013, Chris Mason wrote: Quoting Alexandre Oliva (2013-03-19 01:20:10) On Mar 18, 2013, Chris Mason chris.ma...@fusionio.com wrote: A few questions. Does leveldb use O_DIRECT and mmap together? No, it doesn't use O_DIRECT at all. Its I/O interface is very simplified: it just opens each new file (database chunks limited to 2MB) with O_CREAT|O_RDWR|O_TRUNC, and then uses ftruncate, mmap, msync, munmap and fdatasync. It doesn't seem to modify data once it's written; it only appends. Reading data back from it uses a completely different class interface, using separate descriptors and using pread only. (the source of a write being pages that are mmap'd from somewhere else) AFAICT the source of the memcpy()s that append to the file are malloc()ed memory. That's the most likely place for this kind of problem. Also, you mention crc errors. Are those reported by btrfs or are they application level crcs. These are CRCs leveldb computes and writes out after each db block. No btrfs CRC errors are reported in this process. Ok, so we have three moving pieces here. 1) leveldb truncating the files 2) leveldb using mmap to write 3) btrfs snapshots My guess is the truncate is creating a orphan item that is being processed inside the snapshot. Is it possible to create a smaller leveldb unit test that we might use to exercise all of this? There is a set of unit tests in the leveldb source tree that ought to do the trick: git clone https://code.google.com/p/leveldb/ sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fs: encode_fh: return FILEID_INVALID if invalid fid_type
Acked-by: Sage Weil s...@inktank.com On Mon, 11 Feb 2013, Namjae Jeon wrote: From: Namjae Jeon namjae.j...@samsung.com This patch is a follow up on below patch: [PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type commit: 216b6cbdcbd86b1db0754d58886b466ae31f5a63 Signed-off-by: Namjae Jeon namjae.j...@samsung.com Signed-off-by: Vivek Trivedi t.vi...@samsung.com Acked-by: Steven Whitehouse swhit...@redhat.com --- fs/btrfs/export.c |4 ++-- fs/ceph/export.c|4 ++-- fs/fuse/inode.c |2 +- fs/gfs2/export.c|4 ++-- fs/isofs/export.c |4 ++-- fs/nilfs2/namei.c |4 ++-- fs/ocfs2/export.c |4 ++-- fs/reiserfs/inode.c |4 ++-- fs/udf/namei.c |4 ++-- fs/xfs/xfs_export.c |4 ++-- mm/cleancache.c |2 +- mm/shmem.c |2 +- 12 files changed, 21 insertions(+), 21 deletions(-) diff --git a/fs/btrfs/export.c b/fs/btrfs/export.c index 614f34a..81ee29e 100644 --- a/fs/btrfs/export.c +++ b/fs/btrfs/export.c @@ -22,10 +22,10 @@ static int btrfs_encode_fh(struct inode *inode, u32 *fh, int *max_len, if (parent (len BTRFS_FID_SIZE_CONNECTABLE)) { *max_len = BTRFS_FID_SIZE_CONNECTABLE; - return 255; + return FILEID_INVALID; } else if (len BTRFS_FID_SIZE_NON_CONNECTABLE) { *max_len = BTRFS_FID_SIZE_NON_CONNECTABLE; - return 255; + return FILEID_INVALID; } len = BTRFS_FID_SIZE_NON_CONNECTABLE; diff --git a/fs/ceph/export.c b/fs/ceph/export.c index ca3ab3f..16796be 100644 --- a/fs/ceph/export.c +++ b/fs/ceph/export.c @@ -81,7 +81,7 @@ static int ceph_encode_fh(struct inode *inode, u32 *rawfh, int *max_len, if (parent_inode) { /* nfsd wants connectable */ *max_len = connected_handle_length; - type = 255; + type = FILEID_INVALID; } else { dout(encode_fh %p\n, dentry); fh-ino = ceph_ino(inode); @@ -90,7 +90,7 @@ static int ceph_encode_fh(struct inode *inode, u32 *rawfh, int *max_len, } } else { *max_len = handle_length; - type = 255; + type = FILEID_INVALID; } if (dentry) dput(dentry); diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 9876a87..973e8f0 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -679,7 +679,7 @@ static int fuse_encode_fh(struct inode *inode, u32 *fh, int *max_len, if (*max_len len) { *max_len = len; - return 255; + return FILEID_INVALID; } nodeid = get_fuse_inode(inode)-nodeid; diff --git a/fs/gfs2/export.c b/fs/gfs2/export.c index 4767774..9973df4 100644 --- a/fs/gfs2/export.c +++ b/fs/gfs2/export.c @@ -37,10 +37,10 @@ static int gfs2_encode_fh(struct inode *inode, __u32 *p, int *len, if (parent (*len GFS2_LARGE_FH_SIZE)) { *len = GFS2_LARGE_FH_SIZE; - return 255; + return FILEID_INVALID; } else if (*len GFS2_SMALL_FH_SIZE) { *len = GFS2_SMALL_FH_SIZE; - return 255; + return FILEID_INVALID; } fh[0] = cpu_to_be32(ip-i_no_formal_ino 32); diff --git a/fs/isofs/export.c b/fs/isofs/export.c index 2b4f235..12088d8 100644 --- a/fs/isofs/export.c +++ b/fs/isofs/export.c @@ -125,10 +125,10 @@ isofs_export_encode_fh(struct inode *inode, */ if (parent (len 5)) { *max_len = 5; - return 255; + return FILEID_INVALID; } else if (len 3) { *max_len = 3; - return 255; + return FILEID_INVALID; } len = 3; diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c index 1d0c0b8..9de78f0 100644 --- a/fs/nilfs2/namei.c +++ b/fs/nilfs2/namei.c @@ -517,11 +517,11 @@ static int nilfs_encode_fh(struct inode *inode, __u32 *fh, int *lenp, if (parent *lenp NILFS_FID_SIZE_CONNECTABLE) { *lenp = NILFS_FID_SIZE_CONNECTABLE; - return 255; + return FILEID_INVALID; } if (*lenp NILFS_FID_SIZE_NON_CONNECTABLE) { *lenp = NILFS_FID_SIZE_NON_CONNECTABLE; - return 255; + return FILEID_INVALID; } fid-cno = root-cno; diff --git a/fs/ocfs2/export.c b/fs/ocfs2/export.c index 322216a..2965116 100644 --- a/fs/ocfs2/export.c +++ b/fs/ocfs2/export.c @@ -195,11 +195,11 @@ static int ocfs2_encode_fh(struct inode *inode, u32 *fh_in, int *max_len, if (parent (len 6)) { *max_len = 6; - type = 255; + type = FILEID_INVALID; goto bail; } else if (len 3) { *max_len = 3; - type = 255
Re: corrupted file size on inline extent conversion?
On Wed, 30 Jan 2013, Josef Bacik wrote: On Wed, Jan 30, 2013 at 11:30:49AM -0700, Mike Lowe wrote: I've been running rsync against a rbd device backed by btrfs filesystems that are about 11% full for about 45 minutes before I checked and noticed the printk message. That was the first go with the patch. Seems like I was able to get by without any problems until the btrfs filesystems got some use and filled up a little bit. Ok since you are seeing the message I'll go ahead and post the patch and get it moving along, let me know if you still see the problem. Thanks, Awesome. Mike still hasn't seen a reocurrence, so it's looking like the patch is good. Thanks so much! sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
corrupted file size on inline extent conversion?
A ceph user observed a incorrect i_size on btrfs. The pattern looks like this: - some writes at low file offsets - a write to 4185600 len 8704 (i_size should be 4MB) - more writes to low offsets - a write to 4181504 len 4096 (abutts the write above) - a bit of time goes by... - stat returns 4186112 (4MB - 8192) - that's a fwe bytes to the right of the top write above. There are some logs showing the full read/write activity to the file at http://tracker.newdream.net/attachments/658/object_log.txt on issue http://tracker.newdream.net/issues/3810 The kernel was 3.7.0-030700-generic (and probably also observed on 3.7.1). Is this a known bug? Thanks! sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: move the sb_end_intwrite until after the throttle logic
On Wed, 5 Sep 2012, Josef Bacik wrote: Sage reported the following lockdep backtrace = [ BUG: bad unlock balance detected! ] 3.6.0-rc2-ceph-00171-gc7ed62d #1 Not tainted - btrfs-cleaner/7607 is trying to release lock (sb_internal) at: [a00422ae] btrfs_commit_transaction+0xa6e/0xb20 [btrfs] but there are no more locks to release! other info that might help us debug this: 1 lock held by btrfs-cleaner/7607: #0: (fs_info-cleaner_mutex){+.+...}, at: [a003b405] cleaner_kthread+0x95/0x120 [btrfs] stack backtrace: Pid: 7607, comm: btrfs-cleaner Not tainted 3.6.0-rc2-ceph-00171-gc7ed62d #1 Call Trace: [a00422ae] ? btrfs_commit_transaction+0xa6e/0xb20 [btrfs] [810afa9e] print_unlock_inbalance_bug+0xfe/0x110 [810b289e] lock_release_non_nested+0x1ee/0x310 [81172f9b] ? kmem_cache_free+0x7b/0x160 [a004106c] ? put_transaction+0x8c/0x130 [btrfs] [a00422ae] ? btrfs_commit_transaction+0xa6e/0xb20 [btrfs] [810b2a95] lock_release+0xd5/0x220 [81173071] ? kmem_cache_free+0x151/0x160 [8117d9ed] __sb_end_write+0x7d/0x90 [a00422ae] btrfs_commit_transaction+0xa6e/0xb20 [btrfs] [81079850] ? __init_waitqueue_head+0x60/0x60 [81634c6b] ? _raw_spin_unlock+0x2b/0x40 [a0042758] __btrfs_end_transaction+0x368/0x3c0 [btrfs] [a0042808] btrfs_end_transaction_throttle+0x18/0x20 [btrfs] [a00318f0] btrfs_drop_snapshot+0x410/0x600 [btrfs] [8132babd] ? do_raw_spin_unlock+0x5d/0xb0 [a00430ef] btrfs_clean_old_snapshots+0xaf/0x150 [btrfs] [a003b405] ? cleaner_kthread+0x95/0x120 [btrfs] [a003b419] cleaner_kthread+0xa9/0x120 [btrfs] [a003b370] ? btrfs_destroy_delayed_refs.isra.102+0x220/0x220 [btrfs] [810791ee] kthread+0xae/0xc0 [810b379d] ? trace_hardirqs_on+0xd/0x10 [8163e744] kernel_thread_helper+0x4/0x10 [81635430] ? retint_restore_args+0x13/0x13 [81079140] ? flush_kthread_work+0x1a0/0x1a0 [8163e740] ? gs_change+0x13/0x13 This is because the throttle stuff can commit the transaction, which expects to be the one stopping the intwrite stuff, but we've already done it in the __btrfs_end_transaction. Moving the sb_end_intewrite after this logic makes the lockdep go away. Thanks, Signed-off-by: Josef Bacik jba...@fusionio.com This appears to have done the trick (at least so far). I'm running it on top of -rc5 and the 3 patches I posted on Aug 30th. Tested-by: Sage Weil s...@inktank.com Thanks! sage --- fs/btrfs/transaction.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index a86fc72..0163afa 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -551,8 +551,6 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans, btrfs_trans_release_metadata(trans, root); trans-block_rsv = NULL; - sb_end_intwrite(root-fs_info-sb); - if (lock !atomic_read(root-fs_info-open_ioctl_trans) should_end_transaction(trans, root)) { trans-transaction-blocked = 1; @@ -573,6 +571,8 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans, } } + sb_end_intwrite(root-fs_info-sb); + WARN_ON(cur_trans != info-running_transaction); WARN_ON(atomic_read(cur_trans-num_writers) 1); atomic_dec(cur_trans-num_writers); -- 1.7.11.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] lockdep fixups for sb_interval#2 vs async transaction commit
These three patches (in combination with btrfs-next) fix things up for me! Sage Weil (3): Btrfs: pass lockdep rwsem metadata to async commit transaction Btrfs: set journal_info in async trans commit worker Btrfs: do not take cleanup_work_sem in btrfs_run_delayed_iputs() fs/btrfs/inode.c |2 -- fs/btrfs/transaction.c | 18 ++ 2 files changed, 18 insertions(+), 2 deletions(-) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] Btrfs: set journal_info in async trans commit worker
We expect current-journal_info to point to the trans handle we are committing. Signed-off-by: Sage Weil s...@inktank.com --- fs/btrfs/transaction.c |2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index efc41a5..f910a26 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1236,6 +1236,8 @@ static void do_async_commit(struct work_struct *work) ac-root-fs_info-sb-s_writers.lock_map[SB_FREEZE_FS-1], 0, 1, _THIS_IP_); + current-journal_info = ac-newtrans; + btrfs_commit_transaction(ac-newtrans, ac-root); kfree(ac); } -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] Btrfs: pass lockdep rwsem metadata to async commit transaction
The freeze rwsem is taken by sb_start_intwrite() and dropped during the commit_ or end_transaction(). In the async case, that happens in a worker thread. Tell lockdep the calling thread is releasing ownership of the rwsem and the async thread is picking it up. XFS plays the same trick in fs/xfs/xfs_aops.c. Signed-off-by: Sage Weil s...@inktank.com --- fs/btrfs/transaction.c | 16 1 file changed, 16 insertions(+) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 17be3de..efc41a5 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1228,6 +1228,14 @@ static void do_async_commit(struct work_struct *work) struct btrfs_async_commit *ac = container_of(work, struct btrfs_async_commit, work.work); + /* +* We've got freeze protection passed with the transaction. +* Tell lockdep about it. +*/ + rwsem_acquire_read( + ac-root-fs_info-sb-s_writers.lock_map[SB_FREEZE_FS-1], + 0, 1, _THIS_IP_); + btrfs_commit_transaction(ac-newtrans, ac-root); kfree(ac); } @@ -1257,6 +1265,14 @@ int btrfs_commit_transaction_async(struct btrfs_trans_handle *trans, atomic_inc(cur_trans-use_count); btrfs_end_transaction(trans, root); + + /* +* Tell lockdep we've released the freeze rwsem, since the +* async commit thread will be the one to unlock it. +*/ + rwsem_release(root-fs_info-sb-s_writers.lock_map[SB_FREEZE_FS-1], + 1, _THIS_IP_); + schedule_delayed_work(ac-work, 0); /* wait for transaction to start and unblock */ -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] Btrfs: do not take cleanup_work_sem in btrfs_run_delayed_iputs()
/0x130 [btrfs] [ 606.970510] [a003b5d5] ? cleaner_kthread+0x95/0x120 [btrfs] [ 607.005648] [a003b5e1] cleaner_kthread+0xa1/0x120 [btrfs] [ 607.040724] [a003b540] ? btrfs_destroy_delayed_refs.isra.102+0x220/0x220 [btrfs] [ 607.104740] [810791ee] kthread+0xae/0xc0 [ 607.137119] [810b379d] ? trace_hardirqs_on+0xd/0x10 [ 607.169797] [8163e744] kernel_thread_helper+0x4/0x10 [ 607.202472] [81635430] ? retint_restore_args+0x13/0x13 [ 607.235884] [81079140] ? flush_kthread_work+0x1a0/0x1a0 [ 607.268731] [8163e740] ? gs_change+0x13/0x13 Signed-off-by: Sage Weil s...@inktank.com --- fs/btrfs/inode.c |2 -- 1 file changed, 2 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 6e8f416..d3f2e6a 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2118,7 +2118,6 @@ void btrfs_run_delayed_iputs(struct btrfs_root *root) if (empty) return; - down_read(root-fs_info-cleanup_work_sem); spin_lock(fs_info-delayed_iput_lock); list_splice_init(fs_info-delayed_iputs, list); spin_unlock(fs_info-delayed_iput_lock); @@ -2129,7 +2128,6 @@ void btrfs_run_delayed_iputs(struct btrfs_root *root) iput(delayed-inode); kfree(delayed); } - up_read(root-fs_info-cleanup_work_sem); } enum btrfs_orphan_cleanup_state { -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: pass lockdep rwsem metadata to async commit transaction
The freeze rwsem is taken by sb_start_intwrite() and dropped during the commit_ or end_transaction(). In the async case, that happens in a worker thread. Tell lockdep the calling thread is releasing ownership of the rwsem and the async thread is picking it up. Josef and I worked out a more complicated solution that made the async commit thread join and potentially get a later transaction, but it failed my initial smoke test and Dave pointed out that XFS avoids the issue by just telling lockdep what's up. This is much simpler. XFS does the same thing in fs/xfs/xfs_aops.c. Signed-off-by: Sage Weil s...@inktank.com --- fs/btrfs/transaction.c | 16 1 files changed, 16 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 17be3de..efc41a5 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1228,6 +1228,14 @@ static void do_async_commit(struct work_struct *work) struct btrfs_async_commit *ac = container_of(work, struct btrfs_async_commit, work.work); + /* +* We've got freeze protection passed with the transaction. +* Tell lockdep about it. +*/ + rwsem_acquire_read( + ac-root-fs_info-sb-s_writers.lock_map[SB_FREEZE_FS-1], + 0, 1, _THIS_IP_); + btrfs_commit_transaction(ac-newtrans, ac-root); kfree(ac); } @@ -1257,6 +1265,14 @@ int btrfs_commit_transaction_async(struct btrfs_trans_handle *trans, atomic_inc(cur_trans-use_count); btrfs_end_transaction(trans, root); + + /* +* Tell lockdep we've released the freeze rwsem, since the +* async commit thread will be the one to unlock it. +*/ + rwsem_release(root-fs_info-sb-s_writers.lock_map[SB_FREEZE_FS-1], + 1, _THIS_IP_); + schedule_delayed_work(ac-work, 0); /* wait for transaction to start and unblock */ -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: pass lockdep rwsem metadata to async commit transaction
Sadly, now I hit another one: [ 378.433842] = [ 378.433842] [ INFO: possible recursive locking detected ] [ 378.433845] 3.6.0-rc2-ceph-00143-g995fc06 #1 Not tainted [ 378.433845] - [ 378.433847] kworker/6:1/238 is trying to acquire lock: [ 378.433872] (sb_internal#2){.+.+..}, at: [a0042b74] start_transaction+0x124/0x430 [btrfs] [ 378.433873] [ 378.433873] but task is already holding lock: [ 378.433890] (sb_internal#2){.+.+..}, at: [a0042590] do_async_commit+0x0/0x80 [btrfs] [ 378.433891] [ 378.433891] other info that might help us debug this: [ 378.433892] Possible unsafe locking scenario: [ 378.433892] [ 378.433892]CPU0 [ 378.433893] [ 378.433895] lock(sb_internal#2); [ 378.433897] lock(sb_internal#2); [ 378.433898] [ 378.433898] *** DEADLOCK *** [ 378.433898] [ 378.433898] May be due to missing lock nesting notation [ 378.433898] [ 378.433899] 3 locks held by kworker/6:1/238: [ 378.433906] #0: (events){.+.+.+}, at: [810717d6] process_one_work+0x136/0x5f0 [ 378.433911] #1: (((ac-work)-work)){+.+...}, at: [810717d6] process_one_work+0x136/0x5f0 [ 378.433929] #2: (sb_internal#2){.+.+..}, at: [a0042590] do_async_commit+0x0/0x80 [btrfs] [ 378.433932] [ 378.433932] stack backtrace: [ 378.433935] Pid: 238, comm: kworker/6:1 Not tainted 3.6.0-rc2-ceph-00143-g995fc06 #1 [ 378.433936] Call Trace: [ 378.433941] [810b2032] __lock_acquire+0x1512/0x1b90 [ 378.433944] [810ada73] ? __bfs+0x23/0x270 [ 378.433961] [a0042b74] ? start_transaction+0x124/0x430 [btrfs] [ 378.433964] [810b2c82] lock_acquire+0xa2/0x140 [ 378.433980] [a0042b74] ? start_transaction+0x124/0x430 [btrfs] [ 378.433982] [810b3546] ? mark_held_locks+0x86/0x140 [ 378.433987] [8117dac6] __sb_start_write+0xc6/0x1b0 [ 378.434003] [a0042b74] ? start_transaction+0x124/0x430 [btrfs] [ 378.434019] [a0042b74] ? start_transaction+0x124/0x430 [btrfs] [ 378.434022] [81172e75] ? kmem_cache_alloc+0xb5/0x160 [ 378.434024] [81172f9b] ? kmem_cache_free+0x7b/0x160 [ 378.434042] [a0058b48] ? free_extent_state+0x58/0xd0 [btrfs] [ 378.434058] [a0042b74] start_transaction+0x124/0x430 [btrfs] [ 378.434076] [a005940d] ? __set_extent_bit+0x37d/0x4e0 [btrfs] [ 378.434092] [a0042ed5] btrfs_join_transaction+0x15/0x20 [btrfs] [ 378.434109] [a00496b7] cow_file_range+0x87/0x4a0 [btrfs] [ 378.434114] [81634c6b] ? _raw_spin_unlock+0x2b/0x40 [ 378.434131] [a004a80c] run_delalloc_range+0x34c/0x370 [btrfs] [ 378.434149] [a005cbb0] __extent_writepage+0x5e0/0x770 [btrfs] [ 378.434152] [810b3546] ? mark_held_locks+0x86/0x140 [ 378.434155] [8112aa5e] ? find_get_pages_tag+0x2e/0x1c0 [ 378.434174] [a005cffa] extent_write_cache_pages.isra.25.constprop.39+0x2ba/0x410 [btrfs] [ 378.434187] [a002f7cc] ? btrfs_run_delayed_refs+0xac/0x550 [btrfs] [ 378.434190] [81196117] ? igrab+0x27/0x70 [ 378.434208] [a005d389] extent_writepages+0x49/0x60 [btrfs] [ 378.434224] [a0046a90] ? btrfs_submit_direct+0x670/0x670 [btrfs] [ 378.434240] [a00444c8] btrfs_writepages+0x28/0x30 [btrfs] [ 378.434243] [81136443] do_writepages+0x23/0x40 [ 378.434247] [8112b839] __filemap_fdatawrite_range+0x59/0x60 [ 378.434249] [8112c6ac] filemap_flush+0x1c/0x20 [ 378.434266] [a0050b1e] btrfs_start_delalloc_inodes+0xbe/0x200 [btrfs] [ 378.434270] [8132babd] ? do_raw_spin_unlock+0x5d/0xb0 [ 378.434286] [a0041ebd] btrfs_commit_transaction+0x44d/0xb20 [btrfs] [ 378.434290] [81079850] ? __init_waitqueue_head+0x60/0x60 [ 378.434293] [810717d6] ? process_one_work+0x136/0x5f0 [ 378.434308] [a00425f1] do_async_commit+0x61/0x80 [btrfs] [ 378.434324] [a0042590] ? btrfs_commit_transaction+0xb20/0xb20 [btrfs] [ 378.434327] [81071840] process_one_work+0x1a0/0x5f0 [ 378.434330] [810717d6] ? process_one_work+0x136/0x5f0 [ 378.434346] [a0042590] ? btrfs_commit_transaction+0xb20/0xb20 [btrfs] [ 378.434350] [8107360d] worker_thread+0x18d/0x4c0 [ 378.434354] [81073480] ? manage_workers.isra.22+0x2c0/0x2c0 [ 378.434356] [810791ee] kthread+0xae/0xc0 [ 378.434359] [810b379d] ? trace_hardirqs_on+0xd/0x10 [ 378.434363] [8163e744] kernel_thread_helper+0x4/0x10 [ 378.434366] [81635430] ? retint_restore_args+0x13/0x13 [ 378.434368] [81079140] ? flush_kthread_work+0x1a0/0x1a0 [ 378.434371] [8163e740] ? gs_change+0x13/0x13 On Fri, 24 Aug 2012, Sage Weil wrote: The freeze rwsem is taken by sb_start_intwrite() and dropped during the commit_ or end_transaction(). In the async
Re: btrfs call trace
Hi Stefan, I haven't seen this one. The async commit stuff is mine, but there haven't been problems with it for a year or more. This is a recent kernel, I assume? Can you dump the other tasks? Something is preventing the commit from completing. [adding linux-btrfs to cc] sage On Fri, 22 Jun 2012, Stefan Priebe wrote: Hi, are these bugs known? [ 2157.104532] INFO: task kworker/2:2:6278 blocked for more than 120 seconds. [ 2157.130649] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 2157.156750] kworker/2:2 D 8180eba0 0 6278 2 0x [ 2157.182763] 880c3c0d7c90 0046 880c3e2aa1c0 00012280 [ 2157.208662] 880c3c0d7fd8 880c3c0d6010 00012280 00012280 [ 2157.234591] 880c3c0d7fd8 00012280 880e78e5a240 880c3e2aa1c0 [ 2157.260573] Call Trace: [ 2157.285872] [81620334] schedule+0x24/0x70 [ 2157.311097] [a007f4e5] wait_for_commit+0x55/0x90 [btrfs] [ 2157.311106] [81060140] ? wake_up_bit+0x40/0x40 [ 2157.311129] [a00810b5] btrfs_commit_transaction+0x655/0xab0 [btrfs] [ 2157.311134] [810736f0] ? set_next_entity+0x90/0xa0 [ 2157.311138] [81060140] ? wake_up_bit+0x40/0x40 [ 2157.311156] [a0081810] ? btrfs_end_transaction+0x20/0x20 [btrfs] [ 2157.311171] [a008182a] do_async_commit+0x1a/0x30 [btrfs] [ 2157.311175] [81058eff] process_one_work+0x11f/0x470 [ 2157.311179] [8105b128] worker_thread+0x178/0x400 [ 2157.311182] [8105afb0] ? manage_workers+0x210/0x210 [ 2157.311185] [8105fc46] kthread+0x96/0xa0 [ 2157.311190] [81622dd4] kernel_thread_helper+0x4/0x10 [ 2157.311195] [8105fbb0] ? kthread_worker_fn+0x130/0x130 [ 2157.311198] [81622dd0] ? gs_change+0xb/0xb [ 2157.311220] INFO: task ceph-osd:14426 blocked for more than 120 seconds. [ 2157.311220] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 2157.311224] ceph-osdD 0002 0 14426 1 0x [ 2157.311228] 880ae8c33c38 0082 880a39142180 00012280 [ 2157.311231] 880ae8c33fd8 880ae8c32010 00012280 00012280 [ 2157.311234] 880ae8c33fd8 00012280 880983efa1c0 880a39142180 [ 2157.311235] Call Trace: [ 2157.311240] [81620334] schedule+0x24/0x70 [ 2157.311256] [a0081d85] btrfs_commit_transaction_async+0x1d5/0x240 [btrfs] [ 2157.311271] [a0065eb6] ? block_rsv_add_bytes+0x26/0x60 [btrfs] [ 2157.311275] [81060140] ? wake_up_bit+0x40/0x40 [ 2157.311289] [a0065f25] ? block_rsv_migrate_bytes+0x35/0x50 [btrfs] [ 2157.311311] [a00b2d5e] btrfs_mksubvol+0x2be/0x350 [btrfs] [ 2157.311329] [a00b2ef9] btrfs_ioctl_snap_create_transid+0x109/0x1a0 [btrfs] [ 2157.311346] [a00b4bf4] btrfs_ioctl_snap_create_v2+0x84/0xf0 [btrfs] [ 2157.311363] [a00b756f] btrfs_ioctl+0x76f/0x12d0 [btrfs] [ 2157.311368] [8114459a] ? fsnotify+0x1ba/0x2e0 [ 2157.311372] [8111ade3] do_vfs_ioctl+0x93/0x4f0 [ 2157.311375] [8111b28a] sys_ioctl+0x4a/0x80 [ 2157.311379] [81621ba2] system_call_fastpath+0x16/0x1b Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph on btrfs 3.4rc
On Tue, 24 Apr 2012, Josef Bacik wrote: On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote: After running ceph on XFS for some time, I decided to try btrfs again. Performance with the current for-linux-min branch and big metadata is much better. The only problem (?) I'm still seeing is a warning that seems to occur from time to time: Actually, before you do that... we have a new tool, test_filestore_workloadgen, that generates a ceph-osd-like workload on the local file system. It's a subset of what a full OSD might do, but if we're lucky it will be sufficient to reproduce this issue. Something like test_filestore_workloadgen --osd-data /foo --osd-journal /bar will hopefully do the trick. Christian, maybe you can see if that is able to trigger this warning? You'll need to pull it from the current master branch; it wasn't in the last release. Thanks! sage [87703.784552] [ cut here ] [87703.789759] WARNING: at fs/btrfs/inode.c:2103 btrfs_orphan_commit_root+0xf6/0x100 [btrfs]() [87703.799070] Hardware name: ProLiant DL180 G6 [87703.804024] Modules linked in: btrfs zlib_deflate libcrc32c xfs exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt iTCO_vendor_support i7core_edac edac_core ixgbe dca mdio iomemory_vsl(PO) hpsa squashfs [last unloaded: scsi_wait_scan] [87703.828166] Pid: 929, comm: kworker/1:2 Tainted: P O 3.3.2-1.fits.1.el6.x86_64 #1 [87703.837513] Call Trace: [87703.840280] [8104df6f] warn_slowpath_common+0x7f/0xc0 [87703.847016] [8104dfca] warn_slowpath_null+0x1a/0x20 [87703.853533] [a0355686] btrfs_orphan_commit_root+0xf6/0x100 [btrfs] [87703.861541] [a0350a06] commit_fs_roots+0xc6/0x1c0 [btrfs] [87703.868674] [a0351bcb] btrfs_commit_transaction+0x5db/0xa50 [btrfs] [87703.876745] [810127a3] ? __switch_to+0x153/0x440 [87703.882966] [81070a90] ? wake_up_bit+0x40/0x40 [87703.888997] [a0352040] ? btrfs_commit_transaction+0xa50/0xa50 [btrfs] [87703.897271] [a035205f] do_async_commit+0x1f/0x30 [btrfs] [87703.904262] [81068949] process_one_work+0x129/0x450 [87703.910777] [8106b7eb] worker_thread+0x17b/0x3c0 [87703.916991] [8106b670] ? manage_workers+0x220/0x220 [87703.923504] [810703fe] kthread+0x9e/0xb0 [87703.928952] [8158c224] kernel_thread_helper+0x4/0x10 [87703.93] [81070360] ? kthread_freezable_should_stop+0x70/0x70 [87703.943323] [8158c220] ? gs_change+0x13/0x13 [87703.949149] ---[ end trace b8c31966cca731fa ]--- [91128.812399] [ cut here ] [91128.817576] WARNING: at fs/btrfs/inode.c:2103 btrfs_orphan_commit_root+0xf6/0x100 [btrfs]() [91128.826930] Hardware name: ProLiant DL180 G6 [91128.831897] Modules linked in: btrfs zlib_deflate libcrc32c xfs exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt iTCO_vendor_support i7core_edac edac_core ixgbe dca mdio iomemory_vsl(PO) hpsa squashfs [last unloaded: scsi_wait_scan] [91128.856086] Pid: 6806, comm: btrfs-transacti Tainted: PW O 3.3.2-1.fits.1.el6.x86_64 #1 [91128.865912] Call Trace: [91128.868670] [8104df6f] warn_slowpath_common+0x7f/0xc0 [91128.875379] [8104dfca] warn_slowpath_null+0x1a/0x20 [91128.881900] [a0355686] btrfs_orphan_commit_root+0xf6/0x100 [btrfs] [91128.889894] [a0350a06] commit_fs_roots+0xc6/0x1c0 [btrfs] [91128.897019] [a03a2b61] ? btrfs_run_delayed_items+0xf1/0x160 [btrfs] [91128.905075] [a0351bcb] btrfs_commit_transaction+0x5db/0xa50 [btrfs] [91128.913156] [a03524b2] ? start_transaction+0x92/0x310 [btrfs] [91128.920643] [81070a90] ? wake_up_bit+0x40/0x40 [91128.926667] [a034cfcb] transaction_kthread+0x26b/0x2e0 [btrfs] [91128.934254] [a034cd60] ? btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs] [91128.943671] [a034cd60] ? btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs] [91128.953079] [810703fe] kthread+0x9e/0xb0 [91128.958532] [8158c224] kernel_thread_helper+0x4/0x10 [91128.965133] [81070360] ? kthread_freezable_should_stop+0x70/0x70 [91128.972913] [8158c220] ? gs_change+0x13/0x13 [91128.978826] ---[ end trace b8c31966cca731fb ]--- I'm able to reproduce this with ceph on a single server with 4 disks (4 filesystems/osds) and a small test program based on librbd. It is simply writing random bytes on a rbd volume (see attachment). Is this something I should care about? Any hint's on solving this would be appreciated. Can you send me a config or some basic steps for me to setup ceph on my box so I can run this program and finally track down this problem? Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a
[LSF/MM TOPIC] COWing writeback pages
Hi everyone, The takeaway from the 'stable pages' discussions in the last few workshops was that pages under writeback should remain locked so that subsequent writers don't touch them while they are en route to the disk. This prevents bad checksums and DIF/DIX type failures (whereas previously we didn't really care whether old or new data reached the disk). The fear is/was that anyone subsequently modifying the page will have to wait for writeback io to complete before continuing. I seem to remember somebody (Martin?) saying that in practice, under real workloads, that doesn't actually happen, so don't worry about it. (Does anyone remember the details of what testing led to that conclusion?) Anyway, we are seeing what looks like an analogous problem with btrfs, where operations sometimes block waiting for writeback of the btree pages. Although the 'keep rewriting the same page' pattern may not be prevalent in normal file workloads, it does seem to happen with the btrfs btree. The obvious solution seems to be to COW the page if it is under writeback and we want to remodify it. Presumably that can be done just in btrfs, to address the btrfs-specific symptoms we're hitting, but I'm interested in hearing from other folks about whether it's more generally useful VM functionality for other filesystems and other workloads. Unfortunately, we haven't been able to pinpoint the exact scenarios under which this triggers under btrfs. We regularly see long stalls for metadata operations (create() and similar metadata-only operations) that block after btrfs_commit_transaction has finished the previous transaction and is doing return filemap_write_and_wait(btree_inode-i_mapping); What we're less clear about is when btrfs will modify the in-memory page in place (and thus wait) versus COWing the page... still digging into this now. It's seems like there is a btrfs-specific question about exactly what is going on and why, which isn't super-relevant for LSF/MM (except that we'll all be there). However, my suspicion is that the solution will be generally applicable to other filesystems, and that the tests that led us to believe that normal workloads aren't affected by locked writeback pages would inform which path to take in solving our specific btrfs problem. sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC] COWing writeback pages
On Fri, 10 Feb 2012, Josef Bacik wrote: On Fri, Feb 10, 2012 at 11:25:27AM -0800, Sage Weil wrote: Hi everyone, The takeaway from the 'stable pages' discussions in the last few workshops was that pages under writeback should remain locked so that subsequent writers don't touch them while they are en route to the disk. This prevents bad checksums and DIF/DIX type failures (whereas previously we didn't really care whether old or new data reached the disk). The fear is/was that anyone subsequently modifying the page will have to wait for writeback io to complete before continuing. I seem to remember somebody (Martin?) saying that in practice, under real workloads, that doesn't actually happen, so don't worry about it. (Does anyone remember the details of what testing led to that conclusion?) Anyway, we are seeing what looks like an analogous problem with btrfs, where operations sometimes block waiting for writeback of the btree pages. Although the 'keep rewriting the same page' pattern may not be prevalent in normal file workloads, it does seem to happen with the btrfs btree. The obvious solution seems to be to COW the page if it is under writeback and we want to remodify it. Presumably that can be done just in btrfs, to address the btrfs-specific symptoms we're hitting, but I'm interested in hearing from other folks about whether it's more generally useful VM functionality for other filesystems and other workloads. Unfortunately, we haven't been able to pinpoint the exact scenarios under which this triggers under btrfs. We regularly see long stalls for metadata operations (create() and similar metadata-only operations) that block after btrfs_commit_transaction has finished the previous transaction and is doing return filemap_write_and_wait(btree_inode-i_mapping); What we're less clear about is when btrfs will modify the in-memory page in place (and thus wait) versus COWing the page... still digging into this now. Heh so I'm working on this now, specifically in the heavy create() workload, and I've just about got it nailed down. A lot of this problem is because we rely on normal pagecache for our metadata so I'm copying xfs and creating our own caching. The thing is since we have an inode hanging out with normal pagecache pages we can have multiple people trying to write out dirty pages in our inode at the same time, and since it goes through our normal write path we'll end up in this case where we're waiting on writeback for pages we won't actually end up writing out. My code will fix this, if we're talking about the same problem ;). Oh, I hadn't thought of that... that sounds like a similar but slightly different problem, since it probably wouldn't correlate with the filemap_write_and_wait(). As long as we don't have a btree update waiting on btree writeback, though, both problems should be addressed. In any case, we're definitely interested in checking out the code when it's ready to share! sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: WARNING: at fs/btrfs/inode.c:2222
On Fri, 27 Jan 2012, Chris Mason wrote: On Fri, Jan 27, 2012 at 09:31:46AM -0800, Sage Weil wrote: Has anyone see this one before? We've been hitting it sporatically for a while now, although recently we've been able to trigger pretty easily. The result is EINVAL from the snap create ioctl. The kernel is v3.2 + some ceph stuff. The code in question is /* if we have links, this was a truncate, lets do that */ if (inode-i_nlink) { if (!S_ISREG(inode-i_mode)) { WARN_ON(1); iput(inode); continue; } Is this something obvious, or should we spend some time chasing it down? This doesn't seem right, we've got an orphan on a directory for a truncate? The link count is probably 1, but no I haven't seen this. Okay, looking into it. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
On Wed, 26 Oct 2011, Christian Brunner wrote: 2011/10/26 Sage Weil s...@newdream.net: On Wed, 26 Oct 2011, Christian Brunner wrote: Christian, have you tweaked those settings in your ceph.conf?  It would be something like 'journal dio = false'.  If not, can you verify that directio shows true when the journal is initialized from your osd log? E.g.,  2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open dev/osd0.journal fd 14: 104857600 bytes, block size 4096 bytes, directio = 1 If directio = 1 for you, something else funky is causing those blkdev_fsync's... I've looked it up in the logs - directio is 1: Oct 25 17:20:16 os00 osd.000[1696]: 7f0016841740 journal _open /dev/vg01/lv_osd_journal_0 fd 15: 17179869184 bytes, block size 4096 bytes, directio = 1 Do you mind capturing an strace?  I'd like to see where that blkdev_fsync is coming from. Here is an strace. I can see a lot of sync_file_range operations. Yeah, these all look like the flusher thread, and shouldn't be hitting blkdev_fsync.  Can you confirm that with     filestore flusher = false     filestore sync flush = false you get no sync_file_range at all?  I wonder if this is also perf lying about the call chain. Yes, setting this makes the sync_file_range calls go away. Okay. That means either sync_file_range on a regular btrfs file is triggering blkdev_fsync somewhere in btrfs, there is an extremely sneaky bug that is mixing up file descriptors, or latencytop is lying. I'm guessing the latter, given the other weirdness Josef and Chris were seeing. :) Is it safe to use these settings with filestore btrfs snap = 0? Yeah. They're purely a performance thing to push as much dirty data to disk as quickly as possible to minimize the snapshot create latency. You'll notice the write throughput tends to tank when them off. sage
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
On Tue, 25 Oct 2011, Christoph Hellwig wrote: On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote: - When I run ceph with btrfs snaps disabled, the situation is getting slightly better. I can run an OSD for about 3 days without problems, but then again the load increases. This time, I can see that the ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work than usual. FYI in this scenario you're exposed to the same journal replay issues that ext4 and XFS are. The btrfs workload that ceph is generating will also not be all that special, though, so this problem shouldn't be unique to ceph. What journal replay issues would ext4 and XFS be exposed to? It's the ceph-osd journal replay, not the ext4/XFS journal... the #2 item in http://marc.info/?l=ceph-develm=131942130322957w=2 sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
On Tue, 25 Oct 2011, Josef Bacik wrote: At this point it seems like the biggest problem with latency in ceph-osd is not related to btrfs, the latency seems to all be from the fact that ceph-osd is fsyncing a block dev for whatever reason. There is one place where we sync_file_range() on the journal block device, but that should only happen if directio is disabled (it's on by default). Christian, have you tweaked those settings in your ceph.conf? It would be something like 'journal dio = false'. If not, can you verify that directio shows true when the journal is initialized from your osd log? E.g., 2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open dev/osd0.journal fd 14: 104857600 bytes, block size 4096 bytes, directio = 1 If directio = 1 for you, something else funky is causing those blkdev_fsync's... sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
On Tue, 25 Oct 2011, Christian Brunner wrote: 2011/10/25 Sage Weil s...@newdream.net: On Tue, 25 Oct 2011, Josef Bacik wrote: At this point it seems like the biggest problem with latency in ceph-osd is not related to btrfs, the latency seems to all be from the fact that ceph-osd is fsyncing a block dev for whatever reason. There is one place where we sync_file_range() on the journal block device, but that should only happen if directio is disabled (it's on by default). Christian, have you tweaked those settings in your ceph.conf? Â It would be something like 'journal dio = false'. Â If not, can you verify that directio shows true when the journal is initialized from your osd log? E.g., Â 2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open dev/osd0.journal fd 14: 104857600 bytes, block size 4096 bytes, directio = 1 If directio = 1 for you, something else funky is causing those blkdev_fsync's... I've looked it up in the logs - directio is 1: Oct 25 17:20:16 os00 osd.000[1696]: 7f0016841740 journal _open /dev/vg01/lv_osd_journal_0 fd 15: 17179869184 bytes, block size 4096 bytes, directio = 1 Do you mind capturing an strace? I'd like to see where that blkdev_fsync is coming from. thanks! sage
ceph on btrfs [was Re: ceph on non-btrfs file systems]
[adding linux-btrfs to cc] Josef, Chris, any ideas on the below issues? On Mon, 24 Oct 2011, Christian Brunner wrote: Thanks for explaining this. I don't have any objections against btrfs as a osd filesystem. Even the fact that there is no btrfs-fsck doesn't scare me, since I can use the ceph replication to recover a lost btrfs-filesystem. The only problem I have is, that btrfs is not stable on our side and I wonder what you are doing to make it work. (Maybe it's related to the load pattern of using ceph as a backend store for qemu). Here is a list of the btrfs problems I'm having: - When I run ceph with the default configuration (btrfs snaps enabled) I can see a rapid increase in Disk-I/O after a few hours of uptime. Btrfs-cleaner is using more and more time in btrfs_clean_old_snapshots(). In theory, there shouldn't be any significant difference between taking a snapshot and removing it a few commits later, and the prior root refs that btrfs holds on to internally until the new commit is complete. That's clearly not quite the case, though. In any case, we're going to try to reproduce this issue in our environment. - When I run ceph with btrfs snaps disabled, the situation is getting slightly better. I can run an OSD for about 3 days without problems, but then again the load increases. This time, I can see that the ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work than usual. FYI in this scenario you're exposed to the same journal replay issues that ext4 and XFS are. The btrfs workload that ceph is generating will also not be all that special, though, so this problem shouldn't be unique to ceph. Another thing is that I'm seeing a WARNING: at fs/btrfs/inode.c:2114 from time to time. Maybe it's related to the performance issues, but seems to be able to verify this. I haven't seen this yet with the latest stuff from Josef, but others have. Josef, is there any information we can provide to help track it down? It's really sad to see, that ceph performance and stability is suffering that much from the underlying filesystems and that this hasn't changed over the last months. We don't have anyone internally working on btrfs at the moment, and are still struggling to hire experienced kernel/fs people. Josef has been very helpful with tracking these issues down, but he hass responsibilities beyond just the Ceph related issues. Progress is slow, but we are working on it! sage Kind regards, Christian 2011/10/24 Sage Weil s...@newdream.net: Although running on ext4, xfs, or whatever other non-btrfs you want mostly works, there are a few important remaining issues: 1- ext4 limits total xattrs for 4KB. Â This can cause problems in some cases, as Ceph uses xattrs extensively. Â Most of the time we don't hit this. Â We do hit the limit with radosgw pretty easily, though, and may also hit it in exceptional cases where the OSD cluster is very unhealthy. There is a large xattr patch for ext4 from the Lustre folks that has been floating around for (I think) years. Â Maybe as interest grows in running Ceph on ext4 this can move upstream. Previously we were being forgiving about large setxattr failures on ext3, but we found that was leading to corruption in certain cases (because we couldn't set our internal metadata), so the next release will assert/crash in that case (fail-stop instead of fail-maybe-eventually-corrupt). XFS does not have an xattr size limit and thus does have this problem. 2- The other problem is with OSD journal replay of non-idempotent transactions. Â On non-btrfs backends, the Ceph OSDs use a write-ahead journal. Â After restart, the OSD does not know exactly which transactions in the journal may have already been committed to disk, and may reapply a transaction again during replay. Â For most operations (write, delete, truncate) this is fine. Some operations, though, are non-idempotent. Â The simplest example is CLONE, which copies (efficiently, on btrfs) data from one object to another. Â If the source object is modified, the osd restarts, and then the clone is replayed, the target will get incorrect (newer) data. Â For example, 1- clone A - B 2- modify A Â osd crash, replay from 1 B will get new instead of old contents. (This doesn't happen on btrfs because the snapshots allow us to replay from a known consistent point in time.) For things like clone, skipping the operation of the target exists almost works, except for cases like 1- clone A - B 2- modify A ... 3- delete B Â osd crash, replay from 1 (Although in that example who cares if B had bad data; it was removed anyway.) Â The larger problem, though, is that that doesn't always work; CLONERANGE copies a range of a file from A to B, where B may already exist. In practice, the higher level interfaces don't make full use of the low-level interface, so it's possible
Re: OSD: no current directory
On Tue, 11 Oct 2011, Christian Brunner wrote: 2011/10/11 Sage Weil s...@newdream.net: On Tue, 11 Oct 2011, Christian Brunner wrote: Maybe this one is easier: One of our OSDs isn't starting, because ther is no current directory. What I have are three snap directories. total 0 -rw-r--r-- 1 root root  37 Oct  9 15:57 ceph_fsid -rw-r--r-- 1 root root   8 Oct  9 15:57 fsid -rw-r--r-- 1 root root  21 Oct  9 15:57 magic drwxr-xr-x 1 root root 7986 Oct 11 18:34 snap_506043 drwxr-xr-x 1 root root 7986 Oct 11 18:34 snap_507364 drwxr-xr-x 1 root root 7814 Oct 11 18:36 snap_507417 -rw-r--r-- 1 root root   4 Oct  9 15:57 store_version -rw-r--r-- 1 root root   2 Oct  9 15:57 whoami Is there a way to rollback the latest? That's what the OSD actually does on startup (roll back to the newest snap_).  It's probably a trivial bug that's preventing startup now... I'll take a look.  In the meantime, you can clone the latest snap_ to current and it should start! sage This seems to be a btrfs problem. It fails, when I'm trying to create the clone # btrfs subvolume snapshot snap_507417 current Create a snapshot of 'snap_507417' in './current' ERROR: cannot snapshot 'snap_507417' And I get the following kernel messages: [ 5863.263950] [ cut here ] [ 5863.269125] WARNING: at fs/btrfs/inode.c:2335 btrfs_orphan_cleanup+0xcd/0x3d0 [btrfs]() [ 5863.278142] Hardware name: ProLiant DL180 G6 [ 5863.283161] Modules linked in: btrfs zlib_deflate libcrc32c bonding ipv6 serio_raw pcspkr ghes hed iTCO_wdt iTCO_vendor_support ixgbe dca mdio i7core_edac edac_core iomemory_vsl(P) hpsa squashfs usb_storage [last unloaded: scsi_wait_scan] [ 5863.307774] Pid: 6349, comm: btrfs Tainted: PW 3.0.6-1.fits.2.el6.x86_64 #1 [ 5863.316647] Call Trace: [ 5863.319648] [8106344f] warn_slowpath_common+0x7f/0xc0 [ 5863.326536] [810634aa] warn_slowpath_null+0x1a/0x20 [ 5863.333146] [a023fb0d] btrfs_orphan_cleanup+0xcd/0x3d0 [btrfs] [ 5863.340839] [a0238381] ? join_transaction+0x201/0x250 [btrfs] [ 5863.348482] [a021fbaa] ? block_rsv_migrate_bytes+0x3a/0x50 [btrfs] [ 5863.356590] [a0261a3b] btrfs_mksubvol+0x2fb/0x380 [btrfs] [ 5863.363726] [a0261bba] btrfs_ioctl_snap_create_transid+0xfa/0x150 [btrfs] [ 5863.372445] [a0261c66] btrfs_ioctl_snap_create+0x56/0x80 [btrfs] [ 5863.380398] [a026583e] btrfs_ioctl+0x2fe/0xd50 [btrfs] [ 5863.387344] [8125ed20] ? inode_has_perm+0x30/0x40 [ 5863.393798] [81261a2c] ? file_has_perm+0xdc/0xf0 [ 5863.400114] [8117086a] do_vfs_ioctl+0x9a/0x5a0 [ 5863.406244] [81170e11] sys_ioctl+0xa1/0xb0 [ 5863.412001] [81562882] system_call_fastpath+0x16/0x1b [ 5863.418767] ---[ end trace e3234ecab14ad64c ]--- [ 5863.424084] btrfs: Error removing orphan entry, stopping orphan cleanup [ 5863.431614] btrfs: could not do orphan cleanup -22 Can I use an older snapshot as well? You're able to snapshot the others? Yeah, any of the snap_ directories will work, although keep in mind when the OSD starts up it will immediately remove current/ and re-clone the newest snap_ to current/ again. If the problem is a toxic/broken snap_ dir, you'll need to rename it out of the way to avoid hitting the problem again... sage Regards, Christian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] Btrfs fixes
Hi Chris- This pull misses the clone reservation fix again... :) http://www.spinics.net/lists/linux-btrfs/msg11826.html Thanks! sage On Mon, 19 Sep 2011, Chris Mason wrote: Hi everyone, The for-linus branch of the btrfs tree on github: Head commit: a66e7cc626f42de6c745963fe0d807518fa49d39 git://github.com/chrismason/linux.git for-linus Has the following fixes. for-linus is against rc6, since some of these are regression fixes for earlier 3.1 btrfs commits. The most important of the bunch is Josef's dentry fix, which avoids enoents if we race with multiple procs hitting on the same inode. This bug is btrfs-specific, it came in with his optimization to cache the inode location during readdir. Li Zefan (3) commits (+9/-5): Btrfs: don't make a file partly checksummed through file clone (+5/-0) Btrfs: don't change inode flag of the dest clone file (+0/-1) Btrfs: fix pages truncation in btrfs_ioctl_clone() (+4/-4) Josef Bacik (1) commits (+11/-2): Btrfs: only clear the need lookup flag after the dentry is setup Jeff Liu (1) commits (+7/-2): BTRFS: Fix lseek return value for error Hidetoshi Seto (1) commits (+3/-2): btrfs: fix d_off in the first dirent Total: (6) commits (+30/-11) fs/btrfs/file.c |9 +++-- fs/btrfs/inode.c | 18 ++ fs/btrfs/ioctl.c | 14 +- 3 files changed, 30 insertions(+), 11 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix pages truncation in btrfs_ioctl_clone()
On Sun, 18 Sep 2011, Chris Mason wrote: Excerpts from Sage Weil's message of 2011-09-16 12:39:06 -0400: On Fri, 16 Sep 2011, Li Zefan wrote: It's a bug in commit f81c9cdc567cd3160ff9e64868d9a1a7ee226480 (Btrfs: truncate pages from clone ioctl target range) We should pass the dest range to the truncate function, but not the src range. Sigh... yes. Also move the function before locking extent state. Hmm, any reason? i_mutex protects us from a racing write(2), but what about a racing mmap()? e.g. cloner: truncates dest pages writer: mmap - page_mkwrite locks extent, creates new dirty page, unlocks cloner: locks extent, clones, unlocks extent Thanks guys. The locking order is page lock - extent lock. So if we call truncate_inode_pages with the extent lock held, we're off into ABBA. Oh right. This patch makes sense now. If we want to avoid the mmap race, we'll have to look for the dirty pages with the extent lock held, drop the lock and goto back to the truncate_inode_pages call. Yeah, given that (at Li points out) we're not locking dst extents at all, I don't think it's worth worrying about now. Sorry for the noise! sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix pages truncation in btrfs_ioctl_clone()
On Fri, 16 Sep 2011, Li Zefan wrote: It's a bug in commit f81c9cdc567cd3160ff9e64868d9a1a7ee226480 (Btrfs: truncate pages from clone ioctl target range) We should pass the dest range to the truncate function, but not the src range. Sigh... yes. Also move the function before locking extent state. Hmm, any reason? i_mutex protects us from a racing write(2), but what about a racing mmap()? e.g. cloner: truncates dest pages writer: mmap - page_mkwrite locks extent, creates new dirty page, unlocks cloner: locks extent, clones, unlocks extent sage Signed-off-by: Li Zefan l...@cn.fujitsu.com --- fs/btrfs/ioctl.c |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 5492bb3..f6af8b0 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2237,6 +2237,10 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, goto out_unlock; } + /* truncate page cache pages from target inode range */ + truncate_inode_pages_range(inode-i_data, destoff, +PAGE_CACHE_ALIGN(destoff + len) - 1); + /* do any pending delalloc/csum calc on src, one way or another, and lock file content */ while (1) { @@ -2253,10 +2257,6 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, btrfs_wait_ordered_range(src, off, len); } - /* truncate page cache pages from target inode range */ - truncate_inode_pages_range(inode-i_data, off, -ALIGN(off + len, PAGE_CACHE_SIZE) - 1); - /* clone data */ key.objectid = btrfs_ino(src); key.type = BTRFS_EXTENT_DATA_KEY; -- 1.7.3.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: WARNING: at fs/btrfs/inode.c:2193 btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()
On Tue, 13 Sep 2011, Liu Bo wrote: On 09/11/2011 05:47 AM, Martin Mailand wrote: Hi I am hitting this Warning reproducible, the workload is a ceph osd, kernel ist 3.1.0-rc5. Have posted a patch for this: http://marc.info/?l=linux-btrfsm=131547325515336w=2 We're still seeing this with -rc6, which includes 98c9942 and 65450aa. I haven't looked at the reservation code in much detail. Is there anything I can do to help track this down? Thanks- sage thanks, liubo Best Regards, martin [ 5472.099766] [ cut here ] [ 5472.099833] WARNING: at fs/btrfs/inode.c:2193 btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]() [ 5472.099838] Hardware name: MS-96B3 [ 5472.099842] Modules linked in: radeon ttm drm_kms_helper drm i2c_algo_bit psmouse sp5100_tco edac_core lp shpchp serio_raw k8temp i2c_piix4 edac_mce_amd parport ahci pata_atiixp e1000e libahci btrfs zlib_deflate libcrc32c [ 5472.099878] Pid: 2066, comm: kworker/1:1 Not tainted 3.1.0-rc5-custom #1 [ 5472.099882] Call Trace: [ 5472.099898] [81063c1f] warn_slowpath_common+0x7f/0xc0 [ 5472.099907] [81063c7a] warn_slowpath_null+0x1a/0x20 [ 5472.099935] [a003f420] btrfs_orphan_commit_root+0xb0/0xc0 [btrfs] [ 5472.099961] [a00380aa] commit_fs_roots.clone.21+0xba/0x1a0 [btrfs] [ 5472.099971] [815db96e] ? _raw_spin_lock+0xe/0x20 [ 5472.07] [a003966f] btrfs_commit_transaction+0x3ef/0x870 [btrfs] [ 5472.100065] [81012871] ? __switch_to+0x261/0x2f0 [ 5472.100084] [81086bf0] ? wake_up_bit+0x40/0x40 [ 5472.100120] [a0039af0] ? btrfs_commit_transaction+0x870/0x870 [btrfs] [ 5472.100155] [a0039b0f] do_async_commit+0x1f/0x30 [btrfs] [ 5472.100171] [8108110d] process_one_work+0x11d/0x430 [ 5472.100187] [81081c69] worker_thread+0x169/0x360 [ 5472.100203] [81081b00] ? manage_workers.clone.21+0x240/0x240 [ 5472.100220] [81086496] kthread+0x96/0xa0 [ 5472.100236] [815e5bb4] kernel_thread_helper+0x4/0x10 [ 5472.100253] [81086400] ? flush_kthread_worker+0xb0/0xb0 [ 5472.100269] [815e5bb0] ? gs_change+0x13/0x13 [ 5472.100279] ---[ end trace a8bae5767c2c3e55 ]--- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: WARNING: at fs/btrfs/inode.c:2193 btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()
On Thu, 15 Sep 2011, David Sterba wrote: On Thu, Sep 15, 2011 at 03:50:29PM -0400, Josef Bacik wrote: We're still seeing this with -rc6, which includes 98c9942 and 65450aa. I haven't looked at the reservation code in much detail. Is there anything I can do to help track this down? This should be taken care of with all my enospc changes. You can pull them down from my btrfs-work tree as soon as kernel.org comes back from the dead :). should you need it earlier, here's a copy: git://repo.or.cz/linux-2.6/btrfs-unstable.git #git.kernel.org/josef/master Thanks! We'll do some testing today and see if it behaves better. :) sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] Btrfs: don't make a file partly checksummed through file clone
On Wed, 14 Sep 2011, Li Zefan wrote: To reproduce the bug: # mount /dev/sda7 /mnt # dd if=/dev/zero of=/mnt/src bs=4K count=1 # umount /mnt # mount -o nodatasum /dev/sda7 /mnt # dd if=/dev/zero of=/mnt/dst bs=4K count=1 # clone_range -s 4K -l 4K /mnt/src /mnt/dst # echo 3 /proc/sys/vm/drop_caches # cat /mnt/dst # dmesg ... btrfs no csum found for inode 258 start 0 btrfs csum failed ino 258 off 0 csum 2566472073 private 0 It's because part of the file is checksummed and the other part is not, and then btrfs will complain checksum is not found when we read the file. Disallow file clone if src and dst file have different checksum flag, so we ensure a file is completely checksummed or unchecksummed. Signed-off-by: Li Zefan l...@cn.fujitsu.com Looks good to me. Reviewed-by: Sage Weil s...@newdream.net --- fs/btrfs/ioctl.c |5 + 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 970977a..dc82bbb 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2177,6 +2177,11 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, if (!(src_file-f_mode FMODE_READ)) goto out_fput; + /* don't make the dst file partly checksummed */ + if ((BTRFS_I(src)-flags BTRFS_INODE_NODATASUM) != + (BTRFS_I(inode)-flags BTRFS_INODE_NODATASUM)) + goto out_fput; + ret = -EISDIR; if (S_ISDIR(src-i_mode) || S_ISDIR(inode-i_mode)) goto out_fput; -- 1.7.3.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] Btrfs updates
Hi Chris, Josef, Can some form of the clone ioctl transaction start reservation fix go in soon as well? That hits a BUG_ON every time. Thanks! sage On Thu, 18 Aug 2011, Chris Mason wrote: Hi everyone, The for-linus branch of the btrfs-unstable tree: git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable.git for-linus Has our current pull request for 3.1-rc. The for-linus branch includes 3.1-rc2 because it two fixes from Dan Carpenter and Jeff Mahoney that only apply to 3.1. The master branch of btrfs-unstable is based on 3.0, and includes everything except those two changes. This is a variety pack of fixes. We do have another fix pending for a race in our readdir optimizations, but Josef will work that out with Al Viro and send it in later this week (or early next). Chris Mason (1) commits (+17/-0): Btrfs: force unplugs when switching from high to regular priority bios Dan Carpenter (2) commits (+10/-4): btrfs: unlock on error in btrfs_file_llseek() (+8/-4) btrfs: memory leak in btrfs_add_inode_defrag() (+2/-0) Jeff Mahoney (1) commits (+8/-4): btrfs: btrfs_permission's RO check shouldn't apply to device nodes Josef Bacik (2) commits (+43/-2): Btrfs: set i_size properly when fallocating (+14/-0) Btrfs: detect wether a device supports discard (+29/-2) Li Zefan (1) commits (+2/-4): Btrfs: use plain page_address() in header fields setget functions Miao Xie (2) commits (+12/-6): Btrfs: fix uninitialized sync_pending (+1/-1) Btrfs: fix wrong free space information (+11/-5) Sage Weil (1) commits (+4/-0): Btrfs: truncate pages from clone ioctl target range Tsutomu Itoh (1) commits (+16/-10): Btrfs: forced readonly when btrfs_drop_snapshot() fails liubo (3) commits (+72/-14): Btrfs: check if there is enough space for balancing smarter (+35/-6) Btrfs: fix a bug of balance on full multi-disk partitions (+13/-4) Btrfs: fix an oops of log replay (+24/-4) Total: (14) commits (+183/-43) fs/btrfs/ctree.h| 10 ++--- fs/btrfs/extent-tree.c | 75 +- fs/btrfs/file.c | 28 ++-- fs/btrfs/free-space-cache.c | 16 ++--- fs/btrfs/inode.c| 12 -- fs/btrfs/ioctl.c|4 ++ fs/btrfs/tree-log.c | 28 ++-- fs/btrfs/volumes.c | 51 +++-- fs/btrfs/volumes.h |2 + 9 files changed, 183 insertions(+), 43 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: reserve sufficient space for ioctl clone
On Tue, 16 Aug 2011, Miao Xie wrote: On tue, 9 Aug 2011 10:46:37 -0700, Sage Weil wrote: Fix a crash/BUG_ON in the clone ioctl due to insufficient reservation. We need to reserve space for: - adjusting the old extent (possibly splitting it) - adding the new extent - updating the inode Signed-off-by: Sage Weil s...@newdream.net --- fs/btrfs/ioctl.c |7 ++- 1 files changed, 6 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index a3c4751..f038d4a 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2320,7 +2320,12 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, else new_key.offset = destoff; - trans = btrfs_start_transaction(root, 1); + /* +* 1 - adjusting old extent (we may have to split it) I don't think it is enough. If we have lots of file extents and their extent items are stored in many contiguous leaves, the drop operation may cause those leaves to be COWed. So I think we must calculate the number of leaves which will be COWed at first. Hmm, yes, but if that's the case, most of the other btrfs_drop_extents() callers are broken too. Take btrfs_cont_expand(), which does more or less the same thing that we're doing too but reserves 2 (no inode update). Josef, correct me if I'm wrong, but I'm guessing how this works is that we are reserving space for something along the lines of 2 * tree depth, or enough space to cow the leaf blocks and their parents (plus any space for splits/merges). For inserts, we'd need to be careful about how many items we insert (more new leaves), but for removals, as long as we are deleting sequential items, we need at most two leaves, for the new versions of each end of the deleted range. Is that right? Hmm, how does the reservation interace with the extent backrefs, though? Is space reserved for that? sage Thanks Miao +* 1 - add new extent +* 1 - inode update +*/ + trans = btrfs_start_transaction(root, 3); if (IS_ERR(trans)) { ret = PTR_ERR(trans); goto out; -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: reserve sufficient space for ioctl clone
Fix a crash/BUG_ON in the clone ioctl due to insufficient reservation. We need to reserve space for: - adjusting the old extent (possibly splitting it) - adding the new extent - updating the inode Signed-off-by: Sage Weil s...@newdream.net --- fs/btrfs/ioctl.c |7 ++- 1 files changed, 6 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index a3c4751..f038d4a 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2320,7 +2320,12 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, else new_key.offset = destoff; - trans = btrfs_start_transaction(root, 1); + /* +* 1 - adjusting old extent (we may have to split it) +* 1 - add new extent +* 1 - inode update +*/ + trans = btrfs_start_transaction(root, 3); if (IS_ERR(trans)) { ret = PTR_ERR(trans); goto out; -- 1.7.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
clone ioctl bug with inline extents
Hi all, I'm hitting a problem cloning inline extents that I haven't had much success tracking down. It's simple enough to reproduce: echo src echo 2 dst clone_range src 0 29 dst 0 cmp src dst # fails! dst is size 29 but contains 2\n\0\0\0\0... where clone_range comes from http://ceph.newdream.net/git/?p=ceph.git;a=blob;f=qa/btrfs/clone_range.c;h=0a88e16013104c27aa87e7cd0d75e4d292419a19;hb=HEAD The file size is adjusted for the target, and debug-tree shows an inline data extent of length 29, but it has the old data in it. I'm not sure why ret = btrfs_insert_empty_item(trans, root, path, new_key, size); BUG_ON(ret); [...] leaf = path-nodes[0]; slot = path-slots[0]; write_extent_buffer(leaf, buf, btrfs_item_ptr_offset(leaf, slot), size); inode_add_bytes(inode, datal); is working when cloning to a new file but not over an existing one. Hopefully this is something silly I'm missing... sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: clone ioctl bug with inline extents
On Tue, 9 Aug 2011, Sage Weil wrote: Hi all, I'm hitting a problem cloning inline extents that I haven't had much success tracking down. It's simple enough to reproduce: echo src echo 2 dst clone_range src 0 29 dst 0 cmp src dst # fails! dst is size 29 but contains 2\n\0\0\0\0... facepalm, ok, this was just a matter of the ioctl code not dropping the old page cache pages. I'm not sure how we didn't notice that for so long! sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs slowdown
Hi Christian, Are you still seeing this slowness? sage On Wed, 27 Jul 2011, Christian Brunner wrote: 2011/7/25 Chris Mason chris.ma...@oracle.com: Excerpts from Christian Brunner's message of 2011-07-25 03:54:47 -0400: Hi, we are running a ceph cluster with btrfs as it's base filesystem (kernel 3.0). At the beginning everything worked very well, but after a few days (2-3) things are getting very slow. When I look at the object store servers I see heavy disk-i/o on the btrfs filesystems (disk utilization is between 60% and 100%). I also did some tracing on the Cepp-Object-Store-Daemon, but I'm quite certain, that the majority of the disk I/O is not caused by ceph or any other userland process. When reboot the system(s) the problems go away for another 2-3 days, but after that, it starts again. I'm not sure if the problem is related to the kernel warning I've reported last week. At least there is no temporal relationship between the warning and the slowdown. Any hints on how to trace this would be welcome. The easiest way to trace this is with latencytop. Apply this patch: http://oss.oracle.com/~mason/latencytop.patch And then use latencytop -c for a few minutes while the system is slow. Send the output here and hopefully we'll be able to figure it out. I've now installed latencytop. Attached are two output files: The first is from yesterday and was created aproxematly half an hour after the boot. The second on is from today, uptime is 19h. The load on the system is already rising. Disk utilization is approximately at 50%. Thanks for your help. Christian -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs slowdown
On Thu, 28 Jul 2011, Christian Brunner wrote: When I look at the latencytop results, there is a high latency when calling btrfs_commit_transaction_async. Isn't async supposed to return immediately? It depends. That function has to block until the commit has started before returning in the case where it creates a new btrfs root (i.e., snapshot creation). Otherwise a subsequent operation (after the ioctl returns) can sneak in before the snapshot is taken. (IIRC there was also another problem with keeping internal structures consistent, tho I'm forgetting the details.) And there are a bunch of things btrfs_commit_transaction() does before setting blocked = 1 that can be slow. There is a fair bit of transaction commit optimization work that should eventually be done here that we sadly haven't had the resources to look at yet. sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3.0-rcX BUG at fs/btrfs/ioctl.c:432 - bisected
On Fri, 10 Jun 2011, Chris Mason wrote: Excerpts from Jim Schutt's message of 2011-06-10 13:06:22 -0400: [ two different btrfs crashes ] I think your two crashes in btrfs were from the uninit variables and those should be fixed in rc2. When I did my bisection, my criteria for success/failure was did mkcephfs succeed?. When I apply this criteria to a recent linus kernel (e.g. 06e86849cf4019), which includes the fix you mentioned (aa0467d8d2a00e), I get still a different failure mode, which doesn't actually reference btrfs: [ 276.364178] BUG: unable to handle kernel NULL pointer dereference at 000a [ 276.365127] IP: [a05434b1] journal_start+0x3e/0x9c [jbd] Looking at the resulting code in the oops, we're here in journal_start: if (handle) { J_ASSERT(handle-h_transaction-t_journal == journal); handle comes from current-journal_info, and we're doing a deref on handle-h_transaction, which is probably 0xa. So, we're leaving crud in current-journal_info and ext3 is finding it. Perhaps its from ceph starting a transaction but leaving it running? The bug came with Josef's transaction performance fixes, but it is probably a mixture of his code with the ioctls ceph is using. Ah, yeah, that's the problem. We saw a similar problem a while back with the start/stop transaction ioctls. In this case, create_snapshot is doing trans = btrfs_start_transaction(root-fs_info-extent_root, 5); if (IS_ERR(trans)) { ret = PTR_ERR(trans); goto fail; } which sets current-journal_info. Then ret = btrfs_snap_reserve_metadata(trans, pending_snapshot); BUG_ON(ret); list_add(pending_snapshot-list, trans-transaction-pending_snapshots); if (async_transid) { *async_transid = trans-transid; ret = btrfs_commit_transaction_async(trans, root-fs_info-extent_root, 1); } else { ret = btrfs_commit_transaction(trans, root-fs_info-extent_root); } but the async snap creation ioctl takes the async path, which runs btrfs_commit_transaction in a worker thread. I'm not sure what the right thing to do is here is... can whatever is in journal_info be attached to trans instead in btrfs_commit_transaction_async()? sage [ rest of the oops below for context ] -chris [ 276.365127] PGD 1e4469067 PUD 1e1658067 PMD 0 [ 276.365127] Oops: [#1] SMP [ 276.365127] CPU 2 [ 276.365127] Modules linked in: btrfs zlib_deflate lzo_compress ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp i2c_dev i2c_core ext3 jbd scsi_transport_iscsi rds ib_ipoib rdma_ucm rdma_cm ib_ucm ib_uverbs ib_umad ib_cm iw_cm ib_addr ipv6 ib_sa dm_mirror dm_region_hash dm_log dm_multipath scsi_dh dm_mod video sbs sbshc pci_slot battery acpi_pad ac kvm sg ses sd_mod enclosure megaraid_sas ide_cd_mod cdrom ib_mthca ib_mad qla2xxx button ib_core serio_raw scsi_transport_fc scsi_tgt dcdbas ata_piix libata tpm_tis tpm i5k_amb ioatdma tpm_bios hwmon iTCO_wdt scsi_mod i5000_edac iTCO_vendor_support ehci_hcd dca edac_core uhci_hcd pcspkr rtc nfs nfs_acl auth_rpcgss fscache lockd sunrpc tg3 bnx2 e1000 [last unloaded: freq_table] [ 276.365127] [ 276.365127] Pid: 6076, comm: cosd Not tainted 3.0.0-rc2-00196-g06e8684 #26 Dell Inc. PowerEdge 1950/0DT097 [ 276.365127] RIP: 0010:[a05434b1] [a05434b1] journal_start+0x3e/0x9c [jbd] [ 276.365127] RSP: 0018:8801e2897b28 EFLAGS: 00010286 [ 276.365127] RAX: 000a RBX: 8801de8e1090 RCX: 0002 [ 276.365127] RDX: 19b2d000 RSI: 000e RDI: 000e [ 276.365127] RBP: 8801e2897b48 R08: 0003 R09: 8801e2897c38 [ 276.365127] R10: 8801e2897ed8 R11: 0001 R12: 880223ff4400 [ 276.365127] R13: 880218522d60 R14: 0ec6 R15: 88021f54d878 [ 276.365127] FS: 7f8ff0bbb710() GS:88022fc8() knlGS: [ 276.365127] CS: 0010 DS: ES: CR0: 8005003b [ 276.365127] CR2: 000a CR3: 00021744f000 CR4: 06e0 [ 276.365127] DR0: DR1: DR2: [ 276.365127] DR3: DR6: 0ff0 DR7: 0400 [ 276.365127] Process cosd (pid: 6076, threadinfo 8801e2896000, task 880218522d60) [ 276.365127] Stack: [ 276.365127] 8801e2897b68 ea000756e788 88021f54d728 8801e2897c78 [ 276.365127] 8801e2897b58 a05670ce 8801e2897b68 a055c72d [ 276.365127] 8801e2897be8 a055f044
Re: 3.0-rcX BUG at fs/btrfs/ioctl.c:432 - bisected
On Fri, 10 Jun 2011, Sage Weil wrote: On Fri, 10 Jun 2011, Chris Mason wrote: Excerpts from Jim Schutt's message of 2011-06-10 13:06:22 -0400: [ two different btrfs crashes ] I think your two crashes in btrfs were from the uninit variables and those should be fixed in rc2. When I did my bisection, my criteria for success/failure was did mkcephfs succeed?. When I apply this criteria to a recent linus kernel (e.g. 06e86849cf4019), which includes the fix you mentioned (aa0467d8d2a00e), I get still a different failure mode, which doesn't actually reference btrfs: [ 276.364178] BUG: unable to handle kernel NULL pointer dereference at 000a [ 276.365127] IP: [a05434b1] journal_start+0x3e/0x9c [jbd] Looking at the resulting code in the oops, we're here in journal_start: if (handle) { J_ASSERT(handle-h_transaction-t_journal == journal); handle comes from current-journal_info, and we're doing a deref on handle-h_transaction, which is probably 0xa. So, we're leaving crud in current-journal_info and ext3 is finding it. Perhaps its from ceph starting a transaction but leaving it running? The bug came with Josef's transaction performance fixes, but it is probably a mixture of his code with the ioctls ceph is using. Ah, yeah, that's the problem. We saw a similar problem a while back with the start/stop transaction ioctls. In this case, create_snapshot is doing trans = btrfs_start_transaction(root-fs_info-extent_root, 5); if (IS_ERR(trans)) { ret = PTR_ERR(trans); goto fail; } which sets current-journal_info. Then ret = btrfs_snap_reserve_metadata(trans, pending_snapshot); BUG_ON(ret); list_add(pending_snapshot-list, trans-transaction-pending_snapshots); if (async_transid) { *async_transid = trans-transid; ret = btrfs_commit_transaction_async(trans, root-fs_info-extent_root, 1); } else { ret = btrfs_commit_transaction(trans, root-fs_info-extent_root); } but the async snap creation ioctl takes the async path, which runs btrfs_commit_transaction in a worker thread. I'm not sure what the right thing to do is here is... can whatever is in journal_info be attached to trans instead in btrfs_commit_transaction_async()? It looks like it's not used for anything in btrfs, actually; it's just set and cleared. What's the point of that? Anyway, assuming it's useful, I think the below would fix the problem. Want to give it a shot, Jim? Thanks! sage diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index c571734..fd04ad7 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1196,6 +1196,9 @@ int btrfs_commit_transaction_async(struct btrfs_trans_handle *trans, put_transaction(cur_trans); mutex_unlock(root-fs_info-trans_mutex); + if (current-journal_info == trans) + current-journal_info = NULL; + return 0; } -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3.0-rcX BUG at fs/btrfs/ioctl.c:432 - bisected
On Fri, 10 Jun 2011, Josef Bacik wrote: On 06/10/2011 02:14 PM, Sage Weil wrote: On Fri, 10 Jun 2011, Sage Weil wrote: On Fri, 10 Jun 2011, Chris Mason wrote: Excerpts from Jim Schutt's message of 2011-06-10 13:06:22 -0400: [ two different btrfs crashes ] I think your two crashes in btrfs were from the uninit variables and those should be fixed in rc2. When I did my bisection, my criteria for success/failure was did mkcephfs succeed?. When I apply this criteria to a recent linus kernel (e.g. 06e86849cf4019), which includes the fix you mentioned (aa0467d8d2a00e), I get still a different failure mode, which doesn't actually reference btrfs: [ 276.364178] BUG: unable to handle kernel NULL pointer dereference at 000a [ 276.365127] IP: [a05434b1] journal_start+0x3e/0x9c [jbd] Looking at the resulting code in the oops, we're here in journal_start: if (handle) { J_ASSERT(handle-h_transaction-t_journal == journal); handle comes from current-journal_info, and we're doing a deref on handle-h_transaction, which is probably 0xa. So, we're leaving crud in current-journal_info and ext3 is finding it. Perhaps its from ceph starting a transaction but leaving it running? The bug came with Josef's transaction performance fixes, but it is probably a mixture of his code with the ioctls ceph is using. Ah, yeah, that's the problem. We saw a similar problem a while back with the start/stop transaction ioctls. In this case, create_snapshot is doing trans = btrfs_start_transaction(root-fs_info-extent_root, 5); if (IS_ERR(trans)) { ret = PTR_ERR(trans); goto fail; } which sets current-journal_info. Then ret = btrfs_snap_reserve_metadata(trans, pending_snapshot); BUG_ON(ret); list_add(pending_snapshot-list, trans-transaction-pending_snapshots); if (async_transid) { *async_transid = trans-transid; ret = btrfs_commit_transaction_async(trans, root-fs_info-extent_root, 1); } else { ret = btrfs_commit_transaction(trans, root-fs_info-extent_root); } but the async snap creation ioctl takes the async path, which runs btrfs_commit_transaction in a worker thread. I'm not sure what the right thing to do is here is... can whatever is in journal_info be attached to trans instead in btrfs_commit_transaction_async()? It looks like it's not used for anything in btrfs, actually; it's just set and cleared. What's the point of that? It is used now, check the beginning of start_transaction(). Thanks, Oh I see, okay. So clearing it in btrfs_commit_transaction_async should be fine then, right? When btrfs_commit_transaction runs in the other thread it won't care that current-journal_info is NULL. sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3.0-rcX BUG at fs/btrfs/ioctl.c:432 - bisected
On Fri, 10 Jun 2011, Josef Bacik wrote: On 06/10/2011 02:35 PM, Sage Weil wrote: On Fri, 10 Jun 2011, Josef Bacik wrote: On 06/10/2011 02:14 PM, Sage Weil wrote: On Fri, 10 Jun 2011, Sage Weil wrote: On Fri, 10 Jun 2011, Chris Mason wrote: Excerpts from Jim Schutt's message of 2011-06-10 13:06:22 -0400: [ two different btrfs crashes ] I think your two crashes in btrfs were from the uninit variables and those should be fixed in rc2. When I did my bisection, my criteria for success/failure was did mkcephfs succeed?. When I apply this criteria to a recent linus kernel (e.g. 06e86849cf4019), which includes the fix you mentioned (aa0467d8d2a00e), I get still a different failure mode, which doesn't actually reference btrfs: [ 276.364178] BUG: unable to handle kernel NULL pointer dereference at 000a [ 276.365127] IP: [a05434b1] journal_start+0x3e/0x9c [jbd] Looking at the resulting code in the oops, we're here in journal_start: if (handle) { J_ASSERT(handle-h_transaction-t_journal == journal); handle comes from current-journal_info, and we're doing a deref on handle-h_transaction, which is probably 0xa. So, we're leaving crud in current-journal_info and ext3 is finding it. Perhaps its from ceph starting a transaction but leaving it running? The bug came with Josef's transaction performance fixes, but it is probably a mixture of his code with the ioctls ceph is using. Ah, yeah, that's the problem. We saw a similar problem a while back with the start/stop transaction ioctls. In this case, create_snapshot is doing trans = btrfs_start_transaction(root-fs_info-extent_root, 5); if (IS_ERR(trans)) { ret = PTR_ERR(trans); goto fail; } which sets current-journal_info. Then ret = btrfs_snap_reserve_metadata(trans, pending_snapshot); BUG_ON(ret); list_add(pending_snapshot-list, trans-transaction-pending_snapshots); if (async_transid) { *async_transid = trans-transid; ret = btrfs_commit_transaction_async(trans, root-fs_info-extent_root, 1); } else { ret = btrfs_commit_transaction(trans, root-fs_info-extent_root); } but the async snap creation ioctl takes the async path, which runs btrfs_commit_transaction in a worker thread. I'm not sure what the right thing to do is here is... can whatever is in journal_info be attached to trans instead in btrfs_commit_transaction_async()? It looks like it's not used for anything in btrfs, actually; it's just set and cleared. What's the point of that? It is used now, check the beginning of start_transaction(). Thanks, Oh I see, okay. So clearing it in btrfs_commit_transaction_async should be fine then, right? When btrfs_commit_transaction runs in the other thread it won't care that current-journal_info is NULL. Oh yeah your patch is good :), Okay cool. Here's the fix with a proper changelog and a little use-after-free paranoia. Thanks! sage From 9881c0752293769d5133c01dff3ab04c0c24c61b Mon Sep 17 00:00:00 2001 From: Sage Weil s...@newdream.net Date: Fri, 10 Jun 2011 11:41:25 -0700 Subject: [PATCH] Btrfs: clear current-journal_info on async transaction commit Normally current-jouranl_info is cleared by commit_transaction. For an async snap or subvol creation, though, it runs in a work queue. Clear it in btrfs_commit_transaction_async() to avoid leaking a non-NULL journal_info when we return to userspace. When the actual commit runs in the other thread it won't care that it's current-journal_info is already NULL. Signed-off-by: Sage Weil s...@newdream.net --- fs/btrfs/transaction.c |5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index dd71966..9d516ed 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1118,8 +1118,11 @@ int btrfs_commit_transaction_async(struct btrfs_trans_handle *trans, wait_current_trans_commit_start_and_unblock(root, cur_trans); else wait_current_trans_commit_start(root, cur_trans); - put_transaction(cur_trans); + if (current-journal_info == trans) + current-journal_info = NULL; + + put_transaction(cur_trans); return 0; } -- 1.7.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 12/19] btrfs: remove unnecessary dentry_unhash in rmdir/rename_dir
Btrfs has no problems with lingering references to unlinked directory inodes. CC: Chris Mason chris.ma...@oracle.com CC: linux-btrfs@vger.kernel.org Signed-off-by: Sage Weil s...@newdream.net --- fs/btrfs/inode.c |5 - 1 files changed, 0 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 3a33ae3..7cd8ab0 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -3062,8 +3062,6 @@ static int btrfs_rmdir(struct inode *dir, struct dentry *dentry) inode-i_ino == BTRFS_FIRST_FREE_OBJECTID) return -ENOTEMPTY; - dentry_unhash(dentry); - trans = __unlink_start_trans(dir, dentry); if (IS_ERR(trans)) return PTR_ERR(trans); @@ -6994,9 +6992,6 @@ static int btrfs_rename(struct inode *old_dir, struct dentry *old_dentry, u64 root_objectid; int ret; - if (new_inode S_ISDIR(new_dentry-d_inode-i_mode)) - dentry_unhash(new_dentry); - if (new_dir-i_ino == BTRFS_EMPTY_SUBVOL_DIR_OBJECTID) return -EPERM; -- 1.7.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/18] btrfs: remove unnecessary dentry_unhash in rmdir/rename_dir
Btrfs has no problems with lingering references to unlinked directory inodes. CC: Chris Mason chris.ma...@oracle.com CC: linux-btrfs@vger.kernel.org Signed-off-by: Sage Weil s...@newdream.net --- fs/btrfs/inode.c |5 - 1 files changed, 0 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 3a33ae3..7cd8ab0 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -3062,8 +3062,6 @@ static int btrfs_rmdir(struct inode *dir, struct dentry *dentry) inode-i_ino == BTRFS_FIRST_FREE_OBJECTID) return -ENOTEMPTY; - dentry_unhash(dentry); - trans = __unlink_start_trans(dir, dentry); if (IS_ERR(trans)) return PTR_ERR(trans); @@ -6994,9 +6992,6 @@ static int btrfs_rename(struct inode *old_dir, struct dentry *old_dentry, u64 root_objectid; int ret; - if (new_inode S_ISDIR(new_dentry-d_inode-i_mode)) - dentry_unhash(new_dentry); - if (new_dir-i_ino == BTRFS_EMPTY_SUBVOL_DIR_OBJECTID) return -EPERM; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: fix sync subvol/snapshot creation
We were incorrectly taking the async path even for the sync ioctls by passing in transid unconditionally. There's ample room for further cleanup here, but this keeps the fix simple. Signed-off-by: Sage Weil s...@newdream.net --- fs/btrfs/ioctl.c | 20 +++- 1 files changed, 11 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 41614c3..4c2d7c4 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -970,6 +970,15 @@ static noinline int btrfs_ioctl_snap_create(struct file *file, name = async_vol_args-name; fd = async_vol_args-fd; async_vol_args-name[BTRFS_SNAPSHOT_NAME_MAX] = '\0'; + + ret = btrfs_ioctl_snap_create_transid(file, name, fd, + subvol, transid); + + if (ret == 0 + copy_to_user(arg + +offsetof(struct btrfs_ioctl_async_vol_args, + transid), transid, sizeof(transid))) + ret = -EFAULT; } else { vol_args = memdup_user(arg, sizeof(*vol_args)); if (IS_ERR(vol_args)) @@ -977,16 +986,9 @@ static noinline int btrfs_ioctl_snap_create(struct file *file, name = vol_args-name; fd = vol_args-fd; vol_args-name[BTRFS_PATH_NAME_MAX] = '\0'; - } - - ret = btrfs_ioctl_snap_create_transid(file, name, fd, - subvol, transid); - if (!ret async) { - if (copy_to_user(arg + - offsetof(struct btrfs_ioctl_async_vol_args, - transid), transid, sizeof(transid))) - return -EFAULT; + ret = btrfs_ioctl_snap_create_transid(file, name, fd, + subvol, NULL); } kfree(vol_args); -- 1.6.6.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix sync subvol/snapshot creation
On Thu, 9 Dec 2010, Chris Mason wrote: Excerpts from Li Zefan's message of 2010-12-09 21:11:25 -0500: Sage Weil wrote: We were incorrectly taking the async path even for the sync ioctls by passing in transid unconditionally. Ha, I fixed this accidentally in my patchset. :) Thanks guys, Unless either of you object, I'll take this and the snapshot ABI change for the next rc. Sounds good to me. Just FYI, I'm still seeing some weird corruption and crashes on my machines and haven't narrowed it down yet. Should have more info soon. sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] btrfs: Make async snapshot ioctl more generic
Hi Li, On Mon, 29 Nov 2010, Li Zefan wrote: So we don't have to add new structures as we create more ioctls for snapshots. Now to create async snapshot, set BTRFS_SNAPSHOT_CREATE_ASYNC bit of vol_arg_v2-flags, and then call ioctl(BTRFS_IOCT_SNAP_CREATE_V2). Note: this changes the async snapshot ioctl ABI, which was merged in .37 merge window, so we have to make this change into .37. These changes all look good to me. Thanks! sage Signed-off-by: Li Zefan l...@cn.fujitsu.com --- fs/btrfs/ioctl.c | 34 +- fs/btrfs/ioctl.h | 12 2 files changed, 29 insertions(+), 17 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 463d91b..d3f1a60 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -935,23 +935,31 @@ out: static noinline int btrfs_ioctl_snap_create(struct file *file, void __user *arg, int subvol, - int async) + bool v2) { struct btrfs_ioctl_vol_args *vol_args = NULL; - struct btrfs_ioctl_async_vol_args *async_vol_args = NULL; + struct btrfs_ioctl_vol_args_v2 *vol_args_v2 = NULL; char *name; u64 fd; u64 transid = 0; + bool async = false; int ret; - if (async) { - async_vol_args = memdup_user(arg, sizeof(*async_vol_args)); - if (IS_ERR(async_vol_args)) - return PTR_ERR(async_vol_args); + if (v2) { + vol_args_v2 = memdup_user(arg, sizeof(*vol_args_v2)); + if (IS_ERR(vol_args_v2)) + return PTR_ERR(vol_args_v2); - name = async_vol_args-name; - fd = async_vol_args-fd; - async_vol_args-name[BTRFS_SNAPSHOT_NAME_MAX] = '\0'; + if (vol_args_v2-flags ~BTRFS_SNAPSHOT_CREATE_ASYNC) { + ret = -EINVAL; + goto out; + } + + name = vol_args_v2-name; + fd = vol_args_v2-fd; + vol_args_v2-name[BTRFS_SNAPSHOT_NAME_MAX] = '\0'; + if (vol_args_v2-flags BTRFS_SNAPSHOT_CREATE_ASYNC) + async = true; } else { vol_args = memdup_user(arg, sizeof(*vol_args)); if (IS_ERR(vol_args)) @@ -966,13 +974,13 @@ static noinline int btrfs_ioctl_snap_create(struct file *file, if (!ret async) { if (copy_to_user(arg + - offsetof(struct btrfs_ioctl_async_vol_args, + offsetof(struct btrfs_ioctl_vol_args_v2, transid), transid, sizeof(transid))) return -EFAULT; } - +out: kfree(vol_args); - kfree(async_vol_args); + kfree(vol_args_v2); return ret; } @@ -2235,7 +2243,7 @@ long btrfs_ioctl(struct file *file, unsigned int return btrfs_ioctl_getversion(file, argp); case BTRFS_IOC_SNAP_CREATE: return btrfs_ioctl_snap_create(file, argp, 0, 0); - case BTRFS_IOC_SNAP_CREATE_ASYNC: + case BTRFS_IOC_SNAP_CREATE_V2: return btrfs_ioctl_snap_create(file, argp, 0, 1); case BTRFS_IOC_SUBVOL_CREATE: return btrfs_ioctl_snap_create(file, argp, 1, 0); diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h index 17c99eb..bc70584 100644 --- a/fs/btrfs/ioctl.h +++ b/fs/btrfs/ioctl.h @@ -30,10 +30,14 @@ struct btrfs_ioctl_vol_args { char name[BTRFS_PATH_NAME_MAX + 1]; }; -#define BTRFS_SNAPSHOT_NAME_MAX 4079 -struct btrfs_ioctl_async_vol_args { +#define BTRFS_SNAPSHOT_CREATE_ASYNC (1ULL 0) + +#define BTRFS_SNAPSHOT_NAME_MAX 4039 +struct btrfs_ioctl_vol_args_v2 { __s64 fd; __u64 transid; + __u64 flags; + __u64 unused[4]; char name[BTRFS_SNAPSHOT_NAME_MAX + 1]; }; @@ -187,6 +191,6 @@ struct btrfs_ioctl_space_args { struct btrfs_ioctl_space_args) #define BTRFS_IOC_START_SYNC _IOR(BTRFS_IOCTL_MAGIC, 24, __u64) #define BTRFS_IOC_WAIT_SYNC _IOW(BTRFS_IOCTL_MAGIC, 22, __u64) -#define BTRFS_IOC_SNAP_CREATE_ASYNC _IOW(BTRFS_IOCTL_MAGIC, 23, \ -struct btrfs_ioctl_async_vol_args) +#define BTRFS_IOC_SNAP_CREATE_V2 _IOW(BTRFS_IOCTL_MAGIC, 23, \ +struct btrfs_ioctl_vol_args_v2) #endif -- 1.6.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] btrfs: Check if dest_offset is block-size aligned before cloning file
On Fri, 19 Nov 2010, Li Zefan wrote: We've done the check for src_offset and src_length, and We should also check dest_offset, otherwise we'll corrupt the destination file: (After cloning file1 to file2 with unaligned dest_offset) # cat /mnt/file2 cat: /mnt/file2: Input/output error Signed-off-by: Li Zefan l...@cn.fujitsu.com --- fs/btrfs/ioctl.c |7 +++ 1 files changed, 3 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 463d91b..81b47bd 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1669,12 +1669,11 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, olen = len = src-i_size - off; /* if we extend to eof, continue to block boundary */ if (off + len == src-i_size) - len = ((src-i_size + bs-1) ~(bs-1)) - - off; + len = ALIGN(src-i_size, bs) - off; /* verify the end result is block aligned */ - if ((off (bs-1)) || - ((off + len) (bs-1))) + if (!IS_ALIGNED(off, bs) || !IS_ALIGNED(off + len, bs) || + !IS_ALIGNED(destoff, bs)) goto out_unlock; /* do any pending delalloc/csum calc on src, one way or Looks good. Reviewed-by: Sage Weil s...@newdream.net -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs: Set file size correctly in file clone
On Fri, 19 Nov 2010, Li Zefan wrote: Set src_offset = 0, src_length = 20K, dest_offset = 20K. And the original filesize of the dest file 'file2' is 30K: # ls -l /mnt/file2 -rw-r--r-- 1 root root 30720 Nov 18 16:42 /mnt/file2 Now clone file1 to file2, the dest file should be 40K, but it still shows 30K: # ls -l /mnt/file2 -rw-r--r-- 1 root root 30720 Nov 18 16:42 /mnt/file2 Signed-off-by: Li Zefan l...@cn.fujitsu.com --- fs/btrfs/ioctl.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 81b47bd..6b4bfa7 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1873,8 +1873,8 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, * but shouldn't round up the file size */ endoff = new_key.offset + datal; - if (endoff off+olen) - endoff = off+olen; + if (endoff destoff+olen) + endoff = destoff+olen; if (endoff inode-i_size) btrfs_i_size_write(inode, endoff); Reviewed-by: Sage Weil s...@newdream.net -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
new lockdep warnings in 2.6.37-rc1
Hey, I saw a few variations of this lockdep on about a dozen different nodes running 2.6.37-rc1+ (mainline from earlier this week). They all look to be the same issue, hit via a few different call paths. I haven't seen these before... sage [10862.369802] === [10862.373287] [ INFO: possible circular locking dependency detected ] [10862.373287] 2.6.37-rc1+ #44 [10862.373287] --- [10862.373287] syslogd/2091 is trying to acquire lock: [10862.373287] ((eb-lock)-rlock){+.+...}, at: [a0048bbe] btrfs_try_spin_lock+0x26/0x80 [btrfs] [10862.373287] [10862.373287] but task is already holding lock: [10862.373287] (btrfs-extent-01){+.+...}, at: [a0048b8f] btrfs_clear_lock_blocking+0x1d/0x26 [btrfs] [10862.373287] [10862.373287] which lock already depends on the new lock. [10862.373287] [10862.373287] [10862.373287] the existing dependency chain (in reverse order) is: [10862.373287] [10862.373287] - #1 (btrfs-extent-01){+.+...}: [10862.442464][810634ad] __lock_acquire+0x82d/0x8aa [10862.442464][810635b2] lock_acquire+0x88/0xa5 [10862.442464][814cf403] _raw_spin_lock+0x3c/0x6f [10862.442464][a0048bbe] btrfs_try_spin_lock+0x26/0x80 [btrfs] [10862.442464][a0009de5] btrfs_search_slot+0x723/0x851 [btrfs] [10862.442464][a0018340] btrfs_lookup_csum+0x5c/0x135 [btrfs] [10862.442464][a00185bd] __btrfs_lookup_bio_sums+0x1a4/0x343 [btrfs] [10862.442464][a001877e] btrfs_lookup_bio_sums+0x11/0x13 [btrfs] [10862.442464][a0022764] btrfs_submit_bio_hook+0x99/0x10a [btrfs] [10862.442464][a0039440] submit_one_bio+0x6c/0x96 [btrfs] [10862.442464][a003b4a4] extent_read_full_page+0x41/0x4a [btrfs] [10862.442464][a00218e1] btrfs_readpage+0x1e/0x20 [btrfs] [10862.442464][a002bfec] btrfs_file_aio_write+0x405/0x885 [btrfs] [10862.442464][810ba061] do_sync_readv_writev+0xc4/0x10c [10862.442464][810ba6fa] do_readv_writev+0xb8/0x190 [10862.442464][810ba813] vfs_writev+0x41/0x4c [10862.442464][810bb008] sys_writev+0x4a/0x75 [10862.442464][81002a2b] system_call_fastpath+0x16/0x1b [10862.442464] [10862.442464] - #0 ((eb-lock)-rlock){+.+...}: [10862.442464][8106265c] validate_chain+0x7a9/0xdcd [10862.442464][810634ad] __lock_acquire+0x82d/0x8aa [10862.442464][810635b2] lock_acquire+0x88/0xa5 [10862.442464][814cf403] _raw_spin_lock+0x3c/0x6f [10862.442464][a0048bbe] btrfs_try_spin_lock+0x26/0x80 [btrfs] [10862.442464][a0009de5] btrfs_search_slot+0x723/0x851 [btrfs] [10862.442464][a0018340] btrfs_lookup_csum+0x5c/0x135 [btrfs] [10862.442464][a00185bd] __btrfs_lookup_bio_sums+0x1a4/0x343 [btrfs] [10862.620387][a001877e] btrfs_lookup_bio_sums+0x11/0x13 [btrfs] [10862.620387][a0022764] btrfs_submit_bio_hook+0x99/0x10a [btrfs] [10862.620387][a0039440] submit_one_bio+0x6c/0x96 [btrfs] [10862.620387][a003b4a4] extent_read_full_page+0x41/0x4a [btrfs] [10862.620387][a00218e1] btrfs_readpage+0x1e/0x20 [btrfs] [10862.620387][a002bfec] btrfs_file_aio_write+0x405/0x885 [btrfs] [10862.620387][810ba061] do_sync_readv_writev+0xc4/0x10c [10862.620387][810ba6fa] do_readv_writev+0xb8/0x190 [10862.620387][810ba813] vfs_writev+0x41/0x4c [10862.620387][810bb008] sys_writev+0x4a/0x75 [10862.620387][81002a2b] system_call_fastpath+0x16/0x1b [10862.620387] [10862.620387] other info that might help us debug this: [10862.620387] [10862.620387] 2 locks held by syslogd/2091: [10862.620387] #0: (sb-s_type-i_mutex_key#11){+.+.+.}, at: [a002bd36] btrfs_file_aio_write+0x14f/0x885 [btrfs] [10862.620387] #1: (btrfs-extent-01){+.+...}, at: [a0048b8f] btrfs_clear_lock_blocking+0x1d/0x26 [btrfs] [10862.730233] [10862.730233] stack backtrace: [10862.730233] Pid: 2091, comm: syslogd Not tainted 2.6.37-rc1+ #44 [10862.730233] Call Trace: [10862.730233] [81061876] print_circular_bug+0xb3/0xc1 [10862.730233] [8106265c] validate_chain+0x7a9/0xdcd [10862.730233] [81009948] ? native_sched_clock+0x37/0x71 [10862.730233] [810634ad] __lock_acquire+0x82d/0x8aa [10862.730233] [810635b2] lock_acquire+0x88/0xa5 [10862.730233] [a0048bbe] ? btrfs_try_spin_lock+0x26/0x80 [btrfs] [10862.730233] [814cf403] _raw_spin_lock+0x3c/0x6f [10862.730233] [a0048bbe] ? btrfs_try_spin_lock+0x26/0x80 [btrfs] [10862.730233] [a0048b8f] ? btrfs_clear_lock_blocking+0x1d/0x26 [btrfs]
Re: [PATCH 2/3] btrfs: implement 'async-snapshot' command
On Tue, 2 Nov 2010, Goffredo Baroncelli wrote: On Monday, 01 November, 2010, Sage Weil wrote: On Sat, 30 Oct 2010, Goffredo Baroncelli wrote: On Saturday, 30 October, 2010, Sage Weil wrote: This is identical to 'snapshot', but uses the new async snapshot creation ioctl, and prints out the transid the new snapshot will be committed with. Signed-off-by: Sage Weil s...@newdream.net --- btrfs.c |8 +++- btrfs_cmds.c | 32 +++- btrfs_cmds.h |3 ++- 3 files changed, 36 insertions(+), 7 deletions(-) diff --git a/btrfs.c b/btrfs.c index 46314cf..c4b9a31 100644 --- a/btrfs.c +++ b/btrfs.c @@ -44,11 +44,17 @@ static struct Command commands[] = { /* avoid short commands different for the case only */ - { do_clone, 2, + { do_create_snap, 2, subvolume snapshot, source [dest/]name\n Create a writable snapshot of the subvolume source with\n the name name in the dest directory. }, + { do_create_snap_async, 2, + subvolume async-snapshot, source [dest/]name\n + Create a writable snapshot of the subvolume source with\n + the name name in the dest directory. Do not wait for\n + the snapshot creation to commit to disk before returning. + }, Why create another command ? I think that it is better add a switch like btrfs subvolume snapshot [--async] source [dest/]name How about btrfs subvolume snapshot source [dest/]name [async] That avoids ambiguity when source is named '--async' and simplifies parsing. Updated patches follow (including man page updates). sage No please ! Sorry but this is too strange as syntax. When the source is named --async it is possible to execute the command passing ./--async. This workaround is used in any unix command (have you ever thought what happens if you execute rm * when exists a file named -rf ?). Oh right, that's better. I'm used to using -- (as in 'grep -- -foo'; rm supports the same syntax). Will resend. sage I think that --async is a modification of the standard behaviour of btrfs s s command, so it is natural to use the switch syntax (30+ years of unix say so :-) ) Goffredo -- gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) kreij...@inwind.it Key fingerprint = 4769 7E51 5293 D36C 814E C054 BF04 F161 3DC5 0512 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs-progs: btrfs: implement 'start-sync' and 'wait-sync' commands
On Tue, 2 Nov 2010, Goffredo Baroncelli wrote: Like the command btrfs subvol snapshot, I think that it is better to add a modifier instead of a new command. btrfs filesystem sync [--async] Sorry if I noticed this too late. But I don't see a valid reason to add another command. From a UI point of view the meaning of the command is the same, change only slight the behavior. Even tough I have to admint that sync --async sound strange. May be flush is better ? + { do_wait_sync, 2, + filesystem wait-sync, path transid\n + Wait for the transaction transid on the filesystem at path to commit. + }, If I understood correctly this command may be used to wait the execution of several commands: snapshot, subvolume removal, sync... I can suggest a more general name like: # btrfs filesystem wait-transaction path transid. (which may be shortened as btrfs filesystem wait ...). How about start-transaction and wait-transaction? Or start-commit and wait-commit? ('transaction' is a heavily overloaded term.) IMO both would be less weird than sync --async. Another suggestion: instead of checking if the transid is 0, leave the transid optional. If transid is omitted, then the function waits all the running transaction. Definitely. Finally I suggest to put a little note about the command behvior when a transaction is started after the command execution: is there a risk of DOS ? Okay. There's no DOS risk (at least no more so than with while true ; do sync ; done). It'll return immediately if transid has committed. If transid is 0, it'll find the most recent committing or committed transid and wait for that. If nothing is currently committing, it'll return immediately. sage Anyway great work. Goffredo { do_resize, 2, filesystem resize, [+/-]newsize[gkm]|max filesystem\n Resize the file system. If 'max' is passed, the filesystem\n diff --git a/btrfs_cmds.c b/btrfs_cmds.c index 8031c58..736437d 100644 --- a/btrfs_cmds.c +++ b/btrfs_cmds.c @@ -526,6 +526,55 @@ int do_fssync(int argc, char **argv) return 0; } +int do_start_sync(int argc, char **argv) +{ + int fd, res; + char*path = argv[1]; + __u64 transid; + + fd = open_file_or_dir(path); + if (fd 0) { + fprintf(stderr, ERROR: can't access to '%s'\n, path); + return 12; + } + + printf(StartSync '%s'\n, path); + res = ioctl(fd, BTRFS_IOC_START_SYNC, transid); + close(fd); + if( res 0 ){ + fprintf(stderr, ERROR: unable to fs-syncing '%s'\n, path); + return 16; + } else { + printf(transid %llu\n, (unsigned long long)transid); + } + + return 0; +} + +int do_wait_sync(int argc, char **argv) +{ + int fd, res; + char*path = argv[1]; + __u64 transid = atoll(argv[2]); + + fd = open_file_or_dir(path); + if (fd 0) { + fprintf(stderr, ERROR: can't access to '%s'\n, path); + return 12; + } + + printf(WaitSync '%s' transid %llu\n, path, (unsigned long long)transid); + res = ioctl(fd, BTRFS_IOC_WAIT_SYNC, transid); + close(fd); + if( res 0 ){ + fprintf(stderr, ERROR: unable to wait-sync on '%s' transid %llu: %s\n, path, + (unsigned long long)transid, strerror(errno)); + return 16; + } + + return 0; +} + int do_scan(int argc, char **argv) { int i, fd; diff --git a/btrfs_cmds.h b/btrfs_cmds.h index 7bde191..84c489f 100644 --- a/btrfs_cmds.h +++ b/btrfs_cmds.h @@ -19,6 +19,8 @@ int do_clone(int nargs, char **argv); int do_delete_subvolume(int nargs, char **argv); int do_create_subvol(int nargs, char **argv); int do_fssync(int nargs, char **argv); +int do_start_sync(int nargs, char **argv); +int do_wait_sync(int nargs, char **argv); int do_defrag(int argc, char **argv); int do_show_filesystem(int nargs, char **argv); int do_add_volume(int nargs, char **args); diff --git a/man/btrfs.8.in b/man/btrfs.8.in index 26ef982..e87b5fe 100644 --- a/man/btrfs.8.in +++ b/man/btrfs.8.in @@ -19,6 +19,10 @@ btrfs \- control a btrfs filesystem .PP \fBbtrfs\fP \fBfilesystem sync\fP\fI path \fP .PP +\fBbtrfs\fP \fBfilesystem start-sync\fP\fI path \fP +.PP +\fBbtrfs\fP \fBfilesystem wait-sync\fP\fI path transid\fP +.PP \fBbtrfs\fP \fBfilesystem resize\fP\fI [+/\-]size[gkm]|max filesystem\fP .PP \fBbtrfs\fP \fBdevice scan\fP\fI [device [device..]]\fP @@ -115,6 +119,16 @@ all the block devices. Force a sync for the filesystem identified by \fIpath\fR. .TP +\fBfilesystem start-sync\fR\fI path \fR +Start a sync operation for the filesystem identified by \fIpath\fR. A transaction id +is printed that can be waited on using the \fBfilesystem wait-sync\fR command. +.TP + +\fBfilesystem
Re: [PATCH 2/3] btrfs: implement 'async-snapshot' command
On Sat, 30 Oct 2010, Goffredo Baroncelli wrote: On Saturday, 30 October, 2010, Sage Weil wrote: This is identical to 'snapshot', but uses the new async snapshot creation ioctl, and prints out the transid the new snapshot will be committed with. Signed-off-by: Sage Weil s...@newdream.net --- btrfs.c |8 +++- btrfs_cmds.c | 32 +++- btrfs_cmds.h |3 ++- 3 files changed, 36 insertions(+), 7 deletions(-) diff --git a/btrfs.c b/btrfs.c index 46314cf..c4b9a31 100644 --- a/btrfs.c +++ b/btrfs.c @@ -44,11 +44,17 @@ static struct Command commands[] = { /* avoid short commands different for the case only */ - { do_clone, 2, + { do_create_snap, 2, subvolume snapshot, source [dest/]name\n Create a writable snapshot of the subvolume source with\n the name name in the dest directory. }, + { do_create_snap_async, 2, + subvolume async-snapshot, source [dest/]name\n + Create a writable snapshot of the subvolume source with\n + the name name in the dest directory. Do not wait for\n + the snapshot creation to commit to disk before returning. + }, Why create another command ? I think that it is better add a switch like btrfs subvolume snapshot [--async] source [dest/]name How about btrfs subvolume snapshot source [dest/]name [async] That avoids ambiguity when source is named '--async' and simplifies parsing. Updated patches follow (including man page updates). sage Create a writable snapshot of the subvolume source with the name name in the dest directory. If --async is passed, do not wait for the snapshot creation to commit to disk before returning. In any case, please update the man page too. { do_delete_subvolume, 1, subvolume delete, subvolume\n Delete the subvolume subvolume. diff --git a/btrfs_cmds.c b/btrfs_cmds.c index 8031c58..6da5862 100644 --- a/btrfs_cmds.c +++ b/btrfs_cmds.c @@ -307,7 +307,7 @@ int do_subvol_list(int argc, char **argv) return 0; } -int do_clone(int argc, char **argv) +static int create_snap(int argc, char **argv, int async) { char*subvol, *dst; int res, fd, fddst, len; @@ -316,7 +316,6 @@ int do_clone(int argc, char **argv) subvol = argv[1]; dst = argv[2]; - struct btrfs_ioctl_vol_args args; res = test_issubvolume(subvol); if(res0){ @@ -374,9 +373,22 @@ int do_clone(int argc, char **argv) printf(Create a snapshot of '%s' in '%s/%s'\n, subvol, dstdir, newname); - args.fd = fd; - strcpy(args.name, newname); - res = ioctl(fddst, BTRFS_IOC_SNAP_CREATE, args); + if (async) { + struct btrfs_ioctl_async_vol_args async_args; + async_args.fd = fd; + async_args.transid = 0; + strcpy(async_args.name, newname); + res = ioctl(fddst, BTRFS_IOC_SNAP_CREATE_ASYNC, async_args); + if (res == 0) + printf(transid %llu\n, + (unsigned long long)async_args.transid); + } else { + struct btrfs_ioctl_vol_args args; + + args.fd = fd; + strcpy(args.name, newname); + res = ioctl(fddst, BTRFS_IOC_SNAP_CREATE, args); + } close(fd); close(fddst); @@ -390,6 +402,16 @@ int do_clone(int argc, char **argv) } +int do_create_snap_async(int argc, char **argv) +{ + return create_snap(argc, argv, 1); +} + +int do_create_snap(int argc, char **argv) +{ + return create_snap(argc, argv, 0); +} + int do_delete_subvolume(int argc, char **argv) { int res, fd, len; diff --git a/btrfs_cmds.h b/btrfs_cmds.h index 7bde191..c44dc79 100644 --- a/btrfs_cmds.h +++ b/btrfs_cmds.h @@ -15,7 +15,8 @@ */ /* btrfs_cmds.c*/ -int do_clone(int nargs, char **argv); +int do_create_snap(int nargs, char **argv); +int do_create_snap_async(int nargs, char **argv); int do_delete_subvolume(int nargs, char **argv); int do_create_subvol(int nargs, char **argv); int do_fssync(int nargs, char **argv); -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) kreij...@inwind.it Key fingerprint = 4769 7E51 5293 D36C 814E C054 BF04 F161 3DC5 0512 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message
[PATCH] btrfs-progs: ioctl: add may_alias to btrfs_ioctl_search_header
This fixes the errors btrfs-list.c: In function 'ino_resolve': btrfs-list.c:506: error: dereferencing pointer 'sh' does break strict-aliasing rules btrfs-list.c:504: error: dereferencing pointer 'sh' does break strict-aliasing rules btrfs-list.c:502: note: initialized from here when building with gcc version 4.4.4 (Debian 4.4.4-6). Signed-off-by: Sage Weil s...@newdream.net --- ioctl.h |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/ioctl.h b/ioctl.h index 5a03317..4a850ff 100644 --- a/ioctl.h +++ b/ioctl.h @@ -87,7 +87,7 @@ struct btrfs_ioctl_search_header { __u64 offset; __u32 type; __u32 len; -}; +} __attribute__((__may_alias__)); #define BTRFS_SEARCH_ARGS_BUFSIZE (4096 - sizeof(struct btrfs_ioctl_search_key)) /* -- 1.7.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] btrfs-progs: btrfs: implement async option for snapshot command
Add an 'async' option to the snapshot creating command, that will let you avoid waiting for a new snapshot to commit to disk. Signed-off-by: Sage Weil s...@newdream.net --- btrfs.c|9 ++--- btrfs_cmds.c | 31 +-- btrfs_cmds.h |2 +- man/btrfs.8.in | 15 ++- 4 files changed, 42 insertions(+), 15 deletions(-) diff --git a/btrfs.c b/btrfs.c index c871f4a..657bb6f 100644 --- a/btrfs.c +++ b/btrfs.c @@ -44,10 +44,13 @@ static struct Command commands[] = { /* avoid short commands different for the case only */ - { do_clone, 2, - subvolume snapshot, source [dest/]name\n + { do_create_snap, -2, + subvolume snapshot, source [dest/]name [async]\n Create a writable snapshot of the subvolume source with\n - the name name in the dest directory. + the name name in the dest directory. If [async] is\n + specified, we will not wait for the snapshot to be committed\n + to disk, and a transid will be printed that can be fed to\n + 'filesystem wait-sync transid'. }, { do_delete_subvolume, 1, subvolume delete, subvolume\n diff --git a/btrfs_cmds.c b/btrfs_cmds.c index 736437d..276b8fa 100644 --- a/btrfs_cmds.c +++ b/btrfs_cmds.c @@ -307,16 +307,22 @@ int do_subvol_list(int argc, char **argv) return 0; } -int do_clone(int argc, char **argv) +int do_create_snap(int argc, char **argv) { char*subvol, *dst; - int res, fd, fddst, len; + int res, fd, fddst, len, async; char*newname; char*dstdir; subvol = argv[1]; dst = argv[2]; - struct btrfs_ioctl_vol_args args; + if (argc 3) { + if (strstr(argv[3], async)) { + async = 1; + } else { + fprintf(stderr, ERROR: 'async' expected for third arg\n); + } + } res = test_issubvolume(subvol); if(res0){ @@ -374,9 +380,22 @@ int do_clone(int argc, char **argv) printf(Create a snapshot of '%s' in '%s/%s'\n, subvol, dstdir, newname); - args.fd = fd; - strcpy(args.name, newname); - res = ioctl(fddst, BTRFS_IOC_SNAP_CREATE, args); + if (async) { + struct btrfs_ioctl_async_vol_args async_args; + async_args.fd = fd; + async_args.transid = 0; + strcpy(async_args.name, newname); + res = ioctl(fddst, BTRFS_IOC_SNAP_CREATE_ASYNC, async_args); + if (res == 0) + printf(transid %llu\n, + (unsigned long long)async_args.transid); + } else { + struct btrfs_ioctl_vol_args args; + + args.fd = fd; + strcpy(args.name, newname); + res = ioctl(fddst, BTRFS_IOC_SNAP_CREATE, args); + } close(fd); close(fddst); diff --git a/btrfs_cmds.h b/btrfs_cmds.h index 84c489f..bf566ae 100644 --- a/btrfs_cmds.h +++ b/btrfs_cmds.h @@ -15,7 +15,7 @@ */ /* btrfs_cmds.c*/ -int do_clone(int nargs, char **argv); +int do_create_snap(int nargs, char **argv); int do_delete_subvolume(int nargs, char **argv); int do_create_subvol(int nargs, char **argv); int do_fssync(int nargs, char **argv); diff --git a/man/btrfs.8.in b/man/btrfs.8.in index e87b5fe..504fe5f 100644 --- a/man/btrfs.8.in +++ b/man/btrfs.8.in @@ -5,7 +5,7 @@ .SH NAME btrfs \- control a btrfs filesystem .SH SYNOPSIS -\fBbtrfs\fP \fBsubvolume snapshot\fP\fI source [dest/]name\fP +\fBbtrfs\fP \fBsubvolume snapshot\fP\fI source [dest/]name [async]\fP .PP \fBbtrfs\fP \fBsubvolume delete\fP\fI subvolume\fP .PP @@ -74,10 +74,15 @@ command. .SH COMMANDS .TP -\fBsubvolume snapshot\fR\fI source [dest/]name\fR -Create a writable snapshot of the subvolume \fIsource\fR with the name -\fIname\fR in the \fIdest\fR directory. If \fIsource\fR is not a -subvolume, \fBbtrfs\fR returns an error. +\fBsubvolume snapshot\fR\fI source [dest/]name [async]\fR +Create a writable snapshot of the subvolume \fIsource\fR with the +name \fIname\fR in the \fIdest\fR directory. If \fIsource\fR is +not a subvolume, \fBbtrfs\fR returns an error. If \fI[async]\fR is +specified, we will not wait for the snapshot creation to commit to +disk before returning, and a transaction id is printed that can be +waited on1 using the \fBfilesystem wait-sync\fR command. +.TP + .TP \fBsubvolume delete\fR\fI subvolume\fR -- 1.7.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] btrfs-progs: btrfs: implement 'start-sync' and 'wait-sync' commands
The 'start-sync' command initiates a sync, but does not wait for it to complete. A transaction is printed that can be fed to 'wait-sync', which will wait for it to commit. 'wait-sync' can also be used in combination with 'async-snapshot' to wait for an async snapshot creation to commit. Updates the man page too. Signed-off-by: Sage Weil s...@newdream.net --- btrfs.c|9 + btrfs_cmds.c | 49 + btrfs_cmds.h |2 ++ man/btrfs.8.in | 14 ++ 4 files changed, 74 insertions(+), 0 deletions(-) diff --git a/btrfs.c b/btrfs.c index 46314cf..c871f4a 100644 --- a/btrfs.c +++ b/btrfs.c @@ -77,6 +77,15 @@ static struct Command commands[] = { filesystem sync, path\n Force a sync on the filesystem path. }, + { do_start_sync, 1, + filesystem start-sync, path\n + Start a sync on the filesystem path, and print the resulting\n + transaction id. + }, + { do_wait_sync, 2, + filesystem wait-sync, path transid\n + Wait for the transaction transid on the filesystem at path to commit. + }, { do_resize, 2, filesystem resize, [+/-]newsize[gkm]|max filesystem\n Resize the file system. If 'max' is passed, the filesystem\n diff --git a/btrfs_cmds.c b/btrfs_cmds.c index 8031c58..736437d 100644 --- a/btrfs_cmds.c +++ b/btrfs_cmds.c @@ -526,6 +526,55 @@ int do_fssync(int argc, char **argv) return 0; } +int do_start_sync(int argc, char **argv) +{ + int fd, res; + char*path = argv[1]; + __u64 transid; + + fd = open_file_or_dir(path); + if (fd 0) { + fprintf(stderr, ERROR: can't access to '%s'\n, path); + return 12; + } + + printf(StartSync '%s'\n, path); + res = ioctl(fd, BTRFS_IOC_START_SYNC, transid); + close(fd); + if( res 0 ){ + fprintf(stderr, ERROR: unable to fs-syncing '%s'\n, path); + return 16; + } else { + printf(transid %llu\n, (unsigned long long)transid); + } + + return 0; +} + +int do_wait_sync(int argc, char **argv) +{ + int fd, res; + char*path = argv[1]; + __u64 transid = atoll(argv[2]); + + fd = open_file_or_dir(path); + if (fd 0) { + fprintf(stderr, ERROR: can't access to '%s'\n, path); + return 12; + } + + printf(WaitSync '%s' transid %llu\n, path, (unsigned long long)transid); + res = ioctl(fd, BTRFS_IOC_WAIT_SYNC, transid); + close(fd); + if( res 0 ){ + fprintf(stderr, ERROR: unable to wait-sync on '%s' transid %llu: %s\n, path, + (unsigned long long)transid, strerror(errno)); + return 16; + } + + return 0; +} + int do_scan(int argc, char **argv) { int i, fd; diff --git a/btrfs_cmds.h b/btrfs_cmds.h index 7bde191..84c489f 100644 --- a/btrfs_cmds.h +++ b/btrfs_cmds.h @@ -19,6 +19,8 @@ int do_clone(int nargs, char **argv); int do_delete_subvolume(int nargs, char **argv); int do_create_subvol(int nargs, char **argv); int do_fssync(int nargs, char **argv); +int do_start_sync(int nargs, char **argv); +int do_wait_sync(int nargs, char **argv); int do_defrag(int argc, char **argv); int do_show_filesystem(int nargs, char **argv); int do_add_volume(int nargs, char **args); diff --git a/man/btrfs.8.in b/man/btrfs.8.in index 26ef982..e87b5fe 100644 --- a/man/btrfs.8.in +++ b/man/btrfs.8.in @@ -19,6 +19,10 @@ btrfs \- control a btrfs filesystem .PP \fBbtrfs\fP \fBfilesystem sync\fP\fI path \fP .PP +\fBbtrfs\fP \fBfilesystem start-sync\fP\fI path \fP +.PP +\fBbtrfs\fP \fBfilesystem wait-sync\fP\fI path transid\fP +.PP \fBbtrfs\fP \fBfilesystem resize\fP\fI [+/\-]size[gkm]|max filesystem\fP .PP \fBbtrfs\fP \fBdevice scan\fP\fI [device [device..]]\fP @@ -115,6 +119,16 @@ all the block devices. Force a sync for the filesystem identified by \fIpath\fR. .TP +\fBfilesystem start-sync\fR\fI path \fR +Start a sync operation for the filesystem identified by \fIpath\fR. A transaction id +is printed that can be waited on using the \fBfilesystem wait-sync\fR command. +.TP + +\fBfilesystem wait-sync\fR\fI path transid\fR +Wait for a the transaction \fItransid\fR to commit to disk. If \fItransid\fR is zero, +wait for any currently committing transaction to commit. +.TP + .\ .\ Some wording are extracted by the resize2fs man page .\ -- 1.7.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] update ioctl.h from 2.6.37-rc1
Signed-off-by: Sage Weil s...@newdream.net --- ioctl.h | 39 ++- 1 files changed, 30 insertions(+), 9 deletions(-) diff --git a/ioctl.h b/ioctl.h index 776d7a9..5a03317 100644 --- a/ioctl.h +++ b/ioctl.h @@ -23,13 +23,28 @@ #define BTRFS_IOCTL_MAGIC 0x94 #define BTRFS_VOL_NAME_MAX 255 -#define BTRFS_PATH_NAME_MAX 4087 +/* this should be 4k */ +#define BTRFS_PATH_NAME_MAX 4087 struct btrfs_ioctl_vol_args { __s64 fd; char name[BTRFS_PATH_NAME_MAX + 1]; }; +#define BTRFS_SNAPSHOT_NAME_MAX 4079 +struct btrfs_ioctl_async_vol_args { + __s64 fd; + __u64 transid; + char name[BTRFS_SNAPSHOT_NAME_MAX + 1]; +}; + +#define BTRFS_INO_LOOKUP_PATH_MAX 4080 +struct btrfs_ioctl_ino_lookup_args { + __u64 treeid; + __u64 objectid; + char name[BTRFS_INO_LOOKUP_PATH_MAX]; +}; + struct btrfs_ioctl_search_key { /* which root are we searching. 0 is the tree of tree roots */ __u64 tree_id; @@ -72,7 +87,7 @@ struct btrfs_ioctl_search_header { __u64 offset; __u32 type; __u32 len; -} __attribute__((may_alias)); +}; #define BTRFS_SEARCH_ARGS_BUFSIZE (4096 - sizeof(struct btrfs_ioctl_search_key)) /* @@ -85,11 +100,10 @@ struct btrfs_ioctl_search_args { char buf[BTRFS_SEARCH_ARGS_BUFSIZE]; }; -#define BTRFS_INO_LOOKUP_PATH_MAX 4080 -struct btrfs_ioctl_ino_lookup_args { - __u64 treeid; - __u64 objectid; - char name[BTRFS_INO_LOOKUP_PATH_MAX]; +struct btrfs_ioctl_clone_range_args { + __s64 src_fd; + __u64 src_offset, src_length; + __u64 dest_offset; }; /* flags for the defrag range ioctl */ @@ -155,11 +169,14 @@ struct btrfs_ioctl_space_args { struct btrfs_ioctl_vol_args) #define BTRFS_IOC_BALANCE _IOW(BTRFS_IOCTL_MAGIC, 12, \ struct btrfs_ioctl_vol_args) -/* 13 is for CLONE_RANGE */ + +#define BTRFS_IOC_CLONE_RANGE _IOW(BTRFS_IOCTL_MAGIC, 13, \ + struct btrfs_ioctl_clone_range_args) + #define BTRFS_IOC_SUBVOL_CREATE _IOW(BTRFS_IOCTL_MAGIC, 14, \ struct btrfs_ioctl_vol_args) #define BTRFS_IOC_SNAP_DESTROY _IOW(BTRFS_IOCTL_MAGIC, 15, \ - struct btrfs_ioctl_vol_args) + struct btrfs_ioctl_vol_args) #define BTRFS_IOC_DEFRAG_RANGE _IOW(BTRFS_IOCTL_MAGIC, 16, \ struct btrfs_ioctl_defrag_range_args) #define BTRFS_IOC_TREE_SEARCH _IOWR(BTRFS_IOCTL_MAGIC, 17, \ @@ -169,4 +186,8 @@ struct btrfs_ioctl_space_args { #define BTRFS_IOC_DEFAULT_SUBVOL _IOW(BTRFS_IOCTL_MAGIC, 19, u64) #define BTRFS_IOC_SPACE_INFO _IOWR(BTRFS_IOCTL_MAGIC, 20, \ struct btrfs_ioctl_space_args) +#define BTRFS_IOC_START_SYNC _IOR(BTRFS_IOCTL_MAGIC, 24, __u64) +#define BTRFS_IOC_WAIT_SYNC _IOW(BTRFS_IOCTL_MAGIC, 22, __u64) +#define BTRFS_IOC_SNAP_CREATE_ASYNC _IOW(BTRFS_IOCTL_MAGIC, 23, \ + struct btrfs_ioctl_async_vol_args) #endif -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] btrfs: implement 'async-snapshot' command
This is identical to 'snapshot', but uses the new async snapshot creation ioctl, and prints out the transid the new snapshot will be committed with. Signed-off-by: Sage Weil s...@newdream.net --- btrfs.c |8 +++- btrfs_cmds.c | 32 +++- btrfs_cmds.h |3 ++- 3 files changed, 36 insertions(+), 7 deletions(-) diff --git a/btrfs.c b/btrfs.c index 46314cf..c4b9a31 100644 --- a/btrfs.c +++ b/btrfs.c @@ -44,11 +44,17 @@ static struct Command commands[] = { /* avoid short commands different for the case only */ - { do_clone, 2, + { do_create_snap, 2, subvolume snapshot, source [dest/]name\n Create a writable snapshot of the subvolume source with\n the name name in the dest directory. }, + { do_create_snap_async, 2, + subvolume async-snapshot, source [dest/]name\n + Create a writable snapshot of the subvolume source with\n + the name name in the dest directory. Do not wait for\n + the snapshot creation to commit to disk before returning. + }, { do_delete_subvolume, 1, subvolume delete, subvolume\n Delete the subvolume subvolume. diff --git a/btrfs_cmds.c b/btrfs_cmds.c index 8031c58..6da5862 100644 --- a/btrfs_cmds.c +++ b/btrfs_cmds.c @@ -307,7 +307,7 @@ int do_subvol_list(int argc, char **argv) return 0; } -int do_clone(int argc, char **argv) +static int create_snap(int argc, char **argv, int async) { char*subvol, *dst; int res, fd, fddst, len; @@ -316,7 +316,6 @@ int do_clone(int argc, char **argv) subvol = argv[1]; dst = argv[2]; - struct btrfs_ioctl_vol_args args; res = test_issubvolume(subvol); if(res0){ @@ -374,9 +373,22 @@ int do_clone(int argc, char **argv) printf(Create a snapshot of '%s' in '%s/%s'\n, subvol, dstdir, newname); - args.fd = fd; - strcpy(args.name, newname); - res = ioctl(fddst, BTRFS_IOC_SNAP_CREATE, args); + if (async) { + struct btrfs_ioctl_async_vol_args async_args; + async_args.fd = fd; + async_args.transid = 0; + strcpy(async_args.name, newname); + res = ioctl(fddst, BTRFS_IOC_SNAP_CREATE_ASYNC, async_args); + if (res == 0) + printf(transid %llu\n, + (unsigned long long)async_args.transid); + } else { + struct btrfs_ioctl_vol_args args; + + args.fd = fd; + strcpy(args.name, newname); + res = ioctl(fddst, BTRFS_IOC_SNAP_CREATE, args); + } close(fd); close(fddst); @@ -390,6 +402,16 @@ int do_clone(int argc, char **argv) } +int do_create_snap_async(int argc, char **argv) +{ + return create_snap(argc, argv, 1); +} + +int do_create_snap(int argc, char **argv) +{ + return create_snap(argc, argv, 0); +} + int do_delete_subvolume(int argc, char **argv) { int res, fd, len; diff --git a/btrfs_cmds.h b/btrfs_cmds.h index 7bde191..c44dc79 100644 --- a/btrfs_cmds.h +++ b/btrfs_cmds.h @@ -15,7 +15,8 @@ */ /* btrfs_cmds.c*/ -int do_clone(int nargs, char **argv); +int do_create_snap(int nargs, char **argv); +int do_create_snap_async(int nargs, char **argv); int do_delete_subvolume(int nargs, char **argv); int do_create_subvol(int nargs, char **argv); int do_fssync(int nargs, char **argv); -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] btrfs: implement 'start-sync' and 'wait-sync' commands
The 'start-sync' command initiates a sync, but does not wait for it to complete. A transaction is printed that can be fed to 'wait-sync', which will wait for it to commit. 'wait-sync' can also be used in combination with 'async-snapshot' to wait for an async snapshot creation to commit. Signed-off-by: Sage Weil s...@newdream.net --- btrfs.c |9 + btrfs_cmds.c | 49 + btrfs_cmds.h |2 ++ 3 files changed, 60 insertions(+), 0 deletions(-) diff --git a/btrfs.c b/btrfs.c index c4b9a31..d45ac1f 100644 --- a/btrfs.c +++ b/btrfs.c @@ -83,6 +83,15 @@ static struct Command commands[] = { filesystem sync, path\n Force a sync on the filesystem path. }, + { do_start_sync, 1, + filesystem start-sync, path\n + Start a sync on the filesystem path, and print the resulting\n + transaction id. + }, + { do_wait_sync, 2, + filesystem wait-sync, path transid\n + Wait for the transaction transid on the filesystem at path to commit. + }, { do_resize, 2, filesystem resize, [+/-]newsize[gkm]|max filesystem\n Resize the file system. If 'max' is passed, the filesystem\n diff --git a/btrfs_cmds.c b/btrfs_cmds.c index 6da5862..5b5bb15 100644 --- a/btrfs_cmds.c +++ b/btrfs_cmds.c @@ -548,6 +548,55 @@ int do_fssync(int argc, char **argv) return 0; } +int do_start_sync(int argc, char **argv) +{ + int fd, res; + char*path = argv[1]; + __u64 transid; + + fd = open_file_or_dir(path); + if (fd 0) { + fprintf(stderr, ERROR: can't access to '%s'\n, path); + return 12; + } + + printf(StartSync '%s'\n, path); + res = ioctl(fd, BTRFS_IOC_START_SYNC, transid); + close(fd); + if( res 0 ){ + fprintf(stderr, ERROR: unable to fs-syncing '%s'\n, path); + return 16; + } else { + printf(transid %llu\n, (unsigned long long)transid); + } + + return 0; +} + +int do_wait_sync(int argc, char **argv) +{ + int fd, res; + char*path = argv[1]; + __u64 transid = atoll(argv[2]); + + fd = open_file_or_dir(path); + if (fd 0) { + fprintf(stderr, ERROR: can't access to '%s'\n, path); + return 12; + } + + printf(WaitSync '%s' transid %llu\n, path, (unsigned long long)transid); + res = ioctl(fd, BTRFS_IOC_WAIT_SYNC, transid); + close(fd); + if( res 0 ){ + fprintf(stderr, ERROR: unable to wait-sync on '%s' transid %llu: %s\n, path, + (unsigned long long)transid, strerror(errno)); + return 16; + } + + return 0; +} + int do_scan(int argc, char **argv) { int i, fd; diff --git a/btrfs_cmds.h b/btrfs_cmds.h index c44dc79..e0e5ceb 100644 --- a/btrfs_cmds.h +++ b/btrfs_cmds.h @@ -20,6 +20,8 @@ int do_create_snap_async(int nargs, char **argv); int do_delete_subvolume(int nargs, char **argv); int do_create_subvol(int nargs, char **argv); int do_fssync(int nargs, char **argv); +int do_start_sync(int nargs, char **argv); +int do_wait_sync(int nargs, char **argv); int do_defrag(int argc, char **argv); int do_show_filesystem(int nargs, char **argv); int do_add_volume(int nargs, char **args); -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] Btrfs: fix deadlock in btrfs_commit_transaction
On Tue, 26 Oct 2010, liubo wrote: On 10/26/2010 03:07 AM, Sage Weil wrote: We calculate timeout (either 1 or MAX_SCHEDULE_TIMEOUT) based on whether num_writers 1 or should_grow at the top of the loop. Then, much much later, we wait for that timeout if either num_writers or should_grow is true. However, it's possible for a racing process (calling btrfs_end_transaction()) to decrement num_writers such that we wait forever instead of for 1. IMO, there still exists a deadlock with your patch. === with your patch: thread 1: thread 2: btrfs_commit_transaction() if (num_writers 1) timeout = MAX_TIMEOUT; (This bit goes away, btw.) - __btrfs_end_transaction() num_writers--; if (wq) wake_up(); - smp_mb(); prepare_wait(); if (num_writers 1) schedule_timeout(MAX); else if (should_grow) schedule_timeout(1); === What's the problem above? The wake_up() doesn't get called, and thread1 doesn't sleep. thread2 also needs a memory_barrier, for without memory_barrier, on some CPUs, if (wq) may be executed before num_writers--, like === thread 1: thread 2: btrfs_commit_transaction() if (num_writers 1) timeout = MAX_TIMEOUT; (This bit is gone) - __btrfs_end_transaction() if (wq) wake_up(); - smp_mb(); prepare_wait(); if (num_writers 1) schedule_timeout(MAX); else if (should_grow) schedule_timeout(1); -- num_writers--; === then, thread1 may wait forever. Since wake_up() itself provides a implied wmb, and a wq active check, it is better to drop if (wq) in __btrfs_end_transaction(). I see. It could also be smb_mb(); if (wq) wake_up(); but just calling wake_up() unconditionally is simpler, and fewer barriers in the wake_up case. I'm not attached to the if (wq); I just kept it because it was there already. Chris? Thanks! sage thanks, liubo Fix this by deciding how long to wait when we wait. Signed-off-by: Sage Weil s...@newdream.net --- fs/btrfs/transaction.c | 12 1 files changed, 4 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 66e4c66..bf399ea 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -992,7 +992,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans, struct btrfs_root *root) { unsigned long joined = 0; - unsigned long timeout = 1; struct btrfs_transaction *cur_trans; struct btrfs_transaction *prev_trans = NULL; DEFINE_WAIT(wait); @@ -1063,11 +1062,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans, snap_pending = 1; WARN_ON(cur_trans != trans-transaction); - if (cur_trans-num_writers 1) - timeout = MAX_SCHEDULE_TIMEOUT; - else if (should_grow) - timeout = 1; - mutex_unlock(root-fs_info-trans_mutex); if (flush_on_commit || snap_pending) { @@ -1089,8 +1083,10 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans, TASK_UNINTERRUPTIBLE); smp_mb(); - if (cur_trans-num_writers 1 || should_grow) - schedule_timeout(timeout); + if (cur_trans-num_writers 1) + schedule_timeout(MAX_SCHEDULE_TIMEOUT); + else if (should_grow) + schedule_timeout(1); mutex_lock(root-fs_info-trans_mutex); finish_wait(cur_trans-writer_wait, wait); -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: allow subvol deletion by unprivileged user with -o user_subvol_rm_allowed
On Tue, 26 Oct 2010, Goffredo Baroncelli wrote: inode = dentry-d_inode; + dest = BTRFS_I(inode)-root; + if (!capable(CAP_SYS_ADMIN)){ + /* +* Regular user. Only allow this with a special mount +* option, and when rmdir(2) would have been allowed. +* +* Note that this is _not_ check that the subvol is +* empty or doesn't contain data that we wouldn't +* otherwise be able to delete. +* +* Users who want to delete empty subvols should try +* rmdir(2). +*/ + err = -EPERM; + if (!btrfs_test_opt(root, USER_SUBVOL_RM_ALLOWED)) + goto out_dput; + + /* +* Do not allow deletion if the parent dir is the same +* as the dir to be deleted. That means the ioctl +* must be called on the dentry referencing the root +* of the subvol, not a random directory contained +* within it. +*/ + err = -EINVAL; + if (root == dest) + goto out_dput; + + /* check if subvolume may be deleted by a non-root user */ + err = btrfs_may_delete(dir, dentry, 1); + if (err) + goto out_dput; If I read correctly, an user now is capable to remove a not owned subvolume. Is this the intended behavior ? I am not arguing against, but I want to highlight this fact. Good point. This was mirroring the rmdir(2) checks, but given that we can remove a non-empty subvol, we should add + err = inode_permission(inode, MAY_WRITE | MAY_EXEC); + if (err) + goto out_dput; as well. I'll resend. Thanks! sage + } + if (inode-i_ino != BTRFS_FIRST_FREE_OBJECTID) { err = -EINVAL; goto out_dput; } - dest = BTRFS_I(inode)-root; - mutex_lock(inode-i_mutex); err = d_invalidate(dentry); if (err) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 4da2680..8ff5a3a 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -68,7 +68,7 @@ enum { Opt_nodatacow, Opt_max_inline, Opt_alloc_start, Opt_nobarrier, Opt_ssd, Opt_nossd, Opt_ssd_spread, Opt_thread_pool, Opt_noacl, Opt_compress, Opt_compress_force, Opt_notreelog, Opt_ratio, Opt_flushoncommit, - Opt_discard, Opt_err, + Opt_discard, Opt_user_subvol_rm_allowed, Opt_err, }; static match_table_t tokens = { @@ -92,6 +92,7 @@ static match_table_t tokens = { {Opt_flushoncommit, flushoncommit}, {Opt_ratio, metadata_ratio=%d}, {Opt_discard, discard}, + {Opt_user_subvol_rm_allowed, user_subvol_rm_allowed}, {Opt_err, NULL}, }; @@ -235,6 +236,9 @@ int btrfs_parse_options(struct btrfs_root *root, char *options) case Opt_discard: btrfs_set_opt(info-mount_opt, DISCARD); break; + case Opt_user_subvol_rm_allowed: + btrfs_set_opt(info-mount_opt, USER_SUBVOL_RM_ALLOWED); + break; case Opt_err: printk(KERN_INFO btrfs: unrecognized mount option '%s'\n, p); -- 1.6.6.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) kreij...@inwind.it Key fingerprint = 4769 7E51 5293 D36C 814E C054 BF04 F161 3DC5 0512 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] Btrfs: fix deadlock in btrfs_commit_transaction
We calculate timeout (either 1 or MAX_SCHEDULE_TIMEOUT) based on whether num_writers 1 or should_grow at the top of the loop. Then, much much later, we wait for that timeout if either num_writers or should_grow is true. However, it's possible for a racing process (calling btrfs_end_transaction()) to decrement num_writers such that we wait forever instead of for 1. Fix this by deciding how long to wait when we wait. Include a smp_mb() before checking if the waitqueue is active to ensure the num_writers is visible. Signed-off-by: Sage Weil s...@newdream.net --- v2: - add smp_mb() before waitqueue_active() check to clone another possible race fs/btrfs/transaction.c | 13 + 1 files changed, 5 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 66e4c66..b461fe3 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -392,6 +392,7 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans, WARN_ON(cur_trans-num_writers 1); cur_trans-num_writers--; + smp_mb(); if (waitqueue_active(cur_trans-writer_wait)) wake_up(cur_trans-writer_wait); put_transaction(cur_trans); @@ -992,7 +993,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans, struct btrfs_root *root) { unsigned long joined = 0; - unsigned long timeout = 1; struct btrfs_transaction *cur_trans; struct btrfs_transaction *prev_trans = NULL; DEFINE_WAIT(wait); @@ -1063,11 +1063,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans, snap_pending = 1; WARN_ON(cur_trans != trans-transaction); - if (cur_trans-num_writers 1) - timeout = MAX_SCHEDULE_TIMEOUT; - else if (should_grow) - timeout = 1; - mutex_unlock(root-fs_info-trans_mutex); if (flush_on_commit || snap_pending) { @@ -1089,8 +1084,10 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans, TASK_UNINTERRUPTIBLE); smp_mb(); - if (cur_trans-num_writers 1 || should_grow) - schedule_timeout(timeout); + if (cur_trans-num_writers 1) + schedule_timeout(MAX_SCHEDULE_TIMEOUT); + else if (should_grow) + schedule_timeout(1); mutex_lock(root-fs_info-trans_mutex); finish_wait(cur_trans-writer_wait, wait); -- 1.6.6.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] Btrfs: allow subvol deletion by unprivileged user with -o user_subvol_rm_allowed
Add a mount option user_subvol_rm_allowed that allows users to delete a (potentially non-empty!) subvol when they would otherwise we allowed to do an rmdir(2). We duplicate the may_delete() checks from the core VFS code to implement identical security checks (minus the directory size check). We additionally require that the user has write+exec permission on the subvol root inode. Signed-off-by: Sage Weil s...@newdream.net --- v2: - add write+exec check on subvol root inode fs/btrfs/ctree.h |1 + fs/btrfs/ioctl.c | 115 +++-- fs/btrfs/super.c |6 ++- 3 files changed, 116 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 4c3c20b..f1f4155 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1193,6 +1193,7 @@ struct btrfs_root { #define BTRFS_MOUNT_NOSSD (1 9) #define BTRFS_MOUNT_DISCARD(1 10) #define BTRFS_MOUNT_FORCE_COMPRESS (1 11) +#define BTRFS_MOUNT_USER_SUBVOL_RM_ALLOWED (1 12) #define btrfs_clear_opt(o, opt)((o) = ~BTRFS_MOUNT_##opt) #define btrfs_set_opt(o, opt) ((o) |= BTRFS_MOUNT_##opt) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 9cbda86..7c5e858 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -409,6 +409,76 @@ fail: return ret; } +/* copy of check_sticky in fs/namei.c() +* It's inline, so penalty for filesystems that don't use sticky bit is +* minimal. +*/ +static inline int btrfs_check_sticky(struct inode *dir, struct inode *inode) +{ + uid_t fsuid = current_fsuid(); + + if (!(dir-i_mode S_ISVTX)) + return 0; + if (inode-i_uid == fsuid) + return 0; + if (dir-i_uid == fsuid) + return 0; + return !capable(CAP_FOWNER); +} + +/* copy of may_delete in fs/namei.c() + * Check whether we can remove a link victim from directory dir, check + * whether the type of victim is right. + * 1. We can't do it if dir is read-only (done in permission()) + * 2. We should have write and exec permissions on dir + * 3. We can't remove anything from append-only dir + * 4. We can't do anything with immutable dir (done in permission()) + * 5. If the sticky bit on dir is set we should either + * a. be owner of dir, or + * b. be owner of victim, or + * c. have CAP_FOWNER capability + * 6. If the victim is append-only or immutable we can't do antyhing with + * links pointing to it. + * 7. If we were asked to remove a directory and victim isn't one - ENOTDIR. + * 8. If we were asked to remove a non-directory and victim isn't one - EISDIR. + * 9. We can't remove a root or mountpoint. + * 10. We don't allow removal of NFS sillyrenamed files; it's handled by + * nfs_async_unlink(). + */ + +static int btrfs_may_delete(struct inode *dir,struct dentry *victim,int isdir) +{ + int error; + + if (!victim-d_inode) + return -ENOENT; + + BUG_ON(victim-d_parent-d_inode != dir); + audit_inode_child(victim, dir); + + error = inode_permission(dir, MAY_WRITE | MAY_EXEC); + if (error) + return error; + if (IS_APPEND(dir)) + return -EPERM; + if (btrfs_check_sticky(dir, victim-d_inode)|| + IS_APPEND(victim-d_inode)|| + IS_IMMUTABLE(victim-d_inode) || IS_SWAPFILE(victim-d_inode)) + return -EPERM; + if (isdir) { + if (!S_ISDIR(victim-d_inode-i_mode)) + return -ENOTDIR; + if (IS_ROOT(victim)) + return -EBUSY; + } else if (S_ISDIR(victim-d_inode-i_mode)) + return -EISDIR; + if (IS_DEADDIR(dir)) + return -ENOENT; + if (victim-d_flags DCACHE_NFSFS_RENAMED) + return -EBUSY; + return 0; +} + /* copy of may_create in fs/namei.c() */ static inline int btrfs_may_create(struct inode *dir, struct dentry *child) { @@ -1288,9 +1358,6 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file, int ret; int err = 0; - if (!capable(CAP_SYS_ADMIN)) - return -EPERM; - vol_args = memdup_user(arg, sizeof(*vol_args)); if (IS_ERR(vol_args)) return PTR_ERR(vol_args); @@ -1320,13 +1387,51 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file, } inode = dentry-d_inode; + dest = BTRFS_I(inode)-root; + if (!capable(CAP_SYS_ADMIN)){ + /* +* Regular user. Only allow this with a special mount +* option, when the user has write+exec access to the +* subvol root, and when rmdir(2) would have been +* allowed. +* +* Note that this is _not_ check that the subvol is +* empty or doesn't contain data that we wouldn't +* otherwise be able to delete
[PATCH 0/6] Btrfs commit fixes, async subvol operations
Hi Chris, This is the extent of my current queue of Btrfs snapshot/subvol/commit stuff. Most of these were posted several months ago. Can be sent upstream during this merge window? Not having this functionality is becoming a bit of a roadblock for our efforts to keep the Ceph data in a consistent state. These patches are also available from git://git.kernel.org/pub/scm/linux/kernel/git/sage/btrfs.git snap_ioctls The first patch is strictly a bug fix for a deadlock in btrfs_commit_transaction(). The next few patches are a repost (with a few minor revisions) of the async snapshot/subvolume ioctls I originally posted last spring. They include: - Some async commit helper functions - Start and wait sync ioctls, for initiating and waiting for a sync - An ioctl to start a snapshot creation asynchronously, such that you don't have to wait for the full commit to disk. The interface is cleaned up a bit from the original version. There is also a patch that changes the SNAP_DESTROY ioctl to not do a commit before returning. The rationale here is there is no obvious reason (to me at least) why the snapshot removal should be on disk before returning; rm(2) and unlink(2) certainly don't do that. If the user disagrees, they can call sync(2). If you would prefer, I also have a patch that introduces a new SNAP_DESTROY_ASYNC ioctl that doesn't change any existing behavior. The last item is a change to SNAP_DESTROY to allow deletion of a snapshot when the user owns the subvol's root inode and the parent directory permissions are such that we would have allowed an rmdir(2). Goffredo Baroncelli posted a similar patch that replicates the rmdir(2) semantics completely (except for the empty directory check) by duplicating some VFS code. Whether we want weaker semantics, duplicated code, or some new EXPORT_SYMBOLS is up to you I think. Note that this is distinct from a similar patch (also from Goffredo) that allows rmdir(2) to remove an empty subvol; my goal is to allow a non-empty subvol to be deleted by a non-root user. As long as I can do that, my daemon doesn't have to run as root and I'm a happy camper. :) If anybody has any questions or issues with any of this, please let me know so I can revise the patches accordingly. Thanks! sage --- Sage Weil (6): Btrfs: fix deadlock in btrfs_commit_transaction Btrfs: async transaction commit Btrfs: add START_SYNC, WAIT_SYNC ioctls Btrfs: add SNAP_CREATE_ASYNC ioctl Btrfs: make SNAP_DESTROY async Btrfs: allow subvol deletion by owner fs/btrfs/ctree.h |1 + fs/btrfs/disk-io.c |1 + fs/btrfs/ioctl.c | 152 +++- fs/btrfs/ioctl.h |9 +++ fs/btrfs/transaction.c | 183 +-- fs/btrfs/transaction.h |4 + security/security.c|1 + 7 files changed, 326 insertions(+), 25 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/6] Btrfs: fix deadlock in btrfs_commit_transaction
We calculate timeout (either 1 or MAX_SCHEDULE_TIMEOUT) based on whether num_writers 1 or should_grow at the top of the loop. Then, much much later, we wait for that timeout if either num_writers or should_grow is true. However, it's possible for a racing process (calling btrfs_end_transaction()) to decrement num_writers such that we wait forever instead of for 1. Fix this by deciding how long to wait when we wait. Signed-off-by: Sage Weil s...@newdream.net --- fs/btrfs/transaction.c | 12 1 files changed, 4 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 66e4c66..bf399ea 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -992,7 +992,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans, struct btrfs_root *root) { unsigned long joined = 0; - unsigned long timeout = 1; struct btrfs_transaction *cur_trans; struct btrfs_transaction *prev_trans = NULL; DEFINE_WAIT(wait); @@ -1063,11 +1062,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans, snap_pending = 1; WARN_ON(cur_trans != trans-transaction); - if (cur_trans-num_writers 1) - timeout = MAX_SCHEDULE_TIMEOUT; - else if (should_grow) - timeout = 1; - mutex_unlock(root-fs_info-trans_mutex); if (flush_on_commit || snap_pending) { @@ -1089,8 +1083,10 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans, TASK_UNINTERRUPTIBLE); smp_mb(); - if (cur_trans-num_writers 1 || should_grow) - schedule_timeout(timeout); + if (cur_trans-num_writers 1) + schedule_timeout(MAX_SCHEDULE_TIMEOUT); + else if (should_grow) + schedule_timeout(1); mutex_lock(root-fs_info-trans_mutex); finish_wait(cur_trans-writer_wait, wait); -- 1.6.6.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/6] Btrfs: async transaction commit
Add support for an async transaction commit that is ordered such that any subsequent operations will join the following transaction, but does not wait until the current commit is fully on disk. This avoids much of the latency associated with the btrfs_commit_transaction for callers concerned with serialization and not safety. The wait_for_unblock flag controls whether we wait for the 'middle' portion of commit_transaction to complete, which is necessary if the caller expects some of the modifications contained in the commit to be available (this is the case for subvol/snapshot creation). Signed-off-by: Sage Weil s...@newdream.net --- fs/btrfs/ctree.h |1 + fs/btrfs/disk-io.c |1 + fs/btrfs/transaction.c | 119 fs/btrfs/transaction.h |3 + 4 files changed, 124 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index eaf286a..4c3c20b 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -863,6 +863,7 @@ struct btrfs_fs_info { struct btrfs_transaction *running_transaction; wait_queue_head_t transaction_throttle; wait_queue_head_t transaction_wait; + wait_queue_head_t transaction_blocked_wait; wait_queue_head_t async_submit_wait; struct btrfs_super_block super_copy; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 64f1008..034abc6 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1680,6 +1680,7 @@ struct btrfs_root *open_ctree(struct super_block *sb, init_waitqueue_head(fs_info-transaction_throttle); init_waitqueue_head(fs_info-transaction_wait); + init_waitqueue_head(fs_info-transaction_blocked_wait); init_waitqueue_head(fs_info-async_submit_wait); __setup_root(4096, 4096, 4096, 4096, tree_root, diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index bf399ea..d720106 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -988,6 +988,123 @@ int btrfs_transaction_blocked(struct btrfs_fs_info *info) return ret; } +/* + * wait for the current transaction commit to start and block subsequent + * transaction joins + */ +static void wait_current_trans_commit_start(struct btrfs_root *root, + struct btrfs_transaction *trans) +{ + DEFINE_WAIT(wait); + + if (trans-in_commit) + return; + + while (1) { + prepare_to_wait(root-fs_info-transaction_blocked_wait, wait, + TASK_UNINTERRUPTIBLE); + if (trans-in_commit) { + finish_wait(root-fs_info-transaction_blocked_wait, + wait); + break; + } + mutex_unlock(root-fs_info-trans_mutex); + schedule(); + mutex_lock(root-fs_info-trans_mutex); + finish_wait(root-fs_info-transaction_blocked_wait, wait); + } +} + +/* + * wait for the current transaction to start and then become unblocked. + * caller holds ref. + */ +static void wait_current_trans_commit_start_and_unblock(struct btrfs_root *root, +struct btrfs_transaction *trans) +{ + DEFINE_WAIT(wait); + + if (trans-commit_done || (trans-in_commit !trans-blocked)) + return; + + while (1) { + prepare_to_wait(root-fs_info-transaction_wait, wait, + TASK_UNINTERRUPTIBLE); + if (trans-commit_done || + (trans-in_commit !trans-blocked)) { + finish_wait(root-fs_info-transaction_wait, + wait); + break; + } + mutex_unlock(root-fs_info-trans_mutex); + schedule(); + mutex_lock(root-fs_info-trans_mutex); + finish_wait(root-fs_info-transaction_wait, + wait); + } +} + +/* + * commit transactions asynchronously. once btrfs_commit_transaction_async + * returns, any subsequent transaction will not be allowed to join. + */ +struct btrfs_async_commit { + struct btrfs_trans_handle *newtrans; + struct btrfs_root *root; + struct delayed_work work; +}; + +static void do_async_commit(struct work_struct *work) +{ + struct btrfs_async_commit *ac = + container_of(work, struct btrfs_async_commit, work.work); + + btrfs_commit_transaction(ac-newtrans, ac-root); + kfree(ac); +} + +int btrfs_commit_transaction_async(struct btrfs_trans_handle *trans, + struct btrfs_root *root, + int wait_for_unblock) +{ + struct btrfs_async_commit *ac; + struct btrfs_transaction *cur_trans; + + ac = kmalloc(sizeof(*ac), GFP_NOFS); + BUG_ON(!ac); + + INIT_DELAYED_WORK(ac-work
[PATCH 3/6] Btrfs: add START_SYNC, WAIT_SYNC ioctls
START_SYNC will start a sync/commit, but not wait for it to complete. Any modification started after the ioctl returns is guaranteed not to be included in the commit. If a non-NULL pointer is passed, the transaction id will be returned to userspace. WAIT_SYNC will wait for any in-progress commit to complete. If a transaction id is specified, the ioctl will block and then return (success) when the specified transaction has committed. If it has already committed when we call the ioctl, it returns immediately. If the specified transaction doesn't exist, it returns EINVAL. If no transaction id is specified, WAIT_SYNC will wait for the currently committing transaction to finish it's commit to disk. If there is no currently committing transaction, it returns success. These ioctls are useful for applications which want to impose an ordering on when fs modifications reach disk, but do not want to wait for the full (slow) commit process to do so. Picky callers can take the transid returned by START_SYNC and feed it to WAIT_SYNC, and be certain to wait only as long as necessary for the transaction _they_ started to reach disk. Sloppy callers can START_SYNC and WAIT_SYNC without a transid, and provided they didn't wait too long between the calls, they will get the same result. However, if a second commit starts before they call WAIT_SYNC, they may end up waiting longer for it to commit as well. Even so, a START_SYNC+WAIT_SYNC still guarantees that any operation completed before the START_SYNC reaches disk. Signed-off-by: Sage Weil s...@newdream.net --- fs/btrfs/ioctl.c | 34 +++ fs/btrfs/ioctl.h |2 + fs/btrfs/transaction.c | 52 fs/btrfs/transaction.h |1 + 4 files changed, 89 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 9254b3d..6306051 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1984,6 +1984,36 @@ long btrfs_ioctl_trans_end(struct file *file) return 0; } +static noinline long btrfs_ioctl_start_sync(struct file *file, void __user *argp) +{ + struct btrfs_root *root = BTRFS_I(file-f_dentry-d_inode)-root; + struct btrfs_trans_handle *trans; + u64 transid; + + trans = btrfs_start_transaction(root, 0); + transid = trans-transid; + btrfs_commit_transaction_async(trans, root, 0); + + if (argp) + if (copy_to_user(argp, transid, sizeof(transid))) + return -EFAULT; + return 0; +} + +static noinline long btrfs_ioctl_wait_sync(struct file *file, void __user *argp) +{ + struct btrfs_root *root = BTRFS_I(file-f_dentry-d_inode)-root; + u64 transid; + + if (argp) { + if (copy_from_user(transid, argp, sizeof(transid))) + return -EFAULT; + } else { + transid = 0; /* current trans */ + } + return btrfs_wait_for_commit(root, transid); +} + long btrfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { @@ -2034,6 +2064,10 @@ long btrfs_ioctl(struct file *file, unsigned int case BTRFS_IOC_SYNC: btrfs_sync_fs(file-f_dentry-d_sb, 1); return 0; + case BTRFS_IOC_START_SYNC: + return btrfs_ioctl_start_sync(file, argp); + case BTRFS_IOC_WAIT_SYNC: + return btrfs_ioctl_wait_sync(file, argp); } return -ENOTTY; diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h index 424694a..e9a2f7e 100644 --- a/fs/btrfs/ioctl.h +++ b/fs/btrfs/ioctl.h @@ -178,4 +178,6 @@ struct btrfs_ioctl_space_args { #define BTRFS_IOC_DEFAULT_SUBVOL _IOW(BTRFS_IOCTL_MAGIC, 19, u64) #define BTRFS_IOC_SPACE_INFO _IOWR(BTRFS_IOCTL_MAGIC, 20, \ struct btrfs_ioctl_space_args) +#define BTRFS_IOC_START_SYNC _IOR(BTRFS_IOCTL_MAGIC, 21, __u64) +#define BTRFS_IOC_WAIT_SYNC _IOW(BTRFS_IOCTL_MAGIC, 22, __u64) #endif diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index d720106..932bc03 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -270,6 +270,58 @@ static noinline int wait_for_commit(struct btrfs_root *root, return 0; } +int btrfs_wait_for_commit(struct btrfs_root *root, u64 transid) +{ + struct btrfs_transaction *cur_trans = NULL, *t; + int ret; + + mutex_lock(root-fs_info-trans_mutex); + + ret = 0; + if (transid) { + if (transid = root-fs_info-last_trans_committed) + goto out_unlock; + + /* find specified transaction */ + list_for_each_entry(t, root-fs_info-trans_list, list) { + if (t-transid == transid) { + cur_trans = t; + break; + } + if (t-transid transid) + break
[PATCH 5/6] Btrfs: make SNAP_DESTROY async
There is no reason to force an immediate commit when deleting a snapshot. Users have some expectation that space from a deleted snapshot be freed immediately, but even if we do commit the reclaim is a background process. If users _do_ want the deletion to be durable, they can call 'sync'. Signed-off-by: Sage Weil s...@newdream.net --- fs/btrfs/ioctl.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 679b8a8..9cbda86 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1365,7 +1365,7 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file, BUG_ON(ret); } - ret = btrfs_commit_transaction(trans, root); + ret = btrfs_end_transaction(trans, root); BUG_ON(ret); inode-i_flags |= S_DEAD; out_up_write: -- 1.6.6.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/6] Btrfs: allow subvol deletion by owner
Allow users to delete the snapshots/subvols they own. When CAP_SYS_ADMIN is not present, require that - the user has write and exec permission on the parent directory - security_inode_rmdir() doesn't object - the user has write and exec permission on the subvol directory - the user owns the subvol directory - the directory and subvol append flags are not set This is a bit more strict than the requirements for 'rm -f subvol/*', which is allowed with just 'wx' on non-owned non-sticky dirs. It is less strict than 'rmdir subvol', which has additional requirements if the parent directory is sticky. It is less strict than 'rm -rf subvol', which would require scanning the entire subvol to ensure all content is deletable. Signed-off-by: Sage Weil s...@newdream.net --- fs/btrfs/ioctl.c| 27 --- security/security.c |1 + 2 files changed, 25 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 9cbda86..90d2871 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1288,9 +1288,6 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file, int ret; int err = 0; - if (!capable(CAP_SYS_ADMIN)) - return -EPERM; - vol_args = memdup_user(arg, sizeof(*vol_args)); if (IS_ERR(vol_args)) return PTR_ERR(vol_args); @@ -1325,6 +1322,30 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file, goto out_dput; } + /* +* Allow subvol deletion if we own the subvol and we would +* (approximately) be allowed to rmdir it. Strictly speaking, +* we could possibly delete everything with fewer permissions, +* or might require more permissions to remove all files +* contained by the subvol, but we aren't trying to mimic +* directory semantics perfectly. +*/ + if (!capable(CAP_SYS_ADMIN)) { + err = inode_permission(dir, MAY_WRITE | MAY_EXEC); + if (err) + goto out_dput; + err = security_inode_rmdir(dir, dentry); + if (err) + goto out_dput; + err = inode_permission(inode, MAY_WRITE | MAY_EXEC); + if (err) + goto out_dput; + err = -EPERM; + if (inode-i_uid != current_fsuid() || IS_APPEND(dir) || + IS_APPEND(inode)) + goto out_dput; + } + dest = BTRFS_I(inode)-root; mutex_lock(inode-i_mutex); diff --git a/security/security.c b/security/security.c index c53949f..1c980ee 100644 --- a/security/security.c +++ b/security/security.c @@ -490,6 +490,7 @@ int security_inode_rmdir(struct inode *dir, struct dentry *dentry) return 0; return security_ops-inode_rmdir(dir, dentry); } +EXPORT_SYMBOL_GPL(security_inode_rmdir); int security_inode_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev) { -- 1.6.6.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] Btrfs commit fixes, async subvol operations
On Mon, 25 Oct 2010, Chris Mason wrote: These all look good to me and I'm pulling them in. Great, thanks! The last item is a change to SNAP_DESTROY to allow deletion of a snapshot when the user owns the subvol's root inode and the parent directory permissions are such that we would have allowed an rmdir(2). Goffredo Baroncelli posted a similar patch that replicates the rmdir(2) semantics completely (except for the empty directory check) by duplicating some VFS code. Whether we want weaker semantics, duplicated code, or some new EXPORT_SYMBOLS is up to you I think. Note that this is distinct from a similar patch (also from Goffredo) that allows rmdir(2) to remove an empty subvol; my goal is to allow a non-empty subvol to be deleted by a non-root user. As long as I can do that, my daemon doesn't have to run as root and I'm a happy camper. :) Someone at the storage workshop mentioned that this subvol deletion trick is slightly stronger than rm -rf, to make it include the same level of permission checks would require testing all the directories in the tree for permissions. I think that was me :) For now, could you please make a mount -o user_subvol_rm_allowed option? (or something similar with a better name). Sure. Do you have a preference as far as what checks are implemented? My patch implemented a simplified approximation of may_rmdir(); Goffredo's duplicated the vfs checks. I guess I'm leaning toward the latter... Thanks! sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: allow subvol deletion by unprivileged user with -o user_subvol_rm_allowed
Add a mount option user_subvol_rm_allowed that allows users to delete a (potentially non-empty!) subvol when they would otherwise we allowed to do an rmdir(2). We duplicate the may_delete() checks from the core VFS code to implement identical security checks (minus the directory size check). Signed-off-by: Sage Weil s...@newdream.net --- fs/btrfs/ctree.h |1 + fs/btrfs/ioctl.c | 109 +++-- fs/btrfs/super.c |6 ++- 3 files changed, 110 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 5ac2bca..140003e 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1196,6 +1196,7 @@ struct btrfs_root { #define BTRFS_MOUNT_NOSSD (1 9) #define BTRFS_MOUNT_DISCARD(1 10) #define BTRFS_MOUNT_FORCE_COMPRESS (1 11) +#define BTRFS_MOUNT_USER_SUBVOL_RM_ALLOWED (1 12) #define btrfs_clear_opt(o, opt)((o) = ~BTRFS_MOUNT_##opt) #define btrfs_set_opt(o, opt) ((o) |= BTRFS_MOUNT_##opt) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 906e3b3..919b23f 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -409,6 +409,76 @@ fail: return ret; } +/* copy of check_sticky in fs/namei.c() +* It's inline, so penalty for filesystems that don't use sticky bit is +* minimal. +*/ +static inline int btrfs_check_sticky(struct inode *dir, struct inode *inode) +{ + uid_t fsuid = current_fsuid(); + + if (!(dir-i_mode S_ISVTX)) + return 0; + if (inode-i_uid == fsuid) + return 0; + if (dir-i_uid == fsuid) + return 0; + return !capable(CAP_FOWNER); +} + +/* copy of may_delete in fs/namei.c() + * Check whether we can remove a link victim from directory dir, check + * whether the type of victim is right. + * 1. We can't do it if dir is read-only (done in permission()) + * 2. We should have write and exec permissions on dir + * 3. We can't remove anything from append-only dir + * 4. We can't do anything with immutable dir (done in permission()) + * 5. If the sticky bit on dir is set we should either + * a. be owner of dir, or + * b. be owner of victim, or + * c. have CAP_FOWNER capability + * 6. If the victim is append-only or immutable we can't do antyhing with + * links pointing to it. + * 7. If we were asked to remove a directory and victim isn't one - ENOTDIR. + * 8. If we were asked to remove a non-directory and victim isn't one - EISDIR. + * 9. We can't remove a root or mountpoint. + * 10. We don't allow removal of NFS sillyrenamed files; it's handled by + * nfs_async_unlink(). + */ + +static int btrfs_may_delete(struct inode *dir,struct dentry *victim,int isdir) +{ + int error; + + if (!victim-d_inode) + return -ENOENT; + + BUG_ON(victim-d_parent-d_inode != dir); + audit_inode_child(victim, dir); + + error = inode_permission(dir, MAY_WRITE | MAY_EXEC); + if (error) + return error; + if (IS_APPEND(dir)) + return -EPERM; + if (btrfs_check_sticky(dir, victim-d_inode)|| + IS_APPEND(victim-d_inode)|| + IS_IMMUTABLE(victim-d_inode) || IS_SWAPFILE(victim-d_inode)) + return -EPERM; + if (isdir) { + if (!S_ISDIR(victim-d_inode-i_mode)) + return -ENOTDIR; + if (IS_ROOT(victim)) + return -EBUSY; + } else if (S_ISDIR(victim-d_inode-i_mode)) + return -EISDIR; + if (IS_DEADDIR(dir)) + return -ENOENT; + if (victim-d_flags DCACHE_NFSFS_RENAMED) + return -EBUSY; + return 0; +} + /* copy of may_create in fs/namei.c() */ static inline int btrfs_may_create(struct inode *dir, struct dentry *child) { @@ -1288,9 +1358,6 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file, int ret; int err = 0; - if (!capable(CAP_SYS_ADMIN)) - return -EPERM; - vol_args = memdup_user(arg, sizeof(*vol_args)); if (IS_ERR(vol_args)) return PTR_ERR(vol_args); @@ -1320,13 +1387,45 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file, } inode = dentry-d_inode; + dest = BTRFS_I(inode)-root; + if (!capable(CAP_SYS_ADMIN)){ + /* +* Regular user. Only allow this with a special mount +* option, and when rmdir(2) would have been allowed. +* +* Note that this is _not_ check that the subvol is +* empty or doesn't contain data that we wouldn't +* otherwise be able to delete. +* +* Users who want to delete empty subvols should try +* rmdir(2). +*/ + err = -EPERM; + if (!btrfs_test_opt(root, USER_SUBVOL_RM_ALLOWED
Re: [PATCH 0/6] Btrfs commit fixes, async subvol operations
On Mon, 25 Oct 2010, Chris Mason wrote: Yes, lets duplicate the vfs checks. Christoph just sat bolt upright in whatever ski lift he's currently riding. :) This version is based on Goffredo's patch, but requires the user_subvol_rm_allowed mount option, and will proceed even if the subvol is nonempty. (I suspect we also want his other rmdir(2) patch at some point as well so that users can rmdir an empty subvol and/or 'rm -r'.) We should also make sure they do the subvol rm against the root of the subvol (if that check isn't already done), none of the magic to resolve the subvol based on any file inside it. I don't want people to accidentally think they are deleting a subdir and have it go higher up into the directory tree. Oh, and it shouldn't work on the root of the FS either ;) Since the ioctl is based on a name lookup, I added a check that the parent inode's root isn't the same as the target root. That implies the ioctl is called the dentry referencing the subvol, and presumably the root doesn't have one of those. Does that cover it? It isn't possible to have the subvol bound to multiple locations in the namespace, right? $ whoami sage $ btrfs sub create a Create subvolume './a' $ btrfs sub create a/b Create subvolume 'a/b' $ btrfs sub snap a asnap Create a snapshot of 'a' in './asnap' $ btrfs sub del asnap/b ERROR: 'asnap/b' is not a subvolume $ btrfs sub del a Delete subvolume '/mnt/osd27/a/a' ERROR: cannot delete '/mnt/osd27/a/a' $ btrfs sub del a/b Delete subvolume '/mnt/osd27/a/a/b' $ btrfs sub del a Delete subvolume '/mnt/osd27/a/a' $ btrfs sub del asnap Delete subvolume '/mnt/osd27/a/asnap' Sending the patch separately. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] clone ioctl fixes
Hi Chris, These are a few critical bug fixes for the clone ioctl. We've been hitting these on a number of different systems, so getting these into 2.6.37 would be much appreciated. Thanks! sage -- Sage Weil (1): Btrfs: fix lockdep warning on clone ioctl Yehuda Sadeh (2): Btrfs: fix delalloc checks in clone ioctl Btrfs: fix clone ioctl where range is adjacent to extent fs/btrfs/ioctl.c | 18 ++ 1 files changed, 10 insertions(+), 8 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] Btrfs: fix delalloc checks in clone ioctl
From: Yehuda Sadeh yeh...@hq.newdream.net The lookup_first_ordered_extent() was done on the wrong inode, and the -delalloc_bytes test was wrong, as the following btrfs_wait_ordered_range() would only invoke a range write and wouldn't write the entire file data range. Also, a bad parameter was passed to btrfs_wait_ordered_range(). Signed-off-by: Yehuda Sadeh yeh...@hq.newdream.net --- fs/btrfs/ioctl.c |8 +--- 1 files changed, 5 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 9254b3d..3471b22 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1530,13 +1530,15 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, while (1) { struct btrfs_ordered_extent *ordered; lock_extent(BTRFS_I(src)-io_tree, off, off+len, GFP_NOFS); - ordered = btrfs_lookup_first_ordered_extent(inode, off+len); - if (BTRFS_I(src)-delalloc_bytes == 0 !ordered) + ordered = btrfs_lookup_first_ordered_extent(src, off+len); + if (!ordered + !test_range_bit(BTRFS_I(src)-io_tree, off, off+len, + EXTENT_DELALLOC, 0, NULL)) break; unlock_extent(BTRFS_I(src)-io_tree, off, off+len, GFP_NOFS); if (ordered) btrfs_put_ordered_extent(ordered); - btrfs_wait_ordered_range(src, off, off+len); + btrfs_wait_ordered_range(src, off, len); } /* clone data */ -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] Btrfs: fix clone ioctl where range is adjacent to extent
From: Yehuda Sadeh yeh...@hq.newdream.net We had an edge case issue where the requested range was just following an existing extent. Instead of skipping to the next extent, we used the previous one which lead to having zero sized extents. Signed-off-by: Yehuda Sadeh yeh...@hq.newdream.net --- fs/btrfs/ioctl.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 3471b22..f4a3dde 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1607,7 +1607,7 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, } btrfs_release_path(root, path); - if (key.offset + datal off || + if (key.offset + datal = off || key.offset = off+len) goto next; -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] Btrfs: fix lockdep warning on clone ioctl
I'm no lockdep expert, but this appears to make the lockdep warning go away for the i_mutex locking in the clone ioctl. Signed-off-by: Sage Weil s...@newdream.net --- fs/btrfs/ioctl.c |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index f4a3dde..3f27529 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1502,11 +1502,11 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, path-reada = 2; if (inode src) { - mutex_lock(inode-i_mutex); - mutex_lock(src-i_mutex); + mutex_lock_nested(inode-i_mutex, I_MUTEX_PARENT); + mutex_lock_nested(src-i_mutex, I_MUTEX_CHILD); } else { - mutex_lock(src-i_mutex); - mutex_lock(inode-i_mutex); + mutex_lock_nested(src-i_mutex, I_MUTEX_PARENT); + mutex_lock_nested(inode-i_mutex, I_MUTEX_CHILD); } /* determine range to clone */ -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Allow to exec btrfs subvolume delete by a non root user
On Mon, 18 Oct 2010, Goffredo Baroncelli wrote: Hi all like my previous patch, this one allow to remove a subvolume by an ordinary user. Instead of adding this capability to the rmdir(2) syscall, I update the BTRFS_IOC_SNAP_DESTROY ioctl, relaxing the rules to be execute. The checks are the ones performed by the rmdir(2) syscall. So a subvolume must be empty to be removed by a non-root user. I think that this increases a lot the usefulness of the snapshot/subvolume. It is possible to pull the code from the branch named rm-subvolume-not-root of the following repository: http://cassiopea.homelinux.net/git/btrfs-unstable.git Comments are welcome. This looks okay to me. I posted a similar patch a while back[1], but didn't want to duplicate the check_sticky and may_delete code and implemented a simpler set of checks instead. The full checks are probably a better route, although it would be nice if we could avoid duplicating the VFS checks in the process. Whether those helpers should be exported is someone else's call, though. (The only other may_ functions that are exported are may_umount and may_umount_tree.) sage [1] http://marc.info/?l=linux-btrfsm=128086492512628w=2 Reagrds G.Baroncelli diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 9254b3d..5bbb6bc 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -395,6 +395,76 @@ fail: return ret; } +/* copy of check_sticky in fs/namei.c() +* It's inline, so penalty for filesystems that don't use sticky bit is +* minimal. +*/ +static inline int btrfs_check_sticky(struct inode *dir, struct inode *inode) +{ + uid_t fsuid = current_fsuid(); + + if (!(dir-i_mode S_ISVTX)) + return 0; + if (inode-i_uid == fsuid) + return 0; + if (dir-i_uid == fsuid) + return 0; + return !capable(CAP_FOWNER); +} + +/* copy of may_delete in fs/namei.c() + * Check whether we can remove a link victim from directory dir, check + * whether the type of victim is right. + * 1. We can't do it if dir is read-only (done in permission()) + * 2. We should have write and exec permissions on dir + * 3. We can't remove anything from append-only dir + * 4. We can't do anything with immutable dir (done in permission()) + * 5. If the sticky bit on dir is set we should either + * a. be owner of dir, or + * b. be owner of victim, or + * c. have CAP_FOWNER capability + * 6. If the victim is append-only or immutable we can't do antyhing with + * links pointing to it. + * 7. If we were asked to remove a directory and victim isn't one - ENOTDIR. + * 8. If we were asked to remove a non-directory and victim isn't one - EISDIR. + * 9. We can't remove a root or mountpoint. + * 10. We don't allow removal of NFS sillyrenamed files; it's handled by + * nfs_async_unlink(). + */ + +static int btrfs_may_delete(struct inode *dir,struct dentry *victim,int isdir) +{ + int error; + + if (!victim-d_inode) + return -ENOENT; + + BUG_ON(victim-d_parent-d_inode != dir); + audit_inode_child(victim, dir); + + error = inode_permission(dir, MAY_WRITE | MAY_EXEC); + if (error) + return error; + if (IS_APPEND(dir)) + return -EPERM; + if (btrfs_check_sticky(dir, victim-d_inode)|| + IS_APPEND(victim-d_inode)|| + IS_IMMUTABLE(victim-d_inode) || IS_SWAPFILE(victim-d_inode)) + return -EPERM; + if (isdir) { + if (!S_ISDIR(victim-d_inode-i_mode)) + return -ENOTDIR; + if (IS_ROOT(victim)) + return -EBUSY; + } else if (S_ISDIR(victim-d_inode-i_mode)) + return -EISDIR; + if (IS_DEADDIR(dir)) + return -ENOENT; + if (victim-d_flags DCACHE_NFSFS_RENAMED) + return -EBUSY; + return 0; +} + /* copy of may_create in fs/namei.c() */ static inline int btrfs_may_create(struct inode *dir, struct dentry *child) { @@ -1227,9 +1297,6 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file, int ret; int err = 0; - if (!capable(CAP_SYS_ADMIN)) - return -EPERM; - vol_args = memdup_user(arg, sizeof(*vol_args)); if (IS_ERR(vol_args)) return PTR_ERR(vol_args); @@ -1259,6 +1326,20 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file, } inode = dentry-d_inode; + if (!capable(CAP_SYS_ADMIN)){ + /* regolar user */ + /* check if subvolume is empty */ + if (inode-i_size BTRFS_EMPTY_DIR_SIZE){ + err = -ENOTEMPTY; + goto out_dput; + } + /* check if subvolume may be deleted by a non-root user */ + if (btrfs_may_delete(dir, dentry, 1)){ + err = -EPERM; +
Re: Updating the wiki pages adding btrfs command
On Wed, 15 Sep 2010, Chris Ball wrote: Hi, Or make --sync the default behavior. This is probably what most people are expecting anyway (similar to how standard filesystem commands like rm work). Add an --aysnc option for those that only care about knowing when the subvolume is taken out of the tree. Yeah. We've also talked about making snapshot _creation_ perform an FS sync first by default, since otherwise you get a snapshot with stale files, or without files that existed (not yet on disk) at creation-time. Actually, that was fixed in 2.6.34 (0bdb1db2). Creating a snapshot syncs all dirty data and metadata to disk, so you get a fully consistent point in time snapshot. No 'sync' is necessary (and doing a 'sync' yourself would be racy any). For subvolume removal, racing doesn't seem like an issue. Why not just do # btrfs subvolume delete /path/to/subvolume1 # btrfs subvolume delete /path/to/subvolume2 # ... # sync ? sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Updating the wiki pages adding btrfs command
On Wed, 15 Sep 2010, Sage Weil wrote: For subvolume removal, racing doesn't seem like an issue. Why not just do # btrfs subvolume delete /path/to/subvolume1 # btrfs subvolume delete /path/to/subvolume2 # ... # sync ? Nevermind, read the whole thread this time. I'll defer to others on the best way to wait for the old space to be freed. sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: allow subvol deletion by owner
Allow users to delete the snapshots/subvols they own. Instead of CAP_SYS_ADMIN, require that - the user has write and exec permission on the parent directory - security_inode_rmdir() doesn't object - the user has write and exec permission on the subvol directory - the user owns the subvol directory - the directory and subvol append flags are not set This is a bit more strict than the requirements for 'rm -f subvol/*', which is allowed with just 'wx' on non-owned non-sticky dirs. It is less strict than 'rmdir subvol', which has additional requirements if the parent directory is sticky. It is less strict than 'rm -rf subvol', which would require scanning the entire subvol to ensure all content is deletable. Signed-off-by: Sage Weil s...@newdream.net --- fs/btrfs/ioctl.c | 25 ++--- 1 files changed, 22 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 9254b3d..b31283a 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1227,9 +1227,6 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file, int ret; int err = 0; - if (!capable(CAP_SYS_ADMIN)) - return -EPERM; - vol_args = memdup_user(arg, sizeof(*vol_args)); if (IS_ERR(vol_args)) return PTR_ERR(vol_args); @@ -1264,6 +1261,28 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file, goto out_dput; } + /* +* Allow subvol deletion if we own the subvol and we would +* (approximately) be allowed to rmdir it. Strictly speaking, +* we could possibly delete everything with fewer permissions, +* or might require more permissions to remove all files +* contained by the subvol, but we aren't trying to mimic +* directory semantics perfectly. +*/ + err = inode_permission(dir, MAY_WRITE | MAY_EXEC); + if (err) + goto out_dput; + err = security_inode_rmdir(dir, dentry); + if (err) + goto out_dput; + err = inode_permission(inode, MAY_WRITE | MAY_EXEC); + if (err) + goto out_dput; + err = -EPERM; + if (inode-i_uid != current_fsuid() || IS_APPEND(dir) || + IS_APPEND(inode)) + goto out_dput; + dest = BTRFS_I(inode)-root; mutex_lock(inode-i_mutex); -- 1.7.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: reflinked file size incorrect
On Sat, 12 Jun 2010, Jim Ursetto wrote: Both `cp --reflink` and `bcp` sometimes round the file size up to the next 4k boundary, with the balance consisting of null bytes. At first glance this behavior occurs for source file size 3916 bytes. I have tried this with stock btrfs from kernel 2.6.35-rc2 and 2.6.35-rc1 -- specifically, Ubuntu 2.6.35-2.3~lucid1-server and 2.6.35-1.1~lucid1-server. Any ideas? This bug is new in 2.6.35-rc1 from a22285a6 (Btrfs: Integrate metadata reservation with start_transaction). Just sent a patch fixing this up to the list. sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: fix CLONE ioctl destination file size expansion to block boundary
The CLONE and CLONE_RANGE ioctls round up the range of extents being cloned to the block size when the range to clone extends to the end of file (this is always the case with CLONE). It was then using that offset when extending the destination file's i_size. Fix this by not setting i_size beyond the originally requested ending offset. This bug was introduced by a22285a6 (2.6.35-rc1). Signed-off-by: Sage Weil s...@newdream.net --- fs/btrfs/ioctl.c | 16 +--- 1 files changed, 13 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 4cdb98c..a687f28 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1578,6 +1578,7 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, u64 disko = 0, diskl = 0; u64 datao = 0, datal = 0; u8 comp; + u64 endoff; size = btrfs_item_size_nr(leaf, slot); read_extent_buffer(leaf, buf, @@ -1712,9 +1713,18 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, btrfs_release_path(root, path); inode-i_mtime = inode-i_ctime = CURRENT_TIME; - if (new_key.offset + datal inode-i_size) - btrfs_i_size_write(inode, - new_key.offset + datal); + + /* +* we round up to the block size at eof when +* determining which extents to clone above, +* but shouldn't round up the file size +*/ + endoff = new_key.offset + datal; + if (endoff off+olen) + endoff = off+olen; + if (endoff inode-i_size) + btrfs_i_size_write(inode, endoff); + BTRFS_I(inode)-flags = BTRFS_I(src)-flags; ret = btrfs_update_inode(trans, root, inode); BUG_ON(ret); -- 1.7.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Fwd: Re: Linking two files together][RFC]
On Wed, 9 Jun 2010, Roberto Ragusa wrote: I hope that ideas about btrfs are not off-topic for this mailing list. The forwarded message below was written by me on fedora-users. The thread is about the ability to link two files in a manner similar to cat 1 2 3 rm 1 2 while avoiding any data movement on the disk. The implementation should just put the original extents together in the new file. Is there any filesystem which is capable of doing that? As btrfs is already based on extents and COW, couldn't this feature be evaluated for feasibility? I think a lot of usages will be found for it if actually implemented. Btrfs already has a CLONE_RANGE ioctl that will clone a range of (block-aligned) bytes from file A to any offset in file B. The fs just fixes up the file metadata to reference the same bytes on disk without reading or writing any actual file data. sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: avoid BUG when dropping root and reference in same transaction
If btrfs_ioctl_snap_destroy() deletes a snapshot but finishes with end_transaction(), the cleaner kthread may come in and drop the root in the same transaction. If that's the case, the root's refs still == 1 in the tree when btrfs_del_root() deletes the item, because commit_fs_roots() hasn't updated it yet (that happens during the commit). This wasn't a problem before only because btrfs_ioctl_snap_destroy() would commit the transaction before dropping the dentry reference, so the dead root wouldn't get queued up until after the fs root item was updated in the btree. Since it is not an error to drop the root reference and the root in the same transaction, just drop the BUG_ON() in btrfs_del_root(). Signed-off-by: Sage Weil s...@newdream.net --- fs/btrfs/root-tree.c |3 --- 1 files changed, 0 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c index 67fa2d2..7cb5041 100644 --- a/fs/btrfs/root-tree.c +++ b/fs/btrfs/root-tree.c @@ -313,7 +313,6 @@ int btrfs_del_root(struct btrfs_trans_handle *trans, struct btrfs_root *root, { struct btrfs_path *path; int ret; - u32 refs; struct btrfs_root_item *ri; struct extent_buffer *leaf; @@ -327,8 +326,6 @@ int btrfs_del_root(struct btrfs_trans_handle *trans, struct btrfs_root *root, leaf = path-nodes[0]; ri = btrfs_item_ptr(leaf, path-slots[0], struct btrfs_root_item); - refs = btrfs_disk_root_refs(leaf, ri); - BUG_ON(refs != 0); ret = btrfs_del_item(trans, root, path); out: btrfs_free_path(path); -- 1.6.6.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: fix lockdep warning on clone ioctl
I'm no lockdep expert, but this appears to make the lockdep warning go away for the i_mutex locking in the clone ioctl. Signed-off-by: Sage Weil s...@newdream.net --- fs/btrfs/ioctl.c |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index c8e6470..ae2b9a0 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1530,11 +1530,11 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, path-reada = 2; if (inode src) { - mutex_lock(inode-i_mutex); - mutex_lock(src-i_mutex); + mutex_lock_nested(inode-i_mutex, I_MUTEX_PARENT); + mutex_lock_nested(src-i_mutex, I_MUTEX_CHILD); } else { - mutex_lock(src-i_mutex); - mutex_lock(inode-i_mutex); + mutex_lock_nested(src-i_mutex, I_MUTEX_PARENT); + mutex_lock_nested(inode-i_mutex, I_MUTEX_CHILD); } /* determine range to clone */ -- 1.6.6.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/5] asynchronous commit, snapshot ponies
Hi everyone, This patchset is the latest approach I'm using for the Ceph storage daemon to keep track of which data has safely committed to disk. The basic idea is to not use the (problematic) user transaction ioctls at all. Instead, the daemon quiesces its own write requests, initiates an async snapshot, and then continues. The snapshot approach is nice because it provides rollback. If something goes wrong, we can cleanly go back to the most recent consistent commit. The performance is also very similar to what I was doing before (using the 'flushoncommit' mount option and tiggering a sync_fs to flush data). The only difference is the old snapshots stick around for a bit longer before I delete them and the references get dropped. The first patch introduces a generic btrfs_commit_transaction_async() helper, which starts btrfs_commit_transaction asynchronously and returns either when the commit starts (blocked=1) or when it has done it's dirty work (blocked=0). The second patch adds ioctls that let you start and wait for an asynchronous commit. The third introduces a SNAP_CREATE_ASYNC ioctl that creates a snap but returns before it hits disk. The fourth patch returns the commiting transid to userspace, so that it can be fed to the WAIT_SYNC ioctl. I'm not that happy with the interface, though; any suggestions for alternatives would be great. Alternatively, I could get by without knowing the exact transid and it wouldn't be the end of the world. The final patch lets you delete a snapshot/subvol reference without doing an immediate commit (btrfs_end_transaction instead of btrfs_commit_transaction). AFAICS there's no reason the commit has to happen immediately (user expectations aside). Overall I like this much better than the various user transaction proposals. It's simpler, does the job, and the primitives should be useful for other applications. Let me know what you think! I'm doing more testing this week, but so far I haven't seen any problems with these changes. Thanks- sage Sage Weil (5): Btrfs: async transaction commit Btrfs: add START_SYNC, WAIT_SYNC ioctls Btrfs: add SNAP_CREATE_ASYNC ioctl Btrfs: return transid to userspace from SNAP_CREATE_ASYNC ioctl btrfs: add SNAP_DESTROY_ASYNC ioctl fs/btrfs/ctree.h |1 + fs/btrfs/disk-io.c |1 + fs/btrfs/ioctl.c | 94 ++ fs/btrfs/ioctl.h | 10 +++- fs/btrfs/transaction.c | 171 fs/btrfs/transaction.h |4 + 6 files changed, 265 insertions(+), 16 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/5] Btrfs: async transaction commit
Add support for an async transaction commit that is ordered such that any subsequent operations will join the following transaction, but does not wait until the current commit is fully on disk. This avoids much of the latency associated with the btrfs_commit_transaction for callers concerned with serialization and not safety. The wait_for_unblock flag controls whether we wait for the 'middle' portion of commit_transaction to complete, which is necessary if the caller expects some of the modifications contained in the commit to be available (this is the case for subvol/snapshot creation). Signed-off-by: Sage Weil s...@newdream.net --- fs/btrfs/ctree.h |1 + fs/btrfs/disk-io.c |1 + fs/btrfs/transaction.c | 119 fs/btrfs/transaction.h |3 + 4 files changed, 124 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 584..1d0d5f0 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -840,6 +840,7 @@ struct btrfs_fs_info { struct btrfs_transaction *running_transaction; wait_queue_head_t transaction_throttle; wait_queue_head_t transaction_wait; + wait_queue_head_t transaction_blocked_wait; wait_queue_head_t async_submit_wait; struct btrfs_super_block super_copy; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 11d0ad3..6f3cb33 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1701,6 +1701,7 @@ struct btrfs_root *open_ctree(struct super_block *sb, init_waitqueue_head(fs_info-transaction_throttle); init_waitqueue_head(fs_info-transaction_wait); + init_waitqueue_head(fs_info-transaction_blocked_wait); init_waitqueue_head(fs_info-async_submit_wait); __setup_root(4096, 4096, 4096, 4096, tree_root, diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index e75cc84..52b5488 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -882,6 +882,123 @@ int btrfs_transaction_in_commit(struct btrfs_fs_info *info) return ret; } +/* + * wait for the current transaction commit to start and block subsequent + * transaction joins + */ +static void wait_current_trans_commit_start(struct btrfs_root *root, + struct btrfs_transaction *trans) +{ + DEFINE_WAIT(wait); + + if (trans-in_commit) + return; + + while (1) { + prepare_to_wait(root-fs_info-transaction_blocked_wait, wait, + TASK_UNINTERRUPTIBLE); + if (trans-in_commit) { + finish_wait(root-fs_info-transaction_blocked_wait, + wait); + break; + } + mutex_unlock(root-fs_info-trans_mutex); + schedule(); + mutex_lock(root-fs_info-trans_mutex); + finish_wait(root-fs_info-transaction_blocked_wait, wait); + } +} + +/* + * wait for the current transaction to start and then become unblocked. + * caller holds ref. + */ +static void wait_current_trans_commit_start_and_unblock(struct btrfs_root *root, +struct btrfs_transaction *trans) +{ + DEFINE_WAIT(wait); + + if (trans-commit_done || (trans-in_commit !trans-blocked)) + return; + + while (1) { + prepare_to_wait(root-fs_info-transaction_wait, wait, + TASK_UNINTERRUPTIBLE); + if (trans-commit_done || + (trans-in_commit !trans-blocked)) { + finish_wait(root-fs_info-transaction_wait, + wait); + break; + } + mutex_unlock(root-fs_info-trans_mutex); + schedule(); + mutex_lock(root-fs_info-trans_mutex); + finish_wait(root-fs_info-transaction_wait, + wait); + } +} + +/* + * commit transactions asynchronously. once btrfs_commit_transaction_async + * returns, any subsequent transaction will not be allowed to join. + */ +struct btrfs_async_commit { + struct btrfs_trans_handle *newtrans; + struct btrfs_root *root; + struct delayed_work work; +}; + +static void do_async_commit(struct work_struct *work) +{ + struct btrfs_async_commit *ac = + container_of(work, struct btrfs_async_commit, work.work); + + btrfs_commit_transaction(ac-newtrans, ac-root); + kfree(ac); +} + +int btrfs_commit_transaction_async(struct btrfs_trans_handle *trans, + struct btrfs_root *root, + int wait_for_unblock) +{ + struct btrfs_async_commit *ac; + struct btrfs_transaction *cur_trans; + + ac = kmalloc(sizeof(*ac), GFP_NOFS); + BUG_ON(!ac); + + INIT_DELAYED_WORK(ac-work