Re: [PATCH v2 1/4] btrfs: Remove userspace transaction ioctls

2018-02-05 Thread Sage Weil
On Mon, 5 Feb 2018, David Sterba wrote:
> On Mon, Feb 05, 2018 at 05:52:52PM +0900, Wang Shilong wrote:
> >These ioctl are originally introduced by Sage Weil for Ceph use?
> > Not sure whether it still useful, Cc Sage just in case.
> 
> We have checked that the ioctl is not used in ceph, the reasons why we
> think it's ok to remove the ioctl were in the cover letter of v1.

Yeah, +1 on removal.

Acked-by: Sage Weil <s...@redhat.com>

sage
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix race in WAIT_SYNC ioctl

2014-09-26 Thread Sage Weil
We check whether transid is already committed via last_trans_committed and
then search through trans_list for pending transactions.  If
last_trans_committed is updated by btrfs_commit_transaction after we check
it (there is no locking), we will fail to find the committed transaction
and return EINVAL to the caller.  This has been observed occasionally by
ceph-osd (which uses this ioctl heavily).

Fix by rechecking whether the provided transid = last_trans_committed
after the search fails, and if so return 0.

Signed-off-by: Sage Weil s...@redhat.com
---
 fs/btrfs/transaction.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index d89c6d3..98a25df 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -609,7 +609,6 @@ int btrfs_wait_for_commit(struct btrfs_root *root, u64 
transid)
if (transid = root-fs_info-last_trans_committed)
goto out;
 
-   ret = -EINVAL;
/* find specified transaction */
spin_lock(root-fs_info-trans_lock);
list_for_each_entry(t, root-fs_info-trans_list, list) {
@@ -625,9 +624,16 @@ int btrfs_wait_for_commit(struct btrfs_root *root, u64 
transid)
}
}
spin_unlock(root-fs_info-trans_lock);
-   /* The specified transaction doesn't exist */
-   if (!cur_trans)
+
+   /*
+* The specified transaction doesn't exist, or we
+* raced with btrfs_commit_transaction
+*/
+   if (!cur_trans) {
+   if (transid  root-fs_info-last_trans_committed)
+   ret = -EINVAL;
goto out;
+   }
} else {
/* find newest transaction that is committing | committed */
spin_lock(root-fs_info-trans_lock);
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


warn at fs/btrfs/extent-tree.c:5748 __btrfs_free_extent+0x9ce/0xa20

2014-03-11 Thread Sage Weil
Hey,

Is this something you guys have seen before?  This is from v3.13-rc2.

kernel: [49432.696440] WARNING: CPU: 3 PID: 26411 at 
/srv/autobuild-ceph/gitbuilder.git/build/fs/btrfs/extent-tree.c:5748 
__btrfs_free_extent+0x9ce/0xa20 [btrfs]()
kernel: [49432.710128] Modules linked in: arc4(F) md4(F) nls_utf8(F) cifs(F) 
ufs(F) qnx4(F) hfsplus(F) hfs(F) minix(F) ntfs(F) msdos(F) jfs(F) xfs(F) 
reiserfs(F) ext2(F) kvm_intel(F) kvm(F) ib_iser(F) rdma_cm(F) ib_cm(F) iw_cm(F) 
ib_sa(F) ib_mad(F) ib_core(F) ib_addr(F) iscsi_tcp(F) libiscsi_tcp(F) 
libiscsi(F) psmouse(F) ipmi_si(F) serio_raw(F) gpio_ich(F) joydev(F) dcdbas(F) 
i7core_edac(F) edac_core(F) ipmi_msghandler(F) mac_hid(F) acpi_power_meter(F) 
lpc_ich(F) tpm_tis(F) nfsd(F) nfs_acl(F) auth_rpcgss(F) scsi_transport_iscsi(F) 
nfs(F) fscache(F) lockd(F) lp(F) sunrpc(F) parport(F) hid_generic(F) usbhid(F) 
hid(F) btrfs(F) raid6_pq(F) mptsas(F) ixgbe(F) mptscsih(F) dca(F) mptbase(F) 
ptp(F) pps_core(F) scsi_transport_sas(F) xor(F) mdio(F) bnx2(F) libcrc32c(F)
kernel: [49432.777445] CPU: 3 PID: 26411 Comm: ceph-osd Tainted: GF I  
3.14.0-rc5-ceph-00016-gf31a96a #1
kernel: [49432.786704] Hardware name: Dell Inc. PowerEdge R410/01V648, BIOS 
1.6.3 02/07/2011
kernel: [49432.794223]  1674 8800bf1cbac8 816e4840 
88022726ef90
kernel: [49432.801700]   8800bf1cbb08 810524ac 
a8b07e50
kernel: [49432.809176]  880094e74120  b07c9000 

kernel: [49432.816653] Call Trace:
kernel: [49432.819119]  [816e4840] dump_stack+0x46/0x58
kernel: [49432.825384]  [810524ac] warn_slowpath_common+0x8c/0xc0
kernel: [49432.831413]  [810524fa] warn_slowpath_null+0x1a/0x20
kernel: [49432.837284]  [a010b4be] __btrfs_free_extent+0x9ce/0xa20 
[btrfs]
kernel: [49432.844108]  [a01110b8] 
__btrfs_run_delayed_refs+0x428/0x11e0 [btrfs]
kernel: [49432.851465]  [a0109458] ? 
block_rsv_release_bytes+0x108/0x190 [btrfs]
kernel: [49432.858823]  [a0114066] btrfs_run_delayed_refs+0x76/0x2a0 
[btrfs]
kernel: [49432.865869]  [a01251ff] 
__btrfs_end_transaction+0x26f/0x370 [btrfs]
kernel: [49432.873044]  [a0125330] btrfs_end_transaction+0x10/0x20 
[btrfs]
kernel: [49432.879872]  [a01327de] btrfs_link+0x13e/0x1d0 [btrfs]
kernel: [49432.885903]  [811b7571] vfs_link+0x1b1/0x270
kernel: [49432.891060]  [811b8120] SyS_linkat+0x210/0x2d0
kernel: [49432.896394]  [811b81fe] SyS_link+0x1e/0x20
kernel: [49432.901380]  [816f7cd6] system_call_fastpath+0x1a/0x1f

The full dump is at

http://tracker.ceph.com/issues/7688
http://tracker.ceph.com/attachments/download/1141/kern.log.gz

Thanks-
sage
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: transaction commit deadlock on current rc

2013-10-18 Thread Sage Weil
On Fri, 18 Oct 2013, Josef Bacik wrote:
 On Thu, Oct 17, 2013 at 12:56:14PM -0700, Sage Weil wrote:
  Hey,
  
  I'm seeing the deadlock below under a ceph-osd workload.  There may be a 
  subtle problem with the async transaction sequence (since nobody but ceph 
  uses that that I know of), but not obvious to me why 
  create_pending_snapshots would get stuck on btrfs_tree_lock...
  
 
 Can you do sysrq+w when this happens so I can see everybody who's blocked?
 Thanks,

Oops, forgot to attach the bug link.  It's at

http://tracker.ceph.com/attachments/download/1035/a
http://tracker.ceph.com/issues/6451

The machine is still hung.. if there is additional info I can gather 
you can ping me on irc.  

Thanks!
sage
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: transaction commit deadlock on current rc

2013-10-18 Thread Sage Weil
On Fri, 18 Oct 2013, Chris Mason wrote:
 Quoting Sage Weil (2013-10-18 11:42:28)
  On Fri, 18 Oct 2013, Josef Bacik wrote:
   On Thu, Oct 17, 2013 at 12:56:14PM -0700, Sage Weil wrote:
Hey,

I'm seeing the deadlock below under a ceph-osd workload.  There may be 
a 
subtle problem with the async transaction sequence (since nobody but 
ceph 
uses that that I know of), but not obvious to me why 
create_pending_snapshots would get stuck on btrfs_tree_lock...

   
   Can you do sysrq+w when this happens so I can see everybody who's blocked?
   Thanks,
  
  Oops, forgot to attach the bug link.  It's at
  
  http://tracker.ceph.com/attachments/download/1035/a
  http://tracker.ceph.com/issues/6451
  
  The machine is still hung.. if there is additional info I can gather 
  you can ping me on irc.  
 
 Thanks Sage and Josef, I've got this one queued up pending an ack from
 Sage.  But it's obviously not harmful, so I'll probably send this
 afternoon either way.

This is passing my initial tests!  It'll be subjected to the full firehose 
later tonight; I'll let you know if anything comes up.

Thanks!
sage
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


transaction commit deadlock on current rc

2013-10-17 Thread Sage Weil
Hey,

I'm seeing the deadlock below under a ceph-osd workload.  There may be a 
subtle problem with the async transaction sequence (since nobody but ceph 
uses that that I know of), but not obvious to me why 
create_pending_snapshots would get stuck on btrfs_tree_lock...

[  602.217383] INFO: task kworker/3:2:771 blocked for more than 120 seconds.
[  602.224234]   Not tainted 3.12.0-rc2-ceph-9-g53d0281 #1
[  602.230216] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 
message.
[  602.238121] kworker/3:2 D 88003677df10 0   771  2 0x
[  602.245349] Workqueue: events do_async_commit [btrfs]
[  602.250513]  8800c95c78d8 0046 0286 
8800638fca08
[  602.258192]  88003677df10 8800c95c7fd8 8800c95c7fd8 
8800c95c7fd8
[  602.265867]  880225d2df10 88003677df10 8800c95c78e8 
8800638fc8e0
[  602.273545] Call Trace:
[  602.276049]  [81665849] schedule+0x29/0x70
[  602.281087]  [a0176975] btrfs_tree_lock+0x75/0x270 [btrfs]
[  602.287509]  [81070310] ? __init_waitqueue_head+0x60/0x60
[  602.293840]  [a01185bb] btrfs_lock_root_node+0x3b/0x50 [btrfs]
[  602.300612]  [a011da67] btrfs_search_slot+0x867/0x930 [btrfs]
[  602.307293]  [a012ac62] ? run_clustered_refs+0x232/0xf30 [btrfs]
[  602.314236]  [a011f238] btrfs_insert_empty_items+0x78/0xd0 [btrfs]
[  602.321393]  [a01330cc] insert_with_overflow+0x3c/0x110 [btrfs]
[  602.328287]  [a013325f] btrfs_insert_dir_item+0xbf/0x200 [btrfs]
[  602.335229]  [a013f19c] create_pending_snapshot+0x81c/0xa00 [btrfs]
[  602.342469]  [a013f423] create_pending_snapshots+0xa3/0xb0 [btrfs]
[  602.349624]  [a01408fe] btrfs_commit_transaction+0x46e/0xa40 
[btrfs]
[  602.356919]  [81070310] ? __init_waitqueue_head+0x60/0x60
[  602.363291]  [a0140f58] do_async_commit+0x88/0xa0 [btrfs]
[  602.369665]  [a0140ef9] ? do_async_commit+0x29/0xa0 [btrfs]
[  602.376166]  [810672fa] process_one_work+0x1da/0x540
[  602.382099]  [8106728f] ? process_one_work+0x16f/0x540
[  602.388205]  [810684dc] worker_thread+0x11c/0x370
[  602.393834]  [810683c0] ? manage_workers.isra.20+0x2e0/0x2e0
[  602.400462]  [8106fada] kthread+0xea/0xf0
[  602.405396]  [8106f9f0] ? flush_kthread_worker+0x150/0x150
[  602.411836]  [8166fdec] ret_from_fork+0x7c/0xb0
[  602.417300]  [8106f9f0] ? flush_kthread_worker+0x150/0x150
[  602.423787] INFO: lockdep is turned off.

[  602.427852] INFO: task btrfs-transacti:6069 blocked for more than 120 
seconds.
[  602.435155]   Not tainted 3.12.0-rc2-ceph-9-g53d0281 #1
[  602.441229] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 
message.
[  602.449212] btrfs-transacti D 8800c96461e8 0  6069  2 0x
[  602.457660]  88022408fd08 0046 0286 
8800b68a4578
[  602.465350]  88022448df10 88022408ffd8 88022408ffd8 
88022408ffd8
[  602.473081]  880225d29fb0 88022448df10 88022408fd18 
880082fd48a8
[  602.480835] Call Trace:
[  602.483342]  [81665849] schedule+0x29/0x70
[  602.488450]  [a013f74f] wait_current_trans.isra.33+0xbf/0x120 
[btrfs]
[  602.495836]  [81070310] ? __init_waitqueue_head+0x60/0x60
[  602.502241]  [a01416a8] start_transaction+0x348/0x540 [btrfs]
[  602.509010]  [a0141907] btrfs_attach_transaction+0x17/0x20 [btrfs]
[  602.516124]  [a0139c12] transaction_kthread+0x182/0x250 [btrfs]
[  602.523065]  [a0139a90] ? btrfs_destroy_delayed_refs+0x370/0x370 
[btrfs]
[  602.530791]  [8106fada] kthread+0xea/0xf0
[  602.535725]  [8106f9f0] ? flush_kthread_worker+0x150/0x150
[  602.542178]  [8166fdec] ret_from_fork+0x7c/0xb0
[  602.547658]  [8106f9f0] ? flush_kthread_worker+0x150/0x150
[  602.554068] INFO: lockdep is turned off.

[  602.558154] INFO: task ceph-osd:12248 blocked for more than 120 seconds.
[  602.558155]   Not tainted 3.12.0-rc2-ceph-9-g53d0281 #1
[  602.558156] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 
message.
[  602.558158] ceph-osdD 880082fd48a8 0 12248  12215 0x
[  602.558161]  880184441b58 0046 0282 
8800b68a4578
[  602.558162]  880077fcbf60 880184441fd8 880184441fd8 
880184441fd8
[  602.558164]  88003677df10 880077fcbf60 880184441b68 
880184441ba0
[  602.558164] Call Trace:
[  602.558166]  [81665849] schedule+0x29/0x70
[  602.558178]  [a0141af7] btrfs_commit_transaction_async+0x187/0x2c0 
[btrfs]
[  602.558188]  [a01413f6] ? start_transaction+0x96/0x540 [btrfs]
[  602.558190]  [81070310] ? __init_waitqueue_head+0x60/0x60
[  602.558201]  [a0171565] btrfs_mksubvol.isra.59+0x2a5/0x410 [btrfs]
[  602.558204]  [811a3d9c] ? 

Re: hang on 3.9, 3.10-rc5

2013-08-09 Thread Sage Weil
Hi Chris,

On Thu, 20 Jun 2013, Chris Mason wrote:
 Quoting Sage Weil (2013-06-20 17:56:19)
  On Wed, 19 Jun 2013, Sage Weil wrote:
   Hi Chris,
   
   On Tue, 18 Jun 2013, Chris Mason wrote:
[...]
Very long way of saying I think we're one release_path short.  Sage, I
haven't tested this at all yet, I was hoping to trigger it first.

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index c276ac9..c1954b3 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3730,6 +3730,7 @@ next_slot:
 log_extents:
if (fast_search) {
btrfs_release_path(dst_path);
+   btrfs_release_path(path);
ret = btrfs_log_changed_extents(trans, root, inode, 
dst_path);
if (ret) {
err = ret;
   
   This seems to be doing the trick.  I'll keep testing overnight, but so 
   far 
   so good!
  
  ...and it's still holding up well in QA.
 
 Awesome, thanks for getting the traces for us.  Looks like this one has
 been around since v3.7, so I'm not going to try and sneak it into the
 3.10 final.  I'll have it in the next merge window and for stable.

I was just pulling in the current -rc for testing and realized that this 
isn't upstream yet.  Was this missed?  Sorry I didn't check earlier.

Thanks!
sage
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hang on 3.9, 3.10-rc5

2013-06-20 Thread Sage Weil
On Wed, 19 Jun 2013, Sage Weil wrote:
 Hi Chris,
 
 On Tue, 18 Jun 2013, Chris Mason wrote:
  [...]
  Very long way of saying I think we're one release_path short.  Sage, I
  haven't tested this at all yet, I was hoping to trigger it first.
  
  diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
  index c276ac9..c1954b3 100644
  --- a/fs/btrfs/tree-log.c
  +++ b/fs/btrfs/tree-log.c
  @@ -3730,6 +3730,7 @@ next_slot:
   log_extents:
  if (fast_search) {
  btrfs_release_path(dst_path);
  +   btrfs_release_path(path);
  ret = btrfs_log_changed_extents(trans, root, inode, dst_path);
  if (ret) {
  err = ret;
 
 This seems to be doing the trick.  I'll keep testing overnight, but so far 
 so good!

...and it's still holding up well in QA.

Thanks, Chris!
sage
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hang on 3.9, 3.10-rc5

2013-06-20 Thread Sage Weil
On Thu, 20 Jun 2013, Chris Mason wrote:
 Quoting Sage Weil (2013-06-20 17:56:19)
  On Wed, 19 Jun 2013, Sage Weil wrote:
   Hi Chris,
   
   On Tue, 18 Jun 2013, Chris Mason wrote:
[...]
Very long way of saying I think we're one release_path short.  Sage, I
haven't tested this at all yet, I was hoping to trigger it first.

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index c276ac9..c1954b3 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3730,6 +3730,7 @@ next_slot:
 log_extents:
if (fast_search) {
btrfs_release_path(dst_path);
+   btrfs_release_path(path);
ret = btrfs_log_changed_extents(trans, root, inode, 
dst_path);
if (ret) {
err = ret;
   
   This seems to be doing the trick.  I'll keep testing overnight, but so 
   far 
   so good!
  
  ...and it's still holding up well in QA.
 
 Awesome, thanks for getting the traces for us.  Looks like this one has
 been around since v3.7, so I'm not going to try and sneak it into the
 3.10 final.  I'll have it in the next merge window and for stable.

Weird, these same tests have been running on it nightly for ages and it 
seems like these failures just started with 3.9.  Perhaps some other 
change made it hang when it didn't before?

In any case, thanks!
sage
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hang on 3.9, 3.10-rc5

2013-06-20 Thread Sage Weil
On Thu, 20 Jun 2013, Chris Mason wrote:
 Quoting Sage Weil (2013-06-20 21:00:21)
  On Thu, 20 Jun 2013, Chris Mason wrote:
   Awesome, thanks for getting the traces for us.  Looks like this one has
   been around since v3.7, so I'm not going to try and sneak it into the
   3.10 final.  I'll have it in the next merge window and for stable.
  
  Weird, these same tests have been running on it nightly for ages and it 
  seems like these failures just started with 3.9.  Perhaps some other 
  change made it hang when it didn't before?
  
 
 It's always possible, there are a ton of moving pieces here.  The
 wait_event you were hung on was waiting for crcs to finish, and that
 part at least isn't new.

K.  There was also a shift of writes to leveldb (which does the mmap 
thing), so that may explain the change in behavior.

 Somewhat unrelated, but are you still using notreelog?

Nope, just noatime.

Thanks-
sage
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hang on 3.9, 3.10-rc5

2013-06-19 Thread Sage Weil
Hi Chris,

On Tue, 18 Jun 2013, Chris Mason wrote:
 [...]
 Very long way of saying I think we're one release_path short.  Sage, I
 haven't tested this at all yet, I was hoping to trigger it first.
 
 diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
 index c276ac9..c1954b3 100644
 --- a/fs/btrfs/tree-log.c
 +++ b/fs/btrfs/tree-log.c
 @@ -3730,6 +3730,7 @@ next_slot:
  log_extents:
   if (fast_search) {
   btrfs_release_path(dst_path);
 + btrfs_release_path(path);
   ret = btrfs_log_changed_extents(trans, root, inode, dst_path);
   if (ret) {
   err = ret;

This seems to be doing the trick.  I'll keep testing overnight, but so far 
so good!

Thanks-
sage

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hang on 3.9, 3.10-rc5

2013-06-18 Thread Sage Weil
On Wed, 12 Jun 2013, Sage Weil wrote:
 On Tue, 11 Jun 2013, Chris Mason wrote:
  Quoting Sage Weil (2013-06-11 11:43:30)
   I'm also seeing this hang regularly with both 3.9 and 3.10-rc5.  Is this
   is a known problem?  In this case there is no powercycling; just a regular
   ceph-osd workload.
  
  Everyone here is waiting for the root node, but it isn't immediately
  clear who has the lock.  log_one_extent is the most likely suspect, but
  I can't see how it would be scheduling with the root lock held.
  
  Could you please sysrq-w?
 
 Sorry for the slow reply; had to wait for it to reproduce again.  Attached 
 both the original dmesg output and the blocked tasks.  I'll leave the box 
 wedged this time in case there is any other info that can help.

Still seeing this hang with the latest from linus' tree.  Is there another 
tree I should try testing against?

Thanks!
sage

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


hang on 3.9, 3.10-rc5

2013-06-11 Thread Sage Weil
I'm also seeing this hang regularly with both 3.9 and 3.10-rc5.  Is this 
is a known problem?  In this case there is no powercycling; just a regular 
ceph-osd workload.

Thanks-
sage


[ 2885.479116] INFO: task kworker/u64:1:28713 blocked for more than 120 seconds.
[ 2885.486277] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 
message.
[ 2885.494201] kworker/u64:1   D 880222b81f90 0 28713  2 0x
[ 2885.501355] Workqueue: writeback bdi_writeback_workfn (flush-btrfs-12)
[ 2885.507924]  88021c3d7238 0046 0286 
880082af52e8
[ 2885.515456]  880222b81f90 88021c3d7fd8 88021c3d7fd8 
88021c3d7fd8
[ 2885.522987]  880225d2bf20 880222b81f90 88021c3d7248 
880082af51c0
[ 2885.530497] Call Trace:
[ 2885.532964]  [81635ed9] schedule+0x29/0x70
[ 2885.537972]  [a021ec95] btrfs_tree_lock+0x75/0x270 [btrfs]
[ 2885.544407]  [81067b50] ? __init_waitqueue_head+0x60/0x60
[ 2885.550760]  [a01c22cb] btrfs_lock_root_node+0x3b/0x50 [btrfs]
[ 2885.557490]  [a01c77ba] btrfs_search_slot+0x87a/0x940 [btrfs]
[ 2885.564196]  [a01ff76e] ? free_extent_map+0x4e/0x90 [btrfs]
[ 2885.570738]  [a01dd198] btrfs_lookup_file_extent+0x38/0x40 [btrfs]
[ 2885.577823]  [a01fd1be] __btrfs_drop_extents+0x12e/0xaf0 [btrfs]
[ 2885.584737]  [8107b093] ? select_task_rq_fair+0x53/0x8a0
[ 2885.590977]  [81167788] ? kmem_cache_alloc+0xd8/0x160
[ 2885.596934]  [a01fe63e] btrfs_drop_extents+0x6e/0xa0 [btrfs]
[ 2885.603531]  [a01f16c1] cow_file_range_inline+0xf1/0x1e0 [btrfs]
[ 2885.610451]  [a01f1ad6] __cow_file_range+0x326/0x4b0 [btrfs]
[ 2885.617014]  [a01e89ea] ? join_transaction.isra.32+0x10a/0x3c0 
[btrfs]
[ 2885.624520]  [a01f27d5] cow_file_range+0x95/0xe0 [btrfs]
[ 2885.630772]  [a01f2b5b] run_delalloc_range+0x33b/0x360 [btrfs]
[ 2885.637581]  [a0208120] __extent_writepage+0x5e0/0x770 [btrfs]
[ 2885.644357]  [8111cf9f] ? find_get_pages_tag+0x11f/0x1c0
[ 2885.650614]  [8111ceae] ? find_get_pages_tag+0x2e/0x1c0
[ 2885.656744]  [a0208532] 
extent_write_cache_pages.isra.29.constprop.41+0x282/0x3e0 [btrfs]
[ 2885.665906]  [a01ec8c0] ? add_pending_csums.isra.44+0x70/0x70 
[btrfs]
[ 2885.673325]  [a01ebde0] ? btrfs_readpage_end_io_hook+0x270/0x270 
[btrfs]
[ 2885.680990]  [8119815c] ? __writeback_single_inode+0x6c/0x2a0
[ 2885.687792]  [a020892e] extent_writepages+0x4e/0x70 [btrfs]
[ 2885.694296]  [a01eeec0] ? can_nocow_odirect+0x330/0x330 [btrfs]
[ 2885.701787]  [a01ec698] btrfs_writepages+0x28/0x30 [btrfs]
[ 2885.708164]  [811289e3] do_writepages+0x23/0x40
[ 2885.713679]  [8119813e] __writeback_single_inode+0x4e/0x2a0
[ 2885.720347]  [81199630] writeback_sb_inodes+0x280/0x3e0
[ 2885.726528]  [8119982e] __writeback_inodes_wb+0x9e/0xd0
[ 2885.732766]  [81199a5b] wb_writeback+0x1fb/0x310
[ 2885.738271]  [81127a21] ? bdi_dirty_limit+0x31/0xc0
[ 2885.744266]  [8119b3e8] wb_do_writeback+0x138/0x1e0
[ 2885.750091]  [8119b50a] bdi_writeback_workfn+0x7a/0x200
[ 2885.756207]  [8105f4ba] process_one_work+0x1da/0x540
[ 2885.762111]  [8105f44f] ? process_one_work+0x16f/0x540
[ 2885.768138]  [8106069c] worker_thread+0x11c/0x370
[ 2885.773836]  [81060580] ? manage_workers.isra.20+0x2e0/0x2e0
[ 2885.780513]  [8106735a] kthread+0xea/0xf0
[ 2885.785412]  [81067270] ? flush_kthread_worker+0x150/0x150
[ 2885.791805]  [8163ff9c] ret_from_fork+0x7c/0xb0
[ 2885.797227]  [81067270] ? flush_kthread_worker+0x150/0x150
[ 2885.803694] INFO: lockdep is turned off.
[ 2885.807640] INFO: task btrfs-transacti:23992 blocked for more than 120 
seconds.
[ 2885.815086] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 
message.
[ 2885.823067] btrfs-transacti D 880222493f20 0 23992  2 0x
[ 2885.830265]  88020cebfa28 0046 0286 
880082af52e8
[ 2885.837761]  880222493f20 88020cebffd8 88020cebffd8 
88020cebffd8
[ 2885.845281]  880225d29f90 880222493f20 88020cebfa38 
880082af51c0
[ 2885.852840] Call Trace:
[ 2885.855306]  [81635ed9] schedule+0x29/0x70
[ 2885.860393]  [a021ec95] btrfs_tree_lock+0x75/0x270 [btrfs]
[ 2885.866772]  [81067b50] ? __init_waitqueue_head+0x60/0x60
[ 2885.873123]  [a01c22cb] btrfs_lock_root_node+0x3b/0x50 [btrfs]
[ 2885.879881]  [a01c77ba] btrfs_search_slot+0x87a/0x940 [btrfs]
[ 2885.886517]  [8163760b] ? _raw_spin_unlock+0x2b/0x40
[ 2885.892465]  [a01c8f28] btrfs_insert_empty_items+0x78/0xd0 [btrfs]
[ 2885.899574]  [a0238d54] btrfs_insert_delayed_items+0x84/0x460 
[btrfs]
[ 2885.906936]  [a0239938] __btrfs_run_delayed_items+0xb8/0x1e0 
[btrfs]
[ 2885.914285]  [a0239a93] 

Re: v3.9 bug at /fs/btrfs/free-space-cache.c:1567 after powercycle

2013-06-05 Thread Sage Weil
On Wed, 5 Jun 2013, Josef Bacik wrote:
 On Tue, Jun 04, 2013 at 09:59:05PM -0600, Sage Weil wrote:
  Hi-
  
  I'm pretty reliably triggering the following bug after powercycling an 
  active btrfs + ceph workload and then trying to remount.  Is this a known 
  issue?
  
 
 Yeah sorry it's fixed in 3.10, I really need to send the patch back to stable.
 Thanks,
 
 Josef

Cool.  Thanks!
sage
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


v3.9 bug at /fs/btrfs/free-space-cache.c:1567 after powercycle

2013-06-04 Thread Sage Weil
Hi-

I'm pretty reliably triggering the following bug after powercycling an 
active btrfs + ceph workload and then trying to remount.  Is this a known 
issue?

sage


2013-06-04T18:54:28.532988-07:00 plana71 kernel: [   39.311120] [ 
cut here ]
2013-06-04T18:54:28.533002-07:00 plana71 kernel: [   39.315802] kernel BUG at 
/srv/autobuild-ceph/gitbuilder.git/build/fs/btrfs/free-space-cache.c:1567!
2013-06-04T18:54:28.533004-07:00 plana71 kernel: [   39.325013] invalid opcode: 
 [#1] SMP 
2013-06-04T18:54:28.533012-07:00 plana71 kernel: [   39.329278] Modules linked 
in: coretemp kvm_intel kvm ghash_clmulni_intel aesni_intel ablk_helper nfsd 
cryptd lrw aes_x86_64 nfs_acl xts auth_rpcgss gf128mul microcode psmouse 
exportfs nfs dcdbas lpc_ich mfd_core i7core_edac edac_core joydev serio_raw hed 
lp parport fscache lockd sunrpc hid_generic usbhid hid btrfs raid6_pq ixgbe dca 
ptp pps_core mdio mptsas mptscsih mptbase scsi_transport_sas bnx2 xor 
zlib_deflate crc32c_intel libcrc32c
2013-06-04T18:54:28.533014-07:00 plana71 kernel: [   39.370984] CPU 1 
2013-06-04T18:54:28.533017-07:00 plana71 kernel: [   39.372867] Pid: 1679, 
comm: mount Not tainted 3.9.0-ceph-00303-g19bb6a8 #1 Dell Inc. PowerEdge 
R410/01V648
2013-06-04T18:54:28.533021-07:00 plana71 kernel: [   39.382928] RIP: 
0010:[a0187917]  [a0187917] remove_from_bitmap+0x1b7/0x1c0 
[btrfs]
2013-06-04T18:54:28.533023-07:00 plana71 kernel: [   39.392479] RSP: 
0018:880212d456e8  EFLAGS: 00010287
2013-06-04T18:54:28.533025-07:00 plana71 kernel: [   39.397853] RAX: 
 RBX: 88021fea1100 RCX: 0034
2013-06-04T18:54:28.533028-07:00 plana71 kernel: [   39.405051] RDX: 
00048000 RSI: 61b0a000 RDI: 78c0
2013-06-04T18:54:28.533030-07:00 plana71 kernel: [   39.412249] RBP: 
880212d45738 R08: 8802200350f0 R09: 0740
2013-06-04T18:54:28.533032-07:00 plana71 kernel: [   39.419447] R10: 
88020a4e65d0 R11: 0001 R12: 880212d45760
2013-06-04T18:54:28.533034-07:00 plana71 kernel: [   39.426644] R13: 
88020ad97400 R14: 6940 R15: 880212d45758
2013-06-04T18:54:28.533036-07:00 plana71 kernel: [   39.433843] FS:  
7f035dc1a800() GS:88022722() knlGS:
2013-06-04T18:54:28.533038-07:00 plana71 kernel: [   39.442011] CS:  0010 DS: 
 ES:  CR0: 8005003b
2013-06-04T18:54:28.533040-07:00 plana71 kernel: [   39.447819] CR2: 
7ff46d4524d0 CR3: 00020a52c000 CR4: 07e0
2013-06-04T18:54:28.533042-07:00 plana71 kernel: [   39.455017] DR0: 
 DR1:  DR2: 
2013-06-04T18:54:28.533044-07:00 plana71 kernel: [   39.462215] DR3: 
 DR6: 0ff0 DR7: 0400
2013-06-04T18:54:28.533047-07:00 plana71 kernel: [   39.469414] Process mount 
(pid: 1679, threadinfo 880212d44000, task 88020a4e5eb0)
2013-06-04T18:54:28.533048-07:00 plana71 kernel: [   39.477669] Stack:
2013-06-04T18:54:28.533050-07:00 plana71 kernel: [   39.479739]  
 88020ad97454 61b0c000 00048000
2013-06-04T18:54:28.533052-07:00 plana71 kernel: [   39.487431]  
8802 88020ad97400  88020ad97454
2013-06-04T18:54:28.533054-07:00 plana71 kernel: [   39.495123]  
88022012b400 61b0a000 880212d45798 a0189073
2013-06-04T18:54:28.533055-07:00 plana71 kernel: [   39.502817] Call Trace:
2013-06-04T18:54:28.533058-07:00 plana71 kernel: [   39.505339]  
[a0189073] btrfs_remove_free_space+0x53/0x290 [btrfs]
2013-06-04T18:54:28.533061-07:00 plana71 kernel: [   39.512465]  
[a0135d20] btrfs_alloc_logged_file_extent+0x1c0/0x1e0 [btrfs]
2013-06-04T18:54:28.533063-07:00 plana71 kernel: [   39.520295]  
[a01215fa] ? btrfs_free_path+0x2a/0x40 [btrfs]
2013-06-04T18:54:28.533065-07:00 plana71 kernel: [   39.526815]  
[a0183aef] replay_one_extent+0x5ef/0x650 [btrfs]
2013-06-04T18:54:28.533068-07:00 plana71 kernel: [   39.533510]  
[a017da3a] ? btrfs_tree_read_lock+0x5a/0x140 [btrfs]
2013-06-04T18:54:28.533070-07:00 plana71 kernel: [   39.540553]  
[a0183e2b] replay_one_buffer+0x2db/0x390 [btrfs]
2013-06-04T18:54:28.533075-07:00 plana71 kernel: [   39.547247]  
[a01629f1] ? mark_extent_buffer_accessed+0x51/0x70 [btrfs]
2013-06-04T18:54:28.533078-07:00 plana71 kernel: [   39.554821]  
[a013f6d0] ? verify_parent_transid+0x160/0x160 [btrfs]
2013-06-04T18:54:28.533080-07:00 plana71 kernel: [   39.562037]  
[a017ee53] walk_up_log_tree+0x1c3/0x250 [btrfs]
2013-06-04T18:54:28.533082-07:00 plana71 kernel: [   39.568644]  
[a017fce1] walk_log_tree+0xb1/0x1f0 [btrfs]
2013-06-04T18:54:28.533085-07:00 plana71 kernel: [   39.574903]  
[a0186032] btrfs_recover_log_trees+0x212/0x3c0 [btrfs]
2013-06-04T18:54:28.533087-07:00 plana71 kernel: [   39.582119]  
[a0183b50] ? 

Re: corruption of active mmapped files in btrfs snapshots

2013-03-22 Thread Sage Weil
On Fri, 22 Mar 2013, Chris Mason wrote:
 Quoting Alexandre Oliva (2013-03-22 10:17:30)
  On Mar 22, 2013, Chris Mason clma...@fusionio.com wrote:
  
   Are you using compression in btrfs or just in leveldb?
  
  btrfs lzo compression.
 
 Perfect, I'll focus on that part of things.
 
  
   I'd like to take snapshots out of the picture for a minute.
  
  That's understandable, I guess, but I don't know that anyone has ever
  got the problem without snapshots.  I mean, even when the master copy of
  the database got corrupted, snapshots of the subvol containing it were
  being taken every now and again, because that's the way ceph works.
 
 Hopefully Sage can comment, but the basic idea is that if you snapshot a
 database file the db must participate.  If it doesn't, it really is the
 same effect as crashing the box.
 
 Something is definitely broken if we're corrupting the source files
 (either with or without snapshots), but avoiding incomplete writes in
 the snapshot files requires synchronization with the db.

In this case, we quiesce write activity, call leveldb's sync(), take the 
snapshot, and then continue.

(FWIW, this isn't the first time we've heard about leveldb corruption, but 
in each case we've looked into the user had the btrfs compression 
enabled so I suspect that's the right avenue of investigation!)

sage
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: corruption of active mmapped files in btrfs snapshots

2013-03-19 Thread Sage Weil
On Tue, 19 Mar 2013, Chris Mason wrote:
 Quoting Alexandre Oliva (2013-03-19 01:20:10)
  On Mar 18, 2013, Chris Mason chris.ma...@fusionio.com wrote:
  
   A few questions.  Does leveldb use O_DIRECT and mmap together?
  
  No, it doesn't use O_DIRECT at all.  Its I/O interface is very
  simplified: it just opens each new file (database chunks limited to 2MB)
  with O_CREAT|O_RDWR|O_TRUNC, and then uses ftruncate, mmap, msync,
  munmap and fdatasync.  It doesn't seem to modify data once it's written;
  it only appends.  Reading data back from it uses a completely different
  class interface, using separate descriptors and using pread only.
  
   (the source of a write being pages that are mmap'd from somewhere
   else)
  
  AFAICT the source of the memcpy()s that append to the file are
  malloc()ed memory.
  
   That's the most likely place for this kind of problem.  Also, you
   mention crc errors.  Are those reported by btrfs or are they application
   level crcs.
  
  These are CRCs leveldb computes and writes out after each db block.  No
  btrfs CRC errors are reported in this process.
 
 Ok, so we have three moving pieces here.
 
 1) leveldb truncating the files
 2) leveldb using mmap to write
 3) btrfs snapshots
 
 My guess is the truncate is creating a orphan item that is being
 processed inside the snapshot.
 
 Is it possible to create a smaller leveldb unit test that we might use
 to exercise all of this?

There is a set of unit tests in the leveldb source tree that ought to do 
the trick:

git clone https://code.google.com/p/leveldb/

sage
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fs: encode_fh: return FILEID_INVALID if invalid fid_type

2013-02-11 Thread Sage Weil
Acked-by: Sage Weil s...@inktank.com

On Mon, 11 Feb 2013, Namjae Jeon wrote:

 From: Namjae Jeon namjae.j...@samsung.com
 
 This patch is a follow up on below patch:
 
 [PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type
 commit: 216b6cbdcbd86b1db0754d58886b466ae31f5a63 
 
 Signed-off-by: Namjae Jeon namjae.j...@samsung.com
 Signed-off-by: Vivek Trivedi t.vi...@samsung.com
 Acked-by: Steven Whitehouse swhit...@redhat.com
 ---
  fs/btrfs/export.c   |4 ++--
  fs/ceph/export.c|4 ++--
  fs/fuse/inode.c |2 +-
  fs/gfs2/export.c|4 ++--
  fs/isofs/export.c   |4 ++--
  fs/nilfs2/namei.c   |4 ++--
  fs/ocfs2/export.c   |4 ++--
  fs/reiserfs/inode.c |4 ++--
  fs/udf/namei.c  |4 ++--
  fs/xfs/xfs_export.c |4 ++--
  mm/cleancache.c |2 +-
  mm/shmem.c  |2 +-
  12 files changed, 21 insertions(+), 21 deletions(-)
 
 diff --git a/fs/btrfs/export.c b/fs/btrfs/export.c
 index 614f34a..81ee29e 100644
 --- a/fs/btrfs/export.c
 +++ b/fs/btrfs/export.c
 @@ -22,10 +22,10 @@ static int btrfs_encode_fh(struct inode *inode, u32 *fh, 
 int *max_len,
  
   if (parent  (len  BTRFS_FID_SIZE_CONNECTABLE)) {
   *max_len = BTRFS_FID_SIZE_CONNECTABLE;
 - return 255;
 + return FILEID_INVALID;
   } else if (len  BTRFS_FID_SIZE_NON_CONNECTABLE) {
   *max_len = BTRFS_FID_SIZE_NON_CONNECTABLE;
 - return 255;
 + return FILEID_INVALID;
   }
  
   len  = BTRFS_FID_SIZE_NON_CONNECTABLE;
 diff --git a/fs/ceph/export.c b/fs/ceph/export.c
 index ca3ab3f..16796be 100644
 --- a/fs/ceph/export.c
 +++ b/fs/ceph/export.c
 @@ -81,7 +81,7 @@ static int ceph_encode_fh(struct inode *inode, u32 *rawfh, 
 int *max_len,
   if (parent_inode) {
   /* nfsd wants connectable */
   *max_len = connected_handle_length;
 - type = 255;
 + type = FILEID_INVALID;
   } else {
   dout(encode_fh %p\n, dentry);
   fh-ino = ceph_ino(inode);
 @@ -90,7 +90,7 @@ static int ceph_encode_fh(struct inode *inode, u32 *rawfh, 
 int *max_len,
   }
   } else {
   *max_len = handle_length;
 - type = 255;
 + type = FILEID_INVALID;
   }
   if (dentry)
   dput(dentry);
 diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
 index 9876a87..973e8f0 100644
 --- a/fs/fuse/inode.c
 +++ b/fs/fuse/inode.c
 @@ -679,7 +679,7 @@ static int fuse_encode_fh(struct inode *inode, u32 *fh, 
 int *max_len,
  
   if (*max_len  len) {
   *max_len = len;
 - return  255;
 + return  FILEID_INVALID;
   }
  
   nodeid = get_fuse_inode(inode)-nodeid;
 diff --git a/fs/gfs2/export.c b/fs/gfs2/export.c
 index 4767774..9973df4 100644
 --- a/fs/gfs2/export.c
 +++ b/fs/gfs2/export.c
 @@ -37,10 +37,10 @@ static int gfs2_encode_fh(struct inode *inode, __u32 *p, 
 int *len,
  
   if (parent  (*len  GFS2_LARGE_FH_SIZE)) {
   *len = GFS2_LARGE_FH_SIZE;
 - return 255;
 + return FILEID_INVALID;
   } else if (*len  GFS2_SMALL_FH_SIZE) {
   *len = GFS2_SMALL_FH_SIZE;
 - return 255;
 + return FILEID_INVALID;
   }
  
   fh[0] = cpu_to_be32(ip-i_no_formal_ino  32);
 diff --git a/fs/isofs/export.c b/fs/isofs/export.c
 index 2b4f235..12088d8 100644
 --- a/fs/isofs/export.c
 +++ b/fs/isofs/export.c
 @@ -125,10 +125,10 @@ isofs_export_encode_fh(struct inode *inode,
*/
   if (parent  (len  5)) {
   *max_len = 5;
 - return 255;
 + return FILEID_INVALID;
   } else if (len  3) {
   *max_len = 3;
 - return 255;
 + return FILEID_INVALID;
   }
  
   len = 3;
 diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c
 index 1d0c0b8..9de78f0 100644
 --- a/fs/nilfs2/namei.c
 +++ b/fs/nilfs2/namei.c
 @@ -517,11 +517,11 @@ static int nilfs_encode_fh(struct inode *inode, __u32 
 *fh, int *lenp,
  
   if (parent  *lenp  NILFS_FID_SIZE_CONNECTABLE) {
   *lenp = NILFS_FID_SIZE_CONNECTABLE;
 - return 255;
 + return FILEID_INVALID;
   }
   if (*lenp  NILFS_FID_SIZE_NON_CONNECTABLE) {
   *lenp = NILFS_FID_SIZE_NON_CONNECTABLE;
 - return 255;
 + return FILEID_INVALID;
   }
  
   fid-cno = root-cno;
 diff --git a/fs/ocfs2/export.c b/fs/ocfs2/export.c
 index 322216a..2965116 100644
 --- a/fs/ocfs2/export.c
 +++ b/fs/ocfs2/export.c
 @@ -195,11 +195,11 @@ static int ocfs2_encode_fh(struct inode *inode, u32 
 *fh_in, int *max_len,
  
   if (parent  (len  6)) {
   *max_len = 6;
 - type = 255;
 + type = FILEID_INVALID;
   goto bail;
   } else if (len  3) {
   *max_len = 3;
 - type = 255

Re: corrupted file size on inline extent conversion?

2013-02-04 Thread Sage Weil
On Wed, 30 Jan 2013, Josef Bacik wrote:
 On Wed, Jan 30, 2013 at 11:30:49AM -0700, Mike Lowe wrote:
  I've been running rsync against a rbd device backed by btrfs filesystems 
  that are about 11% full for about 45 minutes before I checked and noticed 
  the printk message.  That was the first go with the patch.  Seems like I 
  was able to get by without any problems until the btrfs filesystems got 
  some use and filled up a little bit.
  
 
 Ok since you are seeing the message I'll go ahead and post the patch and 
 get it moving along, let me know if you still see the problem.  Thanks,

Awesome.  Mike still hasn't seen a reocurrence, so it's looking like the 
patch is good.

Thanks so much!
sage
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


corrupted file size on inline extent conversion?

2013-01-28 Thread Sage Weil
A ceph user observed a incorrect i_size on btrfs.  The pattern looks like 
this:

- some writes at low file offsets
- a write to 4185600 len 8704 (i_size should be 4MB)
- more writes to low offsets
- a write to 4181504 len 4096 (abutts the write above)
- a bit of time goes by...
- stat returns 4186112 (4MB - 8192)
 - that's a fwe bytes to the right of the top write above.

There are some logs showing the full read/write activity to the file at

http://tracker.newdream.net/attachments/658/object_log.txt

on issue

http://tracker.newdream.net/issues/3810

The kernel was 3.7.0-030700-generic (and probably also observed on 3.7.1).

Is this a known bug?

Thanks!
sage

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: move the sb_end_intwrite until after the throttle logic

2012-09-09 Thread Sage Weil
On Wed, 5 Sep 2012, Josef Bacik wrote:
 Sage reported the following lockdep backtrace
 
 =
 [ BUG: bad unlock balance detected! ]
 3.6.0-rc2-ceph-00171-gc7ed62d #1 Not tainted
 -
 btrfs-cleaner/7607 is trying to release lock (sb_internal) at:
 [a00422ae] btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
 but there are no more locks to release!
 
 other info that might help us debug this:
 1 lock held by btrfs-cleaner/7607:
  #0:  (fs_info-cleaner_mutex){+.+...}, at: [a003b405] 
 cleaner_kthread+0x95/0x120 [btrfs]
 
 stack backtrace:
 Pid: 7607, comm: btrfs-cleaner Not tainted 3.6.0-rc2-ceph-00171-gc7ed62d #1
 Call Trace:
  [a00422ae] ? btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
  [810afa9e] print_unlock_inbalance_bug+0xfe/0x110
  [810b289e] lock_release_non_nested+0x1ee/0x310
  [81172f9b] ? kmem_cache_free+0x7b/0x160
  [a004106c] ? put_transaction+0x8c/0x130 [btrfs]
  [a00422ae] ? btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
  [810b2a95] lock_release+0xd5/0x220
  [81173071] ? kmem_cache_free+0x151/0x160
  [8117d9ed] __sb_end_write+0x7d/0x90
  [a00422ae] btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
  [81079850] ? __init_waitqueue_head+0x60/0x60
  [81634c6b] ? _raw_spin_unlock+0x2b/0x40
  [a0042758] __btrfs_end_transaction+0x368/0x3c0 [btrfs]
  [a0042808] btrfs_end_transaction_throttle+0x18/0x20 [btrfs]
  [a00318f0] btrfs_drop_snapshot+0x410/0x600 [btrfs]
  [8132babd] ? do_raw_spin_unlock+0x5d/0xb0
  [a00430ef] btrfs_clean_old_snapshots+0xaf/0x150 [btrfs]
  [a003b405] ? cleaner_kthread+0x95/0x120 [btrfs]
  [a003b419] cleaner_kthread+0xa9/0x120 [btrfs]
  [a003b370] ? btrfs_destroy_delayed_refs.isra.102+0x220/0x220 
 [btrfs]
  [810791ee] kthread+0xae/0xc0
  [810b379d] ? trace_hardirqs_on+0xd/0x10
  [8163e744] kernel_thread_helper+0x4/0x10
  [81635430] ? retint_restore_args+0x13/0x13
  [81079140] ? flush_kthread_work+0x1a0/0x1a0
  [8163e740] ? gs_change+0x13/0x13
 
 This is because the throttle stuff can commit the transaction, which expects 
 to
 be the one stopping the intwrite stuff, but we've already done it in the
 __btrfs_end_transaction.  Moving the sb_end_intewrite after this logic makes 
 the
 lockdep go away.  Thanks,
 
 Signed-off-by: Josef Bacik jba...@fusionio.com

This appears to have done the trick (at least so far).  I'm running it on 
top of -rc5 and the 3 patches I posted on Aug 30th.

Tested-by: Sage Weil s...@inktank.com

Thanks!
sage


 ---
  fs/btrfs/transaction.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)
 
 diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
 index a86fc72..0163afa 100644
 --- a/fs/btrfs/transaction.c
 +++ b/fs/btrfs/transaction.c
 @@ -551,8 +551,6 @@ static int __btrfs_end_transaction(struct 
 btrfs_trans_handle *trans,
   btrfs_trans_release_metadata(trans, root);
   trans-block_rsv = NULL;
  
 - sb_end_intwrite(root-fs_info-sb);
 -
   if (lock  !atomic_read(root-fs_info-open_ioctl_trans) 
   should_end_transaction(trans, root)) {
   trans-transaction-blocked = 1;
 @@ -573,6 +571,8 @@ static int __btrfs_end_transaction(struct 
 btrfs_trans_handle *trans,
   }
   }
  
 + sb_end_intwrite(root-fs_info-sb);
 +
   WARN_ON(cur_trans != info-running_transaction);
   WARN_ON(atomic_read(cur_trans-num_writers)  1);
   atomic_dec(cur_trans-num_writers);
 -- 
 1.7.11.4
 
 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] lockdep fixups for sb_interval#2 vs async transaction commit

2012-08-30 Thread Sage Weil
These three patches (in combination with btrfs-next) fix things up for me!

Sage Weil (3):
  Btrfs: pass lockdep rwsem metadata to async commit transaction
  Btrfs: set journal_info in async trans commit worker
  Btrfs: do not take cleanup_work_sem in btrfs_run_delayed_iputs()

 fs/btrfs/inode.c   |2 --
 fs/btrfs/transaction.c |   18 ++
 2 files changed, 18 insertions(+), 2 deletions(-)

-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] Btrfs: set journal_info in async trans commit worker

2012-08-30 Thread Sage Weil
We expect current-journal_info to point to the trans handle we are
committing.

Signed-off-by: Sage Weil s...@inktank.com
---
 fs/btrfs/transaction.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index efc41a5..f910a26 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1236,6 +1236,8 @@ static void do_async_commit(struct work_struct *work)
ac-root-fs_info-sb-s_writers.lock_map[SB_FREEZE_FS-1],
0, 1, _THIS_IP_);
 
+   current-journal_info = ac-newtrans;
+
btrfs_commit_transaction(ac-newtrans, ac-root);
kfree(ac);
 }
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] Btrfs: pass lockdep rwsem metadata to async commit transaction

2012-08-30 Thread Sage Weil
The freeze rwsem is taken by sb_start_intwrite() and dropped during the
commit_ or end_transaction().  In the async case, that happens in a worker
thread.  Tell lockdep the calling thread is releasing ownership of the
rwsem and the async thread is picking it up.

XFS plays the same trick in fs/xfs/xfs_aops.c.

Signed-off-by: Sage Weil s...@inktank.com
---
 fs/btrfs/transaction.c |   16 
 1 file changed, 16 insertions(+)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 17be3de..efc41a5 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1228,6 +1228,14 @@ static void do_async_commit(struct work_struct *work)
struct btrfs_async_commit *ac =
container_of(work, struct btrfs_async_commit, work.work);
 
+   /*
+* We've got freeze protection passed with the transaction.
+* Tell lockdep about it.
+*/
+   rwsem_acquire_read(
+   ac-root-fs_info-sb-s_writers.lock_map[SB_FREEZE_FS-1],
+   0, 1, _THIS_IP_);
+
btrfs_commit_transaction(ac-newtrans, ac-root);
kfree(ac);
 }
@@ -1257,6 +1265,14 @@ int btrfs_commit_transaction_async(struct 
btrfs_trans_handle *trans,
atomic_inc(cur_trans-use_count);
 
btrfs_end_transaction(trans, root);
+
+   /*
+* Tell lockdep we've released the freeze rwsem, since the
+* async commit thread will be the one to unlock it.
+*/
+   rwsem_release(root-fs_info-sb-s_writers.lock_map[SB_FREEZE_FS-1],
+ 1, _THIS_IP_);
+
schedule_delayed_work(ac-work, 0);
 
/* wait for transaction to start and unblock */
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] Btrfs: do not take cleanup_work_sem in btrfs_run_delayed_iputs()

2012-08-30 Thread Sage Weil
/0x130 [btrfs]
[  606.970510]  [a003b5d5] ? cleaner_kthread+0x95/0x120 [btrfs]
[  607.005648]  [a003b5e1] cleaner_kthread+0xa1/0x120 [btrfs]
[  607.040724]  [a003b540] ? 
btrfs_destroy_delayed_refs.isra.102+0x220/0x220 [btrfs]
[  607.104740]  [810791ee] kthread+0xae/0xc0
[  607.137119]  [810b379d] ? trace_hardirqs_on+0xd/0x10
[  607.169797]  [8163e744] kernel_thread_helper+0x4/0x10
[  607.202472]  [81635430] ? retint_restore_args+0x13/0x13
[  607.235884]  [81079140] ? flush_kthread_work+0x1a0/0x1a0
[  607.268731]  [8163e740] ? gs_change+0x13/0x13

Signed-off-by: Sage Weil s...@inktank.com
---
 fs/btrfs/inode.c |2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 6e8f416..d3f2e6a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2118,7 +2118,6 @@ void btrfs_run_delayed_iputs(struct btrfs_root *root)
if (empty)
return;
 
-   down_read(root-fs_info-cleanup_work_sem);
spin_lock(fs_info-delayed_iput_lock);
list_splice_init(fs_info-delayed_iputs, list);
spin_unlock(fs_info-delayed_iput_lock);
@@ -2129,7 +2128,6 @@ void btrfs_run_delayed_iputs(struct btrfs_root *root)
iput(delayed-inode);
kfree(delayed);
}
-   up_read(root-fs_info-cleanup_work_sem);
 }
 
 enum btrfs_orphan_cleanup_state {
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: pass lockdep rwsem metadata to async commit transaction

2012-08-24 Thread Sage Weil
The freeze rwsem is taken by sb_start_intwrite() and dropped during the
commit_ or end_transaction().  In the async case, that happens in a worker
thread.  Tell lockdep the calling thread is releasing ownership of the
rwsem and the async thread is picking it up.

Josef and I worked out a more complicated solution that made the async 
commit thread join and potentially get a later transaction, but it failed 
my initial smoke test and Dave pointed out that XFS avoids the issue by 
just telling lockdep what's up.  This is much simpler.  XFS does the same
thing in fs/xfs/xfs_aops.c.

Signed-off-by: Sage Weil s...@inktank.com
---
 fs/btrfs/transaction.c |   16 
 1 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 17be3de..efc41a5 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1228,6 +1228,14 @@ static void do_async_commit(struct work_struct *work)
struct btrfs_async_commit *ac =
container_of(work, struct btrfs_async_commit, work.work);
 
+   /*
+* We've got freeze protection passed with the transaction.
+* Tell lockdep about it.
+*/
+   rwsem_acquire_read(
+   ac-root-fs_info-sb-s_writers.lock_map[SB_FREEZE_FS-1],
+   0, 1, _THIS_IP_);
+
btrfs_commit_transaction(ac-newtrans, ac-root);
kfree(ac);
 }
@@ -1257,6 +1265,14 @@ int btrfs_commit_transaction_async(struct 
btrfs_trans_handle *trans,
atomic_inc(cur_trans-use_count);
 
btrfs_end_transaction(trans, root);
+
+   /*
+* Tell lockdep we've released the freeze rwsem, since the
+* async commit thread will be the one to unlock it.
+*/
+   rwsem_release(root-fs_info-sb-s_writers.lock_map[SB_FREEZE_FS-1],
+ 1, _THIS_IP_);
+
schedule_delayed_work(ac-work, 0);
 
/* wait for transaction to start and unblock */
-- 
1.7.9

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: pass lockdep rwsem metadata to async commit transaction

2012-08-24 Thread Sage Weil
Sadly, now I hit another one:

[  378.433842] =
[  378.433842] [ INFO: possible recursive locking detected ]
[  378.433845] 3.6.0-rc2-ceph-00143-g995fc06 #1 Not tainted
[  378.433845] -
[  378.433847] kworker/6:1/238 is trying to acquire lock:
[  378.433872]  (sb_internal#2){.+.+..}, at: [a0042b74] 
start_transaction+0x124/0x430 [btrfs]
[  378.433873] 
[  378.433873] but task is already holding lock:
[  378.433890]  (sb_internal#2){.+.+..}, at: [a0042590] 
do_async_commit+0x0/0x80 [btrfs]
[  378.433891] 
[  378.433891] other info that might help us debug this:
[  378.433892]  Possible unsafe locking scenario:
[  378.433892] 
[  378.433892]CPU0
[  378.433893]
[  378.433895]   lock(sb_internal#2);
[  378.433897]   lock(sb_internal#2);
[  378.433898] 
[  378.433898]  *** DEADLOCK ***
[  378.433898] 
[  378.433898]  May be due to missing lock nesting notation
[  378.433898] 
[  378.433899] 3 locks held by kworker/6:1/238:
[  378.433906]  #0:  (events){.+.+.+}, at: [810717d6] 
process_one_work+0x136/0x5f0
[  378.433911]  #1:  (((ac-work)-work)){+.+...}, at: [810717d6] 
process_one_work+0x136/0x5f0
[  378.433929]  #2:  (sb_internal#2){.+.+..}, at: [a0042590] 
do_async_commit+0x0/0x80 [btrfs]
[  378.433932] 
[  378.433932] stack backtrace:
[  378.433935] Pid: 238, comm: kworker/6:1 Not tainted 
3.6.0-rc2-ceph-00143-g995fc06 #1
[  378.433936] Call Trace:
[  378.433941]  [810b2032] __lock_acquire+0x1512/0x1b90
[  378.433944]  [810ada73] ? __bfs+0x23/0x270
[  378.433961]  [a0042b74] ? start_transaction+0x124/0x430 [btrfs]
[  378.433964]  [810b2c82] lock_acquire+0xa2/0x140
[  378.433980]  [a0042b74] ? start_transaction+0x124/0x430 [btrfs]
[  378.433982]  [810b3546] ? mark_held_locks+0x86/0x140
[  378.433987]  [8117dac6] __sb_start_write+0xc6/0x1b0
[  378.434003]  [a0042b74] ? start_transaction+0x124/0x430 [btrfs]
[  378.434019]  [a0042b74] ? start_transaction+0x124/0x430 [btrfs]
[  378.434022]  [81172e75] ? kmem_cache_alloc+0xb5/0x160
[  378.434024]  [81172f9b] ? kmem_cache_free+0x7b/0x160
[  378.434042]  [a0058b48] ? free_extent_state+0x58/0xd0 [btrfs]
[  378.434058]  [a0042b74] start_transaction+0x124/0x430 [btrfs]
[  378.434076]  [a005940d] ? __set_extent_bit+0x37d/0x4e0 [btrfs]
[  378.434092]  [a0042ed5] btrfs_join_transaction+0x15/0x20 [btrfs]
[  378.434109]  [a00496b7] cow_file_range+0x87/0x4a0 [btrfs]
[  378.434114]  [81634c6b] ? _raw_spin_unlock+0x2b/0x40
[  378.434131]  [a004a80c] run_delalloc_range+0x34c/0x370 [btrfs]
[  378.434149]  [a005cbb0] __extent_writepage+0x5e0/0x770 [btrfs]
[  378.434152]  [810b3546] ? mark_held_locks+0x86/0x140
[  378.434155]  [8112aa5e] ? find_get_pages_tag+0x2e/0x1c0
[  378.434174]  [a005cffa] 
extent_write_cache_pages.isra.25.constprop.39+0x2ba/0x410 [btrfs]
[  378.434187]  [a002f7cc] ? btrfs_run_delayed_refs+0xac/0x550 [btrfs]
[  378.434190]  [81196117] ? igrab+0x27/0x70
[  378.434208]  [a005d389] extent_writepages+0x49/0x60 [btrfs]
[  378.434224]  [a0046a90] ? btrfs_submit_direct+0x670/0x670 [btrfs]
[  378.434240]  [a00444c8] btrfs_writepages+0x28/0x30 [btrfs]
[  378.434243]  [81136443] do_writepages+0x23/0x40
[  378.434247]  [8112b839] __filemap_fdatawrite_range+0x59/0x60
[  378.434249]  [8112c6ac] filemap_flush+0x1c/0x20
[  378.434266]  [a0050b1e] btrfs_start_delalloc_inodes+0xbe/0x200 
[btrfs]
[  378.434270]  [8132babd] ? do_raw_spin_unlock+0x5d/0xb0
[  378.434286]  [a0041ebd] btrfs_commit_transaction+0x44d/0xb20 
[btrfs]
[  378.434290]  [81079850] ? __init_waitqueue_head+0x60/0x60
[  378.434293]  [810717d6] ? process_one_work+0x136/0x5f0
[  378.434308]  [a00425f1] do_async_commit+0x61/0x80 [btrfs]
[  378.434324]  [a0042590] ? btrfs_commit_transaction+0xb20/0xb20 
[btrfs]
[  378.434327]  [81071840] process_one_work+0x1a0/0x5f0
[  378.434330]  [810717d6] ? process_one_work+0x136/0x5f0
[  378.434346]  [a0042590] ? btrfs_commit_transaction+0xb20/0xb20 
[btrfs]
[  378.434350]  [8107360d] worker_thread+0x18d/0x4c0
[  378.434354]  [81073480] ? manage_workers.isra.22+0x2c0/0x2c0
[  378.434356]  [810791ee] kthread+0xae/0xc0
[  378.434359]  [810b379d] ? trace_hardirqs_on+0xd/0x10
[  378.434363]  [8163e744] kernel_thread_helper+0x4/0x10
[  378.434366]  [81635430] ? retint_restore_args+0x13/0x13
[  378.434368]  [81079140] ? flush_kthread_work+0x1a0/0x1a0
[  378.434371]  [8163e740] ? gs_change+0x13/0x13


On Fri, 24 Aug 2012, Sage Weil wrote:

 The freeze rwsem is taken by sb_start_intwrite() and dropped during the
 commit_ or end_transaction().  In the async

Re: btrfs call trace

2012-06-25 Thread Sage Weil
Hi Stefan,

I haven't seen this one.  The async commit stuff is mine, but there 
haven't been problems with it for a year or more.  This is a recent 
kernel, I assume? 

Can you dump the other tasks?  Something is preventing the commit from 
completing.

[adding linux-btrfs to cc]

sage


On Fri, 22 Jun 2012, Stefan Priebe wrote:

 Hi,
 
 are these bugs known?
 
 [ 2157.104532] INFO: task kworker/2:2:6278 blocked for more than 120 seconds.
 [ 2157.130649] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables
 this message.
 [ 2157.156750] kworker/2:2 D 8180eba0 0  6278  2
 0x
 [ 2157.182763]  880c3c0d7c90 0046 880c3e2aa1c0
 00012280
 [ 2157.208662]  880c3c0d7fd8 880c3c0d6010 00012280
 00012280
 [ 2157.234591]  880c3c0d7fd8 00012280 880e78e5a240
 880c3e2aa1c0
 [ 2157.260573] Call Trace:
 [ 2157.285872]  [81620334] schedule+0x24/0x70
 [ 2157.311097]  [a007f4e5] wait_for_commit+0x55/0x90 [btrfs]
 [ 2157.311106]  [81060140] ? wake_up_bit+0x40/0x40
 [ 2157.311129]  [a00810b5] btrfs_commit_transaction+0x655/0xab0
 [btrfs]
 [ 2157.311134]  [810736f0] ? set_next_entity+0x90/0xa0
 [ 2157.311138]  [81060140] ? wake_up_bit+0x40/0x40
 [ 2157.311156]  [a0081810] ? btrfs_end_transaction+0x20/0x20 [btrfs]
 [ 2157.311171]  [a008182a] do_async_commit+0x1a/0x30 [btrfs]
 [ 2157.311175]  [81058eff] process_one_work+0x11f/0x470
 [ 2157.311179]  [8105b128] worker_thread+0x178/0x400
 [ 2157.311182]  [8105afb0] ? manage_workers+0x210/0x210
 [ 2157.311185]  [8105fc46] kthread+0x96/0xa0
 [ 2157.311190]  [81622dd4] kernel_thread_helper+0x4/0x10
 [ 2157.311195]  [8105fbb0] ? kthread_worker_fn+0x130/0x130
 [ 2157.311198]  [81622dd0] ? gs_change+0xb/0xb
 [ 2157.311220] INFO: task ceph-osd:14426 blocked for more than 120 seconds.
 [ 2157.311220] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables
 this message.
 [ 2157.311224] ceph-osdD 0002 0 14426  1
 0x
 [ 2157.311228]  880ae8c33c38 0082 880a39142180
 00012280
 [ 2157.311231]  880ae8c33fd8 880ae8c32010 00012280
 00012280
 [ 2157.311234]  880ae8c33fd8 00012280 880983efa1c0
 880a39142180
 [ 2157.311235] Call Trace:
 [ 2157.311240]  [81620334] schedule+0x24/0x70
 [ 2157.311256]  [a0081d85]
 btrfs_commit_transaction_async+0x1d5/0x240 [btrfs]
 [ 2157.311271]  [a0065eb6] ? block_rsv_add_bytes+0x26/0x60 [btrfs]
 [ 2157.311275]  [81060140] ? wake_up_bit+0x40/0x40
 [ 2157.311289]  [a0065f25] ? block_rsv_migrate_bytes+0x35/0x50
 [btrfs]
 [ 2157.311311]  [a00b2d5e] btrfs_mksubvol+0x2be/0x350 [btrfs]
 [ 2157.311329]  [a00b2ef9]
 btrfs_ioctl_snap_create_transid+0x109/0x1a0 [btrfs]
 [ 2157.311346]  [a00b4bf4] btrfs_ioctl_snap_create_v2+0x84/0xf0
 [btrfs]
 [ 2157.311363]  [a00b756f] btrfs_ioctl+0x76f/0x12d0 [btrfs]
 [ 2157.311368]  [8114459a] ? fsnotify+0x1ba/0x2e0
 [ 2157.311372]  [8111ade3] do_vfs_ioctl+0x93/0x4f0
 [ 2157.311375]  [8111b28a] sys_ioctl+0x4a/0x80
 [ 2157.311379]  [81621ba2] system_call_fastpath+0x16/0x1b
 
 Stefan
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph on btrfs 3.4rc

2012-04-24 Thread Sage Weil
On Tue, 24 Apr 2012, Josef Bacik wrote:
 On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
  After running ceph on XFS for some time, I decided to try btrfs again.
  Performance with the current for-linux-min branch and big metadata
  is much better. The only problem (?) I'm still seeing is a warning
  that seems to occur from time to time:

Actually, before you do that... we have a new tool, 
test_filestore_workloadgen, that generates a ceph-osd-like workload on the 
local file system.  It's a subset of what a full OSD might do, but if 
we're lucky it will be sufficient to reproduce this issue.  Something like

 test_filestore_workloadgen --osd-data /foo --osd-journal /bar

will hopefully do the trick.

Christian, maybe you can see if that is able to trigger this warning?  
You'll need to pull it from the current master branch; it wasn't in the 
last release.

Thanks!
sage


  
  [87703.784552] [ cut here ]
  [87703.789759] WARNING: at fs/btrfs/inode.c:2103
  btrfs_orphan_commit_root+0xf6/0x100 [btrfs]()
  [87703.799070] Hardware name: ProLiant DL180 G6
  [87703.804024] Modules linked in: btrfs zlib_deflate libcrc32c xfs
  exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
  iTCO_vendor_support i7core_edac edac_core ixgbe dca mdio
  iomemory_vsl(PO) hpsa squashfs [last unloaded: scsi_wait_scan]
  [87703.828166] Pid: 929, comm: kworker/1:2 Tainted: P   O
  3.3.2-1.fits.1.el6.x86_64 #1
  [87703.837513] Call Trace:
  [87703.840280]  [8104df6f] warn_slowpath_common+0x7f/0xc0
  [87703.847016]  [8104dfca] warn_slowpath_null+0x1a/0x20
  [87703.853533]  [a0355686] btrfs_orphan_commit_root+0xf6/0x100 
  [btrfs]
  [87703.861541]  [a0350a06] commit_fs_roots+0xc6/0x1c0 [btrfs]
  [87703.868674]  [a0351bcb]
  btrfs_commit_transaction+0x5db/0xa50 [btrfs]
  [87703.876745]  [810127a3] ? __switch_to+0x153/0x440
  [87703.882966]  [81070a90] ? wake_up_bit+0x40/0x40
  [87703.888997]  [a0352040] ?
  btrfs_commit_transaction+0xa50/0xa50 [btrfs]
  [87703.897271]  [a035205f] do_async_commit+0x1f/0x30 [btrfs]
  [87703.904262]  [81068949] process_one_work+0x129/0x450
  [87703.910777]  [8106b7eb] worker_thread+0x17b/0x3c0
  [87703.916991]  [8106b670] ? manage_workers+0x220/0x220
  [87703.923504]  [810703fe] kthread+0x9e/0xb0
  [87703.928952]  [8158c224] kernel_thread_helper+0x4/0x10
  [87703.93]  [81070360] ? 
  kthread_freezable_should_stop+0x70/0x70
  [87703.943323]  [8158c220] ? gs_change+0x13/0x13
  [87703.949149] ---[ end trace b8c31966cca731fa ]---
  [91128.812399] [ cut here ]
  [91128.817576] WARNING: at fs/btrfs/inode.c:2103
  btrfs_orphan_commit_root+0xf6/0x100 [btrfs]()
  [91128.826930] Hardware name: ProLiant DL180 G6
  [91128.831897] Modules linked in: btrfs zlib_deflate libcrc32c xfs
  exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
  iTCO_vendor_support i7core_edac edac_core ixgbe dca mdio
  iomemory_vsl(PO) hpsa squashfs [last unloaded: scsi_wait_scan]
  [91128.856086] Pid: 6806, comm: btrfs-transacti Tainted: PW  O
  3.3.2-1.fits.1.el6.x86_64 #1
  [91128.865912] Call Trace:
  [91128.868670]  [8104df6f] warn_slowpath_common+0x7f/0xc0
  [91128.875379]  [8104dfca] warn_slowpath_null+0x1a/0x20
  [91128.881900]  [a0355686] btrfs_orphan_commit_root+0xf6/0x100 
  [btrfs]
  [91128.889894]  [a0350a06] commit_fs_roots+0xc6/0x1c0 [btrfs]
  [91128.897019]  [a03a2b61] ?
  btrfs_run_delayed_items+0xf1/0x160 [btrfs]
  [91128.905075]  [a0351bcb]
  btrfs_commit_transaction+0x5db/0xa50 [btrfs]
  [91128.913156]  [a03524b2] ? start_transaction+0x92/0x310 [btrfs]
  [91128.920643]  [81070a90] ? wake_up_bit+0x40/0x40
  [91128.926667]  [a034cfcb] transaction_kthread+0x26b/0x2e0 [btrfs]
  [91128.934254]  [a034cd60] ?
  btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs]
  [91128.943671]  [a034cd60] ?
  btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs]
  [91128.953079]  [810703fe] kthread+0x9e/0xb0
  [91128.958532]  [8158c224] kernel_thread_helper+0x4/0x10
  [91128.965133]  [81070360] ? 
  kthread_freezable_should_stop+0x70/0x70
  [91128.972913]  [8158c220] ? gs_change+0x13/0x13
  [91128.978826] ---[ end trace b8c31966cca731fb ]---
  
  I'm able to reproduce this with ceph on a single server with 4 disks
  (4 filesystems/osds) and a small test program based on librbd. It is
  simply writing random bytes on a rbd volume (see attachment).
  
  Is this something I should care about? Any hint's on solving this
  would be appreciated.
  
 
 Can you send me a config or some basic steps for me to setup ceph on my box 
 so I
 can run this program and finally track down this problem?  Thanks,
 
 Josef
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a 

[LSF/MM TOPIC] COWing writeback pages

2012-02-10 Thread Sage Weil
Hi everyone,

The takeaway from the 'stable pages' discussions in the last few workshops 
was that pages under writeback should remain locked so that subsequent 
writers don't touch them while they are en route to the disk.  This 
prevents bad checksums and DIF/DIX type failures (whereas previously we 
didn't really care whether old or new data reached the disk).

The fear is/was that anyone subsequently modifying the page will have to 
wait for writeback io to complete before continuing.  I seem to remember 
somebody (Martin?) saying that in practice, under real workloads, that 
doesn't actually happen, so don't worry about it.  (Does anyone remember 
the details of what testing led to that conclusion?)

Anyway, we are seeing what looks like an analogous problem with btrfs, 
where operations sometimes block waiting for writeback of the btree pages.  
Although the 'keep rewriting the same page' pattern may not be prevalent 
in normal file workloads, it does seem to happen with the btrfs btree.

The obvious solution seems to be to COW the page if it is under writeback 
and we want to remodify it.  Presumably that can be done just in btrfs, to 
address the btrfs-specific symptoms we're hitting, but I'm interested in 
hearing from other folks about whether it's more generally useful VM 
functionality for other filesystems and other workloads.

Unfortunately, we haven't been able to pinpoint the exact scenarios under 
which this triggers under btrfs.  We regularly see long stalls for 
metadata operations (create() and similar metadata-only operations) that 
block after btrfs_commit_transaction has finished the previous 
transaction and is doing

return filemap_write_and_wait(btree_inode-i_mapping);

What we're less clear about is when btrfs will modify the in-memory page 
in place (and thus wait) versus COWing the page... still digging into this 
now.

It's seems like there is a btrfs-specific question about exactly what is 
going on and why, which isn't super-relevant for LSF/MM (except that we'll 
all be there).  However, my suspicion is that the solution will be 
generally applicable to other filesystems, and that the tests that led us 
to believe that normal workloads aren't affected by locked writeback 
pages would inform which path to take in solving our specific btrfs 
problem.

sage

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC] COWing writeback pages

2012-02-10 Thread Sage Weil
On Fri, 10 Feb 2012, Josef Bacik wrote:
 On Fri, Feb 10, 2012 at 11:25:27AM -0800, Sage Weil wrote:
  Hi everyone,
  
  The takeaway from the 'stable pages' discussions in the last few workshops 
  was that pages under writeback should remain locked so that subsequent 
  writers don't touch them while they are en route to the disk.  This 
  prevents bad checksums and DIF/DIX type failures (whereas previously we 
  didn't really care whether old or new data reached the disk).
  
  The fear is/was that anyone subsequently modifying the page will have to 
  wait for writeback io to complete before continuing.  I seem to remember 
  somebody (Martin?) saying that in practice, under real workloads, that 
  doesn't actually happen, so don't worry about it.  (Does anyone remember 
  the details of what testing led to that conclusion?)
  
  Anyway, we are seeing what looks like an analogous problem with btrfs, 
  where operations sometimes block waiting for writeback of the btree pages.  
  Although the 'keep rewriting the same page' pattern may not be prevalent 
  in normal file workloads, it does seem to happen with the btrfs btree.
  
  The obvious solution seems to be to COW the page if it is under writeback 
  and we want to remodify it.  Presumably that can be done just in btrfs, to 
  address the btrfs-specific symptoms we're hitting, but I'm interested in 
  hearing from other folks about whether it's more generally useful VM 
  functionality for other filesystems and other workloads.
  
  Unfortunately, we haven't been able to pinpoint the exact scenarios under 
  which this triggers under btrfs.  We regularly see long stalls for 
  metadata operations (create() and similar metadata-only operations) that 
  block after btrfs_commit_transaction has finished the previous 
  transaction and is doing
  
  return filemap_write_and_wait(btree_inode-i_mapping);
  
  What we're less clear about is when btrfs will modify the in-memory page 
  in place (and thus wait) versus COWing the page... still digging into this 
  now.
  
 
 Heh so I'm working on this now, specifically in the heavy create() workload, 
 and
 I've just about got it nailed down.  A lot of this problem is because we rely 
 on
 normal pagecache for our metadata so I'm copying xfs and creating our own
 caching.
 
 The thing is since we have an inode hanging out with normal pagecache pages we
 can have multiple people trying to write out dirty pages in our inode at the
 same time, and since it goes through our normal write path we'll end up in 
 this
 case where we're waiting on writeback for pages we won't actually end up 
 writing
 out.  My code will fix this, if we're talking about the same problem ;).

Oh, I hadn't thought of that... that sounds like a similar but slightly 
different problem, since it probably wouldn't correlate with the 
filemap_write_and_wait().  As long as we don't have a btree update waiting 
on btree writeback, though, both problems should be addressed.

In any case, we're definitely interested in checking out the code when 
it's ready to share!

sage
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: WARNING: at fs/btrfs/inode.c:2222

2012-01-27 Thread Sage Weil
On Fri, 27 Jan 2012, Chris Mason wrote:
 On Fri, Jan 27, 2012 at 09:31:46AM -0800, Sage Weil wrote:
  Has anyone see this one before?  We've been hitting it sporatically for a 
  while now, although recently we've been able to trigger pretty easily.  
  The result is EINVAL from the snap create ioctl.
  
  The kernel is v3.2 + some ceph stuff.
  
  The code in question is
  
  /* if we have links, this was a truncate, lets do that */
  if (inode-i_nlink) {
  if (!S_ISREG(inode-i_mode)) {
  WARN_ON(1);
  iput(inode);
  continue;
  }
  
  Is this something obvious, or should we spend some time chasing it down?
 
 This doesn't seem right, we've got an orphan on a directory for a
 truncate?  The link count is probably 1, but no I haven't seen this.

Okay, looking into it.  Thanks!

sage
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-26 Thread Sage Weil
On Wed, 26 Oct 2011, Christian Brunner wrote:
 2011/10/26 Sage Weil s...@newdream.net:
  On Wed, 26 Oct 2011, Christian Brunner wrote:
Christian, have you tweaked those settings in your ceph.conf?  It 
would be
something like 'journal dio = false'.  If not, can you verify that
directio shows true when the journal is initialized from your osd log?
E.g.,
   
 2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open 
dev/osd0.journal fd 14: 104857600 bytes, block size 4096 bytes, 
directio = 1
   
If directio = 1 for you, something else funky is causing those
blkdev_fsync's...
  
   I've looked it up in the logs - directio is 1:
  
   Oct 25 17:20:16 os00 osd.000[1696]: 7f0016841740 journal _open
   /dev/vg01/lv_osd_journal_0 fd 15: 17179869184 bytes, block size 4096
   bytes, directio = 1
  
   Do you mind capturing an strace?  I'd like to see where that blkdev_fsync
   is coming from.
 
  Here is an strace. I can see a lot of sync_file_range operations.
 
  Yeah, these all look like the flusher thread, and shouldn't be hitting
  blkdev_fsync.  Can you confirm that with
 
         filestore flusher = false
         filestore sync flush = false
 
  you get no sync_file_range at all?  I wonder if this is also perf lying
  about the call chain.
 
 Yes, setting this makes the sync_file_range calls go away.

Okay.  That means either sync_file_range on a regular btrfs file is 
triggering blkdev_fsync somewhere in btrfs, there is an extremely sneaky 
bug that is mixing up file descriptors, or latencytop is lying.  I'm 
guessing the latter, given the other weirdness Josef and Chris were 
seeing.  :)

 Is it safe to use these settings with filestore btrfs snap = 0?

Yeah.  They're purely a performance thing to push as much dirty data to 
disk as quickly as possible to minimize the snapshot create latency.  
You'll notice the write throughput tends to tank when them off.

sage

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-25 Thread Sage Weil
On Tue, 25 Oct 2011, Christoph Hellwig wrote:
 On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
   - When I run ceph with btrfs snaps disabled, the situation is getting
   slightly better. I can run an OSD for about 3 days without problems,
   but then again the load increases. This time, I can see that the
   ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
   than usual.
  
  FYI in this scenario you're exposed to the same journal replay issues that 
  ext4 and XFS are.  The btrfs workload that ceph is generating will also 
  not be all that special, though, so this problem shouldn't be unique to 
  ceph.
 
 What journal replay issues would ext4 and XFS be exposed to?

It's the ceph-osd journal replay, not the ext4/XFS journal... the #2 
item in

http://marc.info/?l=ceph-develm=131942130322957w=2

sage
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-25 Thread Sage Weil
On Tue, 25 Oct 2011, Josef Bacik wrote:
 At this point it seems like the biggest problem with latency in ceph-osd 
 is not related to btrfs, the latency seems to all be from the fact that 
 ceph-osd is fsyncing a block dev for whatever reason. 

There is one place where we sync_file_range() on the journal block device, 
but that should only happen if directio is disabled (it's on by default).  

Christian, have you tweaked those settings in your ceph.conf?  It would be 
something like 'journal dio = false'.  If not, can you verify that 
directio shows true when the journal is initialized from your osd log?  
E.g.,

 2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open dev/osd0.journal fd 14: 
104857600 bytes, block size 4096 bytes, directio = 1

If directio = 1 for you, something else funky is causing those 
blkdev_fsync's...

sage
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-25 Thread Sage Weil
On Tue, 25 Oct 2011, Christian Brunner wrote:
 2011/10/25 Sage Weil s...@newdream.net:
  On Tue, 25 Oct 2011, Josef Bacik wrote:
  At this point it seems like the biggest problem with latency in ceph-osd
  is not related to btrfs, the latency seems to all be from the fact that
  ceph-osd is fsyncing a block dev for whatever reason.
 
  There is one place where we sync_file_range() on the journal block device,
  but that should only happen if directio is disabled (it's on by default).
 
  Christian, have you tweaked those settings in your ceph.conf?  It would be
  something like 'journal dio = false'.  If not, can you verify that
  directio shows true when the journal is initialized from your osd log?
  E.g.,
 
   2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open dev/osd0.journal fd 
  14: 104857600 bytes, block size 4096 bytes, directio = 1
 
  If directio = 1 for you, something else funky is causing those
  blkdev_fsync's...
 
 I've looked it up in the logs - directio is 1:
 
 Oct 25 17:20:16 os00 osd.000[1696]: 7f0016841740 journal _open
 /dev/vg01/lv_osd_journal_0 fd 15: 17179869184 bytes, block size 4096
 bytes, directio = 1

Do you mind capturing an strace?  I'd like to see where that blkdev_fsync 
is coming from.

thanks!
sage

ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-24 Thread Sage Weil
[adding linux-btrfs to cc]

Josef, Chris, any ideas on the below issues?

On Mon, 24 Oct 2011, Christian Brunner wrote:
 Thanks for explaining this. I don't have any objections against btrfs
 as a osd filesystem. Even the fact that there is no btrfs-fsck doesn't
 scare me, since I can use the ceph replication to recover a lost
 btrfs-filesystem. The only problem I have is, that btrfs is not stable
 on our side and I wonder what you are doing to make it work. (Maybe
 it's related to the load pattern of using ceph as a backend store for
 qemu).
 
 Here is a list of the btrfs problems I'm having:
 
 - When I run ceph with the default configuration (btrfs snaps enabled)
 I can see a rapid increase in Disk-I/O after a few hours of uptime.
 Btrfs-cleaner is using more and more time in
 btrfs_clean_old_snapshots().

In theory, there shouldn't be any significant difference between taking a 
snapshot and removing it a few commits later, and the prior root refs that 
btrfs holds on to internally until the new commit is complete.  That's 
clearly not quite the case, though.

In any case, we're going to try to reproduce this issue in our 
environment.

 - When I run ceph with btrfs snaps disabled, the situation is getting
 slightly better. I can run an OSD for about 3 days without problems,
 but then again the load increases. This time, I can see that the
 ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
 than usual.

FYI in this scenario you're exposed to the same journal replay issues that 
ext4 and XFS are.  The btrfs workload that ceph is generating will also 
not be all that special, though, so this problem shouldn't be unique to 
ceph.

 Another thing is that I'm seeing a WARNING: at fs/btrfs/inode.c:2114
 from time to time. Maybe it's related to the performance issues, but
 seems to be able to verify this.

I haven't seen this yet with the latest stuff from Josef, but others have.  
Josef, is there any information we can provide to help track it down?

 It's really sad to see, that ceph performance and stability is
 suffering that much from the underlying filesystems and that this
 hasn't changed over the last months.

We don't have anyone internally working on btrfs at the moment, and are 
still struggling to hire experienced kernel/fs people.  Josef has been 
very helpful with tracking these issues down, but he hass responsibilities 
beyond just the Ceph related issues.  Progress is slow, but we are 
working on it!

sage


 
 Kind regards,
 Christian
 
 2011/10/24 Sage Weil s...@newdream.net:
  Although running on ext4, xfs, or whatever other non-btrfs you want mostly
  works, there are a few important remaining issues:
 
  1- ext4 limits total xattrs for 4KB.  This can cause problems in some
  cases, as Ceph uses xattrs extensively.  Most of the time we don't hit
  this.  We do hit the limit with radosgw pretty easily, though, and may
  also hit it in exceptional cases where the OSD cluster is very unhealthy.
 
  There is a large xattr patch for ext4 from the Lustre folks that has been
  floating around for (I think) years.  Maybe as interest grows in running
  Ceph on ext4 this can move upstream.
 
  Previously we were being forgiving about large setxattr failures on ext3,
  but we found that was leading to corruption in certain cases (because we
  couldn't set our internal metadata), so the next release will assert/crash
  in that case (fail-stop instead of fail-maybe-eventually-corrupt).
 
  XFS does not have an xattr size limit and thus does have this problem.
 
  2- The other problem is with OSD journal replay of non-idempotent
  transactions.  On non-btrfs backends, the Ceph OSDs use a write-ahead
  journal.  After restart, the OSD does not know exactly which transactions
  in the journal may have already been committed to disk, and may reapply a
  transaction again during replay.  For most operations (write, delete,
  truncate) this is fine.
 
  Some operations, though, are non-idempotent.  The simplest example is
  CLONE, which copies (efficiently, on btrfs) data from one object to
  another.  If the source object is modified, the osd restarts, and then
  the clone is replayed, the target will get incorrect (newer) data.  For
  example,
 
  1- clone A - B
  2- modify A
    osd crash, replay from 1
 
  B will get new instead of old contents.
 
  (This doesn't happen on btrfs because the snapshots allow us to replay
  from a known consistent point in time.)
 
  For things like clone, skipping the operation of the target exists almost
  works, except for cases like
 
  1- clone A - B
  2- modify A
  ...
  3- delete B
    osd crash, replay from 1
 
  (Although in that example who cares if B had bad data; it was removed
  anyway.)  The larger problem, though, is that that doesn't always work;
  CLONERANGE copies a range of a file from A to B, where B may already
  exist.
 
  In practice, the higher level interfaces don't make full use of the
  low-level interface, so it's possible

Re: OSD: no current directory

2011-10-11 Thread Sage Weil
On Tue, 11 Oct 2011, Christian Brunner wrote:
 2011/10/11 Sage Weil s...@newdream.net:
  On Tue, 11 Oct 2011, Christian Brunner wrote:
  Maybe this one is easier:
 
  One of our OSDs isn't starting, because ther is no current
  directory. What I have are three snap directories.
 
  total 0
  -rw-r--r-- 1 root root   37 Oct  9 15:57 ceph_fsid
  -rw-r--r-- 1 root root    8 Oct  9 15:57 fsid
  -rw-r--r-- 1 root root   21 Oct  9 15:57 magic
  drwxr-xr-x 1 root root 7986 Oct 11 18:34 snap_506043
  drwxr-xr-x 1 root root 7986 Oct 11 18:34 snap_507364
  drwxr-xr-x 1 root root 7814 Oct 11 18:36 snap_507417
  -rw-r--r-- 1 root root    4 Oct  9 15:57 store_version
  -rw-r--r-- 1 root root    2 Oct  9 15:57 whoami
 
  Is there a way to rollback the latest?
 
  That's what the OSD actually does on startup (roll back to the newest
  snap_).  It's probably a trivial bug that's preventing startup now... I'll
  take a look.  In the meantime, you can clone the latest snap_ to current
  and it should start!
 
  sage
 
 This seems to be a btrfs problem. It fails, when I'm trying to create the 
 clone
 
 # btrfs subvolume snapshot snap_507417 current
 Create a snapshot of 'snap_507417' in './current'
 ERROR: cannot snapshot 'snap_507417'
 
 And I get the following kernel messages:
 
 [ 5863.263950] [ cut here ]
 [ 5863.269125] WARNING: at fs/btrfs/inode.c:2335
 btrfs_orphan_cleanup+0xcd/0x3d0 [btrfs]()
 [ 5863.278142] Hardware name: ProLiant DL180 G6
 [ 5863.283161] Modules linked in: btrfs zlib_deflate libcrc32c bonding
 ipv6 serio_raw pcspkr ghes hed iTCO_wdt iTCO_vendor_support ixgbe dca
 mdio i7core_edac edac_core iomemory_vsl(P) hpsa squashfs usb_storage
 [last unloaded: scsi_wait_scan]
 [ 5863.307774] Pid: 6349, comm: btrfs Tainted: PW
 3.0.6-1.fits.2.el6.x86_64 #1
 [ 5863.316647] Call Trace:
 [ 5863.319648]  [8106344f] warn_slowpath_common+0x7f/0xc0
 [ 5863.326536]  [810634aa] warn_slowpath_null+0x1a/0x20
 [ 5863.333146]  [a023fb0d] btrfs_orphan_cleanup+0xcd/0x3d0 [btrfs]
 [ 5863.340839]  [a0238381] ? join_transaction+0x201/0x250 [btrfs]
 [ 5863.348482]  [a021fbaa] ? block_rsv_migrate_bytes+0x3a/0x50 
 [btrfs]
 [ 5863.356590]  [a0261a3b] btrfs_mksubvol+0x2fb/0x380 [btrfs]
 [ 5863.363726]  [a0261bba]
 btrfs_ioctl_snap_create_transid+0xfa/0x150 [btrfs]
 [ 5863.372445]  [a0261c66] btrfs_ioctl_snap_create+0x56/0x80 [btrfs]
 [ 5863.380398]  [a026583e] btrfs_ioctl+0x2fe/0xd50 [btrfs]
 [ 5863.387344]  [8125ed20] ? inode_has_perm+0x30/0x40
 [ 5863.393798]  [81261a2c] ? file_has_perm+0xdc/0xf0
 [ 5863.400114]  [8117086a] do_vfs_ioctl+0x9a/0x5a0
 [ 5863.406244]  [81170e11] sys_ioctl+0xa1/0xb0
 [ 5863.412001]  [81562882] system_call_fastpath+0x16/0x1b
 [ 5863.418767] ---[ end trace e3234ecab14ad64c ]---
 [ 5863.424084] btrfs: Error removing orphan entry, stopping orphan cleanup
 [ 5863.431614] btrfs: could not do orphan cleanup -22
 
 Can I use an older snapshot as well?

You're able to snapshot the others?

Yeah, any of the snap_ directories will work, although keep in mind when 
the OSD starts up it will immediately remove current/ and re-clone the 
newest snap_ to current/ again.  If the problem is a toxic/broken snap_ 
dir, you'll need to rename it out of the way to avoid hitting the problem 
again...

sage

 
 Regards,
 Christian
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 

Re: [GIT PULL] Btrfs fixes

2011-09-20 Thread Sage Weil
Hi Chris-

This pull misses the clone reservation fix again... :)

http://www.spinics.net/lists/linux-btrfs/msg11826.html

Thanks!
sage



On Mon, 19 Sep 2011, Chris Mason wrote:

 Hi everyone,
 
 The for-linus branch of the btrfs tree on github:
 
 Head commit: a66e7cc626f42de6c745963fe0d807518fa49d39
 git://github.com/chrismason/linux.git for-linus
 
 Has the following fixes.  for-linus is against rc6, since some of these
 are regression fixes for earlier 3.1 btrfs commits.  The most important
 of the bunch is Josef's dentry fix, which avoids enoents if we race with
 multiple procs hitting on the same inode.  This bug is btrfs-specific,
 it came in with his optimization to cache the inode location during
 readdir.
 
 Li Zefan (3) commits (+9/-5):
 Btrfs: don't make a file partly checksummed through file clone (+5/-0)
 Btrfs: don't change inode flag of the dest clone file (+0/-1)
 Btrfs: fix pages truncation in btrfs_ioctl_clone() (+4/-4)
 
 Josef Bacik (1) commits (+11/-2):
 Btrfs: only clear the need lookup flag after the dentry is setup
 
 Jeff Liu (1) commits (+7/-2):
 BTRFS: Fix lseek return value for error
 
 Hidetoshi Seto (1) commits (+3/-2):
 btrfs: fix d_off in the first dirent
 
 Total: (6) commits (+30/-11)
 
  fs/btrfs/file.c  |9 +++--
  fs/btrfs/inode.c |   18 ++
  fs/btrfs/ioctl.c |   14 +-
  3 files changed, 30 insertions(+), 11 deletions(-)
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix pages truncation in btrfs_ioctl_clone()

2011-09-18 Thread Sage Weil
On Sun, 18 Sep 2011, Chris Mason wrote:
 Excerpts from Sage Weil's message of 2011-09-16 12:39:06 -0400:
  On Fri, 16 Sep 2011, Li Zefan wrote:
   It's a bug in commit f81c9cdc567cd3160ff9e64868d9a1a7ee226480
   (Btrfs: truncate pages from clone ioctl target range)
   
   We should pass the dest range to the truncate function, but not the
   src range.
  
  Sigh... yes.
  
   Also move the function before locking extent state.
  
  Hmm, any reason?  i_mutex protects us from a racing write(2), but what 
  about a racing mmap()?  e.g.
  
  cloner: truncates dest pages
  writer: mmap - page_mkwrite locks extent, creates new dirty page, unlocks
  cloner: locks extent, clones, unlocks extent
 
 Thanks guys.  The locking order is page lock - extent lock.  So if we
 call truncate_inode_pages with the extent lock held, we're off into
 ABBA.

Oh right.  This patch makes sense now.
 
 If we want to avoid the mmap race, we'll have to look for the dirty
 pages with the extent lock held, drop the lock and goto back to the
 truncate_inode_pages call.

Yeah, given that (at Li points out) we're not locking dst extents at all, 
I don't think it's worth worrying about now.  Sorry for the noise!

sage

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix pages truncation in btrfs_ioctl_clone()

2011-09-16 Thread Sage Weil
On Fri, 16 Sep 2011, Li Zefan wrote:
 It's a bug in commit f81c9cdc567cd3160ff9e64868d9a1a7ee226480
 (Btrfs: truncate pages from clone ioctl target range)
 
 We should pass the dest range to the truncate function, but not the
 src range.

Sigh... yes.

 Also move the function before locking extent state.

Hmm, any reason?  i_mutex protects us from a racing write(2), but what 
about a racing mmap()?  e.g.

cloner: truncates dest pages
writer: mmap - page_mkwrite locks extent, creates new dirty page, unlocks
cloner: locks extent, clones, unlocks extent

sage


 
 Signed-off-by: Li Zefan l...@cn.fujitsu.com
 ---
  fs/btrfs/ioctl.c |8 
  1 files changed, 4 insertions(+), 4 deletions(-)
 
 diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
 index 5492bb3..f6af8b0 100644
 --- a/fs/btrfs/ioctl.c
 +++ b/fs/btrfs/ioctl.c
 @@ -2237,6 +2237,10 @@ static noinline long btrfs_ioctl_clone(struct file 
 *file, unsigned long srcfd,
   goto out_unlock;
   }
  
 + /* truncate page cache pages from target inode range */
 + truncate_inode_pages_range(inode-i_data, destoff,
 +PAGE_CACHE_ALIGN(destoff + len) - 1);
 +
   /* do any pending delalloc/csum calc on src, one way or
  another, and lock file content */
   while (1) {
 @@ -2253,10 +2257,6 @@ static noinline long btrfs_ioctl_clone(struct file 
 *file, unsigned long srcfd,
   btrfs_wait_ordered_range(src, off, len);
   }
  
 - /* truncate page cache pages from target inode range */
 - truncate_inode_pages_range(inode-i_data, off,
 -ALIGN(off + len, PAGE_CACHE_SIZE) - 1);
 -
   /* clone data */
   key.objectid = btrfs_ino(src);
   key.type = BTRFS_EXTENT_DATA_KEY;
 -- 
 1.7.3.1
 
 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: WARNING: at fs/btrfs/inode.c:2193 btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()

2011-09-15 Thread Sage Weil
On Tue, 13 Sep 2011, Liu Bo wrote:
 On 09/11/2011 05:47 AM, Martin Mailand wrote:
  Hi
  I am hitting this Warning reproducible, the workload is a ceph osd,
  kernel ist 3.1.0-rc5.
  
 
 Have posted a patch for this:
 
 http://marc.info/?l=linux-btrfsm=131547325515336w=2

We're still seeing this with -rc6, which includes 98c9942 and 65450aa.

I haven't looked at the reservation code in much detail.  Is there 
anything I can do to help track this down?

Thanks-
sage


 
 thanks,
 liubo
 
  Best Regards,
   martin
  
  [ 5472.099766] [ cut here ]
  [ 5472.099833] WARNING: at fs/btrfs/inode.c:2193
  btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()
  [ 5472.099838] Hardware name: MS-96B3
  [ 5472.099842] Modules linked in: radeon ttm drm_kms_helper drm
  i2c_algo_bit psmouse sp5100_tco edac_core lp shpchp serio_raw k8temp
  i2c_piix4 edac_mce_amd parport ahci pata_atiixp e1000e libahci btrfs
  zlib_deflate libcrc32c
  [ 5472.099878] Pid: 2066, comm: kworker/1:1 Not tainted 3.1.0-rc5-custom #1
  [ 5472.099882] Call Trace:
  [ 5472.099898]  [81063c1f] warn_slowpath_common+0x7f/0xc0
  [ 5472.099907]  [81063c7a] warn_slowpath_null+0x1a/0x20
  [ 5472.099935]  [a003f420] btrfs_orphan_commit_root+0xb0/0xc0
  [btrfs]
  [ 5472.099961]  [a00380aa] commit_fs_roots.clone.21+0xba/0x1a0
  [btrfs]
  [ 5472.099971]  [815db96e] ? _raw_spin_lock+0xe/0x20
  [ 5472.07]  [a003966f]
  btrfs_commit_transaction+0x3ef/0x870 [btrfs]
  [ 5472.100065]  [81012871] ? __switch_to+0x261/0x2f0
  [ 5472.100084]  [81086bf0] ? wake_up_bit+0x40/0x40
  [ 5472.100120]  [a0039af0] ?
  btrfs_commit_transaction+0x870/0x870 [btrfs]
  [ 5472.100155]  [a0039b0f] do_async_commit+0x1f/0x30 [btrfs]
  [ 5472.100171]  [8108110d] process_one_work+0x11d/0x430
  [ 5472.100187]  [81081c69] worker_thread+0x169/0x360
  [ 5472.100203]  [81081b00] ? manage_workers.clone.21+0x240/0x240
  [ 5472.100220]  [81086496] kthread+0x96/0xa0
  [ 5472.100236]  [815e5bb4] kernel_thread_helper+0x4/0x10
  [ 5472.100253]  [81086400] ? flush_kthread_worker+0xb0/0xb0
  [ 5472.100269]  [815e5bb0] ? gs_change+0x13/0x13
  [ 5472.100279] ---[ end trace a8bae5767c2c3e55 ]---
  -- 
  To unsubscribe from this list: send the line unsubscribe linux-btrfs in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: WARNING: at fs/btrfs/inode.c:2193 btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()

2011-09-15 Thread Sage Weil
On Thu, 15 Sep 2011, David Sterba wrote:
 On Thu, Sep 15, 2011 at 03:50:29PM -0400, Josef Bacik wrote:
   We're still seeing this with -rc6, which includes 98c9942 and 65450aa.
   
   I haven't looked at the reservation code in much detail.  Is there 
   anything I can do to help track this down?
   
  
  This should be taken care of with all my enospc changes.  You can pull them 
  down
  from my btrfs-work tree as soon as kernel.org comes back from the dead :).
 
 should you need it earlier, here's a copy:
 
 git://repo.or.cz/linux-2.6/btrfs-unstable.git #git.kernel.org/josef/master

Thanks!  We'll do some testing today and see if it behaves better.  :)

sage
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Btrfs: don't make a file partly checksummed through file clone

2011-09-14 Thread Sage Weil
On Wed, 14 Sep 2011, Li Zefan wrote:

 To reproduce the bug:
 
   # mount /dev/sda7 /mnt
   # dd if=/dev/zero of=/mnt/src bs=4K count=1
   # umount /mnt
 
   # mount -o nodatasum /dev/sda7 /mnt
   # dd if=/dev/zero of=/mnt/dst bs=4K count=1
   # clone_range -s 4K -l 4K /mnt/src /mnt/dst
 
   # echo 3  /proc/sys/vm/drop_caches
   # cat /mnt/dst
   # dmesg
   ...
   btrfs no csum found for inode 258 start 0
   btrfs csum failed ino 258 off 0 csum 2566472073 private 0
 
 It's because part of the file is checksummed and the other part is not,
 and then btrfs will complain checksum is not found when we read the file.
 
 Disallow file clone if src and dst file have different checksum flag,
 so we ensure a file is completely checksummed or unchecksummed.
 
 Signed-off-by: Li Zefan l...@cn.fujitsu.com

Looks good to me.

Reviewed-by: Sage Weil s...@newdream.net

 ---
  fs/btrfs/ioctl.c |5 +
  1 files changed, 5 insertions(+), 0 deletions(-)
 
 diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
 index 970977a..dc82bbb 100644
 --- a/fs/btrfs/ioctl.c
 +++ b/fs/btrfs/ioctl.c
 @@ -2177,6 +2177,11 @@ static noinline long btrfs_ioctl_clone(struct file 
 *file, unsigned long srcfd,
   if (!(src_file-f_mode  FMODE_READ))
   goto out_fput;
  
 + /* don't make the dst file partly checksummed */
 + if ((BTRFS_I(src)-flags  BTRFS_INODE_NODATASUM) !=
 + (BTRFS_I(inode)-flags  BTRFS_INODE_NODATASUM))
 + goto out_fput;
 +
   ret = -EISDIR;
   if (S_ISDIR(src-i_mode) || S_ISDIR(inode-i_mode))
   goto out_fput;
 -- 1.7.3.1 
 
 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Btrfs updates

2011-08-18 Thread Sage Weil
Hi Chris, Josef,

Can some form of the clone ioctl transaction start reservation fix go in 
soon as well?  That hits a BUG_ON every time.

Thanks!
sage


On Thu, 18 Aug 2011, Chris Mason wrote:

 Hi everyone,
 
 The for-linus branch of the btrfs-unstable tree:
 
 git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable.git 
 for-linus
 
 Has our current pull request for 3.1-rc.  The for-linus branch includes
 3.1-rc2 because it two fixes from Dan Carpenter and Jeff Mahoney that
 only apply to 3.1.
 
 The master branch of btrfs-unstable is based on 3.0, and includes
 everything except those two changes.
 
 This is a variety pack of fixes.  We do have another fix pending for a
 race in our readdir optimizations, but Josef will work that out with Al
 Viro and send it in later this week (or early next).
 
 Chris Mason (1) commits (+17/-0):
 Btrfs: force unplugs when switching from high to regular priority bios
 
 Dan Carpenter (2) commits (+10/-4):
 btrfs: unlock on error in btrfs_file_llseek() (+8/-4)
 btrfs: memory leak in btrfs_add_inode_defrag() (+2/-0)
 
 Jeff Mahoney (1) commits (+8/-4):
 btrfs: btrfs_permission's RO check shouldn't apply to device nodes
 
 Josef Bacik (2) commits (+43/-2):
 Btrfs: set i_size properly when fallocating (+14/-0)
 Btrfs: detect wether a device supports discard (+29/-2)
 
 Li Zefan (1) commits (+2/-4):
 Btrfs: use plain page_address() in header fields setget functions
 
 Miao Xie (2) commits (+12/-6):
 Btrfs: fix uninitialized sync_pending (+1/-1)
 Btrfs: fix wrong free space information (+11/-5)
 
 Sage Weil (1) commits (+4/-0):
 Btrfs: truncate pages from clone ioctl target range
 
 Tsutomu Itoh (1) commits (+16/-10):
 Btrfs: forced readonly when btrfs_drop_snapshot() fails
 
 liubo (3) commits (+72/-14):
 Btrfs: check if there is enough space for balancing smarter (+35/-6)
 Btrfs: fix a bug of balance on full multi-disk partitions (+13/-4)
 Btrfs: fix an oops of log replay (+24/-4)
 
 Total: (14) commits (+183/-43)
 
  fs/btrfs/ctree.h|   10 ++---
  fs/btrfs/extent-tree.c  |   75 +-
  fs/btrfs/file.c |   28 ++--
  fs/btrfs/free-space-cache.c |   16 ++---
  fs/btrfs/inode.c|   12 --
  fs/btrfs/ioctl.c|4 ++
  fs/btrfs/tree-log.c |   28 ++--
  fs/btrfs/volumes.c  |   51 +++--
  fs/btrfs/volumes.h  |2 +
  9 files changed, 183 insertions(+), 43 deletions(-)
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: reserve sufficient space for ioctl clone

2011-08-15 Thread Sage Weil
On Tue, 16 Aug 2011, Miao Xie wrote:
 On tue, 9 Aug 2011 10:46:37 -0700, Sage Weil wrote:
  Fix a crash/BUG_ON in the clone ioctl due to insufficient reservation. We
  need to reserve space for:
  
   - adjusting the old extent (possibly splitting it)
   - adding the new extent
   - updating the inode
  
  Signed-off-by: Sage Weil s...@newdream.net
  ---
   fs/btrfs/ioctl.c |7 ++-
   1 files changed, 6 insertions(+), 1 deletions(-)
  
  diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
  index a3c4751..f038d4a 100644
  --- a/fs/btrfs/ioctl.c
  +++ b/fs/btrfs/ioctl.c
  @@ -2320,7 +2320,12 @@ static noinline long btrfs_ioctl_clone(struct file 
  *file, unsigned long srcfd,
  else
  new_key.offset = destoff;
   
  -   trans = btrfs_start_transaction(root, 1);
  +   /*
  +* 1 - adjusting old extent (we may have to split it)
 
 I don't think it is enough. If we have lots of file extents and their extent 
 items are stored
 in many contiguous leaves, the drop operation may cause those leaves to be 
 COWed. So I think we
 must calculate the number of leaves which will be COWed at first.

Hmm, yes, but if that's the case, most of the other btrfs_drop_extents() 
callers are broken too.  Take btrfs_cont_expand(), which does more or less 
the same thing that we're doing too but reserves 2 (no inode update).

Josef, correct me if I'm wrong, but I'm guessing how this works is that we 
are reserving space for something along the lines of 2 * tree depth, or 
enough space to cow the leaf blocks and their parents (plus any space for 
splits/merges).  For inserts, we'd need to be careful about how many items 
we insert (more new leaves), but for removals, as long as we are deleting 
sequential items, we need at most two leaves, for the new versions of each 
end of the deleted range.  Is that right?

Hmm, how does the reservation interace with the extent backrefs, though?  
Is space reserved for that?

sage



 
 Thanks
 Miao
 
  +* 1 - add new extent
  +* 1 - inode update
  +*/
  +   trans = btrfs_start_transaction(root, 3);
  if (IS_ERR(trans)) {
  ret = PTR_ERR(trans);
  goto out;
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: reserve sufficient space for ioctl clone

2011-08-09 Thread Sage Weil
Fix a crash/BUG_ON in the clone ioctl due to insufficient reservation. We
need to reserve space for:

 - adjusting the old extent (possibly splitting it)
 - adding the new extent
 - updating the inode

Signed-off-by: Sage Weil s...@newdream.net
---
 fs/btrfs/ioctl.c |7 ++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index a3c4751..f038d4a 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2320,7 +2320,12 @@ static noinline long btrfs_ioctl_clone(struct file 
*file, unsigned long srcfd,
else
new_key.offset = destoff;
 
-   trans = btrfs_start_transaction(root, 1);
+   /*
+* 1 - adjusting old extent (we may have to split it)
+* 1 - add new extent
+* 1 - inode update
+*/
+   trans = btrfs_start_transaction(root, 3);
if (IS_ERR(trans)) {
ret = PTR_ERR(trans);
goto out;
-- 
1.7.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


clone ioctl bug with inline extents

2011-08-09 Thread Sage Weil
Hi all,

I'm hitting a problem cloning inline extents that I haven't had much 
success tracking down.  It's simple enough to reproduce:

 echo   src
 echo 2  dst
 clone_range src 0 29 dst 0
 cmp src dst   # fails! dst is size 29 but contains 2\n\0\0\0\0...

where clone_range comes from 

 
http://ceph.newdream.net/git/?p=ceph.git;a=blob;f=qa/btrfs/clone_range.c;h=0a88e16013104c27aa87e7cd0d75e4d292419a19;hb=HEAD

The file size is adjusted for the target, and debug-tree shows an inline 
data extent of length 29, but it has the old data in it.  I'm not sure why

ret = btrfs_insert_empty_item(trans, root, path,
  new_key, size);
BUG_ON(ret);

[...]

leaf = path-nodes[0];
slot = path-slots[0];
write_extent_buffer(leaf, buf,
btrfs_item_ptr_offset(leaf, slot),
size);
inode_add_bytes(inode, datal);

is working when cloning to a new file but not over an existing one.  
Hopefully this is something silly I'm missing...

sage
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: clone ioctl bug with inline extents

2011-08-09 Thread Sage Weil
On Tue, 9 Aug 2011, Sage Weil wrote:
 Hi all,
 
 I'm hitting a problem cloning inline extents that I haven't had much 
 success tracking down.  It's simple enough to reproduce:
 
  echo   src
  echo 2  dst
  clone_range src 0 29 dst 0
  cmp src dst   # fails! dst is size 29 but contains 2\n\0\0\0\0...

facepalm, ok, this was just a matter of the ioctl code not dropping the 
old page cache pages.  I'm not sure how we didn't notice that for so long!

sage
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs slowdown

2011-08-08 Thread Sage Weil
Hi Christian,

Are you still seeing this slowness?

sage


On Wed, 27 Jul 2011, Christian Brunner wrote:
 2011/7/25 Chris Mason chris.ma...@oracle.com:
  Excerpts from Christian Brunner's message of 2011-07-25 03:54:47 -0400:
  Hi,
 
  we are running a ceph cluster with btrfs as it's base filesystem
  (kernel 3.0). At the beginning everything worked very well, but after
  a few days (2-3) things are getting very slow.
 
  When I look at the object store servers I see heavy disk-i/o on the
  btrfs filesystems (disk utilization is between 60% and 100%). I also
  did some tracing on the Cepp-Object-Store-Daemon, but I'm quite
  certain, that the majority of the disk I/O is not caused by ceph or
  any other userland process.
 
  When reboot the system(s) the problems go away for another 2-3 days,
  but after that, it starts again. I'm not sure if the problem is
  related to the kernel warning I've reported last week. At least there
  is no temporal relationship between the warning and the slowdown.
 
  Any hints on how to trace this would be welcome.
 
  The easiest way to trace this is with latencytop.
 
  Apply this patch:
 
  http://oss.oracle.com/~mason/latencytop.patch
 
  And then use latencytop -c for a few minutes while the system is slow.
  Send the output here and hopefully we'll be able to figure it out.
 
 I've now installed latencytop. Attached are two output files: The
 first is from yesterday and was created aproxematly half an hour after
 the boot. The second on is from today, uptime is 19h. The load on the
 system is already rising. Disk utilization is approximately at 50%.
 
 Thanks for your help.
 
 Christian
 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs slowdown

2011-07-28 Thread Sage Weil
On Thu, 28 Jul 2011, Christian Brunner wrote:
 When I look at the latencytop results, there is a high latency when
 calling btrfs_commit_transaction_async. Isn't async supposed to
 return immediately?

It depends.  That function has to block until the commit has started 
before returning in the case where it creates a new btrfs root (i.e., 
snapshot creation).  Otherwise a subsequent operation (after the ioctl 
returns) can sneak in before the snapshot is taken.  (IIRC there was also 
another problem with keeping internal structures consistent, tho I'm 
forgetting the details.)  And there are a bunch of things 
btrfs_commit_transaction() does before setting blocked = 1 that can be 
slow.  There is a fair bit of transaction commit optimization work that 
should eventually be done here that we sadly haven't had the resources to 
look at yet.

sage
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3.0-rcX BUG at fs/btrfs/ioctl.c:432 - bisected

2011-06-10 Thread Sage Weil
On Fri, 10 Jun 2011, Chris Mason wrote:
 Excerpts from Jim Schutt's message of 2011-06-10 13:06:22 -0400:
 
 [ two different btrfs crashes ]
 
 I think your two crashes in btrfs were from the uninit variables and
 those should be fixed in rc2.
 
  When I did my bisection, my criteria for success/failure was
  did mkcephfs succeed?.  When I apply this criteria to a recent
  linus kernel (e.g. 06e86849cf4019), which includes the fix you
  mentioned (aa0467d8d2a00e), I get still a different failure mode,
  which doesn't actually reference btrfs:
  
  [  276.364178] BUG: unable to handle kernel NULL pointer dereference at 
  000a
  [  276.365127] IP: [a05434b1] journal_start+0x3e/0x9c [jbd]
 
 Looking at the resulting code in the oops, we're here in journal_start:
 
 if (handle) {
 J_ASSERT(handle-h_transaction-t_journal == journal);
 
 handle comes from current-journal_info, and we're doing a deref on
 handle-h_transaction, which is probably 0xa.
 
 So, we're leaving crud in current-journal_info and ext3 is finding it.
 
 Perhaps its from ceph starting a transaction but leaving it running?
 The bug came with Josef's transaction performance fixes, but it is
 probably a mixture of his code with the ioctls ceph is using.

Ah, yeah, that's the problem.  We saw a similar problem a while back with 
the start/stop transaction ioctls.  In this case, create_snapshot is doing

trans = btrfs_start_transaction(root-fs_info-extent_root, 5);
if (IS_ERR(trans)) {
ret = PTR_ERR(trans);
goto fail;
}

which sets current-journal_info.  Then

ret = btrfs_snap_reserve_metadata(trans, pending_snapshot);
BUG_ON(ret);

list_add(pending_snapshot-list,
 trans-transaction-pending_snapshots);
if (async_transid) {
*async_transid = trans-transid;
ret = btrfs_commit_transaction_async(trans,
 root-fs_info-extent_root, 1);
} else {
ret = btrfs_commit_transaction(trans,
   root-fs_info-extent_root);
}

but the async snap creation ioctl takes the async path, which runs 
btrfs_commit_transaction in a worker thread.

I'm not sure what the right thing to do is here is... can whatever is in 
journal_info be attached to trans instead in 
btrfs_commit_transaction_async()?

sage






 
 [ rest of the oops below for context ]
 
 -chris
 
  [  276.365127] PGD 1e4469067 PUD 1e1658067 PMD 0
  [  276.365127] Oops:  [#1] SMP
  [  276.365127] CPU 2
  [  276.365127] Modules linked in: btrfs zlib_deflate lzo_compress 
  ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state 
  nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge 
  stp i2c_dev i2c_core ext3 jbd scsi_transport_iscsi rds ib_ipoib rdma_ucm 
  rdma_cm ib_ucm ib_uverbs ib_umad ib_cm iw_cm ib_addr ipv6 ib_sa dm_mirror 
  dm_region_hash dm_log dm_multipath scsi_dh dm_mod video sbs sbshc pci_slot 
  battery acpi_pad ac kvm sg ses sd_mod enclosure megaraid_sas ide_cd_mod 
  cdrom ib_mthca ib_mad qla2xxx button ib_core serio_raw scsi_transport_fc 
  scsi_tgt dcdbas ata_piix libata tpm_tis tpm i5k_amb ioatdma tpm_bios hwmon 
  iTCO_wdt scsi_mod i5000_edac iTCO_vendor_support ehci_hcd dca edac_core 
  uhci_hcd pcspkr rtc nfs nfs_acl auth_rpcgss fscache lockd sunrpc tg3 bnx2 
  e1000 [last unloaded: freq_table]
  [  276.365127]
  [  276.365127] Pid: 6076, comm: cosd Not tainted 3.0.0-rc2-00196-g06e8684 
  #26 Dell Inc. PowerEdge 1950/0DT097
  [  276.365127] RIP: 0010:[a05434b1]  [a05434b1] 
  journal_start+0x3e/0x9c [jbd]
  [  276.365127] RSP: 0018:8801e2897b28  EFLAGS: 00010286
  [  276.365127] RAX: 000a RBX: 8801de8e1090 RCX: 
  0002
  [  276.365127] RDX: 19b2d000 RSI: 000e RDI: 
  000e
  [  276.365127] RBP: 8801e2897b48 R08: 0003 R09: 
  8801e2897c38
  [  276.365127] R10: 8801e2897ed8 R11: 0001 R12: 
  880223ff4400
  [  276.365127] R13: 880218522d60 R14: 0ec6 R15: 
  88021f54d878
  [  276.365127] FS:  7f8ff0bbb710() GS:88022fc8() 
  knlGS:
  [  276.365127] CS:  0010 DS:  ES:  CR0: 8005003b
  [  276.365127] CR2: 000a CR3: 00021744f000 CR4: 
  06e0
  [  276.365127] DR0:  DR1:  DR2: 
  
  [  276.365127] DR3:  DR6: 0ff0 DR7: 
  0400
  [  276.365127] Process cosd (pid: 6076, threadinfo 8801e2896000, task 
  880218522d60)
  [  276.365127] Stack:
  [  276.365127]  8801e2897b68 ea000756e788 88021f54d728 
  8801e2897c78
  [  276.365127]  8801e2897b58 a05670ce 8801e2897b68 
  a055c72d
  [  276.365127]  8801e2897be8 a055f044 

Re: 3.0-rcX BUG at fs/btrfs/ioctl.c:432 - bisected

2011-06-10 Thread Sage Weil
On Fri, 10 Jun 2011, Sage Weil wrote:
 On Fri, 10 Jun 2011, Chris Mason wrote:
  Excerpts from Jim Schutt's message of 2011-06-10 13:06:22 -0400:
  
  [ two different btrfs crashes ]
  
  I think your two crashes in btrfs were from the uninit variables and
  those should be fixed in rc2.
  
   When I did my bisection, my criteria for success/failure was
   did mkcephfs succeed?.  When I apply this criteria to a recent
   linus kernel (e.g. 06e86849cf4019), which includes the fix you
   mentioned (aa0467d8d2a00e), I get still a different failure mode,
   which doesn't actually reference btrfs:
   
   [  276.364178] BUG: unable to handle kernel NULL pointer dereference at 
   000a
   [  276.365127] IP: [a05434b1] journal_start+0x3e/0x9c [jbd]
  
  Looking at the resulting code in the oops, we're here in journal_start:
  
  if (handle) {
  J_ASSERT(handle-h_transaction-t_journal == journal);
  
  handle comes from current-journal_info, and we're doing a deref on
  handle-h_transaction, which is probably 0xa.
  
  So, we're leaving crud in current-journal_info and ext3 is finding it.
  
  Perhaps its from ceph starting a transaction but leaving it running?
  The bug came with Josef's transaction performance fixes, but it is
  probably a mixture of his code with the ioctls ceph is using.
 
 Ah, yeah, that's the problem.  We saw a similar problem a while back with 
 the start/stop transaction ioctls.  In this case, create_snapshot is doing
 
   trans = btrfs_start_transaction(root-fs_info-extent_root, 5);
   if (IS_ERR(trans)) {
   ret = PTR_ERR(trans);
   goto fail;
   }
 
 which sets current-journal_info.  Then
 
   ret = btrfs_snap_reserve_metadata(trans, pending_snapshot);
   BUG_ON(ret);
 
   list_add(pending_snapshot-list,
trans-transaction-pending_snapshots);
   if (async_transid) {
   *async_transid = trans-transid;
   ret = btrfs_commit_transaction_async(trans,
root-fs_info-extent_root, 1);
   } else {
   ret = btrfs_commit_transaction(trans,
  root-fs_info-extent_root);
   }
 
 but the async snap creation ioctl takes the async path, which runs 
 btrfs_commit_transaction in a worker thread.
 
 I'm not sure what the right thing to do is here is... can whatever is in 
 journal_info be attached to trans instead in 
 btrfs_commit_transaction_async()?

It looks like it's not used for anything in btrfs, actually; it's just set 
and cleared.  What's the point of that?

Anyway, assuming it's useful, I think the below would fix the problem.  
Want to give it a shot, Jim?

Thanks!
sage


diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index c571734..fd04ad7 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1196,6 +1196,9 @@ int btrfs_commit_transaction_async(struct 
btrfs_trans_handle *trans,
put_transaction(cur_trans);
mutex_unlock(root-fs_info-trans_mutex);
 
+   if (current-journal_info == trans)
+   current-journal_info = NULL;
+
return 0;
 }
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3.0-rcX BUG at fs/btrfs/ioctl.c:432 - bisected

2011-06-10 Thread Sage Weil
On Fri, 10 Jun 2011, Josef Bacik wrote:
 On 06/10/2011 02:14 PM, Sage Weil wrote:
  On Fri, 10 Jun 2011, Sage Weil wrote:
  On Fri, 10 Jun 2011, Chris Mason wrote:
  Excerpts from Jim Schutt's message of 2011-06-10 13:06:22 -0400:
 
  [ two different btrfs crashes ]
 
  I think your two crashes in btrfs were from the uninit variables and
  those should be fixed in rc2.
 
  When I did my bisection, my criteria for success/failure was
  did mkcephfs succeed?.  When I apply this criteria to a recent
  linus kernel (e.g. 06e86849cf4019), which includes the fix you
  mentioned (aa0467d8d2a00e), I get still a different failure mode,
  which doesn't actually reference btrfs:
 
  [  276.364178] BUG: unable to handle kernel NULL pointer dereference at 
  000a
  [  276.365127] IP: [a05434b1] journal_start+0x3e/0x9c [jbd]
 
  Looking at the resulting code in the oops, we're here in journal_start:
 
  if (handle) {
  J_ASSERT(handle-h_transaction-t_journal == journal);
 
  handle comes from current-journal_info, and we're doing a deref on
  handle-h_transaction, which is probably 0xa.
 
  So, we're leaving crud in current-journal_info and ext3 is finding it.
 
  Perhaps its from ceph starting a transaction but leaving it running?
  The bug came with Josef's transaction performance fixes, but it is
  probably a mixture of his code with the ioctls ceph is using.
 
  Ah, yeah, that's the problem.  We saw a similar problem a while back with 
  the start/stop transaction ioctls.  In this case, create_snapshot is doing
 
 trans = btrfs_start_transaction(root-fs_info-extent_root, 5);
 if (IS_ERR(trans)) {
 ret = PTR_ERR(trans);
 goto fail;
 }
 
  which sets current-journal_info.  Then
 
 ret = btrfs_snap_reserve_metadata(trans, pending_snapshot);
 BUG_ON(ret);
 
 list_add(pending_snapshot-list,
  trans-transaction-pending_snapshots);
 if (async_transid) {
 *async_transid = trans-transid;
 ret = btrfs_commit_transaction_async(trans,
  root-fs_info-extent_root, 1);
 } else {
 ret = btrfs_commit_transaction(trans,
root-fs_info-extent_root);
 }
 
  but the async snap creation ioctl takes the async path, which runs 
  btrfs_commit_transaction in a worker thread.
 
  I'm not sure what the right thing to do is here is... can whatever is in 
  journal_info be attached to trans instead in 
  btrfs_commit_transaction_async()?
  
  It looks like it's not used for anything in btrfs, actually; it's just set 
  and cleared.  What's the point of that?
  
 
 It is used now, check the beginning of start_transaction().  Thanks,

Oh I see, okay.

So clearing it in btrfs_commit_transaction_async should be fine then, 
right?  When btrfs_commit_transaction runs in the other thread it won't 
care that current-journal_info is NULL.

sage
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3.0-rcX BUG at fs/btrfs/ioctl.c:432 - bisected

2011-06-10 Thread Sage Weil
On Fri, 10 Jun 2011, Josef Bacik wrote:
 On 06/10/2011 02:35 PM, Sage Weil wrote:
  On Fri, 10 Jun 2011, Josef Bacik wrote:
  On 06/10/2011 02:14 PM, Sage Weil wrote:
  On Fri, 10 Jun 2011, Sage Weil wrote:
  On Fri, 10 Jun 2011, Chris Mason wrote:
  Excerpts from Jim Schutt's message of 2011-06-10 13:06:22 -0400:
 
  [ two different btrfs crashes ]
 
  I think your two crashes in btrfs were from the uninit variables and
  those should be fixed in rc2.
 
  When I did my bisection, my criteria for success/failure was
  did mkcephfs succeed?.  When I apply this criteria to a recent
  linus kernel (e.g. 06e86849cf4019), which includes the fix you
  mentioned (aa0467d8d2a00e), I get still a different failure mode,
  which doesn't actually reference btrfs:
 
  [  276.364178] BUG: unable to handle kernel NULL pointer dereference 
  at 000a
  [  276.365127] IP: [a05434b1] journal_start+0x3e/0x9c [jbd]
 
  Looking at the resulting code in the oops, we're here in journal_start:
 
  if (handle) {
  J_ASSERT(handle-h_transaction-t_journal == journal);
 
  handle comes from current-journal_info, and we're doing a deref on
  handle-h_transaction, which is probably 0xa.
 
  So, we're leaving crud in current-journal_info and ext3 is finding it.
 
  Perhaps its from ceph starting a transaction but leaving it running?
  The bug came with Josef's transaction performance fixes, but it is
  probably a mixture of his code with the ioctls ceph is using.
 
  Ah, yeah, that's the problem.  We saw a similar problem a while back 
  with 
  the start/stop transaction ioctls.  In this case, create_snapshot is 
  doing
 
   trans = btrfs_start_transaction(root-fs_info-extent_root, 5);
   if (IS_ERR(trans)) {
   ret = PTR_ERR(trans);
   goto fail;
   }
 
  which sets current-journal_info.  Then
 
   ret = btrfs_snap_reserve_metadata(trans, pending_snapshot);
   BUG_ON(ret);
 
   list_add(pending_snapshot-list,
trans-transaction-pending_snapshots);
   if (async_transid) {
   *async_transid = trans-transid;
   ret = btrfs_commit_transaction_async(trans,
root-fs_info-extent_root, 1);
   } else {
   ret = btrfs_commit_transaction(trans,
  root-fs_info-extent_root);
   }
 
  but the async snap creation ioctl takes the async path, which runs 
  btrfs_commit_transaction in a worker thread.
 
  I'm not sure what the right thing to do is here is... can whatever is in 
  journal_info be attached to trans instead in 
  btrfs_commit_transaction_async()?
 
  It looks like it's not used for anything in btrfs, actually; it's just 
  set 
  and cleared.  What's the point of that?
 
 
  It is used now, check the beginning of start_transaction().  Thanks,
  
  Oh I see, okay.
  
  So clearing it in btrfs_commit_transaction_async should be fine then, 
  right?  When btrfs_commit_transaction runs in the other thread it won't 
  care that current-journal_info is NULL.
  
 
 Oh yeah your patch is good :),

Okay cool.  Here's the fix with a proper changelog and a little 
use-after-free paranoia.

Thanks!
sage


From 9881c0752293769d5133c01dff3ab04c0c24c61b Mon Sep 17 00:00:00 2001
From: Sage Weil s...@newdream.net
Date: Fri, 10 Jun 2011 11:41:25 -0700
Subject: [PATCH] Btrfs: clear current-journal_info on async transaction commit

Normally current-jouranl_info is cleared by commit_transaction.  For an
async snap or subvol creation, though, it runs in a work queue.  Clear
it in btrfs_commit_transaction_async() to avoid leaking a non-NULL
journal_info when we return to userspace.  When the actual commit runs in
the other thread it won't care that it's current-journal_info is already
NULL.

Signed-off-by: Sage Weil s...@newdream.net
---
 fs/btrfs/transaction.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index dd71966..9d516ed 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1118,8 +1118,11 @@ int btrfs_commit_transaction_async(struct 
btrfs_trans_handle *trans,
wait_current_trans_commit_start_and_unblock(root, cur_trans);
else
wait_current_trans_commit_start(root, cur_trans);
-   put_transaction(cur_trans);
 
+   if (current-journal_info == trans)
+   current-journal_info = NULL;
+
+   put_transaction(cur_trans);
return 0;
 }
 
-- 
1.7.0


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/19] btrfs: remove unnecessary dentry_unhash in rmdir/rename_dir

2011-05-24 Thread Sage Weil
Btrfs has no problems with lingering references to unlinked directory
inodes.

CC: Chris Mason chris.ma...@oracle.com
CC: linux-btrfs@vger.kernel.org
Signed-off-by: Sage Weil s...@newdream.net
---
 fs/btrfs/inode.c |5 -
 1 files changed, 0 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3a33ae3..7cd8ab0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3062,8 +3062,6 @@ static int btrfs_rmdir(struct inode *dir, struct dentry 
*dentry)
inode-i_ino == BTRFS_FIRST_FREE_OBJECTID)
return -ENOTEMPTY;
 
-   dentry_unhash(dentry);
-
trans = __unlink_start_trans(dir, dentry);
if (IS_ERR(trans))
return PTR_ERR(trans);
@@ -6994,9 +6992,6 @@ static int btrfs_rename(struct inode *old_dir, struct 
dentry *old_dentry,
u64 root_objectid;
int ret;
 
-   if (new_inode  S_ISDIR(new_dentry-d_inode-i_mode))
-   dentry_unhash(new_dentry);
-
if (new_dir-i_ino == BTRFS_EMPTY_SUBVOL_DIR_OBJECTID)
return -EPERM;
 
-- 
1.7.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/18] btrfs: remove unnecessary dentry_unhash in rmdir/rename_dir

2011-05-09 Thread Sage Weil
Btrfs has no problems with lingering references to unlinked directory
inodes.

CC: Chris Mason chris.ma...@oracle.com
CC: linux-btrfs@vger.kernel.org
Signed-off-by: Sage Weil s...@newdream.net
---
 fs/btrfs/inode.c |5 -
 1 files changed, 0 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3a33ae3..7cd8ab0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3062,8 +3062,6 @@ static int btrfs_rmdir(struct inode *dir, struct dentry 
*dentry)
inode-i_ino == BTRFS_FIRST_FREE_OBJECTID)
return -ENOTEMPTY;
 
-   dentry_unhash(dentry);
-
trans = __unlink_start_trans(dir, dentry);
if (IS_ERR(trans))
return PTR_ERR(trans);
@@ -6994,9 +6992,6 @@ static int btrfs_rename(struct inode *old_dir, struct 
dentry *old_dentry,
u64 root_objectid;
int ret;
 
-   if (new_inode  S_ISDIR(new_dentry-d_inode-i_mode))
-   dentry_unhash(new_dentry);
-
if (new_dir-i_ino == BTRFS_EMPTY_SUBVOL_DIR_OBJECTID)
return -EPERM;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix sync subvol/snapshot creation

2010-12-09 Thread Sage Weil
We were incorrectly taking the async path even for the sync ioctls by
passing in transid unconditionally.

There's ample room for further cleanup here, but this keeps the fix simple.

Signed-off-by: Sage Weil s...@newdream.net
---
 fs/btrfs/ioctl.c |   20 +++-
 1 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 41614c3..4c2d7c4 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -970,6 +970,15 @@ static noinline int btrfs_ioctl_snap_create(struct file 
*file,
name = async_vol_args-name;
fd = async_vol_args-fd;
async_vol_args-name[BTRFS_SNAPSHOT_NAME_MAX] = '\0';
+
+   ret = btrfs_ioctl_snap_create_transid(file, name, fd,
+ subvol, transid);
+
+   if (ret == 0 
+   copy_to_user(arg +
+offsetof(struct btrfs_ioctl_async_vol_args,
+ transid), transid, sizeof(transid)))
+   ret = -EFAULT;
} else {
vol_args = memdup_user(arg, sizeof(*vol_args));
if (IS_ERR(vol_args))
@@ -977,16 +986,9 @@ static noinline int btrfs_ioctl_snap_create(struct file 
*file,
name = vol_args-name;
fd = vol_args-fd;
vol_args-name[BTRFS_PATH_NAME_MAX] = '\0';
-   }
-
-   ret = btrfs_ioctl_snap_create_transid(file, name, fd,
- subvol, transid);
 
-   if (!ret  async) {
-   if (copy_to_user(arg +
-   offsetof(struct btrfs_ioctl_async_vol_args,
-   transid), transid, sizeof(transid)))
-   return -EFAULT;
+   ret = btrfs_ioctl_snap_create_transid(file, name, fd,
+ subvol, NULL);
}
 
kfree(vol_args);
-- 
1.6.6.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix sync subvol/snapshot creation

2010-12-09 Thread Sage Weil
On Thu, 9 Dec 2010, Chris Mason wrote:
 Excerpts from Li Zefan's message of 2010-12-09 21:11:25 -0500:
  Sage Weil wrote:
   We were incorrectly taking the async path even for the sync ioctls by
   passing in transid unconditionally.
   
  
  Ha, I fixed this accidentally in my patchset. :)

 Thanks guys,
 
 Unless either of you object, I'll take this and the snapshot ABI change
 for the next rc.

Sounds good to me.

Just FYI, I'm still seeing some weird corruption and crashes on my 
machines and haven't narrowed it down yet.  Should have more info soon.

sage
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] btrfs: Make async snapshot ioctl more generic

2010-11-29 Thread Sage Weil
Hi Li,

On Mon, 29 Nov 2010, Li Zefan wrote:
 So we don't have to add new structures as we create more ioctls
 for snapshots.
 
 Now to create async snapshot, set BTRFS_SNAPSHOT_CREATE_ASYNC bit of
 vol_arg_v2-flags, and then call ioctl(BTRFS_IOCT_SNAP_CREATE_V2).
 
 Note: this changes the async snapshot ioctl ABI, which was merged
 in .37 merge window, so we have to make this change into .37.

These changes all look good to me.

Thanks!
sage



 
 Signed-off-by: Li Zefan l...@cn.fujitsu.com
 ---
  fs/btrfs/ioctl.c |   34 +-
  fs/btrfs/ioctl.h |   12 
  2 files changed, 29 insertions(+), 17 deletions(-)
 
 diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
 index 463d91b..d3f1a60 100644
 --- a/fs/btrfs/ioctl.c
 +++ b/fs/btrfs/ioctl.c
 @@ -935,23 +935,31 @@ out:
  
  static noinline int btrfs_ioctl_snap_create(struct file *file,
   void __user *arg, int subvol,
 - int async)
 + bool v2)
  {
   struct btrfs_ioctl_vol_args *vol_args = NULL;
 - struct btrfs_ioctl_async_vol_args *async_vol_args = NULL;
 + struct btrfs_ioctl_vol_args_v2 *vol_args_v2 = NULL;
   char *name;
   u64 fd;
   u64 transid = 0;
 + bool async = false;
   int ret;
  
 - if (async) {
 - async_vol_args = memdup_user(arg, sizeof(*async_vol_args));
 - if (IS_ERR(async_vol_args))
 - return PTR_ERR(async_vol_args);
 + if (v2) {
 + vol_args_v2 = memdup_user(arg, sizeof(*vol_args_v2));
 + if (IS_ERR(vol_args_v2))
 + return PTR_ERR(vol_args_v2);
  
 - name = async_vol_args-name;
 - fd = async_vol_args-fd;
 - async_vol_args-name[BTRFS_SNAPSHOT_NAME_MAX] = '\0';
 + if (vol_args_v2-flags  ~BTRFS_SNAPSHOT_CREATE_ASYNC) {
 + ret = -EINVAL;
 + goto out;
 + }
 +
 + name = vol_args_v2-name;
 + fd = vol_args_v2-fd;
 + vol_args_v2-name[BTRFS_SNAPSHOT_NAME_MAX] = '\0';
 + if (vol_args_v2-flags  BTRFS_SNAPSHOT_CREATE_ASYNC)
 + async = true;
   } else {
   vol_args = memdup_user(arg, sizeof(*vol_args));
   if (IS_ERR(vol_args))
 @@ -966,13 +974,13 @@ static noinline int btrfs_ioctl_snap_create(struct file 
 *file,
  
   if (!ret  async) {
   if (copy_to_user(arg +
 - offsetof(struct btrfs_ioctl_async_vol_args,
 + offsetof(struct btrfs_ioctl_vol_args_v2,
   transid), transid, sizeof(transid)))
   return -EFAULT;
   }
 -
 +out:
   kfree(vol_args);
 - kfree(async_vol_args);
 + kfree(vol_args_v2);
  
   return ret;
  }
 @@ -2235,7 +2243,7 @@ long btrfs_ioctl(struct file *file, unsigned int
   return btrfs_ioctl_getversion(file, argp);
   case BTRFS_IOC_SNAP_CREATE:
   return btrfs_ioctl_snap_create(file, argp, 0, 0);
 - case BTRFS_IOC_SNAP_CREATE_ASYNC:
 + case BTRFS_IOC_SNAP_CREATE_V2:
   return btrfs_ioctl_snap_create(file, argp, 0, 1);
   case BTRFS_IOC_SUBVOL_CREATE:
   return btrfs_ioctl_snap_create(file, argp, 1, 0);
 diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
 index 17c99eb..bc70584 100644
 --- a/fs/btrfs/ioctl.h
 +++ b/fs/btrfs/ioctl.h
 @@ -30,10 +30,14 @@ struct btrfs_ioctl_vol_args {
   char name[BTRFS_PATH_NAME_MAX + 1];
  };
  
 -#define BTRFS_SNAPSHOT_NAME_MAX 4079
 -struct btrfs_ioctl_async_vol_args {
 +#define BTRFS_SNAPSHOT_CREATE_ASYNC  (1ULL  0)
 +
 +#define BTRFS_SNAPSHOT_NAME_MAX 4039
 +struct btrfs_ioctl_vol_args_v2 {
   __s64 fd;
   __u64 transid;
 + __u64 flags;
 + __u64 unused[4];
   char name[BTRFS_SNAPSHOT_NAME_MAX + 1];
  };
  
 @@ -187,6 +191,6 @@ struct btrfs_ioctl_space_args {
   struct btrfs_ioctl_space_args)
  #define BTRFS_IOC_START_SYNC _IOR(BTRFS_IOCTL_MAGIC, 24, __u64)
  #define BTRFS_IOC_WAIT_SYNC  _IOW(BTRFS_IOCTL_MAGIC, 22, __u64)
 -#define BTRFS_IOC_SNAP_CREATE_ASYNC _IOW(BTRFS_IOCTL_MAGIC, 23, \
 -struct btrfs_ioctl_async_vol_args)
 +#define BTRFS_IOC_SNAP_CREATE_V2 _IOW(BTRFS_IOCTL_MAGIC, 23, \
 +struct btrfs_ioctl_vol_args_v2)
  #endif
 -- 
 1.6.3
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs: Check if dest_offset is block-size aligned before cloning file

2010-11-22 Thread Sage Weil
On Fri, 19 Nov 2010, Li Zefan wrote:
 We've done the check for src_offset and src_length, and We should
 also check dest_offset, otherwise we'll corrupt the destination
 file:
 
   (After cloning file1 to file2 with unaligned dest_offset)
   # cat /mnt/file2
   cat: /mnt/file2: Input/output error
 
 Signed-off-by: Li Zefan l...@cn.fujitsu.com
 ---
  fs/btrfs/ioctl.c |7 +++
  1 files changed, 3 insertions(+), 4 deletions(-)
 
 diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
 index 463d91b..81b47bd 100644
 --- a/fs/btrfs/ioctl.c
 +++ b/fs/btrfs/ioctl.c
 @@ -1669,12 +1669,11 @@ static noinline long btrfs_ioctl_clone(struct file 
 *file, unsigned long srcfd,
   olen = len = src-i_size - off;
   /* if we extend to eof, continue to block boundary */
   if (off + len == src-i_size)
 - len = ((src-i_size + bs-1)  ~(bs-1))
 - - off;
 + len = ALIGN(src-i_size, bs) - off;
  
   /* verify the end result is block aligned */
 - if ((off  (bs-1)) ||
 - ((off + len)  (bs-1)))
 + if (!IS_ALIGNED(off, bs) || !IS_ALIGNED(off + len, bs) ||
 + !IS_ALIGNED(destoff, bs))
   goto out_unlock;
  
   /* do any pending delalloc/csum calc on src, one way or

Looks good.

Reviewed-by: Sage Weil s...@newdream.net

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: Set file size correctly in file clone

2010-11-22 Thread Sage Weil
On Fri, 19 Nov 2010, Li Zefan wrote:
 Set src_offset = 0, src_length = 20K, dest_offset = 20K. And the
 original filesize of the dest file 'file2' is 30K:
 
   # ls -l /mnt/file2
   -rw-r--r-- 1 root root 30720 Nov 18 16:42 /mnt/file2
 
 Now clone file1 to file2, the dest file should be 40K, but it
 still shows 30K:
 
   # ls -l /mnt/file2
   -rw-r--r-- 1 root root 30720 Nov 18 16:42 /mnt/file2
 
 Signed-off-by: Li Zefan l...@cn.fujitsu.com
 ---
  fs/btrfs/ioctl.c |4 ++--
  1 files changed, 2 insertions(+), 2 deletions(-)
 
 diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
 index 81b47bd..6b4bfa7 100644
 --- a/fs/btrfs/ioctl.c
 +++ b/fs/btrfs/ioctl.c
 @@ -1873,8 +1873,8 @@ static noinline long btrfs_ioctl_clone(struct file 
 *file, unsigned long srcfd,
* but shouldn't round up the file size
*/
   endoff = new_key.offset + datal;
 - if (endoff  off+olen)
 - endoff = off+olen;
 + if (endoff  destoff+olen)
 + endoff = destoff+olen;
   if (endoff  inode-i_size)
   btrfs_i_size_write(inode, endoff);

Reviewed-by: Sage Weil s...@newdream.net
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


new lockdep warnings in 2.6.37-rc1

2010-11-11 Thread Sage Weil
Hey,

I saw a few variations of this lockdep on about a dozen different nodes 
running 2.6.37-rc1+ (mainline from earlier this week).  They all look to 
be the same issue, hit via a few different call paths.  I haven't seen 
these before...

sage


[10862.369802] ===
[10862.373287] [ INFO: possible circular locking dependency detected ]
[10862.373287] 2.6.37-rc1+ #44
[10862.373287] ---
[10862.373287] syslogd/2091 is trying to acquire lock:
[10862.373287]  ((eb-lock)-rlock){+.+...}, at: [a0048bbe] 
btrfs_try_spin_lock+0x26/0x80 [btrfs]
[10862.373287] 
[10862.373287] but task is already holding lock:
[10862.373287]  (btrfs-extent-01){+.+...}, at: [a0048b8f] 
btrfs_clear_lock_blocking+0x1d/0x26 [btrfs]
[10862.373287] 
[10862.373287] which lock already depends on the new lock.
[10862.373287] 
[10862.373287] 
[10862.373287] the existing dependency chain (in reverse order) is:
[10862.373287] 
[10862.373287] - #1 (btrfs-extent-01){+.+...}:
[10862.442464][810634ad] __lock_acquire+0x82d/0x8aa
[10862.442464][810635b2] lock_acquire+0x88/0xa5
[10862.442464][814cf403] _raw_spin_lock+0x3c/0x6f
[10862.442464][a0048bbe] btrfs_try_spin_lock+0x26/0x80 [btrfs]
[10862.442464][a0009de5] btrfs_search_slot+0x723/0x851 [btrfs]
[10862.442464][a0018340] btrfs_lookup_csum+0x5c/0x135 [btrfs]
[10862.442464][a00185bd] __btrfs_lookup_bio_sums+0x1a4/0x343 
[btrfs]
[10862.442464][a001877e] btrfs_lookup_bio_sums+0x11/0x13 
[btrfs]
[10862.442464][a0022764] btrfs_submit_bio_hook+0x99/0x10a 
[btrfs]
[10862.442464][a0039440] submit_one_bio+0x6c/0x96 [btrfs]
[10862.442464][a003b4a4] extent_read_full_page+0x41/0x4a 
[btrfs]
[10862.442464][a00218e1] btrfs_readpage+0x1e/0x20 [btrfs]
[10862.442464][a002bfec] btrfs_file_aio_write+0x405/0x885 
[btrfs]
[10862.442464][810ba061] do_sync_readv_writev+0xc4/0x10c
[10862.442464][810ba6fa] do_readv_writev+0xb8/0x190
[10862.442464][810ba813] vfs_writev+0x41/0x4c
[10862.442464][810bb008] sys_writev+0x4a/0x75
[10862.442464][81002a2b] system_call_fastpath+0x16/0x1b
[10862.442464] 
[10862.442464] - #0 ((eb-lock)-rlock){+.+...}:
[10862.442464][8106265c] validate_chain+0x7a9/0xdcd
[10862.442464][810634ad] __lock_acquire+0x82d/0x8aa
[10862.442464][810635b2] lock_acquire+0x88/0xa5
[10862.442464][814cf403] _raw_spin_lock+0x3c/0x6f
[10862.442464][a0048bbe] btrfs_try_spin_lock+0x26/0x80 [btrfs]
[10862.442464][a0009de5] btrfs_search_slot+0x723/0x851 [btrfs]
[10862.442464][a0018340] btrfs_lookup_csum+0x5c/0x135 [btrfs]
[10862.442464][a00185bd] __btrfs_lookup_bio_sums+0x1a4/0x343 
[btrfs]
[10862.620387][a001877e] btrfs_lookup_bio_sums+0x11/0x13 
[btrfs]
[10862.620387][a0022764] btrfs_submit_bio_hook+0x99/0x10a 
[btrfs]
[10862.620387][a0039440] submit_one_bio+0x6c/0x96 [btrfs]
[10862.620387][a003b4a4] extent_read_full_page+0x41/0x4a 
[btrfs]
[10862.620387][a00218e1] btrfs_readpage+0x1e/0x20 [btrfs]
[10862.620387][a002bfec] btrfs_file_aio_write+0x405/0x885 
[btrfs]
[10862.620387][810ba061] do_sync_readv_writev+0xc4/0x10c
[10862.620387][810ba6fa] do_readv_writev+0xb8/0x190
[10862.620387][810ba813] vfs_writev+0x41/0x4c
[10862.620387][810bb008] sys_writev+0x4a/0x75
[10862.620387][81002a2b] system_call_fastpath+0x16/0x1b
[10862.620387] 
[10862.620387] other info that might help us debug this:
[10862.620387] 
[10862.620387] 2 locks held by syslogd/2091:
[10862.620387]  #0:  (sb-s_type-i_mutex_key#11){+.+.+.}, at: 
[a002bd36] btrfs_file_aio_write+0x14f/0x885 [btrfs]
[10862.620387]  #1:  (btrfs-extent-01){+.+...}, at: [a0048b8f] 
btrfs_clear_lock_blocking+0x1d/0x26 [btrfs]
[10862.730233] 
[10862.730233] stack backtrace:
[10862.730233] Pid: 2091, comm: syslogd Not tainted 2.6.37-rc1+ #44
[10862.730233] Call Trace:
[10862.730233]  [81061876] print_circular_bug+0xb3/0xc1
[10862.730233]  [8106265c] validate_chain+0x7a9/0xdcd
[10862.730233]  [81009948] ? native_sched_clock+0x37/0x71
[10862.730233]  [810634ad] __lock_acquire+0x82d/0x8aa
[10862.730233]  [810635b2] lock_acquire+0x88/0xa5
[10862.730233]  [a0048bbe] ? btrfs_try_spin_lock+0x26/0x80 [btrfs]
[10862.730233]  [814cf403] _raw_spin_lock+0x3c/0x6f
[10862.730233]  [a0048bbe] ? btrfs_try_spin_lock+0x26/0x80 [btrfs]
[10862.730233]  [a0048b8f] ? btrfs_clear_lock_blocking+0x1d/0x26 
[btrfs]

Re: [PATCH 2/3] btrfs: implement 'async-snapshot' command

2010-11-02 Thread Sage Weil
On Tue, 2 Nov 2010, Goffredo Baroncelli wrote:
 On Monday, 01 November, 2010, Sage Weil wrote:
  On Sat, 30 Oct 2010, Goffredo Baroncelli wrote:
   On Saturday, 30 October, 2010, Sage Weil wrote:
This is identical to 'snapshot', but uses the new async snapshot 
 creation
ioctl, and prints out the transid the new snapshot will be committed
with.

Signed-off-by: Sage Weil s...@newdream.net
---
 btrfs.c  |8 +++-
 btrfs_cmds.c |   32 +++-
 btrfs_cmds.h |3 ++-
 3 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/btrfs.c b/btrfs.c
index 46314cf..c4b9a31 100644
--- a/btrfs.c
+++ b/btrfs.c
@@ -44,11 +44,17 @@ static struct Command commands[] = {
/*
avoid short commands different for the case only
*/
-   { do_clone, 2,
+   { do_create_snap, 2,
  subvolume snapshot, source [dest/]name\n
Create a writable snapshot of the subvolume source 
with\n
the name name in the dest directory.
},
+   { do_create_snap_async, 2,
+ subvolume async-snapshot, source [dest/]name\n
+   Create a writable snapshot of the subvolume source 
with\n
+   the name name in the dest directory.  Do not wait 
for\n
+   the snapshot creation to commit to disk before 
returning.
+   },
   
   Why create another command ? I think that it is better add a switch like
   
   btrfs subvolume snapshot [--async] source [dest/]name
  
  How about
  
   btrfs subvolume snapshot source [dest/]name [async]
  
  That avoids ambiguity when source is named '--async' and simplifies 
  parsing.  Updated patches follow (including man page updates).
  
  sage
  
 
 No please ! Sorry but this is too strange as syntax.
 
 When the source is named --async it is possible to execute the command 
 passing ./--async. This workaround is used in any unix command (have you 
 ever thought what happens if you execute rm * when exists a file named 
 -rf  
 ?).

Oh right, that's better.  I'm used to using -- (as in 'grep -- -foo'; rm 
supports the same syntax).  Will resend.

sage


 I think that --async is a modification of the standard behaviour of btrfs 
 s 
 s command, so it is natural to use the switch syntax (30+ years of unix say 
 so :-) )
 
 Goffredo
 
 -- 
 gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) kreij...@inwind.it
 Key fingerprint = 4769 7E51 5293 D36C 814E  C054 BF04 F161 3DC5 0512
 
 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs-progs: btrfs: implement 'start-sync' and 'wait-sync' commands

2010-11-02 Thread Sage Weil
On Tue, 2 Nov 2010, Goffredo Baroncelli wrote:
 Like the command btrfs subvol snapshot, I think that it is better to add a 
 modifier instead of a new command.
 
  btrfs filesystem sync [--async]
 
 Sorry if I noticed this too late. But I don't see a valid reason to add 
 another command. From a UI point of view the meaning of the command is the 
 same, change only slight the behavior.
 
 Even tough I have to admint that sync --async sound strange. May be flush 
 is 
 better ?
  +   { do_wait_sync, 2,
  + filesystem wait-sync, path transid\n
  +   Wait for the transaction transid on the filesystem at 
 path to commit.
  +   },
 
 If I understood correctly this command may be used to wait the execution of 
 several commands: snapshot, subvolume removal, sync...
 I can suggest a more general name like:
 
 # btrfs filesystem wait-transaction path transid.
 
 (which may be shortened as btrfs filesystem wait ...). 

How about start-transaction and wait-transaction?  Or start-commit and 
wait-commit?  ('transaction' is a heavily overloaded term.)  IMO both 
would be less weird than sync --async.

 Another suggestion: instead of checking if the transid is 0, leave the 
 transid optional. If transid is omitted, then the function waits all the 
 running transaction.

Definitely.

 Finally I suggest to put a little note about the command behvior when a 
 transaction is started after the command execution: is there a risk of DOS ?

Okay.  There's no DOS risk (at least no more so than with while true ; do 
sync ; done).  It'll return immediately if transid has committed.  If 
transid is 0, it'll find the most recent committing or committed transid 
and wait for that.  If nothing is currently committing, it'll return 
immediately.

sage



 
 Anyway great work.
 
 Goffredo
 
 
  { do_resize, 2,
filesystem resize, [+/-]newsize[gkm]|max filesystem\n
  Resize the file system. If 'max' is passed, the filesystem\n
  diff --git a/btrfs_cmds.c b/btrfs_cmds.c
  index 8031c58..736437d 100644
  --- a/btrfs_cmds.c
  +++ b/btrfs_cmds.c
  @@ -526,6 +526,55 @@ int do_fssync(int argc, char **argv)
  return 0;
   }
   
  +int do_start_sync(int argc, char **argv)
  +{
  +   int fd, res;
  +   char*path = argv[1];
  +   __u64 transid;
  +
  +   fd = open_file_or_dir(path);
  +   if (fd  0) {
  +   fprintf(stderr, ERROR: can't access to '%s'\n, path);
  +   return 12;
  +   }
  +
  +   printf(StartSync '%s'\n, path);
  +   res = ioctl(fd, BTRFS_IOC_START_SYNC, transid);
  +   close(fd);
  +   if( res  0 ){
  +   fprintf(stderr, ERROR: unable to fs-syncing '%s'\n, path);
  +   return 16;
  +   } else {
  +   printf(transid %llu\n, (unsigned long long)transid);
  +   }
  +
  +   return 0;
  +}
  +
  +int do_wait_sync(int argc, char **argv)
  +{
  +   int fd, res;
  +   char*path = argv[1];
  +   __u64 transid = atoll(argv[2]);
  +
  +   fd = open_file_or_dir(path);
  +   if (fd  0) {
  +   fprintf(stderr, ERROR: can't access to '%s'\n, path);
  +   return 12;
  +   }
  +
  +   printf(WaitSync '%s' transid %llu\n, path, (unsigned long 
 long)transid);
  +   res = ioctl(fd, BTRFS_IOC_WAIT_SYNC, transid);
  +   close(fd);
  +   if( res  0 ){
  +   fprintf(stderr, ERROR: unable to wait-sync on '%s' transid 
 %llu: %s\n, path,
  +   (unsigned long long)transid, strerror(errno));
  +   return 16;
  +   }
  +
  +   return 0;
  +}
  +
   int do_scan(int argc, char **argv)
   {
  int i, fd;
  diff --git a/btrfs_cmds.h b/btrfs_cmds.h
  index 7bde191..84c489f 100644
  --- a/btrfs_cmds.h
  +++ b/btrfs_cmds.h
  @@ -19,6 +19,8 @@ int do_clone(int nargs, char **argv);
   int do_delete_subvolume(int nargs, char **argv);
   int do_create_subvol(int nargs, char **argv);
   int do_fssync(int nargs, char **argv);
  +int do_start_sync(int nargs, char **argv);
  +int do_wait_sync(int nargs, char **argv);
   int do_defrag(int argc, char **argv);
   int do_show_filesystem(int nargs, char **argv);
   int do_add_volume(int nargs, char **args);
  diff --git a/man/btrfs.8.in b/man/btrfs.8.in
  index 26ef982..e87b5fe 100644
  --- a/man/btrfs.8.in
  +++ b/man/btrfs.8.in
  @@ -19,6 +19,10 @@ btrfs \- control a btrfs filesystem
   .PP
   \fBbtrfs\fP \fBfilesystem sync\fP\fI path \fP
   .PP
  +\fBbtrfs\fP \fBfilesystem start-sync\fP\fI path \fP
  +.PP
  +\fBbtrfs\fP \fBfilesystem wait-sync\fP\fI path transid\fP
  +.PP
   \fBbtrfs\fP \fBfilesystem resize\fP\fI [+/\-]size[gkm]|max 
 filesystem\fP
   .PP
   \fBbtrfs\fP \fBdevice scan\fP\fI [device [device..]]\fP
  @@ -115,6 +119,16 @@ all the block devices.
   Force a sync for the filesystem identified by \fIpath\fR.
   .TP
   
  +\fBfilesystem start-sync\fR\fI path \fR
  +Start a sync operation for the filesystem identified by \fIpath\fR.  A 
 transaction id
  +is printed that can be waited on using the \fBfilesystem wait-sync\fR 
 command.
  +.TP
  +
  +\fBfilesystem 

Re: [PATCH 2/3] btrfs: implement 'async-snapshot' command

2010-11-01 Thread Sage Weil
On Sat, 30 Oct 2010, Goffredo Baroncelli wrote:
 On Saturday, 30 October, 2010, Sage Weil wrote:
  This is identical to 'snapshot', but uses the new async snapshot creation
  ioctl, and prints out the transid the new snapshot will be committed
  with.
  
  Signed-off-by: Sage Weil s...@newdream.net
  ---
   btrfs.c  |8 +++-
   btrfs_cmds.c |   32 +++-
   btrfs_cmds.h |3 ++-
   3 files changed, 36 insertions(+), 7 deletions(-)
  
  diff --git a/btrfs.c b/btrfs.c
  index 46314cf..c4b9a31 100644
  --- a/btrfs.c
  +++ b/btrfs.c
  @@ -44,11 +44,17 @@ static struct Command commands[] = {
  /*
  avoid short commands different for the case only
  */
  -   { do_clone, 2,
  +   { do_create_snap, 2,
subvolume snapshot, source [dest/]name\n
  Create a writable snapshot of the subvolume source with\n
  the name name in the dest directory.
  },
  +   { do_create_snap_async, 2,
  + subvolume async-snapshot, source [dest/]name\n
  +   Create a writable snapshot of the subvolume source with\n
  +   the name name in the dest directory.  Do not wait for\n
  +   the snapshot creation to commit to disk before returning.
  +   },
 
 Why create another command ? I think that it is better add a switch like
 
 btrfs subvolume snapshot [--async] source [dest/]name

How about

 btrfs subvolume snapshot source [dest/]name [async]

That avoids ambiguity when source is named '--async' and simplifies 
parsing.  Updated patches follow (including man page updates).

sage




 
   Create a writable snapshot of the subvolume source with
   the name name in the dest directory.  If --async is
   passed, do not wait for the snapshot creation to commit to
   disk before returning.
 
 In any case, please update the man page too.
 
 
 
  { do_delete_subvolume, 1,
subvolume delete, subvolume\n
  Delete the subvolume subvolume.
  diff --git a/btrfs_cmds.c b/btrfs_cmds.c
  index 8031c58..6da5862 100644
  --- a/btrfs_cmds.c
  +++ b/btrfs_cmds.c
  @@ -307,7 +307,7 @@ int do_subvol_list(int argc, char **argv)
  return 0;
   }
   
  -int do_clone(int argc, char **argv)
  +static int create_snap(int argc, char **argv, int async)
   {
  char*subvol, *dst;
  int res, fd, fddst, len;
  @@ -316,7 +316,6 @@ int do_clone(int argc, char **argv)
   
  subvol = argv[1];
  dst = argv[2];
  -   struct btrfs_ioctl_vol_args args;
   
  res = test_issubvolume(subvol);
  if(res0){
  @@ -374,9 +373,22 @@ int do_clone(int argc, char **argv)
   
  printf(Create a snapshot of '%s' in '%s/%s'\n,
 subvol, dstdir, newname);
  -   args.fd = fd;
  -   strcpy(args.name, newname);
  -   res = ioctl(fddst, BTRFS_IOC_SNAP_CREATE, args);
  +   if (async) {
  +   struct btrfs_ioctl_async_vol_args   async_args;
  +   async_args.fd = fd;
  +   async_args.transid = 0;
  +   strcpy(async_args.name, newname);
  +   res = ioctl(fddst, BTRFS_IOC_SNAP_CREATE_ASYNC, async_args);
  +   if (res == 0)
  +   printf(transid %llu\n,
  +  (unsigned long long)async_args.transid);
  +   } else {
  +   struct btrfs_ioctl_vol_args args;
  +
  +   args.fd = fd;
  +   strcpy(args.name, newname);
  +   res = ioctl(fddst, BTRFS_IOC_SNAP_CREATE, args);
  +   }
   
  close(fd);
  close(fddst);
  @@ -390,6 +402,16 @@ int do_clone(int argc, char **argv)
   
   }
   
  +int do_create_snap_async(int argc, char **argv)
  +{
  +   return create_snap(argc, argv, 1);
  +}
  +
  +int do_create_snap(int argc, char **argv)
  +{
  +   return create_snap(argc, argv, 0);
  +}
  +
   int do_delete_subvolume(int argc, char **argv)
   {
  int res, fd, len;
  diff --git a/btrfs_cmds.h b/btrfs_cmds.h
  index 7bde191..c44dc79 100644
  --- a/btrfs_cmds.h
  +++ b/btrfs_cmds.h
  @@ -15,7 +15,8 @@
*/
   
   /* btrfs_cmds.c*/
  -int do_clone(int nargs, char **argv);
  +int do_create_snap(int nargs, char **argv);
  +int do_create_snap_async(int nargs, char **argv);
   int do_delete_subvolume(int nargs, char **argv);
   int do_create_subvol(int nargs, char **argv);
   int do_fssync(int nargs, char **argv);
  -- 
  1.7.1
  
  --
  To unsubscribe from this list: send the line unsubscribe linux-btrfs in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
 
 
 -- 
 gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) kreij...@inwind.it
 Key fingerprint = 4769 7E51 5293 D36C 814E  C054 BF04 F161 3DC5 0512
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message

[PATCH] btrfs-progs: ioctl: add may_alias to btrfs_ioctl_search_header

2010-11-01 Thread Sage Weil
This fixes the errors

btrfs-list.c: In function 'ino_resolve':
btrfs-list.c:506: error: dereferencing pointer 'sh' does break strict-aliasing 
rules
btrfs-list.c:504: error: dereferencing pointer 'sh' does break strict-aliasing 
rules
btrfs-list.c:502: note: initialized from here

when building with gcc version 4.4.4 (Debian 4.4.4-6).

Signed-off-by: Sage Weil s...@newdream.net
---
 ioctl.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/ioctl.h b/ioctl.h
index 5a03317..4a850ff 100644
--- a/ioctl.h
+++ b/ioctl.h
@@ -87,7 +87,7 @@ struct btrfs_ioctl_search_header {
__u64 offset;
__u32 type;
__u32 len;
-};
+} __attribute__((__may_alias__));
 
 #define BTRFS_SEARCH_ARGS_BUFSIZE (4096 - sizeof(struct 
btrfs_ioctl_search_key))
 /*
-- 
1.7.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] btrfs-progs: btrfs: implement async option for snapshot command

2010-11-01 Thread Sage Weil
Add an 'async' option to the snapshot creating command, that will let you
avoid waiting for a new snapshot to commit to disk.

Signed-off-by: Sage Weil s...@newdream.net
---
 btrfs.c|9 ++---
 btrfs_cmds.c   |   31 +--
 btrfs_cmds.h   |2 +-
 man/btrfs.8.in |   15 ++-
 4 files changed, 42 insertions(+), 15 deletions(-)

diff --git a/btrfs.c b/btrfs.c
index c871f4a..657bb6f 100644
--- a/btrfs.c
+++ b/btrfs.c
@@ -44,10 +44,13 @@ static struct Command commands[] = {
/*
avoid short commands different for the case only
*/
-   { do_clone, 2,
- subvolume snapshot, source [dest/]name\n
+   { do_create_snap, -2,
+ subvolume snapshot, source [dest/]name [async]\n
Create a writable snapshot of the subvolume source with\n
-   the name name in the dest directory.
+   the name name in the dest directory.  If [async] is\n
+   specified, we will not wait for the snapshot to be committed\n
+   to disk, and a transid will be printed that can be fed to\n
+   'filesystem wait-sync transid'.
},
{ do_delete_subvolume, 1,
  subvolume delete, subvolume\n
diff --git a/btrfs_cmds.c b/btrfs_cmds.c
index 736437d..276b8fa 100644
--- a/btrfs_cmds.c
+++ b/btrfs_cmds.c
@@ -307,16 +307,22 @@ int do_subvol_list(int argc, char **argv)
return 0;
 }
 
-int do_clone(int argc, char **argv)
+int do_create_snap(int argc, char **argv)
 {
char*subvol, *dst;
-   int res, fd, fddst, len;
+   int res, fd, fddst, len, async;
char*newname;
char*dstdir;
 
subvol = argv[1];
dst = argv[2];
-   struct btrfs_ioctl_vol_args args;
+   if (argc  3) {
+   if (strstr(argv[3], async)) {
+   async = 1;
+   } else {
+   fprintf(stderr, ERROR: 'async' expected for third 
arg\n);
+   }
+   }
 
res = test_issubvolume(subvol);
if(res0){
@@ -374,9 +380,22 @@ int do_clone(int argc, char **argv)
 
printf(Create a snapshot of '%s' in '%s/%s'\n,
   subvol, dstdir, newname);
-   args.fd = fd;
-   strcpy(args.name, newname);
-   res = ioctl(fddst, BTRFS_IOC_SNAP_CREATE, args);
+   if (async) {
+   struct btrfs_ioctl_async_vol_args   async_args;
+   async_args.fd = fd;
+   async_args.transid = 0;
+   strcpy(async_args.name, newname);
+   res = ioctl(fddst, BTRFS_IOC_SNAP_CREATE_ASYNC, async_args);
+   if (res == 0)
+   printf(transid %llu\n,
+  (unsigned long long)async_args.transid);
+   } else {
+   struct btrfs_ioctl_vol_args args;
+
+   args.fd = fd;
+   strcpy(args.name, newname);
+   res = ioctl(fddst, BTRFS_IOC_SNAP_CREATE, args);
+   }
 
close(fd);
close(fddst);
diff --git a/btrfs_cmds.h b/btrfs_cmds.h
index 84c489f..bf566ae 100644
--- a/btrfs_cmds.h
+++ b/btrfs_cmds.h
@@ -15,7 +15,7 @@
  */
 
 /* btrfs_cmds.c*/
-int do_clone(int nargs, char **argv);
+int do_create_snap(int nargs, char **argv);
 int do_delete_subvolume(int nargs, char **argv);
 int do_create_subvol(int nargs, char **argv);
 int do_fssync(int nargs, char **argv);
diff --git a/man/btrfs.8.in b/man/btrfs.8.in
index e87b5fe..504fe5f 100644
--- a/man/btrfs.8.in
+++ b/man/btrfs.8.in
@@ -5,7 +5,7 @@
 .SH NAME
 btrfs \- control a btrfs filesystem
 .SH SYNOPSIS
-\fBbtrfs\fP \fBsubvolume snapshot\fP\fI source [dest/]name\fP
+\fBbtrfs\fP \fBsubvolume snapshot\fP\fI source [dest/]name [async]\fP
 .PP
 \fBbtrfs\fP \fBsubvolume delete\fP\fI subvolume\fP
 .PP
@@ -74,10 +74,15 @@ command.
 .SH COMMANDS
 .TP
 
-\fBsubvolume snapshot\fR\fI source [dest/]name\fR
-Create a writable snapshot of the subvolume \fIsource\fR with the name
-\fIname\fR in the \fIdest\fR directory. If \fIsource\fR is not a
-subvolume, \fBbtrfs\fR returns an error.
+\fBsubvolume snapshot\fR\fI source [dest/]name [async]\fR 
+Create a writable snapshot of the subvolume \fIsource\fR with the
+name \fIname\fR in the \fIdest\fR directory. If \fIsource\fR is
+not a subvolume, \fBbtrfs\fR returns an error.  If \fI[async]\fR is
+specified, we will not wait for the snapshot creation to commit to
+disk before returning, and a transaction id is printed that can be
+waited on1 using the \fBfilesystem wait-sync\fR command.
+.TP
+
 .TP
 
 \fBsubvolume delete\fR\fI subvolume\fR
-- 
1.7.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] btrfs-progs: btrfs: implement 'start-sync' and 'wait-sync' commands

2010-11-01 Thread Sage Weil
The 'start-sync' command initiates a sync, but does not wait for it to
complete.  A transaction is printed that can be fed to 'wait-sync', which
will wait for it to commit.

'wait-sync' can also be used in combination with 'async-snapshot' to wait
for an async snapshot creation to commit.

Updates the man page too.

Signed-off-by: Sage Weil s...@newdream.net
---
 btrfs.c|9 +
 btrfs_cmds.c   |   49 +
 btrfs_cmds.h   |2 ++
 man/btrfs.8.in |   14 ++
 4 files changed, 74 insertions(+), 0 deletions(-)

diff --git a/btrfs.c b/btrfs.c
index 46314cf..c871f4a 100644
--- a/btrfs.c
+++ b/btrfs.c
@@ -77,6 +77,15 @@ static struct Command commands[] = {
  filesystem sync, path\n
Force a sync on the filesystem path.
},
+   { do_start_sync, 1,
+ filesystem start-sync, path\n
+   Start a sync on the filesystem path, and print the 
resulting\n
+   transaction id.
+   },
+   { do_wait_sync, 2,
+ filesystem wait-sync, path transid\n
+   Wait for the transaction transid on the filesystem at path 
to commit.
+   },
{ do_resize, 2,
  filesystem resize, [+/-]newsize[gkm]|max filesystem\n
Resize the file system. If 'max' is passed, the filesystem\n
diff --git a/btrfs_cmds.c b/btrfs_cmds.c
index 8031c58..736437d 100644
--- a/btrfs_cmds.c
+++ b/btrfs_cmds.c
@@ -526,6 +526,55 @@ int do_fssync(int argc, char **argv)
return 0;
 }
 
+int do_start_sync(int argc, char **argv)
+{
+   int fd, res;
+   char*path = argv[1];
+   __u64 transid;
+
+   fd = open_file_or_dir(path);
+   if (fd  0) {
+   fprintf(stderr, ERROR: can't access to '%s'\n, path);
+   return 12;
+   }
+
+   printf(StartSync '%s'\n, path);
+   res = ioctl(fd, BTRFS_IOC_START_SYNC, transid);
+   close(fd);
+   if( res  0 ){
+   fprintf(stderr, ERROR: unable to fs-syncing '%s'\n, path);
+   return 16;
+   } else {
+   printf(transid %llu\n, (unsigned long long)transid);
+   }
+
+   return 0;
+}
+
+int do_wait_sync(int argc, char **argv)
+{
+   int fd, res;
+   char*path = argv[1];
+   __u64 transid = atoll(argv[2]);
+
+   fd = open_file_or_dir(path);
+   if (fd  0) {
+   fprintf(stderr, ERROR: can't access to '%s'\n, path);
+   return 12;
+   }
+
+   printf(WaitSync '%s' transid %llu\n, path, (unsigned long 
long)transid);
+   res = ioctl(fd, BTRFS_IOC_WAIT_SYNC, transid);
+   close(fd);
+   if( res  0 ){
+   fprintf(stderr, ERROR: unable to wait-sync on '%s' transid 
%llu: %s\n, path,
+   (unsigned long long)transid, strerror(errno));
+   return 16;
+   }
+
+   return 0;
+}
+
 int do_scan(int argc, char **argv)
 {
int i, fd;
diff --git a/btrfs_cmds.h b/btrfs_cmds.h
index 7bde191..84c489f 100644
--- a/btrfs_cmds.h
+++ b/btrfs_cmds.h
@@ -19,6 +19,8 @@ int do_clone(int nargs, char **argv);
 int do_delete_subvolume(int nargs, char **argv);
 int do_create_subvol(int nargs, char **argv);
 int do_fssync(int nargs, char **argv);
+int do_start_sync(int nargs, char **argv);
+int do_wait_sync(int nargs, char **argv);
 int do_defrag(int argc, char **argv);
 int do_show_filesystem(int nargs, char **argv);
 int do_add_volume(int nargs, char **args);
diff --git a/man/btrfs.8.in b/man/btrfs.8.in
index 26ef982..e87b5fe 100644
--- a/man/btrfs.8.in
+++ b/man/btrfs.8.in
@@ -19,6 +19,10 @@ btrfs \- control a btrfs filesystem
 .PP
 \fBbtrfs\fP \fBfilesystem sync\fP\fI path \fP
 .PP
+\fBbtrfs\fP \fBfilesystem start-sync\fP\fI path \fP
+.PP
+\fBbtrfs\fP \fBfilesystem wait-sync\fP\fI path transid\fP
+.PP
 \fBbtrfs\fP \fBfilesystem resize\fP\fI [+/\-]size[gkm]|max filesystem\fP
 .PP
 \fBbtrfs\fP \fBdevice scan\fP\fI [device [device..]]\fP
@@ -115,6 +119,16 @@ all the block devices.
 Force a sync for the filesystem identified by \fIpath\fR.
 .TP
 
+\fBfilesystem start-sync\fR\fI path \fR
+Start a sync operation for the filesystem identified by \fIpath\fR.  A 
transaction id
+is printed that can be waited on using the \fBfilesystem wait-sync\fR command.
+.TP
+
+\fBfilesystem wait-sync\fR\fI path transid\fR
+Wait for a the transaction \fItransid\fR to commit to disk.  If 
\fItransid\fR is zero,
+wait for any currently committing transaction to commit.
+.TP
+
 .\
 .\ Some wording are extracted by the resize2fs man page
 .\
-- 
1.7.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] update ioctl.h from 2.6.37-rc1

2010-10-30 Thread Sage Weil
Signed-off-by: Sage Weil s...@newdream.net
---
 ioctl.h |   39 ++-
 1 files changed, 30 insertions(+), 9 deletions(-)

diff --git a/ioctl.h b/ioctl.h
index 776d7a9..5a03317 100644
--- a/ioctl.h
+++ b/ioctl.h
@@ -23,13 +23,28 @@
 
 #define BTRFS_IOCTL_MAGIC 0x94
 #define BTRFS_VOL_NAME_MAX 255
-#define BTRFS_PATH_NAME_MAX 4087
 
+/* this should be 4k */
+#define BTRFS_PATH_NAME_MAX 4087
 struct btrfs_ioctl_vol_args {
__s64 fd;
char name[BTRFS_PATH_NAME_MAX + 1];
 };
 
+#define BTRFS_SNAPSHOT_NAME_MAX 4079
+struct btrfs_ioctl_async_vol_args {
+   __s64 fd;
+   __u64 transid;
+   char name[BTRFS_SNAPSHOT_NAME_MAX + 1];
+};
+
+#define BTRFS_INO_LOOKUP_PATH_MAX 4080
+struct btrfs_ioctl_ino_lookup_args {
+   __u64 treeid;
+   __u64 objectid;
+   char name[BTRFS_INO_LOOKUP_PATH_MAX];
+};
+
 struct btrfs_ioctl_search_key {
/* which root are we searching.  0 is the tree of tree roots */
__u64 tree_id;
@@ -72,7 +87,7 @@ struct btrfs_ioctl_search_header {
__u64 offset;
__u32 type;
__u32 len;
-} __attribute__((may_alias));
+};
 
 #define BTRFS_SEARCH_ARGS_BUFSIZE (4096 - sizeof(struct 
btrfs_ioctl_search_key))
 /*
@@ -85,11 +100,10 @@ struct btrfs_ioctl_search_args {
char buf[BTRFS_SEARCH_ARGS_BUFSIZE];
 };
 
-#define BTRFS_INO_LOOKUP_PATH_MAX 4080
-struct btrfs_ioctl_ino_lookup_args {
-   __u64 treeid;
-   __u64 objectid;
-   char name[BTRFS_INO_LOOKUP_PATH_MAX];
+struct btrfs_ioctl_clone_range_args {
+  __s64 src_fd;
+  __u64 src_offset, src_length;
+  __u64 dest_offset;
 };
 
 /* flags for the defrag range ioctl */
@@ -155,11 +169,14 @@ struct btrfs_ioctl_space_args {
   struct btrfs_ioctl_vol_args)
 #define BTRFS_IOC_BALANCE _IOW(BTRFS_IOCTL_MAGIC, 12, \
   struct btrfs_ioctl_vol_args)
-/* 13 is for CLONE_RANGE */
+
+#define BTRFS_IOC_CLONE_RANGE _IOW(BTRFS_IOCTL_MAGIC, 13, \
+ struct btrfs_ioctl_clone_range_args)
+
 #define BTRFS_IOC_SUBVOL_CREATE _IOW(BTRFS_IOCTL_MAGIC, 14, \
   struct btrfs_ioctl_vol_args)
 #define BTRFS_IOC_SNAP_DESTROY _IOW(BTRFS_IOCTL_MAGIC, 15, \
-  struct btrfs_ioctl_vol_args)
+   struct btrfs_ioctl_vol_args)
 #define BTRFS_IOC_DEFRAG_RANGE _IOW(BTRFS_IOCTL_MAGIC, 16, \
struct btrfs_ioctl_defrag_range_args)
 #define BTRFS_IOC_TREE_SEARCH _IOWR(BTRFS_IOCTL_MAGIC, 17, \
@@ -169,4 +186,8 @@ struct btrfs_ioctl_space_args {
 #define BTRFS_IOC_DEFAULT_SUBVOL _IOW(BTRFS_IOCTL_MAGIC, 19, u64)
 #define BTRFS_IOC_SPACE_INFO _IOWR(BTRFS_IOCTL_MAGIC, 20, \
struct btrfs_ioctl_space_args)
+#define BTRFS_IOC_START_SYNC _IOR(BTRFS_IOCTL_MAGIC, 24, __u64)
+#define BTRFS_IOC_WAIT_SYNC  _IOW(BTRFS_IOCTL_MAGIC, 22, __u64)
+#define BTRFS_IOC_SNAP_CREATE_ASYNC _IOW(BTRFS_IOCTL_MAGIC, 23, \
+  struct btrfs_ioctl_async_vol_args)
 #endif
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] btrfs: implement 'async-snapshot' command

2010-10-30 Thread Sage Weil
This is identical to 'snapshot', but uses the new async snapshot creation
ioctl, and prints out the transid the new snapshot will be committed
with.

Signed-off-by: Sage Weil s...@newdream.net
---
 btrfs.c  |8 +++-
 btrfs_cmds.c |   32 +++-
 btrfs_cmds.h |3 ++-
 3 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/btrfs.c b/btrfs.c
index 46314cf..c4b9a31 100644
--- a/btrfs.c
+++ b/btrfs.c
@@ -44,11 +44,17 @@ static struct Command commands[] = {
/*
avoid short commands different for the case only
*/
-   { do_clone, 2,
+   { do_create_snap, 2,
  subvolume snapshot, source [dest/]name\n
Create a writable snapshot of the subvolume source with\n
the name name in the dest directory.
},
+   { do_create_snap_async, 2,
+ subvolume async-snapshot, source [dest/]name\n
+   Create a writable snapshot of the subvolume source with\n
+   the name name in the dest directory.  Do not wait for\n
+   the snapshot creation to commit to disk before returning.
+   },
{ do_delete_subvolume, 1,
  subvolume delete, subvolume\n
Delete the subvolume subvolume.
diff --git a/btrfs_cmds.c b/btrfs_cmds.c
index 8031c58..6da5862 100644
--- a/btrfs_cmds.c
+++ b/btrfs_cmds.c
@@ -307,7 +307,7 @@ int do_subvol_list(int argc, char **argv)
return 0;
 }
 
-int do_clone(int argc, char **argv)
+static int create_snap(int argc, char **argv, int async)
 {
char*subvol, *dst;
int res, fd, fddst, len;
@@ -316,7 +316,6 @@ int do_clone(int argc, char **argv)
 
subvol = argv[1];
dst = argv[2];
-   struct btrfs_ioctl_vol_args args;
 
res = test_issubvolume(subvol);
if(res0){
@@ -374,9 +373,22 @@ int do_clone(int argc, char **argv)
 
printf(Create a snapshot of '%s' in '%s/%s'\n,
   subvol, dstdir, newname);
-   args.fd = fd;
-   strcpy(args.name, newname);
-   res = ioctl(fddst, BTRFS_IOC_SNAP_CREATE, args);
+   if (async) {
+   struct btrfs_ioctl_async_vol_args   async_args;
+   async_args.fd = fd;
+   async_args.transid = 0;
+   strcpy(async_args.name, newname);
+   res = ioctl(fddst, BTRFS_IOC_SNAP_CREATE_ASYNC, async_args);
+   if (res == 0)
+   printf(transid %llu\n,
+  (unsigned long long)async_args.transid);
+   } else {
+   struct btrfs_ioctl_vol_args args;
+
+   args.fd = fd;
+   strcpy(args.name, newname);
+   res = ioctl(fddst, BTRFS_IOC_SNAP_CREATE, args);
+   }
 
close(fd);
close(fddst);
@@ -390,6 +402,16 @@ int do_clone(int argc, char **argv)
 
 }
 
+int do_create_snap_async(int argc, char **argv)
+{
+   return create_snap(argc, argv, 1);
+}
+
+int do_create_snap(int argc, char **argv)
+{
+   return create_snap(argc, argv, 0);
+}
+
 int do_delete_subvolume(int argc, char **argv)
 {
int res, fd, len;
diff --git a/btrfs_cmds.h b/btrfs_cmds.h
index 7bde191..c44dc79 100644
--- a/btrfs_cmds.h
+++ b/btrfs_cmds.h
@@ -15,7 +15,8 @@
  */
 
 /* btrfs_cmds.c*/
-int do_clone(int nargs, char **argv);
+int do_create_snap(int nargs, char **argv);
+int do_create_snap_async(int nargs, char **argv);
 int do_delete_subvolume(int nargs, char **argv);
 int do_create_subvol(int nargs, char **argv);
 int do_fssync(int nargs, char **argv);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] btrfs: implement 'start-sync' and 'wait-sync' commands

2010-10-30 Thread Sage Weil
The 'start-sync' command initiates a sync, but does not wait for it to
complete.  A transaction is printed that can be fed to 'wait-sync', which
will wait for it to commit.

'wait-sync' can also be used in combination with 'async-snapshot' to wait
for an async snapshot creation to commit.

Signed-off-by: Sage Weil s...@newdream.net
---
 btrfs.c  |9 +
 btrfs_cmds.c |   49 +
 btrfs_cmds.h |2 ++
 3 files changed, 60 insertions(+), 0 deletions(-)

diff --git a/btrfs.c b/btrfs.c
index c4b9a31..d45ac1f 100644
--- a/btrfs.c
+++ b/btrfs.c
@@ -83,6 +83,15 @@ static struct Command commands[] = {
  filesystem sync, path\n
Force a sync on the filesystem path.
},
+   { do_start_sync, 1,
+ filesystem start-sync, path\n
+   Start a sync on the filesystem path, and print the 
resulting\n
+   transaction id.
+   },
+   { do_wait_sync, 2,
+ filesystem wait-sync, path transid\n
+   Wait for the transaction transid on the filesystem at path 
to commit.
+   },
{ do_resize, 2,
  filesystem resize, [+/-]newsize[gkm]|max filesystem\n
Resize the file system. If 'max' is passed, the filesystem\n
diff --git a/btrfs_cmds.c b/btrfs_cmds.c
index 6da5862..5b5bb15 100644
--- a/btrfs_cmds.c
+++ b/btrfs_cmds.c
@@ -548,6 +548,55 @@ int do_fssync(int argc, char **argv)
return 0;
 }
 
+int do_start_sync(int argc, char **argv)
+{
+   int fd, res;
+   char*path = argv[1];
+   __u64 transid;
+
+   fd = open_file_or_dir(path);
+   if (fd  0) {
+   fprintf(stderr, ERROR: can't access to '%s'\n, path);
+   return 12;
+   }
+
+   printf(StartSync '%s'\n, path);
+   res = ioctl(fd, BTRFS_IOC_START_SYNC, transid);
+   close(fd);
+   if( res  0 ){
+   fprintf(stderr, ERROR: unable to fs-syncing '%s'\n, path);
+   return 16;
+   } else {
+   printf(transid %llu\n, (unsigned long long)transid);
+   }
+
+   return 0;
+}
+
+int do_wait_sync(int argc, char **argv)
+{
+   int fd, res;
+   char*path = argv[1];
+   __u64 transid = atoll(argv[2]);
+
+   fd = open_file_or_dir(path);
+   if (fd  0) {
+   fprintf(stderr, ERROR: can't access to '%s'\n, path);
+   return 12;
+   }
+
+   printf(WaitSync '%s' transid %llu\n, path, (unsigned long 
long)transid);
+   res = ioctl(fd, BTRFS_IOC_WAIT_SYNC, transid);
+   close(fd);
+   if( res  0 ){
+   fprintf(stderr, ERROR: unable to wait-sync on '%s' transid 
%llu: %s\n, path,
+   (unsigned long long)transid, strerror(errno));
+   return 16;
+   }
+
+   return 0;
+}
+
 int do_scan(int argc, char **argv)
 {
int i, fd;
diff --git a/btrfs_cmds.h b/btrfs_cmds.h
index c44dc79..e0e5ceb 100644
--- a/btrfs_cmds.h
+++ b/btrfs_cmds.h
@@ -20,6 +20,8 @@ int do_create_snap_async(int nargs, char **argv);
 int do_delete_subvolume(int nargs, char **argv);
 int do_create_subvol(int nargs, char **argv);
 int do_fssync(int nargs, char **argv);
+int do_start_sync(int nargs, char **argv);
+int do_wait_sync(int nargs, char **argv);
 int do_defrag(int argc, char **argv);
 int do_show_filesystem(int nargs, char **argv);
 int do_add_volume(int nargs, char **args);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] Btrfs: fix deadlock in btrfs_commit_transaction

2010-10-26 Thread Sage Weil
On Tue, 26 Oct 2010, liubo wrote:
 On 10/26/2010 03:07 AM, Sage Weil wrote:
  We calculate timeout (either 1 or MAX_SCHEDULE_TIMEOUT) based on whether
  num_writers  1 or should_grow at the top of the loop.  Then, much much
  later, we wait for that timeout if either num_writers or should_grow is
  true.  However, it's possible for a racing process (calling
  btrfs_end_transaction()) to decrement num_writers such that we wait
  forever instead of for 1.
  
 
 IMO, there still exists a deadlock with your patch.
 ===
 with your patch:
 
 thread 1:  thread 2:   
 
 btrfs_commit_transaction()
   if (num_writers  1)
 timeout = MAX_TIMEOUT;
(This bit goes away, btw.)
   -
__btrfs_end_transaction()
  num_writers--;
if (wq)  
wake_up();
   -
   smp_mb();
   prepare_wait();
   if (num_writers  1)
 schedule_timeout(MAX);
   else if (should_grow)
 schedule_timeout(1);
 ===

What's the problem above?  The wake_up() doesn't get called, and thread1 
doesn't sleep.

 thread2 also needs a memory_barrier, for without memory_barrier, 
 on some CPUs, if (wq) may be executed before num_writers--, like
 ===
 thread 1:  thread 2:   
 
 btrfs_commit_transaction()
   if (num_writers  1)
 timeout = MAX_TIMEOUT;
(This bit is gone)

   -
__btrfs_end_transaction()
  if (wq)  
wake_up();
   -
   smp_mb();
   prepare_wait();
   if (num_writers  1)
 schedule_timeout(MAX);
   else if (should_grow)
 schedule_timeout(1); 
   --
 num_writers--;
 ===
 then, thread1 may wait forever.
 
 Since wake_up() itself provides a implied wmb, and a wq active check, 
 it is better to drop if (wq) in __btrfs_end_transaction().

I see.  It could also be

smb_mb();
if (wq)
wake_up();

but just calling wake_up() unconditionally is simpler, and fewer barriers 
in the wake_up case.  I'm not attached to the if (wq); I just kept it 
because it was there already.

Chris?

Thanks!
sage





 
 
 thanks,
 liubo
 
  Fix this by deciding how long to wait when we wait.
  
  Signed-off-by: Sage Weil s...@newdream.net
  ---
   fs/btrfs/transaction.c |   12 
   1 files changed, 4 insertions(+), 8 deletions(-)
  
  diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
  index 66e4c66..bf399ea 100644
  --- a/fs/btrfs/transaction.c
  +++ b/fs/btrfs/transaction.c
  @@ -992,7 +992,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
  *trans,
   struct btrfs_root *root)
   {
  unsigned long joined = 0;
  -   unsigned long timeout = 1;
  struct btrfs_transaction *cur_trans;
  struct btrfs_transaction *prev_trans = NULL;
  DEFINE_WAIT(wait);
  @@ -1063,11 +1062,6 @@ int btrfs_commit_transaction(struct 
  btrfs_trans_handle *trans,
  snap_pending = 1;
   
  WARN_ON(cur_trans != trans-transaction);
  -   if (cur_trans-num_writers  1)
  -   timeout = MAX_SCHEDULE_TIMEOUT;
  -   else if (should_grow)
  -   timeout = 1;
  -
  mutex_unlock(root-fs_info-trans_mutex);
   
  if (flush_on_commit || snap_pending) {
  @@ -1089,8 +1083,10 @@ int btrfs_commit_transaction(struct 
  btrfs_trans_handle *trans,
  TASK_UNINTERRUPTIBLE);
   
  smp_mb();
  -   if (cur_trans-num_writers  1 || should_grow)
  -   schedule_timeout(timeout);
  +   if (cur_trans-num_writers  1)
  +   schedule_timeout(MAX_SCHEDULE_TIMEOUT);
  +   else if (should_grow)
  +   schedule_timeout(1);
   
  mutex_lock(root-fs_info-trans_mutex);
  finish_wait(cur_trans-writer_wait, wait);
 
 
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: allow subvol deletion by unprivileged user with -o user_subvol_rm_allowed

2010-10-26 Thread Sage Weil
On Tue, 26 Oct 2010, Goffredo Baroncelli wrote:
  inode = dentry-d_inode;
  +   dest = BTRFS_I(inode)-root;
  +   if (!capable(CAP_SYS_ADMIN)){
  +   /*
  +* Regular user.  Only allow this with a special mount
  +* option, and when rmdir(2) would have been allowed.
  +*
  +* Note that this is _not_ check that the subvol is
  +* empty or doesn't contain data that we wouldn't
  +* otherwise be able to delete.
  +*
  +* Users who want to delete empty subvols should try
  +* rmdir(2).
  +*/
  +   err = -EPERM;
  +   if (!btrfs_test_opt(root, USER_SUBVOL_RM_ALLOWED))
  +   goto out_dput;
  +
  +   /*
  +* Do not allow deletion if the parent dir is the same
  +* as the dir to be deleted.  That means the ioctl
  +* must be called on the dentry referencing the root
  +* of the subvol, not a random directory contained
  +* within it.
  +*/
  +   err = -EINVAL;
  +   if (root == dest)
  +   goto out_dput;
  +
  +   /* check if subvolume may be deleted by a non-root user */  
  +   err = btrfs_may_delete(dir, dentry, 1);
  +   if (err)
  +   goto out_dput;
 
 If I read correctly, an user now is capable to remove a not owned 
 subvolume. 
 Is this the intended behavior ?
 I am not arguing against, but I want to highlight this fact.

Good point.  This was mirroring the rmdir(2) checks, but given that we can 
remove a non-empty subvol, we should add

+   err = inode_permission(inode, MAY_WRITE | MAY_EXEC);
+   if (err)
+   goto out_dput;

as well.  I'll resend.

Thanks!
sage


 
 
  +   }
  +
  if (inode-i_ino != BTRFS_FIRST_FREE_OBJECTID) {
  err = -EINVAL;
  goto out_dput;
  }
   
  -   dest = BTRFS_I(inode)-root;
  -
  mutex_lock(inode-i_mutex);
  err = d_invalidate(dentry);
  if (err)
  diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
  index 4da2680..8ff5a3a 100644
  --- a/fs/btrfs/super.c
  +++ b/fs/btrfs/super.c
  @@ -68,7 +68,7 @@ enum {
  Opt_nodatacow, Opt_max_inline, Opt_alloc_start, Opt_nobarrier, 
 Opt_ssd,
  Opt_nossd, Opt_ssd_spread, Opt_thread_pool, Opt_noacl, Opt_compress,
  Opt_compress_force, Opt_notreelog, Opt_ratio, Opt_flushoncommit,
  -   Opt_discard, Opt_err,
  +   Opt_discard, Opt_user_subvol_rm_allowed, Opt_err,
   };
   
   static match_table_t tokens = {
  @@ -92,6 +92,7 @@ static match_table_t tokens = {
  {Opt_flushoncommit, flushoncommit},
  {Opt_ratio, metadata_ratio=%d},
  {Opt_discard, discard},
  +   {Opt_user_subvol_rm_allowed, user_subvol_rm_allowed},
  {Opt_err, NULL},
   };
   
  @@ -235,6 +236,9 @@ int btrfs_parse_options(struct btrfs_root *root, char 
 *options)
  case Opt_discard:
  btrfs_set_opt(info-mount_opt, DISCARD);
  break;
  +   case Opt_user_subvol_rm_allowed:
  +   btrfs_set_opt(info-mount_opt, 
 USER_SUBVOL_RM_ALLOWED);
  +   break;
  case Opt_err:
  printk(KERN_INFO btrfs: unrecognized mount option 
 '%s'\n, p);
  -- 
  1.6.6.1
  
  --
  To unsubscribe from this list: send the line unsubscribe linux-btrfs in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
 
 
 -- 
 gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) kreij...@inwind.it
 Key fingerprint = 4769 7E51 5293 D36C 814E  C054 BF04 F161 3DC5 0512
 
 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] Btrfs: fix deadlock in btrfs_commit_transaction

2010-10-26 Thread Sage Weil
We calculate timeout (either 1 or MAX_SCHEDULE_TIMEOUT) based on whether
num_writers  1 or should_grow at the top of the loop.  Then, much much
later, we wait for that timeout if either num_writers or should_grow is
true.  However, it's possible for a racing process (calling
btrfs_end_transaction()) to decrement num_writers such that we wait
forever instead of for 1.

Fix this by deciding how long to wait when we wait.  Include a smp_mb()
before checking if the waitqueue is active to ensure the num_writers
is visible.

Signed-off-by: Sage Weil s...@newdream.net
---
v2:
 - add smp_mb() before waitqueue_active() check to clone another possible race

 fs/btrfs/transaction.c |   13 +
 1 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 66e4c66..b461fe3 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -392,6 +392,7 @@ static int __btrfs_end_transaction(struct 
btrfs_trans_handle *trans,
WARN_ON(cur_trans-num_writers  1);
cur_trans-num_writers--;
 
+   smp_mb();
if (waitqueue_active(cur_trans-writer_wait))
wake_up(cur_trans-writer_wait);
put_transaction(cur_trans);
@@ -992,7 +993,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans,
 struct btrfs_root *root)
 {
unsigned long joined = 0;
-   unsigned long timeout = 1;
struct btrfs_transaction *cur_trans;
struct btrfs_transaction *prev_trans = NULL;
DEFINE_WAIT(wait);
@@ -1063,11 +1063,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans,
snap_pending = 1;
 
WARN_ON(cur_trans != trans-transaction);
-   if (cur_trans-num_writers  1)
-   timeout = MAX_SCHEDULE_TIMEOUT;
-   else if (should_grow)
-   timeout = 1;
-
mutex_unlock(root-fs_info-trans_mutex);
 
if (flush_on_commit || snap_pending) {
@@ -1089,8 +1084,10 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans,
TASK_UNINTERRUPTIBLE);
 
smp_mb();
-   if (cur_trans-num_writers  1 || should_grow)
-   schedule_timeout(timeout);
+   if (cur_trans-num_writers  1)
+   schedule_timeout(MAX_SCHEDULE_TIMEOUT);
+   else if (should_grow)
+   schedule_timeout(1);
 
mutex_lock(root-fs_info-trans_mutex);
finish_wait(cur_trans-writer_wait, wait);
-- 
1.6.6.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] Btrfs: allow subvol deletion by unprivileged user with -o user_subvol_rm_allowed

2010-10-26 Thread Sage Weil
Add a mount option user_subvol_rm_allowed that allows users to delete a
(potentially non-empty!) subvol when they would otherwise we allowed to do
an rmdir(2).  We duplicate the may_delete() checks from the core VFS code
to implement identical security checks (minus the directory size check).
We additionally require that the user has write+exec permission on the
subvol root inode.

Signed-off-by: Sage Weil s...@newdream.net
---
v2:
 - add write+exec check on subvol root inode

 fs/btrfs/ctree.h |1 +
 fs/btrfs/ioctl.c |  115 +++--
 fs/btrfs/super.c |6 ++-
 3 files changed, 116 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4c3c20b..f1f4155 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1193,6 +1193,7 @@ struct btrfs_root {
 #define BTRFS_MOUNT_NOSSD  (1  9)
 #define BTRFS_MOUNT_DISCARD(1  10)
 #define BTRFS_MOUNT_FORCE_COMPRESS  (1  11)
+#define BTRFS_MOUNT_USER_SUBVOL_RM_ALLOWED (1  12)
 
 #define btrfs_clear_opt(o, opt)((o) = ~BTRFS_MOUNT_##opt)
 #define btrfs_set_opt(o, opt)  ((o) |= BTRFS_MOUNT_##opt)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 9cbda86..7c5e858 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -409,6 +409,76 @@ fail:
return ret;
 }
 
+/*  copy of check_sticky in fs/namei.c() 
+* It's inline, so penalty for filesystems that don't use sticky bit is
+* minimal.
+*/
+static inline int btrfs_check_sticky(struct inode *dir, struct inode *inode)
+{
+   uid_t fsuid = current_fsuid();
+
+   if (!(dir-i_mode  S_ISVTX))
+   return 0;
+   if (inode-i_uid == fsuid)
+   return 0;
+   if (dir-i_uid == fsuid)
+   return 0;
+   return !capable(CAP_FOWNER);
+}
+
+/*  copy of may_delete in fs/namei.c() 
+ * Check whether we can remove a link victim from directory dir, check
+ *  whether the type of victim is right.
+ *  1. We can't do it if dir is read-only (done in permission())
+ *  2. We should have write and exec permissions on dir
+ *  3. We can't remove anything from append-only dir
+ *  4. We can't do anything with immutable dir (done in permission())
+ *  5. If the sticky bit on dir is set we should either
+ * a. be owner of dir, or
+ * b. be owner of victim, or
+ * c. have CAP_FOWNER capability
+ *  6. If the victim is append-only or immutable we can't do antyhing with
+ * links pointing to it.
+ *  7. If we were asked to remove a directory and victim isn't one - ENOTDIR.
+ *  8. If we were asked to remove a non-directory and victim isn't one - 
EISDIR.
+ *  9. We can't remove a root or mountpoint.
+ * 10. We don't allow removal of NFS sillyrenamed files; it's handled by
+ * nfs_async_unlink().
+ */
+
+static int btrfs_may_delete(struct inode *dir,struct dentry *victim,int isdir)
+{
+   int error;
+
+   if (!victim-d_inode)
+   return -ENOENT;
+
+   BUG_ON(victim-d_parent-d_inode != dir);
+   audit_inode_child(victim, dir);
+
+   error = inode_permission(dir, MAY_WRITE | MAY_EXEC);
+   if (error)
+   return error;
+   if (IS_APPEND(dir))
+   return -EPERM;
+   if (btrfs_check_sticky(dir, victim-d_inode)||
+   IS_APPEND(victim-d_inode)||
+   IS_IMMUTABLE(victim-d_inode) || IS_SWAPFILE(victim-d_inode))
+   return -EPERM;
+   if (isdir) {
+   if (!S_ISDIR(victim-d_inode-i_mode))
+   return -ENOTDIR;
+   if (IS_ROOT(victim))
+   return -EBUSY;
+   } else if (S_ISDIR(victim-d_inode-i_mode))
+   return -EISDIR;
+   if (IS_DEADDIR(dir))
+   return -ENOENT;
+   if (victim-d_flags  DCACHE_NFSFS_RENAMED)
+   return -EBUSY;
+   return 0;
+}
+
 /* copy of may_create in fs/namei.c() */
 static inline int btrfs_may_create(struct inode *dir, struct dentry *child)
 {
@@ -1288,9 +1358,6 @@ static noinline int btrfs_ioctl_snap_destroy(struct file 
*file,
int ret;
int err = 0;
 
-   if (!capable(CAP_SYS_ADMIN))
-   return -EPERM;
-
vol_args = memdup_user(arg, sizeof(*vol_args));
if (IS_ERR(vol_args))
return PTR_ERR(vol_args);
@@ -1320,13 +1387,51 @@ static noinline int btrfs_ioctl_snap_destroy(struct 
file *file,
}
 
inode = dentry-d_inode;
+   dest = BTRFS_I(inode)-root;
+   if (!capable(CAP_SYS_ADMIN)){
+   /*
+* Regular user.  Only allow this with a special mount
+* option, when the user has write+exec access to the
+* subvol root, and when rmdir(2) would have been
+* allowed.
+*
+* Note that this is _not_ check that the subvol is
+* empty or doesn't contain data that we wouldn't
+* otherwise be able to delete

[PATCH 0/6] Btrfs commit fixes, async subvol operations

2010-10-25 Thread Sage Weil
Hi Chris,

This is the extent of my current queue of Btrfs snapshot/subvol/commit 
stuff. Most of these were posted several months ago.  Can be sent 
upstream during this merge window?  Not having this functionality is 
becoming a bit of a roadblock for our efforts to keep the Ceph data in a 
consistent state.

These patches are also available from

git://git.kernel.org/pub/scm/linux/kernel/git/sage/btrfs.git snap_ioctls

The first patch is strictly a bug fix for a deadlock in 
btrfs_commit_transaction().

The next few patches are a repost (with a few minor revisions) of the 
async snapshot/subvolume ioctls I originally posted last spring.  They 
include:

 - Some async commit helper functions
 - Start and wait sync ioctls, for initiating and waiting for a sync
 - An ioctl to start a snapshot creation asynchronously, such that you 
   don't have to wait for the full commit to disk.  The interface is cleaned
   up a bit from the original version.

There is also a patch that changes the SNAP_DESTROY ioctl to not do a 
commit before returning.  The rationale here is there is no obvious 
reason (to me at least) why the snapshot removal should be on disk 
before returning; rm(2) and unlink(2) certainly don't do that.  If the 
user disagrees, they can call sync(2).  If you would prefer, I also have 
a patch that introduces a new SNAP_DESTROY_ASYNC ioctl that doesn't 
change any existing behavior.

The last item is a change to SNAP_DESTROY to allow deletion of a 
snapshot when the user owns the subvol's root inode and the parent 
directory permissions are such that we would have allowed an rmdir(2).  
Goffredo Baroncelli posted a similar patch that replicates the rmdir(2) 
semantics completely (except for the empty directory check) by 
duplicating some VFS code.  Whether we want weaker semantics, duplicated 
code, or some new EXPORT_SYMBOLS is up to you I think.  Note that this 
is distinct from a similar patch (also from Goffredo) that allows 
rmdir(2) to remove an empty subvol; my goal is to allow a non-empty 
subvol to be deleted by a non-root user.  As long as I can do that, my 
daemon doesn't have to run as root and I'm a happy camper.  :)

If anybody has any questions or issues with any of this, please let me 
know so I can revise the patches accordingly.

Thanks!
sage

---


Sage Weil (6):
  Btrfs: fix deadlock in btrfs_commit_transaction
  Btrfs: async transaction commit
  Btrfs: add START_SYNC, WAIT_SYNC ioctls
  Btrfs: add SNAP_CREATE_ASYNC ioctl
  Btrfs: make SNAP_DESTROY async
  Btrfs: allow subvol deletion by owner

 fs/btrfs/ctree.h   |1 +
 fs/btrfs/disk-io.c |1 +
 fs/btrfs/ioctl.c   |  152 +++-
 fs/btrfs/ioctl.h   |9 +++
 fs/btrfs/transaction.c |  183 +--
 fs/btrfs/transaction.h |4 +
 security/security.c|1 +
 7 files changed, 326 insertions(+), 25 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/6] Btrfs: fix deadlock in btrfs_commit_transaction

2010-10-25 Thread Sage Weil
We calculate timeout (either 1 or MAX_SCHEDULE_TIMEOUT) based on whether
num_writers  1 or should_grow at the top of the loop.  Then, much much
later, we wait for that timeout if either num_writers or should_grow is
true.  However, it's possible for a racing process (calling
btrfs_end_transaction()) to decrement num_writers such that we wait
forever instead of for 1.

Fix this by deciding how long to wait when we wait.

Signed-off-by: Sage Weil s...@newdream.net
---
 fs/btrfs/transaction.c |   12 
 1 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 66e4c66..bf399ea 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -992,7 +992,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans,
 struct btrfs_root *root)
 {
unsigned long joined = 0;
-   unsigned long timeout = 1;
struct btrfs_transaction *cur_trans;
struct btrfs_transaction *prev_trans = NULL;
DEFINE_WAIT(wait);
@@ -1063,11 +1062,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans,
snap_pending = 1;
 
WARN_ON(cur_trans != trans-transaction);
-   if (cur_trans-num_writers  1)
-   timeout = MAX_SCHEDULE_TIMEOUT;
-   else if (should_grow)
-   timeout = 1;
-
mutex_unlock(root-fs_info-trans_mutex);
 
if (flush_on_commit || snap_pending) {
@@ -1089,8 +1083,10 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans,
TASK_UNINTERRUPTIBLE);
 
smp_mb();
-   if (cur_trans-num_writers  1 || should_grow)
-   schedule_timeout(timeout);
+   if (cur_trans-num_writers  1)
+   schedule_timeout(MAX_SCHEDULE_TIMEOUT);
+   else if (should_grow)
+   schedule_timeout(1);
 
mutex_lock(root-fs_info-trans_mutex);
finish_wait(cur_trans-writer_wait, wait);
-- 
1.6.6.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/6] Btrfs: async transaction commit

2010-10-25 Thread Sage Weil
Add support for an async transaction commit that is ordered such that any
subsequent operations will join the following transaction, but does not
wait until the current commit is fully on disk.  This avoids much of the
latency associated with the btrfs_commit_transaction for callers concerned
with serialization and not safety.

The wait_for_unblock flag controls whether we wait for the 'middle' portion
of commit_transaction to complete, which is necessary if the caller expects
some of the modifications contained in the commit to be available (this is
the case for subvol/snapshot creation).

Signed-off-by: Sage Weil s...@newdream.net
---
 fs/btrfs/ctree.h   |1 +
 fs/btrfs/disk-io.c |1 +
 fs/btrfs/transaction.c |  119 
 fs/btrfs/transaction.h |3 +
 4 files changed, 124 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index eaf286a..4c3c20b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -863,6 +863,7 @@ struct btrfs_fs_info {
struct btrfs_transaction *running_transaction;
wait_queue_head_t transaction_throttle;
wait_queue_head_t transaction_wait;
+   wait_queue_head_t transaction_blocked_wait;
wait_queue_head_t async_submit_wait;
 
struct btrfs_super_block super_copy;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 64f1008..034abc6 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1680,6 +1680,7 @@ struct btrfs_root *open_ctree(struct super_block *sb,
 
init_waitqueue_head(fs_info-transaction_throttle);
init_waitqueue_head(fs_info-transaction_wait);
+   init_waitqueue_head(fs_info-transaction_blocked_wait);
init_waitqueue_head(fs_info-async_submit_wait);
 
__setup_root(4096, 4096, 4096, 4096, tree_root,
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index bf399ea..d720106 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -988,6 +988,123 @@ int btrfs_transaction_blocked(struct btrfs_fs_info *info)
return ret;
 }
 
+/*
+ * wait for the current transaction commit to start and block subsequent
+ * transaction joins
+ */
+static void wait_current_trans_commit_start(struct btrfs_root *root,
+   struct btrfs_transaction *trans)
+{
+   DEFINE_WAIT(wait);
+
+   if (trans-in_commit)
+   return;
+
+   while (1) {
+   prepare_to_wait(root-fs_info-transaction_blocked_wait, wait,
+   TASK_UNINTERRUPTIBLE);
+   if (trans-in_commit) {
+   finish_wait(root-fs_info-transaction_blocked_wait,
+   wait);
+   break;
+   }
+   mutex_unlock(root-fs_info-trans_mutex);
+   schedule();
+   mutex_lock(root-fs_info-trans_mutex);
+   finish_wait(root-fs_info-transaction_blocked_wait, wait);
+   }
+}
+
+/*
+ * wait for the current transaction to start and then become unblocked.
+ * caller holds ref.
+ */
+static void wait_current_trans_commit_start_and_unblock(struct btrfs_root 
*root,
+struct btrfs_transaction *trans)
+{
+   DEFINE_WAIT(wait);
+
+   if (trans-commit_done || (trans-in_commit  !trans-blocked))
+   return;
+
+   while (1) {
+   prepare_to_wait(root-fs_info-transaction_wait, wait,
+   TASK_UNINTERRUPTIBLE);
+   if (trans-commit_done ||
+   (trans-in_commit  !trans-blocked)) {
+   finish_wait(root-fs_info-transaction_wait,
+   wait);
+   break;
+   }
+   mutex_unlock(root-fs_info-trans_mutex);
+   schedule();
+   mutex_lock(root-fs_info-trans_mutex);
+   finish_wait(root-fs_info-transaction_wait,
+   wait);
+   }
+}
+
+/*
+ * commit transactions asynchronously. once btrfs_commit_transaction_async
+ * returns, any subsequent transaction will not be allowed to join.
+ */
+struct btrfs_async_commit {
+   struct btrfs_trans_handle *newtrans;
+   struct btrfs_root *root;
+   struct delayed_work work;
+};
+
+static void do_async_commit(struct work_struct *work)
+{
+   struct btrfs_async_commit *ac =
+   container_of(work, struct btrfs_async_commit, work.work);
+   
+   btrfs_commit_transaction(ac-newtrans, ac-root);
+   kfree(ac);
+}
+
+int btrfs_commit_transaction_async(struct btrfs_trans_handle *trans,
+  struct btrfs_root *root,
+  int wait_for_unblock)
+{
+   struct btrfs_async_commit *ac;
+   struct btrfs_transaction *cur_trans;
+
+   ac = kmalloc(sizeof(*ac), GFP_NOFS);
+   BUG_ON(!ac);
+
+   INIT_DELAYED_WORK(ac-work

[PATCH 3/6] Btrfs: add START_SYNC, WAIT_SYNC ioctls

2010-10-25 Thread Sage Weil
START_SYNC will start a sync/commit, but not wait for it to
complete.  Any modification started after the ioctl returns is
guaranteed not to be included in the commit.  If a non-NULL
pointer is passed, the transaction id will be returned to
userspace.

WAIT_SYNC will wait for any in-progress commit to complete.  If a
transaction id is specified, the ioctl will block and then
return (success) when the specified transaction has committed.
If it has already committed when we call the ioctl, it returns
immediately.  If the specified transaction doesn't exist, it
returns EINVAL.

If no transaction id is specified, WAIT_SYNC will wait for the
currently committing transaction to finish it's commit to disk.
If there is no currently committing transaction, it returns
success.

These ioctls are useful for applications which want to impose an
ordering on when fs modifications reach disk, but do not want to
wait for the full (slow) commit process to do so.

Picky callers can take the transid returned by START_SYNC and
feed it to WAIT_SYNC, and be certain to wait only as long as
necessary for the transaction _they_ started to reach disk.

Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
and provided they didn't wait too long between the calls, they
will get the same result.  However, if a second commit starts
before they call WAIT_SYNC, they may end up waiting longer for
it to commit as well.  Even so, a START_SYNC+WAIT_SYNC still
guarantees that any operation completed before the START_SYNC
reaches disk.

Signed-off-by: Sage Weil s...@newdream.net
---
 fs/btrfs/ioctl.c   |   34 +++
 fs/btrfs/ioctl.h   |2 +
 fs/btrfs/transaction.c |   52 
 fs/btrfs/transaction.h |1 +
 4 files changed, 89 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 9254b3d..6306051 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1984,6 +1984,36 @@ long btrfs_ioctl_trans_end(struct file *file)
return 0;
 }
 
+static noinline long btrfs_ioctl_start_sync(struct file *file, void __user 
*argp)
+{
+   struct btrfs_root *root = BTRFS_I(file-f_dentry-d_inode)-root;
+   struct btrfs_trans_handle *trans;
+   u64 transid;
+
+   trans = btrfs_start_transaction(root, 0);
+   transid = trans-transid;
+   btrfs_commit_transaction_async(trans, root, 0);
+
+   if (argp)
+   if (copy_to_user(argp, transid, sizeof(transid)))
+   return -EFAULT;
+   return 0;
+}
+
+static noinline long btrfs_ioctl_wait_sync(struct file *file, void __user 
*argp)
+{
+   struct btrfs_root *root = BTRFS_I(file-f_dentry-d_inode)-root;
+   u64 transid;
+
+   if (argp) {
+   if (copy_from_user(transid, argp, sizeof(transid)))
+   return -EFAULT;
+   } else {
+   transid = 0;  /* current trans */
+   }
+   return btrfs_wait_for_commit(root, transid);
+}
+
 long btrfs_ioctl(struct file *file, unsigned int
cmd, unsigned long arg)
 {
@@ -2034,6 +2064,10 @@ long btrfs_ioctl(struct file *file, unsigned int
case BTRFS_IOC_SYNC:
btrfs_sync_fs(file-f_dentry-d_sb, 1);
return 0;
+   case BTRFS_IOC_START_SYNC:
+   return btrfs_ioctl_start_sync(file, argp);
+   case BTRFS_IOC_WAIT_SYNC:
+   return btrfs_ioctl_wait_sync(file, argp);
}
 
return -ENOTTY;
diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
index 424694a..e9a2f7e 100644
--- a/fs/btrfs/ioctl.h
+++ b/fs/btrfs/ioctl.h
@@ -178,4 +178,6 @@ struct btrfs_ioctl_space_args {
 #define BTRFS_IOC_DEFAULT_SUBVOL _IOW(BTRFS_IOCTL_MAGIC, 19, u64)
 #define BTRFS_IOC_SPACE_INFO _IOWR(BTRFS_IOCTL_MAGIC, 20, \
struct btrfs_ioctl_space_args)
+#define BTRFS_IOC_START_SYNC _IOR(BTRFS_IOCTL_MAGIC, 21, __u64)
+#define BTRFS_IOC_WAIT_SYNC  _IOW(BTRFS_IOCTL_MAGIC, 22, __u64)
 #endif
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index d720106..932bc03 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -270,6 +270,58 @@ static noinline int wait_for_commit(struct btrfs_root 
*root,
return 0;
 }
 
+int btrfs_wait_for_commit(struct btrfs_root *root, u64 transid)
+{
+   struct btrfs_transaction *cur_trans = NULL, *t;
+   int ret;
+
+   mutex_lock(root-fs_info-trans_mutex);
+
+   ret = 0;
+   if (transid) {
+   if (transid = root-fs_info-last_trans_committed)
+   goto out_unlock;
+
+   /* find specified transaction */
+   list_for_each_entry(t, root-fs_info-trans_list, list) {
+   if (t-transid == transid) {
+   cur_trans = t;
+   break;
+   }
+   if (t-transid  transid)
+   break

[PATCH 5/6] Btrfs: make SNAP_DESTROY async

2010-10-25 Thread Sage Weil
There is no reason to force an immediate commit when deleting a snapshot.
Users have some expectation that space from a deleted snapshot be freed
immediately, but even if we do commit the reclaim is a background process.

If users _do_ want the deletion to be durable, they can call 'sync'.

Signed-off-by: Sage Weil s...@newdream.net
---
 fs/btrfs/ioctl.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 679b8a8..9cbda86 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1365,7 +1365,7 @@ static noinline int btrfs_ioctl_snap_destroy(struct file 
*file,
BUG_ON(ret);
}
 
-   ret = btrfs_commit_transaction(trans, root);
+   ret = btrfs_end_transaction(trans, root);
BUG_ON(ret);
inode-i_flags |= S_DEAD;
 out_up_write:
-- 
1.6.6.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6] Btrfs: allow subvol deletion by owner

2010-10-25 Thread Sage Weil
Allow users to delete the snapshots/subvols they own.  When CAP_SYS_ADMIN
is not present, require that

 - the user has write and exec permission on the parent directory
 - security_inode_rmdir() doesn't object
 - the user has write and exec permission on the subvol directory
 - the user owns the subvol directory
 - the directory and subvol append flags are not set

This is a bit more strict than the requirements for 'rm -f subvol/*',
which is allowed with just 'wx' on non-owned non-sticky dirs.  It is less
strict than 'rmdir subvol', which has additional requirements if the parent
directory is sticky.  It is less strict than 'rm -rf subvol', which would
require scanning the entire subvol to ensure all content is deletable.

Signed-off-by: Sage Weil s...@newdream.net
---
 fs/btrfs/ioctl.c|   27 ---
 security/security.c |1 +
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 9cbda86..90d2871 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1288,9 +1288,6 @@ static noinline int btrfs_ioctl_snap_destroy(struct file 
*file,
int ret;
int err = 0;
 
-   if (!capable(CAP_SYS_ADMIN))
-   return -EPERM;
-
vol_args = memdup_user(arg, sizeof(*vol_args));
if (IS_ERR(vol_args))
return PTR_ERR(vol_args);
@@ -1325,6 +1322,30 @@ static noinline int btrfs_ioctl_snap_destroy(struct file 
*file,
goto out_dput;
}
 
+   /*
+* Allow subvol deletion if we own the subvol and we would
+* (approximately) be allowed to rmdir it.  Strictly speaking,
+* we could possibly delete everything with fewer permissions,
+* or might require more permissions to remove all files
+* contained by the subvol, but we aren't trying to mimic
+* directory semantics perfectly.
+*/
+   if (!capable(CAP_SYS_ADMIN)) {
+   err = inode_permission(dir, MAY_WRITE | MAY_EXEC);
+   if (err)
+   goto out_dput;
+   err = security_inode_rmdir(dir, dentry);
+   if (err)
+   goto out_dput;
+   err = inode_permission(inode, MAY_WRITE | MAY_EXEC);
+   if (err)
+   goto out_dput;
+   err = -EPERM;
+   if (inode-i_uid != current_fsuid() || IS_APPEND(dir) ||
+   IS_APPEND(inode))
+   goto out_dput;
+   }
+
dest = BTRFS_I(inode)-root;
 
mutex_lock(inode-i_mutex);
diff --git a/security/security.c b/security/security.c
index c53949f..1c980ee 100644
--- a/security/security.c
+++ b/security/security.c
@@ -490,6 +490,7 @@ int security_inode_rmdir(struct inode *dir, struct dentry 
*dentry)
return 0;
return security_ops-inode_rmdir(dir, dentry);
 }
+EXPORT_SYMBOL_GPL(security_inode_rmdir);
 
 int security_inode_mknod(struct inode *dir, struct dentry *dentry, int mode, 
dev_t dev)
 {
-- 
1.6.6.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] Btrfs commit fixes, async subvol operations

2010-10-25 Thread Sage Weil
On Mon, 25 Oct 2010, Chris Mason wrote:
 These all look good to me and I'm pulling them in.

Great, thanks!

  The last item is a change to SNAP_DESTROY to allow deletion of a 
  snapshot when the user owns the subvol's root inode and the parent 
  directory permissions are such that we would have allowed an rmdir(2).  
  Goffredo Baroncelli posted a similar patch that replicates the rmdir(2) 
  semantics completely (except for the empty directory check) by 
  duplicating some VFS code.  Whether we want weaker semantics, duplicated 
  code, or some new EXPORT_SYMBOLS is up to you I think.  Note that this 
  is distinct from a similar patch (also from Goffredo) that allows 
  rmdir(2) to remove an empty subvol; my goal is to allow a non-empty 
  subvol to be deleted by a non-root user.  As long as I can do that, my 
  daemon doesn't have to run as root and I'm a happy camper.  :)
 
 Someone at the storage workshop mentioned that this subvol deletion
 trick is slightly stronger than rm -rf, to make it include the same
 level of permission checks would require testing all the directories in
 the tree for permissions.

I think that was me :)
 
 For now, could you please make a mount -o user_subvol_rm_allowed option?
 (or something similar with a better name).

Sure.  

Do you have a preference as far as what checks are implemented?  My patch 
implemented a simplified approximation of may_rmdir(); Goffredo's 
duplicated the vfs checks.  I guess I'm leaning toward the latter...

Thanks!
sage
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: allow subvol deletion by unprivileged user with -o user_subvol_rm_allowed

2010-10-25 Thread Sage Weil
Add a mount option user_subvol_rm_allowed that allows users to delete a
(potentially non-empty!) subvol when they would otherwise we allowed to do
an rmdir(2).  We duplicate the may_delete() checks from the core VFS code
to implement identical security checks (minus the directory size check).

Signed-off-by: Sage Weil s...@newdream.net
---
 fs/btrfs/ctree.h |1 +
 fs/btrfs/ioctl.c |  109 +++--
 fs/btrfs/super.c |6 ++-
 3 files changed, 110 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5ac2bca..140003e 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1196,6 +1196,7 @@ struct btrfs_root {
 #define BTRFS_MOUNT_NOSSD  (1  9)
 #define BTRFS_MOUNT_DISCARD(1  10)
 #define BTRFS_MOUNT_FORCE_COMPRESS  (1  11)
+#define BTRFS_MOUNT_USER_SUBVOL_RM_ALLOWED (1  12)
 
 #define btrfs_clear_opt(o, opt)((o) = ~BTRFS_MOUNT_##opt)
 #define btrfs_set_opt(o, opt)  ((o) |= BTRFS_MOUNT_##opt)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 906e3b3..919b23f 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -409,6 +409,76 @@ fail:
return ret;
 }
 
+/*  copy of check_sticky in fs/namei.c() 
+* It's inline, so penalty for filesystems that don't use sticky bit is
+* minimal.
+*/
+static inline int btrfs_check_sticky(struct inode *dir, struct inode *inode)
+{
+   uid_t fsuid = current_fsuid();
+
+   if (!(dir-i_mode  S_ISVTX))
+   return 0;
+   if (inode-i_uid == fsuid)
+   return 0;
+   if (dir-i_uid == fsuid)
+   return 0;
+   return !capable(CAP_FOWNER);
+}
+
+/*  copy of may_delete in fs/namei.c() 
+ * Check whether we can remove a link victim from directory dir, check
+ *  whether the type of victim is right.
+ *  1. We can't do it if dir is read-only (done in permission())
+ *  2. We should have write and exec permissions on dir
+ *  3. We can't remove anything from append-only dir
+ *  4. We can't do anything with immutable dir (done in permission())
+ *  5. If the sticky bit on dir is set we should either
+ * a. be owner of dir, or
+ * b. be owner of victim, or
+ * c. have CAP_FOWNER capability
+ *  6. If the victim is append-only or immutable we can't do antyhing with
+ * links pointing to it.
+ *  7. If we were asked to remove a directory and victim isn't one - ENOTDIR.
+ *  8. If we were asked to remove a non-directory and victim isn't one - 
EISDIR.
+ *  9. We can't remove a root or mountpoint.
+ * 10. We don't allow removal of NFS sillyrenamed files; it's handled by
+ * nfs_async_unlink().
+ */
+
+static int btrfs_may_delete(struct inode *dir,struct dentry *victim,int isdir)
+{
+   int error;
+
+   if (!victim-d_inode)
+   return -ENOENT;
+
+   BUG_ON(victim-d_parent-d_inode != dir);
+   audit_inode_child(victim, dir);
+
+   error = inode_permission(dir, MAY_WRITE | MAY_EXEC);
+   if (error)
+   return error;
+   if (IS_APPEND(dir))
+   return -EPERM;
+   if (btrfs_check_sticky(dir, victim-d_inode)||
+   IS_APPEND(victim-d_inode)||
+   IS_IMMUTABLE(victim-d_inode) || IS_SWAPFILE(victim-d_inode))
+   return -EPERM;
+   if (isdir) {
+   if (!S_ISDIR(victim-d_inode-i_mode))
+   return -ENOTDIR;
+   if (IS_ROOT(victim))
+   return -EBUSY;
+   } else if (S_ISDIR(victim-d_inode-i_mode))
+   return -EISDIR;
+   if (IS_DEADDIR(dir))
+   return -ENOENT;
+   if (victim-d_flags  DCACHE_NFSFS_RENAMED)
+   return -EBUSY;
+   return 0;
+}
+
 /* copy of may_create in fs/namei.c() */
 static inline int btrfs_may_create(struct inode *dir, struct dentry *child)
 {
@@ -1288,9 +1358,6 @@ static noinline int btrfs_ioctl_snap_destroy(struct file 
*file,
int ret;
int err = 0;
 
-   if (!capable(CAP_SYS_ADMIN))
-   return -EPERM;
-
vol_args = memdup_user(arg, sizeof(*vol_args));
if (IS_ERR(vol_args))
return PTR_ERR(vol_args);
@@ -1320,13 +1387,45 @@ static noinline int btrfs_ioctl_snap_destroy(struct 
file *file,
}
 
inode = dentry-d_inode;
+   dest = BTRFS_I(inode)-root;
+   if (!capable(CAP_SYS_ADMIN)){
+   /*
+* Regular user.  Only allow this with a special mount
+* option, and when rmdir(2) would have been allowed.
+*
+* Note that this is _not_ check that the subvol is
+* empty or doesn't contain data that we wouldn't
+* otherwise be able to delete.
+*
+* Users who want to delete empty subvols should try
+* rmdir(2).
+*/
+   err = -EPERM;
+   if (!btrfs_test_opt(root, USER_SUBVOL_RM_ALLOWED

Re: [PATCH 0/6] Btrfs commit fixes, async subvol operations

2010-10-25 Thread Sage Weil
On Mon, 25 Oct 2010, Chris Mason wrote:
 Yes, lets duplicate the vfs checks.  Christoph just sat bolt upright in
 whatever ski lift he's currently riding.

:)

This version is based on Goffredo's patch, but requires the 
user_subvol_rm_allowed mount option, and will proceed even if the subvol 
is nonempty.  (I suspect we also want his other rmdir(2) patch at some 
point as well so that users can rmdir an empty subvol and/or 'rm -r'.)

 We should also make sure they do the subvol rm against the root of the
 subvol (if that check isn't already done), none of the magic to resolve
 the subvol based on any file inside it.  I don't want people to
 accidentally think they are deleting a subdir and have it go higher up
 into the directory tree.
 
 Oh, and it shouldn't work on the root of the FS either ;)

Since the ioctl is based on a name lookup, I added a check that the parent 
inode's root isn't the same as the target root.  That implies the ioctl is 
called the dentry referencing the subvol, and presumably the root doesn't 
have one of those.  Does that cover it?  It isn't possible to have the 
subvol bound to multiple locations in the namespace, right?

$ whoami
sage
$ btrfs sub create a
Create subvolume './a'
$ btrfs sub create a/b
Create subvolume 'a/b'
$ btrfs sub snap a asnap
Create a snapshot of 'a' in './asnap'
$ btrfs sub del asnap/b 
ERROR: 'asnap/b' is not a subvolume
$ btrfs sub del a
Delete subvolume '/mnt/osd27/a/a'
ERROR: cannot delete '/mnt/osd27/a/a'
$ btrfs sub del a/b
Delete subvolume '/mnt/osd27/a/a/b'
$ btrfs sub del a  
Delete subvolume '/mnt/osd27/a/a'
$ btrfs sub del asnap
Delete subvolume '/mnt/osd27/a/asnap'

Sending the patch separately.

Thanks!
sage
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] clone ioctl fixes

2010-10-19 Thread Sage Weil
Hi Chris,

These are a few critical bug fixes for the clone ioctl.  We've been 
hitting these on a number of different systems, so getting these into 
2.6.37 would be much appreciated.

Thanks!
sage

--

Sage Weil (1):
  Btrfs: fix lockdep warning on clone ioctl

Yehuda Sadeh (2):
  Btrfs: fix delalloc checks in clone ioctl
  Btrfs: fix clone ioctl where range is adjacent to extent

 fs/btrfs/ioctl.c |   18 ++
 1 files changed, 10 insertions(+), 8 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] Btrfs: fix delalloc checks in clone ioctl

2010-10-19 Thread Sage Weil
From: Yehuda Sadeh yeh...@hq.newdream.net

The lookup_first_ordered_extent() was done on the wrong inode, and the 
-delalloc_bytes test was wrong, as the following 
btrfs_wait_ordered_range() would only invoke a range write and wouldn't 
write the entire file data range. Also, a bad parameter was passed to 
btrfs_wait_ordered_range().

Signed-off-by: Yehuda Sadeh yeh...@hq.newdream.net
---
 fs/btrfs/ioctl.c |8 +---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 9254b3d..3471b22 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1530,13 +1530,15 @@ static noinline long btrfs_ioctl_clone(struct file 
*file, unsigned long srcfd,
while (1) {
struct btrfs_ordered_extent *ordered;
lock_extent(BTRFS_I(src)-io_tree, off, off+len, GFP_NOFS);
-   ordered = btrfs_lookup_first_ordered_extent(inode, off+len);
-   if (BTRFS_I(src)-delalloc_bytes == 0  !ordered)
+   ordered = btrfs_lookup_first_ordered_extent(src, off+len);
+   if (!ordered 
+   !test_range_bit(BTRFS_I(src)-io_tree, off, off+len,
+  EXTENT_DELALLOC, 0, NULL))
break;
unlock_extent(BTRFS_I(src)-io_tree, off, off+len, GFP_NOFS);
if (ordered)
btrfs_put_ordered_extent(ordered);
-   btrfs_wait_ordered_range(src, off, off+len);
+   btrfs_wait_ordered_range(src, off, len);
}
 
/* clone data */
-- 
1.7.0.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] Btrfs: fix clone ioctl where range is adjacent to extent

2010-10-19 Thread Sage Weil
From: Yehuda Sadeh yeh...@hq.newdream.net

We had an edge case issue where the requested range was just
following an existing extent. Instead of skipping to the next
extent, we used the previous one which lead to having zero
sized extents.

Signed-off-by: Yehuda Sadeh yeh...@hq.newdream.net
---
 fs/btrfs/ioctl.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 3471b22..f4a3dde 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1607,7 +1607,7 @@ static noinline long btrfs_ioctl_clone(struct file *file, 
unsigned long srcfd,
}
btrfs_release_path(root, path);
 
-   if (key.offset + datal  off ||
+   if (key.offset + datal = off ||
key.offset = off+len)
goto next;
 
-- 
1.7.0.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] Btrfs: fix lockdep warning on clone ioctl

2010-10-19 Thread Sage Weil
I'm no lockdep expert, but this appears to make the lockdep warning go
away for the i_mutex locking in the clone ioctl.

Signed-off-by: Sage Weil s...@newdream.net
---
 fs/btrfs/ioctl.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index f4a3dde..3f27529 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1502,11 +1502,11 @@ static noinline long btrfs_ioctl_clone(struct file 
*file, unsigned long srcfd,
path-reada = 2;
 
if (inode  src) {
-   mutex_lock(inode-i_mutex);
-   mutex_lock(src-i_mutex);
+   mutex_lock_nested(inode-i_mutex, I_MUTEX_PARENT);
+   mutex_lock_nested(src-i_mutex, I_MUTEX_CHILD);
} else {
-   mutex_lock(src-i_mutex);
-   mutex_lock(inode-i_mutex);
+   mutex_lock_nested(src-i_mutex, I_MUTEX_PARENT);
+   mutex_lock_nested(inode-i_mutex, I_MUTEX_CHILD);
}
 
/* determine range to clone */
-- 
1.7.0.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Allow to exec btrfs subvolume delete by a non root user

2010-10-19 Thread Sage Weil
On Mon, 18 Oct 2010, Goffredo Baroncelli wrote:

 Hi all
 
 like my previous patch, this one allow to remove a subvolume by an ordinary
  user. Instead of adding this capability to the rmdir(2) syscall, I update the
 BTRFS_IOC_SNAP_DESTROY ioctl, relaxing the rules to be execute.
 The checks are the ones performed by the rmdir(2) syscall. So a 
 subvolume must be empty to be removed by a non-root user. I think that this
  increases a lot the usefulness of the snapshot/subvolume.
 
 It is possible to pull the code from the branch named rm-subvolume-not-root 
 of the following repository:
 
   http://cassiopea.homelinux.net/git/btrfs-unstable.git  
 
 Comments are welcome.

This looks okay to me.  I posted a similar patch a while back[1], but 
didn't want to duplicate the check_sticky and may_delete code and 
implemented a simpler set of checks instead.  The full checks are probably 
a better route, although it would be nice if we could avoid duplicating 
the VFS checks in the process.  Whether those helpers should be exported 
is someone else's call, though.  (The only other may_ functions that are 
exported are may_umount and may_umount_tree.)

sage

[1] http://marc.info/?l=linux-btrfsm=128086492512628w=2


 
 Reagrds
 G.Baroncelli
 
 diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
 index 9254b3d..5bbb6bc 100644
 --- a/fs/btrfs/ioctl.c
 +++ b/fs/btrfs/ioctl.c
 @@ -395,6 +395,76 @@ fail:
   return ret;
  }
  
 +/*  copy of check_sticky in fs/namei.c() 
 +* It's inline, so penalty for filesystems that don't use sticky bit is
 +* minimal.
 +*/
 +static inline int btrfs_check_sticky(struct inode *dir, struct inode *inode)
 +{
 + uid_t fsuid = current_fsuid();
 +
 + if (!(dir-i_mode  S_ISVTX))
 + return 0;
 + if (inode-i_uid == fsuid)
 + return 0;
 + if (dir-i_uid == fsuid)
 + return 0;
 + return !capable(CAP_FOWNER);
 +}
 +
 +/*  copy of may_delete in fs/namei.c() 
 + *   Check whether we can remove a link victim from directory dir, check
 + *  whether the type of victim is right.
 + *  1. We can't do it if dir is read-only (done in permission())
 + *  2. We should have write and exec permissions on dir
 + *  3. We can't remove anything from append-only dir
 + *  4. We can't do anything with immutable dir (done in permission())
 + *  5. If the sticky bit on dir is set we should either
 + *   a. be owner of dir, or
 + *   b. be owner of victim, or
 + *   c. have CAP_FOWNER capability
 + *  6. If the victim is append-only or immutable we can't do antyhing with
 + * links pointing to it.
 + *  7. If we were asked to remove a directory and victim isn't one - ENOTDIR.
 + *  8. If we were asked to remove a non-directory and victim isn't one - 
 EISDIR.
 + *  9. We can't remove a root or mountpoint.
 + * 10. We don't allow removal of NFS sillyrenamed files; it's handled by
 + * nfs_async_unlink().
 + */
 +
 +static int btrfs_may_delete(struct inode *dir,struct dentry *victim,int 
 isdir)
 +{
 + int error;
 +
 + if (!victim-d_inode)
 + return -ENOENT;
 +
 + BUG_ON(victim-d_parent-d_inode != dir);
 + audit_inode_child(victim, dir);
 +
 + error = inode_permission(dir, MAY_WRITE | MAY_EXEC);
 + if (error)
 + return error;
 + if (IS_APPEND(dir))
 + return -EPERM;
 + if (btrfs_check_sticky(dir, victim-d_inode)||
 + IS_APPEND(victim-d_inode)||
 + IS_IMMUTABLE(victim-d_inode) || IS_SWAPFILE(victim-d_inode))
 + return -EPERM;
 + if (isdir) {
 + if (!S_ISDIR(victim-d_inode-i_mode))
 + return -ENOTDIR;
 + if (IS_ROOT(victim))
 + return -EBUSY;
 + } else if (S_ISDIR(victim-d_inode-i_mode))
 + return -EISDIR;
 + if (IS_DEADDIR(dir))
 + return -ENOENT;
 + if (victim-d_flags  DCACHE_NFSFS_RENAMED)
 + return -EBUSY;
 + return 0;
 +}
 +
  /* copy of may_create in fs/namei.c() */
  static inline int btrfs_may_create(struct inode *dir, struct dentry *child)
  {
 @@ -1227,9 +1297,6 @@ static noinline int btrfs_ioctl_snap_destroy(struct 
 file *file,
   int ret;
   int err = 0;
  
 - if (!capable(CAP_SYS_ADMIN))
 - return -EPERM;
 -
   vol_args = memdup_user(arg, sizeof(*vol_args));
   if (IS_ERR(vol_args))
   return PTR_ERR(vol_args);
 @@ -1259,6 +1326,20 @@ static noinline int btrfs_ioctl_snap_destroy(struct 
 file *file,
   }
  
   inode = dentry-d_inode;
 + if (!capable(CAP_SYS_ADMIN)){
 + /* regolar user */
 + /* check if subvolume is empty */
 + if (inode-i_size  BTRFS_EMPTY_DIR_SIZE){
 + err = -ENOTEMPTY;
 + goto out_dput;
 + }
 + /* check if subvolume may be deleted by a non-root user */  
 + if (btrfs_may_delete(dir, dentry, 1)){
 + err = -EPERM;
 +

Re: Updating the wiki pages adding btrfs command

2010-09-15 Thread Sage Weil
On Wed, 15 Sep 2010, Chris Ball wrote:
 Hi,
 
 Or make --sync the default behavior.  This is probably what most
 people are expecting anyway (similar to how standard filesystem
 commands like rm work).  Add an --aysnc option for those that
 only care about knowing when the subvolume is taken out of the
 tree.
 
 Yeah.  We've also talked about making snapshot _creation_ perform an
 FS sync first by default, since otherwise you get a snapshot with stale
 files, or without files that existed (not yet on disk) at creation-time.

Actually, that was fixed in 2.6.34 (0bdb1db2).  Creating a snapshot syncs 
all dirty data and metadata to disk, so you get a fully consistent point 
in time snapshot.  No 'sync' is necessary (and doing a 'sync' yourself 
would be racy any).

For subvolume removal, racing doesn't seem like an issue.  Why not just 
do

# btrfs subvolume delete /path/to/subvolume1
# btrfs subvolume delete /path/to/subvolume2
# ...
# sync

?
sage
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Updating the wiki pages adding btrfs command

2010-09-15 Thread Sage Weil
On Wed, 15 Sep 2010, Sage Weil wrote:
 For subvolume removal, racing doesn't seem like an issue.  Why not just 
 do
 
 # btrfs subvolume delete /path/to/subvolume1
 # btrfs subvolume delete /path/to/subvolume2
 # ...
 # sync
 
 ?

Nevermind, read the whole thread this time.  I'll defer to others on the 
best way to wait for the old space to be freed.

sage
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: allow subvol deletion by owner

2010-08-03 Thread Sage Weil
Allow users to delete the snapshots/subvols they own.  Instead of
CAP_SYS_ADMIN, require that

 - the user has write and exec permission on the parent directory
 - security_inode_rmdir() doesn't object
 - the user has write and exec permission on the subvol directory
 - the user owns the subvol directory
 - the directory and subvol append flags are not set

This is a bit more strict than the requirements for 'rm -f subvol/*',
which is allowed with just 'wx' on non-owned non-sticky dirs.  It is less
strict than 'rmdir subvol', which has additional requirements if the parent
directory is sticky.  It is less strict than 'rm -rf subvol', which would
require scanning the entire subvol to ensure all content is deletable.

Signed-off-by: Sage Weil s...@newdream.net
---
 fs/btrfs/ioctl.c |   25 ++---
 1 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 9254b3d..b31283a 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1227,9 +1227,6 @@ static noinline int btrfs_ioctl_snap_destroy(struct file 
*file,
int ret;
int err = 0;
 
-   if (!capable(CAP_SYS_ADMIN))
-   return -EPERM;
-
vol_args = memdup_user(arg, sizeof(*vol_args));
if (IS_ERR(vol_args))
return PTR_ERR(vol_args);
@@ -1264,6 +1261,28 @@ static noinline int btrfs_ioctl_snap_destroy(struct file 
*file,
goto out_dput;
}
 
+   /*
+* Allow subvol deletion if we own the subvol and we would
+* (approximately) be allowed to rmdir it.  Strictly speaking,
+* we could possibly delete everything with fewer permissions,
+* or might require more permissions to remove all files
+* contained by the subvol, but we aren't trying to mimic
+* directory semantics perfectly.
+*/
+   err = inode_permission(dir, MAY_WRITE | MAY_EXEC);
+   if (err)
+   goto out_dput;
+   err = security_inode_rmdir(dir, dentry);
+   if (err)
+   goto out_dput;
+   err = inode_permission(inode, MAY_WRITE | MAY_EXEC);
+   if (err)
+   goto out_dput;
+   err = -EPERM;
+   if (inode-i_uid != current_fsuid() || IS_APPEND(dir) ||
+   IS_APPEND(inode))
+   goto out_dput;
+
dest = BTRFS_I(inode)-root;
 
mutex_lock(inode-i_mutex);
-- 
1.7.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: reflinked file size incorrect

2010-06-12 Thread Sage Weil
On Sat, 12 Jun 2010, Jim Ursetto wrote:
 Both `cp --reflink` and `bcp` sometimes round the file size up to the next
 4k boundary, with the balance consisting of null bytes.  At first glance
 this behavior occurs for source file size  3916 bytes.  I have tried this
 with stock btrfs from kernel 2.6.35-rc2 and 2.6.35-rc1 -- specifically,
 Ubuntu 2.6.35-2.3~lucid1-server and 2.6.35-1.1~lucid1-server.
 Any ideas?

This bug is new in 2.6.35-rc1 from a22285a6 (Btrfs: Integrate metadata 
reservation with start_transaction).  Just sent a patch fixing this up to 
the list.

sage
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix CLONE ioctl destination file size expansion to block boundary

2010-06-12 Thread Sage Weil
The CLONE and CLONE_RANGE ioctls round up the range of extents being
cloned to the block size when the range to clone extends to the end of file
(this is always the case with CLONE).  It was then using that offset when
extending the destination file's i_size.  Fix this by not setting i_size
beyond the originally requested ending offset.

This bug was introduced by a22285a6 (2.6.35-rc1).

Signed-off-by: Sage Weil s...@newdream.net
---
 fs/btrfs/ioctl.c |   16 +---
 1 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 4cdb98c..a687f28 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1578,6 +1578,7 @@ static noinline long btrfs_ioctl_clone(struct file *file, 
unsigned long srcfd,
u64 disko = 0, diskl = 0;
u64 datao = 0, datal = 0;
u8 comp;
+   u64 endoff;
 
size = btrfs_item_size_nr(leaf, slot);
read_extent_buffer(leaf, buf,
@@ -1712,9 +1713,18 @@ static noinline long btrfs_ioctl_clone(struct file 
*file, unsigned long srcfd,
btrfs_release_path(root, path);
 
inode-i_mtime = inode-i_ctime = CURRENT_TIME;
-   if (new_key.offset + datal  inode-i_size)
-   btrfs_i_size_write(inode,
-  new_key.offset + datal);
+
+   /*
+* we round up to the block size at eof when
+* determining which extents to clone above,
+* but shouldn't round up the file size
+*/
+   endoff = new_key.offset + datal;
+   if (endoff  off+olen)
+   endoff = off+olen;
+   if (endoff  inode-i_size)
+   btrfs_i_size_write(inode, endoff);
+
BTRFS_I(inode)-flags = BTRFS_I(src)-flags;
ret = btrfs_update_inode(trans, root, inode);
BUG_ON(ret);
-- 
1.7.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Fwd: Re: Linking two files together][RFC]

2010-06-09 Thread Sage Weil
On Wed, 9 Jun 2010, Roberto Ragusa wrote:
 I hope that ideas about btrfs are not off-topic for this mailing list.
 
 The forwarded message below was written by me on fedora-users.
 The thread is about the ability to link two files in a manner
 similar to cat 1 2 3  rm 1 2 while avoiding any data
 movement on the disk.
 The implementation should just put the original extents together in
 the new file. Is there any filesystem which is capable of doing that?
 As btrfs is already based on extents and COW, couldn't this feature be
 evaluated for feasibility? I think a lot of usages will be found
 for it if actually implemented.

Btrfs already has a CLONE_RANGE ioctl that will clone a range of 
(block-aligned) bytes from file A to any offset in file B.  The fs just 
fixes up the file metadata to reference the same bytes on disk without 
reading or writing any actual file data.

sage
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: avoid BUG when dropping root and reference in same transaction

2010-05-17 Thread Sage Weil
If btrfs_ioctl_snap_destroy() deletes a snapshot but finishes
with end_transaction(), the cleaner kthread may come in and
drop the root in the same transaction.  If that's the case, the
root's refs still == 1 in the tree when btrfs_del_root() deletes
the item, because commit_fs_roots() hasn't updated it yet (that
happens during the commit).

This wasn't a problem before only because
btrfs_ioctl_snap_destroy() would commit the transaction before dropping
the dentry reference, so the dead root wouldn't get queued up until
after the fs root item was updated in the btree.

Since it is not an error to drop the root reference and the root in the
same transaction, just drop the BUG_ON() in btrfs_del_root().

Signed-off-by: Sage Weil s...@newdream.net
---
 fs/btrfs/root-tree.c |3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c
index 67fa2d2..7cb5041 100644
--- a/fs/btrfs/root-tree.c
+++ b/fs/btrfs/root-tree.c
@@ -313,7 +313,6 @@ int btrfs_del_root(struct btrfs_trans_handle *trans, struct 
btrfs_root *root,
 {
struct btrfs_path *path;
int ret;
-   u32 refs;
struct btrfs_root_item *ri;
struct extent_buffer *leaf;
 
@@ -327,8 +326,6 @@ int btrfs_del_root(struct btrfs_trans_handle *trans, struct 
btrfs_root *root,
leaf = path-nodes[0];
ri = btrfs_item_ptr(leaf, path-slots[0], struct btrfs_root_item);
 
-   refs = btrfs_disk_root_refs(leaf, ri);
-   BUG_ON(refs != 0);
ret = btrfs_del_item(trans, root, path);
 out:
btrfs_free_path(path);
-- 
1.6.6.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix lockdep warning on clone ioctl

2010-04-07 Thread Sage Weil
I'm no lockdep expert, but this appears to make the lockdep warning go
away for the i_mutex locking in the clone ioctl.

Signed-off-by: Sage Weil s...@newdream.net
---
 fs/btrfs/ioctl.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index c8e6470..ae2b9a0 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1530,11 +1530,11 @@ static noinline long btrfs_ioctl_clone(struct file 
*file, unsigned long srcfd,
path-reada = 2;
 
if (inode  src) {
-   mutex_lock(inode-i_mutex);
-   mutex_lock(src-i_mutex);
+   mutex_lock_nested(inode-i_mutex, I_MUTEX_PARENT);
+   mutex_lock_nested(src-i_mutex, I_MUTEX_CHILD);
} else {
-   mutex_lock(src-i_mutex);
-   mutex_lock(inode-i_mutex);
+   mutex_lock_nested(src-i_mutex, I_MUTEX_PARENT);
+   mutex_lock_nested(inode-i_mutex, I_MUTEX_CHILD);
}
 
/* determine range to clone */
-- 
1.6.6.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/5] asynchronous commit, snapshot ponies

2010-03-22 Thread Sage Weil
Hi everyone,

This patchset is the latest approach I'm using for the Ceph storage daemon to
keep track of which data has safely committed to disk.  The basic idea is to
not use the (problematic) user transaction ioctls at all.  Instead, the daemon
quiesces its own write requests, initiates an async snapshot, and then
continues.

The snapshot approach is nice because it provides rollback.  If something goes
wrong, we can cleanly go back to the most recent consistent commit.  The
performance is also very similar to what I was doing before (using the
'flushoncommit' mount option and tiggering a sync_fs to flush data).  The only
difference is the old snapshots stick around for a bit longer before I delete
them and the references get dropped.

The first patch introduces a generic btrfs_commit_transaction_async() helper,
which starts btrfs_commit_transaction asynchronously and returns either
when the commit starts (blocked=1) or when it has done it's dirty work
(blocked=0).  The second patch adds ioctls that let you start and wait for
an asynchronous commit.  The third introduces a SNAP_CREATE_ASYNC ioctl that
creates a snap but returns before it hits disk.

The fourth patch returns the commiting transid to userspace, so that it can be
fed to the WAIT_SYNC ioctl.  I'm not that happy with the interface, though; any
suggestions for alternatives would be great.  Alternatively, I could get by
without knowing the exact transid and it wouldn't be the end of the world.

The final patch lets you delete a snapshot/subvol reference without doing an
immediate commit (btrfs_end_transaction instead of btrfs_commit_transaction).
AFAICS there's no reason the commit has to happen immediately (user expectations
aside).

Overall I like this much better than the various user transaction proposals.
It's simpler, does the job, and the primitives should be useful for other
applications.  Let me know what you think!  I'm doing more testing this week,
but so far I haven't seen any problems with these changes.

Thanks-
sage

Sage Weil (5):
  Btrfs: async transaction commit
  Btrfs: add START_SYNC, WAIT_SYNC ioctls
  Btrfs: add SNAP_CREATE_ASYNC ioctl
  Btrfs: return transid to userspace from SNAP_CREATE_ASYNC ioctl
  btrfs: add SNAP_DESTROY_ASYNC ioctl

 fs/btrfs/ctree.h   |1 +
 fs/btrfs/disk-io.c |1 +
 fs/btrfs/ioctl.c   |   94 ++
 fs/btrfs/ioctl.h   |   10 +++-
 fs/btrfs/transaction.c |  171 
 fs/btrfs/transaction.h |4 +
 6 files changed, 265 insertions(+), 16 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] Btrfs: async transaction commit

2010-03-22 Thread Sage Weil
Add support for an async transaction commit that is ordered such that any
subsequent operations will join the following transaction, but does not
wait until the current commit is fully on disk.  This avoids much of the
latency associated with the btrfs_commit_transaction for callers concerned
with serialization and not safety.

The wait_for_unblock flag controls whether we wait for the 'middle' portion
of commit_transaction to complete, which is necessary if the caller expects
some of the modifications contained in the commit to be available (this is
the case for subvol/snapshot creation).

Signed-off-by: Sage Weil s...@newdream.net
---
 fs/btrfs/ctree.h   |1 +
 fs/btrfs/disk-io.c |1 +
 fs/btrfs/transaction.c |  119 
 fs/btrfs/transaction.h |3 +
 4 files changed, 124 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 584..1d0d5f0 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -840,6 +840,7 @@ struct btrfs_fs_info {
struct btrfs_transaction *running_transaction;
wait_queue_head_t transaction_throttle;
wait_queue_head_t transaction_wait;
+   wait_queue_head_t transaction_blocked_wait;
wait_queue_head_t async_submit_wait;
 
struct btrfs_super_block super_copy;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 11d0ad3..6f3cb33 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1701,6 +1701,7 @@ struct btrfs_root *open_ctree(struct super_block *sb,
 
init_waitqueue_head(fs_info-transaction_throttle);
init_waitqueue_head(fs_info-transaction_wait);
+   init_waitqueue_head(fs_info-transaction_blocked_wait);
init_waitqueue_head(fs_info-async_submit_wait);
 
__setup_root(4096, 4096, 4096, 4096, tree_root,
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index e75cc84..52b5488 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -882,6 +882,123 @@ int btrfs_transaction_in_commit(struct btrfs_fs_info 
*info)
return ret;
 }
 
+/*
+ * wait for the current transaction commit to start and block subsequent
+ * transaction joins
+ */
+static void wait_current_trans_commit_start(struct btrfs_root *root,
+   struct btrfs_transaction *trans)
+{
+   DEFINE_WAIT(wait);
+
+   if (trans-in_commit)
+   return;
+
+   while (1) {
+   prepare_to_wait(root-fs_info-transaction_blocked_wait, wait,
+   TASK_UNINTERRUPTIBLE);
+   if (trans-in_commit) {
+   finish_wait(root-fs_info-transaction_blocked_wait,
+   wait);
+   break;
+   }
+   mutex_unlock(root-fs_info-trans_mutex);
+   schedule();
+   mutex_lock(root-fs_info-trans_mutex);
+   finish_wait(root-fs_info-transaction_blocked_wait, wait);
+   }
+}
+
+/*
+ * wait for the current transaction to start and then become unblocked.
+ * caller holds ref.
+ */
+static void wait_current_trans_commit_start_and_unblock(struct btrfs_root 
*root,
+struct btrfs_transaction *trans)
+{
+   DEFINE_WAIT(wait);
+
+   if (trans-commit_done || (trans-in_commit  !trans-blocked))
+   return;
+
+   while (1) {
+   prepare_to_wait(root-fs_info-transaction_wait, wait,
+   TASK_UNINTERRUPTIBLE);
+   if (trans-commit_done ||
+   (trans-in_commit  !trans-blocked)) {
+   finish_wait(root-fs_info-transaction_wait,
+   wait);
+   break;
+   }
+   mutex_unlock(root-fs_info-trans_mutex);
+   schedule();
+   mutex_lock(root-fs_info-trans_mutex);
+   finish_wait(root-fs_info-transaction_wait,
+   wait);
+   }
+}
+
+/*
+ * commit transactions asynchronously. once btrfs_commit_transaction_async
+ * returns, any subsequent transaction will not be allowed to join.
+ */
+struct btrfs_async_commit {
+   struct btrfs_trans_handle *newtrans;
+   struct btrfs_root *root;
+   struct delayed_work work;
+};
+
+static void do_async_commit(struct work_struct *work)
+{
+   struct btrfs_async_commit *ac =
+   container_of(work, struct btrfs_async_commit, work.work);
+   
+   btrfs_commit_transaction(ac-newtrans, ac-root);
+   kfree(ac);
+}
+
+int btrfs_commit_transaction_async(struct btrfs_trans_handle *trans,
+  struct btrfs_root *root,
+  int wait_for_unblock)
+{
+   struct btrfs_async_commit *ac;
+   struct btrfs_transaction *cur_trans;
+
+   ac = kmalloc(sizeof(*ac), GFP_NOFS);
+   BUG_ON(!ac);
+
+   INIT_DELAYED_WORK(ac-work

  1   2   >