Re: Atomic replacement of subvolumes is not possible

2010-06-30 Thread Chris Mason
On Sun, Jun 27, 2010 at 07:44:12PM -0500, C Anthony Risinger wrote:
 On Sat, Jun 26, 2010 at 12:25 PM, Daniel Baumann dan...@debian.org wrote:
  Hi,
 
  this is basically a forward from
  http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=587253
 
  rename(2) allows for the atomic replacement of files.  Being able to
  atomically replace subvolume snapshots would be equally invaluable,
  since it would permit lock-free replacement of subvolumes.
 
   % btrfs subvolume snapshot src dest
 
  creates dest as a snapshot of src. However, if I want to do the
  converse,
 
   % btrfs subvolume snapshot dest src
 
  then dest is snapshotted as src/dest, i.e. not replacing the
  original subvolume, but going inside the original subvolume.
 
  Use case 1:
   I have a subvolume of data under active use, which I want to
   periodically update.  I'd like to do this by atomically
   replacing its contents.  I can replace the content right now
   by deleting the old subvolume and then snapshotting the new
   on in its place, but it's racy.  It really needs to be
   replaced in a single operation, or else there's a small window
   where there is no data, and I'd need to resort to some external
   locking to protect myself.

I'm not sure I understand use case #1.  The problem is that you'll have
files open in the subvolume and you can't just pull the rug out from
under them.  Could you tell me a little more about what you're trying to
do?

 
  Use case 2:
   In schroot, we create btrfs subvolume snapshots to get copy-on-
   write chroots.  This works just fine.  We also provide direct
   access to the source subvolume, but since it could be
   snapshotted in an inconsistent state while being updated, we
   want to do the following:
 
   · snapshot source subvolume
   · update snapshot
   · replace source volume with updated snapshot
 
  Please keep roger in the cc for any replies, thanks.
 
 i am also looking for functionality similar to this, except i would
 like to be able to replace the DEFAULT subvolume, with an empty or
 existing subvolume, and put the original default subvolume INSIDE the
 new root (or drop it completely), outlined by this post and the thread
 it's in:
 
 http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg05278.html
 
 is there any feedback on these actions?  no one seems to even respond :-(
 
 it would seem we need ways to swap subvolumes around, _including_ the
 default, providing the on-disk format supports such operations.

Moving 'default' generally involves a reboot for the same reasons.  We
have to worry about open files and their view of the filesystem.  mv on
a directory won't affect file handles that are open, and renaming
subvolumes needs to follow a similar model.

-chris


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Atomic replacement of subvolumes is not possible

2010-06-30 Thread C Anthony Risinger
On Wed, Jun 30, 2010 at 8:31 AM, Chris Mason chris.ma...@oracle.com wrote:
 On Sun, Jun 27, 2010 at 07:44:12PM -0500, C Anthony Risinger wrote:
 On Sat, Jun 26, 2010 at 12:25 PM, Daniel Baumann dan...@debian.org wrote:
  Hi,
 
  this is basically a forward from
  http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=587253
 
  rename(2) allows for the atomic replacement of files.  Being able to
  atomically replace subvolume snapshots would be equally invaluable,
  since it would permit lock-free replacement of subvolumes.
 
   % btrfs subvolume snapshot src dest
 
  creates dest as a snapshot of src. However, if I want to do the
  converse,
 
   % btrfs subvolume snapshot dest src
 
  then dest is snapshotted as src/dest, i.e. not replacing the
  original subvolume, but going inside the original subvolume.
 
  Use case 1:
   I have a subvolume of data under active use, which I want to
   periodically update.  I'd like to do this by atomically
   replacing its contents.  I can replace the content right now
   by deleting the old subvolume and then snapshotting the new
   on in its place, but it's racy.  It really needs to be
   replaced in a single operation, or else there's a small window
   where there is no data, and I'd need to resort to some external
   locking to protect myself.

 I'm not sure I understand use case #1.  The problem is that you'll have
 files open in the subvolume and you can't just pull the rug out from
 under them.  Could you tell me a little more about what you're trying to
 do?

 
  Use case 2:
   In schroot, we create btrfs subvolume snapshots to get copy-on-
   write chroots.  This works just fine.  We also provide direct
   access to the source subvolume, but since it could be
   snapshotted in an inconsistent state while being updated, we
   want to do the following:
 
   · snapshot source subvolume
   · update snapshot
   · replace source volume with updated snapshot
 
  Please keep roger in the cc for any replies, thanks.

 i am also looking for functionality similar to this, except i would
 like to be able to replace the DEFAULT subvolume, with an empty or
 existing subvolume, and put the original default subvolume INSIDE the
 new root (or drop it completely), outlined by this post and the thread
 it's in:

 http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg05278.html

 is there any feedback on these actions?  no one seems to even respond :-(

 it would seem we need ways to swap subvolumes around, _including_ the
 default, providing the on-disk format supports such operations.

 Moving 'default' generally involves a reboot for the same reasons.  We
 have to worry about open files and their view of the filesystem.  mv on
 a directory won't affect file handles that are open, and renaming
 subvolumes needs to follow a similar model.

could we fail if the user tries to replace a subvolume while it's
being used?  what if the root device is _not_ the default (.)
subvolume, then can it be swapped?

in my use case, i am running in initramfs, so the root device has not
even been mounted or pivoted to; it should be safe to do whatever i
want to the filesystem.  i want to move the user's installation to a
dedicated subvolume.

what about this:  would it be possible to have TWO subvolumes by
default?  the regular one (current directory, .):

mount -o subvol=. btrfs_dev /mnt

would behave as it does now.  BUT... there would then be a special,
permanent (like . is right now) subvol, say parent directory
(..):

mount -o subvol=.. btrfs_dev /mnt

TWO dots would mount the parent of ., where i could then swap out
the real default (.).

would that work?

C Anthony
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Balancing leaves when walking from top to down (was Btrfs:...)

2010-06-30 Thread Edward Shishkin

Chris Mason wrote:
[...]

1. the balancing condition n = number_of_keys = 2n+1
  is not satisfied (indeed, when number_of_keys is 1
  we have 1 = 2n+1 -- false);
  

That doesn't matter.  It is not necessary (or desirable) to follow the
classical B-tree algorithms to the letter to get sufficiently good
properties. 


sure, but we need something instead of classic balancing condition.
We can not just refuse it: this is your lifejacket during performing
tree operations. How will you proof utilization bounds without any
balancing conditions?



 B-trees are quite robust to changing the details, as long
as you follow the generalised principles which make them work.



There are lots of different perspectives here,


I wouldn't be so optimistic here (see below)


 but it is important to
step back and look at what we're using the btree for.  We're making an
index to filesystem metadata.  The index is there so that we can find
that metadata with a reasonable number of IOs and in a reasonable amount
of time.

Because btrfs is COW, and because of the reference counted COW used to
maintain snapshots, the cost of updating the btree is different from
traditional btrees.  We do try to avoid balancing more and
intentionally allow the btree leaves to be less compact simply because
it allows us to avoid the higher cost of those operations.
  


I can understand reducing low bound, say, from 1/2 to 1/3 for performance
reasons, but again, bound 0.00 is unacceptable.. So, please, make sure you
have a sane proved bound for every such compromise.


The btree nodes are fairly strict.  They contain fixed records and are
balanced in a simple fashion.  When we are inserting a new key into a
node, if the node is completely full we split it.  When we are deleting
a key, if the node is less than half full we balance it.


This works only if insertion/deletion on the leaf level won't result in
insertion/deletion more then one key on upper levels.


  Calculating
full on the btree nodes is a factor of the number of keys in them.

The leaves have fixed length keys that describe items of variable size.
On deletion merging is done when we're less than 1/3rd full and on
insertion we split when the item we're trying to shove in doesn't fit.
Calculating full on the btree leaves is a factor of the total size of
the items stored.

So, with all of that said, the nodes point to leaves and the leaves
contain inodes, directory entries and sometimes file data.

The most important part of the index is the higher levels of the tree.
The index of most filesystems points to but does not contain
inodes and file data, and so the higher levels of the btrfs btree are
more like a traditional FS index.

There's a knob here for the max inline item size and if someone wants to
fill their whole FS with 2K files, the default settings are not at all
suitable.  Filling with 3K files actually works better because we end
up with inode and file data in the same block much of the time.
  


Chris, thanks for the description.

I think that such balancing is totally incorrect. In accordance with
your strategy insertions can spawn shallow leaves, and deletions will
result in shallow leaves (because of merging with shallow leaves).
This will lead to unbound utilization slump.

In particular, your patch is just a workaround, which doesn't fix the
core problem. Knobs for cases is a nonsense. Are you kidding? Ext4,
xfs, reiserfs don't use any knobs, so why should we accept this
regress on the background of common progress? Everything should be
packed fine without any knobs...

Your strategy slightly resembles balancing in Reiserfs file system,
so I have two comments here.

1. Algorithms used in Reiserfs are precise (or minimal). That
means there is no superfluous balancing there. It is impossible to
save on balancing operations and slightly decrease utilization bound:
you will lose everything (this is unacceptable). Only classic Bayer's
B-trees allow to reduce bound 1/2 via replacing usual balancing
condition q = N = 2q with more liberal one (say, q = N = 3q).

2. If you want to base on COW-friendly Reiserfs, then it's algorithms
should be properly modified for the case of top-down balancing and
reviewed (better by the authors of the original algorithms).
I recommend to consider the following approach:


Balancing the leaf level


A. Inserting a key of length I(*)


01Suppose we traversed the tree and have arrived to the position
02pos in the leaf A.
03
04At first we try to make space by shifting all possible items at the
05left and right from the pos to the neighbors B and C respectively:
06
07B   A   C
08   
09Let's denote via L and R total length of items on leaf A at left

10and right from the pos respectively.
11
12If after shifting there is enough space in A, then we perform
13insertion to A. After the insertion pairs (BA) and (AC) are
14incompressible for obvious reasons.
15
16Else if there is 

Re: Balancing leaves when walking from top to down (was Btrfs:...)

2010-06-30 Thread Chris Mason
On Wed, Jun 30, 2010 at 10:05:21PM +0200, Edward Shishkin wrote:
 Chris Mason wrote:
 [...]
 1. the balancing condition n = number_of_keys = 2n+1
   is not satisfied (indeed, when number_of_keys is 1
   we have 1 = 2n+1 -- false);
 That doesn't matter.  It is not necessary (or desirable) to follow the
 classical B-tree algorithms to the letter to get sufficiently good
 properties.
 
 sure, but we need something instead of classic balancing condition.
 We can not just refuse it: this is your lifejacket during performing
 tree operations. How will you proof utilization bounds without any
 balancing conditions?

Again, the leaves and nodes both have balancing conditions.  I'm not
sure I follow.

 
 
  B-trees are quite robust to changing the details, as long
 as you follow the generalised principles which make them work.
 
 There are lots of different perspectives here,
 
 I wouldn't be so optimistic here (see below)
 
  but it is important to
 step back and look at what we're using the btree for.  We're making an
 index to filesystem metadata.  The index is there so that we can find
 that metadata with a reasonable number of IOs and in a reasonable amount
 of time.
 
 Because btrfs is COW, and because of the reference counted COW used to
 maintain snapshots, the cost of updating the btree is different from
 traditional btrees.  We do try to avoid balancing more and
 intentionally allow the btree leaves to be less compact simply because
 it allows us to avoid the higher cost of those operations.
 
 I can understand reducing low bound, say, from 1/2 to 1/3 for performance
 reasons, but again, bound 0.00 is unacceptable.. So, please, make sure you
 have a sane proved bound for every such compromise.
 
 The btree nodes are fairly strict.  They contain fixed records and are
 balanced in a simple fashion.  When we are inserting a new key into a
 node, if the node is completely full we split it.  When we are deleting
 a key, if the node is less than half full we balance it.
 
 This works only if insertion/deletion on the leaf level won't result in
 insertion/deletion more then one key on upper levels.

The top-down balancing requires that we control the number of upper
level inserts.  This is already done today.

 
   Calculating
 full on the btree nodes is a factor of the number of keys in them.
 
 The leaves have fixed length keys that describe items of variable size.
 On deletion merging is done when we're less than 1/3rd full and on
 insertion we split when the item we're trying to shove in doesn't fit.
 Calculating full on the btree leaves is a factor of the total size of
 the items stored.
 
 So, with all of that said, the nodes point to leaves and the leaves
 contain inodes, directory entries and sometimes file data.
 
 The most important part of the index is the higher levels of the tree.
 The index of most filesystems points to but does not contain
 inodes and file data, and so the higher levels of the btrfs btree are
 more like a traditional FS index.
 
 There's a knob here for the max inline item size and if someone wants to
 fill their whole FS with 2K files, the default settings are not at all
 suitable.  Filling with 3K files actually works better because we end
 up with inode and file data in the same block much of the time.
 
 Chris, thanks for the description.
 
 I think that such balancing is totally incorrect. In accordance with
 your strategy insertions can spawn shallow leaves, and deletions will
 result in shallow leaves (because of merging with shallow leaves).
 This will lead to unbound utilization slump.

Ok, insertions can spawn shallow leaves containing file data and
deletions can result in shallow leaves containing file data, just trying
to make sure I'm reading this correctly.

Shallow leaves with file data happen in every filesystem, but the other
filesystems call them data extents.  I think we're talking past each other a
little here ;)

 
 In particular, your patch is just a workaround, which doesn't fix the
 core problem. Knobs for cases is a nonsense. Are you kidding? Ext4,
 xfs, reiserfs don't use any knobs, so why should we accept this
 regress on the background of common progress? Everything should be
 packed fine without any knobs...

Ext4 and xfs don't pack file data into the btree.  reiserfs does but
splits the data items.  I'm still trying to nail down if you're worried
about something other than packed file data.

I'm saying there's no difference between a btrfs leaf with a single item
and a block pointer in xfs pointing to a single block.  The btrfs leaf
is full and the single block in xfs is full.

Both are backed by higher levels in the btree that are properly compact.

Btrfs metadata mirroring makes the cost of that leaf much higher, which
is something we might want to address by lowering the size of file data
that we choose to inline.


 
 Your strategy slightly resembles balancing in Reiserfs file system,
 so I have two comments here.
 
 1. Algorithms used in 

[PATCH] btrfs: handle errors for FS_IOC_SETFLAGS

2010-06-30 Thread Sean Bartell
Makes btrfs_ioctl_setflags return -ENOSPC and other errors when
necessary.

Signed-off-by: Sean Bartell wingedtachik...@gmail.com
---
I ran chattr -R on a full FS and btrfs crashed. This overlaps with the patch
series being worked on by Jeff Mahoney.

 fs/btrfs/ioctl.c |   17 -
 1 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 4dbaf89..8db62c2 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -200,19 +200,26 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
 
 
trans = btrfs_join_transaction(root, 1);
-   BUG_ON(!trans);
+   if (IS_ERR(trans)) {
+   ret = PTR_ERR(trans);
+   goto out_drop_write;
+   }
 
ret = btrfs_update_inode(trans, root, inode);
-   BUG_ON(ret);
+   if (ret)
+   goto out_endtrans;
 
btrfs_update_iflags(inode);
inode-i_ctime = CURRENT_TIME;
-   btrfs_end_transaction(trans, root);
 
+   ret = 0;
+out_endtrans:
+   btrfs_end_transaction(trans, root);
+out_drop_write:
mnt_drop_write(file-f_path.mnt);
- out_unlock:
+out_unlock:
mutex_unlock(inode-i_mutex);
-   return 0;
+   return ret;
 }
 
 static int btrfs_ioctl_getversion(struct file *file, int __user *arg)
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is there a more aggressive fixer than btrfsck?

2010-06-30 Thread Rodrigo E . De León Plicet
On Wed, Jun 30, 2010 at 11:47 AM, Florian Weimer f...@deneb.enyo.de wrote:
 ZFS doesn't need a fsck because you have throw away the file system
 and restore from backup for certain types of corruption:

 | What can I do if ZFS file system panics on every boot?
 [...]
 | This will remove all knowledge of pools from your system. You will
 | have to re-create your pool and restore from backup.

 http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#HWhatcanIdoifZFSfilesystempanicsoneveryboot


They *do* make it clear when could something like that happen.

From the same URL:

ZFS is designed to survive arbitrary hardware failures through the
use of redundancy (mirroring or RAID-Z). Unfortunately, certain
failures in *non-replicated* configurations can cause ZFS to panic
when trying to load the pool. This is a bug, and will be fixed in the
near future (along with several other nifty features, such as
background scrubbing).

Non-replicated configuration boils down to no mirroring or parity
checking (basically RAID-0 or similar); such a thing implies:

- No redundancy.
- No fault tolerance.

So, yeah, I guess if you go for a non-replicated configuration,
there will be risks, whether you use ZFS, btrfs,
MD+LVM+$ANY_TRADITIONAL_FS, etc.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-convert: add reiserfs support

2010-06-30 Thread Jérôme Poulin
Hello, I just want to thank you for the ReiserFS convertion patch.

Now that the semester's over, I've finally gotten around to finishing this.
Ironically, I no longer have a ReiserFS partition I want to convert :P.

On my side I had my root filesystem which has been running for quite a
while, I decided to convert it in an LVM snapshot to see if everything
is fine, seems to be, I'll be trying to convert the real filesystem
soon.

This depends on my previous four patches for btrfs-convert.

I applied the four patch, the correction to patch 2 and finally
reiserfs conversion, compiled, make convert and ran the converter,
took 10 minutes for 25GB.
All data is accessible, and seem accessible. I will try to boot it
then do the real conversion (I will first make a backup :) and let you
know!
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html