subject:"Re\: What to do about subvolumes\?"

Re: What to do about subvolumes?

2010-12-09 Thread J. Bruce Fields

On Wed, Dec 08, 2010 at 09:41:33PM -0700, Andreas Dilger wrote:
 On 2010-12-08, at 16:07, Neil Brown wrote:
  On Mon, 6 Dec 2010 11:48:45 -0500 J. Bruce Fields bfie...@redhat.com
  wrote:
  
  On Fri, Dec 03, 2010 at 04:01:44PM -0700, Andreas Dilger wrote:
  Any chance we can add a -get_fsid(sb, inode) method to
  export_operations (or something simiar), that allows the
  filesystem to generate an FSID based on the volume and
  inode that is being exported?
  
  No objection from here.
  
  My standard objection here is that you cannot guarantee that the
  fsid is 100% guarantied to be unique across all filesystems in
  the system (including filesystems mounted from dm snapshots of
  filesystems that are currently mounted).  NFSd needs this uniqueness.
 
 Sure, but you also cannot guarantee that the devno is constant across 
 reboots, yet NFS continues to use this much-less-constant value...
 
  This is only really an objection if user-space cannot over-ride
  the fsid provided by the filesystem.
 
 Agreed.  It definitely makes sense to allow this, for whatever strange 
 circumstances might arise.  However, defaulting to using the filesystem UUID 
 definitely makes the most sense, and looking at the nfs-utils mountd code, it 
 seems that this is already standard behaviour for local block devices 
 (excluding btrfs filesystems).
 
  I'd be very happy to see an interface to user-space whereby
  user-space can get a reasonably unique fsid for a given
  filesystem.
 
 Hmm, maybe I'm missing something, but why does userspace need to be able to 
 get this value?  I would think that nfsd gets it from the filesystem directly 
 in the kernel, but if a uuid= option is present in the exports file that is 
 preferentially used over the value from the filesystem.

Well, the kernel can't distinguish the case of an explicit uuid=
option in /etc/exports from one that was (as is the normal default)
generated automatically by mountd.  Maybe not a big deal.

The uuid seems like a useful thing to have access to from userspace
anyway, for userspace nfs servers if for no other reason:

 That said, I think Aneesh's open_by_handle patchset also made the UUID 
 visible in /proc/pid/mountinfo, after the filesystems stored it in
 sb-s_uuid at mount time.  That _should_ make it visible for non-block 
 mountpoints as well, assuming they fill in s_uuid.
 
  Whether this is an export_operations method or some field in the
  'struct super' which gets copied out doesn't matter to me.
 
 Since Aneesh has already developed patches, is there any objection to using 
 those (last sent to linux-fsdevel on 2010-10-29):
 
 [PATCH -V22 12/14] vfs: Export file system uuid via /proc/pid/mountinfo
 [PATCH -V22 13/14] ext3: Copy fs UUID to superblock.
 [PATCH -V22 14/14] ext4: Copy fs UUID to superblock

I can't see anything wrong with that.

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-08 Thread Andreas Dilger

On 2010-12-07, at 10:02, Trond Myklebust wrote:

 On Tue, 2010-12-07 at 17:51 +0100, Christoph Hellwig wrote:
 It's just as stable as a real dev_t in the times of hotplug and udev.
 As long as you don't touch anything including not upgrading the kernel
 it's remain stable, otherwise it will break.  That's why modern
 nfs-utils default to using the uuid-based filehandle schemes instead of
 the dev_t based ones.  At least that's what I told - I really hope it's
 using the real UUIDs from the filesystem and not the horrible fsid hack
 that was once added - for some filesystems like XFS that field does not
 actually have any relation to the UUID historically.  And while we
 could have changed that it's too late now that nfs was hacked into
 abusing that field.
 
 IIRC, NFS uses the full true uuid for NFSv3 and NFSv4 filehandles, but
 they won't fit into the NFSv2 32-byte filehandles, so there is an
 '8-byte fsid' and '4-byte fsid + inode number' workaround for that...
 
 See the mk_fsid() helper in fs/nfsd/nfsfh.h

It looks like mk_fsid() is only actually using the UUID if it is specified in 
the /etc/exports file (AFAICS, this depends on ex_uuid being set from a 
uuid=... option).

There was a patch in the open_by_handle() patch series that added an s_uuid 
field to the superblock, that could be used if no uuid= option is specified in 
the /etc/exports file.

Cheers, Andreas





--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-08 Thread J. Bruce Fields

On Wed, Dec 08, 2010 at 10:16:29AM -0700, Andreas Dilger wrote:
 On 2010-12-07, at 10:02, Trond Myklebust wrote:
 
  On Tue, 2010-12-07 at 17:51 +0100, Christoph Hellwig wrote:
  It's just as stable as a real dev_t in the times of hotplug and udev.
  As long as you don't touch anything including not upgrading the kernel
  it's remain stable, otherwise it will break.  That's why modern
  nfs-utils default to using the uuid-based filehandle schemes instead of
  the dev_t based ones.  At least that's what I told - I really hope it's
  using the real UUIDs from the filesystem and not the horrible fsid hack
  that was once added - for some filesystems like XFS that field does not
  actually have any relation to the UUID historically.  And while we
  could have changed that it's too late now that nfs was hacked into
  abusing that field.
  
  IIRC, NFS uses the full true uuid for NFSv3 and NFSv4 filehandles, but
  they won't fit into the NFSv2 32-byte filehandles, so there is an
  '8-byte fsid' and '4-byte fsid + inode number' workaround for that...
  
  See the mk_fsid() helper in fs/nfsd/nfsfh.h
 
 It looks like mk_fsid() is only actually using the UUID if it is specified in 
 the /etc/exports file (AFAICS, this depends on ex_uuid being set from a 
 uuid=... option).

No, if you look at the nfs-utils source you'll find mountd sets a uuid
by default (in utils/mountd/cache.c:uuid_by_path()).

 There was a patch in the open_by_handle() patch series that added an s_uuid 
 field to the superblock, that could be used if no uuid= option is specified 
 in the /etc/exports file.

Agreed that doing this in the kernel would probably be simpler.

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-08 Thread Andreas Dilger

On 2010-12-06, at 09:48, J. Bruce Fields wrote:
On Fri, Dec 03, 2010 at 04:01:44PM -0700, Andreas Dilger wrote:
 Any chance we can add a -get_fsid(sb, inode) method to
 export_operations (or something simiar), that allows the filesystem to
 generate an FSID based on the volume and inode that is being exported?
 
 No objection from here.
 
 (Though I don't understand the inode argument--aren't subvolumes
 usually expected to have separate superblocks?)

I thought that if two directories from the same filesystem are both being 
exported at the same time that they would need to have different FSID values, 
hence the inode parameter to allow generating an FSID that is a function of 
both the filesystem (sb) and the directory being exported (inode)?

Cheers, Andreas





--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-08 Thread Neil Brown

On Mon, 6 Dec 2010 11:48:45 -0500 J. Bruce Fields bfie...@redhat.com
wrote:

 On Fri, Dec 03, 2010 at 04:01:44PM -0700, Andreas Dilger wrote:
  On 2010-12-03, at 15:45, J. Bruce Fields wrote:
   We're using statfs64.fs_fsid for this; I believe that's both stable
   across reboots and distinguishes between subvolumes, so that's OK.
   
   (That said, since fs_fsid doesn't work for other filesystems, we depend
   on an explicit check for a filesystem type of btrfs, which is
   awful--btrfs won't always be the only filesystem that wants to do this
   kind of thing, etc.)
  
  Sigh, I wanted to be able to specify the NFS FSID directly from within the 
  kernel for Lustre many years already.  Glad to see that this is moving 
  forward.
  
  Any chance we can add a -get_fsid(sb, inode) method to export_operations
  (or something simiar), that allows the filesystem to generate an FSID based 
  on the volume and inode that is being exported?
 
 No objection from here.

My standard objection here is that you cannot guarantee that the fsid is 100%
guarantied to be unique across all filesystems in the system (including
filesystems mounted from dm snapshots of filesystems that are currently
mounted).  NFSd needs this uniqueness.

This is only really an objection if user-space cannot over-ride the fsid
provided by the filesystem.

I'd be very happy to see an interface to user-space whereby user-space can
get a reasonably unique fsid for a given filesystem.  Whether this is an
export_operations method or some field in the 'struct super' which gets
copied out doesn't matter to me.

NeilBrown


 
 (Though I don't understand the inode argument--aren't subvolumes
 usually expected to have separate superblocks?)
 
 --b.
 --
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-08 Thread Andreas Dilger

On 2010-12-08, at 16:07, Neil Brown wrote:
 On Mon, 6 Dec 2010 11:48:45 -0500 J. Bruce Fields bfie...@redhat.com
 wrote:
 
 On Fri, Dec 03, 2010 at 04:01:44PM -0700, Andreas Dilger wrote:
 Any chance we can add a -get_fsid(sb, inode) method to
 export_operations (or something simiar), that allows the
 filesystem to generate an FSID based on the volume and
 inode that is being exported?
 
 No objection from here.
 
 My standard objection here is that you cannot guarantee that the
 fsid is 100% guarantied to be unique across all filesystems in
 the system (including filesystems mounted from dm snapshots of
 filesystems that are currently mounted).  NFSd needs this uniqueness.

Sure, but you also cannot guarantee that the devno is constant across reboots, 
yet NFS continues to use this much-less-constant value...

 This is only really an objection if user-space cannot over-ride
 the fsid provided by the filesystem.

Agreed.  It definitely makes sense to allow this, for whatever strange 
circumstances might arise.  However, defaulting to using the filesystem UUID 
definitely makes the most sense, and looking at the nfs-utils mountd code, it 
seems that this is already standard behaviour for local block devices 
(excluding btrfs filesystems).

 I'd be very happy to see an interface to user-space whereby
 user-space can get a reasonably unique fsid for a given
 filesystem.

Hmm, maybe I'm missing something, but why does userspace need to be able to get 
this value?  I would think that nfsd gets it from the filesystem directly in 
the kernel, but if a uuid= option is present in the exports file that is 
preferentially used over the value from the filesystem.

That said, I think Aneesh's open_by_handle patchset also made the UUID visible 
in /proc/pid/mountinfo, after the filesystems stored it in
sb-s_uuid at mount time.  That _should_ make it visible for non-block 
mountpoints as well, assuming they fill in s_uuid.

 Whether this is an export_operations method or some field in the
 'struct super' which gets copied out doesn't matter to me.

Since Aneesh has already developed patches, is there any objection to using 
those (last sent to linux-fsdevel on 2010-10-29):

[PATCH -V22 12/14] vfs: Export file system uuid via /proc/pid/mountinfo
[PATCH -V22 13/14] ext3: Copy fs UUID to superblock.
[PATCH -V22 14/14] ext4: Copy fs UUID to superblock

Cheers, Andreas





--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-07 Thread Christoph Hellwig

 === What do subvolumes look like? ===
 
 All the user sees are directories.  They act like any other directory acts, 
 with
 a few exceptions
 
 1) You cannot hardlink between subvolumes.  This is because subvolumes have
 their own inode numbers and such, think of them as seperate mounts in this 
 case,
 you cannot hardlink between two mounts because the link needs to point to the
 same on disk inode, which is impossible between two different filesystems.  
 The
 same is true for subvolumes, they have their own trees with their own inodes 
 and
 inode numbers, so it's impossible to hardlink between them.

which means they act like a different mount point.

 1a) In case it wasn't clear from above, each subvolume has their own inode
 numbers, so you can have the same inode numbers used between two different
 subvolumes, since they are two different trees.

which means they act like not just a different mount point, but they
also act like beeing a separate superblock.

 2) Obviously you can't just rm -rf subvolumes.  Because they are roots there's
 extra metadata to keep track of them, so you have to use one of our ioctls to
 delete subvolumes/snapshots.

Again this means they act like a mount point.

 1) Users need to be able to create their own subvolumes.  The permission
 semantics will be absolutely the same as creating directories, so I don't 
 think
 this is too tricky.  We want this because you can only take snapshots of
 subvolumes, and so it is important that users be able to create their own
 discrete snapshottable targets.

Not that I'm entirely against this, but instead of just stating they
must can you also state the detailed reason?  Allowing users to create
your subvolumes is a mostly equivalent problem to allowing user mounts,
so handling those two under one umbrella makes a lot of sense.

 This is where I expect to see the most discussion.  Here is what I want to do
 
 1) Scrap the 256 inode number thing.  Instead we'll just put a flag in the 
 inode
 to say Hey, I'm a subvolume and then we can do all of the appropriate magic
 that way.  This unfortunately will be an incompatible format change, but the
 sooner we get this adressed the easier it will be in the long run.  Obviously
 when I say format change I mean via the incompat bits we have, so old fs's 
 won't
 be broken and such.

From reading later post in this threads readddir already seems to take
care of this in some way.  But is there a chance of collisions between
real inode numbers and the ones faked up for the subvolume roots?

 2) Do something like NFS's referral mounts when we cd into a subvolume.  Now 
 we
 just do dentry trickery, but that doesn't make the boundary between subvolumes
 clear, so it will confuse people (and samba) when they walk into a subvolume 
 and
 all of a sudden the inode numbers are the same as in the directory behind 
 them.
 With doing the referral mount thing, each subvolume appears to be its own 
 mount
 and that way things like NFS and samba will work properly.
 
 I feel like I'm forgetting something here, hopefully somebody will point it 
 out.

The current code requires the automount trigger points to be links,
which is something that Chris didn't like at all.  But that issue is
solved by building upong David Howell's series to replace that
follow_link magic with a new d_automount dentry operation.  I'd suggest
building the new code on top of that.

And most importantly:

 3) allocate a different anon dev_t for each subvolume.


One thing that really confuses me is that the the actual root of the
subvolume appears directly in the parent namespace.  Given that you have
your subvolume identifiers that doesn't even seems nessecary.

To me the following scheme seems more useful:

 - all subvolumes/snapshots only show up in a virtual below-root
   directory, similar to how the existing default one doesn't
   sit on the top.
 - the entries inside a namespace that are to be automounted have
   an entry in the filesystem that just marks them as an auto-mount
   point that redirects to the actual subvolume.
 - we still allow mounting subvolumes (and only those) directly
   from get_sb by specifying the subvolume name.

This is especially important for snapshots, as just having them hang
off the filesystem that is to be snapshotted is extremly confusing.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-07 Thread Christoph Hellwig

On Sat, Dec 04, 2010 at 09:27:56AM +1100, Dave Chinner wrote:
 A property of NFS fileshandles is that they must be stable across
 server reboots. Is this anon dev_t used as part of the NFS
 filehandle and if so how can you guarantee that it is stable?

It's just as stable as a real dev_t in the times of hotplug and udev.
As long as you don't touch anything including not upgrading the kernel
it's remain stable, otherwise it will break.  That's why modern
nfs-utils default to using the uuid-based filehandle schemes instead of
the dev_t based ones.  At least that's what I told - I really hope it's
using the real UUIDs from the filesystem and not the horrible fsid hack
that was once added - for some filesystems like XFS that field does not
actually have any relation to the UUID historically.  And while we could
have changed that it's too late now that nfs was hacked into abusing
that field.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-07 Thread hch

On Fri, Dec 03, 2010 at 05:45:26PM -0500, J. Bruce Fields wrote:
 We're using statfs64.fs_fsid for this; I believe that's both stable
 across reboots and distinguishes between subvolumes, so that's OK.

It's a field that doesn't have any useful specification and basically
contains random garbage that a filesystem put into it.  Using it is a
very bad idea.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-07 Thread Trond Myklebust

On Tue, 2010-12-07 at 17:51 +0100, Christoph Hellwig wrote:
 On Sat, Dec 04, 2010 at 09:27:56AM +1100, Dave Chinner wrote:
  A property of NFS fileshandles is that they must be stable across
  server reboots. Is this anon dev_t used as part of the NFS
  filehandle and if so how can you guarantee that it is stable?
 
 It's just as stable as a real dev_t in the times of hotplug and udev.
 As long as you don't touch anything including not upgrading the kernel
 it's remain stable, otherwise it will break.  That's why modern
 nfs-utils default to using the uuid-based filehandle schemes instead of
 the dev_t based ones.  At least that's what I told - I really hope it's
 using the real UUIDs from the filesystem and not the horrible fsid hack
 that was once added - for some filesystems like XFS that field does not
 actually have any relation to the UUID historically.  And while we could
 have changed that it's too late now that nfs was hacked into abusing
 that field.

IIRC, NFS uses the full true uuid for NFSv3 and NFSv4 filehandles, but
they won't fit into the NFSv2 32-byte filehandles, so there is an
'8-byte fsid' and '4-byte fsid + inode number' workaround for that...

See the mk_fsid() helper in fs/nfsd/nfsfh.h

Cheers
  Trond
-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
trond.mykleb...@netapp.com
www.netapp.com

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-07 Thread J. Bruce Fields

On Tue, Dec 07, 2010 at 05:52:13PM +0100, hch wrote:
 On Fri, Dec 03, 2010 at 05:45:26PM -0500, J. Bruce Fields wrote:
  We're using statfs64.fs_fsid for this; I believe that's both stable
  across reboots and distinguishes between subvolumes, so that's OK.
 
 It's a field that doesn't have any useful specification and basically
 contains random garbage that a filesystem put into it.  Using it is a
 very bad idea.

I meant the above statement to apply only to btrfs; and nfs-utils is
using fs_fsid only in the case where the filesystem type is btrfs.  So
I believe the current code does work.

But I agree that constructing filehandles differently based on a
strcmp() of the filesystem type is not a sustainable design, to say the
least.

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-03 Thread Josef Bacik

On Thu, Dec 02, 2010 at 11:25:01PM -0500, Chris Ball wrote:
 Hi Josef,
 
 1) Scrap the 256 inode number thing.  Instead we'll just put a
 flag in the inode to say Hey, I'm a subvolume and then we can
 do all of the appropriate magic that way.  This unfortunately
 will be an incompatible format change, but the sooner we get this
 adressed the easier it will be in the long run.  Obviously when I
 say format change I mean via the incompat bits we have, so old
 fs's won't be broken and such.
 
 Sorry if I've missed this elsewhere in the thread -- will we still
 have an efficient operation for enumerating subvolumes and snapshots,
 and how will that work?  We're going to want tools like plymouth and
 grub to be able to list all snapshots without running a large scan.


Yeah the idea is we want to fix the problems with the design without breaking
anything that currently works.  So all the changes I want to make are going to
be invisible for the user.  Thanks,

Josef 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-03 Thread Josef Bacik

On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
 Hello,
 
 Various people have complained about how BTRFS deals with subvolumes recently,
 specifically the fact that they all have the same inode number, and there's no
 discrete seperation from one subvolume to another.  Christoph asked that I lay
 out a basic design document of how we want subvolumes to work so we can hash
 everything out now, fix what is broken, and then move forward with a design 
 that
 everybody is more or less happy with.  I apologize in advance for how freaking
 long this email is going to be.  I assume that most people are generally
 familiar with how BTRFS works, so I'm not going to bother explaining in great
 detail some stuff.
 
 === What are subvolumes? ===
 
 They are just another tree.  In BTRFS we have various b-trees to describe the
 filesystem.  A few of them are filesystem wide, such as the extent tree, chunk
 tree, root tree etc.  The tree's that hold the actual filesystem data, that is
 inodes and such, are kept in their own b-tree.  This is how subvolumes and
 snapshots appear on disk, they are simply new b-trees with all of the file 
 data
 contained within them.
 
 === What do subvolumes look like? ===
 
 All the user sees are directories.  They act like any other directory acts, 
 with
 a few exceptions
 
 1) You cannot hardlink between subvolumes.  This is because subvolumes have
 their own inode numbers and such, think of them as seperate mounts in this 
 case,
 you cannot hardlink between two mounts because the link needs to point to the
 same on disk inode, which is impossible between two different filesystems.  
 The
 same is true for subvolumes, they have their own trees with their own inodes 
 and
 inode numbers, so it's impossible to hardlink between them.
 
 1a) In case it wasn't clear from above, each subvolume has their own inode
 numbers, so you can have the same inode numbers used between two different
 subvolumes, since they are two different trees.
 
 2) Obviously you can't just rm -rf subvolumes.  Because they are roots there's
 extra metadata to keep track of them, so you have to use one of our ioctls to
 delete subvolumes/snapshots.
 
 But permissions and everything else they are the same.
 
 There is one tricky thing.  When you create a subvolume, the directory inode
 that is created in the parent subvolume has the inode number of 256.  So if 
 you
 have a bunch of subvolumes in the same parent subvolume, you are going to 
 have a
 bunch of directories with the inode number of 256.  This is so when users cd
 into a subvolume we can know its a subvolume and do all the normal voodoo to
 start looking in the subvolumes tree instead of the parent subvolumes tree.
 
 This is where things go a bit sideways.  We had serious problems with NFS, but
 thankfully NFS gives us a bunch of hooks to get around these problems.
 CIFS/Samba do not, so we will have problems there, not to mention any other
 userspace application that looks at inode numbers.
 
 === How do we want subvolumes to work from a user perspective? ===
 
 1) Users need to be able to create their own subvolumes.  The permission
 semantics will be absolutely the same as creating directories, so I don't 
 think
 this is too tricky.  We want this because you can only take snapshots of
 subvolumes, and so it is important that users be able to create their own
 discrete snapshottable targets.
 
 2) Users need to be able to snapshot their subvolumes.  This is basically the
 same as #1, but it bears repeating.
 
 3) Subvolumes shouldn't need to be specifically mounted.  This is also
 important, we don't want users to have to go around mounting their subvolumes 
 up
 manually one-by-one.  Today users just cd into subvolumes and it works, just
 like cd'ing into a directory.
 
 === Quotas ===
 
 This is a huge topic in and of itself, but Christoph mentioned wanting to have
 an idea of what we wanted to do with it, so I'm putting it here.  There are
 really 2 things here
 
 1) Limiting the size of subvolumes.  This is really easy for us, just create a
 subvolume and at creation time set a maximum size it can grow to and not let 
 it
 go farther than that.  Nice, simple and straightforward.
 
 2) Normal quotas, via the quota tools.  This just comes down to how do we want
 to charge users, do we want to do it per subvolume, or per filesystem.  My 
 vote
 is per filesystem.  Obviously this will make it tricky with snapshots, but I
 think if we're just charging the diff's between the original volume and the
 snapshot to the user then that will be the easiest for people to understand,
 rather than making a snapshot all of a sudden count the users currently used
 quota * 2.
 
 === What do we do? ===
 
 This is where I expect to see the most discussion.  Here is what I want to do
 
 1) Scrap the 256 inode number thing.  Instead we'll just put a flag in the 
 inode
 to say Hey, I'm a subvolume and then we can do all of the appropriate magic
 that way.

Re: What to do about subvolumes?

2010-12-03 Thread J. Bruce Fields

On Fri, Dec 03, 2010 at 04:45:27PM -0500, Josef Bacik wrote:
 So now that I've actually looked at everything, it looks like the semantics 
 are
 all right for subvolumes
 
 1) readdir - we return the root id in d_ino, which is unique across the fs

Though Michael Vrable pointed out an apparent collision with normal
inode numbers on the parent filesystem?

 2) stat - we return 256 for all subvolumes, because that is their inode number
 3) dev_t - we setup an anon super for all volumes, so they all get their own
 dev_t, which is set properly for all of their children, see below
 
 [r...@test1244 btrfs-test]# stat .
   File: `.'
   Size: 20  Blocks: 8  IO Block: 4096   directory
 Device: 15h/21d Inode: 256 Links: 1
 Access: (0555/dr-xr-xr-x)  Uid: (0/root)   Gid: (0/root)
 Access: 2010-12-03 15:35:41.931679393 -0500
 Modify: 2010-12-03 15:35:20.405679493 -0500
 Change: 2010-12-03 15:35:20.405679493 -0500
 
 [r...@test1244 btrfs-test]# stat foo
   File: `foo'
   Size: 12  Blocks: 0  IO Block: 4096   directory
 Device: 19h/25d Inode: 256 Links: 1
 Access: (0700/drwx--)  Uid: (0/root)   Gid: (0/root)
 Access: 2010-12-03 15:35:17.501679393 -0500
 Modify: 2010-12-03 15:35:59.150680051 -0500
 Change: 2010-12-03 15:35:59.150680051 -0500
 
 [r...@test1244 btrfs-test]# stat foo/foobar 
   File: `foo/foobar'
   Size: 0   Blocks: 0  IO Block: 4096   regular empty file
 Device: 19h/25d Inode: 257 Links: 1
 Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)
 Access: 2010-12-03 15:35:59.150680051 -0500
 Modify: 2010-12-03 15:35:59.150680051 -0500
 Change: 2010-12-03 15:35:59.150680051 -0500
 
 So as far as the user is concerned, everything should come out right.  
 Obviously
 we had to do the NFS trickery still because as far as VFS is concerned the
 subvolumes are all on the same mount.  So the question is this (and really 
 this
 is directed at Christoph and Bruce and anybody else who may care), is this 
 good
 enough, or do we want to have a seperate vfsmount for each subvolume?  Thanks,

For nfsd's purposes, we need to be able find out about filesystems in
two different ways:

1. Lookup by filehandle: we need to be able to identify which
subvolume we're dealing with from a filehandle.
2. Lookup by path: we need to notice when we cross into a
subvolume.

Looks like #1 already works.  Not #2: the current nfsd code just checks
for mountpoints.  We could modify nfsd to also check whether dev_t
changed each time it did a lookup.  I suppose it would work, though it's
annoying to have to do it just for the case of btrfs.

As far as I can tell, crossing into a subvolume is like crossing a
mountpoint in every way except for the lack of a separate vfsmount.  I'd
worry that the inconsistency will end up requiring more special cases
down the road, but I don't have any in mind.

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-03 Thread Dave Chinner

On Fri, Dec 03, 2010 at 04:45:27PM -0500, Josef Bacik wrote:
 On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
  Hello,
  
  Various people have complained about how BTRFS deals with subvolumes 
  recently,
  specifically the fact that they all have the same inode number, and there's 
  no
  discrete seperation from one subvolume to another.  Christoph asked that I 
  lay
  out a basic design document of how we want subvolumes to work so we can hash
  everything out now, fix what is broken, and then move forward with a design 
  that
  everybody is more or less happy with.  I apologize in advance for how 
  freaking
  long this email is going to be.  I assume that most people are generally
  familiar with how BTRFS works, so I'm not going to bother explaining in 
  great
  detail some stuff.
  

  are things that cannot be fixed now.  Some of these changes will require
  incompat format changes, but it's either we fix it now, or later on down the
  road when BTRFS starts getting used in production really find out how many
  things our current scheme breaks and then have to do the changes then.  
  Thanks,
  
 
 So now that I've actually looked at everything, it looks like the semantics 
 are
 all right for subvolumes
 
 1) readdir - we return the root id in d_ino, which is unique across the fs
 2) stat - we return 256 for all subvolumes, because that is their inode number
 3) dev_t - we setup an anon super for all volumes, so they all get their own
 dev_t, which is set properly for all of their children, see below

A property of NFS fileshandles is that they must be stable across
server reboots. Is this anon dev_t used as part of the NFS
filehandle and if so how can you guarantee that it is stable?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-03 Thread Chris Mason

Excerpts from Dave Chinner's message of 2010-12-03 17:27:56 -0500:
 On Fri, Dec 03, 2010 at 04:45:27PM -0500, Josef Bacik wrote:
  On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
   Hello,
   
   Various people have complained about how BTRFS deals with subvolumes 
   recently,
   specifically the fact that they all have the same inode number, and 
   there's no
   discrete seperation from one subvolume to another.  Christoph asked that 
   I lay
   out a basic design document of how we want subvolumes to work so we can 
   hash
   everything out now, fix what is broken, and then move forward with a 
   design that
   everybody is more or less happy with.  I apologize in advance for how 
   freaking
   long this email is going to be.  I assume that most people are generally
   familiar with how BTRFS works, so I'm not going to bother explaining in 
   great
   detail some stuff.
   
 
   are things that cannot be fixed now.  Some of these changes will require
   incompat format changes, but it's either we fix it now, or later on down 
   the
   road when BTRFS starts getting used in production really find out how many
   things our current scheme breaks and then have to do the changes then.  
   Thanks,
   
  
  So now that I've actually looked at everything, it looks like the semantics 
  are
  all right for subvolumes
  
  1) readdir - we return the root id in d_ino, which is unique across the fs
  2) stat - we return 256 for all subvolumes, because that is their inode 
  number
  3) dev_t - we setup an anon super for all volumes, so they all get their own
  dev_t, which is set properly for all of their children, see below
 
 A property of NFS fileshandles is that they must be stable across
 server reboots. Is this anon dev_t used as part of the NFS
 filehandle and if so how can you guarantee that it is stable?

It isn't today, that's something we'll have to address.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-03 Thread J. Bruce Fields

On Fri, Dec 03, 2010 at 05:29:24PM -0500, Chris Mason wrote:
 Excerpts from Dave Chinner's message of 2010-12-03 17:27:56 -0500:
  On Fri, Dec 03, 2010 at 04:45:27PM -0500, Josef Bacik wrote:
   On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
Hello,

Various people have complained about how BTRFS deals with subvolumes 
recently,
specifically the fact that they all have the same inode number, and 
there's no
discrete seperation from one subvolume to another.  Christoph asked 
that I lay
out a basic design document of how we want subvolumes to work so we can 
hash
everything out now, fix what is broken, and then move forward with a 
design that
everybody is more or less happy with.  I apologize in advance for how 
freaking
long this email is going to be.  I assume that most people are generally
familiar with how BTRFS works, so I'm not going to bother explaining in 
great
detail some stuff.

  
are things that cannot be fixed now.  Some of these changes will require
incompat format changes, but it's either we fix it now, or later on 
down the
road when BTRFS starts getting used in production really find out how 
many
things our current scheme breaks and then have to do the changes then.  
Thanks,

   
   So now that I've actually looked at everything, it looks like the 
   semantics are
   all right for subvolumes
   
   1) readdir - we return the root id in d_ino, which is unique across the fs
   2) stat - we return 256 for all subvolumes, because that is their inode 
   number
   3) dev_t - we setup an anon super for all volumes, so they all get their 
   own
   dev_t, which is set properly for all of their children, see below
  
  A property of NFS fileshandles is that they must be stable across
  server reboots. Is this anon dev_t used as part of the NFS
  filehandle and if so how can you guarantee that it is stable?
 
 It isn't today, that's something we'll have to address.

We're using statfs64.fs_fsid for this; I believe that's both stable
across reboots and distinguishes between subvolumes, so that's OK.

(That said, since fs_fsid doesn't work for other filesystems, we depend
on an explicit check for a filesystem type of btrfs, which is
awful--btrfs won't always be the only filesystem that wants to do this
kind of thing, etc.)

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-03 Thread Andreas Dilger

On 2010-12-03, at 15:45, J. Bruce Fields wrote:
 We're using statfs64.fs_fsid for this; I believe that's both stable
 across reboots and distinguishes between subvolumes, so that's OK.
 
 (That said, since fs_fsid doesn't work for other filesystems, we depend
 on an explicit check for a filesystem type of btrfs, which is
 awful--btrfs won't always be the only filesystem that wants to do this
 kind of thing, etc.)

Sigh, I wanted to be able to specify the NFS FSID directly from within the 
kernel for Lustre many years already.  Glad to see that this is moving forward.

Any chance we can add a -get_fsid(sb, inode) method to export_operations
(or something simiar), that allows the filesystem to generate an FSID based on 
the volume and inode that is being exported?

Cheers, Andreas





--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-02 Thread Arne Jansen

Josef Bacik wrote:
 
 This is a huge topic in and of itself, but Christoph mentioned wanting to have
 an idea of what we wanted to do with it, so I'm putting it here.  There are
 really 2 things here
 
 1) Limiting the size of subvolumes.  This is really easy for us, just create a
 subvolume and at creation time set a maximum size it can grow to and not let 
 it
 go farther than that.  Nice, simple and straightforward.
 

I'd love to be able to limit the size of a subvolume. Here the size comprises
all blocks this subvolume refers to.
But at least as important to me is a mode where one can build groups of sub-
volumes and snapshots and define a quota for the complete group. Again, the
size here comprises all blocks any of the subvolumes/snapshots refer to. If
a block is referred to more than once, it counts only once.
A subvolume/snapshot can be configured to be part of multiple groups.

With this I can do interesting things:
 a) The user pays only for the space he occupies, not for read-only snapshots
 b) The user pays for his space and for all the snapshots
 c) The user pays for his space and snapshots, but not for snapshots generated
for internal backup purposes
 d) Hierarchical quotas. I can limit /home and set an additional quota on each
homedir

Thanks,
Arne
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-02 Thread Arne Jansen

Josef Bacik wrote:
 
 1) Scrap the 256 inode number thing.  Instead we'll just put a flag in the 
 inode
 to say Hey, I'm a subvolume and then we can do all of the appropriate magic
 that way.  This unfortunately will be an incompatible format change, but the
 sooner we get this adressed the easier it will be in the long run.  Obviously
 when I say format change I mean via the incompat bits we have, so old fs's 
 won't
 be broken and such.
 
 2) Do something like NFS's referral mounts when we cd into a subvolume.  Now 
 we
 just do dentry trickery, but that doesn't make the boundary between subvolumes
 clear, so it will confuse people (and samba) when they walk into a subvolume 
 and
 all of a sudden the inode numbers are the same as in the directory behind 
 them.
 With doing the referral mount thing, each subvolume appears to be its own 
 mount
 and that way things like NFS and samba will work properly.
 

What about the alternative and allocating inode numbers globally? The only
problem would be with snapshots as they share the inum with the source, but
one could just remap inode numbers in snapshots by sparing some bits at the
top of this 64 bit field.

Having one mount per subvolume/snapshots is the cleaner solution, but
quickly leads to situations where you have _lots_ of mounts, especially when
you export them via NFS and mount it somewhere else. I've seen a machine
which had to handle  100,000 mounts from a zfs server. This definitely
brings it's own problems, so I'd love to see a full fs exported as a single
mount. This will also keep output from tools like iostat (for nfs mounts)
and df readable.

Thanks,
Arne
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-02 Thread Chris Mason

Excerpts from Arne Jansen's message of 2010-12-02 04:49:39 -0500:
 Josef Bacik wrote:
  
  1) Scrap the 256 inode number thing.  Instead we'll just put a flag in the 
  inode
  to say Hey, I'm a subvolume and then we can do all of the appropriate 
  magic
  that way.  This unfortunately will be an incompatible format change, but the
  sooner we get this adressed the easier it will be in the long run.  
  Obviously
  when I say format change I mean via the incompat bits we have, so old fs's 
  won't
  be broken and such.
  
  2) Do something like NFS's referral mounts when we cd into a subvolume.  
  Now we
  just do dentry trickery, but that doesn't make the boundary between 
  subvolumes
  clear, so it will confuse people (and samba) when they walk into a 
  subvolume and
  all of a sudden the inode numbers are the same as in the directory behind 
  them.
  With doing the referral mount thing, each subvolume appears to be its own 
  mount
  and that way things like NFS and samba will work properly.
  
 
 What about the alternative and allocating inode numbers globally? The only
 problem would be with snapshots as they share the inum with the source, but
 one could just remap inode numbers in snapshots by sparing some bits at the
 top of this 64 bit field.

The global inode number is possible, it's just another btree that must
be maintained on disk in order to map which inodes are free and which
ones aren't.  It also needs to have a reference count on each inode,
since each snapshot effectively increases the reference count on
every file and directory it contains.

The cost of maintaining that reference count is very very high.

-chris

 
 Having one mount per subvolume/snapshots is the cleaner solution, but
 quickly leads to situations where you have _lots_ of mounts, especially when
 you export them via NFS and mount it somewhere else. I've seen a machine
 which had to handle  100,000 mounts from a zfs server. This definitely
 brings it's own problems, so I'd love to see a full fs exported as a single
 mount. This will also keep output from tools like iostat (for nfs mounts)
 and df readable.
 
 Thanks,
 Arne
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-02 Thread David Pottage


On 02/12/10 16:11, Chris Mason wrote:

Excerpts from Arne Jansen's message of 2010-12-02 04:49:39 -0500:
   

Josef Bacik wrote:
 

1) Scrap the 256 inode number thing.  Instead we'll just put a flag in the inode
to say Hey, I'm a subvolume and then we can do all of the appropriate magic
that way.  This unfortunately will be an incompatible format change, but the
sooner we get this adressed the easier it will be in the long run.  Obviously
when I say format change I mean via the incompat bits we have, so old fs's won't
be broken and such.

2) Do something like NFS's referral mounts when we cd into a subvolume.  Now we
just do dentry trickery, but that doesn't make the boundary between subvolumes
clear, so it will confuse people (and samba) when they walk into a subvolume and
all of a sudden the inode numbers are the same as in the directory behind them.
With doing the referral mount thing, each subvolume appears to be its own mount
and that way things like NFS and samba will work properly.

   

What about the alternative and allocating inode numbers globally? The only
problem would be with snapshots as they share the inum with the source, but
one could just remap inode numbers in snapshots by sparing some bits at the
top of this 64 bit field.
 

The global inode number is possible, it's just another btree that must
be maintained on disk in order to map which inodes are free and which
ones aren't.  It also needs to have a reference count on each inode,
since each snapshot effectively increases the reference count on
every file and directory it contains.

The cost of maintaining that reference count is very very high.
   


A couple of years ago I was suffering from the problem of different 
files having the same inode number on Netapp servers. On a Netapp device 
if you snapshot a volume then the files in the snapshot have the same 
inode number as the original, even if the original changes. (Netapp 
snapshots are read only).


This means that if you attempt to see what has changed since your last 
snapshot using a command line such as:


diff src/file.c .snapshots/hourly.12/src.file.c

Then the diff tool will tell you that the files are the same even if 
they are different, because it is assuming that files with the same 
inode number will have identical contents.


Therefore I think it is a bad idea if potentially different files on 
btrfs can have the same inode number. It will break all sorts of tools.


Instead of maintaining a big complicated reference count of used inode 
numbers, could btrfs use bit masks to create a the userland visible 
inode number from the subvolume id and the real internal inode number. 
Something like:


userland_inode = ( volume_id  48 )  internal_inode;

Please forgive me if this is impossible, or if that C snippet is 
syntactically incorrect. I am not a filesystem or kernel developer, and 
I have not coded in C for many years.


--
David Pottage

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-02 Thread Phillip Susi


On 12/02/2010 04:49 AM, Arne Jansen wrote:

What about the alternative and allocating inode numbers globally? The only
problem would be with snapshots as they share the inum with the source, but
one could just remap inode numbers in snapshots by sparing some bits at the
top of this 64 bit field.


I was wondering this as well.  Why give each subvol its own inode number 
space?  To avoid breaking assumptions of various programs, if they each 
have their own inode space, they must each have a unique st_dev.  How 
are inode numbers currently allocated, and why wouldn't it be simple to 
just have a single pool of inode numbers for all subvols?  It seems 
obvious to me that snapshots start out inheriting the inode numbers of 
the original subvol, but must be given a new st_dev.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Mike Hommey

On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
 1) Users need to be able to create their own subvolumes.  The permission
 semantics will be absolutely the same as creating directories, so I don't 
 think
 this is too tricky.  We want this because you can only take snapshots of
 subvolumes, and so it is important that users be able to create their own
 discrete snapshottable targets.
 
 2) Users need to be able to snapshot their subvolumes.  This is basically the
 same as #1, but it bears repeating.
 
 3) Subvolumes shouldn't need to be specifically mounted.  This is also
 important, we don't want users to have to go around mounting their subvolumes 
 up
 manually one-by-one.  Today users just cd into subvolumes and it works, just
 like cd'ing into a directory.

It would be helpful to be able to create subvolumes off existing
directories, instead of creating a subvolume and having to copy all the
data around.

Mike
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread C Anthony Risinger

On Wed, Dec 1, 2010 at 8:21 AM, Josef Bacik jo...@redhat.com wrote:

 === How do we want subvolumes to work from a user perspective? ===

 1) Users need to be able to create their own subvolumes.  The permission
 semantics will be absolutely the same as creating directories, so I don't 
 think
 this is too tricky.  We want this because you can only take snapshots of
 subvolumes, and so it is important that users be able to create their own
 discrete snapshottable targets.

 2) Users need to be able to snapshot their subvolumes.  This is basically the
 same as #1, but it bears repeating.

could it be possible to convert a directory into a volume?  or at
least base a snapshot off it?

C Anthony
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Chris Mason

Excerpts from Josef Bacik's message of 2010-12-01 09:21:36 -0500:
 Hello,
 
 Various people have complained about how BTRFS deals with subvolumes recently,
 specifically the fact that they all have the same inode number, and there's no
 discrete seperation from one subvolume to another.  Christoph asked that I lay
 out a basic design document of how we want subvolumes to work so we can hash
 everything out now, fix what is broken, and then move forward with a design 
 that
 everybody is more or less happy with.  I apologize in advance for how freaking
 long this email is going to be.  I assume that most people are generally
 familiar with how BTRFS works, so I'm not going to bother explaining in great
 detail some stuff.

Thanks for writing this up.

 === What do we do? ===
 
 This is where I expect to see the most discussion.  Here is what I want to do
 
 1) Scrap the 256 inode number thing.  Instead we'll just put a flag in the 
 inode
 to say Hey, I'm a subvolume and then we can do all of the appropriate magic
 that way.  This unfortunately will be an incompatible format change, but the
 sooner we get this adressed the easier it will be in the long run.  Obviously
 when I say format change I mean via the incompat bits we have, so old fs's 
 won't
 be broken and such.

If they don't have inode number 256, what inode number do they have?
I'm assuming you mean the subvolume is given an inode number in the
parent directory just like any other dir,  but this doesn't get rid of
the duplicate inode problem.  I think it ends up making it less clear,
but I'm open to suggestions ;)

We could give each subvol a different devt, which is something Christoph
had asked about as well.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Chris Mason

Excerpts from C Anthony Risinger's message of 2010-12-01 09:51:55 -0500:
 On Wed, Dec 1, 2010 at 8:21 AM, Josef Bacik jo...@redhat.com wrote:
 
  === How do we want subvolumes to work from a user perspective? ===
 
  1) Users need to be able to create their own subvolumes.  The permission
  semantics will be absolutely the same as creating directories, so I don't 
  think
  this is too tricky.  We want this because you can only take snapshots of
  subvolumes, and so it is important that users be able to create their own
  discrete snapshottable targets.
 
  2) Users need to be able to snapshot their subvolumes.  This is basically 
  the
  same as #1, but it bears repeating.
 
 could it be possible to convert a directory into a volume?  or at
 least base a snapshot off it?

I'm afraid this turns into the same complexity as creating a new volume
and copying all the files/dirs in by hand.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread C Anthony Risinger

On Wed, Dec 1, 2010 at 10:01 AM, Chris Mason chris.ma...@oracle.com wrote:
 Excerpts from C Anthony Risinger's message of 2010-12-01 09:51:55 -0500:
 On Wed, Dec 1, 2010 at 8:21 AM, Josef Bacik jo...@redhat.com wrote:
 
  === How do we want subvolumes to work from a user perspective? ===
 
  1) Users need to be able to create their own subvolumes.  The permission
  semantics will be absolutely the same as creating directories, so I don't 
  think
  this is too tricky.  We want this because you can only take snapshots of
  subvolumes, and so it is important that users be able to create their own
  discrete snapshottable targets.
 
  2) Users need to be able to snapshot their subvolumes.  This is basically 
  the
  same as #1, but it bears repeating.

 could it be possible to convert a directory into a volume?  or at
 least base a snapshot off it?

 I'm afraid this turns into the same complexity as creating a new volume
 and copying all the files/dirs in by hand.

ok; if i create an empty volume, and use cp --reflink, it would have
the desired affect though, right?

C Anthony
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Mike Hommey

On Wed, Dec 01, 2010 at 11:01:37AM -0500, Chris Mason wrote:
 Excerpts from C Anthony Risinger's message of 2010-12-01 09:51:55 -0500:
  On Wed, Dec 1, 2010 at 8:21 AM, Josef Bacik jo...@redhat.com wrote:
  
   === How do we want subvolumes to work from a user perspective? ===
  
   1) Users need to be able to create their own subvolumes. Â The permission
   semantics will be absolutely the same as creating directories, so I don't 
   think
   this is too tricky. Â We want this because you can only take snapshots of
   subvolumes, and so it is important that users be able to create their own
   discrete snapshottable targets.
  
   2) Users need to be able to snapshot their subvolumes. Â This is 
   basically the
   same as #1, but it bears repeating.
  
  could it be possible to convert a directory into a volume?  or at
  least base a snapshot off it?
 
 I'm afraid this turns into the same complexity as creating a new volume
 and copying all the files/dirs in by hand.

Except you wouldn't have to copy data, only metadata.

Mike
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Hugo Mills

On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
 === Quotas ===
 
 This is a huge topic in and of itself, but Christoph mentioned wanting to have
 an idea of what we wanted to do with it, so I'm putting it here.  There are
 really 2 things here
 
 1) Limiting the size of subvolumes.  This is really easy for us, just create a
 subvolume and at creation time set a maximum size it can grow to and not let 
 it
 go farther than that.  Nice, simple and straightforward.
 
 2) Normal quotas, via the quota tools.  This just comes down to how do we want
 to charge users, do we want to do it per subvolume, or per filesystem.  My 
 vote
 is per filesystem.  Obviously this will make it tricky with snapshots, but I
 think if we're just charging the diff's between the original volume and the
 snapshot to the user then that will be the easiest for people to understand,
 rather than making a snapshot all of a sudden count the users currently used
 quota * 2.

   This is going to be tricky to get the semantics right, I suspect.

   Say you've created a subvolume, A, containing 10G of Useful Stuff
(say, a base image for VMs). This counts 10G against your quota. Now,
I come along and snapshot that subvolume (as a writable subvolume) --
call it B. This is essentially free for me, because I've got a COW
copy of your subvolume (and the original counts against your quota).

   If I now modify a file in subvolume B, the full modified section
goes onto my quota. This is all well and good. But what happens if you
delete your subvolume, A? Suddenly, I get lumbered with 10G of extra
files.  Worse, what happens if someone else had made a snapshot of A,
too? Who gets the 10G added to their quota, me or them? What if I'd
filled up my quota? Would that stop you from deleting your copy,
because my copy can't be charged against my quota? Would I just end up
unexpectedly 10G over quota?

   This is a whole gigantic can of worms, as far as I can see, and I
don't think it's going to be possible to implement quotas, even on a
filesystem level, until there's some good and functional model for
dealing with all the implications of COW copies. :(

   Hugo.

-- 
=== Hugo Mills: h...@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- I believe that it's closely correlated with ---   
   the aeroswine coefficient.


signature.asc
Description: Digital signature

Re: What to do about subvolumes?

2010-12-01 Thread C Anthony Risinger

On Wed, Dec 1, 2010 at 10:38 AM, Hugo Mills hugo-l...@carfax.org.uk wrote:
 On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
 === Quotas ===

 This is a huge topic in and of itself, but Christoph mentioned wanting to 
 have
 an idea of what we wanted to do with it, so I'm putting it here.  There are
 really 2 things here

 1) Limiting the size of subvolumes.  This is really easy for us, just create 
 a
 subvolume and at creation time set a maximum size it can grow to and not let 
 it
 go farther than that.  Nice, simple and straightforward.

 2) Normal quotas, via the quota tools.  This just comes down to how do we 
 want
 to charge users, do we want to do it per subvolume, or per filesystem.  My 
 vote
 is per filesystem.  Obviously this will make it tricky with snapshots, but I
 think if we're just charging the diff's between the original volume and the
 snapshot to the user then that will be the easiest for people to understand,
 rather than making a snapshot all of a sudden count the users currently used
 quota * 2.

   This is going to be tricky to get the semantics right, I suspect.

   Say you've created a subvolume, A, containing 10G of Useful Stuff
 (say, a base image for VMs). This counts 10G against your quota. Now,
 I come along and snapshot that subvolume (as a writable subvolume) --
 call it B. This is essentially free for me, because I've got a COW
 copy of your subvolume (and the original counts against your quota).

   If I now modify a file in subvolume B, the full modified section
 goes onto my quota. This is all well and good. But what happens if you
 delete your subvolume, A? Suddenly, I get lumbered with 10G of extra
 files.  Worse, what happens if someone else had made a snapshot of A,
 too? Who gets the 10G added to their quota, me or them? What if I'd
 filled up my quota? Would that stop you from deleting your copy,
 because my copy can't be charged against my quota? Would I just end up
 unexpectedly 10G over quota?

   This is a whole gigantic can of worms, as far as I can see, and I
 don't think it's going to be possible to implement quotas, even on a
 filesystem level, until there's some good and functional model for
 dealing with all the implications of COW copies. :(

i'd expect that as a separate user, you should both be whacked 10G.
imo, the whole benefit of transparent COW is to the administrators
advantage, thus i would even think the _uncompressed_ volume size
would go against quota (which could possibly be artificially inflated
to account for the space saving of compression).  users just need a
nice steadily predictable number to monitor.

thought maybe these users could be grouped, such that the COW'ed
portions of the files they share are balanced across each users quota,
but this would have to be a soprt of opt in thing else you get the
wild fluctuations because of other user's actions.  additionally, some
users could be marked as system, where COW'ing their subvol results
in 0 quota -- you only pay for what you change -- but if the system
subvol gets removed, then you pay for it all.  in this way you would
have to keep reusing system subvols to get any advantage as a regular
user.

i dont know the existing systems though so i dont know what it would
take to do such balancing.

C Anthony
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Mike Hommey

On Wed, Dec 01, 2010 at 04:38:00PM +, Hugo Mills wrote:
 On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
  === Quotas ===
  
  This is a huge topic in and of itself, but Christoph mentioned wanting to 
  have
  an idea of what we wanted to do with it, so I'm putting it here.  There are
  really 2 things here
  
  1) Limiting the size of subvolumes.  This is really easy for us, just 
  create a
  subvolume and at creation time set a maximum size it can grow to and not 
  let it
  go farther than that.  Nice, simple and straightforward.
  
  2) Normal quotas, via the quota tools.  This just comes down to how do we 
  want
  to charge users, do we want to do it per subvolume, or per filesystem.  My 
  vote
  is per filesystem.  Obviously this will make it tricky with snapshots, but I
  think if we're just charging the diff's between the original volume and the
  snapshot to the user then that will be the easiest for people to understand,
  rather than making a snapshot all of a sudden count the users currently used
  quota * 2.
 
This is going to be tricky to get the semantics right, I suspect.
 
Say you've created a subvolume, A, containing 10G of Useful Stuff
 (say, a base image for VMs). This counts 10G against your quota. Now,
 I come along and snapshot that subvolume (as a writable subvolume) --
 call it B. This is essentially free for me, because I've got a COW
 copy of your subvolume (and the original counts against your quota).
 
If I now modify a file in subvolume B, the full modified section
 goes onto my quota. This is all well and good. But what happens if you
 delete your subvolume, A? Suddenly, I get lumbered with 10G of extra
 files.  Worse, what happens if someone else had made a snapshot of A,
 too? Who gets the 10G added to their quota, me or them? What if I'd
 filled up my quota? Would that stop you from deleting your copy,
 because my copy can't be charged against my quota? Would I just end up
 unexpectedly 10G over quota?
 
This is a whole gigantic can of worms, as far as I can see, and I
 don't think it's going to be possible to implement quotas, even on a
 filesystem level, until there's some good and functional model for
 dealing with all the implications of COW copies. :(

In your case, it would sound fair that everyone is simply charged 10G.
What Josef is refering to would probably only apply to volumes and
snapshots owned by the same user: If I have a subvolume of 10G, and a
snapshot of it where I only changed 1G, the charged quota would be 11G,
not 20G.

Mike
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Josef Bacik

On Wed, Dec 01, 2010 at 04:38:00PM +, Hugo Mills wrote:
 On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
  === Quotas ===
  
  This is a huge topic in and of itself, but Christoph mentioned wanting to 
  have
  an idea of what we wanted to do with it, so I'm putting it here.  There are
  really 2 things here
  
  1) Limiting the size of subvolumes.  This is really easy for us, just 
  create a
  subvolume and at creation time set a maximum size it can grow to and not 
  let it
  go farther than that.  Nice, simple and straightforward.
  
  2) Normal quotas, via the quota tools.  This just comes down to how do we 
  want
  to charge users, do we want to do it per subvolume, or per filesystem.  My 
  vote
  is per filesystem.  Obviously this will make it tricky with snapshots, but I
  think if we're just charging the diff's between the original volume and the
  snapshot to the user then that will be the easiest for people to understand,
  rather than making a snapshot all of a sudden count the users currently used
  quota * 2.
 
This is going to be tricky to get the semantics right, I suspect.
 
Say you've created a subvolume, A, containing 10G of Useful Stuff
 (say, a base image for VMs). This counts 10G against your quota. Now,
 I come along and snapshot that subvolume (as a writable subvolume) --
 call it B. This is essentially free for me, because I've got a COW
 copy of your subvolume (and the original counts against your quota).
 
If I now modify a file in subvolume B, the full modified section
 goes onto my quota. This is all well and good. But what happens if you
 delete your subvolume, A? Suddenly, I get lumbered with 10G of extra
 files.  Worse, what happens if someone else had made a snapshot of A,
 too? Who gets the 10G added to their quota, me or them? What if I'd
 filled up my quota? Would that stop you from deleting your copy,
 because my copy can't be charged against my quota? Would I just end up
 unexpectedly 10G over quota?
 

If you delete your subvolume A, like use the btrfs tool to delete it, you will
only be stuck with what you changed in snapshot B.  So if you only changed 5gig
worth of information, and you deleted the original subvolume, you would have
5gig charged to your quota.  The idea is you are only charged for what blocks
you have on the disk.  Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Josef Bacik

On Wed, Dec 01, 2010 at 07:33:39PM +0100, Goffredo Baroncelli wrote:
 On Wednesday, 01 December, 2010, Josef Bacik wrote:
  Hello,
  
 
 Hi Josef
 
  
  === What are subvolumes? ===
  
  They are just another tree.  In BTRFS we have various b-trees to describe 
 the
  filesystem.  A few of them are filesystem wide, such as the extent tree, 
 chunk
  tree, root tree etc.  The tree's that hold the actual filesystem data, that 
 is
  inodes and such, are kept in their own b-tree.  This is how subvolumes and
  snapshots appear on disk, they are simply new b-trees with all of the file 
 data
  contained within them.
  
  === What do subvolumes look like? ===
  
 [...]
  
  2) Obviously you can't just rm -rf subvolumes.  Because they are roots 
 there's
  extra metadata to keep track of them, so you have to use one of our ioctls 
 to
  delete subvolumes/snapshots.
 
 Sorry, but I can't understand this sentence. It is clear that a directory and 
 a subvolume have a totally different on-disk format. But why it would be not 
 possible to remove a subvolume via the normal rmdir(2) syscall ? I posted a 
 patch some months ago: when the rmdir is invoked on a subvolume, the same 
 action of the ioctl BTRFS_IOC_SNAP_DESTROY is performed.
 
 See https://patchwork.kernel.org/patch/260301/
  

Oh hey thats cool.  That would be reasonable I think.  I was just saying that
currently we can't remove subvolumes/snapshots via rm, not that it wasn't
possible at all.  So I think what you did would be a good thing to have.

 [...]
  
  There is one tricky thing.  When you create a subvolume, the directory inode
  that is created in the parent subvolume has the inode number of 256.  So if 
 you
  have a bunch of subvolumes in the same parent subvolume, you are going to 
 have a
  bunch of directories with the inode number of 256.  This is so when users cd
  into a subvolume we can know its a subvolume and do all the normal voodoo to
  start looking in the subvolumes tree instead of the parent subvolumes tree.
  
  This is where things go a bit sideways.  We had serious problems with NFS, 
 but
  thankfully NFS gives us a bunch of hooks to get around these problems.
  CIFS/Samba do not, so we will have problems there, not to mention any other
  userspace application that looks at inode numbers.
 
 How this is/should be different of a mounted filesystem ?
 For example:
 
 # cd /tmp
 # btrfs subvolume create sub-a
 # btrfs subvolume create sub-b
 # mkdir mount -a; mkdir mount-b
 # mount /dev/sda6 mount-a # an ext4 fs
 # mount /dev/sdb2 mount-b # an ext3 fs
 # $ stat -c %8i %n sub-a sub-b mount-a mount-b
  256 sub-a
  256 sub-b
2 mount-a
2 mount-b
 
 In this case the inode-number returned are equal for both the mounted 
 filesystems and the subvolumes. However, the fsid is different.
 
 # stat -fc %8i %n sub-a sub-b mount-a mount-b .
 cdc937c1a203df74 sub-a
 cdc937c1a203df77 sub-b
 b27d147f003561c8 mount-a
 d49e1a3d2333d2e1 mount-b
 cdc937c1a203df75 .
 
 Moreover I suggest to look at the difference of the inode returned by 
 readdir(3) and stat(3)..


Yeah you are right, the inode numbering can probably be the same, we just need
to make them logically different mounts so things like NFS and samba still work
right.

 [...]
  I feel like I'm forgetting something here, hopefully somebody will point it 
 out.
  
 
 Another point that I want like to discuss is how manage the pivoting 
 between 
 the subvolumes. One of the most beautiful feature of btrfs is the snapshot 
 capability. In fact it is possible to make a snapshot of the root of the 
 filesystem and to mount it in a subsequent reboot.
 But is very complicated to manage the pivoting of a snapshot of a root 
 filesystem, because I cannot delete the old root due to the fact that the 
 new root is placed in the old root.
 
 A possible solution is not to put the root of the filesystem (where are 
 placed 
 /usr, /etc) in the root of the btrfs filesystem; but it should be 
 accepted 
 from the beginning the idea that the root of a filesystem should be placed in 
 a subvolume which int turn is placed in the root of a btrfs filesystem...
 
 I am open to other opinions.
 

Agreed, one of the things that Chris and I have discussed is the possiblity of
just having dangling roots, since really the directories are just an easy way to
get to the subvolumes.  This would let you delete the original volume and use
the snapshot from then on out.  Something to do in the future for sure.  Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread C Anthony Risinger

On Wed, Dec 1, 2010 at 12:36 PM, Josef Bacik jo...@redhat.com wrote:
 On Wed, Dec 01, 2010 at 07:33:39PM +0100, Goffredo Baroncelli wrote:

 Another point that I want like to discuss is how manage the pivoting 
 between
 the subvolumes. One of the most beautiful feature of btrfs is the snapshot
 capability. In fact it is possible to make a snapshot of the root of the
 filesystem and to mount it in a subsequent reboot.
 But is very complicated to manage the pivoting of a snapshot of a root
 filesystem, because I cannot delete the old root due to the fact that the
 new root is placed in the old root.

 A possible solution is not to put the root of the filesystem (where are 
 placed
 /usr, /etc) in the root of the btrfs filesystem; but it should be 
 accepted
 from the beginning the idea that the root of a filesystem should be placed in
 a subvolume which int turn is placed in the root of a btrfs filesystem...

 I am open to other opinions.


 Agreed, one of the things that Chris and I have discussed is the possiblity of
 just having dangling roots, since really the directories are just an easy way 
 to
 get to the subvolumes.  This would let you delete the original volume and use
 the snapshot from then on out.  Something to do in the future for sure.

i would really like to see a solution to this particular issue.  i may
be missing something, but the dangling subvol roots doesn't seem to
address the management of the root volume itself.

for example... most people will install their whole system into the
real root (id=5), but this renders the system unmanageable, because
there is no way to ever empty it without manually issuing an `rm -rf`.

i'm having a really hard time controlling this with the initramfs hook
i provide for archlinux users.  the hook requires a specific structure
underneath what the user perceives as /, but i can only accomplish
this for new installs -- for existing installs i can setup the proper
subroot structure, and snapshot their current root... but i cannot
remove the stagnant files in the real root (id=5) that well never,
ever be accessed again.

... or does dangling roots address this?

C Anthony
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread C Anthony Risinger

On Wed, Dec 1, 2010 at 12:48 PM, C Anthony Risinger anth...@extof.me wrote:
 On Wed, Dec 1, 2010 at 12:36 PM, Josef Bacik jo...@redhat.com wrote:
 On Wed, Dec 01, 2010 at 07:33:39PM +0100, Goffredo Baroncelli wrote:

 Another point that I want like to discuss is how manage the pivoting 
 between
 the subvolumes. One of the most beautiful feature of btrfs is the snapshot
 capability. In fact it is possible to make a snapshot of the root of the
 filesystem and to mount it in a subsequent reboot.
 But is very complicated to manage the pivoting of a snapshot of a root
 filesystem, because I cannot delete the old root due to the fact that the
 new root is placed in the old root.

 A possible solution is not to put the root of the filesystem (where are 
 placed
 /usr, /etc) in the root of the btrfs filesystem; but it should be 
 accepted
 from the beginning the idea that the root of a filesystem should be placed 
 in
 a subvolume which int turn is placed in the root of a btrfs filesystem...

 I am open to other opinions.


 Agreed, one of the things that Chris and I have discussed is the possiblity 
 of
 just having dangling roots, since really the directories are just an easy 
 way to
 get to the subvolumes.  This would let you delete the original volume and use
 the snapshot from then on out.  Something to do in the future for sure.

 i would really like to see a solution to this particular issue.  i may
 be missing something, but the dangling subvol roots doesn't seem to
 address the management of the root volume itself.

 for example... most people will install their whole system into the
 real root (id=5), but this renders the system unmanageable, because
 there is no way to ever empty it without manually issuing an `rm -rf`.

 i'm having a really hard time controlling this with the initramfs hook
 i provide for archlinux users.  the hook requires a specific structure
 underneath what the user perceives as /, but i can only accomplish
 this for new installs -- for existing installs i can setup the proper
 subroot structure, and snapshot their current root... but i cannot
 remove the stagnant files in the real root (id=5) that well never,
 ever be accessed again.

 ... or does dangling roots address this?

i forgot to mention, but a quick 'n dirty solution would be to simply
not enable users to do this by accident.  mkfs.btrfs could create a
new subvol, then mark it as default... this way the user has to
manually mount with id=0, or remark 0 as the default.

effectively, users would be unknowingly be installing into a
subvolume, rather then the top-level root (apologies if my terminology
is incorrect).

C Anthony
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Goffredo Baroncelli

On Wednesday, 01 December, 2010, you (C Anthony Risinger) wrote:
[...]
 i forgot to mention, but a quick 'n dirty solution would be to simply
 not enable users to do this by accident.  mkfs.btrfs could create a
 new subvol, then mark it as default... this way the user has to
 manually mount with id=0, or remark 0 as the default.
 
 effectively, users would be unknowingly be installing into a
 subvolume, rather then the top-level root (apologies if my terminology
 is incorrect).

I fully agree: it fulfill the KISS principle :-)

 C Anthony
 


-- 
gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) kreij...@inwind.it
Key fingerprint = 4769 7E51 5293 D36C 814E  C054 BF04 F161 3DC5 0512
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Hugo Mills

On Wed, Dec 01, 2010 at 12:38:30PM -0500, Josef Bacik wrote:
 On Wed, Dec 01, 2010 at 04:38:00PM +, Hugo Mills wrote:
  On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
   === Quotas ===
   
   This is a huge topic in and of itself, but Christoph mentioned wanting to 
   have
   an idea of what we wanted to do with it, so I'm putting it here.  There 
   are
   really 2 things here
   
   1) Limiting the size of subvolumes.  This is really easy for us, just 
   create a
   subvolume and at creation time set a maximum size it can grow to and not 
   let it
   go farther than that.  Nice, simple and straightforward.
   
   2) Normal quotas, via the quota tools.  This just comes down to how do we 
   want
   to charge users, do we want to do it per subvolume, or per filesystem.  
   My vote
   is per filesystem.  Obviously this will make it tricky with snapshots, 
   but I
   think if we're just charging the diff's between the original volume and 
   the
   snapshot to the user then that will be the easiest for people to 
   understand,
   rather than making a snapshot all of a sudden count the users currently 
   used
   quota * 2.
  
 This is going to be tricky to get the semantics right, I suspect.
  
 Say you've created a subvolume, A, containing 10G of Useful Stuff
  (say, a base image for VMs). This counts 10G against your quota. Now,
  I come along and snapshot that subvolume (as a writable subvolume) --
  call it B. This is essentially free for me, because I've got a COW
  copy of your subvolume (and the original counts against your quota).
  
 If I now modify a file in subvolume B, the full modified section
  goes onto my quota. This is all well and good. But what happens if you
  delete your subvolume, A? Suddenly, I get lumbered with 10G of extra
  files.  Worse, what happens if someone else had made a snapshot of A,
  too? Who gets the 10G added to their quota, me or them? What if I'd
  filled up my quota? Would that stop you from deleting your copy,
  because my copy can't be charged against my quota? Would I just end up
  unexpectedly 10G over quota?
  
 
 If you delete your subvolume A, like use the btrfs tool to delete it, you will
 only be stuck with what you changed in snapshot B.  So if you only changed 
 5gig
 worth of information, and you deleted the original subvolume, you would have
 5gig charged to your quota.

   This doesn't work, though, if the owners of the original and
new subvolume are different:

Case 1:

 * Porthos creates 10G data.
 * Athos makes a snapshot of Porthos's data.
 * A sysadmin (Richelieu) changes the ownership on Athos's snapshot of
   Porthos's data to Athos.
 * Porthos deletes his copy of the data.

Case 2:

 * Porthos creates 10G of data.
 * Athos makes a snapshot of Porthos's data.
 * Porthos deletes his copy of the data.
 * A sysadmin (Richelieu) changes the ownership on Athos's snapshot of
   Porthos's data to Athos.

Case 3:

 * Porthos creates 10G data.
 * Athos makes a snapshot of Porthos's data.
 * Aramis makes a snapshot of Porthos's data.
 * A sysadmin (Richelieu) changes the ownership on Athos's snapshot of
   Porthos's data to Athos.
 * Porthos deletes his copy of the data.

Case 4:

 * Porthos creates 10G data.
 * Athos makes a snapshot of Porthos's data.
 * Aramis makes a snapshot of Athos's data.
 * Porthos deletes his copy of the data.
   [Consider also Richelieu changing ownerships of Athos's and Aramis's
   data at alternative points in this sequence]

   In each of these, who gets charged (and how much) for their copy of
the data?

  The idea is you are only charged for what blocks
 you have on the disk.  Thanks,

   My point was that it's perfectly possible to have blocks on the
disk that are effectively owned by two people, and that the person to
charge for those blocks is, to me, far from clear. You either end up
charging twice for a single set of blocks on the disk, or you end up
in a situation where one person's actions can cause another person's
quota to fill up. Neither of these is particularly obvious behaviour.

   Hugo.

-- 
=== Hugo Mills: h...@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- I believe that it's closely correlated with ---   
   the aeroswine coefficient.


signature.asc
Description: Digital signature

Re: What to do about subvolumes?

2010-12-01 Thread J. Bruce Fields

On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
 Hello,
 
 Various people have complained about how BTRFS deals with subvolumes recently,
 specifically the fact that they all have the same inode number, and there's no
 discrete seperation from one subvolume to another.  Christoph asked that I lay
 out a basic design document of how we want subvolumes to work so we can hash
 everything out now, fix what is broken, and then move forward with a design 
 that
 everybody is more or less happy with.  I apologize in advance for how freaking
 long this email is going to be.  I assume that most people are generally
 familiar with how BTRFS works, so I'm not going to bother explaining in great
 detail some stuff.
 
 === What are subvolumes? ===
 
 They are just another tree.  In BTRFS we have various b-trees to describe the
 filesystem.  A few of them are filesystem wide, such as the extent tree, chunk
 tree, root tree etc.  The tree's that hold the actual filesystem data, that is
 inodes and such, are kept in their own b-tree.  This is how subvolumes and
 snapshots appear on disk, they are simply new b-trees with all of the file 
 data
 contained within them.
 
 === What do subvolumes look like? ===
 
 All the user sees are directories.  They act like any other directory acts, 
 with
 a few exceptions
 
 1) You cannot hardlink between subvolumes.  This is because subvolumes have
 their own inode numbers and such, think of them as seperate mounts in this 
 case,
 you cannot hardlink between two mounts because the link needs to point to the
 same on disk inode, which is impossible between two different filesystems.  
 The
 same is true for subvolumes, they have their own trees with their own inodes 
 and
 inode numbers, so it's impossible to hardlink between them.

OK, so I'm unclear: would it be possible for nfsd to export subvolumes
independently?

For that to work, we need to be able to take an inode that we just
looked up by filehandle, and see which subvolume it belongs in.  So if
two subvolumes can point to the same inode, it doesn't work, but if
st_dev is different between them, e.g., that'd be fine.  Sounds like
you're seeing the latter is possible, good!

 
 1a) In case it wasn't clear from above, each subvolume has their own inode
 numbers, so you can have the same inode numbers used between two different
 subvolumes, since they are two different trees.
 
 2) Obviously you can't just rm -rf subvolumes.  Because they are roots there's
 extra metadata to keep track of them, so you have to use one of our ioctls to
 delete subvolumes/snapshots.
 
 But permissions and everything else they are the same.
 
 There is one tricky thing.  When you create a subvolume, the directory inode
 that is created in the parent subvolume has the inode number of 256.

Is that the right way to say this?  Doing a quick test, the inode
numbers that a readdir of the parent directory returns *are* distinct.
It's just the inode number that you get when you stat that is different.

Which is all fine and normal, *if* you treat this as a real mountpoint
with its own vfsmount, st_dev, etc.

 === How do we want subvolumes to work from a user perspective? ===
 
 1) Users need to be able to create their own subvolumes.  The permission
 semantics will be absolutely the same as creating directories, so I don't 
 think
 this is too tricky.  We want this because you can only take snapshots of
 subvolumes, and so it is important that users be able to create their own
 discrete snapshottable targets.
 
 2) Users need to be able to snapshot their subvolumes.  This is basically the
 same as #1, but it bears repeating.
 
 3) Subvolumes shouldn't need to be specifically mounted.  This is also
 important, we don't want users to have to go around mounting their subvolumes 
 up
 manually one-by-one.  Today users just cd into subvolumes and it works, just
 like cd'ing into a directory.

And the separate nfsd exports is another thing I'd really love to see
work: currently you can export a subtree of a filesystem if you want,
but it's trivial to escape the subtree by guessing filehandles.  So this
gives us an easy way for administrators to create secure separate
exports without having to manage entirely separate volumes.

If subvolumes got real mountpoints and so on, this would be easy.

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Josef Bacik

On Wed, Dec 01, 2010 at 02:44:04PM -0500, J. Bruce Fields wrote:
 On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
  Hello,
  
  Various people have complained about how BTRFS deals with subvolumes 
  recently,
  specifically the fact that they all have the same inode number, and there's 
  no
  discrete seperation from one subvolume to another.  Christoph asked that I 
  lay
  out a basic design document of how we want subvolumes to work so we can hash
  everything out now, fix what is broken, and then move forward with a design 
  that
  everybody is more or less happy with.  I apologize in advance for how 
  freaking
  long this email is going to be.  I assume that most people are generally
  familiar with how BTRFS works, so I'm not going to bother explaining in 
  great
  detail some stuff.
  
  === What are subvolumes? ===
  
  They are just another tree.  In BTRFS we have various b-trees to describe 
  the
  filesystem.  A few of them are filesystem wide, such as the extent tree, 
  chunk
  tree, root tree etc.  The tree's that hold the actual filesystem data, that 
  is
  inodes and such, are kept in their own b-tree.  This is how subvolumes and
  snapshots appear on disk, they are simply new b-trees with all of the file 
  data
  contained within them.
  
  === What do subvolumes look like? ===
  
  All the user sees are directories.  They act like any other directory acts, 
  with
  a few exceptions
  
  1) You cannot hardlink between subvolumes.  This is because subvolumes have
  their own inode numbers and such, think of them as seperate mounts in this 
  case,
  you cannot hardlink between two mounts because the link needs to point to 
  the
  same on disk inode, which is impossible between two different filesystems.  
  The
  same is true for subvolumes, they have their own trees with their own 
  inodes and
  inode numbers, so it's impossible to hardlink between them.
 
 OK, so I'm unclear: would it be possible for nfsd to export subvolumes
 independently?
 

Yeah.

 For that to work, we need to be able to take an inode that we just
 looked up by filehandle, and see which subvolume it belongs in.  So if
 two subvolumes can point to the same inode, it doesn't work, but if
 st_dev is different between them, e.g., that'd be fine.  Sounds like
 you're seeing the latter is possible, good!
 

So you can't have the same inode in two subvolumes, since they are different
trees.  You can have the same inode numbers between two subvolumes, because they
are different trees.

  
  1a) In case it wasn't clear from above, each subvolume has their own inode
  numbers, so you can have the same inode numbers used between two different
  subvolumes, since they are two different trees.
  
  2) Obviously you can't just rm -rf subvolumes.  Because they are roots 
  there's
  extra metadata to keep track of them, so you have to use one of our ioctls 
  to
  delete subvolumes/snapshots.
  
  But permissions and everything else they are the same.
  
  There is one tricky thing.  When you create a subvolume, the directory inode
  that is created in the parent subvolume has the inode number of 256.
 
 Is that the right way to say this?  Doing a quick test, the inode
 numbers that a readdir of the parent directory returns *are* distinct.
 It's just the inode number that you get when you stat that is different.
 
 Which is all fine and normal, *if* you treat this as a real mountpoint
 with its own vfsmount, st_dev, etc.
 

Oh well crud, I was hoping that I could leave the inode numbers as 256 for
everything, but I forgot about readdir.  So the inode item in the parent would
have to have a unique inode number that would get spit out in readdir, but then
if we stat'ed the directory we'd get 256 for the inode number.  Oh well,
incompat flag it is then.

  === How do we want subvolumes to work from a user perspective? ===
  
  1) Users need to be able to create their own subvolumes.  The permission
  semantics will be absolutely the same as creating directories, so I don't 
  think
  this is too tricky.  We want this because you can only take snapshots of
  subvolumes, and so it is important that users be able to create their own
  discrete snapshottable targets.
  
  2) Users need to be able to snapshot their subvolumes.  This is basically 
  the
  same as #1, but it bears repeating.
  
  3) Subvolumes shouldn't need to be specifically mounted.  This is also
  important, we don't want users to have to go around mounting their 
  subvolumes up
  manually one-by-one.  Today users just cd into subvolumes and it works, just
  like cd'ing into a directory.
 
 And the separate nfsd exports is another thing I'd really love to see
 work: currently you can export a subtree of a filesystem if you want,
 but it's trivial to escape the subtree by guessing filehandles.  So this
 gives us an easy way for administrators to create secure separate
 exports without having to manage entirely separate volumes.
 
 If subvolumes

Re: What to do about subvolumes?

2010-12-01 Thread J. Bruce Fields

On Wed, Dec 01, 2010 at 02:54:33PM -0500, Josef Bacik wrote:
 Oh well crud, I was hoping that I could leave the inode numbers as 256 for
 everything, but I forgot about readdir.  So the inode item in the parent would
 have to have a unique inode number that would get spit out in readdir, but 
 then
 if we stat'ed the directory we'd get 256 for the inode number.  Oh well,
 incompat flag it is then.

I think you're already fine:

# mkdir TMP
# dd if=/dev/zero of=TMP-image bs=1M count=512
# mkfs.btrfs TMP-image
# mount -oloop TMP-image TMP/
# btrfs subvolume create sub-a
# btrfs subvolume create sub-b
../readdir-inos .
. 256 256
.. 256 4130609
sub-a 256 256
sub-b 257 256

Where readdir-inos is my silly test program below, and the first number is from
readdir, the second from stat.

?

--b.

#include stdio.h
#include err.h
#include sys/types.h
#include sys/stat.h
#include unistd.h
#include dirent.h

/* demonstrate that for mountpoints, readdir ino of mounted-on
 * directory, stat returns ino of mounted directory. */

int main(int argc, char *argv[])
{
struct dirent *de;
int ret;
DIR *d;

if (argc != 2)
errx(1, usage: %s directory, argv[0]);
ret = chdir(argv[1]);
if (ret)
errx(1, chdir /);
d = opendir(.);
if (!d)
errx(1, opendir .);
while (de = readdir(d)) {
struct stat st;

ret = stat(de-d_name, st);
if (ret)
errx(1, stat %s, de-d_name);
printf(%s %d %d\n, de-d_name, de-d_ino, st.st_ino);
}
}

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Jeff Layton

On Wed, 1 Dec 2010 09:21:36 -0500
Josef Bacik jo...@redhat.com wrote:

 There is one tricky thing.  When you create a subvolume, the directory inode
 that is created in the parent subvolume has the inode number of 256.  So if 
 you
 have a bunch of subvolumes in the same parent subvolume, you are going to 
 have a
 bunch of directories with the inode number of 256.  This is so when users cd
 into a subvolume we can know its a subvolume and do all the normal voodoo to
 start looking in the subvolumes tree instead of the parent subvolumes tree.
 
 This is where things go a bit sideways.  We had serious problems with NFS, but
 thankfully NFS gives us a bunch of hooks to get around these problems.
 CIFS/Samba do not, so we will have problems there, not to mention any other
 userspace application that looks at inode numbers.

A more common use case than CIFS or samba is going to be things like
backup programs. They commonly look at inode numbers in order to
identify hardlinks and may be horribly confused when there files that
have a link count 1 and inode number collisions with other files.

That probably qualifies as an enterprise-ready show stopper...

 === What do we do? ===
 
 This is where I expect to see the most discussion.  Here is what I want to do
 
 1) Scrap the 256 inode number thing.  Instead we'll just put a flag in the 
 inode
 to say Hey, I'm a subvolume and then we can do all of the appropriate magic
 that way.  This unfortunately will be an incompatible format change, but the
 sooner we get this adressed the easier it will be in the long run.  Obviously
 when I say format change I mean via the incompat bits we have, so old fs's 
 won't
 be broken and such.
 
 2) Do something like NFS's referral mounts when we cd into a subvolume.  Now 
 we
 just do dentry trickery, but that doesn't make the boundary between subvolumes
 clear, so it will confuse people (and samba) when they walk into a subvolume 
 and
 all of a sudden the inode numbers are the same as in the directory behind 
 them.
 With doing the referral mount thing, each subvolume appears to be its own 
 mount
 and that way things like NFS and samba will work properly.
 

Sounds like you're on the right track.

The key concept is really that an inode number should be unique within
the scope of the st_dev. The simplest solution for you here is simply to
give each subvol its own st_dev and mount it up via a shrinkable mount
automagically when someone walks into the directory. In addition to the
examples of this in NFS, CIFS does this for DFS referrals.

Today, this is mostly done by hijacking the follow_link operation, but
David Howells proposed some patches a while back to do this via a more
formalized interface. It may be reasonable to target this work on top
of that, depending on the state of those changes...

-- 
Jeff Layton jlay...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Josef Bacik

On Wed, Dec 01, 2010 at 03:00:08PM -0500, J. Bruce Fields wrote:
 On Wed, Dec 01, 2010 at 02:54:33PM -0500, Josef Bacik wrote:
  Oh well crud, I was hoping that I could leave the inode numbers as 256 for
  everything, but I forgot about readdir.  So the inode item in the parent 
  would
  have to have a unique inode number that would get spit out in readdir, but 
  then
  if we stat'ed the directory we'd get 256 for the inode number.  Oh well,
  incompat flag it is then.
 
 I think you're already fine:
 
   # mkdir TMP
   # dd if=/dev/zero of=TMP-image bs=1M count=512
   # mkfs.btrfs TMP-image
   # mount -oloop TMP-image TMP/
   # btrfs subvolume create sub-a
   # btrfs subvolume create sub-b
   ../readdir-inos .
   . 256 256
   .. 256 4130609
   sub-a 256 256
   sub-b 257 256
 
 Where readdir-inos is my silly test program below, and the first number is 
 from
 readdir, the second from stat.


Heh as soon as I typed my email I went and actually looked at the code, looks
like for readdir we fill in the root id, which will be unique, so hotdamn we are
good and I don't have to use a stupid incompat flag.  Thanks for checking that
:),

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread J. Bruce Fields

On Wed, Dec 01, 2010 at 03:09:52PM -0500, Josef Bacik wrote:
 On Wed, Dec 01, 2010 at 03:00:08PM -0500, J. Bruce Fields wrote:
  On Wed, Dec 01, 2010 at 02:54:33PM -0500, Josef Bacik wrote:
   Oh well crud, I was hoping that I could leave the inode numbers as 256 for
   everything, but I forgot about readdir.  So the inode item in the parent 
   would
   have to have a unique inode number that would get spit out in readdir, 
   but then
   if we stat'ed the directory we'd get 256 for the inode number.  Oh well,
   incompat flag it is then.
  
  I think you're already fine:
  
  # mkdir TMP
  # dd if=/dev/zero of=TMP-image bs=1M count=512
  # mkfs.btrfs TMP-image
  # mount -oloop TMP-image TMP/
  # btrfs subvolume create sub-a
  # btrfs subvolume create sub-b
  ../readdir-inos .
  . 256 256
  .. 256 4130609
  sub-a 256 256
  sub-b 257 256
  
  Where readdir-inos is my silly test program below, and the first number is 
  from
  readdir, the second from stat.
 
 
 Heh as soon as I typed my email I went and actually looked at the code, looks
 like for readdir we fill in the root id, which will be unique, so hotdamn we 
 are
 good and I don't have to use a stupid incompat flag.  Thanks for checking that
 :),

My only complaint was just about how you said this:

When you create a subvolume, the directory inode that is
created in the parent subvolume has the inode number of 256

If you revise that you might want to clarify.  (Maybe Every subvolume
has a root directory inode with inode number 256?)

The way you've stated it sounds like you're talking about the
readdir-returned number, which would normally come from the inode that
has been covered up by the mount, and which really is an inode in the
parent filesystem

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Freddie Cash

On Wed, Dec 1, 2010 at 11:35 AM, Hugo Mills hugo-l...@carfax.org.uk wrote:
 On Wed, Dec 01, 2010 at 12:38:30PM -0500, Josef Bacik wrote:
 If you delete your subvolume A, like use the btrfs tool to delete it, you 
 will
 only be stuck with what you changed in snapshot B.  So if you only changed 
 5gig
 worth of information, and you deleted the original subvolume, you would have
 5gig charged to your quota.

   This doesn't work, though, if the owners of the original and
 new subvolume are different:

 Case 1:

  * Porthos creates 10G data.
  * Athos makes a snapshot of Porthos's data.
  * A sysadmin (Richelieu) changes the ownership on Athos's snapshot of
   Porthos's data to Athos.
  * Porthos deletes his copy of the data.

 Case 2:

  * Porthos creates 10G of data.
  * Athos makes a snapshot of Porthos's data.
  * Porthos deletes his copy of the data.
  * A sysadmin (Richelieu) changes the ownership on Athos's snapshot of
   Porthos's data to Athos.

 Case 3:

  * Porthos creates 10G data.
  * Athos makes a snapshot of Porthos's data.
  * Aramis makes a snapshot of Porthos's data.
  * A sysadmin (Richelieu) changes the ownership on Athos's snapshot of
   Porthos's data to Athos.
  * Porthos deletes his copy of the data.

 Case 4:

  * Porthos creates 10G data.
  * Athos makes a snapshot of Porthos's data.
  * Aramis makes a snapshot of Athos's data.
  * Porthos deletes his copy of the data.
   [Consider also Richelieu changing ownerships of Athos's and Aramis's
   data at alternative points in this sequence]

   In each of these, who gets charged (and how much) for their copy of
 the data?

  The idea is you are only charged for what blocks
 you have on the disk.  Thanks,

   My point was that it's perfectly possible to have blocks on the
 disk that are effectively owned by two people, and that the person to
 charge for those blocks is, to me, far from clear. You either end up
 charging twice for a single set of blocks on the disk, or you end up
 in a situation where one person's actions can cause another person's
 quota to fill up. Neither of these is particularly obvious behaviour.

As a sysadmin and as a user, quotas shouldn't be about physical
blocks of storage used but should be about logical storage used.

IOW, if the filesystem is compressed, using 1 GB of physical space to
store 10 GB of data, my quota used should be 10 GB.

Similar for deduplication.  The quota is based on the storage *before*
the file is deduped.  Not after.

Similar for snapshots.  If UserA has 10 GB of quota used, I snapshot
their filesystem, then my quota used would be 10 GB as well.  As
data in my snapshot changes, my quota used is updated to reflect
that (change 1 GB of data compared to snapshot, use 1 GB of quota).

You have to (or at least should) keep two sets of stats for storage usage:
  - logical amount used (real file size, before compression, before
de-dupe, before snapshots, etc)
  - physical amount used (what's actually written to disk)

User-level quotas are based on the logical storage used.
Admin-level quotas (if you want to implement them) would be based on
physical storage used.

Thus, the output of things like df, du, ls would show the logical
storage used and file sizes.  And you would either have an additional
option to those apps (--real or something) to show the actual
storage used and file sizes as stored on disk.

Trying to make quotas and disk usage utilities to work based on what's
physically on disk is just backwards, imo.  And prone to a lot of
confusion.

-- 
Freddie Cash
fjwc...@gmail.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Jeff Layton

On Wed, 1 Dec 2010 21:46:03 +0100
Goffredo Baroncelli kreij...@libero.it wrote:

 On Wednesday, 01 December, 2010, Jeff Layton wrote:
  A more common use case than CIFS or samba is going to be things like
  backup programs. They commonly look at inode numbers in order to
  identify hardlinks and may be horribly confused when there files that
  have a link count 1 and inode number collisions with other files.
  
  That probably qualifies as an enterprise-ready show stopper...
 
 I hope that a backup program, uses the pair (inode,fsid) to identify if two 
 file are hardlinked... otherwise a backup of two filesystem mounted can be 
 quite danguerous...
 
 
 From the statfs(2) man page:
 [..]
 The f_fsid field
 [...]
 The general idea is that f_fsid contains some random stuff such that the pair 
 (f_fsid,ino) uniquely determines a file.  Some operating systems use (a 
 variation on) the device number, or the device number combined  with  the  
 file-system  type.   Several  OSes restrict giving out the f_fsid field to 
 the 
 superuser only (and zero it for unprivileged users), because this field is 
 used in the filehandle of the file system when NFS-exported, and giving it 
 out 
 is a security concern.
 
 
 And the btrfs_statfs function returns a different fsid for every subvolume.
 

Ahh, interesting. I've never read that blurb on f_fsid...

Unfortunately, it looks like not all filesystems fill that field out.
NFS and CIFS leave it conspicuously blank. Those are probably bugs...

OTOH, the GLibc docs say this:

dev_t st_dev
Identifies the device containing the file. The st_ino and st_dev,
taken together, uniquely identify the file. The st_dev value is not
necessarily consistent across reboots or system crashes, however. 

...and it's always been my understanding that a st_dev/st_ino
combination should be unique.

Is there some definitive POSIX statement on why one should prefer to
use f_fsid over st_dev in this situation?

-- 
Jeff Layton jlay...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Hugo Mills

On Wed, Dec 01, 2010 at 12:24:28PM -0800, Freddie Cash wrote:
 On Wed, Dec 1, 2010 at 11:35 AM, Hugo Mills hugo-l...@carfax.org.uk wrote:
   The idea is you are only charged for what blocks
  you have on the disk.  Thanks,
 
    My point was that it's perfectly possible to have blocks on the
  disk that are effectively owned by two people, and that the person to
  charge for those blocks is, to me, far from clear. You either end up
  charging twice for a single set of blocks on the disk, or you end up
  in a situation where one person's actions can cause another person's
  quota to fill up. Neither of these is particularly obvious behaviour.
 
 As a sysadmin and as a user, quotas shouldn't be about physical
 blocks of storage used but should be about logical storage used.
 
 IOW, if the filesystem is compressed, using 1 GB of physical space to
 store 10 GB of data, my quota used should be 10 GB.
 
 Similar for deduplication.  The quota is based on the storage *before*
 the file is deduped.  Not after.
 
 Similar for snapshots.  If UserA has 10 GB of quota used, I snapshot
 their filesystem, then my quota used would be 10 GB as well.  As
 data in my snapshot changes, my quota used is updated to reflect
 that (change 1 GB of data compared to snapshot, use 1 GB of quota).

   So if I've got 10G of data, and I snapshot it, I've just used
another 10G of quota?

 You have to (or at least should) keep two sets of stats for storage usage:
   - logical amount used (real file size, before compression, before
 de-dupe, before snapshots, etc)
   - physical amount used (what's actually written to disk)
 
 User-level quotas are based on the logical storage used.
 Admin-level quotas (if you want to implement them) would be based on
 physical storage used.
 
 Thus, the output of things like df, du, ls would show the logical
 storage used and file sizes.  And you would either have an additional
 option to those apps (--real or something) to show the actual
 storage used and file sizes as stored on disk.
 
 Trying to make quotas and disk usage utilities to work based on what's
 physically on disk is just backwards, imo.  And prone to a lot of
 confusion.

   Trying to make quotas work based on what's physically on the disk
appears to have serious issues on the semantics of using up space,
so I agree with you on this point (and, indeed, it was the point I was
trying to make).

   However, doing it that way also effectively penalises users and
prevents (or severely discourages) them from using the advanced
functions of the filesystem. There's no benefit (in disk usage terms)
to the user in using a snapshot -- they might as well use plain cp.

   Hugo.

-- 
=== Hugo Mills: h...@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- I believe that it's closely correlated with ---   
   the aeroswine coefficient.


signature.asc
Description: Digital signature

Re: What to do about subvolumes?

2010-12-01 Thread Freddie Cash

On Wed, Dec 1, 2010 at 1:28 PM, Hugo Mills hugo-l...@carfax.org.uk wrote:
 On Wed, Dec 01, 2010 at 12:24:28PM -0800, Freddie Cash wrote:
 On Wed, Dec 1, 2010 at 11:35 AM, Hugo Mills hugo-l...@carfax.org.uk wrote:
   The idea is you are only charged for what blocks
  you have on the disk.  Thanks,
 
    My point was that it's perfectly possible to have blocks on the
  disk that are effectively owned by two people, and that the person to
  charge for those blocks is, to me, far from clear. You either end up
  charging twice for a single set of blocks on the disk, or you end up
  in a situation where one person's actions can cause another person's
  quota to fill up. Neither of these is particularly obvious behaviour.

 As a sysadmin and as a user, quotas shouldn't be about physical
 blocks of storage used but should be about logical storage used.

 IOW, if the filesystem is compressed, using 1 GB of physical space to
 store 10 GB of data, my quota used should be 10 GB.

 Similar for deduplication.  The quota is based on the storage *before*
 the file is deduped.  Not after.

 Similar for snapshots.  If UserA has 10 GB of quota used, I snapshot
 their filesystem, then my quota used would be 10 GB as well.  As
 data in my snapshot changes, my quota used is updated to reflect
 that (change 1 GB of data compared to snapshot, use 1 GB of quota).

   So if I've got 10G of data, and I snapshot it, I've just used
 another 10G of quota?

Sorry, forgot the per user bit above.

If UserA has 10 GB of data, then UserB snapshots it, UserB's quota
usage is 10 GB.

If UserA has 10 GB of data and snapshots it, then only 10 GB of quota
usage is used, as there is 0 difference between the snapshot and the
filesystem.  As UserA modifies data, their quota usage increases by
the amount that is modified (ie 10 GB data, snapshot, modify 1 GB data
== 11 GB quota usage).

If you combine the two scenarios, you end up with:
  - UserA has 10 GB of data == 10 GB quota usage
  - UserB snapshots UserA's filesystem (clone), so UserB has 10 GB
quota usage (even though 0 blocks have changed on disk)
  - UserA snapshots UserA's filesystem == no change to quota usage (no
blocks on disk have changed)
  - UserA modifies 1 GB of data in the filesystem == 1 GB new quota
usage (11 GB total) (1 GB of blocks owned by UserA have changed, plus
the 10 GB in the snapshot)
  - UserB still only has 10 GB quota usage, since their snapshot
hasn't changed (0 blocks changed)

If UserA deletes their filesystem and all their snapshots, freeing up
11 GB of quota usage on their account, UserB's quota will still be 10
GB, and the blocks on the disk aren't actually removed (still
referenced by UserB's snapshot).

Basically, within a user's account, only the data unique to a snapshot
should count toward the quota.

Across accounts, the original (root) snapshot would count completely
to the new user's quota, and then only data unique to subsequent
snapshots would count.

I hope that makes it more clear.  :)  All the different layers and
whatnot get confusing.  :)

-- 
Freddie Cash
fjwc...@gmail.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Michael Vrable


On Wed, Dec 01, 2010 at 03:09:52PM -0500, Josef Bacik wrote:

On Wed, Dec 01, 2010 at 03:00:08PM -0500, J. Bruce Fields wrote:

I think you're already fine:

# mkdir TMP
# dd if=/dev/zero of=TMP-image bs=1M count=512
# mkfs.btrfs TMP-image
# mount -oloop TMP-image TMP/
# btrfs subvolume create sub-a
# btrfs subvolume create sub-b
../readdir-inos .
. 256 256
.. 256 4130609
sub-a 256 256
sub-b 257 256

Where readdir-inos is my silly test program below, and the first 
number is from readdir, the second from stat.




Heh as soon as I typed my email I went and actually looked at the 
code, looks like for readdir we fill in the root id, which will be 
unique, so hotdamn we are good and I don't have to use a stupid 
incompat flag.  Thanks for checking that :),


Except, aren't the inode numbers within a filesystem and the sunbvolume 
tree IDs allocated out of separate namespaces?  I don't think there's 
anything preventing a file/directory from having an inode number that 
clashes with one of the snapshots.


In fact, this already happens in the example above: . (inode 256 in 
the root subvolume) and sub-a (subvolume ID 256).


(Though I still don't understand the semantics well enough to say 
whether we need all the inode numbers returned by readdir to be 
distinct.)


--Michael Vrable
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Mike Fedyk

On Wed, Dec 1, 2010 at 3:32 PM, Freddie Cash fjwc...@gmail.com wrote:
 On Wed, Dec 1, 2010 at 1:28 PM, Hugo Mills hugo-l...@carfax.org.uk wrote:
 On Wed, Dec 01, 2010 at 12:24:28PM -0800, Freddie Cash wrote:
 On Wed, Dec 1, 2010 at 11:35 AM, Hugo Mills hugo-l...@carfax.org.uk wrote:
   The idea is you are only charged for what blocks
  you have on the disk.  Thanks,
 
    My point was that it's perfectly possible to have blocks on the
  disk that are effectively owned by two people, and that the person to
  charge for those blocks is, to me, far from clear. You either end up
  charging twice for a single set of blocks on the disk, or you end up
  in a situation where one person's actions can cause another person's
  quota to fill up. Neither of these is particularly obvious behaviour.

 As a sysadmin and as a user, quotas shouldn't be about physical
 blocks of storage used but should be about logical storage used.

 IOW, if the filesystem is compressed, using 1 GB of physical space to
 store 10 GB of data, my quota used should be 10 GB.

 Similar for deduplication.  The quota is based on the storage *before*
 the file is deduped.  Not after.

 Similar for snapshots.  If UserA has 10 GB of quota used, I snapshot
 their filesystem, then my quota used would be 10 GB as well.  As
 data in my snapshot changes, my quota used is updated to reflect
 that (change 1 GB of data compared to snapshot, use 1 GB of quota).

   So if I've got 10G of data, and I snapshot it, I've just used
 another 10G of quota?

 Sorry, forgot the per user bit above.

 If UserA has 10 GB of data, then UserB snapshots it, UserB's quota
 usage is 10 GB.

 If UserA has 10 GB of data and snapshots it, then only 10 GB of quota
 usage is used, as there is 0 difference between the snapshot and the
 filesystem.  As UserA modifies data, their quota usage increases by
 the amount that is modified (ie 10 GB data, snapshot, modify 1 GB data
 == 11 GB quota usage).

 If you combine the two scenarios, you end up with:
  - UserA has 10 GB of data == 10 GB quota usage
  - UserB snapshots UserA's filesystem (clone), so UserB has 10 GB
 quota usage (even though 0 blocks have changed on disk)

Please define where the owner of a subvolume/snapshot is stored.

To my knowledge when you make a snapshot, you have the same set of
files with the same set of owners and groups.  Whatever user does the
snapshot this does not change this unless chown or chgrp are used.

Also a non-root user (or a process without CAP_whatever) should not be
able to snapshot a subvolume where the root directory of that
subvolume is not owned by the user attempting the snapshot.   If you
do not do so then you end up with the same security and quota issues
that hard links have when you don't have separate filesystems.

You could have separate subvolumes for / and /home/foo and user foo
could snapshot / to /home/foo/exploit_later_001 and then foo can just
wait for an exploit to come along for one of the binaries or libs in
/home/foo/exploit_later_001 and own.

Yes, snapshot creation should be more restricted than hard links, for
good reason.

I have other questions but the answer to this fundamental game changer
may solve many of the mentioned issues.

  - UserA snapshots UserA's filesystem == no change to quota usage (no
 blocks on disk have changed)
  - UserA modifies 1 GB of data in the filesystem == 1 GB new quota
 usage (11 GB total) (1 GB of blocks owned by UserA have changed, plus
 the 10 GB in the snapshot)
  - UserB still only has 10 GB quota usage, since their snapshot
 hasn't changed (0 blocks changed)

 If UserA deletes their filesystem and all their snapshots, freeing up
 11 GB of quota usage on their account, UserB's quota will still be 10
 GB, and the blocks on the disk aren't actually removed (still
 referenced by UserB's snapshot).

 Basically, within a user's account, only the data unique to a snapshot
 should count toward the quota.

 Across accounts, the original (root) snapshot would count completely
 to the new user's quota, and then only data unique to subsequent
 snapshots would count.

 I hope that makes it more clear.  :)  All the different layers and
 whatnot get confusing.  :)
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

50 matches

Mail list logo