Re: [UNIONFS] 00/29 Unionfs and related patches pre-merge review (v2)

2008-02-02 Thread Al Viro
On Sat, Feb 02, 2008 at 01:45:15PM -0500, Erez Zadok wrote:
> > You are thinking about non-interesting case.  _Files_ are not much
> > of a problem.  Directory tree is.  The real problems with all unionfs and
> > stacking implementations I've seen so far, all way back to Heidemann et.al.
> > start when topology of the underlying layer changes.
> 
> OK, so if I understand you, your concerns center around the fact that lower
> directories can be moved around (i.e., topology changes), what happens then
> for operations that go through the stackable f/s, and what should users
> expect to see.

Correct.
 
> > If you have clear
> > semantics for unionfs behaviour in presence of such things, by all means,
> > publish it - as far as I know *nobody* had done that; not even on the
> > "what should we see when..." level, nevermind the implementation.
> 
> Since stacking and NFS have some similarities, I first checked w/ the NFS
> people to see what are their semantics in a similar scenario: an NFS client
> could be validating a directory, then issue, say, a ->create; but in
> between, the server could have moved the directory that was validated.  In
> NFS, the ->create operation succeeds, and creates the file in the new
> location of the directory which was validated.
>
> Unionfs's behavior is similar: the newly created file will be successfully
> created in the moved directory.  The only exception is that if a lower
> branch is marked readonly by unionfs, a copyup will take place.

Er...  For unionfs it's a much more monumental problem.  Look - you have
a mapping between the directory trees of layers; everything else is
defined in terms of that mapping ("this files shadows that file", etc.).

If you allow a mix of old and new mappings, you can easily run into the
situations when at some moment X1 covers Y1, X2 covers Y2, X2 is a descendent
of X1 and Y1 is a descendent of Y2.  You *really* don't want to go there -
if nothing else, defining behaviour of copyup in face of that insanity
will be very painful.

What's more, and what makes NFS a bad model, NFS client doesn't lock
directories of NFS server.  unionfs *does* locking of directories in
underlying layers, which means that you have an entirely new set of
constraints - you can deadlock easily.  As the matter of fact, your
current code suffers from that problem - it violates the locking rules
in operations on covered layers.
 
> This had not been a problem for unionfs users to date.  The main reason is
> that when unionfs users modify lower files, they often do so while there's
> little to no activity going through the union itself.  And while it doesn't
> prevent directories from being moved around, this common usage mode does
> reduce the frequency in which topology changes can be an issue for unionfs
> users.

Ugh...  "Currently unionfs users tend to use it ways that do not trigger
fs corruption" is nice, but doesn't really address the "current unionfs
implementation contains races leadint to fs corruption"...

> > around, changing parents, etc.  Cross-directory rename() certainly rates
> > very high on the list of "WTF had they been smoking in UCB?" misfeatures,
> > but it's there and it has to be dealt with.
> 
> Well, it was UCB, UCLA, and Sun.  I don't think back in the early 90s they
> were too concerned about topology changes; f/s stacking was a new idea and
> they wanted to explore what can be done with it conceptually, not produce
> commercial-grade software (still, remind me to tell you the first-hand story
> I've learned about of how full-blown stacking _almost_ made it into Solaris
> 2.0 :-)
 
> The only known reference to try and address this coherency problem was
> Heidemann's SOSP'95 paper titled "Performance of cache coherence in
> stackable filing."  The paper advocated changing the whole VFS and the
> caches (page cache + dnlc) to create a "unified cache manager" that was
> aware of complex layering topologies (including fan-out and fan-in).  It was
> even able to handle compression layers, where file data offsets change b/t
> the layers (a nasty problem).  Code for this unified cache manager was never
> released AFAIK.  I think Heidemann's approach was elegant, but I don't think
> it was practical as it required radical VFS/VM surgery.  Ironically, MS
> Windows has a single I/O cache manager that all storage and filesystem
> modules talk to directly (they're not allowed to pass IRPs directly b/t
> layers): so Windows can handle such coherency better than most Unix systems
> can today.
 
Different problem.  IIRC, that paper implicitly assumed that mapping between
vnodes in different layers would _not_ be subject to massive surgeries from
operations involved.

> However, for this "view" idea to work, I'll need a way to lock-out or hide
> the lower directories from the namespace, so no one can access them in r-w
> mode any longer (if at all).  Moreover, even if such a method was available,
> one would have to decide what to do with any open files 

Re: [UNIONFS] 00/29 Unionfs and related patches pre-merge review (v2)

2008-02-02 Thread Erez Zadok
In message <[EMAIL PROTECTED]>, Al Viro writes:
> On Sat, Jan 26, 2008 at 12:08:30AM -0500, Erez Zadok wrote:

[concerns about lower directories moving around...]

> You are thinking about non-interesting case.  _Files_ are not much
> of a problem.  Directory tree is.  The real problems with all unionfs and
> stacking implementations I've seen so far, all way back to Heidemann et.al.
> start when topology of the underlying layer changes.

OK, so if I understand you, your concerns center around the fact that lower
directories can be moved around (i.e., topology changes), what happens then
for operations that go through the stackable f/s, and what should users
expect to see.

> If you have clear
> semantics for unionfs behaviour in presence of such things, by all means,
> publish it - as far as I know *nobody* had done that; not even on the
> "what should we see when..." level, nevermind the implementation.

Since stacking and NFS have some similarities, I first checked w/ the NFS
people to see what are their semantics in a similar scenario: an NFS client
could be validating a directory, then issue, say, a ->create; but in
between, the server could have moved the directory that was validated.  In
NFS, the ->create operation succeeds, and creates the file in the new
location of the directory which was validated.

Unionfs's behavior is similar: the newly created file will be successfully
created in the moved directory.  The only exception is that if a lower
branch is marked readonly by unionfs, a copyup will take place.

This had not been a problem for unionfs users to date.  The main reason is
that when unionfs users modify lower files, they often do so while there's
little to no activity going through the union itself.  And while it doesn't
prevent directories from being moved around, this common usage mode does
reduce the frequency in which topology changes can be an issue for unionfs
users.

I'll submit a patch to document this behavior.

> > Perhaps this general topic is a good one to discuss at more length at LSF?
> > Suggestions are welcome.
> 
> It would; I honestly do not know if the problem is solvable with the
> (lack of) constraints you apparently want.  Again, the real PITA begins
> when you start dealing with pieces of underlying trees getting moved
> around, changing parents, etc.  Cross-directory rename() certainly rates
> very high on the list of "WTF had they been smoking in UCB?" misfeatures,
> but it's there and it has to be dealt with.

Well, it was UCB, UCLA, and Sun.  I don't think back in the early 90s they
were too concerned about topology changes; f/s stacking was a new idea and
they wanted to explore what can be done with it conceptually, not produce
commercial-grade software (still, remind me to tell you the first-hand story
I've learned about of how full-blown stacking _almost_ made it into Solaris
2.0 :-)

The only known reference to try and address this coherency problem was
Heidemann's SOSP'95 paper titled "Performance of cache coherence in
stackable filing."  The paper advocated changing the whole VFS and the
caches (page cache + dnlc) to create a "unified cache manager" that was
aware of complex layering topologies (including fan-out and fan-in).  It was
even able to handle compression layers, where file data offsets change b/t
the layers (a nasty problem).  Code for this unified cache manager was never
released AFAIK.  I think Heidemann's approach was elegant, but I don't think
it was practical as it required radical VFS/VM surgery.  Ironically, MS
Windows has a single I/O cache manager that all storage and filesystem
modules talk to directly (they're not allowed to pass IRPs directly b/t
layers): so Windows can handle such coherency better than most Unix systems
can today.

I've always thought of a different way to allow users to write to lower
branches -- through the union.  This is similar to what an old AT&T
unioning-like file system named "3DFS" did.  3DFS introduced a new directory
called "..." so if you cd to /mntpt/... then you got to the next level down
the stack (as if you popped the top one and now you see how the union looks
like without the top layer).  And if you "cd /mntpt/.../..." then you see
the view without the top two layers, etc.

So my idea is similar: to introduce virtual directory views that restrict
access to a single lower branch within the union.  So if someone does a "cd
/mnt/unionfs/.1" then they get access to branch 1; "cd /mnt/unionfs/.2" gets
access to branch 2; etc.  While this technique will waste a few names, it's
probably worth the savings in terms of cache-coherency pains (plus, the
actual virtual directory names can be configurable at mount time to allow
users to choose a non-conflicting dir name).  With this idea, users will be
actually accessing a one-branch union, but all ops and locks will have to go
through the union: no one would be able to modify lower files directly.

However, for this "view" idea to work, I'll need a way to lock-

Re: [UNIONFS] 00/29 Unionfs and related patches pre-merge review (v2)

2008-01-26 Thread Al Viro
On Sat, Jan 26, 2008 at 12:08:30AM -0500, Erez Zadok wrote:

> > * lock_parent(): who said that you won't get dentry moved
> > before managing to grab i_mutex on parent?  While we are at it,
> > who said that you won't get dentry moved between fetching d_parent
> > and doing dget()?  In that case parent could've been _freed_ before
> > you get to dget().
> 
> OK, so looks like I should use dget_parent() in my lock_parent(), as I've
> done elsewhere.  I'll also take a look at all instances in which I get
> dentry->d_parent and see if a d_lock is needed there.
 
dget_parent() doesn't deal with the problem of rename() done directly
in that layer while you'd been waiting for i_mutex.

> > +   lock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
> > +   err = vfs_rename(lower_old_dir_dentry->d_inode, lower_old_dentry,
> > +lower_new_dir_dentry->d_inode, lower_new_dentry);
> > +   unlock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
> > 
> > Uh-huh...  To start with, what guarantees that your lower_old_dentry
> > is still a child of your lower_old_dir_dentry?
> 
> We dget/dget_parent the old/new dentry and parents a few lines above
> (actually, it looked like I forgot to dget(lower_new_dentry) -- fixed).

And?  Having a reference to dentry does not prevent it being moved
elsewhere by direct rename(2) in that layer.  It will exist, that
much is guaranteed by grabbing a reference.  However, there is no
warranties whatsoever that by the time you get i_mutex on what had
once been its parent, it will still remain the parent of our dentry.

> BTW, my sense of the relationship b/t upper and lower objects and their
> validity in a stackable f/s, is that it's similar to the relationship b/t
> the NFS client and server -- the client can't be sure that a file on the
> server doesn't change b/t ->revalidate and ->op (hence nfs's reliance on dir
> mtime checks).

You are thinking about non-interesting case.  _Files_ are not much
of a problem.  Directory tree is.  The real problems with all unionfs and
stacking implementations I've seen so far, all way back to Heidemann et.al.
start when topology of the underlying layer changes.  If you have clear
semantics for unionfs behaviour in presence of such things, by all means,
publish it - as far as I know *nobody* had done that; not even on the
"what should we see when..." level, nevermind the implementation.
 
> Perhaps this general topic is a good one to discuss at more length at LSF?
> Suggestions are welcome.

It would; I honestly do not know if the problem is solvable with the
(lack of) constraints you apparently want.  Again, the real PITA begins
when you start dealing with pieces of underlying trees getting moved
around, changing parents, etc.  Cross-directory rename() certainly rates
very high on the list of "WTF had they been smoking in UCB?" misfeatures,
but it's there and it has to be dealt with.

BTW, and that's a completely unrelated story, I'd rather see whiteouts
done directly by filesystems involved - it would simplify the life big
way.  How about adding a dir->i_op->whiteout(dir, dentry) and seeing if
your variant could be turned into such a method to be used by really
piss-poor filesystems?  All UFS-related ones (including ext*) can trivially
support whiteouts without any PITA; adding them to tmpfs is also not a big
deal and anything that caches inode type in directory entries should be
easy to extend in the same way as ext*/ufs...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [UNIONFS] 00/29 Unionfs and related patches pre-merge review (v2)

2008-01-25 Thread Erez Zadok
In message <[EMAIL PROTECTED]>, Al Viro writes:
> After grep for locking-related things:
> 
>   * lock_parent(): who said that you won't get dentry moved
> before managing to grab i_mutex on parent?  While we are at it,
> who said that you won't get dentry moved between fetching d_parent
> and doing dget()?  In that case parent could've been _freed_ before
> you get to dget().

OK, so looks like I should use dget_parent() in my lock_parent(), as I've
done elsewhere.  I'll also take a look at all instances in which I get
dentry->d_parent and see if a d_lock is needed there.

>   * in create_parents():
> +   struct inode *inode = lower_dentry->d_inode;
> +   /*
> +* If we get here, it means that we created a new
> +* dentry+inode, but copying permissions failed.
> +* Therefore, we should delete this inode and dput
> +* the dentry so as not to leave cruft behind.
> +*/
> +   if (lower_dentry->d_op && lower_dentry->d_op->d_iput)
> +   lower_dentry->d_op->d_iput(lower_dentry,
> +  inode);
> +   else
> +   iput(inode);
> +   lower_dentry->d_inode = NULL;
> +   dput(lower_dentry);
> +   lower_dentry = ERR_PTR(err);
> +   goto out;
> Really?  So what happens if it had become positive after your test and
> somebody had looked it up in lower layer and just now happens to be
> in the middle of operations on it?  Will be thucking frilled by that...

Good catch.  That ->d_iput call was an old fix to a bug that has since been
fixed more cleanly and generically in our copyup_permission routine and our
unionfs_d_iput.  I've removed the above ->d_iput "if" and tested to verify
that it's indeed unnecessary.

>   * __unionfs_rename():
> +   lock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
> +   err = vfs_rename(lower_old_dir_dentry->d_inode, lower_old_dentry,
> +lower_new_dir_dentry->d_inode, lower_new_dentry);
> +   unlock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
> 
> Uh-huh...  To start with, what guarantees that your lower_old_dentry
> is still a child of your lower_old_dir_dentry?

We dget/dget_parent the old/new dentry and parents a few lines above
(actually, it looked like I forgot to dget(lower_new_dentry) -- fixed).
This is a generic stackable f/s issue: ecryptfs does the same stuff before
calling vfs_rename() on the lower objects.

> What's more, you are
> not checking the result of lock_rename(), i.e. asking for serious trouble.

OK.  I'm now checking for the return from lock_rename for ancestor/rename
rules.  I'm CC'ing Mike Halcrow so he can do the same for ecryptfs.

>   * revalidation stuff: err...  how the devil can it work for
> directories, when there's nothing to prevent changes in underlying
> layers between ->d_revalidate() and operation itself?  For the upper
> layer (unionfs itself) everything's more or less fine, but the rest
> of that...

In a stacked f/s, we keep references to the lower dentries/inodes, so they
can't disappear on us (that happens in our interpose function, called from
our ->lookup).  On entry to every f/s method in unionfs, we first perform
lightweight revalidation of our dentry against the lower ones: we check if
m/ctime changed (users modifying lower files) or if the generation# b/t our
super and the our dentries have changed (branch-management took place); if
needed, then we perform a full revalidation of all lower objects (while
holding a lock on the branch configuration).  If we have to do a full reval
upon entry to our ->op, and the reval failed, then we return an appropriate
error; o/w we proceed.  (In certain cases, the VFS re-issues a lookup if the
f/s says that it's dentry is invalid.)

Without changes to the VFS, I don't see how else I can ensure cache
coherency cleanly, while allowing users to modify lower files; this feature
is very useful to some unionfs users, who depend on it (so even if I could
"lock out" the lower directories from being modified, there will be users
who'd still want to be able to modify lower files).

BTW, my sense of the relationship b/t upper and lower objects and their
validity in a stackable f/s, is that it's similar to the relationship b/t
the NFS client and server -- the client can't be sure that a file on the
server doesn't change b/t ->revalidate and ->op (hence nfs's reliance on dir
mtime checks).

Perhaps this general topic is a good one to discuss at more length at LSF?
Suggestions are welcome.

Thanks,
Erez.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-

Re: [UNIONFS] 00/29 Unionfs and related patches pre-merge review (v2)

2008-01-16 Thread Erez Zadok
In message <[EMAIL PROTECTED]>, Al Viro writes:
> After grep for locking-related things:
[...]

Thanks.  I'll start looking at these issues asap.

Erez.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [UNIONFS] 00/29 Unionfs and related patches pre-merge review (v2)

2008-01-16 Thread Al Viro
After grep for locking-related things:

* lock_parent(): who said that you won't get dentry moved
before managing to grab i_mutex on parent?  While we are at it,
who said that you won't get dentry moved between fetching d_parent
and doing dget()?  In that case parent could've been _freed_ before
you get to dget().

* in create_parents():
+   struct inode *inode = lower_dentry->d_inode;
+   /*
+* If we get here, it means that we created a new
+* dentry+inode, but copying permissions failed.
+* Therefore, we should delete this inode and dput
+* the dentry so as not to leave cruft behind.
+*/
+   if (lower_dentry->d_op && lower_dentry->d_op->d_iput)
+   lower_dentry->d_op->d_iput(lower_dentry,
+  inode);
+   else
+   iput(inode);
+   lower_dentry->d_inode = NULL;
+   dput(lower_dentry);
+   lower_dentry = ERR_PTR(err);
+   goto out;
Really?  So what happens if it had become positive after your test and
somebody had looked it up in lower layer and just now happens to be
in the middle of operations on it?  Will be thucking frilled by that...

* __unionfs_rename():
+   lock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
+   err = vfs_rename(lower_old_dir_dentry->d_inode, lower_old_dentry,
+lower_new_dir_dentry->d_inode, lower_new_dentry);
+   unlock_rename(lower_old_dir_dentry, lower_new_dir_dentry);

Uh-huh...  To start with, what guarantees that your lower_old_dentry
is still a child of your lower_old_dir_dentry?  What's more, you are
not checking the result of lock_rename(), i.e. asking for serious trouble.

* revalidation stuff: err...  how the devil can it work for
directories, when there's nothing to prevent changes in underlying
layers between ->d_revalidate() and operation itself?  For the upper
layer (unionfs itself) everything's more or less fine, but the rest
of that...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [UNIONFS] 00/29 Unionfs and related patches pre-merge review (v2)

2008-01-16 Thread Erez Zadok
In message <[EMAIL PROTECTED]>, Michael Halcrow writes:
> On Thu, Jan 10, 2008 at 10:57:46AM -0500, Erez Zadok wrote:
[...]
> Would the inclusion of Unionfs in mainline really slow down or damage
> the union mount effort? If not, then I think the pragmatic approach
> would be to make it available in mainline for all of the users who are
> already successfully running it today. We can then focus future
> efforts on the VFS-level modifications that address the remaining
> issues, limiting Unionfs in the future to only those problems that are
> best solved in a stacked filesystem layer.

Mike, this is indeed the pragmatic approach I've advocated: as the VFS would
come up with more unioning-related functionality, I could easily make use of
it in unionfs, thus shrinking the code base in unionfs (while keeping the
user API unchanged).  In the end, what'll be left over is probably a smaller
standalone file system that offers the kind of features that aren't likely
to show up at the VFS level (e.g., a persistent cache of unified dir
contents, persistent inode numbers, whiteouts that work with any "obscure"
filesystem, and such).

> Mike

Cheers,
Erez.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [UNIONFS] 00/29 Unionfs and related patches pre-merge review (v2)

2008-01-16 Thread Michael Halcrow
On Thu, Jan 10, 2008 at 10:57:46AM -0500, Erez Zadok wrote:
> In message <[EMAIL PROTECTED]>, Christoph Hellwig
> writes:
> > On Thu, Jan 10, 2008 at 09:59:19AM -0500, Erez Zadok wrote:
> > > 
> > > Dear Linus, Al, Christoph, and Andrew,
> > > 
> > > As per your request, I'm posting for review the unionfs code
> > > (and related code) that's in my korg tree against mainline
> > > (v2.6.24-rc7-71-gfd0b45d).  This is in preparation for merge in
> > > 2.6.25.
> > 
> > Huh?  There's still aboslutely not fix to the underlying problems
> > of the whole idea.  I think we made it pretty clear that unionfs
> > is not the way to go, and that we'll get the union mount patches
> > clear once the per-mountpoint r/o and unprivilegued mount patches
> > series are in and stable.
> 
> I'll reiterate what I've said before: unionfs is used today by many
> users, it works, and is stable.  After years of working with
> unionfs, we've settled on a set of features that users actually use.
> This functionality can be in mainline today.

There are well-known distributions out there that are tacking Unionfs
onto their kernels; for instance, it is popular in the bootable
read-only media scene. When enough vendors are adding the code on
their own, it is generally a good indicator that it should just be
upstream, especially if it can be built as a non-invasive experimental
module with a caveat about its shortcomings in the config menu and
documentation.

> Unioning at the VFS level, will take a long time to reach the same level of
> maturity and support the same set of features.  Based on my years of
> practical experience with it, unioning directories seems like a simple idea,
> but in practice it's quite hard no matter the approach taken to implement
> it.
> 
> Existing users of unioning aren't likely to switch to Union Mounts
> unless it supports the same set of features.  How long will it
> realistically take to get whiteout support in every lower file
> system that's used by Unionfs users?

Well, depending on the amount of code that actually needs to get
pushed below the VFS layer itself, I think you would be surprised how
quickly something like this can be done. But I do agree that a
non-invasive stacked module is a reasonable intermediate step in the
meantime, so long as the users understand the potential shortcomings
and are able to substantially benefit from its present inclusion in
mainline.

> How will Union Mounts support persistent inode numbers at the VFS
> level?  Those are just a few of the questions.
> 
> I think a better approach would be to start with Unionfs (a
> standalone file system that doesn't touch the rest of the kernel).
> And as Linux gradually starts supporting more and more features that
> help unioning/stacking in general, to change Unionfs to use those
> features (e.g., native whiteout support).  Eventually there could be
> basic unioning support at the VFS level, and concurrently a
> file-system which offers the extra features (e.g., persistency).
> This can be done w/o affecting user-visible APIs.

Would the inclusion of Unionfs in mainline really slow down or damage
the union mount effort? If not, then I think the pragmatic approach
would be to make it available in mainline for all of the users who are
already successfully running it today. We can then focus future
efforts on the VFS-level modifications that address the remaining
issues, limiting Unionfs in the future to only those problems that are
best solved in a stacked filesystem layer.

Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [UNIONFS] 00/29 Unionfs and related patches pre-merge review (v2)

2008-01-14 Thread Jesse Hathaway
Erez Zadok  cs.sunysb.edu> writes:

> > Huh?  There's still aboslutely not fix to the underlying problems of
> > the whole idea.   I think we made it pretty clear that unionfs is not
> > the way to go, and that we'll get the union mount patches clear once
> > the per-mountpoint r/o and unprivilegued mount patches series are in
> > and stable.
> 
> I'll reiterate what I've said before: unionfs is used today by many users,
> it works, and is stable.  After years of working with unionfs, we've settled
> on a set of features that users actually use.  This functionality can be in
> mainline today.

I think it would be of great benefit to many linux users to have Unionfs 
in mainline. It is a highly valuable and reliable filesystem which 
allows one to develop interesting live distributions, read only root
images and other types of layered systems. Unionfs works well now, and
given that it is a stand alone filesystem I think it would really be a
waste to exclude it from mainline only to insist that one should wait
for union mounts, which may never reach feature parity with Unionfs,
especially given that the  linux kernel has a long running history of
filesystems with overlapping abilities.

 - Jesse Hathaway

> Unioning at the VFS level, will take a long time to reach the same level of
> maturity and support the same set of features.  Based on my years of
> practical experience with it, unioning directories seems like a simple idea,
> but in practice it's quite hard no matter the approach taken to implement
> it.
> 
> Existing users of unioning aren't likely to switch to Union Mounts unless it
> supports the same set of features.  How long will it realistically take to
> get whiteout support in every lower file system that's used by Unionfs
> users?  How will Union Mounts support persistent inode numbers at the VFS
> level?  Those are just a few of the questions.
> 
> I think a better approach would be to start with Unionfs (a standalone file
> system that doesn't touch the rest of the kernel).  And as Linux gradually
> starts supporting more and more features that help unioning/stacking in
> general, to change Unionfs to use those features (e.g., native whiteout
> support).  Eventually there could be basic unioning support at the VFS
> level, and concurrently a file-system which offers the extra features (e.g.,
> persistency).  This can be done w/o affecting user-visible APIs.
> 
> Cheers,
> Erez.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo  vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [UNIONFS] 00/29 Unionfs and related patches pre-merge review (v2)

2008-01-10 Thread Erez Zadok
In message <[EMAIL PROTECTED]>, Christoph Hellwig writes:
> On Thu, Jan 10, 2008 at 09:59:19AM -0500, Erez Zadok wrote:
> > 
> > Dear Linus, Al, Christoph, and Andrew,
> > 
> > As per your request, I'm posting for review the unionfs code (and related
> > code) that's in my korg tree against mainline (v2.6.24-rc7-71-gfd0b45d).
> > This is in preparation for merge in 2.6.25.
> 
> Huh?  There's still aboslutely not fix to the underlying problems of
> the whole idea.   I think we made it pretty clear that unionfs is not
> the way to go, and that we'll get the union mount patches clear once
> the per-mountpoint r/o and unprivilegued mount patches series are in
> and stable.

I'll reiterate what I've said before: unionfs is used today by many users,
it works, and is stable.  After years of working with unionfs, we've settled
on a set of features that users actually use.  This functionality can be in
mainline today.

Unioning at the VFS level, will take a long time to reach the same level of
maturity and support the same set of features.  Based on my years of
practical experience with it, unioning directories seems like a simple idea,
but in practice it's quite hard no matter the approach taken to implement
it.

Existing users of unioning aren't likely to switch to Union Mounts unless it
supports the same set of features.  How long will it realistically take to
get whiteout support in every lower file system that's used by Unionfs
users?  How will Union Mounts support persistent inode numbers at the VFS
level?  Those are just a few of the questions.

I think a better approach would be to start with Unionfs (a standalone file
system that doesn't touch the rest of the kernel).  And as Linux gradually
starts supporting more and more features that help unioning/stacking in
general, to change Unionfs to use those features (e.g., native whiteout
support).  Eventually there could be basic unioning support at the VFS
level, and concurrently a file-system which offers the extra features (e.g.,
persistency).  This can be done w/o affecting user-visible APIs.

Cheers,
Erez.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [UNIONFS] 00/29 Unionfs and related patches pre-merge review (v2)

2008-01-10 Thread Christoph Hellwig
On Thu, Jan 10, 2008 at 09:59:19AM -0500, Erez Zadok wrote:
> 
> Dear Linus, Al, Christoph, and Andrew,
> 
> As per your request, I'm posting for review the unionfs code (and related
> code) that's in my korg tree against mainline (v2.6.24-rc7-71-gfd0b45d).
> This is in preparation for merge in 2.6.25.

Huh?  There's still aboslutely not fix to the underlying problems of
the whole idea.   I think we made it pretty clear that unionfs is not
the way to go, and that we'll get the union mount patches clear once
the per-mountpoint r/o and unprivilegued mount patches series are in
and stable.


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[UNIONFS] 00/29 Unionfs and related patches pre-merge review (v2)

2008-01-10 Thread Erez Zadok

Dear Linus, Al, Christoph, and Andrew,

As per your request, I'm posting for review the unionfs code (and related
code) that's in my korg tree against mainline (v2.6.24-rc7-71-gfd0b45d).
This is in preparation for merge in 2.6.25.  This code here is nearly
identical to what's in -mm (the mm code has a couple of additional things
that depend on mm-specific patches that aren't in mainline yet).

I've addressed *every* public/private comment I received on the first set of
patches I posted.  A summary of the changes from first set of patches (v1),
based on review feedback:

- several races/deadlocks fixed (thanks to lockdep)
- dropped the drop_pagecache_sb patch
- merged several logical patches together
- ensure that git-bisect works
- enhance user documentation
- assorted code cleanups (esp. removal of redundant/unnecessary code)

The rest of this message is nearly identical to the introductory text I used
in the first set of patchsets I posted for review, and is included here for
convenience.

BTW, I will be at LSF'08 to discuss/present stackable file system
VFS-support issues (together with Mike Halcrow, if he can make the trip).
So I hope to see some of you there for these important discussions.

Cheers,
Erez.

--

I really tried to keep this message short, by offering pointers to more
info, but still there's a bunch of info here.

Andrew, you've asked me to list the main issues that came about in
discussions regarding unionfs, and how were they addressed.  So I've
reviewed my notes from OLS'06, LSF'07, and OLS'07, as well as assorted
postings in mailing lists, and I came up with this prioritized list (in
descending priority order):

1. cache coherency
2. nameidata handling
3. namespace pollution
4. use of ioctls for branch management

(1) Cache coherency: by far, the biggest concern had been around cache
coherency: what happens if someone modifies a lower object
(file/dir/etc.).  I met with Mike Halcrow in October 2007 and we
discussed stacking in general; Mike also emphasized that cache-coherency
was one of his most pressing concerns in ecryptfs.

At OLS'06, several suggestions were made, including fancy tricks to hide the
lower namespace or "lock" it so users have readonly access.  None of these
solutions would have been able to easily handle the problem of an existing
open file descriptor on a lower file, and they might have required
significant VFS changes.  Moreover, unionfs users actually want to modify
lower branches directly, and then be able to see their changes reflected in
the union immediately.  So we explored a number of ideas.  We feel that the
VFS is complex enough so we tried our best to handle cache-coherency inside
unionfs.  The solution we have implemented is to compare the mtime/ctime of
upper/lower objects during revalidation (esp. of dentries); and if the lower
times are newer, we reconstruct the union object (drop the older objects,
and re-lookup them).  This time-based cache-coherency works well and is
similar to the NFS model.  Because Unionfs users tend to have a burst of
activity on lower branches, our current cache-coherency also defers the
revalidation actions until absolutely needed, so this idea tends to also be
more efficient for the common usage patterns.  More details about how we
handle cache-coherency are available in our
Documentation/filesystems/unionfs/concepts.txt file.

That said, we're now developing some VFS patches that would allow lower file
systems to more directly inform the upper objects about such (mtime)
changes.  We're exploring a couple of different options but our key goals
are to (a) minimize VFS changes and (b) avoid any changes to lower file
systems.

(2) nameidata handling.  Another important question raised (esp. by NFS
people) was how we handle struct nameidata.  The VFS passes nameidata
structs to file systems, and some file systems use that.  We used to
either pass NULL or the upper nd to the lower f/s.  That caused NULL
de-refs inside nfsv4, among other problems.  We now create our own
nameidata structure, fill it up as needed (esp. for intent data), and
pass it down.  We do this every time we call any VFS function that takes
a nameidata (e.g., vfs_create).  This seems to work well.

There's been some discussion on lkml about splitting struct nameidata in
two, one of which would handle just the intent information.  I'd like to see
that happen, maybe even help, because right now we pass a whole large-ish
struct nameidata for just a couple of intent bits of information that the
lower f/s needs.

(3) namespace pollution.  Unioning readonly and readwrite directories
requires the ability to mask, or white-out, files that are being deleted
from a readonly directory.  Unionfs does this in a portable way, by
creating .wh.XXX files to indicate that file XXX has been whited-out.
This works well on many file systems, but