Re: [RFC 01/11] add generic versions of debugfs file operations

2008-02-23 Thread Al Viro
On Tue, Feb 19, 2008 at 05:04:36AM +0100, Arnd Bergmann wrote:
 + char buf[3];
 + u32 *val = file-private_data;
 +
 + if (*val)
 + buf[0] = 'Y';
 + else
 + buf[0] = 'N';
 + buf[1] = '\n';
 + buf[2] = 0x00;
 + return simple_read_from_buffer(user_buf, count, ppos, buf, 2);

Ewww - caps, \n...  BTW, \0 is pointless here - simple_read_from_buffer() will
not access it with these arguments)...

 +{
 + char buf[32];
 + int buf_size;
 + u32 *val = file-private_data;
 +
 + buf_size = min(count, (sizeof(buf)-1));
 + if (copy_from_user(buf, user_buf, buf_size))
 + return -EFAULT;
 +
 + switch (buf[0]) {
 + case 'y':
 + case 'Y':
 + case '1':
 + *val = 1;
 + break;
 + case 'n':
 + case 'N':
 + case '0':
 + *val = 0;
 + break;

Please, check the length; sloppy input grammar is a bad idea.  Hell, at the
very least you want -EINVAL if input is not recognized...
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 02/11] introduce simple_fs_type

2008-02-23 Thread Al Viro
On Tue, Feb 19, 2008 at 05:04:37AM +0100, Arnd Bergmann wrote:
 There is a number of pseudo file systems in the kernel
 that are basically copies of debugfs, all implementing the
 same boilerplate code, just with different bugs.
 
 This adds yet another copy to the kernel in the libfs directory,
 with generalized helpers that can be used by any of them.
 
 The most interesting function here is the new struct dentry *
 simple_register_filesystem(struct simple_fs_type *type), which
 returns the root directory of a new file system that can then
 be passed to simple_create_file() and similar functions as a
 parent.

Don't mix split the file with add new stuff, please.  And frankly,
I'm not convinced that embedding file_system_type into another struct
is a good idea.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 01/11] add generic versions of debugfs file operations

2008-02-23 Thread Al Viro
On Tue, Feb 19, 2008 at 05:04:36AM +0100, Arnd Bergmann wrote:
 The file operations in debugfs are rather generic and can
 be used by other file systems, so it can be interesting to
 include them in libfs, with more generic names, and exported
 to modules.
 
 This patch adds a new copy of these operations to libfs,
 so that the debugfs version can later be cut down.

BTW, where does CONFIG_LIBFS come from?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 03/11] slim down debugfs

2008-02-23 Thread Al Viro
On Tue, Feb 19, 2008 at 05:04:38AM +0100, Arnd Bergmann wrote:
 With most of debugfs now copied to generic code in libfs,
 we can remove the original copy and replace it with thin
 wrappers around libfs.
 
 Signed-off-by: Arnd Bergmann [EMAIL PROTECTED]
 Index: linux-2.6/fs/Kconfig
 ===
 --- linux-2.6.orig/fs/Kconfig
 +++ linux-2.6/fs/Kconfig
 @@ -1001,6 +1001,14 @@ config CONFIGFS_FS
 Both sysfs and configfs can and should exist together on the
 same system. One is not a replacement for the other.
  
 +config LIBFS
 + tristate
 + default m
 + help
 +   libfs is a helper library used by many of the simpler file
 +   systems. Parts of libfs can be modular when all of its users
 +   are modules as well, and the users should select this symbol.

NAK.  For one thing, you need dependencies or selects and none of the
patches later in the series introduces them.  For another, neither
dependencies nor selects work well if you have non-modular users of
that sucker.  Seriously, try to add those and do allmodconfig.  Then
look what a mess it produces.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 00/10] mount ownership and unprivileged mount syscall (v8)

2008-02-23 Thread Al Viro
On Mon, Feb 18, 2008 at 12:47:59PM +0100, Miklos Szeredi wrote:
 So what should I do?
 
 Would Al be wanting to merge this into his VFS tree?  (Can't find it
 on git.kernel.org yet, BTW.)

FWIW, it's on hera right now, should propagate to git.kernel.org in a few.

Branches I'd pushed there: vfs-fixes.b0 and ro-bind.b0.  The latter is
on top of the former.  There will be more, but that at least takes care
of the most urgent stuff.  Again, apologies for things being too damn
slow ;-/

As for the unprivileged mounts...
a) why do we lose them on clone() in new namespace?  Bloody
inconvenient, to put it mildly.
b) why do we prohibit all kinds of remount?
c) just what is limited by that sysctl?  AFAICS, rbind is allowed
if mountpoint is on user vfsmount and it seems to create vfsmounts without
eating into that limit just fine...  What's the point of limiting the
amount of vfsmounts marked user when you do not limit the number of vfsmount
one can allocate?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 00/10] mount ownership and unprivileged mount syscall (v8)

2008-02-23 Thread Al Viro
On Sat, Feb 23, 2008 at 06:33:13PM +0100, Miklos Szeredi wrote:
  c) just what is limited by that sysctl?  AFAICS, rbind is allowed
  if mountpoint is on user vfsmount and it seems to create vfsmounts without
  eating into that limit just fine...  What's the point of limiting the
  amount of vfsmounts marked user when you do not limit the number of vfsmount
  one can allocate?
 
 The limit is there, so that unprivileged users cannot create insane
 number of mounts.  It's just a safety thing, analogous to
 /proc/sys/fs/file-max.

Can't they?  Looks like one can create any number of vfsmounts without
getting more than one marked MNT_USER...

What are you trying to limit - vfsmounts or superblocks?  The former is
not limited in your patchset at all, AFAICS - you can
do while true; do
mount --bind /bin ~/my_directory;
mount --bind /sbin ~/my_directory;
done
indefinitely and all the bleeding stack of vfsmounts will be !MNT_USER.
Or any number of similar schemes, really, without overmounting if you
wish to avoid it.

If you are trying to limit the number of superblocks (i.e. active instances
of filesystems), then I'd say that vfsmounts make piss-poor proxies for
those and it would be better to count the objects you really want to count...
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: git tree with VFS stuff

2008-02-20 Thread Al Viro
On Thu, Feb 21, 2008 at 01:13:48AM +1100, Stephen Rothwell wrote:
 Hi Miklos,
 
 On Tue, 19 Feb 2008 14:32:28 +0100 Miklos Szeredi [EMAIL PROTECTED] wrote:
 
  I've created a git tree with the following mounts related stuff:
  
- read-only bind mounts
- /proc/pid/mountinfo
- unprivileged mounts
  
  git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfsstuff.git master
  
  I guess, giving these a spin in linux-next wouldn't hurt?
 
 I don't think this is what we want to use linux-next for.  Linux-next is
 really a place for stuff that will pretty clearly go into the next kernel
 release i.e. 2.6.26 right now.  If you want to experiment on things for
 beyond that timeframe, then a snapshot of linux-next may be a good base.
 I will take them when they reach the appropriate subsystem tree and are
 ready for integration.

FWIW, I must apologize for delay with getting the damn tree open on
kernel.org ;-/   The last couple of weeks had been Not Fun(tm) in a
lot of respects.

Hopefully I'll finish putting the damn thing into publishable shape by
tomorrow.  As for the stuff mentioned above...  ro-bind series - definitely
yes, mountinfo - IMO needs a sane discussion of what and how should be shown
wrt propagation state, unprivileged mounts - in the need to finish reviewing
pile.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to show propagation state for mounts

2008-02-20 Thread Al Viro
On Wed, Feb 20, 2008 at 04:39:15PM +0100, Miklos Szeredi wrote:
  mountinfo - IMO needs a sane discussion of what and how should be shown
  wrt propagation state
 
 Here's my take on the matter.
 
 The propagation tree can be either be represented
 
  1) from root to leaf listing members of peer groups and their
  slaves explicitly,
 
  2) or from leaf to root by identifying each peer group and then for
  each mount showing the id of its own group and the id of the group's
  master.
 
 2) can have two variants:
 
  2a) id of peer group is constant in time
 
  2b) id of peer group may change
 
 The current patch does 2b).  Having a fixed id for each peer group
 would mean introducing a new object to anchor the peer group into,
 which would add complexity to the whole thing.
 
 All of these are implementable, just need to decide which one we want.

Eh...  Much more interesting question: since the propagation tree spans
multiple namespaces in a lot of normal uses, how do we deal with
reconstructing propagation through the parts that are not present in
our namespace?  Moreover, what should and what should not be kept private
to namespace?  Full exposure of mount trees is definitely over the top
(it shows potentially sensitive information), so we probably want less
than that.

FWIW, my gut feeling is that for each peer group that intersects with our
namespace we ought to expose in some form
* all vfsmounts belonging to that intesection
* the nearest dominating peer group (== master (of master ...) of)
that also has a non-empty intersection with our namespace

It's less about the form of representation (after all, we generate poll
events when contents of that sucker changes, so one *can* get a consistent
snapshot of the entire thing) and more about having it self-contained
when we have namespaces in the play.

IOW, the data in there should give answers to questions that make sense.
Do events get propagated from this vfsmount I have to that vfsmount I have?
is a meaningful one; ditto for are events here propagated to somewhere I
don't see? or are events getting propagated here from somewhere I don't
see?.

Dumping pieces of raw graph, with IDs of nodes we can't see and without
any way to connect those pieces, OTOH, doesn't make much sense.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to show propagation state for mounts

2008-02-20 Thread Al Viro
On Wed, Feb 20, 2008 at 11:29:13AM -0800, Ram Pai wrote:

 I wonder, what is wrong in reporting mounts in other namespaces that
 either receive and send propagation to mounts in our namespace?

A plenty.  E.g. if foo trusts control over /var/blah to bar, it's not
obvious that foo has any business knowing if bar gets it from somebody
else in turn.  And I'm not sure that bar has any business knowing that
foo has the damn thing attached in five places instead of just one,
let alone _where_ it has been attached.

If you get down to it, the thing is about delegating control over part
of namespace to somebody, without letting them control, see, etc. the
rest of it.  So I'd rather be very conservative about extra information
we allow to piggyback on that.  I don't know... perhaps with stable peer
group IDs it would be OK to show peer group ID by (our) vfsmount + peer
group ID of master + peer group ID of nearest dominating group that has
intersection with our namespace.  Then we don't leak information (AFAICS),
get full propagation information between our vfsmounts and cooperating
tasks in different namespaces can figure the things out as much as possible
without leaking 3rd-party information to either.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Add vfsmount to vfs helper functions.

2008-02-17 Thread Al Viro
On Sun, Feb 17, 2008 at 06:00:30PM +0900, Tetsuo Handa wrote:
 Hello.
 
 This is (c) Add new hooks. approach I proposed at
 http://www.mail-archive.com/linux-fsdevel@vger.kernel.org/msg11536.html .
 
 Although this is an incomplete patch,
 I want to know whether you can tolerate this approach or not.
 
 If you cannot tolerate this approach, may be we need to consider
 implementing (d) Calculate pathname while doing name resolution approach.
 

No printable comments, except for that:

(e) why don't you guys move the Linus' Serious Mistake to _callers_ of
vfs_mknod() and its ilk?

Which obviously solves all problems with having vfsmount.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [UNIONFS] 00/29 Unionfs and related patches pre-merge review (v2)

2008-02-02 Thread Al Viro
On Sat, Feb 02, 2008 at 01:45:15PM -0500, Erez Zadok wrote:
  You are thinking about non-interesting case.  _Files_ are not much
  of a problem.  Directory tree is.  The real problems with all unionfs and
  stacking implementations I've seen so far, all way back to Heidemann et.al.
  start when topology of the underlying layer changes.
 
 OK, so if I understand you, your concerns center around the fact that lower
 directories can be moved around (i.e., topology changes), what happens then
 for operations that go through the stackable f/s, and what should users
 expect to see.

Correct.
 
  If you have clear
  semantics for unionfs behaviour in presence of such things, by all means,
  publish it - as far as I know *nobody* had done that; not even on the
  what should we see when... level, nevermind the implementation.
 
 Since stacking and NFS have some similarities, I first checked w/ the NFS
 people to see what are their semantics in a similar scenario: an NFS client
 could be validating a directory, then issue, say, a -create; but in
 between, the server could have moved the directory that was validated.  In
 NFS, the -create operation succeeds, and creates the file in the new
 location of the directory which was validated.

 Unionfs's behavior is similar: the newly created file will be successfully
 created in the moved directory.  The only exception is that if a lower
 branch is marked readonly by unionfs, a copyup will take place.

Er...  For unionfs it's a much more monumental problem.  Look - you have
a mapping between the directory trees of layers; everything else is
defined in terms of that mapping (this files shadows that file, etc.).

If you allow a mix of old and new mappings, you can easily run into the
situations when at some moment X1 covers Y1, X2 covers Y2, X2 is a descendent
of X1 and Y1 is a descendent of Y2.  You *really* don't want to go there -
if nothing else, defining behaviour of copyup in face of that insanity
will be very painful.

What's more, and what makes NFS a bad model, NFS client doesn't lock
directories of NFS server.  unionfs *does* locking of directories in
underlying layers, which means that you have an entirely new set of
constraints - you can deadlock easily.  As the matter of fact, your
current code suffers from that problem - it violates the locking rules
in operations on covered layers.
 
 This had not been a problem for unionfs users to date.  The main reason is
 that when unionfs users modify lower files, they often do so while there's
 little to no activity going through the union itself.  And while it doesn't
 prevent directories from being moved around, this common usage mode does
 reduce the frequency in which topology changes can be an issue for unionfs
 users.

Ugh...  Currently unionfs users tend to use it ways that do not trigger
fs corruption is nice, but doesn't really address the current unionfs
implementation contains races leadint to fs corruption...

  around, changing parents, etc.  Cross-directory rename() certainly rates
  very high on the list of WTF had they been smoking in UCB? misfeatures,
  but it's there and it has to be dealt with.
 
 Well, it was UCB, UCLA, and Sun.  I don't think back in the early 90s they
 were too concerned about topology changes; f/s stacking was a new idea and
 they wanted to explore what can be done with it conceptually, not produce
 commercial-grade software (still, remind me to tell you the first-hand story
 I've learned about of how full-blown stacking _almost_ made it into Solaris
 2.0 :-)
 
 The only known reference to try and address this coherency problem was
 Heidemann's SOSP'95 paper titled Performance of cache coherence in
 stackable filing.  The paper advocated changing the whole VFS and the
 caches (page cache + dnlc) to create a unified cache manager that was
 aware of complex layering topologies (including fan-out and fan-in).  It was
 even able to handle compression layers, where file data offsets change b/t
 the layers (a nasty problem).  Code for this unified cache manager was never
 released AFAIK.  I think Heidemann's approach was elegant, but I don't think
 it was practical as it required radical VFS/VM surgery.  Ironically, MS
 Windows has a single I/O cache manager that all storage and filesystem
 modules talk to directly (they're not allowed to pass IRPs directly b/t
 layers): so Windows can handle such coherency better than most Unix systems
 can today.
 
Different problem.  IIRC, that paper implicitly assumed that mapping between
vnodes in different layers would _not_ be subject to massive surgeries from
operations involved.

 However, for this view idea to work, I'll need a way to lock-out or hide
 the lower directories from the namespace, so no one can access them in r-w
 mode any longer (if at all).  Moreover, even if such a method was available,
 one would have to decide what to do with any open files on the lower branch
 (killing such processes or closing the descriptors seems 

Re: [UNIONFS] 00/29 Unionfs and related patches pre-merge review (v2)

2008-01-26 Thread Al Viro
On Sat, Jan 26, 2008 at 12:08:30AM -0500, Erez Zadok wrote:

  * lock_parent(): who said that you won't get dentry moved
  before managing to grab i_mutex on parent?  While we are at it,
  who said that you won't get dentry moved between fetching d_parent
  and doing dget()?  In that case parent could've been _freed_ before
  you get to dget().
 
 OK, so looks like I should use dget_parent() in my lock_parent(), as I've
 done elsewhere.  I'll also take a look at all instances in which I get
 dentry-d_parent and see if a d_lock is needed there.
 
dget_parent() doesn't deal with the problem of rename() done directly
in that layer while you'd been waiting for i_mutex.

  +   lock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
  +   err = vfs_rename(lower_old_dir_dentry-d_inode, lower_old_dentry,
  +lower_new_dir_dentry-d_inode, lower_new_dentry);
  +   unlock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
  
  Uh-huh...  To start with, what guarantees that your lower_old_dentry
  is still a child of your lower_old_dir_dentry?
 
 We dget/dget_parent the old/new dentry and parents a few lines above
 (actually, it looked like I forgot to dget(lower_new_dentry) -- fixed).

And?  Having a reference to dentry does not prevent it being moved
elsewhere by direct rename(2) in that layer.  It will exist, that
much is guaranteed by grabbing a reference.  However, there is no
warranties whatsoever that by the time you get i_mutex on what had
once been its parent, it will still remain the parent of our dentry.

 BTW, my sense of the relationship b/t upper and lower objects and their
 validity in a stackable f/s, is that it's similar to the relationship b/t
 the NFS client and server -- the client can't be sure that a file on the
 server doesn't change b/t -revalidate and -op (hence nfs's reliance on dir
 mtime checks).

You are thinking about non-interesting case.  _Files_ are not much
of a problem.  Directory tree is.  The real problems with all unionfs and
stacking implementations I've seen so far, all way back to Heidemann et.al.
start when topology of the underlying layer changes.  If you have clear
semantics for unionfs behaviour in presence of such things, by all means,
publish it - as far as I know *nobody* had done that; not even on the
what should we see when... level, nevermind the implementation.
 
 Perhaps this general topic is a good one to discuss at more length at LSF?
 Suggestions are welcome.

It would; I honestly do not know if the problem is solvable with the
(lack of) constraints you apparently want.  Again, the real PITA begins
when you start dealing with pieces of underlying trees getting moved
around, changing parents, etc.  Cross-directory rename() certainly rates
very high on the list of WTF had they been smoking in UCB? misfeatures,
but it's there and it has to be dealt with.

BTW, and that's a completely unrelated story, I'd rather see whiteouts
done directly by filesystems involved - it would simplify the life big
way.  How about adding a dir-i_op-whiteout(dir, dentry) and seeing if
your variant could be turned into such a method to be used by really
piss-poor filesystems?  All UFS-related ones (including ext*) can trivially
support whiteouts without any PITA; adding them to tmpfs is also not a big
deal and anything that caches inode type in directory entries should be
easy to extend in the same way as ext*/ufs...
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] VFS: extend /proc/mounts

2008-01-16 Thread Al Viro
On Wed, Jan 16, 2008 at 11:12:31PM +0100, Miklos Szeredi wrote:
 The alternative (and completely safe) solution is to add another file
 to proc.  Me no likey.

Since we need saner layout, I would strongly suggest exactly that.

 major:minor -- is the major minor number of the device hosting the filesystem

Bad description.  Value of st_dev for files on that filesystem, please -
there might be no such thing as the device hosting the filesystem _and_
the value here may bloody well be unrelated to device actually holding
all data (for things like ext2meta, etc.).

 1) The mount is a shared mount.
 2) Its peer mount of mount with id 20
 3) It is also a slave mount of the master-mount with the id  19
 4) The filesystem on device with major/minor number 98:0 and subdirectory
   mnt/1/abc makes the root directory of this mount.
 5) And finally the mount with id 16 is its parent.

I'd suggest doing a new file that would *not* try to imitate /etc/mtab.
Another thing is, how much of propagation information do we want to
be exposed and what do we intend to do with it?  Note that entire
propagation tree is out of question - it spans many namespaces and
contains potentially sensitive information.  So we won't see all nodes.

What do we want to *do* with the information about propagation?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] VFS: extend /proc/mounts

2008-01-16 Thread Al Viro
On Wed, Jan 16, 2008 at 04:09:30PM -0800, Andrew Morton wrote:
 On Thu, 17 Jan 2008 00:58:06 +0100 (CET) Jan Engelhardt [EMAIL PROTECTED] 
 wrote:
 
  
  On Jan 17 2008 00:43, Karel Zak wrote:
   
   Seems like a plain bad idea to me.  There will be any number of home-made
   /proc/mounts parsers and we don't know what they do.
  
   So, let's use /proc/mounts_v2  ;-)
  
  Was not it like don't use /proc for new things?
 
 Well yeah.  If we're going to do a brand new mechanism to expose
 per-mount data then we should hunker down and get it right.

Which automatically means no sysfs.  We are NOT converting vfsmounts
to kobject-based lifetime rules.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [UNIONFS] 00/29 Unionfs and related patches pre-merge review (v2)

2008-01-16 Thread Al Viro
After grep for locking-related things:

* lock_parent(): who said that you won't get dentry moved
before managing to grab i_mutex on parent?  While we are at it,
who said that you won't get dentry moved between fetching d_parent
and doing dget()?  In that case parent could've been _freed_ before
you get to dget().

* in create_parents():
+   struct inode *inode = lower_dentry-d_inode;
+   /*
+* If we get here, it means that we created a new
+* dentry+inode, but copying permissions failed.
+* Therefore, we should delete this inode and dput
+* the dentry so as not to leave cruft behind.
+*/
+   if (lower_dentry-d_op  lower_dentry-d_op-d_iput)
+   lower_dentry-d_op-d_iput(lower_dentry,
+  inode);
+   else
+   iput(inode);
+   lower_dentry-d_inode = NULL;
+   dput(lower_dentry);
+   lower_dentry = ERR_PTR(err);
+   goto out;
Really?  So what happens if it had become positive after your test and
somebody had looked it up in lower layer and just now happens to be
in the middle of operations on it?  Will be thucking frilled by that...

* __unionfs_rename():
+   lock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
+   err = vfs_rename(lower_old_dir_dentry-d_inode, lower_old_dentry,
+lower_new_dir_dentry-d_inode, lower_new_dentry);
+   unlock_rename(lower_old_dir_dentry, lower_new_dir_dentry);

Uh-huh...  To start with, what guarantees that your lower_old_dentry
is still a child of your lower_old_dir_dentry?  What's more, you are
not checking the result of lock_rename(), i.e. asking for serious trouble.

* revalidation stuff: err...  how the devil can it work for
directories, when there's nothing to prevent changes in underlying
layers between -d_revalidate() and operation itself?  For the upper
layer (unionfs itself) everything's more or less fine, but the rest
of that...
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/2] [RFC] Simple tamper-proof device filesystem.

2007-12-16 Thread Al Viro
On Sun, Dec 16, 2007 at 05:52:08PM +0100, Indan Zupancic wrote:

 What prevents them from mounting tmpfs on top of /dev, bypassing your fs?

Or binding /dev/null over nodes they want to get rid of...

 Also, if they have root there are plenty of ways to prevent an administrator
 from logging in, e.g. using iptables or changing the password.

Indeed.

BTW, tmpfs with root marked append-only and populated in normal ways on boot
would get a comparable effect without spending so much efforts.  Still won't
really help if attacker gains root, but then neither will your variant.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [2.6.24-rc3-mm1] loop cleanup in fs/namespace.c - repost

2007-11-26 Thread Al Viro
On Wed, Nov 21, 2007 at 03:24:33PM -0800, Andrew Morton wrote:
 -repeat:
 - if (atomic_dec_and_lock(mnt-mnt_count, vfsmount_lock)) {
 + while (atomic_dec_and_lock(mnt-mnt_count, vfsmount_lock)) {
   if (likely(!mnt-mnt_pinned)) {
   spin_unlock(vfsmount_lock);
   __mntput(mnt);
 - return;
 + break;
   }
   atomic_add(mnt-mnt_pinned + 1, mnt-mnt_count);
   mnt-mnt_pinned = 0;
   spin_unlock(vfsmount_lock);
   acct_auto_close_mnt(mnt);
   security_sb_umount_close(mnt);
 - goto repeat;
   }
  }
  
 This patch has no changelog which I can use.

BTW, I have a problem with that one.  The change is misleading; FWIW,
I'm very tempted to turn that into tail recursion, with

if (!atomic_dec_and_lock(mnt-mnt_count, vfsmount_lock))
return;

in the very beginning (i.e. invert the test and pull the body to top level).
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/5] Shadow directories

2007-10-18 Thread Al Viro
On Fri, Oct 19, 2007 at 06:07:45AM +0930, David Newall wrote:
 considerations of this whole scheme. Linux, like most Unix systems, 
 has never allowed hard links to directories for a number of reasons;
 
 The claim is wrong.  UNIX systems have traditionally allowed the 
 superuser to create hard links to directories.  See link(2) for 2.10BSD 
 http://www.freebsd.org/cgi/man.cgi?query=linksektion=2manpath=2.10+BSD. 
 Having got that wrong throws doubt on the argument; perhaps a path can 
 simultaneously be a file and a directory.

Learn to read.  Linux has never allowed that.  Most of the Unix systems
do not allow that.  Original _did_ allow that, but at the cost of very
easily triggered fs corruption (and it didn't have things like rename(2) -
it _did_ have userland implementation, of course, in suid-root mv(1),
but that sucker had been extremely racy and could be easily used to
screw filesystem to hell and back; adding rename(2) to the set of primitives
combined with multiple links to directories leads to very nasty issues on
_any_ system).
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/5] Shadow directories

2007-10-18 Thread Al Viro
On Fri, Oct 19, 2007 at 12:27:16PM +0930, David Newall wrote:

 Learn to read.  Linux has never allowed that.  Most of the Unix systems
 do not allow that.
 
 I did read the claim and it is ambiguous, in that it can reasonably be 
 read to mean that most UNIX systems never allowed such links, which is 
 wrong.  All UNIX systems allowed it until relatively recently.

FVOrelatively recently exceeding a decade and half.  In any case,
it's _trivial_ to get fs corruption on any system with such links -
play with rename() races a bit and you'll get it.  And yes, it does
include 4.4BSD and quite a chunk of even later history.

Anyway, you are quite welcome to propose a sane locking scheme capable
of dealing with that mess.

As for the posted patch, AFAICS it's FUBAR in handling of .. in such
directories.  Moreover, how are you going to keep that shadow tree
in sync with the main one if somebody starts doing renames in the
latter?  Or mount --move, or...
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] block_device_operations prototype changes

2007-08-27 Thread Al Viro
It's time to sanitize prototypes of bdev -open(), -release()
and -ioctl().  This stuff had sat in need to fix for a long time
and there is a bunch of bugs hard to fix without dealing with it.

1) -open() gets inode * and file *.  Almost all instances use only
inode-i_bdev
file-f_mode
file-f_flags - O_ACCMODE|O_NDELAY|O_EXCL
and nothing else.  Most actually use only inode-i_bdev-bd_disk and
those are easy; the trouble begins when we use f_mode and/or f_flags
in non-trivial ways.  One good example is cdrom.c - its behaviour
depends on whether we have O_NDELAY (doesn't lock the door, among
other things) and on whether we are opening for write.  Unfortunately,
it's not just something we care about at open() time - we need different
cleanups at close(), and that doesn't work at all.

At the very least, we need to know whether we want to unlock the door.
Suppose somebody opens it with O_NDELAY (ioctl-only) when it's already
opened for data (or mounted, for that matter).  Now we close it; how
do we know if we need to unlock?  Even if we did get struct file * (and
we do not, but that's a separate story), we could not rely on file-f_flags.
Why?  Because O_NDELAY can be flipped at will by fcntl() after the file
got opened.  IOW, we can't rely on it.

Most of that crap can be solved by saving the relevant information from
f_flags into f_mode (e.g. fs/block_dev.c::do_open()), where they'd remain
safe.  However, that's not quite enough - struct file we pass to -open()
is often a fake one and it certainly won't live until -release().

One thing that _can_ be solved that way, though, and it alone makes the
scheme worthwhile: floppy.c abuses can be finally killed.  Just saving
-f_flags  2 in -f_mode is enough to eliminate the need for floppy_open()
to play games with permission(9) and modify *file in any way.  Anything
close to current fdutils either passes O_RDWR |... or 3 | ... to open()
when it wants to do priveleged ioctls and then we get open_namei check
that we can write to file.  So we can check that bit saved in -f_mode
for can we do that kind of ioctl; no need to mess with anything.

That's a *big* win, since this is eliminates the only place that uses
more than the fields mentioned above and the only place that tries to
modify struct file and check for modifications in later method calls.

So having done that, we get to the situation where -open() only uses
inode-i_bdev and file-f_mode (and only reads these).  I.e. we will be
able to go for much saner
int (*open)(struct block_device *bdev, mode_t mode)

We even can kill that fake struct file.  At that point we have sanitized
-open(), no regressions and a chance to solve the problems with mode-dependent
cleanups on close().  However, to use that chance we'll need to do something
with -release().

2) -release() gets struct inode * and struct file * too.  It only uses
inode-i_bdev-bd_disk and, sometimes, file-f_mode.  Tries to use
file-f_mode, that is.  Even all way back in 2.2 when it was -release()
of file_operations, we used to get NULL on operations like umount, etc.,
so users of file-f_mode/file-f_flags had to check for file != NULL
first.  And pray that all such users will want some reasonable default
behaviour.

For one thing, that's not true anymore.  E.g. pktcdvd.c does blkdev_get()
with O_NDELAY and blkdev_put() as matching close.  IOW, blkdev_put() can't
expect that matching open had no O_NDELAY.

For another, blkdev_put() is used by blkdev_close() since 2.3, so we _always_
get NULL now.  IOW, the cleanup-related logics doesn't work properly, period.

The obvious first step is to switch to
int (*release)(struct gendisk *disk, mode_t mode)
(there are good reasons why open wants bdev and release does not).  At least
that way we can start passing the right value without fake struct file, etc.

Now, the only thing we need is a way to pass that mode_t to blkdev_put().
IOW, blkdev_put() needs an extra argument.  And that's where the things
get really interesting.  blkdev_close() should just pass file-f_mode to
it; that's easy.  However, we have more callers and need case-by-case
analysis for those.

We don't have too many of those callers, fortunately, and almost all of them
are easy - we know what had matching blkdev_open() got and that's it.

Exceptions are interesting.
* we have reiserfs journal that could've been opened r/o or r/w.
BFD - we can store the mode_t just fine along with the reference to bdev.
* a bug (AFAICT) in md.c - we open raid components r/w and if it
fails, it fails.  Good for our purposes, but... how about raid0 on read-only
devices?  In any case, we have a ready place to store mode_t, so it's not a
problem for getting the right value to blkdev_put().  FWIW, I think that
allowing fallback to r/o (and making the entire array r/o, of course) would
be a good thing to have.  Any comments from drivers/md folks?
* open_bdev_excl()/close_bdev_excl().  Needs an extra 

Re: Adding a security parameter to VFS functions

2007-08-16 Thread Al Viro
On Thu, Aug 16, 2007 at 03:57:24PM -0700, Linus Torvalds wrote:
 I personally consider this an affront to everythign that is decent.
 
 Why the *hell* would mkdir() be so magical as to need something like that?
 
 Make it something sane, like a struct nameidata instead, and make it at 
 least try to look like the path creation that is done by open().  Or 
 create a struct file * or something.

No.  I agree that mkdir/mknod/etc. should all take the same thing.
*However*, it should not be nameidata or file.  Why?  Because it
should not contain vfsmount.  Object creation should not *care*
where fs is mounted.  At all.  The same goes for open(), BTW - letting
it see the reference to vfsmount via nameidata is bloody wrong and
I really couldn't care less what kinds of perversions apparmour crowd
might like - pathnames simply do not belong there.

Said that, I agree that we need uniform prototypes for all these guys
and I can see arguments both for and against having creds passed by
reference stored in whatever struct we are passing (for: copy-on-write
refcounted cred objects would almost never have to be modified or
copied and normally we would just pick the current creds reference
from task_struct and assing it to a single field in whatever we pass
instead of nameidata; against: might be conceptually simpler to have
all that data directly in that struct and a helper filling it in).

Which way we do that and which struct we end up passing is a very good
question, but I want to make it very clear: having object creation/removal/
renaming methods[1] depend on which vfsmount we are dealing with is *wrong*.
IOW, I strongly object against the use of nameidata here.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] coda: kill file_count abuse

2007-07-19 Thread Al Viro
On Fri, Jul 20, 2007 at 10:45:34AM +1000, David Chinner wrote:
 On Thu, Jul 19, 2007 at 06:16:00PM -0400, Jan Harkes wrote:
  On Thu, Jul 19, 2007 at 11:45:08PM +0200, Christoph Hellwig wrote:
   -release is the proper way to detect the last close of a file,
   file_count should never be used in filesystems.
  
  Has been tried, the problem with that once -release is called it is too
  late to pass the the error back to close(2).
 
 I think you'll find the problem is that fput() throws away the error
 from -release, not that it's too late

Just where would that return value go?

BTW, the reason why checks for struct file refcount blow is not far
from that:

task A: write()
task B (sharing descriptor table with A): close()
task C (with another reference to struct file in question): close()
task A: return from write()

Now, the final fput() here happens in write().  In particular, no
call of close(2) sees refcount equal to 1.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] coda: kill file_count abuse

2007-07-19 Thread Al Viro
On Fri, Jul 20, 2007 at 12:36:01PM +1000, David Chinner wrote:
 To the context that dropped the last reference. It can't be
 reported to anything else

Oh, for fsck sake...

Send a datagram with SCM_RIGHTS in it.  Have all other references
to these files closed.  Now have close(2) kill the socket that
might eventually be used to receive the datagram.  Now, all these
files are closed.  Tell me, which error value would you report
to that caller of close() and how would it guess which files had
those been?

Consider munmap().  It can release the last reference to any number
of files (mmap them in different places, then munmap the entire
area).  Which error would you report?

Consider exit(), for crying out loud.  You have a file opened.
You fork a child.  You close the file in parent.  Child goes
away.  Who do you report that stuff?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] coda: file count cannot be used to discover last close

2007-07-19 Thread Al Viro
On Fri, Jul 20, 2007 at 12:10:00AM -0400, Jan Harkes wrote:
 I will try to find a clean way to block the close syscall until fput
 drops the last reference. However I realize that such an implementation
 would not be acceptable for other file systems, and there are some
 interesting unresolved details such as 'which close is going to be the
 last close'.

Simply impossible.

fd = open(...);
fd2 = dup(fd);

close(fd2);  -- deadlock
close(fd);

Or
fd = open(...);
p = mmap(..., fd, ...);
close(fd);  -- deadlock
...
munmap(p, ...);

The problem is that you are mixing two different operations:
* closing descriptor: descriptor doesn't refer to this opened file
anymore.  Can be done by close(), can be done by dup2(), can be done by
fcntl() analog of dup2(), can be done by execve(), can be done by exit().
And can be done by death-by-signal.
* closing opened file: no references (including descriptors) exist
anymore, no IO of any kind will be done on that IO channel.  Can happen
when descriptor gets closed, can happen when mapping gets destroyed (munmap,
etc.), can happen when SCM_RIGHTS datagram gets destroyed (including the
garbage collection), can happen when any syscall on that file finishes
after another thread has closed descriptor passed to that syscall, etc.

You _can't_ expect the errors from the latter to be reliably
passed to some process; for one thing, many files can get closed by
single system call (e.g. closing an AF_UNIX socket with pending SCM_RIGHTS
datagrams will flush those datagrams, dropping references to files we
tried to pass via them).

Why does CODA need special warranties in that area, anyway?
Related question: does fsync() force the writeback?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] isofs: mounting to regular file may succeed

2007-07-17 Thread Al Viro
On Tue, Jul 17, 2007 at 12:04:07AM -0700, Andrew Morton wrote:

 I don't think any (all?) other filesystems perform checks like this.
 Is this something which can/should be performed at the VFS level?

I don't see why we _should_ check that; VFS checks that we don't
turn directory into non-directory and vice versa, same as for binding.
There is no reason why fs driver could not give you a single-node filesystem
with root being e.g. a block device node.  You could mount it on any
non-directory and get a block device seen there, etc.

IOW, if filesystem driver considers fs image it's working from
as invalid, it's up to filesystem driver.  _What_ is considered invalid
depends on fs type; what the driver cares to report as dubious is, again,
up to driver.  If isofs wants to warn about an ISO image that definitely
violates ISO9660, it's isofs decision.  VFS will work with it as long
as fs driver does...
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: *at syscalls for xattrs?

2007-07-16 Thread Al Viro
On Mon, Jul 16, 2007 at 09:56:10AM +0200, Jan Engelhardt wrote:
 Just one question: what the bleeding hell for?  Not that the rest of
 ..at() family made any damn sense as an interface...
 
 fd1 = open(dir1, O_DIRECTORY):
 fd2 = open(dir2, O_DIRECTORY);
 system(mount -t tmpfs none dir1);
 system(mount -t tmpfs none dir2);
 openat(fd1, file1, O_RDWR | O_CREAT);
 openat(fd2, file2, O_RDWR | O_CREAT);
 
 If you have a better way to accomplish this, let me know. :)

To accomplish what, exactly?  Access to overmounted directory?
So bind it elsewhere and use that.  I still don't see the point -
neither of the interface nor of your example...
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: *at syscalls for xattrs?

2007-07-15 Thread Al Viro
On Sun, Jul 15, 2007 at 09:46:27PM +0200, Jan Engelhardt wrote:
 Hi,
 
 
 recently, the family of *at() syscalls and functions (openat, fstatat, 
 etc.) have been added to Linux and Glibc, respectively.
 In short: I am missing xattr at functions :)

No.  They are not fscking forks.  They are almost as revolting, but
not quite on the same level.

 BTW, why is fstatat called fstatat and not statat? (Same goes for 
 futimesat.) It does not take a file descriptor for the file argument. 
 Otherwise we'd also need fopenat/funlinkat, etc. Any reasons?

Ulrich having an odd taste?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: *at syscalls for xattrs?

2007-07-15 Thread Al Viro
On Sun, Jul 15, 2007 at 02:13:21PM -0700, Nicholas Miell wrote:
 
 I suspect he was asking for 
 
 int getxattrat(int fd, const char *path, const char *name, void *value, 
   size_t size, int flags)
 int setxattrat(int fd, const char *path, const char *name, void *value,
   size_t size, int xattrflags, int atflags)
 
 rather than the ability to access xattrs as files.

Just one question: what the bleeding hell for?  Not that the rest of
..at() family made any damn sense as an interface...

   BTW, why is fstatat called fstatat and not statat? (Same goes for 
   futimesat.) It does not take a file descriptor for the file argument. 
   Otherwise we'd also need fopenat/funlinkat, etc. Any reasons?
  
  Ulrich having an odd taste?
 
 Solaris compatibility.

Sun having no taste whatsoever
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH 2/5] revoke: core code

2007-07-11 Thread Al Viro
On Wed, Jul 11, 2007 at 12:01:06PM +0300, Pekka J Enberg wrote:
 From: Pekka Enberg [EMAIL PROTECTED]
 
 The revokeat(2) and frevoke(2) system calls invalidate open file descriptors
 and shared mappings of an inode.  After an successful revocation, operations
 on file descriptors fail with the EBADF or ENXIO error code for regular and
 device files, respectively.  Attempting to read from or write to a revoked
 mapping causes SIGBUS.
 
 The actual operation is done in two passes:
 
  1. Revoke all file descriptors that point to the given inode. We do
 this under tasklist_lock so that after this pass, we don't need
 to worry about racing with close(2) or dup(2).

How does that deal with the following:

task A gets its descriptor table cleansed
task B sends a descriptor to task A via SCM_RIGHTS datagram
task B gets its descriptor table cleansed
task A receives the datagram and gets the descriptor installed
task A does any kind of IO on that descriptor
-f_mapping gets replaced in the most inconvenient time.

Come to think of that, what do you do if I create a socketpair,
stuff the descriptor into SCM_RIGHTS datagram and send it to
myself?  Then receive it a day after you've called revoke() and
voila - I've got an almost undamaged struct file back...  At
the very least, it allows me to fchmod().  Or fchdir() if that
had been a directory...

BTW, read() or write() in progress might get rather unhappy if your
live replacement of -f_mapping races with them...
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH 2/5] revoke: core code

2007-07-11 Thread Al Viro
On Wed, Jul 11, 2007 at 10:37:33AM +0100, Al Viro wrote:
 On Wed, Jul 11, 2007 at 12:01:06PM +0300, Pekka J Enberg wrote:
  From: Pekka Enberg [EMAIL PROTECTED]
  
  The revokeat(2) and frevoke(2) system calls invalidate open file descriptors
  and shared mappings of an inode.  After an successful revocation, operations
  on file descriptors fail with the EBADF or ENXIO error code for regular and
  device files, respectively.  Attempting to read from or write to a revoked
  mapping causes SIGBUS.
  
  The actual operation is done in two passes:
  
   1. Revoke all file descriptors that point to the given inode. We do
  this under tasklist_lock so that after this pass, we don't need
  to worry about racing with close(2) or dup(2).
 
 How does that deal with the following:
 
 task A gets its descriptor table cleansed
 task B sends a descriptor to task A via SCM_RIGHTS datagram
 task B gets its descriptor table cleansed
 task A receives the datagram and gets the descriptor installed
 task A does any kind of IO on that descriptor
 -f_mapping gets replaced in the most inconvenient time.
 
 Come to think of that, what do you do if I create a socketpair,
 stuff the descriptor into SCM_RIGHTS datagram and send it to
 myself?  Then receive it a day after you've called revoke() and
 voila - I've got an almost undamaged struct file back...  At
 the very least, it allows me to fchmod().  Or fchdir() if that
 had been a directory...
 
 BTW, read() or write() in progress might get rather unhappy if your
 live replacement of -f_mapping races with them...

Better: I have the only opened descriptor for foo.  I send it to myself
as described above.  I close it.  revoke() is called, finds no opened
instances of foo in any descriptor tables and cheerfully does nothing.
I call recvmsg() and I have completely undamaged opened file back.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH 2/5] revoke: core code

2007-07-11 Thread Al Viro
On Wed, Jul 11, 2007 at 12:50:48PM +0300, Pekka J Enberg wrote:
 Hi Al,
 
 On Wed, 11 Jul 2007, Al Viro wrote:
  Better: I have the only opened descriptor for foo.  I send it to myself
  as described above.  I close it.  revoke() is called, finds no opened
  instances of foo in any descriptor tables and cheerfully does nothing.
  I call recvmsg() and I have completely undamaged opened file back.
 
 Uhm, nice. So, revoke() needs a proper inode - struct files mapping 
 somewhere. Can we add a list of files to struct inode? Are there other 
 cases where a file can point to an inode but the file is not attached to 
 any file descriptor?

Umm...  Any number, really - it might be in the middle of syscall
while another task sharing descriptor table has closed the descriptor.
Then there's quota, then there's process accounting, then there's
execve() in progress, then there's knfsd working with that struct
file, etc.

The fundamental issue here is that even if you do find struct file,
you can't blindly rip its -f_mapping since it can be in the middle
of -read(), -write(), pageout, etc.  And even if you do manage
that, you still have the ability to do fchmod() later.

I don't see how the ability to find all instances in SCM_RIGHTS
datagrams (for example) will help you with the race I've described
first.  Original state: task B has the only reference to file.
revoke() is called, passes task A.  B sends datagram to A and closes
file.  A receives datagram.  Now the only reference is in A's table
and you've already passed that.

So you can't avoid processes keeping pointers to struct file.
If you could find all struct file over given inode (which, I suspect,
will lead to interesting locking), you could call something on
that struct file, but you'd have zero exclusion with processes
calling methods on it.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH 2/5] revoke: core code

2007-07-11 Thread Al Viro
On Wed, Jul 11, 2007 at 01:01:07PM +0300, Pekka J Enberg wrote:
 On Wed, 11 Jul 2007, Al Viro wrote:
  BTW, read() or write() in progress might get rather unhappy if your
  live replacement of -f_mapping races with them...
 
 For writes, we (1) never start any new operations after we've cleaned up 
 the file descriptor tables so (2) after we're done with do_fsync() we 
 never touch -f_mapping again.

Er, no.  do_fsync() won't hit the sys_write() that is yet to enter
-write().  And you can't get rid of new callers _anyway_ (see
previous mail).
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adding subroot information to /proc/mounts, or obtaining that through other means

2007-06-20 Thread Al Viro
On Wed, Jun 20, 2007 at 01:57:33PM -0700, H. Peter Anvin wrote:
 ... or, alternatively, add a subfield to the first field (which would
 entail escaping whatever separator we choose):
 
 /dev/md6 /export ext3 rw,data=ordered 0 0
 /dev/md6:/users/foo /home/foo ext3 rw,data=ordered 0 0
 /dev/md6:/users/bar /home/bar ext3 rw,data=ordered 0 0

Hell, no.  The first field is in principle impossible to parse unless
you know the fs type.

How about making a new file with sane format?  From the very
beginning.  E.g. mountpoint + ID + relative path + type + options,
where ID uniquely identifies superblock (e.g. numeric st_dev)
and backing device (if any) is sitting among the options...
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AppArmor 01/41] Pass struct vfsmount to the inode_create LSM hook

2007-05-24 Thread Al Viro
On Thu, May 24, 2007 at 08:10:00PM +0200, Andreas Gruenbacher wrote:

 Read it like this: we don't have a good idea how to support multiple 
 namespaces so far. Currently, we interpret all pathnames relative to the 
 namespace a process is in. Confined processes don't have the privilege to 
 create or manipulate namespaces, which makes this safe. We may find a better 
 future solution.

You also don't have a solution for multiple chroot jails, since they
often have the same fs mounted in many places on given box *and* since
the pathnames from the confined processes' POVs have fsck-all to do
with each other.

It's really not kinder than multiple namespaces as far as your approach
is concerned.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] file as directory

2007-05-23 Thread Al Viro
On Wed, May 23, 2007 at 08:36:04AM +0200, Miklos Szeredi wrote:
  Interesting...  How do you deal with mount propagation and things like
  mount --move?
 
 Moving (or doing other mount operations on) an ancestor shouldn't be a
 problem.  Moving this mount itself is not allowed, and neither is
 doing bind or pivot_root.  Maybe bind could be allowed...

Eh...  Arbitrary limitations are fun, aren't they?
 
 When doing recursive bind on ancestor, these mounts are skipped.

What about clone copying your namespace?  What about MNT_SLAVE stuff being
set up prior to that lookup?  More interesting question: should independent
lookups of that sucker on different paths end up with the same superblock
(and vfsmount for each) or should we get fully independent mount on each?
The latter would be interesting wrt cache coherency...

  As for unlink...  How do you deal with having that thing
  mounted, mounting something _under_ it (so that vfsmount would be kept
  busy) and then unlinking that sucker?
 
 Yeah, that's a good point.  Current patch doesn't deal with that.
 Simplest solution could be to disallow submounting these.  Don't think
 it makes much sense anyway.

Arbitrary limitations... (and that's where revalidate horrors come in, BTW).
BTW^2: what if fs mounted that way will happen to have such node itself?

I'm not saying that it's unfeasible or won't lead to interesting things,
but it really needs semantics done right...
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] file as directory

2007-05-23 Thread Al Viro
On Wed, May 23, 2007 at 09:19:17AM +0200, Miklos Szeredi wrote:
  Eh...  Arbitrary limitations are fun, aren't they?
 
 But these mounts _are_ special.  There is really no point in moving or
 pivoting them.

pivoting - probably true, moving... why not?
 
  What about MNT_SLAVE stuff being set up prior to that lookup?
 
 These mounts are not propagated.  Or at least I hope so.  Propagation
 stuff is a bit too complicated for my poor little brain.

Er...  These mounts might not be propagated, but what about a bind
over another instance of such file in master tree?
 
 I think they should be the same superblock, same dentry.  What would
 be the advantage of doing otherwise?

Then you are going to have interesting time with locking in final mntput().
BTW, what about having several links to the same file?  You have i_mutex
on the inode, so serialization of those is not a problem, but...
 
 I think doing this recursively should be allowed.  Releasing last ref
 cleans up the mess should work in that case.

Releasing the last reference will lead to cascade of umounts in that
case...  IOW, need to be careful with locking.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] file as directory

2007-05-23 Thread Al Viro
On Tue, May 22, 2007 at 08:48:49PM +0200, Miklos Szeredi wrote:
   */
 -static int __follow_mount(struct path *path)
 +static int __follow_mount(struct path *path, bool enter)
  {
   int res = 0;
   while (d_mountpoint(path-dentry)) {
 - struct vfsmount *mounted = lookup_mnt(path-mnt, path-dentry);
 + struct vfsmount *mounted =
 + lookup_mnt(path-mnt, path-dentry, enter);
 +
   if (!mounted)
   break;
   dput(path-dentry);
 @@ -689,27 +697,37 @@ static int __follow_mount(struct path *p
   return res;
  }
  
 -static void follow_mount(struct vfsmount **mnt, struct dentry **dentry)
 +/*
 + * Follows mounts on the given nameidata.
 + *
 + * Only follows directory on file mounts if LOOKUP_ENTER is set.
 + */
 +void follow_mount(struct nameidata *nd)

BTW, I'd split that (and matching updates in callers) into separate
patch.

  {
 - while (d_mountpoint(*dentry)) {
 - struct vfsmount *mounted = lookup_mnt(*mnt, *dentry);
 + while (d_mountpoint(nd-dentry)) {
 + bool enter = nd-flags  LOOKUP_ENTER;

int, surely?

 + * This is called if the object has no -lookup() defined, yet the
 + * path contains a slash after the object name.
 + *
 + * If the filesystem defines an -enter() method, this will be called,
 + * and the filesystem shall fill the supplied struct path or return an
 + * error.
 + *
 + * The returned path will be bind mounted on top of the object with
 + * the MNT_DIRONFILE flag, and the nameidata will descend into the
 + * mount.
 + */
 +static int enter_file(struct inode *inode, struct nameidata *nd)
 +{
 + int err;
 + struct path newpath;
 +
 + printk(KERN_DEBUG %s/%d enter %s/\n, current-comm, current-pid,
 +nd-dentry-d_name.name);
 + if (!inode-i_op-enter)
 + return -ENOTDIR;
 +
 + newpath.mnt = NULL;
 + newpath.dentry = NULL;
 + err = inode-i_op-enter(nd, newpath);
 + if (!err) {
 + err = mount_dironfile(nd, newpath);
 + pathput(newpath);
 + }
 + return err;

Ouch.  What guarantees that two lookups won't race right here?  You are
not holding any locks at that point, AFAICS...

BTW, why newpath?  What's wrong with simply returning a new vfsmount
with right -mnt_root/-mnt_sb (instead of creating it inside
mount_dironfile())?  ERR_PTR() for error, struct vfsmount * for success...

 @@ -301,8 +310,8 @@ static struct vfsmount *clone_mnt(struct
   mnt-mnt_mountpoint = mnt-mnt_root;
   mnt-mnt_parent = mnt;
  
 - /* don't copy the MNT_USER flag */
 - mnt-mnt_flags = ~MNT_USER;
 + /* don't copy some flags */
 + mnt-mnt_flags = ~(MNT_USER | MNT_DIRONFILE);
   if (flag  CL_SETUSER)
   __set_mnt_user(mnt, owner);

Hmm?  So you do copy them and strip your MNT_DIRONFILE from copies?

 + * This is tricky, because for namespace modification we must take the
 + * namespace semaphore.  But mntput() is called from various places,
 + * sometimes with namespace_sem held.  Fortunately in those places the
 + * mount cannot yet have MNT_DIRONFILE, or at least that's what I
 + * hope...
 + *
 + * The umounting is done in two stages, first the mount is removed
 + * from the hashes.  This is done atomically wrt other mount lookups,
 + * so it's not possible to acquire a new ref to this dead mount that
 + * way.
 + *
 + * Then after having locked namespace_sem and relocked vfsmount_lock,
 + * the mount is properly detached.
 + */
 +static void umount_dironfile(struct vfsmount *mnt)
 + __releases(vfsmount_lock)
 +{
 + struct nameidata nd;

You've got to be kidding.  nameidata is *big*.  If anything, we want
to make detach_mnt() take struct path * instead, but even that is
lousy due to recursion.

I really don't like what's going on here.  The thing is, current code
is based on assumption that presence in the mount tree = holding a
reference.  We _might_ deal with that (there was an old plan to change
refcounting logics for vfsmounts), but that sort of games with locks
spells trouble.  What happens, for example, if namespace gets cloned
before you grab namespace_sem?

There's another problem, BTW - a lot of stuff does stat + open + fstat +
compare kind of sequence.  You'll end up mounting/umounting between stat
and open, which opens you to race with somebody else.  Get a different
st_dev, eat a nice unreproducible error from application...
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] file as directory

2007-05-23 Thread Al Viro
On Wed, May 23, 2007 at 11:03:08AM +0200, Miklos Szeredi wrote:
 I still don't get it where the superblock comes in.  The locking is
 interesting in there, yes.  And I haven't completely convinced
 myself it's right, let alone something that won't easily be screwed up
 in the future.  So there's definitely room for thought there.
 
 But how does it matter if two different paths have the same sb or a
 different sb mounted over them?
 
Because then you get a slew of fun issues with dropping the final reference
to vfsmount vs. lookup on another place.  What hold do you have on that
superblock and when do you switch from oh, called -enter() on the same
inode again, return vfsmount over the same superblock to need to
initialize that damn superblock, all mounts are gone?

 The same dentry is mounted over each one.  The contents of the
 directory should only depend on the contents of the underlying inode.
 The path leading up to it is completely irrelevant.

So what kind of exclusion do you have for -enter()?  None?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] file as directory

2007-05-23 Thread Al Viro
On Wed, May 23, 2007 at 12:09:19PM +0200, Miklos Szeredi wrote:
 Right.  After locking vfsmount_lock, mount_dironfile() should recheck
 if there was a race and bail out.

Owww...  Not pretty, that...
 
 I don't think the filesystem ought to try _creating_ a vfsmount.  I
 imagine, that the fs has already a kernel-internal mounted for this
 kind of stuff, and it just supplies a dentry from that.  The vfsmount
 isn't actually important, but it should be readily available, and it's
 easier to clone from a vfsmount/dentry pair.

I don't get it.  What's the point of that exercise, then?  When do you
create that kernel-internal mount?

 As I said, the superblock should be persistent, so we'll get a stable
 st_dev for multiple mounts.

OK, but then I guess I don't understand the intended use.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] file as directory

2007-05-23 Thread Al Viro
 When the real superblock is created.  It could even be the _same_
 super block as the real one.  There'd be just the problem of anchoring
 the dir-on-file dentries somewhere...
 
 Or with fuse the dir-on-file mount can just come from any mounted
 filesystem, again possibly the same one as the parent.  I do actually
 test with this.  The userspace filesystem supplies a file descriptor,
 from which the struct path is extracted and returned from -enter().

Then I do not understand what this mechanism could be used for, other
than an odd way to twist POSIX behaviour and see how much of the userland
would survive that.  Certainly not useful for your look into tarball
as a tree, unless you seriously want to scan the entire damn fs for
tarballs at mount time and set up a superblock for each.  And for per-file
extended attributes/forks/whatever-you-call-that-abomination it also
obviously doesn't help, since you lose them for directories.

IOW, what uses do you have in mind?  Complete scenario, please...
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] file as directory

2007-05-23 Thread Al Viro
On Wed, May 23, 2007 at 12:39:25PM +0100, Al Viro wrote:
 Then I do not understand what this mechanism could be used for, other
 than an odd way to twist POSIX behaviour and see how much of the userland
 would survive that.  Certainly not useful for your look into tarball
 as a tree, unless you seriously want to scan the entire damn fs for
 tarballs at mount time and set up a superblock for each.  And for per-file
 extended attributes/forks/whatever-you-call-that-abomination it also
 obviously doesn't help, since you lose them for directories.
 
 IOW, what uses do you have in mind?  Complete scenario, please...

Ah... After rereading the thread you've mentioned in the very beginning,
I think I understand what you are driving at.  However, in that case
* I really don't see why bother with returning vfsmount at all.
dentry alone is enough to create a new vfsmount, all in fs/namei.c.
* the lifetime rules look fscking scary.  You call that -enter()
on nearly every damn lookup.  OK, so you'll recreate equivalent vfsmount,
but...  That's a lot of allocations/freeing.  Can we do some caching and
deal with it on memory pressure?
* invalidation on unlink is still an open problem.
* locking in final mntput() doesn't look nice; we probably need
a new refcounting scheme for vfsmounts to make that work.  I have a variant
that might work here (and make life much easier for expiry logics in
automount/shared trees, which is what it had been initially proposed for),
but it still doesn't kill the need to deal with invalidation.  And yes,
NFS still needs it (and so do all network filesystems, really).  The question
of caching is related to that.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] file as directory

2007-05-23 Thread Al Viro
On Wed, May 23, 2007 at 08:34:42AM -0400, Trond Myklebust wrote:
 On Wed, 2007-05-23 at 08:36 +0100, Al Viro wrote:
  On Wed, May 23, 2007 at 09:19:17AM +0200, Miklos Szeredi wrote:
Eh...  Arbitrary limitations are fun, aren't they?
   
   But these mounts _are_ special.  There is really no point in moving or
   pivoting them.
  
  pivoting - probably true, moving... why not?
 
 Moving would be an implementation artefact that doesn't really
 correspond to any useful operation on the filesyst
 
 AFAIK, most filesystems that have implemented subfiles (excepting
 Reiser4 of course) do not allow you to rename or move the subfile
 directory or its contents from one parent file to another.

If that's about xattr and nothing else, colour me thoroughly uninterested.
If it might have other interesting uses, OTOH...
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] file as directory

2007-05-23 Thread Al Viro
On Wed, May 23, 2007 at 03:01:38PM +0200, Miklos Szeredi wrote:
 Someone might think of a way to make those work with directories.
 Invisible directory entries, anyone?  /me ducks

Not unless you manage to get working union-mount [*NOT* unionfs]
 
  * invalidation on unlink is still an open problem.
  * locking in final mntput() doesn't look nice; we probably need
  a new refcounting scheme for vfsmounts to make that work.  I have a variant
  that might work here (and make life much easier for expiry logics in
  automount/shared trees, which is what it had been initially proposed for),
 
 Which variant?  We had that detached subtrees thing, is that it?

Umm...  It is related to detached subtrees, but I'm not sure if it is what
you are thinking about.

Short version of the story: new counter (mnt_busy) that would be defined
in the following way: the number of external references (not due to the
vfsmount tree structure or from namespace to root) + the number of
children that have non-zero -mnt_busy.  And a per-vfsmount flag (goner).

The rules for handling -mnt_busy:
* duplicating external reference: increment m-mnt_busy
* getting from m to child: increment child-mnt_busy, if it went
from 0 to non-zero - increment m-mnt_busy as well (that's done under
vfsmount_lock, so we can safely check for zero here).
* getting from m to parent: increment parent-mnt_busy.
* dropping external reference: decrement m-mnt_busy; if it's still
non-zero, we are done.  If it's zero, we are in for some work (and had
acquired vfsmount_lock by atomic_dec_and_lock()).  Here's what we do:
* go through ancestors, decrementing -mnt_busy, until we
  hit the root or get to one with -mnt_busy staying
  non-zero.
* find the most remote ancestor that has zero -mnt_busy
  and is marked as goner (might be m itself).
* if no such beast exists, we are done.
* otherwise, detach the subtree rooted in that ancestor
  from its parent (if any) and unhash its root (if hashed).
  Now there is no external references to any vfsmount in that
  subtree.
* now we can kill all vfsmounts in that subtree.
* detaching m from parent: nothing; we trade a busy child of parent
for new external reference to parent.
* lazy umount: in addition to detaching everything from parents
and dropping resulting external references to parents, mark everything
in the subtree as goners.
* normal umount: check -mnt_busy *and* lack of children, detach,
mark as goner, drop resulting external reference to parent.
* fun new stuff - umount of intact subtree: detach the subtree from
parent, do *not* dissolve it, mark everything in subtree as goners.  If
something we mark as goner is not busy, we can kill it and all its descendents.
The subtree will be shrinking as its pieces lose external references.
* check for expirability: we hold an external reference to m and
m-mnt_busy is 1.  No need to look into children, etc.
* your vfsmounts: simply mark them goners from the very beginning.
 
  but it still doesn't kill the need to deal with invalidation.  And
  yes, NFS still needs it (and so do all network filesystems, really).
  The question of caching is related to that.
 
 So what's so special about invalidation?  Why not just treat
 dir-on-file mounts the same as any other ref on the dentry?

Because of the case of having something mounted in that subtree.  The
current code doesn't even try to evict such stuff.  NFS *does*, but
it's not in position to do that decently (not NFS fault, it's just that
we don't have the data needed for it).

Note that one problem we used to have back then is gone - namely, per-namespace
semaphores.  It's a global semaphore now, so we *can* do cross-namespace
rogering of mount trees without that kind of locking horrors.

What we really need is go through dentry subtree, try to evict everything
we can, for anything that has stuff mounted on it go through all such
vfsmounts and kick them and all their descendents out.  That's what should
happen on invalidation.  From generic code, so that NFS wouldn't have to
bother.

And _that_ is what we could call from -unlink() on your inode - would take
care of submounts.

Note that I'm not all that happy about this scheme; we might make it work,
but I still want to see a good use scenario for that kind of stuff.
Invalidation logics is a separate story - it's simply needed for existing
stuff; that area sucks *badly*, regardless of adding these hybrid objects.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] file as directory

2007-05-23 Thread Al Viro
On Wed, May 23, 2007 at 03:23:54PM +0200, Ph. Marek wrote:
 How about some special node in eg. /proc (or a new filesystem)?
 Eg.
/fileAsDir/etc/passwd/owner ...
 would work for all *files*. For directories we do not know whether we're 
 still 
 climbing the hierarchy or would like to see meta-data.

So we need to make *anything* done anywhere in the namespace to modify
the dentry tree on that fs.  Could you spell fuck, NO?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] file as directory

2007-05-23 Thread Al Viro
On Wed, May 23, 2007 at 04:32:37PM +0200, Miklos Szeredi wrote:
  Umm...  It is related to detached subtrees, but I'm not sure if it is what
  you are thinking about.
 
 I was thinking of a similar one by Mike Waychison.  It had the problem
 of requiring a spinlock for mntget/mntput.  It was also different in
 that it did not gradually dissolve detached trees, but kept them as
 whole blobs until the last ref went away.
 
Here the spinlock is needed only when mnt_busy goes to 0, so presumably
it won't be a serious problem on more or less common setups; however,
it certainly would need serious profiling.

 How will this work with copy_tree() and namespace duplication, which
 currently walk the tree with only namespace_sem held?

Easy - grab namespace_sem, grab vfsmount_lock, walk the subtree and bump
mnt_busy on everything (by 1 + number of non-busy children).  Then drop
vfsmount_lock and do as usual, dropping references in tree being copied
as you go.  Nothing will get attached or detached due to namespace_sem,
nothing will get evicted by anybody other than you since you've got all
that stuff pinned down.  End of story...
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] file as directory

2007-05-23 Thread Al Viro
On Wed, May 23, 2007 at 05:25:49PM +0200, Miklos Szeredi wrote:
   How will this work with copy_tree() and namespace duplication, which
   currently walk the tree with only namespace_sem held?
  
  Easy - grab namespace_sem, grab vfsmount_lock, walk the subtree and bump
  mnt_busy on everything (by 1 + number of non-busy children).  Then drop
  vfsmount_lock and do as usual, dropping references in tree being copied
  as you go.  Nothing will get attached or detached due to namespace_sem,
  nothing will get evicted by anybody other than you since you've got all
  that stuff pinned down.  End of story...
 
 Right.
 
 Do you have some code?
 
 Should I try to code something up?

I hope to get some breathing space next week, then I'll get back to
VFS work.  I'd rather do that one myself, since it'll be a long series
of equivalent transformations - debugging such rewrite of refcounting
done as a single patch is going to be hell.  And yes, refcounting rewrite
is near the top of the list (another thing is wading through several
threads from hell and reviewing unionfs ;-/)
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AppArmor 01/41] Pass struct vfsmount to the inode_create LSM hook

2007-04-12 Thread Al Viro
On Thu, Apr 12, 2007 at 02:08:10AM -0700, [EMAIL PROTECTED] wrote:
 This is needed for computing pathnames in the AppArmor LSM.

Which is an argument against said LSM in current form.


 - error = security_inode_create(dir, dentry, mode);
 + error = security_inode_create(dir, dentry, nd ? nd-mnt : NULL, mode);

is a clear sign that interface is wrong.  Leaving aside the general
idiocy of we prohibit to do something with file if mounted here,
but if there's another mountpoint, well, we just miss, an API of that
kind is just plain wrong.  Either you can live without seeing vfsmount
in that method (in which case you shouldn't pass it at all), or you
have a hole.

NACK.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AppArmor 40/41] AppArmor: all the rest

2007-04-12 Thread Al Viro
On Thu, Apr 12, 2007 at 02:08:49AM -0700, [EMAIL PROTECTED] wrote:
 + } else if (profile1  profile2) {
 + /* profile1 cannot be NULL here. */
 + spin_lock_irqsave(profile1-lock, profile1-int_flags);
 + if (profile2)
 + spin_lock(profile2-lock);
 +
 + } else {
 + /* profile2 cannot be NULL here. */
 + spin_lock_irqsave(profile2-lock, profile2-int_flags);
 + spin_lock(profile1-lock);
 + }

Ahem...

profile2 is locked individually.  profile1  profile2.  profile1 is not
locked.  We try to lock both.  profile1 is locked OK, flags (with interrupts
disabled) are stored into it.  We spin trying to lock profile2.  Eventually,
whoever had held profile2 unlocks it, restoring the flags from profile2.
We happily grab the spinlock and move on.  When we unlock the pair, we
restore flags from profile1.  I.e. we are left with interrupts disabled.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AppArmor 33/41] Add d_namespace_path() to obtain namespace relative pathnames

2007-04-12 Thread Al Viro
 +char *d_namespace_path(struct dentry *dentry, struct vfsmount *vfsmnt,
 +char *buf, int buflen)
 +{
 + char *res;
 + struct vfsmount *rootmnt, *nsrootmnt;
 + struct dentry *root;
 +
 + read_lock(current-fs-lock);
 + rootmnt = mntget(current-fs-rootmnt);
 + read_unlock(current-fs-lock);
 + spin_lock(vfsmount_lock);
 + nsrootmnt = mntget(rootmnt-mnt_ns-root);

... and when somebody does umount -l on your chroot jail, you get
NULL -mnt_ns.  Oops...
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Move across mount,sb

2007-03-14 Thread Al Viro
On Thu, Mar 15, 2007 at 02:20:25AM +0100, Jan Engelhardt wrote:
 Why is EXDEV returned when _vfs mountpoints_ are crossed? Should not it 
 be more like the following?
 
   error = -EXDEV;
   if (oldnd.mnt-mnt_sb != newnd.mnt-mnt_sb)
   goto exit2;
 

No.  This is absolutely deliberate - mountpoint creates a boundary and
such boundaries are very useful for restricting modifications of filesystem.
IOW, it's not a bug and it applies to other operations as well (link(), for
example).
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext2: conditional removal of NFSD code

2007-01-06 Thread Al Viro
On Sun, Jan 07, 2007 at 12:44:56AM +0300, Alexey Dobriyan wrote:
  #if defined(CONFIG_EXPORTFS) || defined(CONFIG_EXPORTFS_MODULE)
  #define set_export_ops(sb, ops) sb-s_export_op = ops
  #else
  #define set_export_ops(sb, ops) 0
  #endif
 
  That way you can get rid of the function pointer from the struct
  superblock too.
 
 Exactly! I've just started with filesystems I use.
 
 it should be wrapped in do {} while 0, of course.

What the hell for?  Repeat after me:
* do {} while(0) is always inferior to ((void)0)
* do { expr; } while(0) is always inferior to ((void)(expr))
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is a NULL check missing in nfs_lookup?

2007-01-05 Thread Al Viro
On Fri, Jan 05, 2007 at 12:11:17PM -0500, Shaya Potter wrote:
 ok, I guess what I don't understand is when are
 
 dentry-d_sb-s_root and nd-mnt-mnt_root not equivalent.
 
 I guess it would be when crossing a mountpoint on the server, so I
 look at nfs_follow_mountpoint() and basically see that you create a new
 mountpoint and a new nd for that.  And since superblock objects are only
 per real file system not per mount point, they will be different.
 
 I guess the question is, why shouldn't a dentry object know what
 vfsmount it belongs to is?  Can it belong to multiple vfsmount objects?

Yes.  dentry belongs to superblock.  vfsmount refers to a subtree of some
superblock's dentry tree.  There can be any number of vfsmounts refering
to subtrees of the same fs.  Some might refer to the entire tree, some -
to its arbitrary subtrees (possibly as small as a single file).  There
can be any number of vfsmounts refering to any given subtree.

Think what happens when you create a binding.  Or when you clone an entire
namespace.  (Pieces of) the same filesystem might be mounted in a lot
of places.  vfsmount describes a part of unified tree delimited by
mountpoints; if the same fs (or its parts) are seen under several
mountpoints, you get vfsmounts that refer to the same fs instance, i.e.
the same superblock and dentry tree.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Finding hardlinks

2006-12-20 Thread Al Viro
On Wed, Dec 20, 2006 at 05:50:11PM +0100, Miklos Szeredi wrote:
 I don't see any problems with changing struct kstat.  There would be
 reservations against changing inode.i_ino though.
 
 So filesystems that have 64bit inodes will need a specialized
 getattr() method instead of generic_fillattr().

And they are already free to do so.  And no, struct kstat doesn't need
to be changed - it has u64 ino already.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ensure i_ino uniqueness in filesystems without permanent inode numbers (via idr)

2006-11-16 Thread Al Viro
On Wed, Nov 15, 2006 at 11:42:38AM -0500, Jeff Layton wrote:
 +{
 + int rv;
 +
 + rv = idr_pre_get(inode-i_sb-s_inode_ids, GFP_KERNEL);
 + if (! rv)
 + return -ENOMEM;
 +
 + lock_super(inode-i_sb);

[EMAIL PROTECTED]

Please, use something saner.  Use of lock_super() for anything generic
is wrong; using it for something that'd better be fast is even dumber...

 @@ -1025,6 +1055,7 @@ void generic_delete_inode(struct inode *
   spin_lock(inode_lock);
   hlist_del_init(inode-i_hash);
   spin_unlock(inode_lock);
 + iunique_unregister(inode);

Unconditional?  Hitting every fs out there?  With that kind of locking?

   wake_up_inode(inode);
   BUG_ON(inode-i_state != I_CLEAR);
   destroy_inode(inode);
 @@ -1057,6 +1088,7 @@ static void generic_forget_inode(struct 
   inode-i_state |= I_FREEING;
   inodes_stat.nr_inodes--;
   spin_unlock(inode_lock);
 + iunique_unregister(inode);

Ditto.

   if (inode-i_data.nrpages)
   truncate_inode_pages(inode-i_data, 0);
   clear_inode(inode);
 diff --git a/fs/pipe.c b/fs/pipe.c
 index b1626f2..d74ae65 100644
 --- a/fs/pipe.c
 +++ b/fs/pipe.c
 @@ -845,6 +845,9 @@ static struct inode * get_pipe_inode(voi
   if (!inode)
   goto fail_inode;
  
 + if (iunique_register(inode, 0))
 + goto fail_iput;
 +

Humm...  I wonder what the overhead of that is going to be.  There
easily can be shitloads of pipes on the box, with all sorts of
lifetimes.  And we'd better be fast on that codepath...
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel bug: Bad page state: related to generic symlink code and mmap

2005-08-19 Thread Al Viro
On Fri, Aug 19, 2005 at 12:14:48PM +0100, Anton Altaparmakov wrote:
 Hi,
 
 There is a bug somewhere in 2.6.11.4 and I can't figure out where it is.
 I assume it is present in older and newer kernels, too as the related
 code hasn't changed much AFAICS and googling for Bad page state
 returns rather a lot of hits relating to both older (up to 2.5.70!) and
 newer kernels...
 
 Note: PLEASE do not stop reading because you read ncpfs below as I am
 pretty sure it is not ncpfs related!  And looking at google a lot of
 people have reported such similar problems since 2.5.70 or so and they
 were all told to go away as they have bad ram.  That is impossible
 because this happens on well over 600 workstations and several servers
 100% reproducible.  Many different types of hardware, different makes,
 difference age, all running smp kernels even if single cpu.  You can't
 tell me they all have bad ram.  Windows works fine and Linux works fine
 except for that one specific problem which is 100% reproducible...
 
 The bug only appears, but it appears 100% reproducibly when a cross
 volume symlink on ncpfs is accessed using nautilus under gnome.  I.e.
 double click on a cross volume symlink on ncpfs in nautilus and the
 machine locks up solid.

Ugh...  Could you at least tell what does nautilus attempt to do at that
point?  Something that wouldn't show up with simple ls -l symlink or
cat symlink /dev/null, judging by the above, but what?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel bug: Bad page state: related to generic symlink code and mmap

2005-08-19 Thread Al Viro
On Fri, Aug 19, 2005 at 09:07:35AM -0700, Linus Torvalds wrote:
 Hmm.. NFS _does_ use the page cache for symlinks, but uses it slightly 
 differently: instead of relying on the page cache entry being the same 
 when freeing the page, it just caches the page it looked up in the page 
 cache (ie nfs_follow_link() does look up the page cache, but then hides 
 the page pointer inside the page data itself (uglee), and thus does not 
 depend on the mapping staying the same (nfs_put_link() just takes the page 
 from the symlink data).

For NFS that was done exactly to deal with cache invalidation.  IIRC, I've
convinced myself that it wasn't going to happen on ncpfs and happily
abstained from duplicating the NFS variant.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel bug: Bad page state: related to generic symlink code and mmap

2005-08-19 Thread Al Viro
On Fri, Aug 19, 2005 at 09:43:17AM -0700, Linus Torvalds wrote:
 Actually, looking at the ncpfs patch, I'd rather not apply that patch 
 as-is. It looks like it will totally disable symlink caching, which would 
 be kind of sad. Somebody willing to do the same thing NFS does?
 
 NFS hides away the struct page * pointer at the end of the page data, so
 that when we pass the pointer to the virtual address around, we can 
 trivially look up the struct page.
 
 An alternative is to make the symlink address-space use a gfp_mask of 
 GFP_KERNEL (no highmem), at which point ncpfs_put_link() just becomes 
 something like
 
   void ncpfs_put_link(struct dentry *dentry, struct nameidata *nd)
   {
   char *addr = nd_get_link(nd);
 
   if (!IS_ERR(addr))
   page_cache_release(virt_to_page(addr));
   }
 
 which is pretty ugly, but even simpler than the NFS trick.
 
 Anybody?

I'm taking NFS helpers to libfs.c and switching ncpfs to them.  IMO that's
better than copying the damn thing and other network filesystems might have
the same needs eventually...
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel bug: Bad page state: related to generic symlink code and mmap

2005-08-19 Thread Al Viro
On Fri, Aug 19, 2005 at 05:53:32PM +0100, Al Viro wrote:
 I'm taking NFS helpers to libfs.c and switching ncpfs to them.  IMO that's
 better than copying the damn thing and other network filesystems might have
 the same needs eventually...

[something like this - completely untested]

* stray_page_get_link(inode, filler) - returns ERR_PTR(error) or pointer
to symlink body.  Said symlink body sits in a page at offset equal to
offsetof(page, struct stray_page_link).  filler() is expected to put it
at such offset. Page is cached.

* stray_page_put_link() - -put_link() suitable for links obtained from
stray_page_get_link().  Unlike the usual pagecache-based variants, this
sucker does _not_ rely on page staying cached.

* nfs and ncpfs switched to the helpers above.

Signed-off-by: Al Viro [EMAIL PROTECTED]

diff -urN RC14-rc6-git10-base/fs/libfs.c current/fs/libfs.c
--- RC13-rc6-git10-base/fs/libfs.c  2005-08-10 10:37:52.0 -0400
+++ current/fs/libfs.c  2005-08-19 13:29:39.0 -0400
@@ -7,6 +7,7 @@
 #include linux/pagemap.h
 #include linux/mount.h
 #include linux/vfs.h
+#include linux/namei.h
 #include asm/uaccess.h
 
 int simple_getattr(struct vfsmount *mnt, struct dentry *dentry,
@@ -616,6 +617,42 @@
return ret;
 }
 
+char *stray_page_get_link(struct inode *inode, int (*filler)(struct inode *, 
struct page *))
+{
+   struct stray_page_symlink *p;
+   struct page *page;
+
+   page = read_cache_page(inode-i_data, 0, (filler_t *)filler, inode);
+   if (IS_ERR(page))
+   goto read_failed;
+   if (!PageUptodate(page))
+   goto getlink_read_error;
+   p = kmap(page);
+   p-page = page;
+   return p-body;
+
+getlink_read_error:
+   page_cache_release(page);
+   page = ERR_PTR(-EIO);
+read_failed:
+   return (char *)page;/* error */
+}
+
+void stray_page_put_link(struct dentry *dentry, struct nameidata *nd)
+{
+   char *s = nd_get_link(nd);
+   if (!IS_ERR(s)) {
+   struct stray_page_symlink *p;
+   struct page *page;
+
+   p = container_of(s, struct stray_page_symlink, body[0]);
+   page = p-page;
+
+   kunmap(page);
+   page_cache_release(page);
+   }
+}
+
 EXPORT_SYMBOL(dcache_dir_close);
 EXPORT_SYMBOL(dcache_dir_lseek);
 EXPORT_SYMBOL(dcache_dir_open);
@@ -648,3 +685,5 @@
 EXPORT_SYMBOL_GPL(simple_attr_close);
 EXPORT_SYMBOL_GPL(simple_attr_read);
 EXPORT_SYMBOL_GPL(simple_attr_write);
+EXPORT_SYMBOL(stray_page_get_link);
+EXPORT_SYMBOL(stray_page_put_link);
diff -urN RC13-rc6-git10-base/fs/ncpfs/inode.c current/fs/ncpfs/inode.c
--- RC13-rc6-git10-base/fs/ncpfs/inode.c2005-06-17 15:48:29.0 
-0400
+++ current/fs/ncpfs/inode.c2005-08-19 13:12:01.0 -0400
@@ -104,7 +104,7 @@
 
 extern struct dentry_operations ncp_root_dentry_operations;
 #if defined(CONFIG_NCPFS_EXTRAS) || defined(CONFIG_NCPFS_NFS_NS)
-extern struct address_space_operations ncp_symlink_aops;
+extern int ncp_follow_link(struct dentry *dentry, struct nameidata *nd);
 extern int ncp_symlink(struct inode*, struct dentry*, const char*);
 #endif
 
@@ -233,8 +233,8 @@
 #if defined(CONFIG_NCPFS_EXTRAS) || defined(CONFIG_NCPFS_NFS_NS)
 static struct inode_operations ncp_symlink_inode_operations = {
.readlink   = generic_readlink,
-   .follow_link= page_follow_link_light,
-   .put_link   = page_put_link,
+   .follow_link= ncp_follow_link,
+   .put_link   = stray_page_put_link,
.setattr= ncp_notify_change,
 };
 #endif
@@ -272,7 +272,6 @@
 #if defined(CONFIG_NCPFS_EXTRAS) || defined(CONFIG_NCPFS_NFS_NS)
} else if (S_ISLNK(inode-i_mode)) {
inode-i_op = ncp_symlink_inode_operations;
-   inode-i_data.a_ops = ncp_symlink_aops;
 #endif
} else {
make_bad_inode(inode);
diff -urN RC13-rc6-git10-base/fs/ncpfs/symlink.c current/fs/ncpfs/symlink.c
--- RC13-rc6-git10-base/fs/ncpfs/symlink.c  2005-06-17 15:48:29.0 
-0400
+++ current/fs/ncpfs/symlink.c  2005-08-19 13:26:26.0 -0400
@@ -30,6 +30,7 @@
 #include linux/time.h
 #include linux/mm.h
 #include linux/stat.h
+#include linux/namei.h
 #include ncplib_kernel.h
 
 
@@ -41,9 +42,8 @@
 
 /* - read a symbolic link -- */
 
-static int ncp_symlink_readpage(struct file *file, struct page *page)
+static int symlink_filler(struct inode *inode, struct page *page)
 {
-   struct inode *inode = page-mapping-host;
int error, length, len;
char *link, *rawlink;
char *buf = kmap(page);
@@ -77,6 +77,7 @@
}
 
len = NCP_MAX_SYMLINK_SIZE;
+   buf += offsetof(struct stray_page_symlink, body);
error = ncp_vol2io(NCP_SERVER(inode), buf, len, link, length, 0);
kfree(rawlink);
if (error)
@@ -96,13 +97,12 @@
return error

Re: Kernel bug: Bad page state: related to generic symlink code and mmap

2005-08-19 Thread Al Viro
On Fri, Aug 19, 2005 at 07:00:38PM +0100, Christoph Hellwig wrote:
 On Fri, Aug 19, 2005 at 07:02:18PM +0100, Al Viro wrote:
  On Fri, Aug 19, 2005 at 05:53:32PM +0100, Al Viro wrote:
   I'm taking NFS helpers to libfs.c and switching ncpfs to them.  IMO that's
   better than copying the damn thing and other network filesystems might 
   have
   the same needs eventually...
  
  [something like this - completely untested]
  
  * stray_page_get_link(inode, filler) - returns ERR_PTR(error) or pointer
  to symlink body.  Said symlink body sits in a page at offset equal to
  offsetof(page, struct stray_page_link).  filler() is expected to put it
  at such offset. Page is cached.
  
  * stray_page_put_link() - -put_link() suitable for links obtained from
  stray_page_get_link().  Unlike the usual pagecache-based variants, this
  sucker does _not_ rely on page staying cached.
  
  * nfs and ncpfs switched to the helpers above.
 
 Can you add some kerneldoc comments to describe them?  Especially as
 the name is not very descriptive.

Hey, if anybody has suggestions on names - they are very welcome ;-)

FWIW, I'd rather take page_symlink(), page_symlink_inode_operations,
page_put_link(), page_follow_link_light(), page_readlink(), page_getlink(),
generic_readlink() and vfs_readlink() to the same place where these guys
would live.  They all belong together and none of them has any business
in fs/namei.c.  Options: fs/libfs.c or separate library since fs/libfs.c
is getting crowded.  Linus, do you have any objections to that or suggestions
on filename here?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel bug: Bad page state: related to generic symlink code and mmap

2005-08-19 Thread Al Viro
On Fri, Aug 19, 2005 at 03:04:52PM -0700, Linus Torvalds wrote:
 
 
 On Fri, 19 Aug 2005, Anton Altaparmakov wrote:
  
  Yes, sure.  I have applied your patch to our 2.6.11.4 tree (with the one 
  liner change I emailed you just now) and have kicked off a compile.
 
 Actually, hold on. The original patch had another problem: it returned an
 uninitialized page pointer when page_getlink() failed.
 
 This one should have that fixed, and has converted a few other 
 filesystems. Most of them trivially, but I took the opportunity to just 
 simplify NFS while I was at it, since it now has no reason to need to save 
 off the struct page * any more.
 
 It's still not tested, but at least I've looked at it a bit more ;)

That looks OK except for
* jffs2 is b0rken (see patch in another mail)
* afs, autofs4, befs, devfs, freevxfs, jffs2, jfs, ncpfs, procfs,
smbfs, sysvfs, ufs, xfs - prototype change for -follow_link()
* befs, smbfs, xfs - same for -put_link()
* ncpfs fix is actually missing here

Prototype changes are covered by patch below (incremental on top of your +
jffs2 fix upthread).  No ncpfs changes - these will go separately, assuming
you haven't done them yet; just a plain janitor stuff.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel bug: Bad page state: related to generic symlink code and mmap

2005-08-19 Thread Al Viro
On Sat, Aug 20, 2005 at 12:15:42AM +0100, Al Viro wrote:
 
 That looks OK except for
   * jffs2 is b0rken (see patch in another mail)
   * afs, autofs4, befs, devfs, freevxfs, jffs2, jfs, ncpfs, procfs,
 smbfs, sysvfs, ufs, xfs - prototype change for -follow_link()
   * befs, smbfs, xfs - same for -put_link()
   * ncpfs fix is actually missing here
 
 Prototype changes are covered by patch below (incremental on top of your +
 jffs2 fix upthread).  No ncpfs changes - these will go separately, assuming
 you haven't done them yet; just a plain janitor stuff.


Gaack...  And here's the patch itself - sorry.

diff -urN RC13-rc6-git10-base/fs/afs/mntpt.c current/fs/afs/mntpt.c
--- RC13-rc6-git10-base/fs/afs/mntpt.c  2005-06-17 15:48:29.0 -0400
+++ current/fs/afs/mntpt.c  2005-08-19 19:02:48.0 -0400
@@ -30,7 +30,7 @@
   struct dentry *dentry,
   struct nameidata *nd);
 static int afs_mntpt_open(struct inode *inode, struct file *file);
-static int afs_mntpt_follow_link(struct dentry *dentry, struct nameidata *nd);
+static void *afs_mntpt_follow_link(struct dentry *dentry, struct nameidata 
*nd);
 
 struct file_operations afs_mntpt_file_operations = {
.open   = afs_mntpt_open,
@@ -233,7 +233,7 @@
 /*
  * follow a link from a mountpoint directory, thus causing it to be mounted
  */
-static int afs_mntpt_follow_link(struct dentry *dentry, struct nameidata *nd)
+static void *afs_mntpt_follow_link(struct dentry *dentry, struct nameidata *nd)
 {
struct vfsmount *newmnt;
struct dentry *old_dentry;
@@ -249,7 +249,7 @@
newmnt = afs_mntpt_do_automount(dentry);
if (IS_ERR(newmnt)) {
path_release(nd);
-   return PTR_ERR(newmnt);
+   return (void *)newmnt;
}
 
old_dentry = nd-dentry;
@@ -267,7 +267,7 @@
}
 
kleave( = %d, err);
-   return err;
+   return ERR_PTR(err);
 } /* end afs_mntpt_follow_link() */
 
 /*/
diff -urN RC13-rc6-git10-base/fs/autofs4/symlink.c current/fs/autofs4/symlink.c
--- RC13-rc6-git10-base/fs/autofs4/symlink.c2005-06-17 15:48:29.0 
-0400
+++ current/fs/autofs4/symlink.c2005-08-19 19:03:11.0 -0400
@@ -12,11 +12,11 @@
 
 #include autofs_i.h
 
-static int autofs4_follow_link(struct dentry *dentry, struct nameidata *nd)
+static void *autofs4_follow_link(struct dentry *dentry, struct nameidata *nd)
 {
struct autofs_info *ino = autofs4_dentry_ino(dentry);
nd_set_link(nd, (char *)ino-u.symlink);
-   return 0;
+   return NULL;
 }
 
 struct inode_operations autofs4_symlink_inode_operations = {
diff -urN RC13-rc6-git10-base/fs/befs/linuxvfs.c current/fs/befs/linuxvfs.c
--- RC13-rc6-git10-base/fs/befs/linuxvfs.c  2005-06-17 15:48:29.0 
-0400
+++ current/fs/befs/linuxvfs.c  2005-08-19 19:03:52.0 -0400
@@ -41,8 +41,8 @@
 static void befs_destroy_inode(struct inode *inode);
 static int befs_init_inodecache(void);
 static void befs_destroy_inodecache(void);
-static int befs_follow_link(struct dentry *, struct nameidata *);
-static void befs_put_link(struct dentry *, struct nameidata *);
+static void *befs_follow_link(struct dentry *, struct nameidata *);
+static void befs_put_link(struct dentry *, struct nameidata *, void *);
 static int befs_utf2nls(struct super_block *sb, const char *in, int in_len,
char **out, int *out_len);
 static int befs_nls2utf(struct super_block *sb, const char *in, int in_len,
@@ -487,10 +487,10 @@
}
 
nd_set_link(nd, link);
-   return 0;
+   return NULL;
 }
 
-static void befs_put_link(struct dentry *dentry, struct nameidata *nd)
+static void befs_put_link(struct dentry *dentry, struct nameidata *nd, void *p)
 {
befs_inode_info *befs_ino = BEFS_I(dentry-d_inode);
if (befs_ino-i_flags  BEFS_LONG_SYMLINK) {
diff -urN RC13-rc6-git10-base/fs/devfs/base.c current/fs/devfs/base.c
--- RC13-rc6-git10-base/fs/devfs/base.c 2005-06-17 15:48:29.0 -0400
+++ current/fs/devfs/base.c 2005-08-19 19:04:17.0 -0400
@@ -2491,11 +2491,11 @@
return 0;
 }  /*  End Function devfs_mknod  */
 
-static int devfs_follow_link(struct dentry *dentry, struct nameidata *nd)
+static void *devfs_follow_link(struct dentry *dentry, struct nameidata *nd)
 {
struct devfs_entry *p = get_devfs_entry_from_vfs_inode(dentry-d_inode);
nd_set_link(nd, p ? p-u.symlink.linkname : ERR_PTR(-ENODEV));
-   return 0;
+   return NULL;
 }  /*  End Function devfs_follow_link  */
 
 static struct inode_operations devfs_iops = {
diff -urN RC13-rc6-git10-base/fs/freevxfs/vxfs_immed.c 
current/fs/freevxfs/vxfs_immed.c
--- RC13-rc6-git10-base/fs/freevxfs/vxfs_immed.c2005-06-17 
15:48:29.0

Re: Kernel bug: Bad page state: related to generic symlink code and mmap

2005-08-19 Thread Al Viro
On Fri, Aug 19, 2005 at 06:08:12PM -0700, Linus Torvalds wrote:
 
 
 On Sat, 20 Aug 2005, Al Viro wrote:
  
  That looks OK except for
  * ncpfs fix is actually missing here
 
 Well, the thing is, with the change to page_follow_link() and 
 page_put_link(), ncpfs should now work fine - it doesn't need any fixing 
 any more.

Ah - right, it's using normal methods...
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call

2005-04-20 Thread Al Viro
On Wed, Apr 20, 2005 at 10:45:58AM +0100, Jamie Lokier wrote:
 For FUSE, what's needed is that a user can mount something, and the
 mounted fs is visible only to that user, but it's visible to _all_ of
 the user's processes.

So get that namespace as soon as you log in.
 
 We think namespaces are a nice way to do that: making a user-owned
 filesystem only visible to a user.  But the mechanism of CLONE_NEWNS
 does not work, because it presumes namespace divisions are only
 propagated over parent-child divisions, like environment variables.
 
 What we really want is a mount point that propagates across all the
 processes owned by one user, but is not there for other users.

This is almost certainly bogus.  Same user can easily want several
different environments set on the same box.

- Have a namespace per user.  The user's namespace will be entered
  by the login program somehow.

Trivial right now - just have libpam do that.
 
- All logins to the same user acquire the same per-user namespace.
  This isn't possible at the moment; it would be a kernel extension
  + administrative change to login.

No.  Identical setup at login time - sure.  Enforced _single_ namespace
is just plain wrong.  Moreover, all user's processes is the wrong answer
to practically any question (well, aside of what processes do you kill
when you get rid of luser's account).
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call

2005-04-20 Thread Al Viro
On Wed, Apr 20, 2005 at 01:03:40PM +0100, Jamie Lokier wrote:
 It shouldn't be literally per-user - it should be possible for a user
 to have several environment _when_ they want that.  chroot-jail style
 virtual server environments require that too.
 
 But that shouldn't be the only option - because it would be horrible
 to use.  If I login on multiple terminals, I normally want to mount
 filesystems in /home/jamie/mnt on one terminal, and use them on another.

And when you log in on several terminals you usually want same $PATH.
You don't do that by sharing VM between shell processes, do you?  Sure,
that would work with sufficient kernel-side hacks for joining thread
group and making e.g. bash multithreaded.  Nobody does it though - it
doesn't buy you anything really useful.
 
 How can libpam join the user's existing namespace?
 
 Having a separate usermount-namespace for each login of the same user
 would not be nice to use.

I don't see why.  _IF_ you can change the set of mounts after you log in,
there's no more need to do any kernel tricks for that stuff than you would
need for environment, etc.  If you can't - well, the last point where you
can get something set up is login with no changes afterwards.  In that case
everything is just as trivial...
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call

2005-04-20 Thread Al Viro
On Wed, Apr 20, 2005 at 09:51:26AM -0700, Ram wrote:
 Reading through the thread I assume the requirement is:
 
 1) A User being able to create his own VFS-mount environment 
 2) being able to use the same VFS-mount environment from 
 multiple login sessions.
 3) Being able to switch some processes to some other
   VFS-mount environment.

Excuse me, but could somebody give coherent rationale for such requirements?
_Especially_ for joining existing group by completely unrelated process -
something we don't do for any other component of process.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call

2005-04-20 Thread Al Viro
On Wed, Apr 20, 2005 at 09:43:47PM +0100, Jamie Lokier wrote:
 Al Viro is right to point out that environment variables are not
 shared.  But he forgets that _files_ are shared.

So they are.  ~/.profile, for instance ;-)
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call

2005-04-20 Thread Al Viro
On Wed, Apr 20, 2005 at 07:53:04PM +0200, Miklos Szeredi wrote:
 The user expects to have the see the same files in all sessions,
 whether those be local logins, remote logins, ftp/scp/etc sessions.

Umm...  You know, I would be *very* unhappy if I found that some process
on parcelfarce was able to see the contents of ~/.ssh/* from the laptop
I'm using right now.  Even more so if that would apply to random ftp
sewer I happen to use at the moment...
 
 If I'm remotely logged into server X from Y, and want to use scp to
 copy some file from X to Y or vica versa, I will want my private
 mounts to be visible from the scp.

Do you?  Really?  OK, so I've got ~/bin/ and ~/bin/arch/ in my path on
my boxen.  The latter has ~/bin/{i386,alpha,sparc,amd64,hppa,ppc} bound
on it - depending on the host I'm using.  Tell me, why would I want that
private mount to be visible when I log in from one box to another?  To
make sure that wrong binaries would be picked?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call

2005-04-19 Thread Al Viro
On Tue, Apr 19, 2005 at 05:13:32PM -0500, Eric Van Hensbergen wrote:
 The motivation behind this patch is to make private namespaces more
 accessible by allowing their creation at mount/bind time.
 
 Based on some of the FUSE permissions discussions, I wanted to check
 into modifying the mount system calls -- adding a flag which created a
 new namespace for the resulting mount.  I quickly discovered that what
 I typically wanted (for the case of running a mount command) was to
 actually create a new namespace for the parent thread (typically the
 shell), inherit that namespace, and then perform the mount.
 
 Its not clear to me that both options are needed, cloning the parent's
 namespace seems to be what you want most of the time.
 
 In order to minimize code impact I split the copy_namespace function,
 perhaps the right long term solution is to change it's interface to
 accommodate the changes.  Things look a bit more invasive as I moved
 the copy_namespace function above do_mount.  The patch follows:

*UGH*

So what happens to those who happen to share task-fs with the parent?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call

2005-04-19 Thread Al Viro
On Tue, Apr 19, 2005 at 06:53:29PM -0500, Eric Van Hensbergen wrote:
 On 4/19/05, Al Viro [EMAIL PROTECTED] wrote:
  On Tue, Apr 19, 2005 at 05:13:32PM -0500, Eric Van Hensbergen wrote:
   The motivation behind this patch is to make private namespaces more
   accessible by allowing their creation at mount/bind time.
  
  
  *UGH*
  
  So what happens to those who happen to share task-fs with the parent?
  
 
 Okay, I'll admit to being a bit too hasty with pushing out that patch
 - I was being particularly myopic looking for a solution only for a
 command-line mount.  Are you generally opposed to new namespace
 creation at mount time or just my slimy hack?  A shared task-fs seems
 like something which could be easily checked against and disallowed.

a) ability to create a private namespace without forking anything - sure,
that would be useful.  However, that's not something I would push into
mount(2) (already overloaded to hell and back).

There used to be a kinda-sorta agreement on a new syscall:
unshare(bitmap)
with arguments like those of clone(2).  That's not just for namespaces -
e.g. you might legitimately want to unshare VM in a thread and leave the
rest alone.  Or unshare -fs (i.e. uncouple cwd from the rest of group).

Most of the code is already there - do_fork() has to do such stuff anyway.
So how about adding sys_unshare(flags) that would do that job?  Flags would
correspond to those of clone(2), except that all these guys would be
what do we unshare instead of what do we leave shared.


b) I _really_ don't like the idea of messing with the parent.  Make it
a shell builtin if you want to affect shell behaviour; the same reason
why cd is a builtin and not an external command.


c) I would be really, really careful with implications of let user
do whatever he wants - that certainly should include bindings and
that can create heaps of fun for suid stuff.  More comments when
I get around to digging through FUSE thread...
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS4 mount problem

2005-04-18 Thread Al Viro
On Mon, Apr 18, 2005 at 10:07:14AM -0700, Bryan Henderson wrote:
 On Fri, Apr 15, 2005 at 01:22:59PM -0700, David S. Miller wrote:
  
  Make a -compat_read_super() just like we have a -compat_ioctl()
  method for files, if you want to suggest a solution like what
  you describe.
 
 I don't think we should encourage filesystem writers to do such stupid
 things as ncfps/smbfs do.  In fact I'm totally unhappy thay nfs4 went
 down that road.
 
 Which road is that?

Architecture-dependent blob passed to mount(2) (aka nfs4_mount_data).
If you want it to be a blob, at least have a decency to use encoding
that would not depend on alignment rules and word size.  Hell, you
could use XDR - it's not that nfs would need something new to handle
it.  Or, better yet, use a normal string.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS4 mount problem

2005-04-18 Thread Al Viro
On Mon, Apr 18, 2005 at 06:33:09PM +0100, David Howells wrote:
 Al Viro [EMAIL PROTECTED] wrote:
 
  
  Architecture-dependent blob passed to mount(2) (aka nfs4_mount_data).
  If you want it to be a blob, at least have a decency to use encoding
  that would not depend on alignment rules and word size.  Hell, you
  could use XDR - it's not that nfs would need something new to handle
  it.  Or, better yet, use a normal string.
 
 Mount doesn't appear to permit a big enough blob though. It has a hard limit
 of PAGE_SIZE.

Excuse me?  Would the use of fixed offsets, field sizes and endianness
make the blob bigger?  And as for the length of string representation
going past 4Kb...  that could be easily dealt with in sys_mount() if it
really becomes a problem.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] shared subtrees

2005-01-16 Thread Al Viro
On Sun, Jan 16, 2005 at 11:02:13AM -0500, J. Bruce Fields wrote:
 On Thu, Jan 13, 2005 at 10:18:51PM +, Al Viro wrote:
  6. mount --move
  prohibited if what we are moving is in some p-node, otherwise we move
  as usual to intended mountpoint and create copies for everything that
  gets propagation from there (as we would do for rbind).
 
 Why this prohibition?

How do you propagate that?  We can weaken that to in a p-node that
owns something or contains more than one vfsmount, but it's not
worth the trouble, AFAICS.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] shared subtrees

2005-01-16 Thread Al Viro
On Sun, Jan 16, 2005 at 01:42:09PM -0500, J. Bruce Fields wrote:
 On Sun, Jan 16, 2005 at 06:06:56PM +, Al Viro wrote:
  On Sun, Jan 16, 2005 at 11:02:13AM -0500, J. Bruce Fields wrote:
   On Thu, Jan 13, 2005 at 10:18:51PM +, Al Viro wrote:
6. mount --move
prohibited if what we are moving is in some p-node, otherwise we move
as usual to intended mountpoint and create copies for everything that
gets propagation from there (as we would do for rbind).
   
   Why this prohibition?
  
  How do you propagate that?  We can weaken that to in a p-node that
  owns something or contains more than one vfsmount, but it's not
  worth the trouble, AFAICS.
 
 I guess I'm not seeing what there is to propagate.  If the vfsmount we
 are moving is mounted under a vfsmount that's in a p-node, then there'd
 be something to propagate, but since the --move doesn't change the
 structure of mounts underneath the moved mountpoint, I wouldn't expect
 any changes to be propagated from it to other mountpoints.
 
 I must be missing something fundamental

No - I have been missing a typo.  Make that if mountpoint of what we
are moving
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] shared subtrees

2005-01-15 Thread Al Viro
On Sat, Jan 15, 2005 at 07:46:59PM -0500, J. Bruce Fields wrote:
 On Thu, Jan 13, 2005 at 10:18:51PM +, Al Viro wrote:
  2. mount
  
  We have a new vfsmount A and want to attach it to mountpoint somewhere in
  vfsmount B.  If B does not belong to any p-node, everything is as usual; A
  doesn't become a member or slave of any p-node and is simply attached to B.
  
  If B belongs to a p-node p, consider all vfsmounts B1,...,Bn that get events
  propagated from B and all p-nodes p1,...,pk that contain them.
  * A gets cloned into n copies and these copies (A1,...,An) are attached
  to corresponding points in B1,...,Bn.
  * k new p-nodes (q1,...,qk) are created
  * Ai is contained in qj = Bi is contained in qj
 
 Minor typo: looks like that second qj should be pj.

ACK
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html