Re: [RFC 01/11] add generic versions of debugfs file operations
On Tue, Feb 19, 2008 at 05:04:36AM +0100, Arnd Bergmann wrote: + char buf[3]; + u32 *val = file-private_data; + + if (*val) + buf[0] = 'Y'; + else + buf[0] = 'N'; + buf[1] = '\n'; + buf[2] = 0x00; + return simple_read_from_buffer(user_buf, count, ppos, buf, 2); Ewww - caps, \n... BTW, \0 is pointless here - simple_read_from_buffer() will not access it with these arguments)... +{ + char buf[32]; + int buf_size; + u32 *val = file-private_data; + + buf_size = min(count, (sizeof(buf)-1)); + if (copy_from_user(buf, user_buf, buf_size)) + return -EFAULT; + + switch (buf[0]) { + case 'y': + case 'Y': + case '1': + *val = 1; + break; + case 'n': + case 'N': + case '0': + *val = 0; + break; Please, check the length; sloppy input grammar is a bad idea. Hell, at the very least you want -EINVAL if input is not recognized... - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 02/11] introduce simple_fs_type
On Tue, Feb 19, 2008 at 05:04:37AM +0100, Arnd Bergmann wrote: There is a number of pseudo file systems in the kernel that are basically copies of debugfs, all implementing the same boilerplate code, just with different bugs. This adds yet another copy to the kernel in the libfs directory, with generalized helpers that can be used by any of them. The most interesting function here is the new struct dentry * simple_register_filesystem(struct simple_fs_type *type), which returns the root directory of a new file system that can then be passed to simple_create_file() and similar functions as a parent. Don't mix split the file with add new stuff, please. And frankly, I'm not convinced that embedding file_system_type into another struct is a good idea. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 01/11] add generic versions of debugfs file operations
On Tue, Feb 19, 2008 at 05:04:36AM +0100, Arnd Bergmann wrote: The file operations in debugfs are rather generic and can be used by other file systems, so it can be interesting to include them in libfs, with more generic names, and exported to modules. This patch adds a new copy of these operations to libfs, so that the debugfs version can later be cut down. BTW, where does CONFIG_LIBFS come from? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 03/11] slim down debugfs
On Tue, Feb 19, 2008 at 05:04:38AM +0100, Arnd Bergmann wrote: With most of debugfs now copied to generic code in libfs, we can remove the original copy and replace it with thin wrappers around libfs. Signed-off-by: Arnd Bergmann [EMAIL PROTECTED] Index: linux-2.6/fs/Kconfig === --- linux-2.6.orig/fs/Kconfig +++ linux-2.6/fs/Kconfig @@ -1001,6 +1001,14 @@ config CONFIGFS_FS Both sysfs and configfs can and should exist together on the same system. One is not a replacement for the other. +config LIBFS + tristate + default m + help + libfs is a helper library used by many of the simpler file + systems. Parts of libfs can be modular when all of its users + are modules as well, and the users should select this symbol. NAK. For one thing, you need dependencies or selects and none of the patches later in the series introduces them. For another, neither dependencies nor selects work well if you have non-modular users of that sucker. Seriously, try to add those and do allmodconfig. Then look what a mess it produces. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 00/10] mount ownership and unprivileged mount syscall (v8)
On Mon, Feb 18, 2008 at 12:47:59PM +0100, Miklos Szeredi wrote: So what should I do? Would Al be wanting to merge this into his VFS tree? (Can't find it on git.kernel.org yet, BTW.) FWIW, it's on hera right now, should propagate to git.kernel.org in a few. Branches I'd pushed there: vfs-fixes.b0 and ro-bind.b0. The latter is on top of the former. There will be more, but that at least takes care of the most urgent stuff. Again, apologies for things being too damn slow ;-/ As for the unprivileged mounts... a) why do we lose them on clone() in new namespace? Bloody inconvenient, to put it mildly. b) why do we prohibit all kinds of remount? c) just what is limited by that sysctl? AFAICS, rbind is allowed if mountpoint is on user vfsmount and it seems to create vfsmounts without eating into that limit just fine... What's the point of limiting the amount of vfsmounts marked user when you do not limit the number of vfsmount one can allocate? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 00/10] mount ownership and unprivileged mount syscall (v8)
On Sat, Feb 23, 2008 at 06:33:13PM +0100, Miklos Szeredi wrote: c) just what is limited by that sysctl? AFAICS, rbind is allowed if mountpoint is on user vfsmount and it seems to create vfsmounts without eating into that limit just fine... What's the point of limiting the amount of vfsmounts marked user when you do not limit the number of vfsmount one can allocate? The limit is there, so that unprivileged users cannot create insane number of mounts. It's just a safety thing, analogous to /proc/sys/fs/file-max. Can't they? Looks like one can create any number of vfsmounts without getting more than one marked MNT_USER... What are you trying to limit - vfsmounts or superblocks? The former is not limited in your patchset at all, AFAICS - you can do while true; do mount --bind /bin ~/my_directory; mount --bind /sbin ~/my_directory; done indefinitely and all the bleeding stack of vfsmounts will be !MNT_USER. Or any number of similar schemes, really, without overmounting if you wish to avoid it. If you are trying to limit the number of superblocks (i.e. active instances of filesystems), then I'd say that vfsmounts make piss-poor proxies for those and it would be better to count the objects you really want to count... - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: git tree with VFS stuff
On Thu, Feb 21, 2008 at 01:13:48AM +1100, Stephen Rothwell wrote: Hi Miklos, On Tue, 19 Feb 2008 14:32:28 +0100 Miklos Szeredi [EMAIL PROTECTED] wrote: I've created a git tree with the following mounts related stuff: - read-only bind mounts - /proc/pid/mountinfo - unprivileged mounts git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfsstuff.git master I guess, giving these a spin in linux-next wouldn't hurt? I don't think this is what we want to use linux-next for. Linux-next is really a place for stuff that will pretty clearly go into the next kernel release i.e. 2.6.26 right now. If you want to experiment on things for beyond that timeframe, then a snapshot of linux-next may be a good base. I will take them when they reach the appropriate subsystem tree and are ready for integration. FWIW, I must apologize for delay with getting the damn tree open on kernel.org ;-/ The last couple of weeks had been Not Fun(tm) in a lot of respects. Hopefully I'll finish putting the damn thing into publishable shape by tomorrow. As for the stuff mentioned above... ro-bind series - definitely yes, mountinfo - IMO needs a sane discussion of what and how should be shown wrt propagation state, unprivileged mounts - in the need to finish reviewing pile. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to show propagation state for mounts
On Wed, Feb 20, 2008 at 04:39:15PM +0100, Miklos Szeredi wrote: mountinfo - IMO needs a sane discussion of what and how should be shown wrt propagation state Here's my take on the matter. The propagation tree can be either be represented 1) from root to leaf listing members of peer groups and their slaves explicitly, 2) or from leaf to root by identifying each peer group and then for each mount showing the id of its own group and the id of the group's master. 2) can have two variants: 2a) id of peer group is constant in time 2b) id of peer group may change The current patch does 2b). Having a fixed id for each peer group would mean introducing a new object to anchor the peer group into, which would add complexity to the whole thing. All of these are implementable, just need to decide which one we want. Eh... Much more interesting question: since the propagation tree spans multiple namespaces in a lot of normal uses, how do we deal with reconstructing propagation through the parts that are not present in our namespace? Moreover, what should and what should not be kept private to namespace? Full exposure of mount trees is definitely over the top (it shows potentially sensitive information), so we probably want less than that. FWIW, my gut feeling is that for each peer group that intersects with our namespace we ought to expose in some form * all vfsmounts belonging to that intesection * the nearest dominating peer group (== master (of master ...) of) that also has a non-empty intersection with our namespace It's less about the form of representation (after all, we generate poll events when contents of that sucker changes, so one *can* get a consistent snapshot of the entire thing) and more about having it self-contained when we have namespaces in the play. IOW, the data in there should give answers to questions that make sense. Do events get propagated from this vfsmount I have to that vfsmount I have? is a meaningful one; ditto for are events here propagated to somewhere I don't see? or are events getting propagated here from somewhere I don't see?. Dumping pieces of raw graph, with IDs of nodes we can't see and without any way to connect those pieces, OTOH, doesn't make much sense. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to show propagation state for mounts
On Wed, Feb 20, 2008 at 11:29:13AM -0800, Ram Pai wrote: I wonder, what is wrong in reporting mounts in other namespaces that either receive and send propagation to mounts in our namespace? A plenty. E.g. if foo trusts control over /var/blah to bar, it's not obvious that foo has any business knowing if bar gets it from somebody else in turn. And I'm not sure that bar has any business knowing that foo has the damn thing attached in five places instead of just one, let alone _where_ it has been attached. If you get down to it, the thing is about delegating control over part of namespace to somebody, without letting them control, see, etc. the rest of it. So I'd rather be very conservative about extra information we allow to piggyback on that. I don't know... perhaps with stable peer group IDs it would be OK to show peer group ID by (our) vfsmount + peer group ID of master + peer group ID of nearest dominating group that has intersection with our namespace. Then we don't leak information (AFAICS), get full propagation information between our vfsmounts and cooperating tasks in different namespaces can figure the things out as much as possible without leaking 3rd-party information to either. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Add vfsmount to vfs helper functions.
On Sun, Feb 17, 2008 at 06:00:30PM +0900, Tetsuo Handa wrote: Hello. This is (c) Add new hooks. approach I proposed at http://www.mail-archive.com/linux-fsdevel@vger.kernel.org/msg11536.html . Although this is an incomplete patch, I want to know whether you can tolerate this approach or not. If you cannot tolerate this approach, may be we need to consider implementing (d) Calculate pathname while doing name resolution approach. No printable comments, except for that: (e) why don't you guys move the Linus' Serious Mistake to _callers_ of vfs_mknod() and its ilk? Which obviously solves all problems with having vfsmount. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [UNIONFS] 00/29 Unionfs and related patches pre-merge review (v2)
On Sat, Feb 02, 2008 at 01:45:15PM -0500, Erez Zadok wrote: You are thinking about non-interesting case. _Files_ are not much of a problem. Directory tree is. The real problems with all unionfs and stacking implementations I've seen so far, all way back to Heidemann et.al. start when topology of the underlying layer changes. OK, so if I understand you, your concerns center around the fact that lower directories can be moved around (i.e., topology changes), what happens then for operations that go through the stackable f/s, and what should users expect to see. Correct. If you have clear semantics for unionfs behaviour in presence of such things, by all means, publish it - as far as I know *nobody* had done that; not even on the what should we see when... level, nevermind the implementation. Since stacking and NFS have some similarities, I first checked w/ the NFS people to see what are their semantics in a similar scenario: an NFS client could be validating a directory, then issue, say, a -create; but in between, the server could have moved the directory that was validated. In NFS, the -create operation succeeds, and creates the file in the new location of the directory which was validated. Unionfs's behavior is similar: the newly created file will be successfully created in the moved directory. The only exception is that if a lower branch is marked readonly by unionfs, a copyup will take place. Er... For unionfs it's a much more monumental problem. Look - you have a mapping between the directory trees of layers; everything else is defined in terms of that mapping (this files shadows that file, etc.). If you allow a mix of old and new mappings, you can easily run into the situations when at some moment X1 covers Y1, X2 covers Y2, X2 is a descendent of X1 and Y1 is a descendent of Y2. You *really* don't want to go there - if nothing else, defining behaviour of copyup in face of that insanity will be very painful. What's more, and what makes NFS a bad model, NFS client doesn't lock directories of NFS server. unionfs *does* locking of directories in underlying layers, which means that you have an entirely new set of constraints - you can deadlock easily. As the matter of fact, your current code suffers from that problem - it violates the locking rules in operations on covered layers. This had not been a problem for unionfs users to date. The main reason is that when unionfs users modify lower files, they often do so while there's little to no activity going through the union itself. And while it doesn't prevent directories from being moved around, this common usage mode does reduce the frequency in which topology changes can be an issue for unionfs users. Ugh... Currently unionfs users tend to use it ways that do not trigger fs corruption is nice, but doesn't really address the current unionfs implementation contains races leadint to fs corruption... around, changing parents, etc. Cross-directory rename() certainly rates very high on the list of WTF had they been smoking in UCB? misfeatures, but it's there and it has to be dealt with. Well, it was UCB, UCLA, and Sun. I don't think back in the early 90s they were too concerned about topology changes; f/s stacking was a new idea and they wanted to explore what can be done with it conceptually, not produce commercial-grade software (still, remind me to tell you the first-hand story I've learned about of how full-blown stacking _almost_ made it into Solaris 2.0 :-) The only known reference to try and address this coherency problem was Heidemann's SOSP'95 paper titled Performance of cache coherence in stackable filing. The paper advocated changing the whole VFS and the caches (page cache + dnlc) to create a unified cache manager that was aware of complex layering topologies (including fan-out and fan-in). It was even able to handle compression layers, where file data offsets change b/t the layers (a nasty problem). Code for this unified cache manager was never released AFAIK. I think Heidemann's approach was elegant, but I don't think it was practical as it required radical VFS/VM surgery. Ironically, MS Windows has a single I/O cache manager that all storage and filesystem modules talk to directly (they're not allowed to pass IRPs directly b/t layers): so Windows can handle such coherency better than most Unix systems can today. Different problem. IIRC, that paper implicitly assumed that mapping between vnodes in different layers would _not_ be subject to massive surgeries from operations involved. However, for this view idea to work, I'll need a way to lock-out or hide the lower directories from the namespace, so no one can access them in r-w mode any longer (if at all). Moreover, even if such a method was available, one would have to decide what to do with any open files on the lower branch (killing such processes or closing the descriptors seems
Re: [UNIONFS] 00/29 Unionfs and related patches pre-merge review (v2)
On Sat, Jan 26, 2008 at 12:08:30AM -0500, Erez Zadok wrote: * lock_parent(): who said that you won't get dentry moved before managing to grab i_mutex on parent? While we are at it, who said that you won't get dentry moved between fetching d_parent and doing dget()? In that case parent could've been _freed_ before you get to dget(). OK, so looks like I should use dget_parent() in my lock_parent(), as I've done elsewhere. I'll also take a look at all instances in which I get dentry-d_parent and see if a d_lock is needed there. dget_parent() doesn't deal with the problem of rename() done directly in that layer while you'd been waiting for i_mutex. + lock_rename(lower_old_dir_dentry, lower_new_dir_dentry); + err = vfs_rename(lower_old_dir_dentry-d_inode, lower_old_dentry, +lower_new_dir_dentry-d_inode, lower_new_dentry); + unlock_rename(lower_old_dir_dentry, lower_new_dir_dentry); Uh-huh... To start with, what guarantees that your lower_old_dentry is still a child of your lower_old_dir_dentry? We dget/dget_parent the old/new dentry and parents a few lines above (actually, it looked like I forgot to dget(lower_new_dentry) -- fixed). And? Having a reference to dentry does not prevent it being moved elsewhere by direct rename(2) in that layer. It will exist, that much is guaranteed by grabbing a reference. However, there is no warranties whatsoever that by the time you get i_mutex on what had once been its parent, it will still remain the parent of our dentry. BTW, my sense of the relationship b/t upper and lower objects and their validity in a stackable f/s, is that it's similar to the relationship b/t the NFS client and server -- the client can't be sure that a file on the server doesn't change b/t -revalidate and -op (hence nfs's reliance on dir mtime checks). You are thinking about non-interesting case. _Files_ are not much of a problem. Directory tree is. The real problems with all unionfs and stacking implementations I've seen so far, all way back to Heidemann et.al. start when topology of the underlying layer changes. If you have clear semantics for unionfs behaviour in presence of such things, by all means, publish it - as far as I know *nobody* had done that; not even on the what should we see when... level, nevermind the implementation. Perhaps this general topic is a good one to discuss at more length at LSF? Suggestions are welcome. It would; I honestly do not know if the problem is solvable with the (lack of) constraints you apparently want. Again, the real PITA begins when you start dealing with pieces of underlying trees getting moved around, changing parents, etc. Cross-directory rename() certainly rates very high on the list of WTF had they been smoking in UCB? misfeatures, but it's there and it has to be dealt with. BTW, and that's a completely unrelated story, I'd rather see whiteouts done directly by filesystems involved - it would simplify the life big way. How about adding a dir-i_op-whiteout(dir, dentry) and seeing if your variant could be turned into such a method to be used by really piss-poor filesystems? All UFS-related ones (including ext*) can trivially support whiteouts without any PITA; adding them to tmpfs is also not a big deal and anything that caches inode type in directory entries should be easy to extend in the same way as ext*/ufs... - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] VFS: extend /proc/mounts
On Wed, Jan 16, 2008 at 11:12:31PM +0100, Miklos Szeredi wrote: The alternative (and completely safe) solution is to add another file to proc. Me no likey. Since we need saner layout, I would strongly suggest exactly that. major:minor -- is the major minor number of the device hosting the filesystem Bad description. Value of st_dev for files on that filesystem, please - there might be no such thing as the device hosting the filesystem _and_ the value here may bloody well be unrelated to device actually holding all data (for things like ext2meta, etc.). 1) The mount is a shared mount. 2) Its peer mount of mount with id 20 3) It is also a slave mount of the master-mount with the id 19 4) The filesystem on device with major/minor number 98:0 and subdirectory mnt/1/abc makes the root directory of this mount. 5) And finally the mount with id 16 is its parent. I'd suggest doing a new file that would *not* try to imitate /etc/mtab. Another thing is, how much of propagation information do we want to be exposed and what do we intend to do with it? Note that entire propagation tree is out of question - it spans many namespaces and contains potentially sensitive information. So we won't see all nodes. What do we want to *do* with the information about propagation? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] VFS: extend /proc/mounts
On Wed, Jan 16, 2008 at 04:09:30PM -0800, Andrew Morton wrote: On Thu, 17 Jan 2008 00:58:06 +0100 (CET) Jan Engelhardt [EMAIL PROTECTED] wrote: On Jan 17 2008 00:43, Karel Zak wrote: Seems like a plain bad idea to me. There will be any number of home-made /proc/mounts parsers and we don't know what they do. So, let's use /proc/mounts_v2 ;-) Was not it like don't use /proc for new things? Well yeah. If we're going to do a brand new mechanism to expose per-mount data then we should hunker down and get it right. Which automatically means no sysfs. We are NOT converting vfsmounts to kobject-based lifetime rules. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [UNIONFS] 00/29 Unionfs and related patches pre-merge review (v2)
After grep for locking-related things: * lock_parent(): who said that you won't get dentry moved before managing to grab i_mutex on parent? While we are at it, who said that you won't get dentry moved between fetching d_parent and doing dget()? In that case parent could've been _freed_ before you get to dget(). * in create_parents(): + struct inode *inode = lower_dentry-d_inode; + /* +* If we get here, it means that we created a new +* dentry+inode, but copying permissions failed. +* Therefore, we should delete this inode and dput +* the dentry so as not to leave cruft behind. +*/ + if (lower_dentry-d_op lower_dentry-d_op-d_iput) + lower_dentry-d_op-d_iput(lower_dentry, + inode); + else + iput(inode); + lower_dentry-d_inode = NULL; + dput(lower_dentry); + lower_dentry = ERR_PTR(err); + goto out; Really? So what happens if it had become positive after your test and somebody had looked it up in lower layer and just now happens to be in the middle of operations on it? Will be thucking frilled by that... * __unionfs_rename(): + lock_rename(lower_old_dir_dentry, lower_new_dir_dentry); + err = vfs_rename(lower_old_dir_dentry-d_inode, lower_old_dentry, +lower_new_dir_dentry-d_inode, lower_new_dentry); + unlock_rename(lower_old_dir_dentry, lower_new_dir_dentry); Uh-huh... To start with, what guarantees that your lower_old_dentry is still a child of your lower_old_dir_dentry? What's more, you are not checking the result of lock_rename(), i.e. asking for serious trouble. * revalidation stuff: err... how the devil can it work for directories, when there's nothing to prevent changes in underlying layers between -d_revalidate() and operation itself? For the upper layer (unionfs itself) everything's more or less fine, but the rest of that... - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/2] [RFC] Simple tamper-proof device filesystem.
On Sun, Dec 16, 2007 at 05:52:08PM +0100, Indan Zupancic wrote: What prevents them from mounting tmpfs on top of /dev, bypassing your fs? Or binding /dev/null over nodes they want to get rid of... Also, if they have root there are plenty of ways to prevent an administrator from logging in, e.g. using iptables or changing the password. Indeed. BTW, tmpfs with root marked append-only and populated in normal ways on boot would get a comparable effect without spending so much efforts. Still won't really help if attacker gains root, but then neither will your variant. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [2.6.24-rc3-mm1] loop cleanup in fs/namespace.c - repost
On Wed, Nov 21, 2007 at 03:24:33PM -0800, Andrew Morton wrote: -repeat: - if (atomic_dec_and_lock(mnt-mnt_count, vfsmount_lock)) { + while (atomic_dec_and_lock(mnt-mnt_count, vfsmount_lock)) { if (likely(!mnt-mnt_pinned)) { spin_unlock(vfsmount_lock); __mntput(mnt); - return; + break; } atomic_add(mnt-mnt_pinned + 1, mnt-mnt_count); mnt-mnt_pinned = 0; spin_unlock(vfsmount_lock); acct_auto_close_mnt(mnt); security_sb_umount_close(mnt); - goto repeat; } } This patch has no changelog which I can use. BTW, I have a problem with that one. The change is misleading; FWIW, I'm very tempted to turn that into tail recursion, with if (!atomic_dec_and_lock(mnt-mnt_count, vfsmount_lock)) return; in the very beginning (i.e. invert the test and pull the body to top level). - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/5] Shadow directories
On Fri, Oct 19, 2007 at 06:07:45AM +0930, David Newall wrote: considerations of this whole scheme. Linux, like most Unix systems, has never allowed hard links to directories for a number of reasons; The claim is wrong. UNIX systems have traditionally allowed the superuser to create hard links to directories. See link(2) for 2.10BSD http://www.freebsd.org/cgi/man.cgi?query=linksektion=2manpath=2.10+BSD. Having got that wrong throws doubt on the argument; perhaps a path can simultaneously be a file and a directory. Learn to read. Linux has never allowed that. Most of the Unix systems do not allow that. Original _did_ allow that, but at the cost of very easily triggered fs corruption (and it didn't have things like rename(2) - it _did_ have userland implementation, of course, in suid-root mv(1), but that sucker had been extremely racy and could be easily used to screw filesystem to hell and back; adding rename(2) to the set of primitives combined with multiple links to directories leads to very nasty issues on _any_ system). - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/5] Shadow directories
On Fri, Oct 19, 2007 at 12:27:16PM +0930, David Newall wrote: Learn to read. Linux has never allowed that. Most of the Unix systems do not allow that. I did read the claim and it is ambiguous, in that it can reasonably be read to mean that most UNIX systems never allowed such links, which is wrong. All UNIX systems allowed it until relatively recently. FVOrelatively recently exceeding a decade and half. In any case, it's _trivial_ to get fs corruption on any system with such links - play with rename() races a bit and you'll get it. And yes, it does include 4.4BSD and quite a chunk of even later history. Anyway, you are quite welcome to propose a sane locking scheme capable of dealing with that mess. As for the posted patch, AFAICS it's FUBAR in handling of .. in such directories. Moreover, how are you going to keep that shadow tree in sync with the main one if somebody starts doing renames in the latter? Or mount --move, or... - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] block_device_operations prototype changes
It's time to sanitize prototypes of bdev -open(), -release() and -ioctl(). This stuff had sat in need to fix for a long time and there is a bunch of bugs hard to fix without dealing with it. 1) -open() gets inode * and file *. Almost all instances use only inode-i_bdev file-f_mode file-f_flags - O_ACCMODE|O_NDELAY|O_EXCL and nothing else. Most actually use only inode-i_bdev-bd_disk and those are easy; the trouble begins when we use f_mode and/or f_flags in non-trivial ways. One good example is cdrom.c - its behaviour depends on whether we have O_NDELAY (doesn't lock the door, among other things) and on whether we are opening for write. Unfortunately, it's not just something we care about at open() time - we need different cleanups at close(), and that doesn't work at all. At the very least, we need to know whether we want to unlock the door. Suppose somebody opens it with O_NDELAY (ioctl-only) when it's already opened for data (or mounted, for that matter). Now we close it; how do we know if we need to unlock? Even if we did get struct file * (and we do not, but that's a separate story), we could not rely on file-f_flags. Why? Because O_NDELAY can be flipped at will by fcntl() after the file got opened. IOW, we can't rely on it. Most of that crap can be solved by saving the relevant information from f_flags into f_mode (e.g. fs/block_dev.c::do_open()), where they'd remain safe. However, that's not quite enough - struct file we pass to -open() is often a fake one and it certainly won't live until -release(). One thing that _can_ be solved that way, though, and it alone makes the scheme worthwhile: floppy.c abuses can be finally killed. Just saving -f_flags 2 in -f_mode is enough to eliminate the need for floppy_open() to play games with permission(9) and modify *file in any way. Anything close to current fdutils either passes O_RDWR |... or 3 | ... to open() when it wants to do priveleged ioctls and then we get open_namei check that we can write to file. So we can check that bit saved in -f_mode for can we do that kind of ioctl; no need to mess with anything. That's a *big* win, since this is eliminates the only place that uses more than the fields mentioned above and the only place that tries to modify struct file and check for modifications in later method calls. So having done that, we get to the situation where -open() only uses inode-i_bdev and file-f_mode (and only reads these). I.e. we will be able to go for much saner int (*open)(struct block_device *bdev, mode_t mode) We even can kill that fake struct file. At that point we have sanitized -open(), no regressions and a chance to solve the problems with mode-dependent cleanups on close(). However, to use that chance we'll need to do something with -release(). 2) -release() gets struct inode * and struct file * too. It only uses inode-i_bdev-bd_disk and, sometimes, file-f_mode. Tries to use file-f_mode, that is. Even all way back in 2.2 when it was -release() of file_operations, we used to get NULL on operations like umount, etc., so users of file-f_mode/file-f_flags had to check for file != NULL first. And pray that all such users will want some reasonable default behaviour. For one thing, that's not true anymore. E.g. pktcdvd.c does blkdev_get() with O_NDELAY and blkdev_put() as matching close. IOW, blkdev_put() can't expect that matching open had no O_NDELAY. For another, blkdev_put() is used by blkdev_close() since 2.3, so we _always_ get NULL now. IOW, the cleanup-related logics doesn't work properly, period. The obvious first step is to switch to int (*release)(struct gendisk *disk, mode_t mode) (there are good reasons why open wants bdev and release does not). At least that way we can start passing the right value without fake struct file, etc. Now, the only thing we need is a way to pass that mode_t to blkdev_put(). IOW, blkdev_put() needs an extra argument. And that's where the things get really interesting. blkdev_close() should just pass file-f_mode to it; that's easy. However, we have more callers and need case-by-case analysis for those. We don't have too many of those callers, fortunately, and almost all of them are easy - we know what had matching blkdev_open() got and that's it. Exceptions are interesting. * we have reiserfs journal that could've been opened r/o or r/w. BFD - we can store the mode_t just fine along with the reference to bdev. * a bug (AFAICT) in md.c - we open raid components r/w and if it fails, it fails. Good for our purposes, but... how about raid0 on read-only devices? In any case, we have a ready place to store mode_t, so it's not a problem for getting the right value to blkdev_put(). FWIW, I think that allowing fallback to r/o (and making the entire array r/o, of course) would be a good thing to have. Any comments from drivers/md folks? * open_bdev_excl()/close_bdev_excl(). Needs an extra
Re: Adding a security parameter to VFS functions
On Thu, Aug 16, 2007 at 03:57:24PM -0700, Linus Torvalds wrote: I personally consider this an affront to everythign that is decent. Why the *hell* would mkdir() be so magical as to need something like that? Make it something sane, like a struct nameidata instead, and make it at least try to look like the path creation that is done by open(). Or create a struct file * or something. No. I agree that mkdir/mknod/etc. should all take the same thing. *However*, it should not be nameidata or file. Why? Because it should not contain vfsmount. Object creation should not *care* where fs is mounted. At all. The same goes for open(), BTW - letting it see the reference to vfsmount via nameidata is bloody wrong and I really couldn't care less what kinds of perversions apparmour crowd might like - pathnames simply do not belong there. Said that, I agree that we need uniform prototypes for all these guys and I can see arguments both for and against having creds passed by reference stored in whatever struct we are passing (for: copy-on-write refcounted cred objects would almost never have to be modified or copied and normally we would just pick the current creds reference from task_struct and assing it to a single field in whatever we pass instead of nameidata; against: might be conceptually simpler to have all that data directly in that struct and a helper filling it in). Which way we do that and which struct we end up passing is a very good question, but I want to make it very clear: having object creation/removal/ renaming methods[1] depend on which vfsmount we are dealing with is *wrong*. IOW, I strongly object against the use of nameidata here. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] coda: kill file_count abuse
On Fri, Jul 20, 2007 at 10:45:34AM +1000, David Chinner wrote: On Thu, Jul 19, 2007 at 06:16:00PM -0400, Jan Harkes wrote: On Thu, Jul 19, 2007 at 11:45:08PM +0200, Christoph Hellwig wrote: -release is the proper way to detect the last close of a file, file_count should never be used in filesystems. Has been tried, the problem with that once -release is called it is too late to pass the the error back to close(2). I think you'll find the problem is that fput() throws away the error from -release, not that it's too late Just where would that return value go? BTW, the reason why checks for struct file refcount blow is not far from that: task A: write() task B (sharing descriptor table with A): close() task C (with another reference to struct file in question): close() task A: return from write() Now, the final fput() here happens in write(). In particular, no call of close(2) sees refcount equal to 1. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] coda: kill file_count abuse
On Fri, Jul 20, 2007 at 12:36:01PM +1000, David Chinner wrote: To the context that dropped the last reference. It can't be reported to anything else Oh, for fsck sake... Send a datagram with SCM_RIGHTS in it. Have all other references to these files closed. Now have close(2) kill the socket that might eventually be used to receive the datagram. Now, all these files are closed. Tell me, which error value would you report to that caller of close() and how would it guess which files had those been? Consider munmap(). It can release the last reference to any number of files (mmap them in different places, then munmap the entire area). Which error would you report? Consider exit(), for crying out loud. You have a file opened. You fork a child. You close the file in parent. Child goes away. Who do you report that stuff? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] coda: file count cannot be used to discover last close
On Fri, Jul 20, 2007 at 12:10:00AM -0400, Jan Harkes wrote: I will try to find a clean way to block the close syscall until fput drops the last reference. However I realize that such an implementation would not be acceptable for other file systems, and there are some interesting unresolved details such as 'which close is going to be the last close'. Simply impossible. fd = open(...); fd2 = dup(fd); close(fd2); -- deadlock close(fd); Or fd = open(...); p = mmap(..., fd, ...); close(fd); -- deadlock ... munmap(p, ...); The problem is that you are mixing two different operations: * closing descriptor: descriptor doesn't refer to this opened file anymore. Can be done by close(), can be done by dup2(), can be done by fcntl() analog of dup2(), can be done by execve(), can be done by exit(). And can be done by death-by-signal. * closing opened file: no references (including descriptors) exist anymore, no IO of any kind will be done on that IO channel. Can happen when descriptor gets closed, can happen when mapping gets destroyed (munmap, etc.), can happen when SCM_RIGHTS datagram gets destroyed (including the garbage collection), can happen when any syscall on that file finishes after another thread has closed descriptor passed to that syscall, etc. You _can't_ expect the errors from the latter to be reliably passed to some process; for one thing, many files can get closed by single system call (e.g. closing an AF_UNIX socket with pending SCM_RIGHTS datagrams will flush those datagrams, dropping references to files we tried to pass via them). Why does CODA need special warranties in that area, anyway? Related question: does fsync() force the writeback? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] isofs: mounting to regular file may succeed
On Tue, Jul 17, 2007 at 12:04:07AM -0700, Andrew Morton wrote: I don't think any (all?) other filesystems perform checks like this. Is this something which can/should be performed at the VFS level? I don't see why we _should_ check that; VFS checks that we don't turn directory into non-directory and vice versa, same as for binding. There is no reason why fs driver could not give you a single-node filesystem with root being e.g. a block device node. You could mount it on any non-directory and get a block device seen there, etc. IOW, if filesystem driver considers fs image it's working from as invalid, it's up to filesystem driver. _What_ is considered invalid depends on fs type; what the driver cares to report as dubious is, again, up to driver. If isofs wants to warn about an ISO image that definitely violates ISO9660, it's isofs decision. VFS will work with it as long as fs driver does... - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: *at syscalls for xattrs?
On Mon, Jul 16, 2007 at 09:56:10AM +0200, Jan Engelhardt wrote: Just one question: what the bleeding hell for? Not that the rest of ..at() family made any damn sense as an interface... fd1 = open(dir1, O_DIRECTORY): fd2 = open(dir2, O_DIRECTORY); system(mount -t tmpfs none dir1); system(mount -t tmpfs none dir2); openat(fd1, file1, O_RDWR | O_CREAT); openat(fd2, file2, O_RDWR | O_CREAT); If you have a better way to accomplish this, let me know. :) To accomplish what, exactly? Access to overmounted directory? So bind it elsewhere and use that. I still don't see the point - neither of the interface nor of your example... - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: *at syscalls for xattrs?
On Sun, Jul 15, 2007 at 09:46:27PM +0200, Jan Engelhardt wrote: Hi, recently, the family of *at() syscalls and functions (openat, fstatat, etc.) have been added to Linux and Glibc, respectively. In short: I am missing xattr at functions :) No. They are not fscking forks. They are almost as revolting, but not quite on the same level. BTW, why is fstatat called fstatat and not statat? (Same goes for futimesat.) It does not take a file descriptor for the file argument. Otherwise we'd also need fopenat/funlinkat, etc. Any reasons? Ulrich having an odd taste? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: *at syscalls for xattrs?
On Sun, Jul 15, 2007 at 02:13:21PM -0700, Nicholas Miell wrote: I suspect he was asking for int getxattrat(int fd, const char *path, const char *name, void *value, size_t size, int flags) int setxattrat(int fd, const char *path, const char *name, void *value, size_t size, int xattrflags, int atflags) rather than the ability to access xattrs as files. Just one question: what the bleeding hell for? Not that the rest of ..at() family made any damn sense as an interface... BTW, why is fstatat called fstatat and not statat? (Same goes for futimesat.) It does not take a file descriptor for the file argument. Otherwise we'd also need fopenat/funlinkat, etc. Any reasons? Ulrich having an odd taste? Solaris compatibility. Sun having no taste whatsoever - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 2/5] revoke: core code
On Wed, Jul 11, 2007 at 12:01:06PM +0300, Pekka J Enberg wrote: From: Pekka Enberg [EMAIL PROTECTED] The revokeat(2) and frevoke(2) system calls invalidate open file descriptors and shared mappings of an inode. After an successful revocation, operations on file descriptors fail with the EBADF or ENXIO error code for regular and device files, respectively. Attempting to read from or write to a revoked mapping causes SIGBUS. The actual operation is done in two passes: 1. Revoke all file descriptors that point to the given inode. We do this under tasklist_lock so that after this pass, we don't need to worry about racing with close(2) or dup(2). How does that deal with the following: task A gets its descriptor table cleansed task B sends a descriptor to task A via SCM_RIGHTS datagram task B gets its descriptor table cleansed task A receives the datagram and gets the descriptor installed task A does any kind of IO on that descriptor -f_mapping gets replaced in the most inconvenient time. Come to think of that, what do you do if I create a socketpair, stuff the descriptor into SCM_RIGHTS datagram and send it to myself? Then receive it a day after you've called revoke() and voila - I've got an almost undamaged struct file back... At the very least, it allows me to fchmod(). Or fchdir() if that had been a directory... BTW, read() or write() in progress might get rather unhappy if your live replacement of -f_mapping races with them... - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 2/5] revoke: core code
On Wed, Jul 11, 2007 at 10:37:33AM +0100, Al Viro wrote: On Wed, Jul 11, 2007 at 12:01:06PM +0300, Pekka J Enberg wrote: From: Pekka Enberg [EMAIL PROTECTED] The revokeat(2) and frevoke(2) system calls invalidate open file descriptors and shared mappings of an inode. After an successful revocation, operations on file descriptors fail with the EBADF or ENXIO error code for regular and device files, respectively. Attempting to read from or write to a revoked mapping causes SIGBUS. The actual operation is done in two passes: 1. Revoke all file descriptors that point to the given inode. We do this under tasklist_lock so that after this pass, we don't need to worry about racing with close(2) or dup(2). How does that deal with the following: task A gets its descriptor table cleansed task B sends a descriptor to task A via SCM_RIGHTS datagram task B gets its descriptor table cleansed task A receives the datagram and gets the descriptor installed task A does any kind of IO on that descriptor -f_mapping gets replaced in the most inconvenient time. Come to think of that, what do you do if I create a socketpair, stuff the descriptor into SCM_RIGHTS datagram and send it to myself? Then receive it a day after you've called revoke() and voila - I've got an almost undamaged struct file back... At the very least, it allows me to fchmod(). Or fchdir() if that had been a directory... BTW, read() or write() in progress might get rather unhappy if your live replacement of -f_mapping races with them... Better: I have the only opened descriptor for foo. I send it to myself as described above. I close it. revoke() is called, finds no opened instances of foo in any descriptor tables and cheerfully does nothing. I call recvmsg() and I have completely undamaged opened file back. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 2/5] revoke: core code
On Wed, Jul 11, 2007 at 12:50:48PM +0300, Pekka J Enberg wrote: Hi Al, On Wed, 11 Jul 2007, Al Viro wrote: Better: I have the only opened descriptor for foo. I send it to myself as described above. I close it. revoke() is called, finds no opened instances of foo in any descriptor tables and cheerfully does nothing. I call recvmsg() and I have completely undamaged opened file back. Uhm, nice. So, revoke() needs a proper inode - struct files mapping somewhere. Can we add a list of files to struct inode? Are there other cases where a file can point to an inode but the file is not attached to any file descriptor? Umm... Any number, really - it might be in the middle of syscall while another task sharing descriptor table has closed the descriptor. Then there's quota, then there's process accounting, then there's execve() in progress, then there's knfsd working with that struct file, etc. The fundamental issue here is that even if you do find struct file, you can't blindly rip its -f_mapping since it can be in the middle of -read(), -write(), pageout, etc. And even if you do manage that, you still have the ability to do fchmod() later. I don't see how the ability to find all instances in SCM_RIGHTS datagrams (for example) will help you with the race I've described first. Original state: task B has the only reference to file. revoke() is called, passes task A. B sends datagram to A and closes file. A receives datagram. Now the only reference is in A's table and you've already passed that. So you can't avoid processes keeping pointers to struct file. If you could find all struct file over given inode (which, I suspect, will lead to interesting locking), you could call something on that struct file, but you'd have zero exclusion with processes calling methods on it. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 2/5] revoke: core code
On Wed, Jul 11, 2007 at 01:01:07PM +0300, Pekka J Enberg wrote: On Wed, 11 Jul 2007, Al Viro wrote: BTW, read() or write() in progress might get rather unhappy if your live replacement of -f_mapping races with them... For writes, we (1) never start any new operations after we've cleaned up the file descriptor tables so (2) after we're done with do_fsync() we never touch -f_mapping again. Er, no. do_fsync() won't hit the sys_write() that is yet to enter -write(). And you can't get rid of new callers _anyway_ (see previous mail). - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adding subroot information to /proc/mounts, or obtaining that through other means
On Wed, Jun 20, 2007 at 01:57:33PM -0700, H. Peter Anvin wrote: ... or, alternatively, add a subfield to the first field (which would entail escaping whatever separator we choose): /dev/md6 /export ext3 rw,data=ordered 0 0 /dev/md6:/users/foo /home/foo ext3 rw,data=ordered 0 0 /dev/md6:/users/bar /home/bar ext3 rw,data=ordered 0 0 Hell, no. The first field is in principle impossible to parse unless you know the fs type. How about making a new file with sane format? From the very beginning. E.g. mountpoint + ID + relative path + type + options, where ID uniquely identifies superblock (e.g. numeric st_dev) and backing device (if any) is sitting among the options... - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 01/41] Pass struct vfsmount to the inode_create LSM hook
On Thu, May 24, 2007 at 08:10:00PM +0200, Andreas Gruenbacher wrote: Read it like this: we don't have a good idea how to support multiple namespaces so far. Currently, we interpret all pathnames relative to the namespace a process is in. Confined processes don't have the privilege to create or manipulate namespaces, which makes this safe. We may find a better future solution. You also don't have a solution for multiple chroot jails, since they often have the same fs mounted in many places on given box *and* since the pathnames from the confined processes' POVs have fsck-all to do with each other. It's really not kinder than multiple namespaces as far as your approach is concerned. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] file as directory
On Wed, May 23, 2007 at 08:36:04AM +0200, Miklos Szeredi wrote: Interesting... How do you deal with mount propagation and things like mount --move? Moving (or doing other mount operations on) an ancestor shouldn't be a problem. Moving this mount itself is not allowed, and neither is doing bind or pivot_root. Maybe bind could be allowed... Eh... Arbitrary limitations are fun, aren't they? When doing recursive bind on ancestor, these mounts are skipped. What about clone copying your namespace? What about MNT_SLAVE stuff being set up prior to that lookup? More interesting question: should independent lookups of that sucker on different paths end up with the same superblock (and vfsmount for each) or should we get fully independent mount on each? The latter would be interesting wrt cache coherency... As for unlink... How do you deal with having that thing mounted, mounting something _under_ it (so that vfsmount would be kept busy) and then unlinking that sucker? Yeah, that's a good point. Current patch doesn't deal with that. Simplest solution could be to disallow submounting these. Don't think it makes much sense anyway. Arbitrary limitations... (and that's where revalidate horrors come in, BTW). BTW^2: what if fs mounted that way will happen to have such node itself? I'm not saying that it's unfeasible or won't lead to interesting things, but it really needs semantics done right... - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] file as directory
On Wed, May 23, 2007 at 09:19:17AM +0200, Miklos Szeredi wrote: Eh... Arbitrary limitations are fun, aren't they? But these mounts _are_ special. There is really no point in moving or pivoting them. pivoting - probably true, moving... why not? What about MNT_SLAVE stuff being set up prior to that lookup? These mounts are not propagated. Or at least I hope so. Propagation stuff is a bit too complicated for my poor little brain. Er... These mounts might not be propagated, but what about a bind over another instance of such file in master tree? I think they should be the same superblock, same dentry. What would be the advantage of doing otherwise? Then you are going to have interesting time with locking in final mntput(). BTW, what about having several links to the same file? You have i_mutex on the inode, so serialization of those is not a problem, but... I think doing this recursively should be allowed. Releasing last ref cleans up the mess should work in that case. Releasing the last reference will lead to cascade of umounts in that case... IOW, need to be careful with locking. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] file as directory
On Tue, May 22, 2007 at 08:48:49PM +0200, Miklos Szeredi wrote: */ -static int __follow_mount(struct path *path) +static int __follow_mount(struct path *path, bool enter) { int res = 0; while (d_mountpoint(path-dentry)) { - struct vfsmount *mounted = lookup_mnt(path-mnt, path-dentry); + struct vfsmount *mounted = + lookup_mnt(path-mnt, path-dentry, enter); + if (!mounted) break; dput(path-dentry); @@ -689,27 +697,37 @@ static int __follow_mount(struct path *p return res; } -static void follow_mount(struct vfsmount **mnt, struct dentry **dentry) +/* + * Follows mounts on the given nameidata. + * + * Only follows directory on file mounts if LOOKUP_ENTER is set. + */ +void follow_mount(struct nameidata *nd) BTW, I'd split that (and matching updates in callers) into separate patch. { - while (d_mountpoint(*dentry)) { - struct vfsmount *mounted = lookup_mnt(*mnt, *dentry); + while (d_mountpoint(nd-dentry)) { + bool enter = nd-flags LOOKUP_ENTER; int, surely? + * This is called if the object has no -lookup() defined, yet the + * path contains a slash after the object name. + * + * If the filesystem defines an -enter() method, this will be called, + * and the filesystem shall fill the supplied struct path or return an + * error. + * + * The returned path will be bind mounted on top of the object with + * the MNT_DIRONFILE flag, and the nameidata will descend into the + * mount. + */ +static int enter_file(struct inode *inode, struct nameidata *nd) +{ + int err; + struct path newpath; + + printk(KERN_DEBUG %s/%d enter %s/\n, current-comm, current-pid, +nd-dentry-d_name.name); + if (!inode-i_op-enter) + return -ENOTDIR; + + newpath.mnt = NULL; + newpath.dentry = NULL; + err = inode-i_op-enter(nd, newpath); + if (!err) { + err = mount_dironfile(nd, newpath); + pathput(newpath); + } + return err; Ouch. What guarantees that two lookups won't race right here? You are not holding any locks at that point, AFAICS... BTW, why newpath? What's wrong with simply returning a new vfsmount with right -mnt_root/-mnt_sb (instead of creating it inside mount_dironfile())? ERR_PTR() for error, struct vfsmount * for success... @@ -301,8 +310,8 @@ static struct vfsmount *clone_mnt(struct mnt-mnt_mountpoint = mnt-mnt_root; mnt-mnt_parent = mnt; - /* don't copy the MNT_USER flag */ - mnt-mnt_flags = ~MNT_USER; + /* don't copy some flags */ + mnt-mnt_flags = ~(MNT_USER | MNT_DIRONFILE); if (flag CL_SETUSER) __set_mnt_user(mnt, owner); Hmm? So you do copy them and strip your MNT_DIRONFILE from copies? + * This is tricky, because for namespace modification we must take the + * namespace semaphore. But mntput() is called from various places, + * sometimes with namespace_sem held. Fortunately in those places the + * mount cannot yet have MNT_DIRONFILE, or at least that's what I + * hope... + * + * The umounting is done in two stages, first the mount is removed + * from the hashes. This is done atomically wrt other mount lookups, + * so it's not possible to acquire a new ref to this dead mount that + * way. + * + * Then after having locked namespace_sem and relocked vfsmount_lock, + * the mount is properly detached. + */ +static void umount_dironfile(struct vfsmount *mnt) + __releases(vfsmount_lock) +{ + struct nameidata nd; You've got to be kidding. nameidata is *big*. If anything, we want to make detach_mnt() take struct path * instead, but even that is lousy due to recursion. I really don't like what's going on here. The thing is, current code is based on assumption that presence in the mount tree = holding a reference. We _might_ deal with that (there was an old plan to change refcounting logics for vfsmounts), but that sort of games with locks spells trouble. What happens, for example, if namespace gets cloned before you grab namespace_sem? There's another problem, BTW - a lot of stuff does stat + open + fstat + compare kind of sequence. You'll end up mounting/umounting between stat and open, which opens you to race with somebody else. Get a different st_dev, eat a nice unreproducible error from application... - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] file as directory
On Wed, May 23, 2007 at 11:03:08AM +0200, Miklos Szeredi wrote: I still don't get it where the superblock comes in. The locking is interesting in there, yes. And I haven't completely convinced myself it's right, let alone something that won't easily be screwed up in the future. So there's definitely room for thought there. But how does it matter if two different paths have the same sb or a different sb mounted over them? Because then you get a slew of fun issues with dropping the final reference to vfsmount vs. lookup on another place. What hold do you have on that superblock and when do you switch from oh, called -enter() on the same inode again, return vfsmount over the same superblock to need to initialize that damn superblock, all mounts are gone? The same dentry is mounted over each one. The contents of the directory should only depend on the contents of the underlying inode. The path leading up to it is completely irrelevant. So what kind of exclusion do you have for -enter()? None? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] file as directory
On Wed, May 23, 2007 at 12:09:19PM +0200, Miklos Szeredi wrote: Right. After locking vfsmount_lock, mount_dironfile() should recheck if there was a race and bail out. Owww... Not pretty, that... I don't think the filesystem ought to try _creating_ a vfsmount. I imagine, that the fs has already a kernel-internal mounted for this kind of stuff, and it just supplies a dentry from that. The vfsmount isn't actually important, but it should be readily available, and it's easier to clone from a vfsmount/dentry pair. I don't get it. What's the point of that exercise, then? When do you create that kernel-internal mount? As I said, the superblock should be persistent, so we'll get a stable st_dev for multiple mounts. OK, but then I guess I don't understand the intended use. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] file as directory
When the real superblock is created. It could even be the _same_ super block as the real one. There'd be just the problem of anchoring the dir-on-file dentries somewhere... Or with fuse the dir-on-file mount can just come from any mounted filesystem, again possibly the same one as the parent. I do actually test with this. The userspace filesystem supplies a file descriptor, from which the struct path is extracted and returned from -enter(). Then I do not understand what this mechanism could be used for, other than an odd way to twist POSIX behaviour and see how much of the userland would survive that. Certainly not useful for your look into tarball as a tree, unless you seriously want to scan the entire damn fs for tarballs at mount time and set up a superblock for each. And for per-file extended attributes/forks/whatever-you-call-that-abomination it also obviously doesn't help, since you lose them for directories. IOW, what uses do you have in mind? Complete scenario, please... - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] file as directory
On Wed, May 23, 2007 at 12:39:25PM +0100, Al Viro wrote: Then I do not understand what this mechanism could be used for, other than an odd way to twist POSIX behaviour and see how much of the userland would survive that. Certainly not useful for your look into tarball as a tree, unless you seriously want to scan the entire damn fs for tarballs at mount time and set up a superblock for each. And for per-file extended attributes/forks/whatever-you-call-that-abomination it also obviously doesn't help, since you lose them for directories. IOW, what uses do you have in mind? Complete scenario, please... Ah... After rereading the thread you've mentioned in the very beginning, I think I understand what you are driving at. However, in that case * I really don't see why bother with returning vfsmount at all. dentry alone is enough to create a new vfsmount, all in fs/namei.c. * the lifetime rules look fscking scary. You call that -enter() on nearly every damn lookup. OK, so you'll recreate equivalent vfsmount, but... That's a lot of allocations/freeing. Can we do some caching and deal with it on memory pressure? * invalidation on unlink is still an open problem. * locking in final mntput() doesn't look nice; we probably need a new refcounting scheme for vfsmounts to make that work. I have a variant that might work here (and make life much easier for expiry logics in automount/shared trees, which is what it had been initially proposed for), but it still doesn't kill the need to deal with invalidation. And yes, NFS still needs it (and so do all network filesystems, really). The question of caching is related to that. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] file as directory
On Wed, May 23, 2007 at 08:34:42AM -0400, Trond Myklebust wrote: On Wed, 2007-05-23 at 08:36 +0100, Al Viro wrote: On Wed, May 23, 2007 at 09:19:17AM +0200, Miklos Szeredi wrote: Eh... Arbitrary limitations are fun, aren't they? But these mounts _are_ special. There is really no point in moving or pivoting them. pivoting - probably true, moving... why not? Moving would be an implementation artefact that doesn't really correspond to any useful operation on the filesyst AFAIK, most filesystems that have implemented subfiles (excepting Reiser4 of course) do not allow you to rename or move the subfile directory or its contents from one parent file to another. If that's about xattr and nothing else, colour me thoroughly uninterested. If it might have other interesting uses, OTOH... - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] file as directory
On Wed, May 23, 2007 at 03:01:38PM +0200, Miklos Szeredi wrote: Someone might think of a way to make those work with directories. Invisible directory entries, anyone? /me ducks Not unless you manage to get working union-mount [*NOT* unionfs] * invalidation on unlink is still an open problem. * locking in final mntput() doesn't look nice; we probably need a new refcounting scheme for vfsmounts to make that work. I have a variant that might work here (and make life much easier for expiry logics in automount/shared trees, which is what it had been initially proposed for), Which variant? We had that detached subtrees thing, is that it? Umm... It is related to detached subtrees, but I'm not sure if it is what you are thinking about. Short version of the story: new counter (mnt_busy) that would be defined in the following way: the number of external references (not due to the vfsmount tree structure or from namespace to root) + the number of children that have non-zero -mnt_busy. And a per-vfsmount flag (goner). The rules for handling -mnt_busy: * duplicating external reference: increment m-mnt_busy * getting from m to child: increment child-mnt_busy, if it went from 0 to non-zero - increment m-mnt_busy as well (that's done under vfsmount_lock, so we can safely check for zero here). * getting from m to parent: increment parent-mnt_busy. * dropping external reference: decrement m-mnt_busy; if it's still non-zero, we are done. If it's zero, we are in for some work (and had acquired vfsmount_lock by atomic_dec_and_lock()). Here's what we do: * go through ancestors, decrementing -mnt_busy, until we hit the root or get to one with -mnt_busy staying non-zero. * find the most remote ancestor that has zero -mnt_busy and is marked as goner (might be m itself). * if no such beast exists, we are done. * otherwise, detach the subtree rooted in that ancestor from its parent (if any) and unhash its root (if hashed). Now there is no external references to any vfsmount in that subtree. * now we can kill all vfsmounts in that subtree. * detaching m from parent: nothing; we trade a busy child of parent for new external reference to parent. * lazy umount: in addition to detaching everything from parents and dropping resulting external references to parents, mark everything in the subtree as goners. * normal umount: check -mnt_busy *and* lack of children, detach, mark as goner, drop resulting external reference to parent. * fun new stuff - umount of intact subtree: detach the subtree from parent, do *not* dissolve it, mark everything in subtree as goners. If something we mark as goner is not busy, we can kill it and all its descendents. The subtree will be shrinking as its pieces lose external references. * check for expirability: we hold an external reference to m and m-mnt_busy is 1. No need to look into children, etc. * your vfsmounts: simply mark them goners from the very beginning. but it still doesn't kill the need to deal with invalidation. And yes, NFS still needs it (and so do all network filesystems, really). The question of caching is related to that. So what's so special about invalidation? Why not just treat dir-on-file mounts the same as any other ref on the dentry? Because of the case of having something mounted in that subtree. The current code doesn't even try to evict such stuff. NFS *does*, but it's not in position to do that decently (not NFS fault, it's just that we don't have the data needed for it). Note that one problem we used to have back then is gone - namely, per-namespace semaphores. It's a global semaphore now, so we *can* do cross-namespace rogering of mount trees without that kind of locking horrors. What we really need is go through dentry subtree, try to evict everything we can, for anything that has stuff mounted on it go through all such vfsmounts and kick them and all their descendents out. That's what should happen on invalidation. From generic code, so that NFS wouldn't have to bother. And _that_ is what we could call from -unlink() on your inode - would take care of submounts. Note that I'm not all that happy about this scheme; we might make it work, but I still want to see a good use scenario for that kind of stuff. Invalidation logics is a separate story - it's simply needed for existing stuff; that area sucks *badly*, regardless of adding these hybrid objects. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] file as directory
On Wed, May 23, 2007 at 03:23:54PM +0200, Ph. Marek wrote: How about some special node in eg. /proc (or a new filesystem)? Eg. /fileAsDir/etc/passwd/owner ... would work for all *files*. For directories we do not know whether we're still climbing the hierarchy or would like to see meta-data. So we need to make *anything* done anywhere in the namespace to modify the dentry tree on that fs. Could you spell fuck, NO? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] file as directory
On Wed, May 23, 2007 at 04:32:37PM +0200, Miklos Szeredi wrote: Umm... It is related to detached subtrees, but I'm not sure if it is what you are thinking about. I was thinking of a similar one by Mike Waychison. It had the problem of requiring a spinlock for mntget/mntput. It was also different in that it did not gradually dissolve detached trees, but kept them as whole blobs until the last ref went away. Here the spinlock is needed only when mnt_busy goes to 0, so presumably it won't be a serious problem on more or less common setups; however, it certainly would need serious profiling. How will this work with copy_tree() and namespace duplication, which currently walk the tree with only namespace_sem held? Easy - grab namespace_sem, grab vfsmount_lock, walk the subtree and bump mnt_busy on everything (by 1 + number of non-busy children). Then drop vfsmount_lock and do as usual, dropping references in tree being copied as you go. Nothing will get attached or detached due to namespace_sem, nothing will get evicted by anybody other than you since you've got all that stuff pinned down. End of story... - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] file as directory
On Wed, May 23, 2007 at 05:25:49PM +0200, Miklos Szeredi wrote: How will this work with copy_tree() and namespace duplication, which currently walk the tree with only namespace_sem held? Easy - grab namespace_sem, grab vfsmount_lock, walk the subtree and bump mnt_busy on everything (by 1 + number of non-busy children). Then drop vfsmount_lock and do as usual, dropping references in tree being copied as you go. Nothing will get attached or detached due to namespace_sem, nothing will get evicted by anybody other than you since you've got all that stuff pinned down. End of story... Right. Do you have some code? Should I try to code something up? I hope to get some breathing space next week, then I'll get back to VFS work. I'd rather do that one myself, since it'll be a long series of equivalent transformations - debugging such rewrite of refcounting done as a single patch is going to be hell. And yes, refcounting rewrite is near the top of the list (another thing is wading through several threads from hell and reviewing unionfs ;-/) - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 01/41] Pass struct vfsmount to the inode_create LSM hook
On Thu, Apr 12, 2007 at 02:08:10AM -0700, [EMAIL PROTECTED] wrote: This is needed for computing pathnames in the AppArmor LSM. Which is an argument against said LSM in current form. - error = security_inode_create(dir, dentry, mode); + error = security_inode_create(dir, dentry, nd ? nd-mnt : NULL, mode); is a clear sign that interface is wrong. Leaving aside the general idiocy of we prohibit to do something with file if mounted here, but if there's another mountpoint, well, we just miss, an API of that kind is just plain wrong. Either you can live without seeing vfsmount in that method (in which case you shouldn't pass it at all), or you have a hole. NACK. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 40/41] AppArmor: all the rest
On Thu, Apr 12, 2007 at 02:08:49AM -0700, [EMAIL PROTECTED] wrote: + } else if (profile1 profile2) { + /* profile1 cannot be NULL here. */ + spin_lock_irqsave(profile1-lock, profile1-int_flags); + if (profile2) + spin_lock(profile2-lock); + + } else { + /* profile2 cannot be NULL here. */ + spin_lock_irqsave(profile2-lock, profile2-int_flags); + spin_lock(profile1-lock); + } Ahem... profile2 is locked individually. profile1 profile2. profile1 is not locked. We try to lock both. profile1 is locked OK, flags (with interrupts disabled) are stored into it. We spin trying to lock profile2. Eventually, whoever had held profile2 unlocks it, restoring the flags from profile2. We happily grab the spinlock and move on. When we unlock the pair, we restore flags from profile1. I.e. we are left with interrupts disabled. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 33/41] Add d_namespace_path() to obtain namespace relative pathnames
+char *d_namespace_path(struct dentry *dentry, struct vfsmount *vfsmnt, +char *buf, int buflen) +{ + char *res; + struct vfsmount *rootmnt, *nsrootmnt; + struct dentry *root; + + read_lock(current-fs-lock); + rootmnt = mntget(current-fs-rootmnt); + read_unlock(current-fs-lock); + spin_lock(vfsmount_lock); + nsrootmnt = mntget(rootmnt-mnt_ns-root); ... and when somebody does umount -l on your chroot jail, you get NULL -mnt_ns. Oops... - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Move across mount,sb
On Thu, Mar 15, 2007 at 02:20:25AM +0100, Jan Engelhardt wrote: Why is EXDEV returned when _vfs mountpoints_ are crossed? Should not it be more like the following? error = -EXDEV; if (oldnd.mnt-mnt_sb != newnd.mnt-mnt_sb) goto exit2; No. This is absolutely deliberate - mountpoint creates a boundary and such boundaries are very useful for restricting modifications of filesystem. IOW, it's not a bug and it applies to other operations as well (link(), for example). - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext2: conditional removal of NFSD code
On Sun, Jan 07, 2007 at 12:44:56AM +0300, Alexey Dobriyan wrote: #if defined(CONFIG_EXPORTFS) || defined(CONFIG_EXPORTFS_MODULE) #define set_export_ops(sb, ops) sb-s_export_op = ops #else #define set_export_ops(sb, ops) 0 #endif That way you can get rid of the function pointer from the struct superblock too. Exactly! I've just started with filesystems I use. it should be wrapped in do {} while 0, of course. What the hell for? Repeat after me: * do {} while(0) is always inferior to ((void)0) * do { expr; } while(0) is always inferior to ((void)(expr)) - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is a NULL check missing in nfs_lookup?
On Fri, Jan 05, 2007 at 12:11:17PM -0500, Shaya Potter wrote: ok, I guess what I don't understand is when are dentry-d_sb-s_root and nd-mnt-mnt_root not equivalent. I guess it would be when crossing a mountpoint on the server, so I look at nfs_follow_mountpoint() and basically see that you create a new mountpoint and a new nd for that. And since superblock objects are only per real file system not per mount point, they will be different. I guess the question is, why shouldn't a dentry object know what vfsmount it belongs to is? Can it belong to multiple vfsmount objects? Yes. dentry belongs to superblock. vfsmount refers to a subtree of some superblock's dentry tree. There can be any number of vfsmounts refering to subtrees of the same fs. Some might refer to the entire tree, some - to its arbitrary subtrees (possibly as small as a single file). There can be any number of vfsmounts refering to any given subtree. Think what happens when you create a binding. Or when you clone an entire namespace. (Pieces of) the same filesystem might be mounted in a lot of places. vfsmount describes a part of unified tree delimited by mountpoints; if the same fs (or its parts) are seen under several mountpoints, you get vfsmounts that refer to the same fs instance, i.e. the same superblock and dentry tree. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Finding hardlinks
On Wed, Dec 20, 2006 at 05:50:11PM +0100, Miklos Szeredi wrote: I don't see any problems with changing struct kstat. There would be reservations against changing inode.i_ino though. So filesystems that have 64bit inodes will need a specialized getattr() method instead of generic_fillattr(). And they are already free to do so. And no, struct kstat doesn't need to be changed - it has u64 ino already. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ensure i_ino uniqueness in filesystems without permanent inode numbers (via idr)
On Wed, Nov 15, 2006 at 11:42:38AM -0500, Jeff Layton wrote: +{ + int rv; + + rv = idr_pre_get(inode-i_sb-s_inode_ids, GFP_KERNEL); + if (! rv) + return -ENOMEM; + + lock_super(inode-i_sb); [EMAIL PROTECTED] Please, use something saner. Use of lock_super() for anything generic is wrong; using it for something that'd better be fast is even dumber... @@ -1025,6 +1055,7 @@ void generic_delete_inode(struct inode * spin_lock(inode_lock); hlist_del_init(inode-i_hash); spin_unlock(inode_lock); + iunique_unregister(inode); Unconditional? Hitting every fs out there? With that kind of locking? wake_up_inode(inode); BUG_ON(inode-i_state != I_CLEAR); destroy_inode(inode); @@ -1057,6 +1088,7 @@ static void generic_forget_inode(struct inode-i_state |= I_FREEING; inodes_stat.nr_inodes--; spin_unlock(inode_lock); + iunique_unregister(inode); Ditto. if (inode-i_data.nrpages) truncate_inode_pages(inode-i_data, 0); clear_inode(inode); diff --git a/fs/pipe.c b/fs/pipe.c index b1626f2..d74ae65 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -845,6 +845,9 @@ static struct inode * get_pipe_inode(voi if (!inode) goto fail_inode; + if (iunique_register(inode, 0)) + goto fail_iput; + Humm... I wonder what the overhead of that is going to be. There easily can be shitloads of pipes on the box, with all sorts of lifetimes. And we'd better be fast on that codepath... - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel bug: Bad page state: related to generic symlink code and mmap
On Fri, Aug 19, 2005 at 12:14:48PM +0100, Anton Altaparmakov wrote: Hi, There is a bug somewhere in 2.6.11.4 and I can't figure out where it is. I assume it is present in older and newer kernels, too as the related code hasn't changed much AFAICS and googling for Bad page state returns rather a lot of hits relating to both older (up to 2.5.70!) and newer kernels... Note: PLEASE do not stop reading because you read ncpfs below as I am pretty sure it is not ncpfs related! And looking at google a lot of people have reported such similar problems since 2.5.70 or so and they were all told to go away as they have bad ram. That is impossible because this happens on well over 600 workstations and several servers 100% reproducible. Many different types of hardware, different makes, difference age, all running smp kernels even if single cpu. You can't tell me they all have bad ram. Windows works fine and Linux works fine except for that one specific problem which is 100% reproducible... The bug only appears, but it appears 100% reproducibly when a cross volume symlink on ncpfs is accessed using nautilus under gnome. I.e. double click on a cross volume symlink on ncpfs in nautilus and the machine locks up solid. Ugh... Could you at least tell what does nautilus attempt to do at that point? Something that wouldn't show up with simple ls -l symlink or cat symlink /dev/null, judging by the above, but what? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel bug: Bad page state: related to generic symlink code and mmap
On Fri, Aug 19, 2005 at 09:07:35AM -0700, Linus Torvalds wrote: Hmm.. NFS _does_ use the page cache for symlinks, but uses it slightly differently: instead of relying on the page cache entry being the same when freeing the page, it just caches the page it looked up in the page cache (ie nfs_follow_link() does look up the page cache, but then hides the page pointer inside the page data itself (uglee), and thus does not depend on the mapping staying the same (nfs_put_link() just takes the page from the symlink data). For NFS that was done exactly to deal with cache invalidation. IIRC, I've convinced myself that it wasn't going to happen on ncpfs and happily abstained from duplicating the NFS variant. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel bug: Bad page state: related to generic symlink code and mmap
On Fri, Aug 19, 2005 at 09:43:17AM -0700, Linus Torvalds wrote: Actually, looking at the ncpfs patch, I'd rather not apply that patch as-is. It looks like it will totally disable symlink caching, which would be kind of sad. Somebody willing to do the same thing NFS does? NFS hides away the struct page * pointer at the end of the page data, so that when we pass the pointer to the virtual address around, we can trivially look up the struct page. An alternative is to make the symlink address-space use a gfp_mask of GFP_KERNEL (no highmem), at which point ncpfs_put_link() just becomes something like void ncpfs_put_link(struct dentry *dentry, struct nameidata *nd) { char *addr = nd_get_link(nd); if (!IS_ERR(addr)) page_cache_release(virt_to_page(addr)); } which is pretty ugly, but even simpler than the NFS trick. Anybody? I'm taking NFS helpers to libfs.c and switching ncpfs to them. IMO that's better than copying the damn thing and other network filesystems might have the same needs eventually... - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel bug: Bad page state: related to generic symlink code and mmap
On Fri, Aug 19, 2005 at 05:53:32PM +0100, Al Viro wrote: I'm taking NFS helpers to libfs.c and switching ncpfs to them. IMO that's better than copying the damn thing and other network filesystems might have the same needs eventually... [something like this - completely untested] * stray_page_get_link(inode, filler) - returns ERR_PTR(error) or pointer to symlink body. Said symlink body sits in a page at offset equal to offsetof(page, struct stray_page_link). filler() is expected to put it at such offset. Page is cached. * stray_page_put_link() - -put_link() suitable for links obtained from stray_page_get_link(). Unlike the usual pagecache-based variants, this sucker does _not_ rely on page staying cached. * nfs and ncpfs switched to the helpers above. Signed-off-by: Al Viro [EMAIL PROTECTED] diff -urN RC14-rc6-git10-base/fs/libfs.c current/fs/libfs.c --- RC13-rc6-git10-base/fs/libfs.c 2005-08-10 10:37:52.0 -0400 +++ current/fs/libfs.c 2005-08-19 13:29:39.0 -0400 @@ -7,6 +7,7 @@ #include linux/pagemap.h #include linux/mount.h #include linux/vfs.h +#include linux/namei.h #include asm/uaccess.h int simple_getattr(struct vfsmount *mnt, struct dentry *dentry, @@ -616,6 +617,42 @@ return ret; } +char *stray_page_get_link(struct inode *inode, int (*filler)(struct inode *, struct page *)) +{ + struct stray_page_symlink *p; + struct page *page; + + page = read_cache_page(inode-i_data, 0, (filler_t *)filler, inode); + if (IS_ERR(page)) + goto read_failed; + if (!PageUptodate(page)) + goto getlink_read_error; + p = kmap(page); + p-page = page; + return p-body; + +getlink_read_error: + page_cache_release(page); + page = ERR_PTR(-EIO); +read_failed: + return (char *)page;/* error */ +} + +void stray_page_put_link(struct dentry *dentry, struct nameidata *nd) +{ + char *s = nd_get_link(nd); + if (!IS_ERR(s)) { + struct stray_page_symlink *p; + struct page *page; + + p = container_of(s, struct stray_page_symlink, body[0]); + page = p-page; + + kunmap(page); + page_cache_release(page); + } +} + EXPORT_SYMBOL(dcache_dir_close); EXPORT_SYMBOL(dcache_dir_lseek); EXPORT_SYMBOL(dcache_dir_open); @@ -648,3 +685,5 @@ EXPORT_SYMBOL_GPL(simple_attr_close); EXPORT_SYMBOL_GPL(simple_attr_read); EXPORT_SYMBOL_GPL(simple_attr_write); +EXPORT_SYMBOL(stray_page_get_link); +EXPORT_SYMBOL(stray_page_put_link); diff -urN RC13-rc6-git10-base/fs/ncpfs/inode.c current/fs/ncpfs/inode.c --- RC13-rc6-git10-base/fs/ncpfs/inode.c2005-06-17 15:48:29.0 -0400 +++ current/fs/ncpfs/inode.c2005-08-19 13:12:01.0 -0400 @@ -104,7 +104,7 @@ extern struct dentry_operations ncp_root_dentry_operations; #if defined(CONFIG_NCPFS_EXTRAS) || defined(CONFIG_NCPFS_NFS_NS) -extern struct address_space_operations ncp_symlink_aops; +extern int ncp_follow_link(struct dentry *dentry, struct nameidata *nd); extern int ncp_symlink(struct inode*, struct dentry*, const char*); #endif @@ -233,8 +233,8 @@ #if defined(CONFIG_NCPFS_EXTRAS) || defined(CONFIG_NCPFS_NFS_NS) static struct inode_operations ncp_symlink_inode_operations = { .readlink = generic_readlink, - .follow_link= page_follow_link_light, - .put_link = page_put_link, + .follow_link= ncp_follow_link, + .put_link = stray_page_put_link, .setattr= ncp_notify_change, }; #endif @@ -272,7 +272,6 @@ #if defined(CONFIG_NCPFS_EXTRAS) || defined(CONFIG_NCPFS_NFS_NS) } else if (S_ISLNK(inode-i_mode)) { inode-i_op = ncp_symlink_inode_operations; - inode-i_data.a_ops = ncp_symlink_aops; #endif } else { make_bad_inode(inode); diff -urN RC13-rc6-git10-base/fs/ncpfs/symlink.c current/fs/ncpfs/symlink.c --- RC13-rc6-git10-base/fs/ncpfs/symlink.c 2005-06-17 15:48:29.0 -0400 +++ current/fs/ncpfs/symlink.c 2005-08-19 13:26:26.0 -0400 @@ -30,6 +30,7 @@ #include linux/time.h #include linux/mm.h #include linux/stat.h +#include linux/namei.h #include ncplib_kernel.h @@ -41,9 +42,8 @@ /* - read a symbolic link -- */ -static int ncp_symlink_readpage(struct file *file, struct page *page) +static int symlink_filler(struct inode *inode, struct page *page) { - struct inode *inode = page-mapping-host; int error, length, len; char *link, *rawlink; char *buf = kmap(page); @@ -77,6 +77,7 @@ } len = NCP_MAX_SYMLINK_SIZE; + buf += offsetof(struct stray_page_symlink, body); error = ncp_vol2io(NCP_SERVER(inode), buf, len, link, length, 0); kfree(rawlink); if (error) @@ -96,13 +97,12 @@ return error
Re: Kernel bug: Bad page state: related to generic symlink code and mmap
On Fri, Aug 19, 2005 at 07:00:38PM +0100, Christoph Hellwig wrote: On Fri, Aug 19, 2005 at 07:02:18PM +0100, Al Viro wrote: On Fri, Aug 19, 2005 at 05:53:32PM +0100, Al Viro wrote: I'm taking NFS helpers to libfs.c and switching ncpfs to them. IMO that's better than copying the damn thing and other network filesystems might have the same needs eventually... [something like this - completely untested] * stray_page_get_link(inode, filler) - returns ERR_PTR(error) or pointer to symlink body. Said symlink body sits in a page at offset equal to offsetof(page, struct stray_page_link). filler() is expected to put it at such offset. Page is cached. * stray_page_put_link() - -put_link() suitable for links obtained from stray_page_get_link(). Unlike the usual pagecache-based variants, this sucker does _not_ rely on page staying cached. * nfs and ncpfs switched to the helpers above. Can you add some kerneldoc comments to describe them? Especially as the name is not very descriptive. Hey, if anybody has suggestions on names - they are very welcome ;-) FWIW, I'd rather take page_symlink(), page_symlink_inode_operations, page_put_link(), page_follow_link_light(), page_readlink(), page_getlink(), generic_readlink() and vfs_readlink() to the same place where these guys would live. They all belong together and none of them has any business in fs/namei.c. Options: fs/libfs.c or separate library since fs/libfs.c is getting crowded. Linus, do you have any objections to that or suggestions on filename here? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel bug: Bad page state: related to generic symlink code and mmap
On Fri, Aug 19, 2005 at 03:04:52PM -0700, Linus Torvalds wrote: On Fri, 19 Aug 2005, Anton Altaparmakov wrote: Yes, sure. I have applied your patch to our 2.6.11.4 tree (with the one liner change I emailed you just now) and have kicked off a compile. Actually, hold on. The original patch had another problem: it returned an uninitialized page pointer when page_getlink() failed. This one should have that fixed, and has converted a few other filesystems. Most of them trivially, but I took the opportunity to just simplify NFS while I was at it, since it now has no reason to need to save off the struct page * any more. It's still not tested, but at least I've looked at it a bit more ;) That looks OK except for * jffs2 is b0rken (see patch in another mail) * afs, autofs4, befs, devfs, freevxfs, jffs2, jfs, ncpfs, procfs, smbfs, sysvfs, ufs, xfs - prototype change for -follow_link() * befs, smbfs, xfs - same for -put_link() * ncpfs fix is actually missing here Prototype changes are covered by patch below (incremental on top of your + jffs2 fix upthread). No ncpfs changes - these will go separately, assuming you haven't done them yet; just a plain janitor stuff. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel bug: Bad page state: related to generic symlink code and mmap
On Sat, Aug 20, 2005 at 12:15:42AM +0100, Al Viro wrote: That looks OK except for * jffs2 is b0rken (see patch in another mail) * afs, autofs4, befs, devfs, freevxfs, jffs2, jfs, ncpfs, procfs, smbfs, sysvfs, ufs, xfs - prototype change for -follow_link() * befs, smbfs, xfs - same for -put_link() * ncpfs fix is actually missing here Prototype changes are covered by patch below (incremental on top of your + jffs2 fix upthread). No ncpfs changes - these will go separately, assuming you haven't done them yet; just a plain janitor stuff. Gaack... And here's the patch itself - sorry. diff -urN RC13-rc6-git10-base/fs/afs/mntpt.c current/fs/afs/mntpt.c --- RC13-rc6-git10-base/fs/afs/mntpt.c 2005-06-17 15:48:29.0 -0400 +++ current/fs/afs/mntpt.c 2005-08-19 19:02:48.0 -0400 @@ -30,7 +30,7 @@ struct dentry *dentry, struct nameidata *nd); static int afs_mntpt_open(struct inode *inode, struct file *file); -static int afs_mntpt_follow_link(struct dentry *dentry, struct nameidata *nd); +static void *afs_mntpt_follow_link(struct dentry *dentry, struct nameidata *nd); struct file_operations afs_mntpt_file_operations = { .open = afs_mntpt_open, @@ -233,7 +233,7 @@ /* * follow a link from a mountpoint directory, thus causing it to be mounted */ -static int afs_mntpt_follow_link(struct dentry *dentry, struct nameidata *nd) +static void *afs_mntpt_follow_link(struct dentry *dentry, struct nameidata *nd) { struct vfsmount *newmnt; struct dentry *old_dentry; @@ -249,7 +249,7 @@ newmnt = afs_mntpt_do_automount(dentry); if (IS_ERR(newmnt)) { path_release(nd); - return PTR_ERR(newmnt); + return (void *)newmnt; } old_dentry = nd-dentry; @@ -267,7 +267,7 @@ } kleave( = %d, err); - return err; + return ERR_PTR(err); } /* end afs_mntpt_follow_link() */ /*/ diff -urN RC13-rc6-git10-base/fs/autofs4/symlink.c current/fs/autofs4/symlink.c --- RC13-rc6-git10-base/fs/autofs4/symlink.c2005-06-17 15:48:29.0 -0400 +++ current/fs/autofs4/symlink.c2005-08-19 19:03:11.0 -0400 @@ -12,11 +12,11 @@ #include autofs_i.h -static int autofs4_follow_link(struct dentry *dentry, struct nameidata *nd) +static void *autofs4_follow_link(struct dentry *dentry, struct nameidata *nd) { struct autofs_info *ino = autofs4_dentry_ino(dentry); nd_set_link(nd, (char *)ino-u.symlink); - return 0; + return NULL; } struct inode_operations autofs4_symlink_inode_operations = { diff -urN RC13-rc6-git10-base/fs/befs/linuxvfs.c current/fs/befs/linuxvfs.c --- RC13-rc6-git10-base/fs/befs/linuxvfs.c 2005-06-17 15:48:29.0 -0400 +++ current/fs/befs/linuxvfs.c 2005-08-19 19:03:52.0 -0400 @@ -41,8 +41,8 @@ static void befs_destroy_inode(struct inode *inode); static int befs_init_inodecache(void); static void befs_destroy_inodecache(void); -static int befs_follow_link(struct dentry *, struct nameidata *); -static void befs_put_link(struct dentry *, struct nameidata *); +static void *befs_follow_link(struct dentry *, struct nameidata *); +static void befs_put_link(struct dentry *, struct nameidata *, void *); static int befs_utf2nls(struct super_block *sb, const char *in, int in_len, char **out, int *out_len); static int befs_nls2utf(struct super_block *sb, const char *in, int in_len, @@ -487,10 +487,10 @@ } nd_set_link(nd, link); - return 0; + return NULL; } -static void befs_put_link(struct dentry *dentry, struct nameidata *nd) +static void befs_put_link(struct dentry *dentry, struct nameidata *nd, void *p) { befs_inode_info *befs_ino = BEFS_I(dentry-d_inode); if (befs_ino-i_flags BEFS_LONG_SYMLINK) { diff -urN RC13-rc6-git10-base/fs/devfs/base.c current/fs/devfs/base.c --- RC13-rc6-git10-base/fs/devfs/base.c 2005-06-17 15:48:29.0 -0400 +++ current/fs/devfs/base.c 2005-08-19 19:04:17.0 -0400 @@ -2491,11 +2491,11 @@ return 0; } /* End Function devfs_mknod */ -static int devfs_follow_link(struct dentry *dentry, struct nameidata *nd) +static void *devfs_follow_link(struct dentry *dentry, struct nameidata *nd) { struct devfs_entry *p = get_devfs_entry_from_vfs_inode(dentry-d_inode); nd_set_link(nd, p ? p-u.symlink.linkname : ERR_PTR(-ENODEV)); - return 0; + return NULL; } /* End Function devfs_follow_link */ static struct inode_operations devfs_iops = { diff -urN RC13-rc6-git10-base/fs/freevxfs/vxfs_immed.c current/fs/freevxfs/vxfs_immed.c --- RC13-rc6-git10-base/fs/freevxfs/vxfs_immed.c2005-06-17 15:48:29.0
Re: Kernel bug: Bad page state: related to generic symlink code and mmap
On Fri, Aug 19, 2005 at 06:08:12PM -0700, Linus Torvalds wrote: On Sat, 20 Aug 2005, Al Viro wrote: That looks OK except for * ncpfs fix is actually missing here Well, the thing is, with the change to page_follow_link() and page_put_link(), ncpfs should now work fine - it doesn't need any fixing any more. Ah - right, it's using normal methods... - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call
On Wed, Apr 20, 2005 at 10:45:58AM +0100, Jamie Lokier wrote: For FUSE, what's needed is that a user can mount something, and the mounted fs is visible only to that user, but it's visible to _all_ of the user's processes. So get that namespace as soon as you log in. We think namespaces are a nice way to do that: making a user-owned filesystem only visible to a user. But the mechanism of CLONE_NEWNS does not work, because it presumes namespace divisions are only propagated over parent-child divisions, like environment variables. What we really want is a mount point that propagates across all the processes owned by one user, but is not there for other users. This is almost certainly bogus. Same user can easily want several different environments set on the same box. - Have a namespace per user. The user's namespace will be entered by the login program somehow. Trivial right now - just have libpam do that. - All logins to the same user acquire the same per-user namespace. This isn't possible at the moment; it would be a kernel extension + administrative change to login. No. Identical setup at login time - sure. Enforced _single_ namespace is just plain wrong. Moreover, all user's processes is the wrong answer to practically any question (well, aside of what processes do you kill when you get rid of luser's account). - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call
On Wed, Apr 20, 2005 at 01:03:40PM +0100, Jamie Lokier wrote: It shouldn't be literally per-user - it should be possible for a user to have several environment _when_ they want that. chroot-jail style virtual server environments require that too. But that shouldn't be the only option - because it would be horrible to use. If I login on multiple terminals, I normally want to mount filesystems in /home/jamie/mnt on one terminal, and use them on another. And when you log in on several terminals you usually want same $PATH. You don't do that by sharing VM between shell processes, do you? Sure, that would work with sufficient kernel-side hacks for joining thread group and making e.g. bash multithreaded. Nobody does it though - it doesn't buy you anything really useful. How can libpam join the user's existing namespace? Having a separate usermount-namespace for each login of the same user would not be nice to use. I don't see why. _IF_ you can change the set of mounts after you log in, there's no more need to do any kernel tricks for that stuff than you would need for environment, etc. If you can't - well, the last point where you can get something set up is login with no changes afterwards. In that case everything is just as trivial... - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call
On Wed, Apr 20, 2005 at 09:51:26AM -0700, Ram wrote: Reading through the thread I assume the requirement is: 1) A User being able to create his own VFS-mount environment 2) being able to use the same VFS-mount environment from multiple login sessions. 3) Being able to switch some processes to some other VFS-mount environment. Excuse me, but could somebody give coherent rationale for such requirements? _Especially_ for joining existing group by completely unrelated process - something we don't do for any other component of process. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call
On Wed, Apr 20, 2005 at 09:43:47PM +0100, Jamie Lokier wrote: Al Viro is right to point out that environment variables are not shared. But he forgets that _files_ are shared. So they are. ~/.profile, for instance ;-) - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call
On Wed, Apr 20, 2005 at 07:53:04PM +0200, Miklos Szeredi wrote: The user expects to have the see the same files in all sessions, whether those be local logins, remote logins, ftp/scp/etc sessions. Umm... You know, I would be *very* unhappy if I found that some process on parcelfarce was able to see the contents of ~/.ssh/* from the laptop I'm using right now. Even more so if that would apply to random ftp sewer I happen to use at the moment... If I'm remotely logged into server X from Y, and want to use scp to copy some file from X to Y or vica versa, I will want my private mounts to be visible from the scp. Do you? Really? OK, so I've got ~/bin/ and ~/bin/arch/ in my path on my boxen. The latter has ~/bin/{i386,alpha,sparc,amd64,hppa,ppc} bound on it - depending on the host I'm using. Tell me, why would I want that private mount to be visible when I log in from one box to another? To make sure that wrong binaries would be picked? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call
On Tue, Apr 19, 2005 at 05:13:32PM -0500, Eric Van Hensbergen wrote: The motivation behind this patch is to make private namespaces more accessible by allowing their creation at mount/bind time. Based on some of the FUSE permissions discussions, I wanted to check into modifying the mount system calls -- adding a flag which created a new namespace for the resulting mount. I quickly discovered that what I typically wanted (for the case of running a mount command) was to actually create a new namespace for the parent thread (typically the shell), inherit that namespace, and then perform the mount. Its not clear to me that both options are needed, cloning the parent's namespace seems to be what you want most of the time. In order to minimize code impact I split the copy_namespace function, perhaps the right long term solution is to change it's interface to accommodate the changes. Things look a bit more invasive as I moved the copy_namespace function above do_mount. The patch follows: *UGH* So what happens to those who happen to share task-fs with the parent? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call
On Tue, Apr 19, 2005 at 06:53:29PM -0500, Eric Van Hensbergen wrote: On 4/19/05, Al Viro [EMAIL PROTECTED] wrote: On Tue, Apr 19, 2005 at 05:13:32PM -0500, Eric Van Hensbergen wrote: The motivation behind this patch is to make private namespaces more accessible by allowing their creation at mount/bind time. *UGH* So what happens to those who happen to share task-fs with the parent? Okay, I'll admit to being a bit too hasty with pushing out that patch - I was being particularly myopic looking for a solution only for a command-line mount. Are you generally opposed to new namespace creation at mount time or just my slimy hack? A shared task-fs seems like something which could be easily checked against and disallowed. a) ability to create a private namespace without forking anything - sure, that would be useful. However, that's not something I would push into mount(2) (already overloaded to hell and back). There used to be a kinda-sorta agreement on a new syscall: unshare(bitmap) with arguments like those of clone(2). That's not just for namespaces - e.g. you might legitimately want to unshare VM in a thread and leave the rest alone. Or unshare -fs (i.e. uncouple cwd from the rest of group). Most of the code is already there - do_fork() has to do such stuff anyway. So how about adding sys_unshare(flags) that would do that job? Flags would correspond to those of clone(2), except that all these guys would be what do we unshare instead of what do we leave shared. b) I _really_ don't like the idea of messing with the parent. Make it a shell builtin if you want to affect shell behaviour; the same reason why cd is a builtin and not an external command. c) I would be really, really careful with implications of let user do whatever he wants - that certainly should include bindings and that can create heaps of fun for suid stuff. More comments when I get around to digging through FUSE thread... - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS4 mount problem
On Mon, Apr 18, 2005 at 10:07:14AM -0700, Bryan Henderson wrote: On Fri, Apr 15, 2005 at 01:22:59PM -0700, David S. Miller wrote: Make a -compat_read_super() just like we have a -compat_ioctl() method for files, if you want to suggest a solution like what you describe. I don't think we should encourage filesystem writers to do such stupid things as ncfps/smbfs do. In fact I'm totally unhappy thay nfs4 went down that road. Which road is that? Architecture-dependent blob passed to mount(2) (aka nfs4_mount_data). If you want it to be a blob, at least have a decency to use encoding that would not depend on alignment rules and word size. Hell, you could use XDR - it's not that nfs would need something new to handle it. Or, better yet, use a normal string. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS4 mount problem
On Mon, Apr 18, 2005 at 06:33:09PM +0100, David Howells wrote: Al Viro [EMAIL PROTECTED] wrote: Architecture-dependent blob passed to mount(2) (aka nfs4_mount_data). If you want it to be a blob, at least have a decency to use encoding that would not depend on alignment rules and word size. Hell, you could use XDR - it's not that nfs would need something new to handle it. Or, better yet, use a normal string. Mount doesn't appear to permit a big enough blob though. It has a hard limit of PAGE_SIZE. Excuse me? Would the use of fixed offsets, field sizes and endianness make the blob bigger? And as for the length of string representation going past 4Kb... that could be easily dealt with in sys_mount() if it really becomes a problem. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] shared subtrees
On Sun, Jan 16, 2005 at 11:02:13AM -0500, J. Bruce Fields wrote: On Thu, Jan 13, 2005 at 10:18:51PM +, Al Viro wrote: 6. mount --move prohibited if what we are moving is in some p-node, otherwise we move as usual to intended mountpoint and create copies for everything that gets propagation from there (as we would do for rbind). Why this prohibition? How do you propagate that? We can weaken that to in a p-node that owns something or contains more than one vfsmount, but it's not worth the trouble, AFAICS. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] shared subtrees
On Sun, Jan 16, 2005 at 01:42:09PM -0500, J. Bruce Fields wrote: On Sun, Jan 16, 2005 at 06:06:56PM +, Al Viro wrote: On Sun, Jan 16, 2005 at 11:02:13AM -0500, J. Bruce Fields wrote: On Thu, Jan 13, 2005 at 10:18:51PM +, Al Viro wrote: 6. mount --move prohibited if what we are moving is in some p-node, otherwise we move as usual to intended mountpoint and create copies for everything that gets propagation from there (as we would do for rbind). Why this prohibition? How do you propagate that? We can weaken that to in a p-node that owns something or contains more than one vfsmount, but it's not worth the trouble, AFAICS. I guess I'm not seeing what there is to propagate. If the vfsmount we are moving is mounted under a vfsmount that's in a p-node, then there'd be something to propagate, but since the --move doesn't change the structure of mounts underneath the moved mountpoint, I wouldn't expect any changes to be propagated from it to other mountpoints. I must be missing something fundamental No - I have been missing a typo. Make that if mountpoint of what we are moving - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] shared subtrees
On Sat, Jan 15, 2005 at 07:46:59PM -0500, J. Bruce Fields wrote: On Thu, Jan 13, 2005 at 10:18:51PM +, Al Viro wrote: 2. mount We have a new vfsmount A and want to attach it to mountpoint somewhere in vfsmount B. If B does not belong to any p-node, everything is as usual; A doesn't become a member or slave of any p-node and is simply attached to B. If B belongs to a p-node p, consider all vfsmounts B1,...,Bn that get events propagated from B and all p-nodes p1,...,pk that contain them. * A gets cloned into n copies and these copies (A1,...,An) are attached to corresponding points in B1,...,Bn. * k new p-nodes (q1,...,qk) are created * Ai is contained in qj = Bi is contained in qj Minor typo: looks like that second qj should be pj. ACK - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html