Re: how to show propagation state for mounts
On Wed, 2008-02-20 at 09:31 -0700, Matthew Wilcox wrote: On Wed, Feb 20, 2008 at 04:04:22PM +, Al Viro wrote: It's less about the form of representation (after all, we generate poll events when contents of that sucker changes, so one *can* get a consistent snapshot of the entire thing) and more about having it self-contained when we have namespaces in the play. IOW, the data in there should give answers to questions that make sense. Do events get propagated from this vfsmount I have to that vfsmount I have? is a meaningful one; ditto for are events here propagated to somewhere I don't see? or are events getting propagated here from somewhere I don't see?. Why do those last two questions deserve an answer? How will a person's or application's behaviour be affected by whether a change will propagate to something they don't know about and can't see? Well, I do not want to be surprised to see a mount suddenly show up in my namespace because of some action by some other user in some other namespace. Its going to happen anyway if the namespace is forked of a namespace that had shared mounts in them. However I would rather prefer to know in advance the spots (mounts) where such surprises can happen. Also I would prefer to know how my actions will effect mounts in other namespaces. RP - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to show propagation state for mounts
On Wed, 2008-02-20 at 17:27 +0100, Miklos Szeredi wrote: On Wed, Feb 20, 2008 at 04:39:15PM +0100, Miklos Szeredi wrote: mountinfo - IMO needs a sane discussion of what and how should be shown wrt propagation state Here's my take on the matter. The propagation tree can be either be represented 1) from root to leaf listing members of peer groups and their slaves explicitly, 2) or from leaf to root by identifying each peer group and then for each mount showing the id of its own group and the id of the group's master. 2) can have two variants: 2a) id of peer group is constant in time 2b) id of peer group may change The current patch does 2b). Having a fixed id for each peer group would mean introducing a new object to anchor the peer group into, which would add complexity to the whole thing. All of these are implementable, just need to decide which one we want. Eh... Much more interesting question: since the propagation tree spans multiple namespaces in a lot of normal uses, how do we deal with reconstructing propagation through the parts that are not present in our namespace? Moreover, what should and what should not be kept private to namespace? Full exposure of mount trees is definitely over the top (it shows potentially sensitive information), so we probably want less than that. FWIW, my gut feeling is that for each peer group that intersects with our namespace we ought to expose in some form * all vfsmounts belonging to that intesection * the nearest dominating peer group (== master (of master ...) of) that also has a non-empty intersection with our namespace It's less about the form of representation (after all, we generate poll events when contents of that sucker changes, so one *can* get a consistent snapshot of the entire thing) and more about having it self-contained when we have namespaces in the play. IOW, the data in there should give answers to questions that make sense. Do events get propagated from this vfsmount I have to that vfsmount I have? is a meaningful one; ditto for are events here propagated to somewhere I don't see? or are events getting propagated here from somewhere I don't see?. Well, assuming you see only one namespace. When I'm experimenting with namespaces and propagations, I see both (each in a separate xterm) and I do want to know how propagation between them happens. Your suggestion doesn't deal with that problem. Otherwise, yes it makes sense to have a consistent view of the tree shown for each namespace. Perhaps the solution is to restrict viewing the whole tree to privileged processes. I wonder, what is wrong in reporting mounts in other namespaces that either receive and send propagation to mounts in our namespace? If we take that approach, we will report **only** the mounts in other namespace which have a counter part in our namespace. After all the filesystems backing the mounts here and there are the same(other wise they would'nt have propagated). And any mounts contained outside our namespace, having no propagation relation to any mounts in our namespace, will remain hidden. RP Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH] vfs: optimization to /proc/pid/mountinfo patch
1) reports deleted inode in dentry_path() consistent with that in __d_path() 2) modified __d_path() to use prepend(), reducing the size of __d_path() 3) moved all the functionality that reports mount information in /proc under CONFIG_PROC_FS. Could not verify if the code would work with CONFIG_PROC_FS=n, since it was impossible to disable CONFIG_PROC_FS. Looking for ideas on how to disable CONFIG_PROC_FS. Signed-off-by: Ram Pai [EMAIL PROTECTED] --- fs/dcache.c | 59 +++ fs/namespace.c |2 + fs/seq_file.c|2 + include/linux/dcache.h |3 ++ include/linux/seq_file.h |3 ++ 5 files changed, 34 insertions(+), 35 deletions(-) Index: linux-2.6.23/fs/dcache.c === --- linux-2.6.23.orig/fs/dcache.c +++ linux-2.6.23/fs/dcache.c @@ -1747,6 +1747,17 @@ shouldnt_be_hashed: goto shouldnt_be_hashed; } +static int prepend(char **buffer, int *buflen, const char *str, + int namelen) +{ + *buflen -= namelen; + if (*buflen 0) + return 1; + *buffer -= namelen; + memcpy(*buffer, str, namelen); + return 0; +} + /** * d_path - return the path of a dentry * @dentry: dentry to report @@ -1768,17 +1779,11 @@ static char *__d_path(struct dentry *den { char * end = buffer+buflen; char * retval; - int namelen; - *--end = '\0'; - buflen--; - if (!IS_ROOT(dentry) d_unhashed(dentry)) { - buflen -= 10; - end -= 10; - if (buflen 0) + prepend(end, buflen, \0, 1); + if (!IS_ROOT(dentry) d_unhashed(dentry) + prepend(end, buflen, (deleted), 10)) goto Elong; - memcpy(end, (deleted), 10); - } if (buflen 1) goto Elong; @@ -1805,13 +1810,10 @@ static char *__d_path(struct dentry *den } parent = dentry-d_parent; prefetch(parent); - namelen = dentry-d_name.len; - buflen -= namelen + 1; - if (buflen 0) + if (prepend(end, buflen, dentry-d_name.name, + dentry-d_name.len) || + prepend(end, buflen, /, 1)) goto Elong; - end -= namelen; - memcpy(end, dentry-d_name.name, namelen); - *--end = '/'; retval = end; dentry = parent; } @@ -1819,12 +1821,9 @@ static char *__d_path(struct dentry *den return retval; global_root: - namelen = dentry-d_name.len; - buflen -= namelen; - if (buflen 0) - goto Elong; - retval -= namelen-1;/* hit the slash */ - memcpy(retval, dentry-d_name.name, namelen); + retval += 1;/* hit the slash */ + if (prepend(retval, buflen, dentry-d_name.name, dentry-d_name.len)) + goto Elong; return retval; Elong: return ERR_PTR(-ENAMETOOLONG); @@ -1890,17 +1889,8 @@ char *dynamic_dname(struct dentry *dentr return memcpy(buffer, temp, sz); } -static int prepend(char **buffer, int *buflen, const char *str, - int namelen) -{ - *buflen -= namelen; - if (*buflen 0) - return 1; - *buffer -= namelen; - memcpy(*buffer, str, namelen); - return 0; -} +#ifdef CONFIG_PROC_FS /* * Write full pathname from the root of the filesystem into the buffer. */ @@ -1910,11 +1900,9 @@ char *dentry_path(struct dentry *dentry, char *retval; spin_lock(dcache_lock); - prepend(end, buflen, \0, 1); - if (!IS_ROOT(dentry) d_unhashed(dentry)) { - if (prepend(end, buflen, //deleted, 9)) + if (!IS_ROOT(dentry) d_unhashed(dentry) + prepend(end, buflen, (deleted), 10)) goto Elong; - } if (buflen 1) goto Elong; /* Get '/' right */ @@ -1943,6 +1931,7 @@ Elong: spin_unlock(dcache_lock); return ERR_PTR(-ENAMETOOLONG); } +#endif /* CONFIG_PROC_FS */ /* * NOTE! The user-level library version returns a Index: linux-2.6.23/fs/namespace.c === --- linux-2.6.23.orig/fs/namespace.c +++ linux-2.6.23/fs/namespace.c @@ -609,6 +609,7 @@ void mnt_unpin(struct vfsmount *mnt) EXPORT_SYMBOL(mnt_unpin); +#ifdef CONFIG_PROC_FS /* iterator */ static void *m_start(struct seq_file *m, loff_t *pos) { @@ -795,6 +796,7 @@ const struct seq_operations mountstats_o .stop = m_stop, .show = show_vfsstat, }; +#endif /* CONFIG_PROC_FS */ /** * may_umount_tree - check if a mount tree is busy Index: linux-2.6.23/fs/seq_file.c
Re: [patch] vfs: create /proc/pid/mountinfo
On Thu, 2008-01-31 at 10:17 +0100, Miklos Szeredi wrote: From: Ram Pai [EMAIL PROTECTED] ...snipped... IDR ids are 'int' but they are always positive (AFAICT), but yeah, maybe this is confusing. The new exported-to-everyone dentry_path() probably could do with a bit more documentation - it's the sort of thing which people keep on wanting and using. OK. How does dentry_path() differ from d_path() and why do we need both and can we get some sharing/consolidation happening here? d_path displays the path from the rootfs, whereas dentry_path displays the path from the root of that filesystem. Tried that but not easy, without removing some of the microoptimizations in d_path(), which I'm not sure are really important, but... this patch was intially developed with Al Viro. He preferred to keep the two functions separate. BTW: this patch owes credits to Al Viro for his initial set of ideas. Why do d_path() and dentry_path() have differing conventions for displaying a deleted file and can we fix that? I think Ram chose a different convention in dentry_path() in order to make sure, there was no space in the resulting path. But spaces would be escaped anyway, so this isn't really important. So yes, this could be fixed. my patch was generated about a year or so back using probably the 2.6.18 code base which had the //deleted convention. That got copied in my patch. But since then I see that the original code has changed to use the (deleted) convention. Yes this patch has to be changed to be consistent with the existing code. This patch adds a lot of code which is, I guess, unused if CONFIG_PROC_FS=n. Fixable? yes. good observation. I will send a patch with this optimization and the above mentioned change. RP Possibly yes. A good chunk of namespace.c could be surrounded by an #ifdef, which would save even more, than was added by this particular patch. Thanks, Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] VFS: create /proc/pid/mountinfo
Miklos, You have removed the code that checked if the peer or master mount was in the same namespace before reporting their corresponding mount-ids. One downside of that approach is the user will see an mount_id in the output with no corresponding line to explain the details of the mount_id. And reporting the mount-id of a mount is some other namespace could subtly mean information-leak? One other comment I had received offline from Steve French was that the patch did not consider the following case: Have you thought about whether this could handle the case in which cifs mounts with a relative path e.g. currently mount -t cifs //server/share /mnt can not be distinguished from mount -t cifs //server/share/subdirectory /mnt when you run the mount command (ie the cifs prefixpath in this case /subdirectory is not displayed) thanks for driving this patch further and sorry; have not been active on this work for a while, RP On Sat, 2008-01-19 at 12:05 +0100, Miklos Szeredi wrote: Seems, most people would be happier with a new file, instead of extending /proc/mounts. This patch is the first attempt at doing that, as well as fixing the issues found in the previous submission. Thanks, Miklos --- From: Ram Pai [EMAIL PROTECTED] /proc/mounts in its current state fail to disambiguate bind mounts, especially when the bind mount is subrooted. Also it does not capture propagation state of the mounts(shared-subtree). The following patch addresses the problem. The patch adds '/proc/pid/mountinfo' which contains a superset of the fields in '/proc/pid/mounts'. The following additional fields are added: mntid -- is a unique identifier of the mount parent -- the id of the parent mount major:minor -- value of st_dev for files on that filesystem dir -- the subdir in the filesystem which forms the root of this mount propagation-type in the form of propagation_flag[:mntid][,...] note: 'shared' flag is followed by the mntid of its peer mount 'slave' flag is followed by the mntid of its master mount 'private' flag stands by itself 'unbindable' flag stands by itself Also mount options are split into two fileds, the first containing the per mount flags, the second the per super block options. Here is a sample cat /proc/mounts after execution the following commands: mount --bind /mnt /mnt mount --make-shared /mnt mount --bind /mnt/1 /var mount --make-slave /var mount --make-shared /var mount --bind /var/abc /tmp mount --make-unbindable /proc 2 2 0:1 rootfs rootfs / / rw rw private 16 2 98:0 ext2 /dev/root / / rw rw private 17 16 0:3 proc /proc / /proc rw rw unbindable 18 16 0:10 devpts devpts /dev/pts / rw rw private 19 16 98:0 ext2 /dev/root /mnt /mnt rw rw shared:19 20 16 98:0 ext2 /dev/root /mnt/1 /var rw rw shared:21,slave:19 21 16 98:0 ext2 /dev/root /mnt/1/abc /tmp rw rw shared:20,slave:19 For example, the last line indicates that: 1) The mount is a shared mount. 2) Its peer mount of mount with id 20 3) It is also a slave mount of the master-mount with the id 19 4) The filesystem on device with major/minor number 98:0 and subdirectory mnt/1/abc makes the root directory of this mount. 5) And finally the mount with id 16 is its parent. [EMAIL PROTECTED]: - new file, rearrange fields - for mount ID's use IDA (from the IDR library) instead of a 32bit counter, which could overflow - print canonical ID's (smallest one within the peer group) for peers and master, this is more useful, than a random ID within the same namespace - fix a couple of small bugs - remove inlines - style fixes Signed-off-by: Ram Pai [EMAIL PROTECTED] Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/fs/dcache.c === --- linux.orig/fs/dcache.c2008-01-18 19:21:38.0 +0100 +++ linux/fs/dcache.c 2008-01-18 19:22:27.0 +0100 @@ -1890,6 +1890,60 @@ char *dynamic_dname(struct dentry *dentr return memcpy(buffer, temp, sz); } +static int prepend(char **buffer, int *buflen, const char *str, + int namelen) +{ + *buflen -= namelen; + if (*buflen 0) + return 1; + *buffer -= namelen; + memcpy(*buffer, str, namelen); + return 0; +} + +/* + * Write full pathname from the root of the filesystem into the buffer. + */ +char *dentry_path(struct dentry *dentry, char *buf, int buflen) +{ + char *end = buf + buflen; + char *retval; + + spin_lock(dcache_lock); + prepend(end, buflen, \0, 1); + if (!IS_ROOT(dentry) d_unhashed(dentry)) { + if (prepend(end, buflen, //deleted, 9)) + goto Elong; + } + if (buflen 1) + goto
Re: [RFC][PATCH] VFS: create /proc/pid/mountinfo
On Mon, 2008-01-21 at 22:25 +0100, Miklos Szeredi wrote: You have removed the code that checked if the peer or master mount was in the same namespace before reporting their corresponding mount-ids. One downside of that approach is the user will see an mount_id in the output with no corresponding line to explain the details of the mount_id. Before the change, the peer and master ID's were basically randomly chosen from the peers, which means, it wasn't possible to always determine, that two mounts were peers, or that they were slaves to the same peer group. After the change, this is possible, since the peer ID will be the same for all mounts which are peers. This means, that even though the peer ID might be in a different namespace, it is possible to determine all peers within the same namespace by comparing their peer ID's. I agree with your reasoning on the random id; showing a single id avoids clutter. But my point is, why not show a id for the master or peer residing in the same namespace? Showing a id with no corresponding entry for that id, can be intriguing. If no master-mount exists in the same namespace then print -1 meaning masked. there is always atleast one peer-mount in a given namespace; so no issue there. And reporting the mount-id of a mount is some other namespace could subtly mean information-leak? I don't think the mount ID itself can be sensitive, it really doesn't contain any information, other than being an identifier. One other comment I had received offline from Steve French was that the patch did not consider the following case: Have you thought about whether this could handle the case in which cifs mounts with a relative path e.g. currently mount -t cifs //server/share /mnt can not be distinguished from mount -t cifs //server/share/subdirectory /mnt when you run the mount command (ie the cifs prefixpath in this case /subdirectory is not displayed) Why cifs not displaying '//server/share/subdirectory' as the source of the mount? dont know. not tried it myself. RP Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC2 PATCH 1/1] VFS: Augment /proc/mount with subroot and shared-subtree
/proc/mounts in its current state fail to disambiguate bind mounts, especially when the bind mount is subrooted. Also it does not capture propagation state of the mounts(shared-subtree). The following patch addresses the problem. The following additional fields to /proc/mounts are added. propagation-type in the form of propagation_flag[:mntid][,...] note: 'shared' flag is followed by the mntid of its peer mount 'slave' flag is followed by the mntid of its master mount 'private' flag stands by itself 'unbindable' flag stands by itself mntid -- is a unique identifier of the mount major:minor -- is the major minor number of the device hosting the filesystem dir -- the subdir in the filesystem which forms the root of this mount parent -- the id of the parent mount Here is a sample cat /proc/mounts after execution the following commands: mount --bind /mnt /mnt mount --make-shared /mnt mount --bind /mnt/1 /var mount --make-slave /var mount --make-shared /var mount --bind /var/abc /tmp mount --make-unbindable /proc rootfs / rootfs rw 0 0 private 2 0:1 / 2 /dev/root / ext2 rw 0 0 private 16 98:0 / 2 /proc /proc proc rw 0 0 unbindable 17 0:3 / 16 devpts /dev/pts devpts rw 0 0 private 18 0:10 / 16 /dev/root /mnt ext2 rw 0 0 shared:19 19 98:0 /mnt 16 /dev/root /var ext2 rw 0 0 shared:21,slave:19 20 98:0 /mnt/1 16 /dev/root /tmp ext2 rw 0 0 shared:20,slave:19 21 98:0 /mnt/1/abc 16 For example, the last line indicates that : 1) The mount is a shared mount. 2) Its peer mount of mount with id 20 3) It is also a slave mount of the master-mount with the id 19 4) The filesystem on device with major/minor number 98:0 and subdirectory mnt/1/abc makes the root directory of this mount. 5) And finally the mount with id 16 is its parent. Testing: symlinked /etc/mtab to /proc/mounts and did some mount and df commands. They worked normally. Signed-off-by: Ram Pai [EMAIL PROTECTED] --- fs/dcache.c | 53 +++ fs/namespace.c | 35 +++- fs/pnode.c | 22 + fs/pnode.h |2 + fs/seq_file.c| 79 ++- include/linux/dcache.h |2 + include/linux/mount.h|1 include/linux/seq_file.h |1 8 files changed, 172 insertions(+), 23 deletions(-) Index: linux-2.6.21.5/fs/dcache.c === --- linux-2.6.21.5.orig/fs/dcache.c +++ linux-2.6.21.5/fs/dcache.c @@ -1835,6 +1835,59 @@ char * d_path(struct dentry *dentry, str return res; } +static inline int prepend(char **buffer, int *buflen, const char *str, + int namelen) +{ + if ((*buflen -= namelen) 0) + return 1; + *buffer -= namelen; + memcpy(*buffer, str, namelen); + return 0; +} + +/* + * write full pathname into buffer and return start of pathname. + * If @vfsmnt is not specified return the path relative to the + * its filesystem's root. + */ +char * dentry_path(struct dentry *dentry, char *buf, int buflen) +{ + char * end = buf+buflen; + char * retval; + + spin_lock(dcache_lock); + prepend(end, buflen, \0, 1); + if (!IS_ROOT(dentry) d_unhashed(dentry)) { + if (prepend(end, buflen, //deleted, 10)) + goto Elong; + } + /* Get '/' right */ + retval = end-1; + *retval = '/'; + + for (;;) { + struct dentry * parent; + if (IS_ROOT(dentry)) + break; + + parent = dentry-d_parent; + prefetch(parent); + + if (prepend(end, buflen, dentry-d_name.name, + dentry-d_name.len) || + prepend(end, buflen, /, 1)) + goto Elong; + + retval = end; + dentry = parent; + } + spin_unlock(dcache_lock); + return retval; +Elong: + spin_unlock(dcache_lock); + return ERR_PTR(-ENAMETOOLONG); +} + /* * NOTE! The user-level library version returns a * character pointer. The kernel system call just Index: linux-2.6.21.5/fs/namespace.c === --- linux-2.6.21.5.orig/fs/namespace.c +++ linux-2.6.21.5/fs/namespace.c @@ -33,6 +33,8 @@ __cacheline_aligned_in_smp DEFINE_SPINLOCK(vfsmount_lock); static int event; +static atomic_t mnt_counter; + static struct list_head *mount_hashtable __read_mostly; static int hash_mask __read_mostly, hash_bits __read_mostly; @@ -51,6 +53,7 @@ static inline unsigned long hash(struct return tmp hash_mask; } + struct vfsmount *alloc_vfsmnt(const char *name) { struct vfsmount *mnt = kmem_cache_zalloc(mnt_cache, GFP_KERNEL); @@ -64,6 +67,7 @@ struct vfsmount *alloc_vfsmnt(const char
Re: [RFC PATCH 1/1] VFS: Augment /proc/mount with subroot and shared-subtree
On Wed, 2007-07-11 at 11:24 +0100, Christoph Hellwig wrote: On Sat, Jun 30, 2007 at 08:56:02AM -0400, H. Peter Anvin wrote: Is that conjecture, or do you have evidence to that effect? Most users of this file are using it via the glibc interfaces, and there probably aren't all that many users of it in the first place. I have written parsers for personal projects that might not have been happy to deal with additional fields myself for example.. I modified the patch to add fields towards the end of each line. i.e after 'freq, passno' fields. And symlinked /etc/mtab to /proc/mounts. mount,df and friends were all perfectly happy. I imagine your script may also be happy with the additional fields **towards the end**. I would like to avoid one more mount interface if we can help it. RP - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 1/1] VFS: Augment /proc/mount with subroot and shared-subtree
Please check if the following modified patch meets the requirements. It augments /proc/mount with additional information to (1) disambiguate bind mounts with subroot information. (2) display shared-subtree information using which one can determine the propagation trees. The following additional fields are appended to each record in /proc/mounts mntid=id- The unique id associated with that mount. fsid=id:dir - The filesystem's id and directory in that filesystem that makes the root directory of this mount. parent=id - The id of the mount's parent; on which it is mounted. also flags are augmented with new information to indicate the mount's propagation type. Here is a sample 'cat /proc/mounts' after executing the following commands: mount --bind /mnt /mnt mount --make-shared /mnt mount --bind /mnt/1 /var mount --make-slave /var mount --make-shared /mnt mount --make-unbindable /proc rootfs / rootfs rw PRIVATE mntid=c1708c30 fsid=1:/ parent=c1708c30 0 0 /dev/root / ext2 rw PRIVATE mntid=c1208c08 fsid=6200:/ parent=c1708c30 0 0 /proc /proc proc rw UNBINDABLE mntid=c1108c90 fsid=3:/ parent=c1208c08 0 0 devpts /dev/pts devpts rw PRIVATE mntid=c1108c18 fsid=a:/ parent=c1208c08 0 0 /dev/root /mnt ext2 rw SHARED:peer=c1e08cb0 mntid=c1e08cb0 fsid=6200:/mnt parent=c1208c08 0 0 /dev/root /var ext2 rw SHARED:peer=c1f08c28 SLAVE:master=c1e08cb0 mntid=c1f08c28 fsid=6200:/mnt/1 parent=c1208c08 0 0 For example, the last line indicates that The mount is a shared mount. Its peer mount is itself (note peer=c1f08c28 is the same mntid as itself). It is also a slave mount of the mount with the id c1e08cb0. The filesystem with fsid=6200 and subdirectory mnt/1 makes the root directory of this mount. And finally the mount with id c1208c08 is its parent. Signed-off-by: Ram Pai [EMAIL PROTECTED] --- fs/dcache.c | 53 +++ fs/namespace.c | 25 ++ fs/pnode.c | 22 + fs/pnode.h |2 + fs/seq_file.c| 79 ++- include/linux/dcache.h |2 + include/linux/seq_file.h |1 7 files changed, 162 insertions(+), 22 deletions(-) Index: linux-2.6.21.5/fs/dcache.c === --- linux-2.6.21.5.orig/fs/dcache.c +++ linux-2.6.21.5/fs/dcache.c @@ -1835,6 +1835,59 @@ char * d_path(struct dentry *dentry, str return res; } +static inline int prepend(char **buffer, int *buflen, const char *str, + int namelen) +{ + if ((*buflen -= namelen) 0) + return 1; + *buffer -= namelen; + memcpy(*buffer, str, namelen); + return 0; +} + +/* + * write full pathname into buffer and return start of pathname. + * If @vfsmnt is not specified return the path relative to the + * its filesystem's root. + */ +char * dentry_path(struct dentry *dentry, char *buf, int buflen) +{ + char * end = buf+buflen; + char * retval; + + spin_lock(dcache_lock); + prepend(end, buflen, \0, 1); + if (!IS_ROOT(dentry) d_unhashed(dentry)) { + if (prepend(end, buflen, //deleted, 10)) + goto Elong; + } + /* Get '/' right */ + retval = end-1; + *retval = '/'; + + for (;;) { + struct dentry * parent; + if (IS_ROOT(dentry)) + break; + + parent = dentry-d_parent; + prefetch(parent); + + if (prepend(end, buflen, dentry-d_name.name, + dentry-d_name.len) || + prepend(end, buflen, /, 1)) + goto Elong; + + retval = end; + dentry = parent; + } + spin_unlock(dcache_lock); + return retval; +Elong: + spin_unlock(dcache_lock); + return ERR_PTR(-ENAMETOOLONG); +} + /* * NOTE! The user-level library version returns a * character pointer. The kernel system call just Index: linux-2.6.21.5/fs/namespace.c === --- linux-2.6.21.5.orig/fs/namespace.c +++ linux-2.6.21.5/fs/namespace.c @@ -386,8 +386,31 @@ static int show_vfsmnt(struct seq_file * if (mnt-mnt_flags fs_infop-flag) seq_puts(m, fs_infop-str); } - if (mnt-mnt_sb-s_op-show_options) + seq_putc(m, ' '); + if (mnt-mnt_sb-s_op-show_options) { err = mnt-mnt_sb-s_op-show_options(m, mnt); + seq_putc(m, ' '); + } + if (IS_MNT_SHARED(mnt)) { + seq_printf(m, %s:peer=%x , SHARED, + new_encode_dev((int)get_peer_same_ns(mnt))); + if (IS_MNT_SLAVE(mnt)) { + seq_printf(m, %s:master=%x , SLAVE
Re: Adding subroot information to /proc/mounts, or obtaining that through other means
On Thu, 2007-06-21 at 10:31 -0700, H. Peter Anvin wrote: Ram Pai wrote: Peter, I am not working on it currently. But i am interested in getting it done. I have the seed set of patches which had Al Viro's ideas incorporated. Infact those patches were sent on lkml 2 months back. Shall we start with those patches? Okay, so what I see in your patches are: path-from-root: mount point of the mount from / path-from-root-of-its-sb: path from its own root dentry. propagation-flag: SHARED, SLAVE, UNBINDABLE, PRIVATE peer-mount-id: the mount-id of its peer mount (if this mount is shared) master-mount-id: the mount-id of its master mount (if this mount is slave) Other than cosmetic, I don't see anything terribly wrong with this, although getting a flag when the directory is overmounted would be nice. I guess I suggest a single comma-separated field with flags and optional :argument: private shared:peer slave:master unbindable overmounted So we could end up with something like: rootfs / rootfs rw 0 0 0:1 / 1 private,overmounted ... where 1 is the mnt_id (sequence number). [Please see my other comments in this thread... basically I believe we should just add fields to /proc/mounts.] I had two patches. The first patch added a new interface called /proc/mounts_new and had the following format. FSID mntpt root-dentry fstype fs-options where FSID is a filesystem unique id mntpt is the path to the mountpoint root-dentry is the path to the dentry with respect to the root dentry of the same filesystem. fstype is the filesystem type fs-options the mount options used. the second patch made a /proc/propagation interface which had almost the same fields, but also added fields to show the propagation type of the mount as well as pointers to its peers and master depending on the type of the mount. I think the consensus seems to have a new interface /proc/make-a-name which extends the interface provided by /proc/mounts but provides the propagation state of the mounts too as well as disambiguate bind mounts. Which makes sense. Why not have something like this? mnt-id FSID backing-dev mntpt root-dentry fstype comma-separated-fs-options and one of the fields in the comma-separated-fs-options indicates the propagation type of the mount. BTW: what is the need for overmounted flag? Do you mean two vfsmounts mounted on the same dentry on the ***same vfsmount*** ? RP -hpa - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adding subroot information to /proc/mounts, or obtaining that through other means
On Fri, 2007-06-22 at 00:06 -0700, H. Peter Anvin wrote: Ram Pai wrote: the second patch made a /proc/propagation interface which had almost the same fields, but also added fields to show the propagation type of the mount as well as pointers to its peers and master depending on the type of the mount. I think the consensus seems to have a new interface /proc/make-a-name which extends the interface provided by /proc/mounts but provides the propagation state of the mounts too as well as disambiguate bind mounts. Which makes sense. Why? It seems a lot cleaner to have all the information in the same place. It is highly unfriendly to userspace to have to gather information in a lot of places, plus it adds race conditions. It would be another matter if the format that we have now couldn't be extended, but we need those fields (well, except the two zeros, but who cares) *anyway*, so we might as well stick to the existing file, and reduce the total amount of code and clutter. Ok. so you think /proc/mounts can be extended easily without breaking any userspace commands? well lets see.. 1. to disambiguate bind mounts, we have to add a field that displays the path to the mount's root dentry from the filesystem's root dentry. Agree? 2. For filesystems that do not have a backing store, it becomes hard to disambiguate bind mounts in (1). So we need to add a filesystem-id field. 3. if we need to add the propagation status of the mount we need a propagation flag added in the output. 4. To be able to construct the propagation tree, we need a way to refer to the other mounts, since some mounts are peers and some other mounts are master. Which means we need a mount-id field. Agree? If you agree to the above 4 new fields, it becomes challenging to extend /proc/mounts to incorporate these new fields without breaking any existing applications. BTW: what is the need for overmounted flag? Do you mean two vfsmounts mounted on the same dentry on the ***same vfsmount*** ? Maybe I'm not following the uses of your flags well enough to figure out if that information can already been deduced. With the addition of the above 4 mentioned fields, I think one should be easily able to decipher which mnt-id is mounted on which mnt-id. no? maybe not. Well we will have to extend the mountpoint field to indicate the mnt-id in which the mountpoint resides. RP -hpa - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adding subroot information to /proc/mounts, or obtaining that through other means
On Wed, 2007-06-20 at 14:20 -0700, H. Peter Anvin wrote: Al Viro wrote: On Wed, Jun 20, 2007 at 01:57:33PM -0700, H. Peter Anvin wrote: ... or, alternatively, add a subfield to the first field (which would entail escaping whatever separator we choose): /dev/md6 /export ext3 rw,data=ordered 0 0 /dev/md6:/users/foo /home/foo ext3 rw,data=ordered 0 0 /dev/md6:/users/bar /home/bar ext3 rw,data=ordered 0 0 Hell, no. The first field is in principle impossible to parse unless you know the fs type. How about making a new file with sane format? From the very beginning. E.g. mountpoint + ID + relative path + type + options, where ID uniquely identifies superblock (e.g. numeric st_dev) and backing device (if any) is sitting among the options... Okay, I see there has been some discussion on this earlier, based on a proposal by Ram Pai, so it pretty much comes down to redesigning this right. I see some issues with his proposal (device numbers exported to userspace in text form should be separated into major:minor form, for one thing.) I know the util-linux-ng people have also had issues with /proc/mounts that they would like resolved in order to finally nuke /etc/mtab. Is Ram still working on this? I'd like to help make this happen so we can be done with it. Peter, I am not working on it currently. But i am interested in getting it done. I have the seed set of patches which had Al Viro's ideas incorporated. Infact those patches were sent on lkml 2 months back. Shall we start with those patches? RP -hpa - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adding subroot information to /proc/mounts, or obtaining that through other means
On Thu, 2007-06-21 at 09:29 -0700, H. Peter Anvin wrote: Ram Pai wrote: Peter, I am not working on it currently. But i am interested in getting it done. I have the seed set of patches which had Al Viro's ideas incorporated. Infact those patches were sent on lkml 2 months back. Shall we start with those patches? Are these the unprivileged mount syscall patches? no. but those patches were sent in the same thread. Karel had provided suggestions which I am yet to incorporate. Give me today. I will send out the patches incorporating the comment later in the evening. ok? RP Otherwise I don't see any patches in my personal LKML cache (apparently my subscription to fsdevel was dropped at some point, so I don't have a stash of it.) -hpa - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag
On Wed, 2007-04-18 at 11:19 +0200, Miklos Szeredi wrote: Allowing this and other flags to NOT be propagated just makes it possible to have a set of shared mounts with asymmetric properties, which may actually be desirable. The shared mount feature was designed to ensure that the mount remained identical at all the locations. OK, so remount not propagating mount flags is a bug then? As I said earlier, are there any flags currently that if not propagated can lead to conflicts with the shared subtree semantics? I am not aware of any. If you did notice a case, than according to me its a bug. But the new proposed 'allow unpriviledged mounts' flag; if not propagated among peers (and slaves) of a shared mount can lead to conflicts with shared subtree semantics. Since mount in one shared-mount; when propagated to its peer fails to mount and hence lead to un-identical peers. Now designing features to make it un-identical but still naming it shared, will break its original purpose. Slave mounts were designed to make it asymmetric. What if I want to modify flags in a master mount, but not the slave mount? Would I be screwed? For example: mount is read-only in both master and slave. I want to mark it read-write in master but not in slave. What do I do? Making mounts read-only or read-write -- will that effect mount propagation in such a way that future mounts in any one of the peers will not be able to propagate that mount to its peers or slaves? I don't think it will. Hence its ok to selectively mark some mounts read-only and some mounts read-write. However with the introduction of unpriviledged mount semantics, there can be cases where a user has priviledges to mount at one location but not at a different location. if these two location happen to share a peer-relationship than I see a case of interference of read-write flag semantics with shared subtree semantics. And hence we will end up propagating the read-write flag too or have to craft a different semantics that stays consistent. Whatever feature that is desired to be exploited; can that be exploited with the current set of semantics that we have? Is there a real need to make the mounts asymmetric but at the same time name them as shared? Maybe I dont understand what the desired application is? I do think this question of propagating mount flags is totally independent of user mounts. As it stands, currently remount doesn't propagate mount flags, and I don't see any compelling reasons why it should. The patchset introduces a new mount flag allowusermnt, but I don't see any compelling reason to propagate this flag _either_. Please say so if you do have such a reason. As I've explained, having this flag set differently in parts of a propagation tree does not interfere with or break propagation in any way. As I said earlier, I see a case where two mounts that are peers of each other can become un-identical if we dont propagate the allowusermnt. As a practical example. /tmp and /mnt are peers of each other. /tmp has its allowusermnt flag set, which has not been propagated to /mnt. now a normal-user mounts an ext2 file system under /tmp at /tmp/1 unfortunately the mount wont appear under /mnt/1 and this breaks the shared-subtree semantics which promises: whatever is mounted under /tmp will also be visible under /mnt and in case if you allow the mount to appear under /mnt/1, you will break unpriviledge mounts semantics which promises: a normal user will not be able to mount at a location that does not allow user-mounts. RP Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag
On Wed, 2007-04-18 at 21:14 +0200, Miklos Szeredi wrote: As I said earlier, I see a case where two mounts that are peers of each other can become un-identical if we dont propagate the allowusermnt. As a practical example. /tmp and /mnt are peers of each other. /tmp has its allowusermnt flag set, which has not been propagated to /mnt. now a normal-user mounts an ext2 file system under /tmp at /tmp/1 unfortunately the mount wont appear under /mnt/1 Argh, that is not true. That's what I've been trying to explain to you all along. I now realize you did, but I failed to catch it. sorry :-( The propagation will be done _regardless_ of the flag. The flag is only checked for the parent of the _requested_ mount. If it is allowed there, the mount, including any propagations are allowed. If it's denied, then obviously it's denied everywhere. and in case if you allow the mount to appear under /mnt/1, you will break unpriviledge mounts semantics which promises: a normal user will not be able to mount at a location that does not allow user-mounts. No, it does not promise that. The flag just promises, that the user cannot _request_ a mount on the parent mount. ok. if the ability for a normal user to mount something *indirectly* under a mount that has its 'allowusermnt flag' unset, is acceptable under the definition of 'allowusermnt', i guess my only choice is to accept it. :-) RP Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to query mount propagation state?
On Mon, 2007-04-16 at 14:16 -0500, Serge E. Hallyn wrote: This patch introduces a new proc interface that exposes all the propagation trees within the namespace. It walks through each off the mounts in the namespace, and prints the following information. mount-id: a unique mount identifier dev-id : the unique device used to identify the device containing the filesystem path-from-root: mount point of the mount from / path-from-root-of-its-sb: path from its own root dentry. propagation-flag: SHARED, SLAVE, UNBINDABLE, PRIVATE peer-mount-id: the mount-id of its peer mount (if this mount is shared) master-mount-id: the mount-id of its master mount (if this mount is slave) Using the above information one could easily write a script that can draw all the propagation trees in the namespace. Example: Here is a sample output of cat /proc/$$/mounts_propagation 0xa917800 0x1 / / PRIVATE 0xa917200 0x6200 / / PRIVATE 0xa917180 0x3 /proc / PRIVATE 0xa917f80 0xa /dev/pts / PRIVATE 0xa917100 0x6210 /mnt / SHARED peer:0xa917100 0xa917f00 0x6210 /tmp /1 SLAVE master:0xa917100 0xa917900 0x6220 /mnt/2 / SHARED peer:0xa917900 line 5 indicates that the mount with id 0xa917100 is mounted at /mnt is shared and it is the only mount in its peer group. line 6 indicates that the mount with id 0xa917f00 is mounted at /tmp, its root is the dentry 1 present under its root directory. This mount is a slave mount and its master is the mount with id 0xa917100. line 7 indicates that the mount with id 0xa917900 is mounted at /mnt/2, its root is the dentry / of its filesystem. This mount is a shared and it is the only mount in its peer group. one could write a script which runs through these lines and draws 4 individual satellite mounts and two propagation trees, the first propagation tree has a shared mount and a slave mount. and the second propagation tree has just one shared mount. Signed-off-by: Ram Pai [EMAIL PROTECTED] --- fs/namespace.c | 42 ++ fs/pnode.c |6 -- fs/pnode.h |6 ++ fs/proc/base.c | 22 +- 4 files changed, 69 insertions(+), 7 deletions(-) Index: linux-2.6.17.10/fs/namespace.c === --- linux-2.6.17.10.orig/fs/namespace.c +++ linux-2.6.17.10/fs/namespace.c @@ -410,6 +410,41 @@ static int show_vfsmnt_new(struct seq_fi return show_options(m, v); } +static int show_vfsmnt_propagation(struct seq_file *m, void *v) +{ + struct vfsmount *mnt = v; + seq_printf(m, 0x%x, (int)mnt); + seq_putc(m, ' '); + seq_printf(m, 0x%x, new_encode_dev(mnt-mnt_sb-s_dev)); + seq_putc(m, ' '); + seq_path(m, mnt, mnt-mnt_root, \t\n\\); + seq_putc(m, ' '); + seq_dentry(m, mnt-mnt_root, \t\n\\); + seq_putc(m, ' '); + + if (IS_MNT_SHARED(mnt)) { + seq_printf(m, %s , SHARED); + if (IS_MNT_SLAVE(mnt)) { + seq_printf(m, %s , SLAVE); + } + } else if (IS_MNT_SLAVE(mnt)) { + seq_printf(m, %s , SLAVE); + } else if (IS_MNT_UNBINDABLE(mnt)) { + seq_printf(m, %s , UNBINDABLE); + } else { + seq_printf(m, %s , PRIVATE); + } + + if (IS_MNT_SHARED(mnt)) { + seq_printf(m, peer:0x%x , (int)next_peer(mnt)); Ok, so if the sequence of events was mount --make-shared /mnt (some user logs in and gets a cloned namespace, so his /mnt becomes the next peer of /mnt) mount --bind /mnt /tmp (some other user logs in and gets cloned namespace...) or some such sequence of events, we could lose all information about /mnt and /tmp being peers, right? Should a new next_peer_in_same_namespace(mnt) be used rather than next_peer()? you are right. it should print next_peer(mnt) only if CAP_SYS_ADMIN, else print next_peer_in_same_namespace(mnt). Somewhat similarly, + } + if (IS_MNT_SLAVE(mnt)) { + seq_printf(m, master:0x%x , (int)mnt-mnt_master); Should we for privacy reasons not print out the address mnt-mnt_master is in a different namespace (perhaps if !CAP_SYS_ADMIN)? right. it should print mnt-mnt_master if (CAP_SYS_ADMIN), otherwise print master_in_same_namespace(mnt). RP Otherwise I like this. thanks, -serge - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag
On Tue, 2007-04-17 at 19:44 +0200, Miklos Szeredi wrote: I'm a bit lost about what is currently done and who advocates for what. It seems to me the MNT_ALLOWUSERMNT (or whatever :) flag should be propagated. In the /share rbind+chroot example, I assume the admin would start by doing mount --bind /share /share mount --make-slave /share mount --bind -o allow_user_mounts /share (or whatever) mount --make-shared /share then on login, pam does chroot /share/$USER or some sort of mount --bind /share /home/$USER/root chroot /home/$USER/root or whatever. In any case, the user cannot make user mounts except under /share, and any cloned namespaces will still allow user mounts. I don't quite understand your method. This is how I think of it: mount --make-rshared / mkdir -p /mnt/ns/$USER mount --rbind / /mnt/ns/$USER mount --make-rslave /mnt/ns/$USER mount --set-flags --recursive -oallowusermnt /mnt/ns/$USER chroot /mnt/ns/$USER su - $USER I did actually try something equivalent (without the fancy mount commands though), and it worked fine. The only problem is the proliferation of mounts in /proc/mounts. There was a recently posted patch in AppArmor, that at least hides unreachable mounts from /proc/mounts, so the user wouldn't see all those. But it could still be pretty confusing to the sysadmin. unbindable mounts were designed to overcome the proliferation problem. Your steps should be something like this: mount --make-rshared / mkdir -p /mnt/ns mount --bind /mnt/ns /mnt/ns mount --make-unbindable /mnt/ns mkdir -p /mnt/ns/$USER mount --rbind / /mnt/ns/$USER mount --make-rslave /mnt/ns/$USER mount --set-flags --recursive -oallowusermnt /mnt/ns/$USER chroot /mnt/ns/$USER su - $USER try this and your proliferation problem will disappear. :-) So in that sense doing it the complicated way, by first cloning the namespace, and then copying and sharing mounts individually which need to be shared could relieve this somewhat. the unbindable mount will just provide you permanent relief. Another point: user mounts under /proc and /sys shouldn't be allowed. There are files there (at least in /proc) that are seemingly writable by the user, but they are still not writable in the sense, that normal files are. Anyway, there are lots of userspace policy issues, but those don't impact the kernel part. As for the original question of propagating the allowusermnt flag, I think it doesn't matter, as long as it's consistent and documented. Propagating some mount flags and not propagating others is inconsistent and confusing, so I wouldn't want that. Currently remount doesn't propagate mount flags, that may be a bug, For consistency reason, one can propagate all the flags. But propagating only those flags that interfere with shared-subtree semantics should suffice. wait...Dave's read-only bind mounts infact need the ability to selectively make some mounts readonly. In such cases propagating the read-only flag will just step on Dave's feature. Wont' it? RP Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag
On Tue, 2007-04-17 at 21:43 +0200, Miklos Szeredi wrote: I'm a bit lost about what is currently done and who advocates for what. It seems to me the MNT_ALLOWUSERMNT (or whatever :) flag should be propagated. In the /share rbind+chroot example, I assume the admin would start by doing mount --bind /share /share mount --make-slave /share mount --bind -o allow_user_mounts /share (or whatever) mount --make-shared /share then on login, pam does chroot /share/$USER or some sort of mount --bind /share /home/$USER/root chroot /home/$USER/root or whatever. In any case, the user cannot make user mounts except under /share, and any cloned namespaces will still allow user mounts. I don't quite understand your method. This is how I think of it: mount --make-rshared / mkdir -p /mnt/ns/$USER mount --rbind / /mnt/ns/$USER mount --make-rslave /mnt/ns/$USER mount --set-flags --recursive -oallowusermnt /mnt/ns/$USER chroot /mnt/ns/$USER su - $USER I did actually try something equivalent (without the fancy mount commands though), and it worked fine. The only problem is the proliferation of mounts in /proc/mounts. There was a recently posted patch in AppArmor, that at least hides unreachable mounts from /proc/mounts, so the user wouldn't see all those. But it could still be pretty confusing to the sysadmin. unbindable mounts were designed to overcome the proliferation problem. Your steps should be something like this: mount --make-rshared / mkdir -p /mnt/ns mount --bind /mnt/ns /mnt/ns mount --make-unbindable /mnt/ns mkdir -p /mnt/ns/$USER mount --rbind / /mnt/ns/$USER mount --make-rslave /mnt/ns/$USER mount --set-flags --recursive -oallowusermnt /mnt/ns/$USER chroot /mnt/ns/$USER su - $USER try this and your proliferation problem will disappear. :-) Right, this is needed. My problem wasn't actually this (which would only have hit, if I tried with more than one user), just that the number of mounts in /proc/mounts grows linearly with the number of users. That can't be helped in such an easy way unfortunately. Propagating some mount flags and not propagating others is inconsistent and confusing, so I wouldn't want that. Currently remount doesn't propagate mount flags, that may be a bug, For consistency reason, one can propagate all the flags. But propagating only those flags that interfere with shared-subtree semantics should suffice. I still don't believe not propagating allowusermnt interferes with mount propagation. In my posted patches the mount (including propagations) is allowed based on the allowusermnt flag on the parent of the requested mount. The flag is _not_ checked during propagation. Allowing this and other flags to NOT be propagated just makes it possible to have a set of shared mounts with asymmetric properties, which may actually be desirable. The shared mount feature was designed to ensure that the mount remained identical at all the locations. Now designing features to make it un-identical but still naming it shared, will break its original purpose. Slave mounts were designed to make it asymmetric. Whatever feature that is desired to be exploited; can that be exploited with the current set of semantics that we have? Is there a real need to make the mounts asymmetric but at the same time name them as shared? Maybe I dont understand what the desired application is? RP Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/8] unprivileged mount syscall
On Fri, 2007-04-13 at 13:58 +0200, Miklos Szeredi wrote: On Wed, 2007-04-11 at 12:44 +0200, Miklos Szeredi wrote: 1. clone the master namespace. 2. in the new namespace move the tree under /share/$me to / for each ($user, $what, $how) { move /share/$user/$what to /$what if ($how == slave) { make the mount tree under /$what as slave } } 3. in the new namespace make the tree under /share as private and unmount /share Thanks. I get the basic idea now: the namespace itself need not be shared between the sessions, it is enough if share propagation is set up between the different namespaces of a user. I don't yet see either in your or Viro's description how the trees under /share/$USER are initialized. I guess they are recursively bound from /, and are made slaves. yes. I suppose, when a userid is created one of the steps would be mount --rbind / /share/$USER mount --make-rslave /share/$USER mount --make-rshared /share/$USER Thinking a bit more about this, I'm quite sure most users wouldn't even want private namespaces. It would be enough to chroot /share/$USER and be done with it. Private namespaces are only good for keeping a bunch of mounts referenced by a group of processes. But my guess is, that the natural behavior for users is to see a persistent set of mounts. If for example they mount something on a remote machine, then log out from the ssh session and later log back in, they would want to see their previous mount still there. They will continue see their previous mount tree. Even if all the namespaces belonging to the different sessions of the user get dismantled when all the sessions exit, the a mirror of those mount trees continue to exist under /share/$USER in the original namespace. So I don't think we have a issue. NOTE: when I say 'original namespace' I mean the admin namespace; the first namespace that gets created when the machine boots. RP Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/8] unprivileged mount syscall
On Fri, 2007-04-13 at 16:05 +0200, Miklos Szeredi wrote: Thinking a bit more about this, I'm quite sure most users wouldn't even want private namespaces. It would be enough to chroot /share/$USER and be done with it. Private namespaces are only good for keeping a bunch of mounts referenced by a group of processes. But my guess is, that the natural behavior for users is to see a persistent set of mounts. If for example they mount something on a remote machine, then log out from the ssh session and later log back in, they would want to see their previous mount still there. Miklos Agreed on desired behavior, but not on chroot sufficing. It actually sounds like you want exactly what was outlined in the OLS paper. Users still need to be in a different mounts namespace from the admin user so long as we consider the deluser and backup problems I don't think it matters, because /share/$USER duplicates a part or the whole of the user's namespace. So backup would have to be taught about /share anyway, and deluser operates on /home/$USER and not on /share/*, so there shouldn't be any problem. There's actually very little difference between rbind+chroot, and CLONE_NEWNS. In a private namespace: 1) when no more processes reference the namespace, the tree will be disbanded 2) the mount tree won't be accessible from outside the namespace Wanting a persistent namespace contradicts 1). Wanting a per-user (as opposed to per-session) namespace contradicts 2). The namespace _has_ to be accessible from outside, so that a new session can access/copy it. As i mentioned in the previous mail, disbanding all the namespaces of a user will not disband his mount tree, because a mirror of the mount tree still continues to exist in /share/$USER in the admin namespace. And a new user session can always use this copy to create a namespace that looks identical to that which existed earlier. So both requirements point to the rbind/chroot solution. Arn't there ways to escape chroot jails? Serge had pointed me to a URL which showed chroots can be escaped. And if that is true than having all user's private mount tree in the same namespace can be a security issue? RP Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag
Serge E. Hallyn [EMAIL PROTECTED] writes: Quoting Miklos Szeredi ([EMAIL PROTECTED]): From: Miklos Szeredi [EMAIL PROTECTED] If CLONE_NEWNS and CLONE_NEWNS_USERMNT are given to clone(2) or unshare(2), then allow user mounts within the new namespace. This is not flexible enough, because user mounts can't be enabled for the initial namespace. The remaining clone bits also getting dangerously few... Alternatives are: - prctl() flag - setting through the containers filesystem Sorry, I know I had mentioned it, but this is definately my least favorite approach. Curious whether are any other suggestions/opinions from the containers list? Given the existence of shared subtrees allowing/denying this at the mount namespace level is silly and wrong. If we need more than just the filesystem permission checks can we make it a mount flag settable with mount and remount that allows non-privileged users the ability to create mount points under it in directories they have full read/write access to. Also for bind-mount and remount operations the flag has to be propagated down its propagation tree. Otherwise a unpriviledged mount in a shared mount wont get reflected in its peers and slaves, leading to unidentical shared-subtrees. RP I don't like the use of clone flags for this purpose but in this case the shared subtress are a much more fundamental reasons for not doing this at the namespace level. Eric ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag
On Mon, 2007-04-16 at 11:32 +0200, Miklos Szeredi wrote: Given the existence of shared subtrees allowing/denying this at the mount namespace level is silly and wrong. If we need more than just the filesystem permission checks can we make it a mount flag settable with mount and remount that allows non-privileged users the ability to create mount points under it in directories they have full read/write access to. Also for bind-mount and remount operations the flag has to be propagated down its propagation tree. Otherwise a unpriviledged mount in a shared mount wont get reflected in its peers and slaves, leading to unidentical shared-subtrees. That's an interesting question. Do we want shared mounts to be totally identical, including mnt_flags? It doesn't look as if do_remount() guarantees that currently. Depends on the semantics of each of the flags. Some flags like of the read/write flag, would not interfere with the propagation semantics AFAICT. But this one certainly seems to interfere. RP Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag
On Mon, 2007-04-16 at 11:56 +0200, Miklos Szeredi wrote: Also for bind-mount and remount operations the flag has to be propagated down its propagation tree. Otherwise a unpriviledged mount in a shared mount wont get reflected in its peers and slaves, leading to unidentical shared-subtrees. That's an interesting question. Do we want shared mounts to be totally identical, including mnt_flags? It doesn't look as if do_remount() guarantees that currently. Depends on the semantics of each of the flags. Some flags like of the read/write flag, would not interfere with the propagation semantics AFAICT. But this one certainly seems to interfere. That depends. Current patches check the unprivileged submounts allowed under this mount flag only on the requested mount and not on the propagated mounts. Do you see a problem with this? Don't see a problem if the flag is propagated to all peers and slave mounts. If not, I see a problem. What if the propagated mount has its flag set to not do un-priviledged mounts, whereas the requested mount has it allowed? RP Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to query mount propagation state?
On Mon, 2007-04-16 at 12:34 +0200, Miklos Szeredi wrote: Currently one of the difficulties with mount propagations is that there's no way to know the current state of the propagation tree. Has anyone thought about how this info could be queried from userspace? I am attaching two patches that I had done way back in Oct 2006 with Al Viro. I had sent these patches to Al Viro. But I forgot to follow them up, I guess so did Al Viro. The first patch disambiguates multiple mount-instances of the same filesystem (or part of the same filesystem), by introducing a new interface /proc/mounts_new. The second patch introduces a new proc interface that exposes all the propagation trees within a namespace. It does not show propagated mounts residing in a different namespace (for privacy reasons). Maybe one could modify the patch a little, to allow it; if the user has root priviledges. RP PS: Sorry these are attachments instead of inline patches. I am scared of inlining in evolution. If needed I can send inline patches through mutt. Thanks, Miklos This patch disambiguates multiple mount-instances of the same filesystem (or part of the same filesystem), by introducing a new interface /proc/mounts_new. The interface has the following format. FSID mntpt root-dentry fstype fs-options NOTE: root-dentry is the path to the dentry w.r.t to the root dentry of the same filesystem. for example: lets say we attempt the following commands mount --bind /var /mnt mount --bind /mnt/tmp /tmp1 'cat /proc/mounts' shows the following: /dev/root /mnt ext2 rw 0 0 /dev/root /tmp1 ext2 rw 0 0 NOTE: The above mount entries, do not indicate that /tmp1 contains the same directory tree as /var/tmp. But 'cat /proc/mounts_new' shows us the following: 0x6200 /mnt /var ext2 rw 0 0 0x6200 /tmp1 /var/tmp ext2 rw 0 0 The above entries clearly indicates that /var/tmp directory of the ext2 filesystem with fsid=0x6200 is the directory tree that resides under /tmp1 Signed-off-by: Ram Pai [EMAIL PROTECTED] --- fs/dcache.c | 53 fs/namespace.c | 35 ++--- fs/proc/base.c | 32 +-- fs/proc/proc_misc.c |1 fs/seq_file.c| 77 ++- include/linux/dcache.h |1 include/linux/seq_file.h |1 7 files changed, 172 insertions(+), 28 deletions(-) Index: linux-2.6.17.10/fs/proc/base.c === --- linux-2.6.17.10.orig/fs/proc/base.c +++ linux-2.6.17.10/fs/proc/base.c @@ -104,6 +104,7 @@ enum pid_directory_inos { PROC_TGID_MAPS, PROC_TGID_NUMA_MAPS, PROC_TGID_MOUNTS, + PROC_TGID_MOUNTS_NEW, PROC_TGID_MOUNTSTATS, PROC_TGID_WCHAN, #ifdef CONFIG_MMU @@ -145,6 +146,7 @@ enum pid_directory_inos { PROC_TID_MAPS, PROC_TID_NUMA_MAPS, PROC_TID_MOUNTS, + PROC_TID_MOUNTS_NEW, PROC_TID_MOUNTSTATS, PROC_TID_WCHAN, #ifdef CONFIG_MMU @@ -203,6 +205,7 @@ static struct pid_entry tgid_base_stuff[ E(PROC_TGID_ROOT, root,S_IFLNK|S_IRWXUGO), E(PROC_TGID_EXE, exe, S_IFLNK|S_IRWXUGO), E(PROC_TGID_MOUNTS,mounts, S_IFREG|S_IRUGO), + E(PROC_TGID_MOUNTS_NEW,mounts_new, S_IFREG|S_IRUGO), E(PROC_TGID_MOUNTSTATS, mountstats, S_IFREG|S_IRUSR), #ifdef CONFIG_MMU E(PROC_TGID_SMAPS, smaps, S_IFREG|S_IRUGO), @@ -246,6 +249,7 @@ static struct pid_entry tid_base_stuff[] E(PROC_TID_ROOT, root,S_IFLNK|S_IRWXUGO), E(PROC_TID_EXE,exe, S_IFLNK|S_IRWXUGO), E(PROC_TID_MOUNTS, mounts, S_IFREG|S_IRUGO), + E(PROC_TID_MOUNTS_NEW, mounts_new, S_IFREG|S_IRUGO), #ifdef CONFIG_MMU E(PROC_TID_SMAPS, smaps, S_IFREG|S_IRUGO), #endif @@ -692,13 +696,13 @@ static struct file_operations proc_smaps }; #endif -extern struct seq_operations mounts_op; struct proc_mounts { struct seq_file m; int event; }; -static int mounts_open(struct inode *inode, struct file *file) +static int __mounts_open(struct inode *inode, struct file *file, + struct seq_operations *mounts_op) { struct task_struct *task = proc_task(inode); struct namespace *namespace; @@ -716,7 +720,7 @@ static int mounts_open(struct inode *ino p = kmalloc(sizeof(struct proc_mounts), GFP_KERNEL); if (p) { file-private_data = p-m; - ret = seq_open(file, mounts_op); + ret = seq_open(file, mounts_op); if (!ret) { p-m.private = namespace; p-event = namespace-event; @@ -729,6 +733,16 @@ static int mounts_open(struct inode *ino return ret; } +extern struct seq_operations mounts_op, mounts_new_op; +static int mounts_open(struct inode *inode, struct file *file) +{ + return (__mounts_open(inode, file, mounts_op)); +} +static int mounts_new_open(struct inode *inode, struct file *file) +{ + return __mounts_open(inode, file, mounts_new_op
Re: [patch 0/8] unprivileged mount syscall
On Wed, 2007-04-11 at 12:44 +0200, Miklos Szeredi wrote: 1. clone the master namespace. 2. in the new namespace move the tree under /share/$me to / for each ($user, $what, $how) { move /share/$user/$what to /$what if ($how == slave) { make the mount tree under /$what as slave } } 3. in the new namespace make the tree under /share as private and unmount /share Thanks. I get the basic idea now: the namespace itself need not be shared between the sessions, it is enough if share propagation is set up between the different namespaces of a user. I don't yet see either in your or Viro's description how the trees under /share/$USER are initialized. I guess they are recursively bound from /, and are made slaves. yes. I suppose, when a userid is created one of the steps would be mount --rbind / /share/$USER mount --make-rslave /share/$USER mount --make-rshared /share/$USER RP Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/8] unprivileged mount syscall
On Mon, 2007-04-09 at 22:10 +0200, Miklos Szeredi wrote: The one in pam-0.99.6.3-29.1 in opensuse-10.2 is totally broken. Are you interested in the details? I can reproduce it, but forgot to note down the details of the brokenness. I don't know how far removed that is from the one being used by redhat, but assuming it's the same, then redhat-lspp@redhat.com will be very interested. OK. - user namespace setup: what if user has multiple sessions? 1) namespaces are shared? That's tricky because the session needs to be a child of a namespace server, not of login. I'm not sure PAM can handle this 2) or mounts are copied on login? That's not possible currently, as there's no way to send a mount between namespaces. Also it's tricky to make sure that new mounts are also shared See toward the end of the 'shared subtrees' OLS paper from last year for a suggestion on how to let users effectively 'log in to' an existing private mounts ns. This? 1. create a new namespace 2. bind /share/$USER to /share 3. for each pair ($who, $what) such that /share/$USER/$who/$what exists, look in /share/$who/allowed for peer $what $USER or slave $what $USER. If the former is found, rbind /share/$who/$what on /share/$USER/$who/$what; if the latter is found, do the same and follow with marking subtree under /share/$USER/$who/$what as slave. 4. rbind /share/$USER to /share 5. mark subtree under /share as private. 6. umount -l /share Well, someone please explain using short words, because I don't understand at all. I am trying to re-construct Viro's thoughts. I think the steps outlined above; though not accurate, are still insightful. The idea is -- there is one master namespace, which has under /share, a replica of the mount tree of namespaces belonging to all users. for example if there are two users A and B, then in the master namespace under /share you will find /share/A and /share/B, each reflecting the mount tree for the namespaces belonging to user-A and user-B respectively. Note: /share is a shared mount-tree, which means it can propagate mount events. Everytime the user logs on the machine, a new namespace is created which is the clone of the master namespace. In this new namespace, the /share/$user is made the root of the namespace. Also if other users have allowed part of their namespace available to this user, than those mounts are also brought under this namespace. And finally the entire tree under /share is unmounted. Note, though multiple namespaces can exist simultaneously for the same user, the user is provided the illusion of per-process-namespace since all the namespaces look identical. I am trying to rewrite the steps outlined above, which may or may not reflect Viro's thoughts, but certainly reflect my reconstruction of viro's thoughts. 1. clone the master namespace. 2. in the new namespace move the tree under /share/$me to / for each ($user, $what, $how) { move /share/$user/$what to /$what if ($how == slave) { make the mount tree under /$what as slave } } 3. in the new namespace make the tree under /share as private and unmount /share RP Thanks, Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/8] unprivileged mount syscall
On Mon, 2007-04-09 at 12:07 -0500, Serge E. Hallyn wrote: Quoting Miklos Szeredi ([EMAIL PROTECTED]): - need to set up mount propagation from global namespace to private ones, mount(8) does not yet have options to configure propagation Hmm, I guess I get lost using my own little systems, and just assumed that shared subtree functionality was making its way up into mount(8). Ram, have you been working on that? It is in FC6. I dont know the status off upstream util-linux. I did submit the patch many times to Adrian Bunk (the then util-linux maintainer) and got no response. I have not pushed the patches to the new maintainer(Karel Zak?) though. RP - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
shared subtree query
Ok. I have shared subtree patches getting ready for review. I have totally revamped the code from what I had sent last time, incorporating all valuable comments Miklos had made. Offcourse I am yet to finish a document that Andrew Morton had requested. The patch snapshot at: http://www.sudhaa.com/~ram/readahead/sharedsubtree and the latest set of working patches are at: http://www.sudhaa.com/~ram/readahead/sharedsubtree/shared.0831.1 Before I formally send the patches for a review, I have bumped into a small issue, and I am not sure about the behavior. Al Viro's RFC at http://lwn.net/Articles/119232/ says 5. umount unmount everything that gets propagation from victim Its hard to interpret what victim means. There can be two interpretations to this. 1) the mount that got unmounted 2) the mount whose child got unmounted I think its natural to assume (2), but (1) also makes sense sometimes. Can somebody shed some light on this? Al Viro: please? Thanks for your help, RP - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mirror a file system on the fly
On Thu, 2005-08-18 at 12:40, Dave Schwartz wrote: Hi list, Not too sure if this is the right forum to ask this question but since my requirement is around linux filesystems, I shall take this liberty to post my question. My requirement is to develop a kernel/user space module to add an extension to the shell program environment such that this shell forks a mirror look-alike filesystem of the underlying OS to the programs run in that particular shell. u seem to be talking about namespaces, if I get you right. there is a flag CLONE_NEWNS to the system call 'clone' which does what u r talking about. RP Was trying to look thru the FAQ and a few list archives to look for ideas around my requirement. The archives were overwhelming. Any ideas/pointers will be a great help, Gracias, decebel - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mirror a file system on the fly
On Thu, 2005-08-18 at 13:27, Dave Schwartz wrote: Hi Ram, Thanks for the inputs. I was going over the man pages describing the clone system call and its option of CLONE_NEWNS. Could understand the description only in parts. The man page suggests that this flag when set, the cloned child is started in a new name space, initialized with a copy of the parent. Now does that mean, a program like a shell when cloned with CLONE_NEWNS set, will have a copy of file hierarchy of the underlying parent process? Yes the child process will see an exact copy of all the mounts of various filesystems as that of the parent. However if you mount/unmount any filesystems in the child, the same will not be mounted/unmounted in the parent and vice-versa. Each has its individual view of the the filesystem heirarchy. Try the following program that clones off a child process with a mirror namespace and gives you a bash prompt. Try mounting and unmounting in this bash prompt and see if the same is visible in a totally different window. #include stdio.h #include signal.h #include sched.h char somemem[4096]; int myfunc(){ system(bash); } int main(int argc, char *argv[]) { if(clone(myfunc, somemem, CLONE_NEWNS|SIGCHLD, NULL)) { wait(NULL); } else { printf(clone failed\n); } printf(exit\n); } Hope this helps, RP Gracias, decebel On 8/19/05, Ram Pai [EMAIL PROTECTED] wrote: On Thu, 2005-08-18 at 12:40, Dave Schwartz wrote: Hi list, Not too sure if this is the right forum to ask this question but since my requirement is around linux filesystems, I shall take this liberty to post my question. My requirement is to develop a kernel/user space module to add an extension to the shell program environment such that this shell forks a mirror look-alike filesystem of the underlying OS to the programs run in that particular shell. u seem to be talking about namespaces, if I get you right. there is a flag CLONE_NEWNS to the system call 'clone' which does what u r talking about. RP Was trying to look thru the FAQ and a few list archives to look for ideas around my requirement. The archives were overwhelming. Any ideas/pointers will be a great help, Gracias, decebel - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/7] shared subtree
On Thu, 2005-07-28 at 02:57, Miklos Szeredi wrote: This is an example, where having struct pnode just complicates things. If there was no struct pnode, this function would be just one line: setting the shared flag. So your comment is mostly about getting rid of pnode and distributing the pnode functionality in the vfsmount structure. Yes, sorry if I didn't make it clear. I know you are thinking of just having the necessary propogation list in the vfsmount structure itself. Yes true with that implementation the complication is reduced in this part of the code, but really complicates the propogation traversal routines. On the contrary, I think it will simplify the traversal routines. Here's an iterator function I coded up. Not tested at all (may not even compile): Your suggested code has bugs. But I understand what you are aiming at. Maybe you are right. I will try out a implementation using your idea. Hmm.. lots of code change, and testing. struct vfsmount { /* ... */ struct list_head mnt_share; /* circular list of shared mounts */ struct list_head mnt_slave_list; /* list of slave mounts */ struct list_head mnt_slave; /* slave list entry */ struct vfsmount *master; /* slave is on master-mnt_slave_list */ }; static inline struct vfsmount *next_shared(struct vfsmount *p) { return list_entry(p-mnt_share.next, struct vfsmount, mnt_share); } static inline struct vfsmount *first_slave(struct vfsmount *p) { return list_entry(p-mnt_slave_list.next, struct vfsmount, mnt_slave); } static inline struct vfsmount *next_slave(struct vfsmount *p) { return list_entry(p-mnt_slave.next, struct vfsmount, mnt_slave); } static struct vfsmount *propagation_next(struct vfsmount *p, struct vfsmount *base) { /* first iterate over the slaves */ if (!list_empty(p-mnt_slave_list)) return first_slave(p); I think this code should be if (!list_empty(p-mnt_slave)) return next_slave(p); Right? I think I get the idea. RP while (1) { struct vfsmount *q; /* more vfsmounts belong to the pnode? */ if (!list_empty(p-mnt_share)) { p = next_shared(p); if (list_empty(p-mnt_slave) p != base) return p; } if (p == base) break; BUG_ON(list_empty(p-mnt_slave)); /* more slaves? */ q = next_slave(p); if (p-master != q) return q; /* back at master */ p = q; } return NULL; } - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
mount behavior question.
Summary of the question: Should the topmost mount be visible, or should the most recent mount be visible? consider the following command sequence (1) cd /mnt (2) mount --bind /usr /mnt (3) mount --bind /bin /mnt (4) mount --bind /var . after step 1, the pwd of the process is pointing to root mount and directory mnt. lets call the root mount as 'A' after step 2, a new mount is laid on top of 'A' at the mountpoint mnt lets call this mount 'B' after step 3, a new mount is laid on top of 'B' at the mountpoint mnt which corresponds to the root dentry of 'B'. lets call this new overlaid mount as 'C'. At this point the visible content of /mnt is the content of C. however at step 4, a new mount is laid on top of 'A' at the same mountpoint mnt, as that of 'B'. Lets call the new mount 'D'. At this point, the visible content of /mnt is that of D and not that of C But should'nt it be C? Why is that the contents of 'D' made visible? Is there any particular reason for this behavior? Note: 'D' is mounted on the bottommost mount, and hence should be obscured by the top level mounts. To make it simpler, imagine you are viewing a 3 storied transparent building from the top. If you place an apple in 1st floor and nothing is placed on any other floors, the apple will be visible from the top. Now if you place a 'orange' in the 2nd floor the apple should get obscured by the orange and the 'orange' should start being visible. And later if you place a 'mango' on 3rd floor, the mango should obscure both the apple and orange. but at this point if you place another apple on top of the first apple in the 1st floor, it cannot be visible, because the 'orange' and the 'mango' block its line of sight. And hence the 'mango' should still continue to be visible. right? If the apple starts becoming visible from the top, won't it defy law of visibility? :) Back to the mount example: Currently the behavior is the most recent mount is visible and not the topmost mount. Not many will run into this question currently, because the sequence of steps have to orchestrated well to get into this scenario. But with shared subtrees it is pretty easy to mount something at a lower level mount because of propogations. And in this case the behavior becomes totally confusing if the rule is 'expose the most-recent-mount and not the topmost-mount'. Here is a scenario with shared subtree. Sorry it is complex. mount --bind /mnt /mnt mount --make-shared /mnt mkdir -p /mnt/p mount --bind /usr /mnt/1 mount --bind /mnt /mnt/2 At this stage the mount at /mnt/2 and /mnt belong to the same pnode which means mounts under them propogate to each other. mount --bind /var /mnt/1 the contents of /var will be visible under /mnt/1 and not under /mnt/2 But if mount --bind /var /mnt/2 is executed, the contents of /var is visible under /mnt/1 as well as /mnt/2 . Isn't this freaky? On analysis it turns out the culprit is the current rule which says 'expose the most-recent-mount and not the topmost mount' RP - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mount behavior question.
On Thu, 2005-07-28 at 04:56, Miklos Szeredi wrote: Here is a scenario with shared subtree. Sorry it is complex. mount --bind /mnt /mnt mount --make-shared /mnt mkdir -p /mnt/p mount --bind /usr /mnt/1 mount --bind /mnt /mnt/2 At this stage the mount at /mnt/2 and /mnt belong to the same pnode which means mounts under them propogate to each other. mount --bind /var /mnt/1 the contents of /var will be visible under /mnt/1 and not under /mnt/2 But if mount --bind /var /mnt/2 is executed, the contents of /var is visible under /mnt/1 as well as /mnt/2 . Isn't this freaky? I don't understand. 'mount --bind /var /mnt/1' should propagate to /mnt/2/1, not /mnt/2. yes it should propogate to /mnt/2/1 , thats what I meant when I said under /mnt/2, but yes I was not clear. Hope I have a clearer explanation below. No? 'mount --bind /var/ /mnt/2' should propagate to /mnt. What am I missing? step 1: mount --bind /mnt /mnt a new mount 'A' is created at /mnt step 2: mount --make-shared /mnt mounts under 'A' are made shared. But in this case there are no other mounts. So only 'A' will be made shared. step 3: mkdir -p /mnt/1 /mnt/2 nothing special here step 4: mount --bind /usr /mnt/1 a new mount 'B' is created at /mnt/1 which is 'shared;. step 5: mount --bind /mnt /mnt/2 a new mount 'C' is created at /mnt/2 and propogation is set between 'A' and 'C'. note: 'C' is made shared. lets say, at this point I try mount --bind /var /mnt/1 this is going to mount 'D' on top of mount 'B'. However there is no other mount to which 'B' propogates to. So that is it. the contents of /var is only visible at /mnt/1 and it propogates no where else. but lets say, we tried mount --bind /var /mnt/2/1 /mnt/2/1 belongs to mount 'C'. And mounts under 'C' propogates to 'A' too. So in this case a new mount 'E' is created at mnt/1/2 i.e on top of 'C' at dentry '2' and due to propogation a new mount 'F' is created at /mnt/1 i.e on top of mount 'A' at dentry '1' But note: /mnt/1 already has a mount 'B' on top of it. The new mount 'F' as per the 'most-current mount rule' obscures 'B' even though the mount is on top of 'A'. As a result the contents of /var are now visible both at /mnt/2/1 and /mnt/1 Ok the net effect is, mount at /mnt/1 is visible only under /mnt/1 but mount at /mnt/2/1 is visible at mount /mnt/2/1 and /mnt/1 This makes it confusing. If the 'top-most mount rule' is applied 'F' though mounted on 'A', will not be visible because it will get obscured by 'B' and the confusion is avoided. So the point I am driving at is, is there any special reason for having 'most-recent mount visible rule' instead of 'top-most mount visible rule'? RP Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mount behavior question.
On Thu, 2005-07-28 at 12:30, Miklos Szeredi wrote: no. there is no asymmetry as such. the propogations are working the way they are meant to. But the confusion arises because of the mount lookup symantics. The reason Avantika(who is doing shared subtree testing), had this exact confusion is because of the 'most-recent-mount visible' rule. I dont think this rule is documented anywhere. And the natural response to such a behavior is confusion. I really fail to see what you are getting at. You agree that: 1) mount doesn't propagate from /mnt/1 to /mnt/2/1. 2) mount propagates from /mnt/2/1 to /mnt/1. Yes I agree. Then you are surprised that you don't see the same thing if you mount on /mnt/1 as if on /mnt/2/1. I am not surprised when mounts on /mnt/1 do not propogate to /mnt/2/1 This is expected, and I am perfectly happy. Because the mount is attempted on 'B' and 'B' has nobody to propogate to. when mount on /mnt/2/1 (i.e on C at dentry 1) is attempted, I expect to see a new mount 'E' at that dentry. That is happening and I am happy with it. I also expect that the mount propogates to /mnt/1 too (i.e on 'A' at dentry '1'). Because 'C' and 'A' have propogation setup. But what I also expect to see is: the new mount 'F' at /mnt/1 ( mount A at dentry 1) be obscured by the already existing mount on /mnt/1 i.e mount 'B'. And the reason I want the new mount at /mnt/1 (i.e 'F') obscured is that the new mount is not done on 'B' but is done on 'A'. The most recent mount rule makes 'B' obscured instead of 'F' and I am expecting the topmount visible rule to be applicable here which makes 'B' still visible and 'F' obscured. Ah...its so hard without a whiteboard :( I wish there was some way to explain it drawing some objects on the whiteboard. I guess, I have got all the letters and the words right. Any small mistake can distort everything. If somebody is wondering why there is no 'D' that is because it was used for something else in the earlier example and hence not used here. RP I think your proposed solution would be _more_ confusing not less, since then I'd not see the expected propagation from /mnt/2/1 to /mnt/1. I'd call that a bug. Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mount behavior question.
On Thu, 2005-07-28 at 13:35, Bryan Henderson wrote: It wouldn't surprise me if someone is depending on mount over .. But I'd be surprised if someone is doing it to a directory that's already been mounted over (such that the stacking behavior is relevant). That seems really eccentric. Bryan, what would you expect the behavior to be when somebody mounts on a directory what is already mounted over? Do you expect the new mount to obscure the already existing mount or do you expect the already existing mount to obscure the new mount? The issue in the current thread is pretty much revolving around this. RP -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mount behavior question.
On Thu, 2005-07-28 at 15:27, Bryan Henderson wrote: Bryan, what would you expect the behavior to be when somebody mounts on a directory what is already mounted over? Well, I've tried to beg the question. I said I don't think it's meaningful to mount over a directory; that one actually mounts at a name. And that Linux's peculiar mount over '.' (which is in fact mounting over a directory and not at a name) is weird enough that there is no natural expectation of it except that it should fail. But if I had to try to merge mount over '.' into as consistent a model as possible with one of the two behaviors we've been discussing, I'd say that . stands for the name by which you looked up that directory in the first place (so in this case, it's equivalent to mount ... /mnt). And that means I would expect the new mount to obscure the already existing mount. ok. maybe I am having some odd expectations here. To me it still feels natural to tuck the mount under the earlier mount, since you are not mounting on something which on the top, but you are mounting on top of something which is under(obscured). RP -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/7] shared subtree
On Wed, 2005-07-27 at 12:54, Miklos Szeredi wrote: +static int do_make_shared(struct vfsmount *mnt) +{ + int err=0; + struct vfspnode *old_pnode = NULL; + /* +* if the mount is already a slave mount, +* allocate a new pnode and make it +* a slave pnode of the original pnode. +*/ + if (IS_MNT_SLAVE(mnt)) { + old_pnode = mnt-mnt_pnode; + pnode_del_slave_mnt(mnt); + } + if(!IS_MNT_SHARED(mnt)) { + mnt-mnt_pnode = pnode_alloc(); + if(!mnt-mnt_pnode) { + pnode_add_slave_mnt(old_pnode, mnt); + err = -ENOMEM; + goto out; + } + pnode_add_member_mnt(mnt-mnt_pnode, mnt); + } + if(old_pnode) + pnode_add_slave_pnode(old_pnode, mnt-mnt_pnode); + set_mnt_shared(mnt); +out: + return err; +} This is an example, where having struct pnode just complicates things. If there was no struct pnode, this function would be just one line: setting the shared flag. So your comment is mostly about getting rid of pnode and distributing the pnode functionality in the vfsmount structure. I know you are thinking of just having the necessary propogation list in the vfsmount structure itself. Yes true with that implementation the complication is reduced in this part of the code, but really complicates the propogation traversal routines. In order to find out the slaves of a given mount: with your proposal: I have to walk through all the peer mounts of this mount and check for any slaves there. in my implementation: I have to just find which pnode it belongs to, and all the slaves are easily available there. In order to find out all the shared mounts that are slave of this mount: with your proposal: Not sure how to do. Maybe you have to have another field in each of the vfsmounts that will point to the shared mounts that are slave of this mount.?? in my implemenation: I have to just find the pnode it belongs to, and all the slave pnodes are easily available there. There is complexity tradeoffs in both the implementations. But I personally felt having a pnode structure keeps the pnode operations seperated out cleanly. It helps to easily visualize the propogation tree. And also one more thing influenced my thought process. The statement in Al Viro's RFC: --- How do we set them up? * we can mark a subtree sharable. Every vfsmount in the subtree that is not already in some p-node gets a single-element p-node of its own. * we can mark a subtree slave. That removes all vfsmounts in the subtree from their p-nodes and makes them owned by said p-nodes. p-nodes that became empty will disappear and everything they used to own will be repossessed by their owners (if any). * we can mark a subtree private. Same as above, but followed by taking all vfsmounts in our subtree and making them *not* owned by anybody. The above statements imply some implementation detail. Not sure if you will buy this point :) +static kmem_cache_t * pnode_cachep; + +/* spinlock for pnode related operations */ + __cacheline_aligned_in_smp DEFINE_SPINLOCK(vfspnode_lock); + +enum pnode_vfs_type { + PNODE_MEMBER_VFS = 0x01, + PNODE_SLAVE_VFS = 0x02 +}; + +void __init pnode_init(unsigned long mempages) +{ + pnode_cachep = kmem_cache_create(pnode_cache, + sizeof(struct vfspnode), 0, + SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL); +} + +struct vfspnode * pnode_alloc(void) +{ + struct vfspnode *pnode = kmem_cache_alloc(pnode_cachep, GFP_KERNEL); + INIT_LIST_HEAD(pnode-pnode_vfs); + INIT_LIST_HEAD(pnode-pnode_slavevfs); + INIT_LIST_HEAD(pnode-pnode_slavepnode); + INIT_LIST_HEAD(pnode-pnode_peer_slave); + pnode-pnode_master = NULL; + pnode-pnode_flags = 0; + atomic_set(pnode-pnode_count,0); + return pnode; +} + +void inline pnode_free(struct vfspnode *pnode) +{ + kmem_cache_free(pnode_cachep, pnode); +} + +/* + * __put_pnode() should be called with vfspnode_lock held + */ +void __put_pnode(struct vfspnode *pnode) +{ + struct vfspnode *tmp_pnode; + do { + tmp_pnode = pnode-pnode_master; + list_del_init(pnode-pnode_peer_slave); + BUG_ON(!list_empty(pnode-pnode_vfs)); + BUG_ON(!list_empty(pnode-pnode_slavevfs)); + BUG_ON(!list_empty(pnode-pnode_slavepnode)); + pnode_free(pnode); + pnode = tmp_pnode; + if (!pnode || !atomic_dec_and_test(pnode-pnode_count)) + break; + } while(pnode); +} + All these are really unnecessary IMO. +/* + * merge 'pnode' into 'peer_pnode' and get rid of
Re: [PATCH 3/7] shared subtree
On Wed, 2005-07-27 at 12:13, Miklos Szeredi wrote: @@ -54,7 +55,7 @@ static inline unsigned long hash(struct struct vfsmount *alloc_vfsmnt(const char *name) { - struct vfsmount *mnt = kmem_cache_alloc(mnt_cache, GFP_KERNEL); + struct vfsmount *mnt = kmem_cache_alloc(mnt_cache, GFP_KERNEL); if (mnt) { memset(mnt, 0, sizeof(struct vfsmount)); atomic_set(mnt-mnt_count,1); Please make whitespace changes a separate patch. I tried to remove trailing whitespaces in the current code whereever I found them. Ok will them a separate patch. @@ -128,11 +162,71 @@ static void attach_mnt(struct vfsmount * { mnt-mnt_parent = mntget(nd-mnt); mnt-mnt_mountpoint = dget(nd-dentry); - list_add(mnt-mnt_hash, mount_hashtable+hash(nd-mnt, nd-dentry)); + mnt-mnt_namespace = nd-mnt-mnt_namespace; + list_add_tail(mnt-mnt_hash, + mount_hashtable+hash(nd-mnt, nd-dentry)); list_add_tail(mnt-mnt_child, nd-mnt-mnt_mounts); nd-dentry-d_mounted++; } Why list_add_tail()? This changes user visible behavior, and seems unnecessary. Yes. I was about to send out a mail questioning the existing behavior. I will start a seperate thread questioning the current behavoir. My plan was to discuss the current behavior before making this change. I thought I had reverted this change. But it slipped in. +static void attach_prepare_mnt(struct vfsmount *mnt, struct nameidata *nd) +{ + mnt-mnt_parent = mntget(nd-mnt); + mnt-mnt_mountpoint = dget(nd-dentry); + nd-dentry-d_mounted++; +} + + You shouldn't add unnecessary newlines. There are a lot of these, please audit all your patches. ok. sure. +void do_attach_commit_mnt(struct vfsmount *mnt) +{ + struct vfsmount *parent = mnt-mnt_parent; + BUG_ON(parent==mnt); BUG_ON(parent == mnt); + if(list_empty(mnt-mnt_hash)) if (list_empty(mnt-mnt_hash)) + list_add_tail(mnt-mnt_hash, + mount_hashtable+hash(parent, mnt-mnt_mountpoint)); + if(list_empty(mnt-mnt_child)) + list_add_tail(mnt-mnt_child, parent-mnt_mounts); + mnt-mnt_namespace = parent-mnt_namespace; + list_add_tail(mnt-mnt_list, mnt-mnt_namespace-list); +} Etc. Maybe you should run Lindent on your changes, but be careful not to change existing code, even if Lindent would do that! sure :) @@ -191,7 +270,7 @@ static void *m_start(struct seq_file *m, struct list_head *p; loff_t l = *pos; - down_read(n-sem); + down_read(namespace_sem); list_for_each(p, n-list) if (!l--) return list_entry(p, struct vfsmount, mnt_list); This should be a separate patch. You can just take the one from the detached trees patch-series. ok. in fact these changes were motivated by that patch. +/* + * abort the operations done in attach_recursive_mnt(). run through the mount + * tree, till vfsmount 'last' and undo the changes. Ensure that all the mounts + * in the tree are all back in the mnt_list headed at 'source_mnt'. + * NOTE: This function is closely tied to the logic in + * 'attach_recursive_mnt()' + */ +static void abort_attach_recursive_mnt(struct vfsmount *source_mnt, struct + vfsmount *last, struct list_head *head) { struct vfsmount *p = + source_mnt, *m; struct vfspnode *src_pnode; If you want to do proper error handling, instead of doing rollback, it seems better to first do anything that can fail (allocations), then do the actual attaching, which cannot fail. It isn't nice to have transient states on failure. yes. it does exactly what you said. In the prepare stage it does not touch any of the existing vfstree or the pnode tree. All it does it builds a new vfstree and pnode tree, does the necessary changes to them. And if everthing is successful, it glues the new tree to the existing tree (which is the commit phase), and if the prepare stage fails allocating memory or any other reason, it goes and destroys the new trees (in the abort phase). Offcourse in the prepare state, it does increase the reference count of the vfsmounts to which the new tree will be attached. This is to ensure that the vfsmounts have not disappeared by the time we reach the commit phase. I think we are talking the same thing, and the code behaves exactly as you said. + /* + * This operation is equivalent of mount --bind dir dir + * create a new mount at the dentry, and unmount all child mounts + * mounted on top of dentries below 'dentry', and mount them + * under the new mount. + */ +struct vfsmount *do_make_mounted(struct vfsmount *mnt, struct dentry *dentry) Why is this needed? I thought we agreed, that this can be removed. yes we agreed on returning EINVAL when a directory is attempted to made shared/private/slave/unclonnable. But this is a different case. lets say /mnt is a
[no subject]
, [EMAIL PROTECTED], Janak Desai [EMAIL PROTECTED], linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 0/7] shared subtree Hi Andrew/Al Viro, Enclosing a final set of well tested patches that implement Al Viro's shared subtree proposal. These patches provide the ability to mark a mount tree as shared/private/slave/unclone, along with the ability to play with these trees with operations like bind/rbind/move/pivot_root/namespace-clone etc. I believe this powerful feature can help build features like per-user namespace. Couple of projects may benefit from shared subtrees. 1) automounter for the ability to automount across namespaces. 2) SeLinux for implementing polyinstantiated trees. 3) MVFS for providing versioning file system. 4) FUSE for per-user namespaces? Thanks to Avantika for developing about 100+ test cases that tests various combintation of private/shared/slave/unclonable trees. All these tests have passed. I feel pretty confident about the stability of the code. The patches have been broken into 7 units, for ease of review. I realize that patch-3 'rbind.patch' is a bit heavier than all the other patches. The reason being, most of the shared-subtree functionality gets manifestated during bind/rbind operation. Couple of work items to be done are: 1. modify the mount command to support this feature eg: mount --make-shared /tmp 2. a tool that can help visualize the propogation tree, maybe support in /proc? 3. some documentation on how to use all this functionality. Please consider the patches for inclusion in your tree. The footprint of this code is pretty small in the normal code path where shared-subtree functionality is not used. Any suggestions/comments to improve the code is welcome. Thanks, RP - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[no subject]
, [EMAIL PROTECTED], Janak Desai [EMAIL PROTECTED], linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 7/7] shared subtree Content-Type: text/x-patch; name=automount.patch Content-Disposition: inline; filename=automount.patch adds support for mount/umount propogation for autofs initiated operations, RP Signed by Ram Pai ([EMAIL PROTECTED]) fs/namespace.c| 176 +++--- fs/pnode.c| 12 +-- include/linux/pnode.h |3 3 files changed, 76 insertions(+), 115 deletions(-) Index: 2.6.12.work2/fs/namespace.c === --- 2.6.12.work2.orig/fs/namespace.c +++ 2.6.12.work2/fs/namespace.c @@ -202,6 +202,9 @@ struct vfsmount *do_attach_prepare_mnt(s if(!(child_mnt = clone_mnt(template_mnt, template_mnt-mnt_root))) return NULL; + spin_lock(vfsmount_lock); + list_del_init(child_mnt-mnt_fslink); + spin_unlock(vfsmount_lock); } else child_mnt = template_mnt; @@ -355,35 +358,14 @@ struct seq_operations mounts_op = { */ int may_umount_tree(struct vfsmount *mnt) { - struct list_head *next; - struct vfsmount *this_parent = mnt; - int actual_refs; - int minimum_refs; + int actual_refs=0; + int minimum_refs=0; + struct vfsmount *p; spin_lock(vfsmount_lock); - actual_refs = atomic_read(mnt-mnt_count); - minimum_refs = 2; -repeat: - next = this_parent-mnt_mounts.next; -resume: - while (next != this_parent-mnt_mounts) { - struct vfsmount *p = list_entry(next, struct vfsmount, mnt_child); - - next = next-next; - + for (p = mnt; p; p = next_mnt(p, mnt)) { actual_refs += atomic_read(p-mnt_count); minimum_refs += 2; - - if (!list_empty(p-mnt_mounts)) { - this_parent = p; - goto repeat; - } - } - - if (this_parent != mnt) { - next = this_parent-mnt_child.next; - this_parent = this_parent-mnt_parent; - goto resume; } spin_unlock(vfsmount_lock); @@ -395,18 +377,18 @@ resume: EXPORT_SYMBOL(may_umount_tree); -int mount_busy(struct vfsmount *mnt) +int mount_busy(struct vfsmount *mnt, int refcnt) { struct vfspnode *parent_pnode; if (mnt == mnt-mnt_parent || !IS_MNT_SHARED(mnt-mnt_parent)) - return do_refcount_check(mnt, 2); + return do_refcount_check(mnt, refcnt); parent_pnode = mnt-mnt_parent-mnt_pnode; BUG_ON(!parent_pnode); return pnode_mount_busy(parent_pnode, mnt-mnt_mountpoint, - mnt-mnt_root, mnt); + mnt-mnt_root, mnt, refcnt); } /** @@ -424,9 +406,12 @@ int mount_busy(struct vfsmount *mnt) */ int may_umount(struct vfsmount *mnt) { - if (mount_busy(mnt)) - return -EBUSY; - return 0; + int ret=0; + spin_lock(vfsmount_lock); + if (mount_busy(mnt, 2)) + ret = -EBUSY; + spin_unlock(vfsmount_lock); + return ret; } EXPORT_SYMBOL(may_umount); @@ -445,7 +430,26 @@ void do_detach_mount(struct vfsmount *mn spin_lock(vfsmount_lock); } -void __umount_tree(struct vfsmount *mnt, int propogate) +void umount_mnt(struct vfsmount *mnt, int propogate) +{ + if (propogate mnt-mnt_parent != mnt + IS_MNT_SHARED(mnt-mnt_parent)) { + struct vfspnode *parent_pnode + = mnt-mnt_parent-mnt_pnode; + BUG_ON(!parent_pnode); + pnode_umount(parent_pnode, + mnt-mnt_mountpoint, + mnt-mnt_root); + } else { + if (IS_MNT_SHARED(mnt) || IS_MNT_SLAVE(mnt)) { + BUG_ON(!mnt-mnt_pnode); + pnode_disassociate_mnt(mnt); + } + do_detach_mount(mnt); + } +} + +static void __umount_tree(struct vfsmount *mnt, int propogate) { struct vfsmount *p; LIST_HEAD(kill); @@ -459,21 +463,7 @@ void __umount_tree(struct vfsmount *mnt, mnt = list_entry(kill.next, struct vfsmount, mnt_list); list_del_init(mnt-mnt_list); list_del_init(mnt-mnt_fslink); - if (propogate mnt-mnt_parent != mnt - IS_MNT_SHARED(mnt-mnt_parent)) { - struct vfspnode *parent_pnode - = mnt-mnt_parent-mnt_pnode; - BUG_ON(!parent_pnode); - pnode_umount(parent_pnode, - mnt-mnt_mountpoint, - mnt-mnt_root); - } else
[no subject]
, [EMAIL PROTECTED], Janak Desai [EMAIL PROTECTED], linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 4/7] shared subtree Content-Type: text/x-patch; name=move.patch Content-Disposition: inline; filename=move.patch Adds ability to move a shared/private/slave/unclone tree to any other shared/private/slave/unclone tree. Also incorporates the same behavior for pivot_root() RP Signed by Ram Pai ([EMAIL PROTECTED]) fs/namespace.c| 196 +++--- include/linux/mount.h |2 2 files changed, 173 insertions(+), 25 deletions(-) Index: 2.6.12.work2/fs/namespace.c === --- 2.6.12.work2.orig/fs/namespace.c +++ 2.6.12.work2/fs/namespace.c @@ -772,9 +772,12 @@ static void abort_attach_recursive_mnt(s list_del_init(head); } + /* * @source_mnt : mount tree to be attached * @nd: place the mount tree @source_mnt is attached + * @move : use the move semantics if set, else use normal attach semantics + *as explained below * * NOTE: in the table below explains the semantics when a source vfsmount * of a given type is attached to a destination vfsmount of a give type. @@ -801,12 +804,41 @@ static void abort_attach_recursive_mnt(s * | | || | | * * - * (++) the mount will be propogated to all the vfsmounts in the pnode tree + * (++) the mount is propogated to all the vfsmounts in the pnode tree * of the destination vfsmount, and all the non-slave new mounts in * destination vfsmount will be added the source vfsmount's pnode. - * (+) the mount will be propogated to the destination vfsmount + * (+) the mount is propogated to the destination vfsmount * and the new mount will be added to the source vfsmount's pnode. * + * - + * | MOVE MOUNT OPERATION| + * |***| + * | dest -- | shared | private | slave |unclonable | + * | source | || | | + * | | | || | | + * | v | || | | + * |***| + * | | || | | + * | shared | shared (++) | shared (+)|shared (+)| shared (+)| + * | | || | | + * | | || | | + * | private | shared (+)| private | private | private | + * | | || | | + * | | || | | + * | slave| shared (+++) | slave | slave| slave | + * | | || | | + * | | || | | + * | unclonable| invalid | unclonable |unclonable| unclonable| + * | | || | | + * | | || | | + * + * + * (+++) the mount is propogated to all the vfsmounts in the pnode tree + * of the destination vfsmount, and all the new mounts is + * added to a new pnode , which is a slave pnode of the + * source vfsmount's pnode. + * + * * if the source mount is a tree, the operations explained above is * applied to each vfsmount in the tree. * @@ -815,7 +847,7 @@ static void abort_attach_recursive_mnt(s * */ static int attach_recursive_mnt(struct vfsmount *source_mnt, - struct nameidata *nd) + struct nameidata *nd, int move) { struct vfsmount *mntpt_mnt, *last, *m, *p; struct vfspnode *src_pnode, *dest_pnode, *tmp_pnode; @@ -849,8 +881,8 @@ static int attach_recursive_mnt(struct v list_add_tail(mnt_list_head, source_mnt-mnt_list); for (m = source_mnt; m; m = next_mnt(m, source_mnt)) { - - BUG_ON(IS_MNT_UNCLONE(m)); + int unclone = IS_MNT_UNCLONE(m); + int slave = IS_MNT_SLAVE(m); while (p p != m-mnt_parent) p = p-mnt_parent; @@ -866,7 +898,7 @@ static int attach_recursive_mnt(struct v dest_pnode = IS_MNT_SHARED(mntpt_mnt) ? mntpt_mnt-mnt_pnode : NULL; - src_pnode = (IS_MNT_SHARED(m))? + src_pnode
[no subject]
, [EMAIL PROTECTED], Janak Desai [EMAIL PROTECTED], linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 3/7] shared subtree Content-Type: text/x-patch; name=rbind.patch Content-Disposition: inline; filename=rbind.patch Adds the ability to bind/rbind a shared/private/slave subtree and set up propogation wherever needed. RP Signed by Ram Pai ([EMAIL PROTECTED]) fs/namespace.c| 660 -- fs/pnode.c| 235 include/linux/dcache.h|2 include/linux/fs.h|5 include/linux/namespace.h |1 5 files changed, 826 insertions(+), 77 deletions(-) Index: 2.6.12.work2/fs/namespace.c === --- 2.6.12.work2.orig/fs/namespace.c +++ 2.6.12.work2/fs/namespace.c @@ -42,7 +42,8 @@ static inline int sysfs_init(void) static struct list_head *mount_hashtable; static int hash_mask, hash_bits; -static kmem_cache_t *mnt_cache; +static kmem_cache_t *mnt_cache; +static struct rw_semaphore namespace_sem; static inline unsigned long hash(struct vfsmount *mnt, struct dentry *dentry) { @@ -54,7 +55,7 @@ static inline unsigned long hash(struct struct vfsmount *alloc_vfsmnt(const char *name) { - struct vfsmount *mnt = kmem_cache_alloc(mnt_cache, GFP_KERNEL); + struct vfsmount *mnt = kmem_cache_alloc(mnt_cache, GFP_KERNEL); if (mnt) { memset(mnt, 0, sizeof(struct vfsmount)); atomic_set(mnt-mnt_count,1); @@ -86,7 +87,8 @@ void free_vfsmnt(struct vfsmount *mnt) * Now, lookup_mnt increments the ref count before returning * the vfsmount struct. */ -struct vfsmount *lookup_mnt(struct vfsmount *mnt, struct dentry *dentry) +struct vfsmount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry, + struct dentry *root) { struct list_head * head = mount_hashtable + hash(mnt, dentry); struct list_head * tmp = head; @@ -99,7 +101,8 @@ struct vfsmount *lookup_mnt(struct vfsmo if (tmp == head) break; p = list_entry(tmp, struct vfsmount, mnt_hash); - if (p-mnt_parent == mnt p-mnt_mountpoint == dentry) { + if (p-mnt_parent == mnt p-mnt_mountpoint == dentry + (root == NULL || p-mnt_root == root)) { found = mntget(p); break; } @@ -108,6 +111,37 @@ struct vfsmount *lookup_mnt(struct vfsmo return found; } +struct vfsmount *lookup_mnt(struct vfsmount *mnt, struct dentry *dentry) +{ + return __lookup_mnt(mnt, dentry, NULL); +} + +static struct vfsmount * +clone_mnt(struct vfsmount *old, struct dentry *root) +{ + struct super_block *sb = old-mnt_sb; + struct vfsmount *mnt = alloc_vfsmnt(old-mnt_devname); + + if (mnt) { + mnt-mnt_flags = old-mnt_flags; + atomic_inc(sb-s_active); + mnt-mnt_sb = sb; + mnt-mnt_root = dget(root); + mnt-mnt_mountpoint = mnt-mnt_root; + mnt-mnt_parent = mnt; + mnt-mnt_namespace = old-mnt_namespace; + mnt-mnt_pnode = get_pnode(old-mnt_pnode); + + /* stick the duplicate mount on the same expiry list +* as the original if that was on one */ + spin_lock(vfsmount_lock); + if (!list_empty(old-mnt_fslink)) + list_add(mnt-mnt_fslink, old-mnt_fslink); + spin_unlock(vfsmount_lock); + } + return mnt; +} + static inline int check_mnt(struct vfsmount *mnt) { return mnt-mnt_namespace == current-namespace; @@ -128,11 +162,71 @@ static void attach_mnt(struct vfsmount * { mnt-mnt_parent = mntget(nd-mnt); mnt-mnt_mountpoint = dget(nd-dentry); - list_add(mnt-mnt_hash, mount_hashtable+hash(nd-mnt, nd-dentry)); + mnt-mnt_namespace = nd-mnt-mnt_namespace; + list_add_tail(mnt-mnt_hash, + mount_hashtable+hash(nd-mnt, nd-dentry)); list_add_tail(mnt-mnt_child, nd-mnt-mnt_mounts); nd-dentry-d_mounted++; } +static void attach_prepare_mnt(struct vfsmount *mnt, struct nameidata *nd) +{ + mnt-mnt_parent = mntget(nd-mnt); + mnt-mnt_mountpoint = dget(nd-dentry); + nd-dentry-d_mounted++; +} + + +void do_attach_commit_mnt(struct vfsmount *mnt) +{ + struct vfsmount *parent = mnt-mnt_parent; + BUG_ON(parent==mnt); + if(list_empty(mnt-mnt_hash)) + list_add_tail(mnt-mnt_hash, + mount_hashtable+hash(parent, mnt-mnt_mountpoint)); + if(list_empty(mnt-mnt_child)) + list_add_tail(mnt-mnt_child, parent-mnt_mounts); + mnt-mnt_namespace = parent-mnt_namespace; + list_add_tail(mnt-mnt_list, mnt-mnt_namespace-list); +} + +struct vfsmount *do_attach_prepare_mnt(struct
[no subject]
, [EMAIL PROTECTED], Janak Desai [EMAIL PROTECTED], linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 6/7] shared subtree Content-Type: text/x-patch; name=namespace.patch Content-Disposition: inline; filename=namespace.patch Adds ability to clone a namespace that has shared/private/slave/unclone subtrees in it. RP Signed by Ram Pai ([EMAIL PROTECTED]) fs/namespace.c |9 + 1 files changed, 9 insertions(+) Index: 2.6.12-rc6.work1/fs/namespace.c === --- 2.6.12-rc6.work1.orig/fs/namespace.c +++ 2.6.12-rc6.work1/fs/namespace.c @@ -1894,6 +1894,13 @@ int copy_namespace(int flags, struct tas q = new_ns-root; while (p) { q-mnt_namespace = new_ns; + + if (IS_MNT_SHARED(q)) + pnode_add_member_mnt(q-mnt_pnode, q); + else if (IS_MNT_SLAVE(q)) + pnode_add_slave_mnt(q-mnt_pnode, q); + put_pnode(q-mnt_pnode); + if (fs) { if (p == fs-rootmnt) { rootmnt = p; @@ -2271,6 +2278,8 @@ void __put_namespace(struct namespace *n spin_lock(vfsmount_lock); list_for_each_entry(mnt, namespace-list, mnt_list) { + if (mnt-mnt_pnode) + pnode_disassociate_mnt(mnt); mnt-mnt_namespace = NULL; } - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[no subject]
, [EMAIL PROTECTED], Janak Desai [EMAIL PROTECTED], linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 1/7] shared subtree Content-Type: text/x-patch; name=shared_private_slave.patch Content-Disposition: inline; filename=shared_private_slave.patch This patch adds the shared/private/slave support for VFS trees. Signed by Ram Pai ([EMAIL PROTECTED]) fs/Makefile |2 fs/dcache.c |2 fs/namespace.c| 93 ++ fs/pnode.c| 441 ++ include/linux/fs.h|5 include/linux/mount.h | 44 include/linux/pnode.h | 90 ++ 7 files changed, 673 insertions(+), 4 deletions(-) Index: 2.6.12.work2/fs/namespace.c === --- 2.6.12.work2.orig/fs/namespace.c +++ 2.6.12.work2/fs/namespace.c @@ -22,6 +22,7 @@ #include linux/namei.h #include linux/security.h #include linux/mount.h +#include linux/pnode.h #include asm/uaccess.h #include asm/unistd.h @@ -62,6 +63,7 @@ struct vfsmount *alloc_vfsmnt(const char INIT_LIST_HEAD(mnt-mnt_mounts); INIT_LIST_HEAD(mnt-mnt_list); INIT_LIST_HEAD(mnt-mnt_fslink); + INIT_LIST_HEAD(mnt-mnt_pnode_mntlist); if (name) { int size = strlen(name)+1; char *newname = kmalloc(size, GFP_KERNEL); @@ -615,6 +617,95 @@ out_unlock: return err; } +static int do_make_shared(struct vfsmount *mnt) +{ + int err=0; + struct vfspnode *old_pnode = NULL; + /* +* if the mount is already a slave mount, +* allocate a new pnode and make it +* a slave pnode of the original pnode. +*/ + if (IS_MNT_SLAVE(mnt)) { + old_pnode = mnt-mnt_pnode; + pnode_del_slave_mnt(mnt); + } + if(!IS_MNT_SHARED(mnt)) { + mnt-mnt_pnode = pnode_alloc(); + if(!mnt-mnt_pnode) { + pnode_add_slave_mnt(old_pnode, mnt); + err = -ENOMEM; + goto out; + } + pnode_add_member_mnt(mnt-mnt_pnode, mnt); + } + if(old_pnode) + pnode_add_slave_pnode(old_pnode, mnt-mnt_pnode); + set_mnt_shared(mnt); +out: + return err; +} + +static int do_make_slave(struct vfsmount *mnt) +{ + int err=0; + + if (IS_MNT_SLAVE(mnt)) + goto out; + /* +* only shared mounts can +* be made slave +*/ + if (!IS_MNT_SHARED(mnt)) { + err = -EINVAL; + goto out; + } + pnode_member_to_slave(mnt); +out: + return err; +} + +static int do_make_private(struct vfsmount *mnt) +{ + if(mnt-mnt_pnode) + pnode_disassociate_mnt(mnt); + set_mnt_private(mnt); + return 0; +} + +/* + * recursively change the type of the mountpoint. + */ +static int do_change_type(struct nameidata *nd, int flag) +{ + struct vfsmount *m, *mnt = nd-mnt; + int err=0; + + if (!(flag MS_SHARED) !(flag MS_PRIVATE) +!(flag MS_SLAVE)) + return -EINVAL; + + if (nd-dentry != nd-mnt-mnt_root) + return -EINVAL; + + spin_lock(vfsmount_lock); + for (m = mnt; m; m = next_mnt(m, mnt)) { + switch (flag) { + case MS_SHARED: + err = do_make_shared(m); + break; + case MS_SLAVE: + err = do_make_slave(m); + break; + case MS_PRIVATE: + err = do_make_private(m); + break; + } + } + spin_unlock(vfsmount_lock); + return err; +} + /* * do loopback mount. */ @@ -1049,6 +1140,8 @@ long do_mount(char * dev_name, char * di data_page); else if (flags MS_BIND) retval = do_loopback(nd, dev_name, flags MS_REC); + else if (flags MS_SHARED || flags MS_PRIVATE || flags MS_SLAVE) + retval = do_change_type(nd, flags); else if (flags MS_MOVE) retval = do_move_mount(nd, dev_name); else Index: 2.6.12.work2/fs/pnode.c === --- /dev/null +++ 2.6.12.work2/fs/pnode.c @@ -0,0 +1,441 @@ +/* + * linux/fs/pnode.c + * + * (C) Copyright IBM Corporation 2005. + * Released under GPL v2. + * Author : Ram Pai ([EMAIL PROTECTED]) + * + */ + +#include linux/config.h +#include linux/syscalls.h +#include linux/slab.h +#include linux/sched.h +#include linux/smp_lock.h +#include linux/init.h +#include linux/quotaops.h +#include linux/acct.h +#include linux/module.h +#include linux/seq_file.h +#include linux/namespace.h +#include linux/namei.h
Re: supposed to be shared subtree patches.
On Mon, 2005-07-25 at 15:44, Ram Pai wrote: , [EMAIL PROTECTED], Janak Desai [EMAIL PROTECTED], linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 0/7] shared subtree Hi Andrew/Al Viro, Enclosing a final set of well tested patches that implement my apologies. I screwed up sending the patches through quilt. anyway I have received the following comments from Andrew Morton, which I will incorporate before sending out saner looking patches. sorry again, RP Andrew's comments follows: Frankly, I don't even know what these patches _do_, and haven't spent the time to try to find out. If these patches are merged, how do we expect end-users to find out how to use the new capabilities? A few paragraphs in the patch #1 changelog would help. A high-level description of the new capability which explains what it does and why it would be a useful thing for Linux. And maybe some deeper information in a Documentation/ file. Right now, there might well be a lot of people who could use these new features, but they don't even know that these patches provide them! It's all a bit of a mystery, really. - - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC-2 PATCH 6/8] shared subtree
Adds ability to clone a namespace that has shared/private/slave/unclone subtrees in it. RP Signed by Ram Pai ([EMAIL PROTECTED]) fs/namespace.c |9 + 1 files changed, 9 insertions(+) Index: 2.6.12.work1/fs/namespace.c === --- 2.6.12.work1.orig/fs/namespace.c +++ 2.6.12.work1/fs/namespace.c @@ -1763,6 +1763,13 @@ int copy_namespace(int flags, struct tas q = new_ns-root; while (p) { q-mnt_namespace = new_ns; + + if (IS_MNT_SHARED(q)) + pnode_add_member_mnt(q-mnt_pnode, q); + else if (IS_MNT_SLAVE(q)) + pnode_add_slave_mnt(q-mnt_pnode, q); + put_pnode(q-mnt_pnode); + if (fs) { if (p == fs-rootmnt) { rootmnt = p; @@ -2129,6 +2136,8 @@ void __put_namespace(struct namespace *n spin_lock(vfsmount_lock); list_for_each_entry(mnt, namespace-list, mnt_list) { + if (mnt-mnt_pnode) + pnode_disassociate_mnt(mnt); mnt-mnt_namespace = NULL; } - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC-2 PATCH 3/8] shared subtree
Adds the ability to bind/rbind a shared/private/slave subtree and set up propogation wherever needed. RP Signed by Ram Pai ([EMAIL PROTECTED]) fs/namespace.c| 559 -- fs/pnode.c| 416 +- include/linux/dcache.h|2 include/linux/fs.h|4 include/linux/namespace.h |1 include/linux/pnode.h |5 6 files changed, 906 insertions(+), 81 deletions(-) Index: 2.6.12.work1/fs/namespace.c === --- 2.6.12.work1.orig/fs/namespace.c +++ 2.6.12.work1/fs/namespace.c @@ -42,7 +42,8 @@ static inline int sysfs_init(void) static struct list_head *mount_hashtable; static int hash_mask, hash_bits; -static kmem_cache_t *mnt_cache; +static kmem_cache_t *mnt_cache; +static struct rw_semaphore namespace_sem; static inline unsigned long hash(struct vfsmount *mnt, struct dentry *dentry) { @@ -54,7 +55,7 @@ static inline unsigned long hash(struct struct vfsmount *alloc_vfsmnt(const char *name) { - struct vfsmount *mnt = kmem_cache_alloc(mnt_cache, GFP_KERNEL); + struct vfsmount *mnt = kmem_cache_alloc(mnt_cache, GFP_KERNEL); if (mnt) { memset(mnt, 0, sizeof(struct vfsmount)); atomic_set(mnt-mnt_count,1); @@ -86,7 +87,8 @@ void free_vfsmnt(struct vfsmount *mnt) * Now, lookup_mnt increments the ref count before returning * the vfsmount struct. */ -struct vfsmount *lookup_mnt(struct vfsmount *mnt, struct dentry *dentry) +struct vfsmount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry, + struct dentry *root) { struct list_head * head = mount_hashtable + hash(mnt, dentry); struct list_head * tmp = head; @@ -99,7 +101,8 @@ struct vfsmount *lookup_mnt(struct vfsmo if (tmp == head) break; p = list_entry(tmp, struct vfsmount, mnt_hash); - if (p-mnt_parent == mnt p-mnt_mountpoint == dentry) { + if (p-mnt_parent == mnt p-mnt_mountpoint == dentry +(root == NULL || p-mnt_root == root)) { found = mntget(p); break; } @@ -108,6 +111,37 @@ struct vfsmount *lookup_mnt(struct vfsmo return found; } +struct vfsmount *lookup_mnt(struct vfsmount *mnt, struct dentry *dentry) +{ + return __lookup_mnt(mnt, dentry, NULL); +} + +static struct vfsmount * +clone_mnt(struct vfsmount *old, struct dentry *root) +{ + struct super_block *sb = old-mnt_sb; + struct vfsmount *mnt = alloc_vfsmnt(old-mnt_devname); + + if (mnt) { + mnt-mnt_flags = old-mnt_flags; + atomic_inc(sb-s_active); + mnt-mnt_sb = sb; + mnt-mnt_root = dget(root); + mnt-mnt_mountpoint = mnt-mnt_root; + mnt-mnt_parent = mnt; + mnt-mnt_namespace = old-mnt_namespace; + mnt-mnt_pnode = get_pnode(old-mnt_pnode); + + /* stick the duplicate mount on the same expiry list + * as the original if that was on one */ + spin_lock(vfsmount_lock); + if (!list_empty(old-mnt_fslink)) + list_add(mnt-mnt_fslink, old-mnt_fslink); + spin_unlock(vfsmount_lock); + } + return mnt; +} + static inline int check_mnt(struct vfsmount *mnt) { return mnt-mnt_namespace == current-namespace; @@ -128,11 +162,70 @@ static void attach_mnt(struct vfsmount * { mnt-mnt_parent = mntget(nd-mnt); mnt-mnt_mountpoint = dget(nd-dentry); + mnt-mnt_namespace = nd-mnt-mnt_namespace; list_add(mnt-mnt_hash, mount_hashtable+hash(nd-mnt, nd-dentry)); list_add_tail(mnt-mnt_child, nd-mnt-mnt_mounts); nd-dentry-d_mounted++; } +static struct vfsmount *do_attach_mnt(struct vfsmount *mnt, + struct dentry *dentry, + struct vfsmount *child_mnt) +{ + struct nameidata nd; + LIST_HEAD(head); + + nd.mnt = mnt; + nd.dentry = dentry; + attach_mnt(child_mnt, nd); + list_add_tail(head, child_mnt-mnt_list); + list_splice(head, child_mnt-mnt_namespace-list.prev); + return child_mnt; +} + +static void attach_prepare_mnt(struct vfsmount *mnt, struct nameidata *nd) +{ + mnt-mnt_parent = mntget(nd-mnt); + mnt-mnt_mountpoint = dget(nd-dentry); + nd-dentry-d_mounted++; +} + +void do_attach_real_mnt(struct vfsmount *mnt) +{ + struct vfsmount *parent = mnt-mnt_parent; + BUG_ON(parent==mnt); + if(list_empty(mnt-mnt_hash)) + list_add(mnt-mnt_hash, + mount_hashtable+hash(parent, mnt-mnt_mountpoint)); + if(list_empty(mnt-mnt_child)) + list_add_tail(mnt-mnt_child, parent-mnt_mounts); + mnt-mnt_namespace = parent-mnt_namespace; + list_add_tail(mnt-mnt_list, mnt-mnt_namespace-list); +} + +struct vfsmount *do_attach_prepare_mnt(struct vfsmount *mnt, + struct dentry *dentry, + struct vfsmount *template_mnt, + int clone_flag) +{ + struct vfsmount *child_mnt; + struct nameidata nd; + + if (clone_flag) { + if(!(child_mnt = clone_mnt(template_mnt, +template_mnt-mnt_root))) + return NULL; + } else + child_mnt = template_mnt; + + nd.mnt = mnt; + nd.dentry = dentry; + + attach_prepare_mnt(child_mnt, nd); + + return child_mnt; +} + static struct vfsmount *next_mnt(struct vfsmount *p, struct vfsmount *root) { struct list_head *next = p-mnt_mounts.next
[RFC-2 PATCH 4/8] shared subtree
Adds ability to move a shared/private/slave/unclone tree to any other shared/private/slave/unclone tree. Also incorporates the same behavior for pivot_root() RP Signed by Ram Pai ([EMAIL PROTECTED]) fs/namespace.c | 150 +++-- 1 files changed, 125 insertions(+), 25 deletions(-) Index: 2.6.12.work1/fs/namespace.c === --- 2.6.12.work1.orig/fs/namespace.c +++ 2.6.12.work1/fs/namespace.c @@ -664,9 +664,12 @@ static struct vfsmount *copy_tree(struct return NULL; } + /* * @source_mnt : mount tree to be attached * @nd : place the mount tree @source_mnt is attached + * @move : use the move semantics if set, else use normal attach semantics + *as explained below * * NOTE: in the table below explains the semantics when a source vfsmount * of a given type is attached to a destination vfsmount of a give type. @@ -699,16 +702,44 @@ static struct vfsmount *copy_tree(struct * (+) the mount will be propogated to the destination vfsmount * and the new mount will be added to the source vfsmount's pnode. * + * + * - + * |MOVE MOUNT OPERATION | + * |***| + * | dest -- | shared | private | slave |unclonable | + * | source | | | | | + * | | | | | | | + * | v | | | | | + * |***| + * | | | | | | + * | shared | shared (++) | shared (+)|shared (+)| shared (+)| + * | | | | | | + * | | | | | | + * | private | shared (+) | private | private | private | + * | | | | | | + * | | | | | | + * | slave | shared (+++) | slave | slave| slave | + * | | | | | | + * | | | | | | + * | unclonable| unclonable | unclonable |unclonable| unclonable| + * | | | | | | + * | | | | | | + * + * + * (+++) the mount will be propogated to all the vfsmounts in the pnode tree + * of the destination vfsmount, and all the new mounts will be + * added to a new pnode , which will be a slave pnode of the + * source vfsmount's pnode. + * * if the source mount is a tree, the operations explained above is - * applied to each - * vfsmount in the tree. + * applied to each vfsmount in the tree. * * Should be called without spinlocks held, because this function can sleep * in allocations. * */ static int attach_recursive_mnt(struct vfsmount *source_mnt, - struct nameidata *nd) + struct nameidata *nd, int move) { struct vfsmount *mntpt_mnt, *m, *p; struct vfspnode *src_pnode, *t_p, *dest_pnode, *tmp_pnode; @@ -718,7 +749,9 @@ static int attach_recursive_mnt(struct v mntpt_mnt = nd-mnt; dest_pnode = IS_MNT_SHARED(mntpt_mnt) ? mntpt_mnt-mnt_pnode : NULL; - src_pnode = IS_MNT_SHARED(source_mnt) ? source_mnt-mnt_pnode : NULL; + src_pnode = IS_MNT_SHARED(source_mnt) || + (move IS_MNT_SLAVE(source_mnt)) ? + source_mnt-mnt_pnode : NULL; if (!dest_pnode !src_pnode) { LIST_HEAD(head); @@ -739,6 +772,7 @@ static int attach_recursive_mnt(struct v p = NULL; for (m = source_mnt; m; m = next_mnt(m, source_mnt)) { int unclone = IS_MNT_UNCLONE(m); + int slave = IS_MNT_SLAVE(m); list_del_init(m-mnt_list); @@ -756,7 +790,7 @@ static int attach_recursive_mnt(struct v p=m; dest_pnode = IS_MNT_SHARED(mntpt_mnt) ? mntpt_mnt-mnt_pnode : NULL; - src_pnode = (IS_MNT_SHARED(m))? + src_pnode = (IS_MNT_SHARED(m) || (move slave))? m-mnt_pnode : NULL; m-mnt_pnode = NULL; @@ -772,19 +806,35 @@ static int attach_recursive_mnt(struct v if ((ret = pnode_prepare_mount(dest_pnode, tmp_pnode, mntpt_dentry, m, mntpt_mnt))) return ret; + if (move dest_pnode slave) +SET_PNODE_SLAVE(tmp_pnode); } else { if (m == m-mnt_parent) do_attach_prepare_mnt(mntpt_mnt, mntpt_dentry, m, 0); - pnode_add_member_mnt(tmp_pnode, m); - if (unclone) { -set_mnt_unclone(m); -m-mnt_pnode = tmp_pnode; -SET_PNODE_DELETE(tmp_pnode); - } else if (!src_pnode) { -set_mnt_private(m); -m-mnt_pnode = tmp_pnode; -SET_PNODE_DELETE(tmp_pnode); + if (move slave) +pnode_add_slave_mnt(tmp_pnode, m); + else { +pnode_add_member_mnt(tmp_pnode, m); +if (unclone) { + BUG_ON(!move); + set_mnt_unclone(m); + m-mnt_pnode = tmp_pnode; + SET_PNODE_DELETE(tmp_pnode); +} else if (!src_pnode) { + set_mnt_private(m); + m-mnt_pnode = tmp_pnode; + SET_PNODE_DELETE(tmp_pnode