Re: how to show propagation state for mounts

2008-02-20 Thread Ram Pai
On Wed, 2008-02-20 at 09:31 -0700, Matthew Wilcox wrote:
 On Wed, Feb 20, 2008 at 04:04:22PM +, Al Viro wrote:
  It's less about the form of representation (after all, we generate poll
  events when contents of that sucker changes, so one *can* get a consistent
  snapshot of the entire thing) and more about having it self-contained
  when we have namespaces in the play.
  
  IOW, the data in there should give answers to questions that make sense.
  Do events get propagated from this vfsmount I have to that vfsmount I 
  have?
  is a meaningful one; ditto for are events here propagated to somewhere I
  don't see? or are events getting propagated here from somewhere I don't
  see?.
 
 Why do those last two questions deserve an answer?  How will a person's
 or application's behaviour be affected by whether a change will
 propagate to something they don't know about and can't see?

Well, I do not want to be surprised to see a mount suddenly show up in
my namespace because of some action by some other user in some other
namespace. Its going to happen anyway if the namespace is forked of 
a namespace that had shared mounts in them. However I would rather
prefer to know in advance the spots (mounts) where such surprises can
happen. Also I would prefer to know how my actions will effect mounts in
other namespaces.

RP


 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to show propagation state for mounts

2008-02-20 Thread Ram Pai
On Wed, 2008-02-20 at 17:27 +0100, Miklos Szeredi wrote:
  On Wed, Feb 20, 2008 at 04:39:15PM +0100, Miklos Szeredi wrote:
mountinfo - IMO needs a sane discussion of what and how should be shown
wrt propagation state
   
   Here's my take on the matter.
   
   The propagation tree can be either be represented
   
1) from root to leaf listing members of peer groups and their
slaves explicitly,
   
2) or from leaf to root by identifying each peer group and then for
each mount showing the id of its own group and the id of the group's
master.
   
   2) can have two variants:
   
2a) id of peer group is constant in time
   
2b) id of peer group may change
   
   The current patch does 2b).  Having a fixed id for each peer group
   would mean introducing a new object to anchor the peer group into,
   which would add complexity to the whole thing.
   
   All of these are implementable, just need to decide which one we want.
  
  Eh...  Much more interesting question: since the propagation tree spans
  multiple namespaces in a lot of normal uses, how do we deal with
  reconstructing propagation through the parts that are not present in
  our namespace?  Moreover, what should and what should not be kept private
  to namespace?  Full exposure of mount trees is definitely over the top
  (it shows potentially sensitive information), so we probably want less
  than that.
  
  FWIW, my gut feeling is that for each peer group that intersects with our
  namespace we ought to expose in some form
  * all vfsmounts belonging to that intesection
  * the nearest dominating peer group (== master (of master ...) of)
  that also has a non-empty intersection with our namespace
  
  It's less about the form of representation (after all, we generate poll
  events when contents of that sucker changes, so one *can* get a consistent
  snapshot of the entire thing) and more about having it self-contained
  when we have namespaces in the play.
  
  IOW, the data in there should give answers to questions that make sense.
  Do events get propagated from this vfsmount I have to that vfsmount I 
  have?
  is a meaningful one; ditto for are events here propagated to somewhere I
  don't see? or are events getting propagated here from somewhere I don't
  see?.
 
 Well, assuming you see only one namespace.  When I'm experimenting
 with namespaces and propagations, I see both (each in a separate
 xterm) and I do want to know how propagation between them happens.
 
 Your suggestion doesn't deal with that problem.
 
 Otherwise, yes it makes sense to have a consistent view of the tree
 shown for each namespace.  Perhaps the solution is to restrict viewing
 the whole tree to privileged processes.

I wonder, what is wrong in reporting mounts in other namespaces that
either receive and send propagation to mounts in our namespace?

If we take that approach, we will report **only** the mounts in other
namespace which have a counter part in our namespace. After all the
filesystems backing the mounts here and there are the same(other wise
they would'nt have propagated).

And any mounts contained outside our namespace, having no propagation
relation to any mounts in our namespace, will remain hidden. 

RP


 
 Miklos

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH] vfs: optimization to /proc/pid/mountinfo patch

2008-02-04 Thread Ram Pai
1) reports deleted inode in dentry_path() consistent with that in __d_path()
2) modified __d_path() to use prepend(), reducing the size of __d_path()
3) moved all the functionality that reports mount information in /proc under
CONFIG_PROC_FS.

Could not verify if the code would work with CONFIG_PROC_FS=n, since it was
impossible to disable CONFIG_PROC_FS. Looking for ideas on how to disable
CONFIG_PROC_FS.



Signed-off-by: Ram Pai [EMAIL PROTECTED]
---
 fs/dcache.c  |   59 +++
 fs/namespace.c   |2 +
 fs/seq_file.c|2 +
 include/linux/dcache.h   |3 ++
 include/linux/seq_file.h |3 ++
 5 files changed, 34 insertions(+), 35 deletions(-)

Index: linux-2.6.23/fs/dcache.c
===
--- linux-2.6.23.orig/fs/dcache.c
+++ linux-2.6.23/fs/dcache.c
@@ -1747,6 +1747,17 @@ shouldnt_be_hashed:
goto shouldnt_be_hashed;
 }
 
+static int prepend(char **buffer, int *buflen, const char *str,
+ int namelen)
+{
+   *buflen -= namelen;
+   if (*buflen  0)
+   return 1;
+   *buffer -= namelen;
+   memcpy(*buffer, str, namelen);
+   return 0;
+}
+
 /**
  * d_path - return the path of a dentry
  * @dentry: dentry to report
@@ -1768,17 +1779,11 @@ static char *__d_path(struct dentry *den
 {
char * end = buffer+buflen;
char * retval;
-   int namelen;
 
-   *--end = '\0';
-   buflen--;
-   if (!IS_ROOT(dentry)  d_unhashed(dentry)) {
-   buflen -= 10;
-   end -= 10;
-   if (buflen  0)
+   prepend(end, buflen, \0, 1);
+   if (!IS_ROOT(dentry)  d_unhashed(dentry) 
+   prepend(end, buflen,  (deleted), 10))
goto Elong;
-   memcpy(end,  (deleted), 10);
-   }
 
if (buflen  1)
goto Elong;
@@ -1805,13 +1810,10 @@ static char *__d_path(struct dentry *den
}
parent = dentry-d_parent;
prefetch(parent);
-   namelen = dentry-d_name.len;
-   buflen -= namelen + 1;
-   if (buflen  0)
+   if (prepend(end, buflen, dentry-d_name.name,
+   dentry-d_name.len) ||
+   prepend(end, buflen, /, 1))
goto Elong;
-   end -= namelen;
-   memcpy(end, dentry-d_name.name, namelen);
-   *--end = '/';
retval = end;
dentry = parent;
}
@@ -1819,12 +1821,9 @@ static char *__d_path(struct dentry *den
return retval;
 
 global_root:
-   namelen = dentry-d_name.len;
-   buflen -= namelen;
-   if (buflen  0)
-   goto Elong;
-   retval -= namelen-1;/* hit the slash */
-   memcpy(retval, dentry-d_name.name, namelen);
+   retval += 1;/* hit the slash */
+   if (prepend(retval, buflen, dentry-d_name.name, dentry-d_name.len))
+   goto Elong;
return retval;
 Elong:
return ERR_PTR(-ENAMETOOLONG);
@@ -1890,17 +1889,8 @@ char *dynamic_dname(struct dentry *dentr
return memcpy(buffer, temp, sz);
 }
 
-static int prepend(char **buffer, int *buflen, const char *str,
- int namelen)
-{
-   *buflen -= namelen;
-   if (*buflen  0)
-   return 1;
-   *buffer -= namelen;
-   memcpy(*buffer, str, namelen);
-   return 0;
-}
 
+#ifdef CONFIG_PROC_FS
 /*
  * Write full pathname from the root of the filesystem into the buffer.
  */
@@ -1910,11 +1900,9 @@ char *dentry_path(struct dentry *dentry,
char *retval;
 
spin_lock(dcache_lock);
-   prepend(end, buflen, \0, 1);
-   if (!IS_ROOT(dentry)  d_unhashed(dentry)) {
-   if (prepend(end, buflen, //deleted, 9))
+   if (!IS_ROOT(dentry)  d_unhashed(dentry) 
+   prepend(end, buflen,  (deleted), 10))
goto Elong;
-   }
if (buflen  1)
goto Elong;
/* Get '/' right */
@@ -1943,6 +1931,7 @@ Elong:
spin_unlock(dcache_lock);
return ERR_PTR(-ENAMETOOLONG);
 }
+#endif /* CONFIG_PROC_FS */
 
 /*
  * NOTE! The user-level library version returns a
Index: linux-2.6.23/fs/namespace.c
===
--- linux-2.6.23.orig/fs/namespace.c
+++ linux-2.6.23/fs/namespace.c
@@ -609,6 +609,7 @@ void mnt_unpin(struct vfsmount *mnt)
 
 EXPORT_SYMBOL(mnt_unpin);
 
+#ifdef CONFIG_PROC_FS
 /* iterator */
 static void *m_start(struct seq_file *m, loff_t *pos)
 {
@@ -795,6 +796,7 @@ const struct seq_operations mountstats_o
.stop   = m_stop,
.show   = show_vfsstat,
 };
+#endif  /* CONFIG_PROC_FS */
 
 /**
  * may_umount_tree - check if a mount tree is busy
Index: linux-2.6.23/fs/seq_file.c

Re: [patch] vfs: create /proc/pid/mountinfo

2008-01-31 Thread Ram Pai
On Thu, 2008-01-31 at 10:17 +0100, Miklos Szeredi wrote:
   From: Ram Pai [EMAIL PROTECTED]

...snipped...

 IDR ids are 'int' but they are always positive (AFAICT), but yeah,
 maybe this is confusing.
 
  The new exported-to-everyone dentry_path() probably could do with a bit
  more documentation - it's the sort of thing which people keep on wanting
  and using.
 
 OK.
 
  How does dentry_path() differ from d_path() and why do we need both and can
  we get some sharing/consolidation happening here?

d_path displays the path from the rootfs, whereas dentry_path displays
the path from the root of that filesystem.

 
 Tried that but not easy, without removing some of the
 microoptimizations in d_path(), which I'm not sure are really
 important, but...

this patch was intially developed with Al Viro. He preferred to keep the
two functions separate. BTW: this patch owes credits to Al Viro for his
initial set of ideas.

 
  Why do d_path() and dentry_path() have differing conventions for displaying
  a deleted file and can we fix that?
 
 I think Ram chose a different convention in dentry_path() in order to
 make sure, there was no space in the resulting path.  But spaces would
 be escaped anyway, so this isn't really important.  So yes, this could
 be fixed.


my patch was generated about  a year or so back using probably the
2.6.18 code base which had the //deleted convention. That got copied
in my patch. But since then I see that the original code has changed to
use the  (deleted) convention. 

Yes this patch has to be changed to be consistent with the existing
code. 


 
  This patch adds a lot of code which is, I guess, unused if
  CONFIG_PROC_FS=n.  Fixable?

yes. good observation. I will send a patch with this optimization and
the above mentioned change. 

RP

 
 Possibly yes.  A good chunk of namespace.c could be surrounded by an
 #ifdef, which would save even more, than was added by this particular
 patch.


 Thanks,
 Miklos

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] VFS: create /proc/pid/mountinfo

2008-01-21 Thread Ram Pai

Miklos,

You have removed the code that checked if the peer or
master mount was in the same namespace before reporting their
corresponding mount-ids. One downside of that approach is the
user will see an mount_id in the output with no corresponding
line to explain the details of the mount_id.  

And reporting the mount-id of a mount is some other namespace
could subtly mean information-leak?


One other comment I had received offline from Steve French was
that the patch did not consider the following case:

Have you thought about whether this could handle the case in which 
cifs mounts with 
a relative path e.g. currently
mount -t cifs //server/share /mnt

can not be distinguished from
mount -t cifs //server/share/subdirectory /mnt

when you run the mount command (ie the cifs prefixpath in this case 
/subdirectory is not displayed)


thanks for driving this patch further and sorry; have not been active on this 
work for a while,
RP


On Sat, 2008-01-19 at 12:05 +0100, Miklos Szeredi wrote:
 Seems, most people would be happier with a new file, instead of
 extending /proc/mounts.
 
 This patch is the first attempt at doing that, as well as fixing the
 issues found in the previous submission.
 
 Thanks,
 Miklos
 
 ---
 From: Ram Pai [EMAIL PROTECTED]
 
 /proc/mounts in its current state fail to disambiguate bind mounts, especially
 when the bind mount is subrooted. Also it does not capture propagation state 
 of
 the mounts(shared-subtree). The following patch addresses the problem.
 
 The patch adds '/proc/pid/mountinfo' which contains a superset of
 the fields in '/proc/pid/mounts'. The following additional fields
 are added:
 
 mntid -- is a unique identifier of the mount
 parent -- the id of the parent mount
 major:minor -- value of st_dev for files on that filesystem
 dir -- the subdir in the filesystem which forms the root of this mount
 propagation-type in the form of propagation_flag[:mntid][,...]
   note: 'shared' flag is followed by the mntid of its peer mount
 'slave' flag is followed by the mntid of its master mount
 'private' flag stands by itself
 'unbindable' flag stands by itself
 
 Also mount options are split into two fileds, the first containing the
 per mount flags, the second the per super block options.
 
 Here is a sample cat /proc/mounts after execution the following commands:
 
 mount --bind /mnt /mnt
 mount --make-shared /mnt
 mount --bind /mnt/1 /var
 mount --make-slave /var
 mount --make-shared /var
 mount --bind /var/abc /tmp
 mount --make-unbindable /proc
 
 2 2 0:1 rootfs rootfs / / rw rw private
 16 2 98:0 ext2 /dev/root / / rw rw private
 17 16 0:3 proc /proc / /proc rw rw unbindable
 18 16 0:10 devpts devpts /dev/pts / rw rw private
 19 16 98:0 ext2 /dev/root /mnt /mnt rw rw shared:19
 20 16 98:0 ext2 /dev/root /mnt/1 /var rw rw shared:21,slave:19
 21 16 98:0 ext2 /dev/root /mnt/1/abc /tmp rw rw shared:20,slave:19
 
 For example, the last line indicates that:
 
 1) The mount is a shared mount.
 2) Its peer mount of mount with id 20
 3) It is also a slave mount of the master-mount with the id  19
 4) The filesystem on device with major/minor number 98:0 and subdirectory
   mnt/1/abc makes the root directory of this mount.
 5) And finally the mount with id 16 is its parent.
 
 
 [EMAIL PROTECTED]:
 
 - new file, rearrange fields
 - for mount ID's use IDA (from the IDR library) instead of a 32bit
   counter, which could overflow
 - print canonical ID's (smallest one within the peer group) for peers
   and master, this is more useful, than a random ID within the same namespace
 - fix a couple of small bugs
 - remove inlines
 - style fixes
 
 Signed-off-by: Ram Pai [EMAIL PROTECTED]
 Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
 ---
 
 Index: linux/fs/dcache.c
 ===
 --- linux.orig/fs/dcache.c2008-01-18 19:21:38.0 +0100
 +++ linux/fs/dcache.c 2008-01-18 19:22:27.0 +0100
 @@ -1890,6 +1890,60 @@ char *dynamic_dname(struct dentry *dentr
   return memcpy(buffer, temp, sz);
  }
 
 +static int prepend(char **buffer, int *buflen, const char *str,
 +   int namelen)
 +{
 + *buflen -= namelen;
 + if (*buflen  0)
 + return 1;
 + *buffer -= namelen;
 + memcpy(*buffer, str, namelen);
 + return 0;
 +}
 +
 +/*
 + * Write full pathname from the root of the filesystem into the buffer.
 + */
 +char *dentry_path(struct dentry *dentry, char *buf, int buflen)
 +{
 + char *end = buf + buflen;
 + char *retval;
 +
 + spin_lock(dcache_lock);
 + prepend(end, buflen, \0, 1);
 + if (!IS_ROOT(dentry)  d_unhashed(dentry)) {
 + if (prepend(end, buflen, //deleted, 9))
 + goto Elong;
 + }
 + if (buflen  1)
 + goto

Re: [RFC][PATCH] VFS: create /proc/pid/mountinfo

2008-01-21 Thread Ram Pai
On Mon, 2008-01-21 at 22:25 +0100, Miklos Szeredi wrote:
  You have removed the code that checked if the peer or
  master mount was in the same namespace before reporting their
  corresponding mount-ids. One downside of that approach is the
  user will see an mount_id in the output with no corresponding
  line to explain the details of the mount_id.  
 
 Before the change, the peer and master ID's were basically randomly
 chosen from the peers, which means, it wasn't possible to always
 determine, that two mounts were peers, or that they were slaves to the
 same peer group.
 
 After the change, this is possible, since the peer ID will be the same
 for all mounts which are peers.  This means, that even though the peer
 ID might be in a different namespace, it is possible to determine all
 peers within the same namespace by comparing their peer ID's.


 I agree with your reasoning on the random id; showing a single
 id avoids clutter. But my point is, why not show a
 id for the master or peer residing in the same namespace?
 Showing a id with no corresponding entry for that id, can be
 intriguing.

 
 If no master-mount exists in the same namespace then print -1
 meaning masked. 

 there is always atleast one peer-mount in a given namespace; so no
 issue there.

 

  
  And reporting the mount-id of a mount is some other namespace
  could subtly mean information-leak?
 
 I don't think the mount ID itself can be sensitive, it really doesn't
 contain any information, other than being an identifier.
 
  One other comment I had received offline from Steve French was
  that the patch did not consider the following case:
  
  Have you thought about whether this could handle the case in which 
  cifs mounts with 
  a relative path e.g. currently
  mount -t cifs //server/share /mnt
  
  can not be distinguished from
  mount -t cifs //server/share/subdirectory /mnt
  
  when you run the mount command (ie the cifs prefixpath in this case 
  /subdirectory is not displayed)
 
 Why cifs not displaying '//server/share/subdirectory' as the source of
 the mount?

dont know. not tried it myself.

RP
 
 Miklos

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC2 PATCH 1/1] VFS: Augment /proc/mount with subroot and shared-subtree

2007-07-16 Thread Ram Pai
/proc/mounts in its current state fail to disambiguate bind mounts, especially 
when the bind mount is subrooted. Also it does not capture propagation state of
the mounts(shared-subtree). The following patch addresses the problem.

The following additional fields to /proc/mounts are added.

propagation-type in the form of propagation_flag[:mntid][,...]
note: 'shared' flag is followed by the mntid of its peer mount
  'slave' flag is followed by the mntid of its master mount
  'private' flag stands by itself
  'unbindable' flag stands by itself
   
mntid -- is a unique identifier of the mount
major:minor -- is the major minor number of the device hosting the filesystem
dir -- the subdir in the filesystem which forms the root of this mount
parent -- the id of the parent mount


Here is a sample cat /proc/mounts after execution the following commands:

mount --bind /mnt /mnt
mount --make-shared /mnt
mount --bind /mnt/1 /var
mount --make-slave /var
mount --make-shared /var
mount --bind /var/abc /tmp
mount --make-unbindable /proc

rootfs / rootfs rw 0 0 private 2 0:1 / 2 
/dev/root / ext2 rw  0 0 private 16 98:0 / 2 
/proc /proc proc rw 0 0 unbindable 17 0:3 / 16 
devpts /dev/pts devpts rw 0 0 private 18 0:10 / 16 
/dev/root /mnt ext2 rw  0 0 shared:19 19 98:0 /mnt 16 
/dev/root /var ext2 rw  0 0 shared:21,slave:19 20 98:0 /mnt/1 16 
/dev/root /tmp ext2 rw  0 0 shared:20,slave:19 21 98:0 /mnt/1/abc 16 

For example, the last line indicates that :

1) The mount is a shared mount.
2) Its peer mount of mount with id 20
3) It is also a slave mount of the master-mount with the id  19
4) The filesystem on device with major/minor number 98:0 and subdirectory 
mnt/1/abc makes the root directory of this mount.
5) And finally the mount with id 16 is its parent.


Testing: symlinked /etc/mtab to /proc/mounts and did some mount and df 
commands. They worked normally.



Signed-off-by: Ram Pai [EMAIL PROTECTED]

---
 fs/dcache.c  |   53 +++
 fs/namespace.c   |   35 +++-
 fs/pnode.c   |   22 +
 fs/pnode.h   |2 +
 fs/seq_file.c|   79 ++-
 include/linux/dcache.h   |2 +
 include/linux/mount.h|1 
 include/linux/seq_file.h |1 
 8 files changed, 172 insertions(+), 23 deletions(-)

Index: linux-2.6.21.5/fs/dcache.c
===
--- linux-2.6.21.5.orig/fs/dcache.c
+++ linux-2.6.21.5/fs/dcache.c
@@ -1835,6 +1835,59 @@ char * d_path(struct dentry *dentry, str
return res;
 }
 
+static inline int prepend(char **buffer, int *buflen, const char *str,
+   int namelen)
+{
+   if ((*buflen -= namelen)  0)
+   return 1;
+   *buffer -= namelen;
+   memcpy(*buffer, str, namelen);
+   return 0;
+}
+
+/*
+ * write full pathname into buffer and return start of pathname.
+ * If @vfsmnt is not specified return the path relative to the
+ * its filesystem's root.
+ */
+char * dentry_path(struct dentry *dentry, char *buf, int buflen)
+{
+   char * end = buf+buflen;
+   char * retval;
+
+   spin_lock(dcache_lock);
+   prepend(end, buflen, \0, 1);
+   if (!IS_ROOT(dentry)  d_unhashed(dentry)) {
+   if (prepend(end, buflen, //deleted, 10))
+   goto Elong;
+   }
+   /* Get '/' right */
+   retval = end-1;
+   *retval = '/';
+
+   for (;;) {
+   struct dentry * parent;
+   if (IS_ROOT(dentry))
+   break;
+
+   parent = dentry-d_parent;
+   prefetch(parent);
+
+   if (prepend(end, buflen, dentry-d_name.name,
+   dentry-d_name.len) ||
+   prepend(end, buflen, /, 1))
+   goto Elong;
+
+   retval = end;
+   dentry = parent;
+   }
+   spin_unlock(dcache_lock);
+   return retval;
+Elong:
+   spin_unlock(dcache_lock);
+   return ERR_PTR(-ENAMETOOLONG);
+}
+
 /*
  * NOTE! The user-level library version returns a
  * character pointer. The kernel system call just
Index: linux-2.6.21.5/fs/namespace.c
===
--- linux-2.6.21.5.orig/fs/namespace.c
+++ linux-2.6.21.5/fs/namespace.c
@@ -33,6 +33,8 @@
 __cacheline_aligned_in_smp DEFINE_SPINLOCK(vfsmount_lock);
 
 static int event;
+static atomic_t mnt_counter;
+
 
 static struct list_head *mount_hashtable __read_mostly;
 static int hash_mask __read_mostly, hash_bits __read_mostly;
@@ -51,6 +53,7 @@ static inline unsigned long hash(struct 
return tmp  hash_mask;
 }
 
+
 struct vfsmount *alloc_vfsmnt(const char *name)
 {
struct vfsmount *mnt = kmem_cache_zalloc(mnt_cache, GFP_KERNEL);
@@ -64,6 +67,7 @@ struct vfsmount *alloc_vfsmnt(const char

Re: [RFC PATCH 1/1] VFS: Augment /proc/mount with subroot and shared-subtree

2007-07-11 Thread Ram Pai
On Wed, 2007-07-11 at 11:24 +0100, Christoph Hellwig wrote:
 On Sat, Jun 30, 2007 at 08:56:02AM -0400, H. Peter Anvin wrote:
  Is that conjecture, or do you have evidence to that effect?  Most users 
  of this file are using it via the glibc interfaces, and there probably 
  aren't all that many users of it in the first place.
 
 I have written parsers for personal projects that might not have been
 happy to deal with additional fields myself for example..

I modified the patch to add fields towards the end of each line.
i.e after 'freq, passno' fields. And symlinked /etc/mtab
to /proc/mounts.
mount,df  and friends were all perfectly happy.  

I imagine your script may also be happy with the additional fields
**towards the end**. I would like to avoid one more mount interface if
we can help it.

RP



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 1/1] VFS: Augment /proc/mount with subroot and shared-subtree

2007-06-25 Thread Ram Pai
Please check if the following modified patch meets the requirements.

It augments /proc/mount with additional information to
(1) disambiguate bind mounts with subroot information.
(2) display shared-subtree information using which one can
determine the propagation trees.


The following additional fields are appended to each record
in /proc/mounts

mntid=id-  The unique id associated with that mount.
fsid=id:dir -  The filesystem's id and directory in that filesystem
that
makes the root directory of this mount.
parent=id   -  The id of the mount's parent; on which it is mounted.

also flags are augmented with new information to indicate the mount's 
propagation type.

Here is a sample 'cat /proc/mounts' after executing the following
commands:
mount --bind /mnt /mnt
mount --make-shared /mnt
mount --bind /mnt/1 /var
mount --make-slave /var
mount --make-shared /mnt
mount --make-unbindable /proc

rootfs / rootfs rw PRIVATE mntid=c1708c30 fsid=1:/ parent=c1708c30 0 0
/dev/root / ext2 rw  PRIVATE mntid=c1208c08 fsid=6200:/ parent=c1708c30
0 0
/proc /proc proc rw UNBINDABLE mntid=c1108c90 fsid=3:/ parent=c1208c08 0
0
devpts /dev/pts devpts rw PRIVATE mntid=c1108c18 fsid=a:/
parent=c1208c08 0 0
/dev/root /mnt ext2 rw  SHARED:peer=c1e08cb0 mntid=c1e08cb0
fsid=6200:/mnt parent=c1208c08 0 0
/dev/root /var ext2 rw  SHARED:peer=c1f08c28 SLAVE:master=c1e08cb0
mntid=c1f08c28 fsid=6200:/mnt/1 parent=c1208c08 0 0


For example, the last line indicates that 
The mount is a shared mount.
Its peer mount is itself (note peer=c1f08c28 is the same mntid as
itself).
It is also a slave mount of the mount with the id c1e08cb0.  
The filesystem with fsid=6200 and subdirectory mnt/1 makes the root
directory 
of this mount.
And finally the mount with id c1208c08 is its parent.


Signed-off-by: Ram Pai [EMAIL PROTECTED]

---
 fs/dcache.c  |   53 +++
 fs/namespace.c   |   25 ++
 fs/pnode.c   |   22 +
 fs/pnode.h   |2 +
 fs/seq_file.c|   79
++-
 include/linux/dcache.h   |2 +
 include/linux/seq_file.h |1 
 7 files changed, 162 insertions(+), 22 deletions(-)

Index: linux-2.6.21.5/fs/dcache.c
===
--- linux-2.6.21.5.orig/fs/dcache.c
+++ linux-2.6.21.5/fs/dcache.c
@@ -1835,6 +1835,59 @@ char * d_path(struct dentry *dentry, str
return res;
 }

+static inline int prepend(char **buffer, int *buflen, const char *str,
+   int namelen)
+{
+   if ((*buflen -= namelen)  0)
+   return 1;
+   *buffer -= namelen;
+   memcpy(*buffer, str, namelen);
+   return 0;
+}
+
+/*
+ * write full pathname into buffer and return start of pathname.
+ * If @vfsmnt is not specified return the path relative to the
+ * its filesystem's root.
+ */
+char * dentry_path(struct dentry *dentry, char *buf, int buflen)
+{
+   char * end = buf+buflen;
+   char * retval;
+
+   spin_lock(dcache_lock);
+   prepend(end, buflen, \0, 1);
+   if (!IS_ROOT(dentry)  d_unhashed(dentry)) {
+   if (prepend(end, buflen, //deleted, 10))
+   goto Elong;
+   }
+   /* Get '/' right */
+   retval = end-1;
+   *retval = '/';
+
+   for (;;) {
+   struct dentry * parent;
+   if (IS_ROOT(dentry))
+   break;
+
+   parent = dentry-d_parent;
+   prefetch(parent);
+
+   if (prepend(end, buflen, dentry-d_name.name,
+   dentry-d_name.len) ||
+   prepend(end, buflen, /, 1))
+   goto Elong;
+
+   retval = end;
+   dentry = parent;
+   }
+   spin_unlock(dcache_lock);
+   return retval;
+Elong:
+   spin_unlock(dcache_lock);
+   return ERR_PTR(-ENAMETOOLONG);
+}
+
 /*
  * NOTE! The user-level library version returns a
  * character pointer. The kernel system call just
Index: linux-2.6.21.5/fs/namespace.c
===
--- linux-2.6.21.5.orig/fs/namespace.c
+++ linux-2.6.21.5/fs/namespace.c
@@ -386,8 +386,31 @@ static int show_vfsmnt(struct seq_file *
if (mnt-mnt_flags  fs_infop-flag)
seq_puts(m, fs_infop-str);
}
-   if (mnt-mnt_sb-s_op-show_options)
+   seq_putc(m, ' ');
+   if (mnt-mnt_sb-s_op-show_options) {
err = mnt-mnt_sb-s_op-show_options(m, mnt);
+   seq_putc(m, ' ');
+   }
+   if (IS_MNT_SHARED(mnt)) {
+   seq_printf(m, %s:peer=%x , SHARED,
+   new_encode_dev((int)get_peer_same_ns(mnt)));
+   if (IS_MNT_SLAVE(mnt)) {
+   seq_printf(m, %s:master=%x , SLAVE

Re: Adding subroot information to /proc/mounts, or obtaining that through other means

2007-06-22 Thread Ram Pai
On Thu, 2007-06-21 at 10:31 -0700, H. Peter Anvin wrote:
 Ram Pai wrote:
  
  Peter, I am not working on it currently. But i am interested in getting
  it done. I have the seed set of patches which had Al Viro's ideas
  incorporated. Infact those patches were sent on lkml 2 months back.
  Shall we start with those patches?
  
 
 Okay, so what I see in your patches are:
 
   path-from-root: mount point of the mount from /
   path-from-root-of-its-sb: path from its own root dentry.
   propagation-flag: SHARED, SLAVE, UNBINDABLE, PRIVATE
   peer-mount-id: the mount-id of its peer mount (if this mount is shared)
   master-mount-id: the mount-id of its master mount (if this mount is
 slave)
 
 Other than cosmetic, I don't see anything terribly wrong with this,
 although getting a flag when the directory is overmounted would be nice.
 
 I guess I suggest a single comma-separated field with flags and optional
 :argument:
 
   private
   shared:peer
   slave:master
   unbindable
   overmounted
 
 So we could end up with something like:
 
 rootfs / rootfs rw 0 0 0:1 / 1 private,overmounted
 
 ... where 1 is the mnt_id (sequence number).
 
 [Please see my other comments in this thread... basically I believe we
 should just add fields to /proc/mounts.]

I had two patches. The first patch added a new interface
called /proc/mounts_new  and had the following format.

FSID  mntpt  root-dentry  fstype fs-options

where FSID is a filesystem unique id
mntpt is the path to the mountpoint
root-dentry is the path to the dentry with respect to the root dentry of
the same filesystem.
fstype  is the filesystem type
fs-options  the mount options used.


the second patch made a /proc/propagation interface which had almost the
same fields, but also added fields to show the propagation type of the
mount as well as pointers to its peers and master depending on the type
of the mount. 

I think the consensus seems to have a new interface /proc/make-a-name
which extends the interface provided by /proc/mounts but provides the
propagation state of the mounts too as well as disambiguate bind mounts.
Which makes sense.

Why not have something like this?

mnt-id FSID backing-dev mntpt root-dentry fstype
comma-separated-fs-options

and one of the fields in the comma-separated-fs-options indicates the
propagation type of the mount.


BTW: what is the need for overmounted flag?  Do you mean two vfsmounts
mounted on the same dentry on the ***same vfsmount*** ?


RP










 
   -hpa

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adding subroot information to /proc/mounts, or obtaining that through other means

2007-06-22 Thread Ram Pai
On Fri, 2007-06-22 at 00:06 -0700, H. Peter Anvin wrote:
 Ram Pai wrote:
  
  the second patch made a /proc/propagation interface which had almost the
  same fields, but also added fields to show the propagation type of the
  mount as well as pointers to its peers and master depending on the type
  of the mount. 
  
  I think the consensus seems to have a new interface /proc/make-a-name
  which extends the interface provided by /proc/mounts but provides the
  propagation state of the mounts too as well as disambiguate bind mounts.
  Which makes sense.
  
 
 Why?  It seems a lot cleaner to have all the information in the same
 place.  It is highly unfriendly to userspace to have to gather
 information in a lot of places, plus it adds race conditions.
 
 It would be another matter if the format that we have now couldn't be
 extended, but we need those fields (well, except the two zeros, but who
 cares) *anyway*, so we might as well stick to the existing file, and
 reduce the total amount of code and clutter.

Ok. so you think /proc/mounts can be extended easily without breaking
any userspace commands?

well lets see..
1. to disambiguate bind mounts, we have to add a field that displays the
 path to the mount's root dentry from the filesystem's root
 dentry. Agree?

2. For filesystems that do not have a backing store, it becomes hard
to disambiguate bind mounts in (1). So we need to add a
filesystem-id field.

3. if we need to add the propagation status of the mount we need a
 propagation flag added in the output.

4. To be able to construct the propagation tree, we need a way to refer
to the other mounts, since some mounts are peers and some other
mounts are master. Which means we need a mount-id field.
Agree?

If you agree to the above 4 new fields, it becomes challenging to
extend /proc/mounts to incorporate these new fields without
breaking any existing applications. 


  
  BTW: what is the need for overmounted flag?  Do you mean two vfsmounts
  mounted on the same dentry on the ***same vfsmount*** ?
  
 
 Maybe I'm not following the uses of your flags well enough to figure out
  if that information can already been deduced.

With the addition of the above 4 mentioned fields, I think one should be
easily able to decipher which mnt-id is mounted on which mnt-id. no?
maybe not. Well we will have to extend the mountpoint field to indicate
the mnt-id in which the mountpoint resides.  

RP

 
   -hpa

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adding subroot information to /proc/mounts, or obtaining that through other means

2007-06-21 Thread Ram Pai
On Wed, 2007-06-20 at 14:20 -0700, H. Peter Anvin wrote:
 Al Viro wrote:
  On Wed, Jun 20, 2007 at 01:57:33PM -0700, H. Peter Anvin wrote:
  ... or, alternatively, add a subfield to the first field (which would
  entail escaping whatever separator we choose):
 
  /dev/md6 /export ext3 rw,data=ordered 0 0
  /dev/md6:/users/foo /home/foo ext3 rw,data=ordered 0 0
  /dev/md6:/users/bar /home/bar ext3 rw,data=ordered 0 0
  
  Hell, no.  The first field is in principle impossible to parse unless
  you know the fs type.
  
  How about making a new file with sane format?  From the very
  beginning.  E.g. mountpoint + ID + relative path + type + options,
  where ID uniquely identifies superblock (e.g. numeric st_dev)
  and backing device (if any) is sitting among the options...
 
 Okay, I see there has been some discussion on this earlier, based on a
 proposal by Ram Pai, so it pretty much comes down to redesigning this
 right.  I see some issues with his proposal (device numbers exported to
 userspace in text form should be separated into major:minor form, for
 one thing.)  I know the util-linux-ng people have also had issues with
 /proc/mounts that they would like resolved in order to finally nuke
 /etc/mtab.
 
 Is Ram still working on this?  I'd like to help make this happen so we
 can be done with it.

Peter, I am not working on it currently. But i am interested in getting
it done. I have the seed set of patches which had Al Viro's ideas
incorporated. Infact those patches were sent on lkml 2 months back.
Shall we start with those patches?

RP


 
   -hpa
 
 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adding subroot information to /proc/mounts, or obtaining that through other means

2007-06-21 Thread Ram Pai
On Thu, 2007-06-21 at 09:29 -0700, H. Peter Anvin wrote:
 Ram Pai wrote:
  
  Peter, I am not working on it currently. But i am interested in getting
  it done. I have the seed set of patches which had Al Viro's ideas
  incorporated. Infact those patches were sent on lkml 2 months back.
  Shall we start with those patches?
  
 
 Are these the unprivileged mount syscall patches?

no. but those patches were sent in the same thread. Karel had provided
suggestions which I am yet to incorporate.

Give me today. I will send out the patches incorporating the comment
later in the evening.

ok?
RP

 
 Otherwise I don't see any patches in my personal LKML cache (apparently
 my subscription to fsdevel was dropped at some point, so I don't have a
 stash of it.)


 
   -hpa

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag

2007-04-18 Thread Ram Pai
On Wed, 2007-04-18 at 11:19 +0200, Miklos Szeredi wrote:
   Allowing this and other flags to NOT be propagated just makes it
   possible to have a set of shared mounts with asymmetric properties,
   which may actually be desirable.
  
  The shared mount feature was designed to ensure that the mount remained
  identical at all the locations.
 
 OK, so remount not propagating mount flags is a bug then?

As I said earlier, are there any flags currently that if not propagated 
can lead to conflicts with the shared subtree semantics? I am not aware
of any.  If you did notice a case, than according to me its a bug.

But the new proposed 'allow unpriviledged mounts' flag; if not
propagated among peers (and slaves) of a shared mount can lead to
conflicts with shared subtree semantics. Since mount in one
shared-mount; when propagated to its peer fails to mount and hence lead
to un-identical peers.



 
  Now designing features to make it un-identical but still naming it
  shared, will break its original purpose.  Slave mounts were designed
  to make it asymmetric.
 
 What if I want to modify flags in a master mount, but not the slave
 mount?  Would I be screwed?  For example: mount is read-only in both
 master and slave.  I want to mark it read-write in master but not in
 slave.  What do I do?

Making mounts read-only or read-write -- will that effect mount
propagation in such a way that future mounts in any one of the
peers will not be able to propagate that mount to its peers or slaves?

I don't think it will. Hence its ok to selectively mark some mounts
read-only and some mounts read-write.

However with the introduction of unpriviledged mount semantics, there 
can be cases where a user has priviledges to mount at one location but
not at a different location. if these two location happen to share
a peer-relationship than I see a case of interference of read-write
flag semantics with shared subtree semantics. And hence we will end up
propagating the read-write flag too or have to craft a different
semantics that stays consistent. 



 
  Whatever feature that is desired to be exploited; can that be exploited
  with the current set of semantics that we have? Is there a real need to
  make the mounts asymmetric but at the same time name them as shared?
  Maybe I dont understand what the desired application is? 
 
 I do think this question of propagating mount flags is totally
 independent of user mounts.
 
 As it stands, currently remount doesn't propagate mount flags, and I
 don't see any compelling reasons why it should.
 
 The patchset introduces a new mount flag allowusermnt, but I don't
 see any compelling reason to propagate this flag _either_.
 
 Please say so if you do have such a reason.  As I've explained, having
 this flag set differently in parts of a propagation tree does not
 interfere with or break propagation in any way.

As I said earlier, I see a case where two mounts that are peers of each
other can become un-identical if we dont propagate the allowusermnt.

As a practical example.

/tmp and /mnt are peers of each other.
/tmp has its allowusermnt flag set, which has not been propagated
to /mnt.

now a normal-user mounts an ext2 file system under /tmp at /tmp/1

unfortunately the mount wont appear under /mnt/1 

and this breaks the shared-subtree semantics which promises: whatever is
mounted under /tmp will also be visible under /mnt


and in case if you allow the mount to appear under /mnt/1, you will
break unpriviledge mounts semantics which promises: a normal user will
not be able to mount at a location that does not allow user-mounts.



RP


 
 Miklos
 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag

2007-04-18 Thread Ram Pai
On Wed, 2007-04-18 at 21:14 +0200, Miklos Szeredi wrote:
  As I said earlier, I see a case where two mounts that are peers of each
  other can become un-identical if we dont propagate the allowusermnt.
  
  As a practical example.
  
  /tmp and /mnt are peers of each other.
  /tmp has its allowusermnt flag set, which has not been propagated
  to /mnt.
  
  now a normal-user mounts an ext2 file system under /tmp at /tmp/1
  
  unfortunately the mount wont appear under /mnt/1 
 
 Argh, that is not true.  That's what I've been trying to explain to
 you all along.

I now realize you did, but I failed to catch it. sorry :-(

 
 The propagation will be done _regardless_ of the flag.  The flag is
 only checked for the parent of the _requested_ mount.  If it is
 allowed there, the mount, including any propagations are allowed.  If
 it's denied, then obviously it's denied everywhere.
 
  and in case if you allow the mount to appear under /mnt/1, you will
  break unpriviledge mounts semantics which promises: a normal user will
  not be able to mount at a location that does not allow user-mounts.
 
 No, it does not promise that.  The flag just promises, that the user
 cannot _request_ a mount on the parent mount.

ok. if the ability for a normal user to mount something *indirectly*
under a mount that has its 'allowusermnt flag' unset, 
is acceptable under the definition of 'allowusermnt', i guess my only
choice is to accept it. :-)

RP

 
 Miklos

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to query mount propagation state?

2007-04-17 Thread Ram Pai
On Mon, 2007-04-16 at 14:16 -0500, Serge E. Hallyn wrote:
  This patch introduces a new proc interface that exposes all the
 propagation 
  trees within the namespace.
  
  It walks through each off the mounts in the namespace, and prints
 the following information.
  
  mount-id: a unique mount identifier
  dev-id : the unique device used to identify the device containing
 the filesystem
  path-from-root: mount point of the mount from /
  path-from-root-of-its-sb: path from its own root dentry.
  propagation-flag: SHARED, SLAVE, UNBINDABLE, PRIVATE
  peer-mount-id: the mount-id of its peer mount (if this mount is
 shared)
  master-mount-id: the mount-id of its master mount (if this mount is
 slave)
  
  Using the above information one could easily write a script that can
  draw all the propagation trees in the namespace.
  
  
  Example:
  Here is a sample output of cat /proc/$$/mounts_propagation
  
  0xa917800 0x1 / / PRIVATE
  0xa917200 0x6200 / / PRIVATE
  0xa917180 0x3 /proc / PRIVATE
  0xa917f80 0xa /dev/pts / PRIVATE
  0xa917100 0x6210 /mnt / SHARED peer:0xa917100
  0xa917f00 0x6210 /tmp /1 SLAVE master:0xa917100
  0xa917900 0x6220 /mnt/2 / SHARED peer:0xa917900
  
  line 5 indicates that the mount with id 0xa917100 is mounted at /mnt
 is shared
  and it is the only mount in its peer group.
  
  line 6 indicates that the mount with id 0xa917f00 is mounted
 at /tmp, its 
  root is the dentry 1 present under its root directory. This mount is
 a
  slave mount and its master is the mount with id 0xa917100.
  
  line 7 indicates that the mount with id 0xa917900 is mounted
 at /mnt/2, its 
  root is the dentry / of its filesystem. This mount is a
  shared and it is the only mount in its peer group.
  
  one could write a script which runs through these lines and draws 4
  individual satellite mounts and two propagation trees, the first
 propagation
  tree has a shared mount and a slave mount.  and the second
 propagation tree has
  just one shared mount.
  

  Signed-off-by: Ram Pai [EMAIL PROTECTED]
  ---
   fs/namespace.c |   42 ++
   fs/pnode.c |6 --
   fs/pnode.h |6 ++
   fs/proc/base.c |   22 +-
   4 files changed, 69 insertions(+), 7 deletions(-)
  
  Index: linux-2.6.17.10/fs/namespace.c
  ===
  --- linux-2.6.17.10.orig/fs/namespace.c
  +++ linux-2.6.17.10/fs/namespace.c
  @@ -410,6 +410,41 @@ static int show_vfsmnt_new(struct seq_fi
return show_options(m, v);
   }
  
  +static int show_vfsmnt_propagation(struct seq_file *m, void *v)
  +{
  + struct vfsmount *mnt = v;
  + seq_printf(m, 0x%x, (int)mnt);
  + seq_putc(m, ' ');
  + seq_printf(m, 0x%x, new_encode_dev(mnt-mnt_sb-s_dev));
  + seq_putc(m, ' ');
  + seq_path(m, mnt, mnt-mnt_root,  \t\n\\);
  + seq_putc(m, ' ');
  + seq_dentry(m, mnt-mnt_root,  \t\n\\);
  + seq_putc(m, ' ');
  +
  + if (IS_MNT_SHARED(mnt)) {
  + seq_printf(m, %s , SHARED);
  + if (IS_MNT_SLAVE(mnt)) {
  + seq_printf(m, %s , SLAVE);
  + }
  + } else if (IS_MNT_SLAVE(mnt)) {
  + seq_printf(m, %s , SLAVE);
  + } else if (IS_MNT_UNBINDABLE(mnt)) {
  + seq_printf(m, %s , UNBINDABLE);
  + } else {
  + seq_printf(m, %s , PRIVATE);
  + }
  +
  + if (IS_MNT_SHARED(mnt)) {
  + seq_printf(m, peer:0x%x , (int)next_peer(mnt));
 
 Ok, so if the sequence of events was
 
 mount --make-shared /mnt
 (some user logs in and gets a cloned namespace, so his /mnt
 becomes the next peer of /mnt)
 mount --bind /mnt /tmp
 (some other user logs in and gets cloned namespace...)
 
 or some such sequence of events, we could lose all information
 about /mnt and /tmp being peers, right?  Should a new
 next_peer_in_same_namespace(mnt) be used rather than next_peer()?

you are right. it should print next_peer(mnt) only if CAP_SYS_ADMIN,
else print next_peer_in_same_namespace(mnt).

 
 Somewhat similarly,
 
  + }
  + if (IS_MNT_SLAVE(mnt)) {
  + seq_printf(m, master:0x%x , (int)mnt-mnt_master);
 
 Should we for privacy reasons not print out the address
 mnt-mnt_master
 is in a different namespace (perhaps if !CAP_SYS_ADMIN)?

right. it should print mnt-mnt_master if (CAP_SYS_ADMIN), otherwise
print master_in_same_namespace(mnt).

RP

 
 Otherwise I like this.
 
 thanks,
 -serge 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag

2007-04-17 Thread Ram Pai
On Tue, 2007-04-17 at 19:44 +0200, Miklos Szeredi wrote:
  I'm a bit lost about what is currently done and who advocates for what.
  
  It seems to me the MNT_ALLOWUSERMNT (or whatever :) flag should be
  propagated.  In the /share rbind+chroot example, I assume the admin
  would start by doing
  
  mount --bind /share /share
  mount --make-slave /share
  mount --bind -o allow_user_mounts /share (or whatever)
  mount --make-shared /share
  
  then on login, pam does
  
  chroot /share/$USER
  
  or some sort of
  
  mount --bind /share /home/$USER/root
  chroot /home/$USER/root
  
  or whatever.  In any case, the user cannot make user mounts except under
  /share, and any cloned namespaces will still allow user mounts.
 
 I don't quite understand your method.  This is how I think of it:
 
 mount --make-rshared /
 mkdir -p /mnt/ns/$USER
 mount --rbind / /mnt/ns/$USER
 mount --make-rslave /mnt/ns/$USER
 mount --set-flags --recursive -oallowusermnt /mnt/ns/$USER
 chroot /mnt/ns/$USER
 su - $USER
 
 I did actually try something equivalent (without the fancy mount
 commands though), and it worked fine.  The only problem is the
 proliferation of mounts in /proc/mounts.  There was a recently posted
 patch in AppArmor, that at least hides unreachable mounts from
 /proc/mounts, so the user wouldn't see all those.  But it could still
 be pretty confusing to the sysadmin.

unbindable mounts were designed to overcome the proliferation problem.

Your steps should be something like this:

mount --make-rshared /
mkdir -p /mnt/ns
mount --bind /mnt/ns /mnt/ns
mount --make-unbindable /mnt/ns
mkdir -p /mnt/ns/$USER
mount --rbind / /mnt/ns/$USER
mount --make-rslave /mnt/ns/$USER
mount --set-flags --recursive -oallowusermnt /mnt/ns/$USER
chroot /mnt/ns/$USER
su - $USER

try this and your proliferation problem will disappear. :-)

 
 So in that sense doing it the complicated way, by first cloning the
 namespace, and then copying and sharing mounts individually which need
 to be shared could relieve this somewhat.

the unbindable mount will just provide you permanent relief.

 
 Another point: user mounts under /proc and /sys shouldn't be allowed.
 There are files there (at least in /proc) that are seemingly writable
 by the user, but they are still not writable in the sense, that
 normal files are.
 
 Anyway, there are lots of userspace policy issues, but those don't
 impact the kernel part.
 
 As for the original question of propagating the allowusermnt flag, I
 think it doesn't matter, as long as it's consistent and documented.
 
 Propagating some mount flags and not propagating others is
 inconsistent and confusing, so I wouldn't want that.  Currently
 remount doesn't propagate mount flags, that may be a bug, 

For consistency reason, one can propagate all the flags. But
propagating only those flags that interfere with shared-subtree
semantics should suffice.

wait...Dave's read-only bind mounts infact need the ability to
selectively make some mounts readonly. In such cases propagating
the read-only flag will just step on Dave's feature. Wont' it?

RP



 
 Miklos

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag

2007-04-17 Thread Ram Pai
On Tue, 2007-04-17 at 21:43 +0200, Miklos Szeredi wrote:
I'm a bit lost about what is currently done and who advocates for what.

It seems to me the MNT_ALLOWUSERMNT (or whatever :) flag should be
propagated.  In the /share rbind+chroot example, I assume the admin
would start by doing

mount --bind /share /share
mount --make-slave /share
mount --bind -o allow_user_mounts /share (or whatever)
mount --make-shared /share

then on login, pam does

chroot /share/$USER

or some sort of

mount --bind /share /home/$USER/root
chroot /home/$USER/root

or whatever.  In any case, the user cannot make user mounts except under
/share, and any cloned namespaces will still allow user mounts.
   
   I don't quite understand your method.  This is how I think of it:
   
   mount --make-rshared /
   mkdir -p /mnt/ns/$USER
   mount --rbind / /mnt/ns/$USER
   mount --make-rslave /mnt/ns/$USER
   mount --set-flags --recursive -oallowusermnt /mnt/ns/$USER
   chroot /mnt/ns/$USER
   su - $USER
   
   I did actually try something equivalent (without the fancy mount
   commands though), and it worked fine.  The only problem is the
   proliferation of mounts in /proc/mounts.  There was a recently posted
   patch in AppArmor, that at least hides unreachable mounts from
   /proc/mounts, so the user wouldn't see all those.  But it could still
   be pretty confusing to the sysadmin.
  
  unbindable mounts were designed to overcome the proliferation problem.
  
  Your steps should be something like this:
  
  mount --make-rshared /
  mkdir -p /mnt/ns
  mount --bind /mnt/ns /mnt/ns
  mount --make-unbindable /mnt/ns
  mkdir -p /mnt/ns/$USER
  mount --rbind / /mnt/ns/$USER
  mount --make-rslave /mnt/ns/$USER
  mount --set-flags --recursive -oallowusermnt /mnt/ns/$USER
  chroot /mnt/ns/$USER
  su - $USER
  
  try this and your proliferation problem will disappear. :-)
 
 Right, this is needed.
 
 My problem wasn't actually this (which would only have hit, if I tried
 with more than one user), just that the number of mounts in
 /proc/mounts grows linearly with the number of users.
 
 That can't be helped in such an easy way unfortunately.
 
   Propagating some mount flags and not propagating others is
   inconsistent and confusing, so I wouldn't want that.  Currently
   remount doesn't propagate mount flags, that may be a bug, 
  
  For consistency reason, one can propagate all the flags. But
  propagating only those flags that interfere with shared-subtree
  semantics should suffice.
 
 I still don't believe not propagating allowusermnt interferes with
 mount propagation.  In my posted patches the mount (including
 propagations) is allowed based on the allowusermnt flag on the
 parent of the requested mount.  The flag is _not_ checked during
 propagation.
 
 Allowing this and other flags to NOT be propagated just makes it
 possible to have a set of shared mounts with asymmetric properties,
 which may actually be desirable.

The shared mount feature was designed to ensure that the mount remained
identical at all the locations. Now designing features 
to make it un-identical but still naming it shared, will break its
original purpose.  Slave mounts were designed to make it asymmetric.

Whatever feature that is desired to be exploited; can that be exploited
with the current set of semantics that we have? Is there a real need to
make the mounts asymmetric but at the same time name them as shared?
Maybe I dont understand what the desired application is? 

RP

 
 Miklos

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/8] unprivileged mount syscall

2007-04-16 Thread Ram Pai
On Fri, 2007-04-13 at 13:58 +0200, Miklos Szeredi wrote:
  On Wed, 2007-04-11 at 12:44 +0200, Miklos Szeredi wrote:
1. clone the master namespace.

2. in the new namespace

move the tree under /share/$me to /
for each ($user, $what, $how) {
move /share/$user/$what to /$what
if ($how == slave) {
 make the mount tree under /$what as slave
}
}

3. in the new namespace make the tree under 
   /share as private and unmount /share
   
   Thanks.  I get the basic idea now: the namespace itself need not be
   shared between the sessions, it is enough if share propagation is
   set up between the different namespaces of a user.
   
   I don't yet see either in your or Viro's description how the trees
   under /share/$USER are initialized.  I guess they are recursively
   bound from /, and are made slaves.
  
  yes. I suppose, when a userid is created one of the steps would be
  
  mount --rbind / /share/$USER
  mount --make-rslave /share/$USER
  mount --make-rshared /share/$USER
 
 Thinking a bit more about this, I'm quite sure most users wouldn't
 even want private namespaces.  It would be enough to
 
   chroot /share/$USER
 
 and be done with it.
 
 Private namespaces are only good for keeping a bunch of mounts
 referenced by a group of processes.  But my guess is, that the natural
 behavior for users is to see a persistent set of mounts.
 
 If for example they mount something on a remote machine, then log out
 from the ssh session and later log back in, they would want to see
 their previous mount still there.

They will continue see their previous mount tree. 
Even if all the namespaces belonging to the different sessions of the
user get dismantled when all the sessions exit, the a mirror of those 
mount trees continue to exist under /share/$USER in the original
namespace.  So I don't think we have a issue.

NOTE: when I say 'original namespace' I mean the admin namespace; the
first namespace that gets created when the machine boots.

RP


 
 Miklos

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/8] unprivileged mount syscall

2007-04-16 Thread Ram Pai
On Fri, 2007-04-13 at 16:05 +0200, Miklos Szeredi wrote:
   Thinking a bit more about this, I'm quite sure most users wouldn't
   even want private namespaces.  It would be enough to
   
 chroot /share/$USER
   
   and be done with it.
   
   Private namespaces are only good for keeping a bunch of mounts
   referenced by a group of processes.  But my guess is, that the natural
   behavior for users is to see a persistent set of mounts.
   
   If for example they mount something on a remote machine, then log out
   from the ssh session and later log back in, they would want to see
   their previous mount still there.
   
   Miklos
  
  Agreed on desired behavior, but not on chroot sufficing.  It actually
  sounds like you want exactly what was outlined in the OLS paper.
  
  Users still need to be in a different mounts namespace from the admin
  user so long as we consider the deluser and backup problems
 
 I don't think it matters, because /share/$USER duplicates a part or
 the whole of the user's namespace.
 
 So backup would have to be taught about /share anyway, and deluser
 operates on /home/$USER and not on /share/*, so there shouldn't be any
 problem.
 
 There's actually very little difference between rbind+chroot, and
 CLONE_NEWNS.  In a private namespace:
 
   1) when no more processes reference the namespace, the tree will be
 disbanded
 
   2) the mount tree won't be accessible from outside the namespace
 
 Wanting a persistent namespace contradicts 1).
 
 Wanting a per-user (as opposed to per-session) namespace contradicts
 2).  The namespace _has_ to be accessible from outside, so that a new
 session can access/copy it.

As i mentioned in the previous mail, disbanding all the namespaces of a
user will not disband his mount tree, because a mirror of the mount tree
still continues to exist in /share/$USER in the admin namespace.

And a new user session can always use this copy to create a namespace
that  looks identical to that which existed earlier.


 
 So both requirements point to the rbind/chroot solution.

Arn't there ways to escape chroot jails? Serge had pointed me to a URL
which showed chroots can be escaped. And if that is true than having all
user's private mount tree in the same namespace can be a security issue?

RP

 
 Miklos

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag

2007-04-16 Thread Ram Pai

 
 Serge E. Hallyn [EMAIL PROTECTED] writes:
 
  Quoting Miklos Szeredi ([EMAIL PROTECTED]):
  From: Miklos Szeredi [EMAIL PROTECTED]
  
  If CLONE_NEWNS and CLONE_NEWNS_USERMNT are given to clone(2) or
  unshare(2), then allow user mounts within the new namespace.
  
  This is not flexible enough, because user mounts can't be enabled
 for
  the initial namespace.
  
  The remaining clone bits also getting dangerously few...
  
  Alternatives are:
  
- prctl() flag
- setting through the containers filesystem
 
  Sorry, I know I had mentioned it, but this is definately my least
  favorite approach.
 
  Curious whether are any other suggestions/opinions from the
 containers
  list?
 
 Given the existence of shared subtrees allowing/denying this at the
 mount
 namespace level is silly and wrong.
 
 If we need more than just the filesystem permission checks can we
 make it a mount flag settable with mount and remount that allows
 non-privileged users the ability to create mount points under it
 in directories they have full read/write access to.

Also for bind-mount and remount operations the flag has to be propagated
down its propagation tree.  Otherwise a unpriviledged mount in a shared
mount wont get reflected in its peers and slaves, leading to unidentical
shared-subtrees.

RP


 
 I don't like the use of clone flags for this purpose but in this
 case the shared subtress are a much more fundamental reasons for not
 doing this at the namespace level.
 
 Eric
 ___
 Containers mailing list
 [EMAIL PROTECTED]
 https://lists.linux-foundation.org/mailman/listinfo/containers 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag

2007-04-16 Thread Ram Pai
On Mon, 2007-04-16 at 11:32 +0200, Miklos Szeredi wrote:
   Given the existence of shared subtrees allowing/denying this at the
   mount
   namespace level is silly and wrong.
   
   If we need more than just the filesystem permission checks can we
   make it a mount flag settable with mount and remount that allows
   non-privileged users the ability to create mount points under it
   in directories they have full read/write access to.
  
  Also for bind-mount and remount operations the flag has to be propagated
  down its propagation tree.  Otherwise a unpriviledged mount in a shared
  mount wont get reflected in its peers and slaves, leading to unidentical
  shared-subtrees.
 
 That's an interesting question.  Do we want shared mounts to be
 totally identical, including mnt_flags?  It doesn't look as if
 do_remount() guarantees that currently.

Depends on the semantics of each of the flags. Some flags like of the
read/write flag, would not interfere with the propagation semantics
AFAICT.  But this one certainly seems to interfere.

RP

 Miklos

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag

2007-04-16 Thread Ram Pai
On Mon, 2007-04-16 at 11:56 +0200, Miklos Szeredi wrote:
Also for bind-mount and remount operations the flag has to be propagated
down its propagation tree.  Otherwise a unpriviledged mount in a shared
mount wont get reflected in its peers and slaves, leading to unidentical
shared-subtrees.
   
   That's an interesting question.  Do we want shared mounts to be
   totally identical, including mnt_flags?  It doesn't look as if
   do_remount() guarantees that currently.
  
  Depends on the semantics of each of the flags. Some flags like of the
  read/write flag, would not interfere with the propagation semantics
  AFAICT.  But this one certainly seems to interfere.
 
 That depends.  Current patches check the unprivileged submounts
 allowed under this mount flag only on the requested mount and not on
 the propagated mounts.  Do you see a problem with this?

Don't see a problem if the flag is propagated to all peers and slave
mounts. 

If not, I see a problem. What if the propagated mount has its flag set
to not do un-priviledged mounts, whereas the requested mount has it
allowed?

RP



 
 Miklos

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to query mount propagation state?

2007-04-16 Thread Ram Pai
On Mon, 2007-04-16 at 12:34 +0200, Miklos Szeredi wrote:
 Currently one of the difficulties with mount propagations is that
 there's no way to know the current state of the propagation tree.
 
 Has anyone thought about how this info could be queried from
 userspace?

I am attaching two patches that I had done way back in Oct 2006 
with Al Viro. I had sent these patches to Al Viro. But I forgot to
follow them up, I guess so did Al Viro.

The first patch disambiguates multiple mount-instances of the same
filesystem (or part of the same filesystem), by introducing a new
interface /proc/mounts_new. 

The second patch introduces a new proc interface that exposes all the
propagation trees within a namespace.  It does not show propagated
mounts residing in a different namespace (for privacy reasons). Maybe
one could modify the patch a little, to allow it; if the user has
root priviledges. 

RP

PS: Sorry these are attachments instead of inline patches. I am scared
of inlining in evolution. If needed I can send inline patches through
mutt.

 
 Thanks,
 Miklos
This patch disambiguates multiple mount-instances of the same
filesystem (or part of the same filesystem), by introducing a new
interface /proc/mounts_new. The interface has the following format.


FSID  mntpt  root-dentry  fstype fs-options


NOTE: root-dentry is the path to the dentry w.r.t to the root dentry of the
same filesystem.

for example: lets say we attempt the following commands
mount --bind /var /mnt
mount --bind /mnt/tmp /tmp1

'cat /proc/mounts' shows the following:
/dev/root /mnt ext2 rw 0 0
/dev/root /tmp1 ext2 rw 0 0

NOTE: The above mount entries, do not indicate that /tmp1 contains the same
directory tree as /var/tmp.

But 'cat /proc/mounts_new' shows us the following:
0x6200 /mnt /var ext2 rw 0 0
0x6200 /tmp1 /var/tmp ext2 rw 0 0

The above entries clearly indicates that /var/tmp directory of the ext2
filesystem with fsid=0x6200 is the directory tree that resides under /tmp1

Signed-off-by: Ram Pai [EMAIL PROTECTED]

---
 fs/dcache.c  |   53 
 fs/namespace.c   |   35 ++---
 fs/proc/base.c   |   32 +--
 fs/proc/proc_misc.c  |1 
 fs/seq_file.c|   77 ++-
 include/linux/dcache.h   |1 
 include/linux/seq_file.h |1 
 7 files changed, 172 insertions(+), 28 deletions(-)

Index: linux-2.6.17.10/fs/proc/base.c
===
--- linux-2.6.17.10.orig/fs/proc/base.c
+++ linux-2.6.17.10/fs/proc/base.c
@@ -104,6 +104,7 @@ enum pid_directory_inos {
 	PROC_TGID_MAPS,
 	PROC_TGID_NUMA_MAPS,
 	PROC_TGID_MOUNTS,
+	PROC_TGID_MOUNTS_NEW,
 	PROC_TGID_MOUNTSTATS,
 	PROC_TGID_WCHAN,
 #ifdef CONFIG_MMU
@@ -145,6 +146,7 @@ enum pid_directory_inos {
 	PROC_TID_MAPS,
 	PROC_TID_NUMA_MAPS,
 	PROC_TID_MOUNTS,
+	PROC_TID_MOUNTS_NEW,
 	PROC_TID_MOUNTSTATS,
 	PROC_TID_WCHAN,
 #ifdef CONFIG_MMU
@@ -203,6 +205,7 @@ static struct pid_entry tgid_base_stuff[
 	E(PROC_TGID_ROOT,  root,S_IFLNK|S_IRWXUGO),
 	E(PROC_TGID_EXE,   exe, S_IFLNK|S_IRWXUGO),
 	E(PROC_TGID_MOUNTS,mounts,  S_IFREG|S_IRUGO),
+	E(PROC_TGID_MOUNTS_NEW,mounts_new,  S_IFREG|S_IRUGO),
 	E(PROC_TGID_MOUNTSTATS, mountstats, S_IFREG|S_IRUSR),
 #ifdef CONFIG_MMU
 	E(PROC_TGID_SMAPS, smaps,   S_IFREG|S_IRUGO),
@@ -246,6 +249,7 @@ static struct pid_entry tid_base_stuff[]
 	E(PROC_TID_ROOT,   root,S_IFLNK|S_IRWXUGO),
 	E(PROC_TID_EXE,exe, S_IFLNK|S_IRWXUGO),
 	E(PROC_TID_MOUNTS, mounts,  S_IFREG|S_IRUGO),
+	E(PROC_TID_MOUNTS_NEW, mounts_new,  S_IFREG|S_IRUGO),
 #ifdef CONFIG_MMU
 	E(PROC_TID_SMAPS,  smaps,   S_IFREG|S_IRUGO),
 #endif
@@ -692,13 +696,13 @@ static struct file_operations proc_smaps
 };
 #endif
 
-extern struct seq_operations mounts_op;
 struct proc_mounts {
 	struct seq_file m;
 	int event;
 };
 
-static int mounts_open(struct inode *inode, struct file *file)
+static int __mounts_open(struct inode *inode, struct file *file,
+			struct seq_operations *mounts_op)
 {
 	struct task_struct *task = proc_task(inode);
 	struct namespace *namespace;
@@ -716,7 +720,7 @@ static int mounts_open(struct inode *ino
 		p = kmalloc(sizeof(struct proc_mounts), GFP_KERNEL);
 		if (p) {
 			file-private_data = p-m;
-			ret = seq_open(file, mounts_op);
+			ret = seq_open(file, mounts_op);
 			if (!ret) {
 p-m.private = namespace;
 p-event = namespace-event;
@@ -729,6 +733,16 @@ static int mounts_open(struct inode *ino
 	return ret;
 }
 
+extern struct seq_operations mounts_op, mounts_new_op;
+static int mounts_open(struct inode *inode, struct file *file)
+{
+	return (__mounts_open(inode, file, mounts_op));
+}
+static int mounts_new_open(struct inode *inode, struct file *file)
+{
+	return __mounts_open(inode, file, mounts_new_op

Re: [patch 0/8] unprivileged mount syscall

2007-04-11 Thread Ram Pai
On Wed, 2007-04-11 at 12:44 +0200, Miklos Szeredi wrote:
  1. clone the master namespace.
  
  2. in the new namespace
  
  move the tree under /share/$me to /
  for each ($user, $what, $how) {
  move /share/$user/$what to /$what
  if ($how == slave) {
   make the mount tree under /$what as slave
  }
  }
  
  3. in the new namespace make the tree under 
 /share as private and unmount /share
 
 Thanks.  I get the basic idea now: the namespace itself need not be
 shared between the sessions, it is enough if share propagation is
 set up between the different namespaces of a user.
 
 I don't yet see either in your or Viro's description how the trees
 under /share/$USER are initialized.  I guess they are recursively
 bound from /, and are made slaves.

yes. I suppose, when a userid is created one of the steps would be

mount --rbind / /share/$USER
mount --make-rslave /share/$USER
mount --make-rshared /share/$USER

RP







 Miklos

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/8] unprivileged mount syscall

2007-04-10 Thread Ram Pai
On Mon, 2007-04-09 at 22:10 +0200, Miklos Szeredi wrote:
   The one in pam-0.99.6.3-29.1 in opensuse-10.2 is totally broken.  Are
   you interested in the details?  I can reproduce it, but forgot to note
   down the details of the brokenness.
  
  I don't know how far removed that is from the one being used by redhat,
  but assuming it's the same, then redhat-lspp@redhat.com will be
  very interested.
 
 OK.
 
- user namespace setup: what if user has multiple sessions?
   
  1) namespaces are shared?  That's tricky because the session needs to
  be a child of a namespace server, not of login.  I'm not sure PAM
  can handle this
   
  2) or mounts are copied on login?  That's not possible currently,
  as there's no way to send a mount between namespaces.  Also it's
  tricky to make sure that new mounts are also shared
  
  See toward the end of the 'shared subtrees' OLS paper from last year for
  a suggestion on how to let users effectively 'log in to' an existing
  private mounts ns.
 
 This?
 
   1. create a new namespace
   2. bind /share/$USER to /share
   3. for each pair ($who, $what) such that
  /share/$USER/$who/$what exists, look
  in /share/$who/allowed for peer $what
  $USER or slave $what $USER. If the
  former is found, rbind /share/$who/$what
  on /share/$USER/$who/$what; if the
  latter is found, do the same and
  follow with marking subtree under
  /share/$USER/$who/$what as slave.
   4. rbind /share/$USER to /share
   5. mark subtree under /share as private.
   6. umount -l /share
 
 Well, someone please explain using short words, because I don't
 understand at all.

I am trying to re-construct Viro's thoughts.  I think the steps outlined
above; though not accurate, are still insightful.

The idea is -- there is one master namespace, which has
under /share, a replica of the mount tree of namespaces belonging to all
users. 

for example if there are two users A and B, then in the master namespace
under /share you will find /share/A and /share/B, each reflecting the
mount tree for the namespaces belonging to user-A and user-B
respectively. 

Note: /share is a shared mount-tree, which means it can propagate mount
events.

Everytime the user logs on the machine, a new namespace is created which
is the clone of the master namespace. In this new namespace,
the /share/$user is made the root of the namespace. Also if other
users have allowed part of their namespace available to this user,
than those mounts are also brought under this namespace. And finally the
entire tree under /share is unmounted.

Note, though multiple namespaces can exist simultaneously for the same
user, the user is provided the illusion of per-process-namespace since
all the namespaces look identical.  

I am trying to rewrite the steps outlined above, which may or may not
reflect Viro's thoughts, but certainly reflect my reconstruction of
viro's thoughts.

1. clone the master namespace.

2. in the new namespace

move the tree under /share/$me to /
for each ($user, $what, $how) {
move /share/$user/$what to /$what
if ($how == slave) {
 make the mount tree under /$what as slave
}
}

3. in the new namespace make the tree under 
   /share as private and unmount /share
  
 
   
RP


 
 Thanks,
 Miklos

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/8] unprivileged mount syscall

2007-04-09 Thread Ram Pai
On Mon, 2007-04-09 at 12:07 -0500, Serge E. Hallyn wrote:
 Quoting Miklos Szeredi ([EMAIL PROTECTED]):

   - need to set up mount propagation from global namespace to private
 ones, mount(8) does not yet have options to configure propagation
 
 Hmm, I guess I get lost using my own little systems, and just assumed
 that shared subtree functionality was making its way up into mount(8).
 Ram, have you been working on that?

It is in FC6. I dont know the status off upstream util-linux. I did
submit the patch many times to Adrian Bunk (the then util-linux
maintainer) and got no response. I have not pushed the patches to the
new maintainer(Karel Zak?) though.

RP

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


shared subtree query

2005-08-31 Thread Ram Pai
Ok. I have shared subtree patches getting ready for review. I have
totally revamped the code from what I had sent last time, incorporating
all valuable comments Miklos had made. Offcourse I am yet to finish a
document that Andrew Morton had requested.

The patch snapshot at:
http://www.sudhaa.com/~ram/readahead/sharedsubtree
and the latest set of working patches are at:
http://www.sudhaa.com/~ram/readahead/sharedsubtree/shared.0831.1

Before I formally send the patches for a review, I have bumped into a
small issue, and I am not sure about the behavior.

Al Viro's RFC at http://lwn.net/Articles/119232/ says

5. umount
  unmount everything that gets propagation from victim

Its hard to interpret what victim means. There can be two
interpretations to this.

1) the mount that got unmounted
2) the mount whose child got unmounted

I think its natural to assume (2), but (1) also makes sense sometimes.
Can somebody shed some light on this?  Al Viro: please?

Thanks for your help,
RP




-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mirror a file system on the fly

2005-08-18 Thread Ram Pai
On Thu, 2005-08-18 at 12:40, Dave Schwartz wrote:
 Hi list,
  
 Not too sure if this is the right forum to ask this question but since
 my requirement is around linux filesystems, I shall take this liberty
 to post my question.
 
 My requirement is to develop a kernel/user space module to add an
 extension to the shell program environment such that this shell forks
 a mirror look-alike filesystem of the underlying OS to the programs
 run in that particular shell.

u seem to be talking about namespaces, if I get you right.

there is a flag CLONE_NEWNS to the system call 'clone' which does what
u r talking about.

RP




 
  
 Was trying to look thru the FAQ and a few list archives to look for
 ideas around my requirement. The archives were overwhelming.
  
 
 Any ideas/pointers will be a great help,
 Gracias,
 decebel
 -
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mirror a file system on the fly

2005-08-18 Thread Ram Pai
On Thu, 2005-08-18 at 13:27, Dave Schwartz wrote:
 Hi Ram,
 Thanks for the inputs. I was going over the man pages describing the
 clone system call and its option of CLONE_NEWNS. Could understand the
 description only in parts.
 
 The man page suggests that this flag when set, the cloned child is
 started in a new name space, initialized with a copy of the parent.
 Now does that mean, a program like a shell when cloned with
 CLONE_NEWNS set, will have a copy of file hierarchy of the underlying
 parent process?

Yes the child process will see an exact copy of all the mounts of
various filesystems as that of the parent. However if you mount/unmount
any filesystems in the child, the same will not be mounted/unmounted in
the parent and vice-versa.  Each has its individual view of the
the filesystem heirarchy.

Try the following program that clones off a child process with a mirror
namespace and gives you a bash prompt. Try mounting and unmounting
in this bash prompt and see if the same is visible in a totally
different window.


#include  stdio.h
#include  signal.h
#include  sched.h

char somemem[4096];

int myfunc(){
system(bash);
}

int
main(int argc, char *argv[])
{
if(clone(myfunc, somemem, CLONE_NEWNS|SIGCHLD, NULL)) {
wait(NULL);
} else {
printf(clone failed\n);
}
printf(exit\n);
}


Hope this helps,
RP




 
 Gracias,
 decebel
 
   
 
 On 8/19/05, Ram Pai [EMAIL PROTECTED] wrote:
  On Thu, 2005-08-18 at 12:40, Dave Schwartz wrote:
   Hi list,
  
   Not too sure if this is the right forum to ask this question but since
   my requirement is around linux filesystems, I shall take this liberty
   to post my question.
  
   My requirement is to develop a kernel/user space module to add an
   extension to the shell program environment such that this shell forks
   a mirror look-alike filesystem of the underlying OS to the programs
   run in that particular shell.
  
  u seem to be talking about namespaces, if I get you right.
  
  there is a flag CLONE_NEWNS to the system call 'clone' which does what
  u r talking about.
  
  RP
  
  
  
  
  
  
   Was trying to look thru the FAQ and a few list archives to look for
   ideas around my requirement. The archives were overwhelming.
  
  
   Any ideas/pointers will be a great help,
   Gracias,
   decebel
   -
   To unsubscribe from this list: send the line unsubscribe linux-fsdevel 
   in
   the body of a message to [EMAIL PROTECTED]
   More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/7] shared subtree

2005-07-29 Thread Ram Pai
On Thu, 2005-07-28 at 02:57, Miklos Szeredi wrote:
   This is an example, where having struct pnode just complicates things.
   If there was no struct pnode, this function would be just one line:
   setting the shared flag.
  So your comment is mostly about getting rid of pnode and distributing
  the pnode functionality in the vfsmount structure.
 
 Yes, sorry if I didn't make it clear.
 
  I know you are thinking of just having the necessary propogation list in
  the vfsmount structure itself.  Yes true with that implementation the
  complication is reduced in this part of the code, but really complicates
  the propogation traversal routines. 
 
 On the contrary, I think it will simplify the traversal routines.
 
 Here's an iterator function I coded up.  Not tested at all (may not
 even compile):

Your suggested code has bugs. But I understand what you are aiming at. 

Maybe you are right. I will try out a implementation using your idea.

Hmm.. lots of code change, and testing.

 
 struct vfsmount {
   /* ... */
 
   struct list_head mnt_share;  /* circular list of shared mounts */
   struct list_head mnt_slave_list; /* list of slave mounts */
   struct list_head mnt_slave;  /* slave list entry */
   struct vfsmount *master; /* slave is on master-mnt_slave_list 
 */
 };
 
 static inline struct vfsmount *next_shared(struct vfsmount *p)
 {
   return list_entry(p-mnt_share.next, struct vfsmount, mnt_share);
 }
 
 static inline struct vfsmount *first_slave(struct vfsmount *p)
 {
   return list_entry(p-mnt_slave_list.next, struct vfsmount, mnt_slave);
 }
 
 static inline struct vfsmount *next_slave(struct vfsmount *p)
 {
   return list_entry(p-mnt_slave.next, struct vfsmount, mnt_slave);
 }
 
 static struct vfsmount *propagation_next(struct vfsmount *p,
struct vfsmount *base)
 {
   /* first iterate over the slaves */
   if (!list_empty(p-mnt_slave_list))
   return first_slave(p);

I think this code should be
if (!list_empty(p-mnt_slave))
return next_slave(p);

Right? I think I get the idea. 



RP

 
   while (1) {
   struct vfsmount *q;
 
   /* more vfsmounts belong to the pnode? */
   if (!list_empty(p-mnt_share)) {
   p = next_shared(p);
   if (list_empty(p-mnt_slave)  p != base)
   return p;
   }
   if (p == base)
   break;
   
   BUG_ON(list_empty(p-mnt_slave));
 
   /* more slaves? */
   q = next_slave(p);
   if (p-master != q)
   return q;
 
   /* back at master */
   p = q;
   }
 
   return NULL;
 }
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


mount behavior question.

2005-07-28 Thread Ram Pai
Summary of the question:
Should the topmost mount be visible, or should the most recent
mount be visible?

consider the following command sequence

(1) cd /mnt
(2) mount --bind /usr /mnt
(3) mount --bind /bin /mnt
(4) mount --bind /var .

after step 1, the pwd of the process is pointing to root mount and
directory mnt. lets call the root mount as 'A'

after step 2, a new mount is laid on top of 'A' at the mountpoint mnt
lets call this mount 'B'

after step 3, a new mount is laid on top of 'B' at the mountpoint mnt
which corresponds to the root dentry of 'B'. lets call this new overlaid
mount as 'C'. At this point the visible content of /mnt is
the content of C.

however at step 4, a new mount is laid on top of 'A' at the same
mountpoint mnt, as that of  'B'. Lets call the new mount 'D'.

At this point, the visible content of /mnt is that of D and not that
of C

But should'nt it be C?

Why is that the contents of 'D' made visible? Is there any particular
reason for this behavior? Note: 'D' is mounted on the bottommost mount,
and hence should be obscured by the top level mounts.

To make it simpler, imagine you are viewing a 3 storied transparent
building from the top. If you place an apple in 1st floor and nothing is
placed on any other floors, the apple will be visible from the top.
Now if you place a 'orange' in the 2nd floor the apple should get
obscured by the orange and the 'orange' should start being visible. And
later if you place a 'mango' on 3rd floor, the mango should obscure both
the apple and orange.  but at this point if you place another apple on
top of the first apple in the 1st floor, it cannot be visible, because
the 'orange' and the 'mango' block its line of sight. And hence the
'mango' should still continue to be visible. right? If the apple starts
becoming visible from the top, won't it defy law of visibility? :)


Back to the mount example:
Currently the behavior is the most recent mount is visible and not the
topmost mount.


Not many will run into this question currently, because the sequence of
steps have to orchestrated well to get into this scenario. But with
shared subtrees it is pretty easy to mount something at a lower level
mount because of propogations. And in this case the behavior becomes
totally confusing if the rule is 'expose the most-recent-mount and not
the topmost-mount'.

Here is a scenario with shared subtree. Sorry it is complex.


mount --bind /mnt /mnt
mount --make-shared /mnt
mkdir -p /mnt/p
mount --bind /usr  /mnt/1
mount --bind /mnt  /mnt/2

At this stage the mount at /mnt/2 and /mnt belong to the same pnode
which means mounts under them propogate to each other.

mount --bind /var /mnt/1

the contents of /var will be visible under /mnt/1 and not under /mnt/2
But if mount --bind /var /mnt/2 is executed, the contents of /var is
visible under /mnt/1 as well as /mnt/2 . Isn't this freaky?

On analysis it turns out the culprit is the current rule which says
'expose the most-recent-mount and not the topmost mount'

RP

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mount behavior question.

2005-07-28 Thread Ram Pai
On Thu, 2005-07-28 at 04:56, Miklos Szeredi wrote:
  Here is a scenario with shared subtree. Sorry it is complex.
  
  
  mount --bind /mnt /mnt
  mount --make-shared /mnt
  mkdir -p /mnt/p
  mount --bind /usr  /mnt/1
  mount --bind /mnt  /mnt/2
  
  At this stage the mount at /mnt/2 and /mnt belong to the same pnode
  which means mounts under them propogate to each other.
  
  mount --bind /var /mnt/1
  
  the contents of /var will be visible under /mnt/1 and not under /mnt/2
  But if mount --bind /var /mnt/2 is executed, the contents of /var is
  visible under /mnt/1 as well as /mnt/2 . Isn't this freaky?
 
 I don't understand.
 
 'mount --bind /var /mnt/1' should propagate to /mnt/2/1, not /mnt/2. 

yes it should propogate to /mnt/2/1 , thats what I meant when I said
under /mnt/2, but yes I was not clear. Hope I have a clearer
explanation below.

  No?
 
 'mount --bind /var/ /mnt/2' should propagate to /mnt.  What am I
 missing?


step 1: mount --bind /mnt /mnt
a new mount 'A' is created at /mnt

step 2: mount --make-shared /mnt
   mounts under 'A' are made shared. But in this case
   there are no other mounts. So only 'A' will be made shared.
  
 
step 3: mkdir -p /mnt/1 /mnt/2
nothing special here

step 4: mount --bind /usr  /mnt/1
a new mount 'B' is created  at /mnt/1 which is
 'shared;.


step 5: mount --bind /mnt  /mnt/2

a new mount 'C' is created at /mnt/2
and propogation is set between 'A' and 'C'.
note: 'C' is made shared.



lets say, at this point I try 
mount --bind /var /mnt/1

this is going to mount 'D' on top of mount 'B'.  However
there is no other mount to which 'B' propogates to. So that is 
it. the contents of /var is only visible at /mnt/1 and it
 propogates no where else.

but lets say, we tried mount --bind /var /mnt/2/1
/mnt/2/1 belongs to mount 'C'. And mounts under 'C' propogates to 'A'
too. So in this case a new mount 'E' is created at mnt/1/2
i.e on top of 'C' at dentry '2'  and due to propogation a new mount
'F' is created at /mnt/1 i.e on top of mount 'A' at dentry '1'
 But note: /mnt/1 already has a mount 'B' on top of it.   The new mount
'F' as per the 'most-current mount rule' obscures 'B' even though the
mount is on top of 'A'. As a result the contents of /var are now
visible both at /mnt/2/1 and /mnt/1


Ok the net effect is, mount at /mnt/1 is visible only under /mnt/1
but mount at /mnt/2/1 is visible at mount /mnt/2/1 and /mnt/1
This makes it confusing. If the 'top-most mount rule' is applied
'F' though mounted on 'A', will not be visible because it will get 
obscured by 'B' and the confusion is avoided.

So the point I am driving at is, is there any special reason 
for having 'most-recent mount visible rule' instead of 'top-most mount
visible rule'?
RP


 Miklos

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mount behavior question.

2005-07-28 Thread Ram Pai
On Thu, 2005-07-28 at 12:30, Miklos Szeredi wrote:
  no. there is no asymmetry as such. the propogations are working the way
  they are meant to. But the confusion arises because of the mount lookup
  symantics.  The reason Avantika(who is doing shared subtree testing),
  had this exact confusion is because of the 'most-recent-mount visible'
  rule. I dont think this rule is documented anywhere. And the natural
  response to such a behavior is confusion.
 
 I really fail to see what you are getting at.
 
 You agree that:
 
   1) mount doesn't propagate from /mnt/1 to /mnt/2/1.
 
   2) mount propagates from /mnt/2/1 to /mnt/1.

Yes I agree.

 
 Then you are surprised that you don't see the same thing if you mount
 on /mnt/1 as if on /mnt/2/1.

I am not surprised when mounts on /mnt/1 do not propogate to /mnt/2/1
This is expected, and I am perfectly happy. Because the mount is
attempted on 'B' and 'B' has nobody to propogate to.

when mount on /mnt/2/1 (i.e on C at dentry 1) is attempted, I expect
 to see a new mount 'E' at that dentry. That is happening and
I am happy with it.
I also expect that the mount propogates to /mnt/1 too (i.e on 'A' at
dentry '1'). Because 'C' and 'A' have propogation setup.
 
But what I also expect to see is: the new mount 'F' at /mnt/1 ( mount A
at dentry 1) be obscured by the already existing mount on /mnt/1 i.e
mount 'B'.

And the reason I want the new mount at /mnt/1 (i.e 'F') obscured is that
the new mount is not done on 'B' but is done on 'A'.

The most recent mount rule makes 'B' obscured instead of 'F'
and I am expecting the topmount visible rule to be applicable
here which makes 'B' still visible and 'F' obscured. 

Ah...its so hard without a whiteboard :( I wish there was some way to
explain it drawing some objects on the whiteboard.

I guess, I have got all the letters and the words right. Any small
mistake can distort everything. If somebody is wondering why there is
no 'D' that is because it was used for something else in the earlier
example and hence not used here.

RP


 
 I think your proposed solution would be _more_ confusing not less,
 since then I'd not see the expected propagation from /mnt/2/1 to
 /mnt/1.  I'd call that a bug.
 
 Miklos

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mount behavior question.

2005-07-28 Thread Ram Pai
On Thu, 2005-07-28 at 13:35, Bryan Henderson wrote:
 
 It wouldn't surprise me if someone is depending on mount over ..  But 
 I'd be surprised if someone is doing it to a directory that's already been 
 mounted over (such that the stacking behavior is relevant).  That seems 
 really eccentric.

Bryan, what would you expect the behavior to be when somebody mounts on
a directory what is already mounted over?  

Do you expect the new mount to obscure the already existing mount or
do you expect the already existing mount to obscure the new mount?

The issue in the current thread is pretty much revolving around this.
RP

 
 --
 Bryan Henderson IBM Almaden Research Center
 San Jose CA Filesystems
 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mount behavior question.

2005-07-28 Thread Ram Pai
On Thu, 2005-07-28 at 15:27, Bryan Henderson wrote:
 Bryan, what would you expect the behavior to be when somebody mounts on
 a directory what is already mounted over? 
 
 Well, I've tried to beg the question.  I said I don't think it's 
 meaningful to mount over a directory; that one actually mounts at a name. 
 And that Linux's peculiar mount over '.' (which is in fact mounting over 
 a directory and not at a name) is weird enough that there is no natural 
 expectation of it except that it should fail.
 
 But if I had to try to merge mount over '.' into as consistent a model 
 as possible with one of the two behaviors we've been discussing, I'd say 
 that . stands for the name by which you looked up that directory in the 
 first place (so in this case, it's equivalent to mount ... /mnt).  And 
 that means I would expect the new mount to obscure the already existing 
 mount.

ok. maybe I am having some odd expectations here.
To me it still feels natural to tuck the mount under the earlier mount,
since you are not mounting on something which on the top, but you
are mounting  on top of something which is under(obscured).

RP


 
 --
 Bryan Henderson IBM Almaden Research Center
 San Jose CA Filesystems

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/7] shared subtree

2005-07-27 Thread Ram Pai
On Wed, 2005-07-27 at 12:54, Miklos Szeredi wrote:
  +static int do_make_shared(struct vfsmount *mnt)
  +{
  +   int err=0;
  +   struct vfspnode *old_pnode = NULL;
  +   /*
  +* if the mount is already a slave mount,
  +* allocate a new pnode and make it
  +* a slave pnode of the original pnode.
  +*/
  +   if (IS_MNT_SLAVE(mnt)) {
  +   old_pnode = mnt-mnt_pnode;
  +   pnode_del_slave_mnt(mnt);
  +   }
  +   if(!IS_MNT_SHARED(mnt)) {
  +   mnt-mnt_pnode = pnode_alloc();
  +   if(!mnt-mnt_pnode) {
  +   pnode_add_slave_mnt(old_pnode, mnt);
  +   err = -ENOMEM;
  +   goto out;
  +   }
  +   pnode_add_member_mnt(mnt-mnt_pnode, mnt);
  +   }
  +   if(old_pnode)
  +   pnode_add_slave_pnode(old_pnode, mnt-mnt_pnode);
  +   set_mnt_shared(mnt);
  +out:
  +   return err;
  +}
 
 This is an example, where having struct pnode just complicates things.
 If there was no struct pnode, this function would be just one line:
 setting the shared flag.
So your comment is mostly about getting rid of pnode and distributing
the pnode functionality in the vfsmount structure.

I know you are thinking of just having the necessary propogation list in
the vfsmount structure itself.  Yes true with that implementation the
complication is reduced in this part of the code, but really complicates
the propogation traversal routines. 

 In order to find out the slaves of a given mount:
with your proposal:  I have to walk through all the peer mounts of this
mount and check for any slaves there.
in my implementation: I have to just find which pnode it belongs to, and
all the slaves are easily available there.

  In order to find out all the shared mounts that are slave of this
mount: 

with your proposal: Not sure how to do. Maybe you have to have another
field in each of the vfsmounts that will point to
the shared mounts that are slave of this mount.??

in my implemenation: I have to just find the pnode it belongs to,
and all the slave pnodes are easily available there.


There is complexity tradeoffs in both the implementations. But I
personally felt having a pnode structure keeps the pnode operations
seperated out cleanly. It helps to easily visualize the propogation
tree. And also one more thing influenced my thought process. The
statement in Al Viro's RFC: 
---
How do we set them up? 

* we can mark a subtree sharable. Every vfsmount in the subtree 
that is not already in some p-node gets a single-element p-node of its 
own. 
* we can mark a subtree slave. That removes all vfsmounts in 
the subtree from their p-nodes and makes them owned by said p-nodes. 
p-nodes that became empty will disappear and everything they used to 
own will be repossessed by their owners (if any). 
* we can mark a subtree private. Same as above, but followed 
by taking all vfsmounts in our subtree and making them *not* owned 
by anybody. 

The above statements imply some implementation detail. Not sure if you
will buy this point :)


 
  +static kmem_cache_t * pnode_cachep;
  +
  +/* spinlock for pnode related operations */
  + __cacheline_aligned_in_smp DEFINE_SPINLOCK(vfspnode_lock);
  +
  +enum pnode_vfs_type {
  +   PNODE_MEMBER_VFS = 0x01,
  +   PNODE_SLAVE_VFS = 0x02
  +};
  +
  +void __init pnode_init(unsigned long mempages)
  +{
  +   pnode_cachep = kmem_cache_create(pnode_cache,
  +   sizeof(struct vfspnode), 0,
  +   SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL);
  +}
  +
  +struct vfspnode * pnode_alloc(void)
  +{
  +   struct vfspnode *pnode =  kmem_cache_alloc(pnode_cachep, GFP_KERNEL);
  +   INIT_LIST_HEAD(pnode-pnode_vfs);
  +   INIT_LIST_HEAD(pnode-pnode_slavevfs);
  +   INIT_LIST_HEAD(pnode-pnode_slavepnode);
  +   INIT_LIST_HEAD(pnode-pnode_peer_slave);
  +   pnode-pnode_master = NULL;
  +   pnode-pnode_flags = 0;
  +   atomic_set(pnode-pnode_count,0);
  +   return pnode;
  +}
  +
  +void inline pnode_free(struct vfspnode *pnode)
  +{
  +   kmem_cache_free(pnode_cachep, pnode);
  +}
  +
  +/*
  + * __put_pnode() should be called with vfspnode_lock held
  + */
  +void __put_pnode(struct vfspnode *pnode)
  +{
  +   struct vfspnode *tmp_pnode;
  +   do {
  +   tmp_pnode = pnode-pnode_master;
  +   list_del_init(pnode-pnode_peer_slave);
  +   BUG_ON(!list_empty(pnode-pnode_vfs));
  +   BUG_ON(!list_empty(pnode-pnode_slavevfs));
  +   BUG_ON(!list_empty(pnode-pnode_slavepnode));
  +   pnode_free(pnode);
  +   pnode = tmp_pnode;
  +   if (!pnode || !atomic_dec_and_test(pnode-pnode_count))
  +   break;
  +   } while(pnode);
  +}
  +
 
 All these are really unnecessary IMO.
 
  +/*
  + * merge 'pnode' into 'peer_pnode' and get rid of 

Re: [PATCH 3/7] shared subtree

2005-07-27 Thread Ram Pai
On Wed, 2005-07-27 at 12:13, Miklos Szeredi wrote:
  @@ -54,7 +55,7 @@ static inline unsigned long hash(struct 
   
   struct vfsmount *alloc_vfsmnt(const char *name)
   {
  -   struct vfsmount *mnt = kmem_cache_alloc(mnt_cache, GFP_KERNEL); 
  +   struct vfsmount *mnt = kmem_cache_alloc(mnt_cache, GFP_KERNEL);
  if (mnt) {
  memset(mnt, 0, sizeof(struct vfsmount));
  atomic_set(mnt-mnt_count,1);
 
 Please make whitespace changes a separate patch.

 I tried to remove trailing whitespaces in the current code
whereever I found them. Ok will them a separate patch.


 
  @@ -128,11 +162,71 @@ static void attach_mnt(struct vfsmount *
   {
  mnt-mnt_parent = mntget(nd-mnt);
  mnt-mnt_mountpoint = dget(nd-dentry);
  -   list_add(mnt-mnt_hash, mount_hashtable+hash(nd-mnt, nd-dentry));
  +   mnt-mnt_namespace = nd-mnt-mnt_namespace;
  +   list_add_tail(mnt-mnt_hash,
  +   mount_hashtable+hash(nd-mnt, nd-dentry));
  list_add_tail(mnt-mnt_child, nd-mnt-mnt_mounts);
  nd-dentry-d_mounted++;
   }
 
 Why list_add_tail()?  This changes user visible behavior, and seems
 unnecessary.

Yes. I was about to send out a mail questioning the existing behavior. I
will start a seperate thread questioning the current behavoir. My plan
was to discuss the current behavior before making this change. I thought
I had reverted this change. But it slipped in. 

 
  +static void attach_prepare_mnt(struct vfsmount *mnt, struct nameidata *nd)
  +{
  +   mnt-mnt_parent = mntget(nd-mnt);
  +   mnt-mnt_mountpoint = dget(nd-dentry);
  +   nd-dentry-d_mounted++;
  +}
  +
  +
 
 You shouldn't add unnecessary newlines.  There are a lot of these,
 please audit all your patches.

ok. sure.

 
  +void do_attach_commit_mnt(struct vfsmount *mnt)
  +{
  +   struct vfsmount *parent = mnt-mnt_parent;
  +   BUG_ON(parent==mnt);
 
   BUG_ON(parent == mnt);
 
  +   if(list_empty(mnt-mnt_hash))
 
   if (list_empty(mnt-mnt_hash))
 
  +   list_add_tail(mnt-mnt_hash,
  +   mount_hashtable+hash(parent, mnt-mnt_mountpoint));
  +   if(list_empty(mnt-mnt_child))
  +   list_add_tail(mnt-mnt_child, parent-mnt_mounts);
  +   mnt-mnt_namespace = parent-mnt_namespace;
  +   list_add_tail(mnt-mnt_list, mnt-mnt_namespace-list);
  +}
 
 Etc.  Maybe you should run Lindent on your changes, but be careful not
 to change existing code, even if Lindent would do that!

sure :)

 
  @@ -191,7 +270,7 @@ static void *m_start(struct seq_file *m,
  struct list_head *p;
  loff_t l = *pos;
   
  -   down_read(n-sem);
  +   down_read(namespace_sem);
  list_for_each(p, n-list)
  if (!l--)
  return list_entry(p, struct vfsmount, mnt_list);
 
 This should be a separate patch.  You can just take the one from the
 detached trees patch-series.

ok. in fact these changes were motivated by that patch.

 
  +/*
  + * abort the operations done in attach_recursive_mnt(). run through the 
  mount
  + * tree, till vfsmount 'last' and undo the changes.  Ensure that all the 
  mounts
  + * in the tree are all back in the mnt_list headed at 'source_mnt'.
  + * NOTE: This function is closely tied to the logic in
  + * 'attach_recursive_mnt()'
  + */
  +static void abort_attach_recursive_mnt(struct vfsmount *source_mnt, struct
  +   vfsmount *last, struct list_head *head) { struct vfsmount *p =
  +   source_mnt, *m; struct vfspnode *src_pnode;
 
 If you want to do proper error handling, instead of doing rollback, it
 seems better to first do anything that can fail (allocations), then do
 the actual attaching, which cannot fail.  It isn't nice to have
 transient states on failure.

yes. it does exactly what you said. In the prepare stage it does not
touch any of the existing vfstree or the pnode tree.

All it does it builds a new vfstree and pnode tree, does the necessary
changes to them. And if everthing is successful, it glues the new tree
to the existing tree (which is the commit phase), and if the prepare
stage fails allocating memory or any other reason, it goes and destroys
the new trees (in the abort phase).

Offcourse in the prepare state, it does increase the reference count of
the vfsmounts to which the new tree will be attached. This is to ensure
that the vfsmounts have not disappeared by the time we reach the commit
phase.  I think we are talking the same thing, and the code behaves
exactly as you said.


 
  + /*
  + * This operation is equivalent of mount --bind dir dir
  + * create a new mount at the dentry, and unmount all child mounts
  + * mounted on top of dentries below 'dentry', and mount them
  + * under the new mount.
  +  */
  +struct vfsmount *do_make_mounted(struct vfsmount *mnt, struct dentry 
  *dentry)
 
 Why is this needed?  I thought we agreed, that this can be removed.

yes we agreed on returning EINVAL when a directory is attempted to made
shared/private/slave/unclonnable.   But this is a different case.

lets say  /mnt is a 

[no subject]

2005-07-25 Thread Ram Pai
, [EMAIL PROTECTED], Janak Desai [EMAIL PROTECTED], 
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: [PATCH 0/7] shared subtree

Hi Andrew/Al Viro,

Enclosing a final set of well tested patches that implement
Al Viro's shared subtree proposal.

These patches provide the ability to mark a mount tree as
shared/private/slave/unclone, along with the ability to play with these
trees with operations like bind/rbind/move/pivot_root/namespace-clone
etc.

I believe this powerful feature can help build features like
per-user namespace.  Couple of projects may benefit from
shared subtrees.
1) automounter for the ability to automount across namespaces.
2) SeLinux for implementing polyinstantiated trees.
3) MVFS for providing versioning file system.
4) FUSE for per-user namespaces?

Thanks to Avantika for developing about 100+ test cases that tests
various combintation of private/shared/slave/unclonable trees. All
these tests have passed. I feel pretty confident about the stability of
the code.

The patches have been broken into 7 units, for ease of review.  I
realize that patch-3 'rbind.patch' is a bit heavier than all the other
patches. The reason being, most of the shared-subtree functionality 
gets manifestated during bind/rbind operation.

Couple of work items to be done are:
1. modify the mount command to support this feature
eg:  mount --make-shared /tmp
2. a tool that can help visualize the propogation tree, maybe
support in /proc?
3. some documentation on how to use all this functionality.

Please consider the patches for inclusion in your tree.

The footprint of this code is pretty small in the normal code path
where shared-subtree functionality is not used.

Any suggestions/comments to improve the code is welcome.

Thanks,
RP
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[no subject]

2005-07-25 Thread Ram Pai
, [EMAIL PROTECTED], Janak Desai [EMAIL PROTECTED], 
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: [PATCH 7/7] shared subtree
Content-Type: text/x-patch; name=automount.patch
Content-Disposition: inline; filename=automount.patch

adds support for mount/umount propogation for autofs initiated operations,
RP

Signed by Ram Pai ([EMAIL PROTECTED])

 fs/namespace.c|  176 +++---
 fs/pnode.c|   12 +--
 include/linux/pnode.h |3 
 3 files changed, 76 insertions(+), 115 deletions(-)

Index: 2.6.12.work2/fs/namespace.c
===
--- 2.6.12.work2.orig/fs/namespace.c
+++ 2.6.12.work2/fs/namespace.c
@@ -202,6 +202,9 @@ struct vfsmount *do_attach_prepare_mnt(s
if(!(child_mnt = clone_mnt(template_mnt,
template_mnt-mnt_root)))
return NULL;
+   spin_lock(vfsmount_lock);
+   list_del_init(child_mnt-mnt_fslink);
+   spin_unlock(vfsmount_lock);
} else
child_mnt = template_mnt;
 
@@ -355,35 +358,14 @@ struct seq_operations mounts_op = {
  */
 int may_umount_tree(struct vfsmount *mnt)
 {
-   struct list_head *next;
-   struct vfsmount *this_parent = mnt;
-   int actual_refs;
-   int minimum_refs;
+   int actual_refs=0;
+   int minimum_refs=0;
+   struct vfsmount *p;
 
spin_lock(vfsmount_lock);
-   actual_refs = atomic_read(mnt-mnt_count);
-   minimum_refs = 2;
-repeat:
-   next = this_parent-mnt_mounts.next;
-resume:
-   while (next != this_parent-mnt_mounts) {
-   struct vfsmount *p = list_entry(next, struct vfsmount, 
mnt_child);
-
-   next = next-next;
-
+   for (p = mnt; p; p = next_mnt(p, mnt)) {
actual_refs += atomic_read(p-mnt_count);
minimum_refs += 2;
-
-   if (!list_empty(p-mnt_mounts)) {
-   this_parent = p;
-   goto repeat;
-   }
-   }
-
-   if (this_parent != mnt) {
-   next = this_parent-mnt_child.next;
-   this_parent = this_parent-mnt_parent;
-   goto resume;
}
spin_unlock(vfsmount_lock);
 
@@ -395,18 +377,18 @@ resume:
 
 EXPORT_SYMBOL(may_umount_tree);
 
-int mount_busy(struct vfsmount *mnt)
+int mount_busy(struct vfsmount *mnt, int refcnt)
 {
struct vfspnode *parent_pnode;
 
if (mnt == mnt-mnt_parent || !IS_MNT_SHARED(mnt-mnt_parent))
-   return do_refcount_check(mnt, 2);
+   return do_refcount_check(mnt, refcnt);
 
parent_pnode = mnt-mnt_parent-mnt_pnode;
BUG_ON(!parent_pnode);
return pnode_mount_busy(parent_pnode,
mnt-mnt_mountpoint,
-   mnt-mnt_root, mnt);
+   mnt-mnt_root, mnt, refcnt);
 }
 
 /**
@@ -424,9 +406,12 @@ int mount_busy(struct vfsmount *mnt)
  */
 int may_umount(struct vfsmount *mnt)
 {
-   if (mount_busy(mnt))
-   return -EBUSY;
-   return 0;
+   int ret=0;
+   spin_lock(vfsmount_lock);
+   if (mount_busy(mnt, 2))
+   ret = -EBUSY;
+   spin_unlock(vfsmount_lock);
+   return ret;
 }
 
 EXPORT_SYMBOL(may_umount);
@@ -445,7 +430,26 @@ void do_detach_mount(struct vfsmount *mn
spin_lock(vfsmount_lock);
 }
 
-void __umount_tree(struct vfsmount *mnt, int propogate)
+void umount_mnt(struct vfsmount *mnt, int propogate)
+{
+   if (propogate  mnt-mnt_parent != mnt 
+   IS_MNT_SHARED(mnt-mnt_parent)) {
+   struct vfspnode *parent_pnode
+   = mnt-mnt_parent-mnt_pnode;
+   BUG_ON(!parent_pnode);
+   pnode_umount(parent_pnode,
+   mnt-mnt_mountpoint,
+   mnt-mnt_root);
+   } else {
+   if (IS_MNT_SHARED(mnt) || IS_MNT_SLAVE(mnt)) {
+   BUG_ON(!mnt-mnt_pnode);
+   pnode_disassociate_mnt(mnt);
+   }
+   do_detach_mount(mnt);
+   }
+}
+
+static void __umount_tree(struct vfsmount *mnt, int propogate)
 {
struct vfsmount *p;
LIST_HEAD(kill);
@@ -459,21 +463,7 @@ void __umount_tree(struct vfsmount *mnt,
mnt = list_entry(kill.next, struct vfsmount, mnt_list);
list_del_init(mnt-mnt_list);
list_del_init(mnt-mnt_fslink);
-   if (propogate  mnt-mnt_parent != mnt 
-   IS_MNT_SHARED(mnt-mnt_parent)) {
-   struct vfspnode *parent_pnode
-   = mnt-mnt_parent-mnt_pnode;
-   BUG_ON(!parent_pnode);
-   pnode_umount(parent_pnode,
-   mnt-mnt_mountpoint,
-   mnt-mnt_root);
-   } else

[no subject]

2005-07-25 Thread Ram Pai
, [EMAIL PROTECTED], Janak Desai [EMAIL PROTECTED], 
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: [PATCH 4/7] shared subtree
Content-Type: text/x-patch; name=move.patch
Content-Disposition: inline; filename=move.patch

Adds ability to move a shared/private/slave/unclone tree to any other
shared/private/slave/unclone tree. Also incorporates the same behavior
for pivot_root()

RP


Signed by Ram Pai ([EMAIL PROTECTED])

 fs/namespace.c|  196 +++---
 include/linux/mount.h |2 
 2 files changed, 173 insertions(+), 25 deletions(-)

Index: 2.6.12.work2/fs/namespace.c
===
--- 2.6.12.work2.orig/fs/namespace.c
+++ 2.6.12.work2/fs/namespace.c
@@ -772,9 +772,12 @@ static void abort_attach_recursive_mnt(s
list_del_init(head);
 }
 
+
  /*
  *  @source_mnt : mount tree to be attached
  *  @nd: place the mount tree @source_mnt is attached
+ *  @move  : use the move semantics if set, else use normal attach 
semantics
+ *as explained below
  *
  *  NOTE: in the table below explains the semantics when a source vfsmount
  *  of a given type is attached to a destination vfsmount of a give type.
@@ -801,12 +804,41 @@ static void abort_attach_recursive_mnt(s
  *  |  |   ||  |   |
  *   
  *
- * (++)  the mount will be propogated to all the vfsmounts in the pnode tree
+ * (++)  the mount is propogated to all the vfsmounts in the pnode tree
  *   of the destination vfsmount, and all the non-slave new mounts in
  *   destination vfsmount will be added the source vfsmount's pnode.
- * (+)  the mount will be propogated to the destination vfsmount
+ * (+)  the mount is propogated to the destination vfsmount
  *   and the new mount will be added to the source vfsmount's pnode.
  *
+ *  -
+ *  |  MOVE MOUNT OPERATION|
+ *  |***|
+ *  |  dest -- | shared   |   private  |  slave   |unclonable |
+ *  | source   |   ||  |   |
+ *  |   |  |   ||  |   |
+ *  |   v  |   ||  |   |
+ *  |***|
+ *  |  |   ||  |   |
+ *  |  shared  | shared (++)   |  shared (+)|shared (+)| shared (+)|
+ *  |  |   ||  |   |
+ *  |  |   ||  |   |
+ *  | private  | shared (+)|  private   | private  | private   |
+ *  |  |   ||  |   |
+ *  |  |   ||  |   |
+ *  | slave| shared (+++)  |  slave | slave| slave |
+ *  |  |   ||  |   |
+ *  |  |   ||  |   |
+ *  | unclonable|  invalid | unclonable |unclonable| unclonable|
+ *  |  |   ||  |   |
+ *  |  |   ||  |   |
+ *   
+ *
+ * (+++)  the mount is propogated to all the vfsmounts in the pnode tree
+ *   of the destination vfsmount, and all the new mounts is
+ *   added to a new pnode , which is a slave pnode of the
+ *   source vfsmount's pnode.
+ *
+ *
  * if the source mount is a tree, the operations explained above is
  * applied to each vfsmount in the tree.
  *
@@ -815,7 +847,7 @@ static void abort_attach_recursive_mnt(s
  *
   */
 static int attach_recursive_mnt(struct vfsmount *source_mnt,
-   struct nameidata *nd)
+   struct nameidata *nd, int move)
 {
struct vfsmount *mntpt_mnt, *last, *m, *p;
struct vfspnode *src_pnode, *dest_pnode, *tmp_pnode;
@@ -849,8 +881,8 @@ static int attach_recursive_mnt(struct v
list_add_tail(mnt_list_head, source_mnt-mnt_list);
 
for (m = source_mnt; m; m = next_mnt(m, source_mnt)) {
-
-   BUG_ON(IS_MNT_UNCLONE(m));
+   int unclone = IS_MNT_UNCLONE(m);
+   int slave = IS_MNT_SLAVE(m);
 
while (p  p != m-mnt_parent)
p = p-mnt_parent;
@@ -866,7 +898,7 @@ static int attach_recursive_mnt(struct v
 
dest_pnode = IS_MNT_SHARED(mntpt_mnt) ?
mntpt_mnt-mnt_pnode : NULL;
-   src_pnode = (IS_MNT_SHARED(m))?
+   src_pnode

[no subject]

2005-07-25 Thread Ram Pai
, [EMAIL PROTECTED], Janak Desai [EMAIL PROTECTED], 
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: [PATCH 3/7] shared subtree
Content-Type: text/x-patch; name=rbind.patch
Content-Disposition: inline; filename=rbind.patch

Adds the ability to bind/rbind a shared/private/slave subtree and set up
propogation wherever needed.

RP

Signed by Ram Pai ([EMAIL PROTECTED])

 fs/namespace.c|  660 --
 fs/pnode.c|  235 
 include/linux/dcache.h|2 
 include/linux/fs.h|5 
 include/linux/namespace.h |1 
 5 files changed, 826 insertions(+), 77 deletions(-)

Index: 2.6.12.work2/fs/namespace.c
===
--- 2.6.12.work2.orig/fs/namespace.c
+++ 2.6.12.work2/fs/namespace.c
@@ -42,7 +42,8 @@ static inline int sysfs_init(void)
 
 static struct list_head *mount_hashtable;
 static int hash_mask, hash_bits;
-static kmem_cache_t *mnt_cache; 
+static kmem_cache_t *mnt_cache;
+static struct rw_semaphore namespace_sem;
 
 static inline unsigned long hash(struct vfsmount *mnt, struct dentry *dentry)
 {
@@ -54,7 +55,7 @@ static inline unsigned long hash(struct 
 
 struct vfsmount *alloc_vfsmnt(const char *name)
 {
-   struct vfsmount *mnt = kmem_cache_alloc(mnt_cache, GFP_KERNEL); 
+   struct vfsmount *mnt = kmem_cache_alloc(mnt_cache, GFP_KERNEL);
if (mnt) {
memset(mnt, 0, sizeof(struct vfsmount));
atomic_set(mnt-mnt_count,1);
@@ -86,7 +87,8 @@ void free_vfsmnt(struct vfsmount *mnt)
  * Now, lookup_mnt increments the ref count before returning
  * the vfsmount struct.
  */
-struct vfsmount *lookup_mnt(struct vfsmount *mnt, struct dentry *dentry)
+struct vfsmount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry,
+   struct dentry *root)
 {
struct list_head * head = mount_hashtable + hash(mnt, dentry);
struct list_head * tmp = head;
@@ -99,7 +101,8 @@ struct vfsmount *lookup_mnt(struct vfsmo
if (tmp == head)
break;
p = list_entry(tmp, struct vfsmount, mnt_hash);
-   if (p-mnt_parent == mnt  p-mnt_mountpoint == dentry) {
+   if (p-mnt_parent == mnt  p-mnt_mountpoint == dentry 
+   (root == NULL || p-mnt_root == root)) {
found = mntget(p);
break;
}
@@ -108,6 +111,37 @@ struct vfsmount *lookup_mnt(struct vfsmo
return found;
 }
 
+struct vfsmount *lookup_mnt(struct vfsmount *mnt, struct dentry *dentry)
+{
+   return __lookup_mnt(mnt, dentry, NULL);
+}
+
+static struct vfsmount *
+clone_mnt(struct vfsmount *old, struct dentry *root)
+{
+   struct super_block *sb = old-mnt_sb;
+   struct vfsmount *mnt = alloc_vfsmnt(old-mnt_devname);
+
+   if (mnt) {
+   mnt-mnt_flags = old-mnt_flags;
+   atomic_inc(sb-s_active);
+   mnt-mnt_sb = sb;
+   mnt-mnt_root = dget(root);
+   mnt-mnt_mountpoint = mnt-mnt_root;
+   mnt-mnt_parent = mnt;
+   mnt-mnt_namespace = old-mnt_namespace;
+   mnt-mnt_pnode = get_pnode(old-mnt_pnode);
+
+   /* stick the duplicate mount on the same expiry list
+* as the original if that was on one */
+   spin_lock(vfsmount_lock);
+   if (!list_empty(old-mnt_fslink))
+   list_add(mnt-mnt_fslink, old-mnt_fslink);
+   spin_unlock(vfsmount_lock);
+   }
+   return mnt;
+}
+
 static inline int check_mnt(struct vfsmount *mnt)
 {
return mnt-mnt_namespace == current-namespace;
@@ -128,11 +162,71 @@ static void attach_mnt(struct vfsmount *
 {
mnt-mnt_parent = mntget(nd-mnt);
mnt-mnt_mountpoint = dget(nd-dentry);
-   list_add(mnt-mnt_hash, mount_hashtable+hash(nd-mnt, nd-dentry));
+   mnt-mnt_namespace = nd-mnt-mnt_namespace;
+   list_add_tail(mnt-mnt_hash,
+   mount_hashtable+hash(nd-mnt, nd-dentry));
list_add_tail(mnt-mnt_child, nd-mnt-mnt_mounts);
nd-dentry-d_mounted++;
 }
 
+static void attach_prepare_mnt(struct vfsmount *mnt, struct nameidata *nd)
+{
+   mnt-mnt_parent = mntget(nd-mnt);
+   mnt-mnt_mountpoint = dget(nd-dentry);
+   nd-dentry-d_mounted++;
+}
+
+
+void do_attach_commit_mnt(struct vfsmount *mnt)
+{
+   struct vfsmount *parent = mnt-mnt_parent;
+   BUG_ON(parent==mnt);
+   if(list_empty(mnt-mnt_hash))
+   list_add_tail(mnt-mnt_hash,
+   mount_hashtable+hash(parent, mnt-mnt_mountpoint));
+   if(list_empty(mnt-mnt_child))
+   list_add_tail(mnt-mnt_child, parent-mnt_mounts);
+   mnt-mnt_namespace = parent-mnt_namespace;
+   list_add_tail(mnt-mnt_list, mnt-mnt_namespace-list);
+}
+
+struct vfsmount *do_attach_prepare_mnt(struct

[no subject]

2005-07-25 Thread Ram Pai
, [EMAIL PROTECTED], Janak Desai [EMAIL PROTECTED], 
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: [PATCH 6/7] shared subtree
Content-Type: text/x-patch; name=namespace.patch
Content-Disposition: inline; filename=namespace.patch

Adds ability to clone a namespace that has shared/private/slave/unclone
subtrees in it.

RP


Signed by Ram Pai ([EMAIL PROTECTED])

 fs/namespace.c |9 +
 1 files changed, 9 insertions(+)

Index: 2.6.12-rc6.work1/fs/namespace.c
===
--- 2.6.12-rc6.work1.orig/fs/namespace.c
+++ 2.6.12-rc6.work1/fs/namespace.c
@@ -1894,6 +1894,13 @@ int copy_namespace(int flags, struct tas
q = new_ns-root;
while (p) {
q-mnt_namespace = new_ns;
+
+   if (IS_MNT_SHARED(q))
+   pnode_add_member_mnt(q-mnt_pnode, q);
+   else if (IS_MNT_SLAVE(q))
+   pnode_add_slave_mnt(q-mnt_pnode, q);
+   put_pnode(q-mnt_pnode);
+
if (fs) {
if (p == fs-rootmnt) {
rootmnt = p;
@@ -2271,6 +2278,8 @@ void __put_namespace(struct namespace *n
spin_lock(vfsmount_lock);
 
list_for_each_entry(mnt, namespace-list, mnt_list) {
+   if (mnt-mnt_pnode)
+   pnode_disassociate_mnt(mnt);
mnt-mnt_namespace = NULL;
}
 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[no subject]

2005-07-25 Thread Ram Pai
, [EMAIL PROTECTED], Janak Desai [EMAIL PROTECTED], 
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: [PATCH 1/7] shared subtree
Content-Type: text/x-patch; name=shared_private_slave.patch
Content-Disposition: inline; filename=shared_private_slave.patch

This patch adds the shared/private/slave support for VFS trees.

Signed by Ram Pai ([EMAIL PROTECTED])

 fs/Makefile   |2 
 fs/dcache.c   |2 
 fs/namespace.c|   93 ++
 fs/pnode.c|  441 ++
 include/linux/fs.h|5 
 include/linux/mount.h |   44 
 include/linux/pnode.h |   90 ++
 7 files changed, 673 insertions(+), 4 deletions(-)

Index: 2.6.12.work2/fs/namespace.c
===
--- 2.6.12.work2.orig/fs/namespace.c
+++ 2.6.12.work2/fs/namespace.c
@@ -22,6 +22,7 @@
 #include linux/namei.h
 #include linux/security.h
 #include linux/mount.h
+#include linux/pnode.h
 #include asm/uaccess.h
 #include asm/unistd.h
 
@@ -62,6 +63,7 @@ struct vfsmount *alloc_vfsmnt(const char
INIT_LIST_HEAD(mnt-mnt_mounts);
INIT_LIST_HEAD(mnt-mnt_list);
INIT_LIST_HEAD(mnt-mnt_fslink);
+   INIT_LIST_HEAD(mnt-mnt_pnode_mntlist);
if (name) {
int size = strlen(name)+1;
char *newname = kmalloc(size, GFP_KERNEL);
@@ -615,6 +617,95 @@ out_unlock:
return err;
 }
 
+static int do_make_shared(struct vfsmount *mnt)
+{
+   int err=0;
+   struct vfspnode *old_pnode = NULL;
+   /*
+* if the mount is already a slave mount,
+* allocate a new pnode and make it
+* a slave pnode of the original pnode.
+*/
+   if (IS_MNT_SLAVE(mnt)) {
+   old_pnode = mnt-mnt_pnode;
+   pnode_del_slave_mnt(mnt);
+   }
+   if(!IS_MNT_SHARED(mnt)) {
+   mnt-mnt_pnode = pnode_alloc();
+   if(!mnt-mnt_pnode) {
+   pnode_add_slave_mnt(old_pnode, mnt);
+   err = -ENOMEM;
+   goto out;
+   }
+   pnode_add_member_mnt(mnt-mnt_pnode, mnt);
+   }
+   if(old_pnode)
+   pnode_add_slave_pnode(old_pnode, mnt-mnt_pnode);
+   set_mnt_shared(mnt);
+out:
+   return err;
+}
+
+static int do_make_slave(struct vfsmount *mnt)
+{
+   int err=0;
+
+   if (IS_MNT_SLAVE(mnt))
+   goto out;
+   /*
+* only shared mounts can
+* be made slave
+*/
+   if (!IS_MNT_SHARED(mnt)) {
+   err = -EINVAL;
+   goto out;
+   }
+   pnode_member_to_slave(mnt);
+out:
+   return err;
+}
+
+static int do_make_private(struct vfsmount *mnt)
+{
+   if(mnt-mnt_pnode)
+   pnode_disassociate_mnt(mnt);
+   set_mnt_private(mnt);
+   return 0;
+}
+
+/*
+ * recursively change the type of the mountpoint.
+ */
+static int do_change_type(struct nameidata *nd, int flag)
+{
+   struct vfsmount *m, *mnt = nd-mnt;
+   int err=0;
+
+   if (!(flag  MS_SHARED)  !(flag  MS_PRIVATE)
+!(flag  MS_SLAVE))
+   return -EINVAL;
+
+   if (nd-dentry != nd-mnt-mnt_root)
+   return -EINVAL;
+
+   spin_lock(vfsmount_lock);
+   for (m = mnt; m; m = next_mnt(m, mnt)) {
+   switch (flag) {
+   case MS_SHARED:
+   err = do_make_shared(m);
+   break;
+   case MS_SLAVE:
+   err = do_make_slave(m);
+   break;
+   case MS_PRIVATE:
+   err = do_make_private(m);
+   break;
+   }
+   }
+   spin_unlock(vfsmount_lock);
+   return err;
+}
+
 /*
  * do loopback mount.
  */
@@ -1049,6 +1140,8 @@ long do_mount(char * dev_name, char * di
data_page);
else if (flags  MS_BIND)
retval = do_loopback(nd, dev_name, flags  MS_REC);
+   else if (flags  MS_SHARED || flags  MS_PRIVATE || flags  MS_SLAVE)
+   retval = do_change_type(nd, flags);
else if (flags  MS_MOVE)
retval = do_move_mount(nd, dev_name);
else
Index: 2.6.12.work2/fs/pnode.c
===
--- /dev/null
+++ 2.6.12.work2/fs/pnode.c
@@ -0,0 +1,441 @@
+/*
+ *  linux/fs/pnode.c
+ *
+ * (C) Copyright IBM Corporation 2005.
+ * Released under GPL v2.
+ * Author : Ram Pai ([EMAIL PROTECTED])
+ *
+ */
+
+#include linux/config.h
+#include linux/syscalls.h
+#include linux/slab.h
+#include linux/sched.h
+#include linux/smp_lock.h
+#include linux/init.h
+#include linux/quotaops.h
+#include linux/acct.h
+#include linux/module.h
+#include linux/seq_file.h
+#include linux/namespace.h
+#include linux/namei.h

Re: supposed to be shared subtree patches.

2005-07-25 Thread Ram Pai
On Mon, 2005-07-25 at 15:44, Ram Pai wrote:
 , [EMAIL PROTECTED], Janak Desai [EMAIL PROTECTED], 
 linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
 Subject: [PATCH 0/7] shared subtree
 
 Hi Andrew/Al Viro,
 
   Enclosing a final set of well tested patches that implement

my apologies. I screwed up sending the patches through quilt.

anyway I have received the following comments from Andrew Morton, which
I will incorporate before sending out saner looking patches.
sorry again,
RP

Andrew's comments follows:


Frankly, I don't even know what these patches _do_, and haven't spent
the time to try to find out.

If these patches are merged, how do we expect end-users to find out how
to use the new capabilities?

A few paragraphs in the patch #1 changelog would help.  A high-level
description of the new capability which explains what it does and why it
would be a useful thing for Linux.

And maybe some deeper information in a Documentation/ file.

Right now, there might well be a lot of people who could use these new
features, but they don't even know that these patches provide them! 
It's all a bit of a mystery, really.
-


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC-2 PATCH 6/8] shared subtree

2005-07-18 Thread Ram Pai
Adds ability to clone a namespace that has shared/private/slave/unclone
subtrees in it.

RP


Signed by Ram Pai ([EMAIL PROTECTED])

 fs/namespace.c |9 +
 1 files changed, 9 insertions(+)

Index: 2.6.12.work1/fs/namespace.c
===
--- 2.6.12.work1.orig/fs/namespace.c
+++ 2.6.12.work1/fs/namespace.c
@@ -1763,6 +1763,13 @@ int copy_namespace(int flags, struct tas
 	q = new_ns-root;
 	while (p) {
 		q-mnt_namespace = new_ns;
+
+		if (IS_MNT_SHARED(q))
+			pnode_add_member_mnt(q-mnt_pnode, q);
+		else if (IS_MNT_SLAVE(q))
+			pnode_add_slave_mnt(q-mnt_pnode, q);
+		put_pnode(q-mnt_pnode);
+
 		if (fs) {
 			if (p == fs-rootmnt) {
 rootmnt = p;
@@ -2129,6 +2136,8 @@ void __put_namespace(struct namespace *n
 	spin_lock(vfsmount_lock);
 
 	list_for_each_entry(mnt, namespace-list, mnt_list) {
+		if (mnt-mnt_pnode)
+			pnode_disassociate_mnt(mnt);
 		mnt-mnt_namespace = NULL;
 	}
 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC-2 PATCH 3/8] shared subtree

2005-07-18 Thread Ram Pai
Adds the ability to bind/rbind a shared/private/slave subtree and set up
propogation wherever needed.

RP

Signed by Ram Pai ([EMAIL PROTECTED])

 fs/namespace.c|  559 --
 fs/pnode.c|  416 +-
 include/linux/dcache.h|2 
 include/linux/fs.h|4 
 include/linux/namespace.h |1 
 include/linux/pnode.h |5 
 6 files changed, 906 insertions(+), 81 deletions(-)

Index: 2.6.12.work1/fs/namespace.c
===
--- 2.6.12.work1.orig/fs/namespace.c
+++ 2.6.12.work1/fs/namespace.c
@@ -42,7 +42,8 @@ static inline int sysfs_init(void)
 
 static struct list_head *mount_hashtable;
 static int hash_mask, hash_bits;
-static kmem_cache_t *mnt_cache; 
+static kmem_cache_t *mnt_cache;
+static struct rw_semaphore namespace_sem;
 
 static inline unsigned long hash(struct vfsmount *mnt, struct dentry *dentry)
 {
@@ -54,7 +55,7 @@ static inline unsigned long hash(struct 
 
 struct vfsmount *alloc_vfsmnt(const char *name)
 {
-	struct vfsmount *mnt = kmem_cache_alloc(mnt_cache, GFP_KERNEL); 
+	struct vfsmount *mnt = kmem_cache_alloc(mnt_cache, GFP_KERNEL);
 	if (mnt) {
 		memset(mnt, 0, sizeof(struct vfsmount));
 		atomic_set(mnt-mnt_count,1);
@@ -86,7 +87,8 @@ void free_vfsmnt(struct vfsmount *mnt)
  * Now, lookup_mnt increments the ref count before returning
  * the vfsmount struct.
  */
-struct vfsmount *lookup_mnt(struct vfsmount *mnt, struct dentry *dentry)
+struct vfsmount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry,
+		struct dentry *root)
 {
 	struct list_head * head = mount_hashtable + hash(mnt, dentry);
 	struct list_head * tmp = head;
@@ -99,7 +101,8 @@ struct vfsmount *lookup_mnt(struct vfsmo
 		if (tmp == head)
 			break;
 		p = list_entry(tmp, struct vfsmount, mnt_hash);
-		if (p-mnt_parent == mnt  p-mnt_mountpoint == dentry) {
+		if (p-mnt_parent == mnt  p-mnt_mountpoint == dentry 
+(root == NULL || p-mnt_root == root)) {
 			found = mntget(p);
 			break;
 		}
@@ -108,6 +111,37 @@ struct vfsmount *lookup_mnt(struct vfsmo
 	return found;
 }
 
+struct vfsmount *lookup_mnt(struct vfsmount *mnt, struct dentry *dentry)
+{
+	return __lookup_mnt(mnt, dentry, NULL);
+}
+
+static struct vfsmount *
+clone_mnt(struct vfsmount *old, struct dentry *root)
+{
+	struct super_block *sb = old-mnt_sb;
+	struct vfsmount *mnt = alloc_vfsmnt(old-mnt_devname);
+
+	if (mnt) {
+		mnt-mnt_flags = old-mnt_flags;
+		atomic_inc(sb-s_active);
+		mnt-mnt_sb = sb;
+		mnt-mnt_root = dget(root);
+		mnt-mnt_mountpoint = mnt-mnt_root;
+		mnt-mnt_parent = mnt;
+		mnt-mnt_namespace = old-mnt_namespace;
+		mnt-mnt_pnode = get_pnode(old-mnt_pnode);
+
+		/* stick the duplicate mount on the same expiry list
+		 * as the original if that was on one */
+		spin_lock(vfsmount_lock);
+		if (!list_empty(old-mnt_fslink))
+			list_add(mnt-mnt_fslink, old-mnt_fslink);
+		spin_unlock(vfsmount_lock);
+	}
+	return mnt;
+}
+
 static inline int check_mnt(struct vfsmount *mnt)
 {
 	return mnt-mnt_namespace == current-namespace;
@@ -128,11 +162,70 @@ static void attach_mnt(struct vfsmount *
 {
 	mnt-mnt_parent = mntget(nd-mnt);
 	mnt-mnt_mountpoint = dget(nd-dentry);
+	mnt-mnt_namespace = nd-mnt-mnt_namespace;
 	list_add(mnt-mnt_hash, mount_hashtable+hash(nd-mnt, nd-dentry));
 	list_add_tail(mnt-mnt_child, nd-mnt-mnt_mounts);
 	nd-dentry-d_mounted++;
 }
 
+static struct vfsmount *do_attach_mnt(struct vfsmount *mnt,
+		struct dentry *dentry,
+		struct vfsmount *child_mnt)
+{
+	struct nameidata nd;
+	LIST_HEAD(head);
+
+	nd.mnt = mnt;
+	nd.dentry = dentry;
+	attach_mnt(child_mnt, nd);
+	list_add_tail(head, child_mnt-mnt_list);
+	list_splice(head, child_mnt-mnt_namespace-list.prev);
+	return child_mnt;
+}
+
+static void attach_prepare_mnt(struct vfsmount *mnt, struct nameidata *nd)
+{
+	mnt-mnt_parent = mntget(nd-mnt);
+	mnt-mnt_mountpoint = dget(nd-dentry);
+	nd-dentry-d_mounted++;
+}
+
+void do_attach_real_mnt(struct vfsmount *mnt)
+{
+	struct vfsmount *parent = mnt-mnt_parent;
+	BUG_ON(parent==mnt);
+	if(list_empty(mnt-mnt_hash))
+		list_add(mnt-mnt_hash,
+			mount_hashtable+hash(parent, mnt-mnt_mountpoint));
+	if(list_empty(mnt-mnt_child))
+		list_add_tail(mnt-mnt_child, parent-mnt_mounts);
+	mnt-mnt_namespace = parent-mnt_namespace;
+	list_add_tail(mnt-mnt_list, mnt-mnt_namespace-list);
+}
+
+struct vfsmount *do_attach_prepare_mnt(struct vfsmount *mnt,
+		struct dentry *dentry,
+		struct vfsmount *template_mnt,
+		int clone_flag)
+{
+	struct vfsmount *child_mnt;
+	struct nameidata nd;
+
+	if (clone_flag) {
+		if(!(child_mnt = clone_mnt(template_mnt,
+template_mnt-mnt_root)))
+			return NULL;
+	} else
+		child_mnt = template_mnt;
+
+	nd.mnt = mnt;
+	nd.dentry = dentry;
+
+	attach_prepare_mnt(child_mnt, nd);
+
+	return child_mnt;
+}
+
 static struct vfsmount *next_mnt(struct vfsmount *p, struct vfsmount *root)
 {
 	struct list_head *next = p-mnt_mounts.next

[RFC-2 PATCH 4/8] shared subtree

2005-07-18 Thread Ram Pai
Adds ability to move a shared/private/slave/unclone tree to any other
shared/private/slave/unclone tree. Also incorporates the same behavior
for pivot_root()

RP


Signed by Ram Pai ([EMAIL PROTECTED])

 fs/namespace.c |  150 +++--
 1 files changed, 125 insertions(+), 25 deletions(-)

Index: 2.6.12.work1/fs/namespace.c
===
--- 2.6.12.work1.orig/fs/namespace.c
+++ 2.6.12.work1/fs/namespace.c
@@ -664,9 +664,12 @@ static struct vfsmount *copy_tree(struct
 	return NULL;
 }
 
+
  /*
  *  @source_mnt : mount tree to be attached
  *  @nd		: place the mount tree @source_mnt is attached
+ *  @move	: use the move semantics if set, else use normal attach semantics
+ *as explained below
  *
  *  NOTE: in the table below explains the semantics when a source vfsmount
  *  of a given type is attached to a destination vfsmount of a give type.
@@ -699,16 +702,44 @@ static struct vfsmount *copy_tree(struct
  * (+)  the mount will be propogated to the destination vfsmount
  *	  and the new mount will be added to the source vfsmount's pnode.
  *
+ *
+ *  -
+ *  |MOVE MOUNT OPERATION			|
+ *  |***|
+ *  |  dest -- | shared	|	private	 |  slave   |unclonable	|
+ *  | source	|		|   	 |   	|	|
+ *  |   |   	|		|   	 |   	|	|
+ *  |   v 	|		|   	 |   	|	|
+ *  |***|
+ *  |	 	|		|   	 |   	|	|
+ *  |  shared	| shared (++) 	|  shared (+)|shared (+)| shared (+)|
+ *  |		|		|   	 |   	|	|
+ *  |		|		|   	 |   	|	|
+ *  | private	| shared (+)	|  private	 | private  | private  	|
+ *  |		|		|   	 |   	|	|
+ *  |		|		|   	 |   	|	|
+ *  | slave	| shared (+++)	|  slave | slave| slave  	|
+ *  |		|		|   	 |   	|	|
+ *  |		|		|   	 |   	|	|
+ *  | unclonable| unclonable	| unclonable |unclonable| unclonable|
+ *  |		|		|   	 |   	|	|
+ *  |		|		|   	 |   	|	|
+ *   
+ *
+ * (+++)  the mount will be propogated to all the vfsmounts in the pnode tree
+ *	  of the destination vfsmount, and all the new mounts will be
+ *	  added to a new pnode , which will be a slave pnode of the
+ *	  source vfsmount's pnode.
+ *
  * if the source mount is a tree, the operations explained above is
- * applied to each
- * vfsmount in the tree.
+ * applied to each vfsmount in the tree.
  *
  * Should be called without spinlocks held, because this function can sleep
  * in allocations.
  *
   */
 static int attach_recursive_mnt(struct vfsmount *source_mnt,
-		struct nameidata *nd)
+		struct nameidata *nd, int move)
 {
 	struct vfsmount *mntpt_mnt, *m, *p;
 	struct vfspnode *src_pnode, *t_p, *dest_pnode, *tmp_pnode;
@@ -718,7 +749,9 @@ static int attach_recursive_mnt(struct v
 
 	mntpt_mnt = nd-mnt;
 	dest_pnode = IS_MNT_SHARED(mntpt_mnt) ? mntpt_mnt-mnt_pnode : NULL;
-	src_pnode = IS_MNT_SHARED(source_mnt) ? source_mnt-mnt_pnode : NULL;
+	src_pnode = IS_MNT_SHARED(source_mnt) ||
+		(move  IS_MNT_SLAVE(source_mnt)) ?
+		source_mnt-mnt_pnode : NULL;
 
 	if (!dest_pnode  !src_pnode) {
 		LIST_HEAD(head);
@@ -739,6 +772,7 @@ static int attach_recursive_mnt(struct v
 	p = NULL;
 	for (m = source_mnt; m; m = next_mnt(m, source_mnt)) {
 		int unclone = IS_MNT_UNCLONE(m);
+		int slave = IS_MNT_SLAVE(m);
 
 		list_del_init(m-mnt_list);
 
@@ -756,7 +790,7 @@ static int attach_recursive_mnt(struct v
 		p=m;
 		dest_pnode = IS_MNT_SHARED(mntpt_mnt) ?
 			mntpt_mnt-mnt_pnode : NULL;
-		src_pnode = (IS_MNT_SHARED(m))?
+		src_pnode = (IS_MNT_SHARED(m) || (move  slave))?
 m-mnt_pnode : NULL;
 
 		m-mnt_pnode = NULL;
@@ -772,19 +806,35 @@ static int attach_recursive_mnt(struct v
 			if ((ret = pnode_prepare_mount(dest_pnode, tmp_pnode,
 	mntpt_dentry, m, mntpt_mnt)))
 return ret;
+			if (move  dest_pnode  slave)
+SET_PNODE_SLAVE(tmp_pnode);
 		} else {
 			if (m == m-mnt_parent)
 do_attach_prepare_mnt(mntpt_mnt,
 	mntpt_dentry, m, 0);
-			pnode_add_member_mnt(tmp_pnode, m);
-			if (unclone) {
-set_mnt_unclone(m);
-m-mnt_pnode = tmp_pnode;
-SET_PNODE_DELETE(tmp_pnode);
-			} else if (!src_pnode) {
-set_mnt_private(m);
-m-mnt_pnode = tmp_pnode;
-SET_PNODE_DELETE(tmp_pnode);
+			if (move  slave)
+pnode_add_slave_mnt(tmp_pnode, m);
+			else {
+pnode_add_member_mnt(tmp_pnode, m);
+if (unclone) {
+	BUG_ON(!move);
+	set_mnt_unclone(m);
+	m-mnt_pnode = tmp_pnode;
+	SET_PNODE_DELETE(tmp_pnode);
+} else if (!src_pnode) {
+	set_mnt_private(m);
+	m-mnt_pnode = tmp_pnode;
+	SET_PNODE_DELETE(tmp_pnode