Re: [AppArmor 39/41] AppArmor: Profile loading and manipulation, pathname matching

2007-04-16 Thread Andi Kleen
 It's nice to check for consistency though, so we're adding that. Profile 
 loading is a trusted operation, at least so far, and so security wise we 
 don't actually have to care --- if loading an invalid profile can bring down 
 the system, then that's no worse than an arbitrary module that crashes the 
 machine. Not sure if there will ever be user loadable profiles; at least at 
 that point we had to care.

A security system that allows to crash the kernel is a little weird 
though. It would be better to check. Not that a recursion check
is particularly expensive.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-16 Thread Timothy Shimmin

Hi Andreas,

--On 12 April 2007 5:05:50 AM -0600 Andreas Dilger [EMAIL PROTECTED] wrote:


I'm interested in getting input for implementing an ioctl to efficiently
map file extents  holes (FIEMAP) instead of looping over FIBMAP a billion
times.

...


I had come up with a plan independently and was also steered toward
XFS_IOC_GETBMAP* ioctls which are in fact very similar to my original
plan, though I think the XFS structs used there are a bit bloated.


They certainly seem to be (combining entries and header).



struct fibmap_extent {
__u64 fe_start; /* starting offset in bytes */
__u64 fe_len;   /* length in bytes */
}

struct fibmap {
struct fibmap_extent fm_start;  /* offset, length of desired mapping */
__u32 fm_extent_count;  /* number of extents in array */
__u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */
__u64 unused;
struct fibmap_extent fm_extents[0];
}

# define FIEMAP_LEN_MASK0xff
# define FIEMAP_LEN_HOLE0x01
# define FIEMAP_LEN_UNWRITTEN   0x02

All offsets are in bytes to allow cases where filesystems are not going
block-aligned/sized allocations (e.g. tail packing).  The fm_extents array
returned contains the packed list of allocation extents for the file,
including entries for holes (which have fe_start == 0, and a flag).

The -fm_extents[] array includes all of the holes in addition to
allocated extents because this avoids the need to return both the logical
and physical address for every extent and does not make processing any
harder.


Well, that's what stood out for me. I was wondering where the fe_block field
had gone - the physical address.
So is your fe_start; /* starting offset */ actually the disk location
(not a logical file offset)
_except_ in the header (fibmap) where it is the desired logical offset.
Okay, looking at your example use below that's what it looks like.
And when you refer to fm_start below, you mean fm_start.fe_start?
Sorry, I realise this is just an approximation but this part confused me.
So you get rid of all the logical file offsets in the extents because we
report holes explicitly (and we know everything is contiguous if you
include the holes).

--Tim



Caller works something like:

char buf[4096];
struct fibmap *fm = (struct fibmap *)buf;
int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent);

fm-fm_extent.fe_start = 0; /* start of file */
fm-fm_extent.fe_len = -1;   /* end of file */
fm-fm_extent_count = count; /* max extents in fm_extents[] array */
fm-fm_flags = 0;/* maybe no DMAPI, etc like XFS */

fd = open(path, O_RDONLY);
printf(logical\t\tphysical\t\tbytes\n);

/* The last entry will have less extents than the maximum */
while (fm-fm_extent_count == count) {
rc = ioctl(fd, FIEMAP, fm);
if (rc)
break;

/* kernel filled in fm_extents[] array, set fm_extent_count
 * to be actual number of extents returned, leaves fm_start
 * alone (unlike XFS_IOC_GETBMAP). */

for (i = 0; i  fm-fm_extent_count; i++) {
__u64 len = fm-fm_extents[i].fe_len  FIEMAP_LEN_MASK;
__u64 fm_next = fm-fm_start + len;
int hole = fm-fm_extents[i].fe_len  FIEMAP_LEN_HOLE;
int unwr = fm-fm_extents[i].fe_len  
FIEMAP_LEN_UNWRITTEN;

printf(%llu-%llu\t%llu-%llu\t%llu\t%s%s\n,
fm-fm_start, fm_next - 1,
hole ? 0 : fm-fm_extents[i].fe_start,
hole ? 0 : fm-fm_extents[i].fe_start +
   fm-fm_extents[i].fe_len - 1,
len, hole ? (hole)  : ,
unwr ? (unwritten)  : );

/* get ready for printing next extent, or next ioctl */
fm-fm_start = fm_next;
}
}



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/8] unprivileged mount syscall

2007-04-16 Thread Ram Pai
On Fri, 2007-04-13 at 13:58 +0200, Miklos Szeredi wrote:
  On Wed, 2007-04-11 at 12:44 +0200, Miklos Szeredi wrote:
1. clone the master namespace.

2. in the new namespace

move the tree under /share/$me to /
for each ($user, $what, $how) {
move /share/$user/$what to /$what
if ($how == slave) {
 make the mount tree under /$what as slave
}
}

3. in the new namespace make the tree under 
   /share as private and unmount /share
   
   Thanks.  I get the basic idea now: the namespace itself need not be
   shared between the sessions, it is enough if share propagation is
   set up between the different namespaces of a user.
   
   I don't yet see either in your or Viro's description how the trees
   under /share/$USER are initialized.  I guess they are recursively
   bound from /, and are made slaves.
  
  yes. I suppose, when a userid is created one of the steps would be
  
  mount --rbind / /share/$USER
  mount --make-rslave /share/$USER
  mount --make-rshared /share/$USER
 
 Thinking a bit more about this, I'm quite sure most users wouldn't
 even want private namespaces.  It would be enough to
 
   chroot /share/$USER
 
 and be done with it.
 
 Private namespaces are only good for keeping a bunch of mounts
 referenced by a group of processes.  But my guess is, that the natural
 behavior for users is to see a persistent set of mounts.
 
 If for example they mount something on a remote machine, then log out
 from the ssh session and later log back in, they would want to see
 their previous mount still there.

They will continue see their previous mount tree. 
Even if all the namespaces belonging to the different sessions of the
user get dismantled when all the sessions exit, the a mirror of those 
mount trees continue to exist under /share/$USER in the original
namespace.  So I don't think we have a issue.

NOTE: when I say 'original namespace' I mean the admin namespace; the
first namespace that gets created when the machine boots.

RP


 
 Miklos

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/8] unprivileged mount syscall

2007-04-16 Thread Ram Pai
On Fri, 2007-04-13 at 16:05 +0200, Miklos Szeredi wrote:
   Thinking a bit more about this, I'm quite sure most users wouldn't
   even want private namespaces.  It would be enough to
   
 chroot /share/$USER
   
   and be done with it.
   
   Private namespaces are only good for keeping a bunch of mounts
   referenced by a group of processes.  But my guess is, that the natural
   behavior for users is to see a persistent set of mounts.
   
   If for example they mount something on a remote machine, then log out
   from the ssh session and later log back in, they would want to see
   their previous mount still there.
   
   Miklos
  
  Agreed on desired behavior, but not on chroot sufficing.  It actually
  sounds like you want exactly what was outlined in the OLS paper.
  
  Users still need to be in a different mounts namespace from the admin
  user so long as we consider the deluser and backup problems
 
 I don't think it matters, because /share/$USER duplicates a part or
 the whole of the user's namespace.
 
 So backup would have to be taught about /share anyway, and deluser
 operates on /home/$USER and not on /share/*, so there shouldn't be any
 problem.
 
 There's actually very little difference between rbind+chroot, and
 CLONE_NEWNS.  In a private namespace:
 
   1) when no more processes reference the namespace, the tree will be
 disbanded
 
   2) the mount tree won't be accessible from outside the namespace
 
 Wanting a persistent namespace contradicts 1).
 
 Wanting a per-user (as opposed to per-session) namespace contradicts
 2).  The namespace _has_ to be accessible from outside, so that a new
 session can access/copy it.

As i mentioned in the previous mail, disbanding all the namespaces of a
user will not disband his mount tree, because a mirror of the mount tree
still continues to exist in /share/$USER in the admin namespace.

And a new user session can always use this copy to create a namespace
that  looks identical to that which existed earlier.


 
 So both requirements point to the rbind/chroot solution.

Arn't there ways to escape chroot jails? Serge had pointed me to a URL
which showed chroots can be escaped. And if that is true than having all
user's private mount tree in the same namespace can be a security issue?

RP

 
 Miklos

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag

2007-04-16 Thread Ram Pai

 
 Serge E. Hallyn [EMAIL PROTECTED] writes:
 
  Quoting Miklos Szeredi ([EMAIL PROTECTED]):
  From: Miklos Szeredi [EMAIL PROTECTED]
  
  If CLONE_NEWNS and CLONE_NEWNS_USERMNT are given to clone(2) or
  unshare(2), then allow user mounts within the new namespace.
  
  This is not flexible enough, because user mounts can't be enabled
 for
  the initial namespace.
  
  The remaining clone bits also getting dangerously few...
  
  Alternatives are:
  
- prctl() flag
- setting through the containers filesystem
 
  Sorry, I know I had mentioned it, but this is definately my least
  favorite approach.
 
  Curious whether are any other suggestions/opinions from the
 containers
  list?
 
 Given the existence of shared subtrees allowing/denying this at the
 mount
 namespace level is silly and wrong.
 
 If we need more than just the filesystem permission checks can we
 make it a mount flag settable with mount and remount that allows
 non-privileged users the ability to create mount points under it
 in directories they have full read/write access to.

Also for bind-mount and remount operations the flag has to be propagated
down its propagation tree.  Otherwise a unpriviledged mount in a shared
mount wont get reflected in its peers and slaves, leading to unidentical
shared-subtrees.

RP


 
 I don't like the use of clone flags for this purpose but in this
 case the shared subtress are a much more fundamental reasons for not
 doing this at the namespace level.
 
 Eric
 ___
 Containers mailing list
 [EMAIL PROTECTED]
 https://lists.linux-foundation.org/mailman/listinfo/containers 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/8] unprivileged mount syscall

2007-04-16 Thread Miklos Szeredi
 Arn't there ways to escape chroot jails? Serge had pointed me to a URL
 which showed chroots can be escaped. And if that is true than having all
 user's private mount tree in the same namespace can be a security issue?

No.  In fact chrooting the user into /share/$USER will actually
_grant_ a privilege to the user, instead of taking it away.  It allows
the user to modify it's root namespace, which it wouldn't be able to
in the initial namespace.

So even if the user could escape from the chroot (which I doubt), s/he
would not be able to do any harm, since unprivileged mounting would be
restricted to /share.  Also /share/$USER should only have read/search
permission for $USER or no permissions at all, which would mean, that
other users' namespaces would be safe from tampering as well.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag

2007-04-16 Thread Ram Pai
On Mon, 2007-04-16 at 11:32 +0200, Miklos Szeredi wrote:
   Given the existence of shared subtrees allowing/denying this at the
   mount
   namespace level is silly and wrong.
   
   If we need more than just the filesystem permission checks can we
   make it a mount flag settable with mount and remount that allows
   non-privileged users the ability to create mount points under it
   in directories they have full read/write access to.
  
  Also for bind-mount and remount operations the flag has to be propagated
  down its propagation tree.  Otherwise a unpriviledged mount in a shared
  mount wont get reflected in its peers and slaves, leading to unidentical
  shared-subtrees.
 
 That's an interesting question.  Do we want shared mounts to be
 totally identical, including mnt_flags?  It doesn't look as if
 do_remount() guarantees that currently.

Depends on the semantics of each of the flags. Some flags like of the
read/write flag, would not interfere with the propagation semantics
AFAICT.  But this one certainly seems to interfere.

RP

 Miklos

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 01/10] add user mounts to the kernel

2007-04-16 Thread Miklos Szeredi
From: Miklos Szeredi [EMAIL PROTECTED]

Add ownership information to mounts.

A new mount flag, MS_SETUSER is used to make a mount owned by a user.
If this flag is specified, then the owner will be set to the current
real user id and the mount will be marked with the MNT_USER flag.  On
remount don't preserve previous owner, and treat MS_SETUSER as for a
new mount.  The MS_SETUSER flag is ignored on mount move.

The MNT_USER flag is not copied on any kind of mount cloning:
namespace creation, binding or propagation.  For bind mounts the
cloned mount(s) are set to MNT_USER depending on the MS_SETUSER mount
flag.  In all the other cases MNT_USER is always cleared.

For MNT_USER mounts a user=UID option is added to /proc/PID/mounts.
This is compatible with how mount ownership is stored in /etc/mtab.

It is expected, that in the future mount(8) will use MS_SETUSER to
store mount ownership within the kernel.  This would help in
situations, where /etc/mtab is difficult or impossible to work with,
e.g. when using mount propagation.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-13 12:24:15.0 +0200
+++ linux/fs/namespace.c2007-04-13 13:03:44.0 +0200
@@ -227,6 +227,13 @@ static struct vfsmount *skip_mnt_tree(st
return p;
 }
 
+static void set_mnt_user(struct vfsmount *mnt)
+{
+   BUG_ON(mnt-mnt_flags  MNT_USER);
+   mnt-mnt_uid = current-uid;
+   mnt-mnt_flags |= MNT_USER;
+}
+
 static struct vfsmount *clone_mnt(struct vfsmount *old, struct dentry *root,
int flag)
 {
@@ -241,6 +248,11 @@ static struct vfsmount *clone_mnt(struct
mnt-mnt_mountpoint = mnt-mnt_root;
mnt-mnt_parent = mnt;
 
+   /* don't copy the MNT_USER flag */
+   mnt-mnt_flags = ~MNT_USER;
+   if (flag  CL_SETUSER)
+   set_mnt_user(mnt);
+
if (flag  CL_SLAVE) {
list_add(mnt-mnt_slave, old-mnt_slave_list);
mnt-mnt_master = old;
@@ -403,6 +415,8 @@ static int show_vfsmnt(struct seq_file *
if (mnt-mnt_flags  fs_infop-flag)
seq_puts(m, fs_infop-str);
}
+   if (mnt-mnt_flags  MNT_USER)
+   seq_printf(m, ,user=%i, mnt-mnt_uid);
if (mnt-mnt_sb-s_op-show_options)
err = mnt-mnt_sb-s_op-show_options(m, mnt);
seq_puts(m,  0 0\n);
@@ -920,8 +934,9 @@ static int do_change_type(struct nameida
 /*
  * do loopback mount.
  */
-static int do_loopback(struct nameidata *nd, char *old_name, int recurse)
+static int do_loopback(struct nameidata *nd, char *old_name, int flags)
 {
+   int clone_flags;
struct nameidata old_nd;
struct vfsmount *mnt = NULL;
int err = mount_is_safe(nd);
@@ -941,11 +956,12 @@ static int do_loopback(struct nameidata 
if (!check_mnt(nd-mnt) || !check_mnt(old_nd.mnt))
goto out;
 
+   clone_flags = (flags  MS_SETUSER) ? CL_SETUSER : 0;
err = -ENOMEM;
-   if (recurse)
-   mnt = copy_tree(old_nd.mnt, old_nd.dentry, 0);
+   if (flags  MS_REC)
+   mnt = copy_tree(old_nd.mnt, old_nd.dentry, clone_flags);
else
-   mnt = clone_mnt(old_nd.mnt, old_nd.dentry, 0);
+   mnt = clone_mnt(old_nd.mnt, old_nd.dentry, clone_flags);
 
if (!mnt)
goto out;
@@ -987,8 +1003,11 @@ static int do_remount(struct nameidata *
 
down_write(sb-s_umount);
err = do_remount_sb(sb, flags, data, 0);
-   if (!err)
+   if (!err) {
nd-mnt-mnt_flags = mnt_flags;
+   if (flags  MS_SETUSER)
+   set_mnt_user(nd-mnt);
+   }
up_write(sb-s_umount);
if (!err)
security_sb_post_remount(nd-mnt, flags, data);
@@ -1093,10 +1112,13 @@ static int do_new_mount(struct nameidata
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
 
-   mnt = do_kern_mount(type, flags, name, data);
+   mnt = do_kern_mount(type, flags  ~MS_SETUSER, name, data);
if (IS_ERR(mnt))
return PTR_ERR(mnt);
 
+   if (flags  MS_SETUSER)
+   set_mnt_user(mnt);
+
return do_add_mount(mnt, nd, mnt_flags, NULL);
 }
 
@@ -1127,7 +1149,8 @@ int do_add_mount(struct vfsmount *newmnt
if (S_ISLNK(newmnt-mnt_root-d_inode-i_mode))
goto unlock;
 
-   newmnt-mnt_flags = mnt_flags;
+   /* MNT_USER was set earlier */
+   newmnt-mnt_flags |= mnt_flags;
if ((err = graft_tree(newmnt, nd)))
goto unlock;
 
@@ -1447,7 +1470,7 @@ long do_mount(char *dev_name, char *dir_
retval = do_remount(nd, flags  ~MS_REMOUNT, mnt_flags,
data_page);
else if 

[patch 02/10] allow unprivileged umount

2007-04-16 Thread Miklos Szeredi
From: Miklos Szeredi [EMAIL PROTECTED]

The owner doesn't need sysadmin capabilities to call umount().

Similar behavior as umount(8) on mounts having user=UID option in
/etc/mtab.  The difference is that umount also checks /etc/fstab,
presumably to exclude another mount on the same mountpoint.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-11 20:07:51.0 +0200
+++ linux/fs/namespace.c2007-04-11 20:08:05.0 +0200
@@ -659,6 +659,25 @@ static int do_umount(struct vfsmount *mn
 }
 
 /*
+ * umount is permitted for
+ *  - sysadmin
+ *  - mount owner, if not forced umount
+ */
+static bool permit_umount(struct vfsmount *mnt, int flags)
+{
+   if (capable(CAP_SYS_ADMIN))
+   return true;
+
+   if (!(mnt-mnt_flags  MNT_USER))
+   return false;
+
+   if (flags  MNT_FORCE)
+   return false;
+
+   return mnt-mnt_uid == current-uid;
+}
+
+/*
  * Now umount can handle mount points as well as block devices.
  * This is important for filesystems which use unnamed block devices.
  *
@@ -681,7 +700,7 @@ asmlinkage long sys_umount(char __user *
goto dput_and_out;
 
retval = -EPERM;
-   if (!capable(CAP_SYS_ADMIN))
+   if (!permit_umount(nd.mnt, flags))
goto dput_and_out;
 
retval = do_umount(nd.mnt, flags);

--
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 03/10] account user mounts

2007-04-16 Thread Miklos Szeredi
From: Miklos Szeredi [EMAIL PROTECTED]

Add sysctl variables for accounting and limiting the number of user
mounts.

The maximum number of user mounts is set to 1024 by default.  This
won't in itself enable user mounts, setting the permit user submount
mount flag will also be needed.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/include/linux/sysctl.h
===
--- linux.orig/include/linux/sysctl.h   2007-04-11 18:27:46.0 +0200
+++ linux/include/linux/sysctl.h2007-04-11 20:08:16.0 +0200
@@ -818,6 +818,8 @@ enum
FS_AIO_NR=18,   /* current system-wide number of aio requests */
FS_AIO_MAX_NR=19,   /* system-wide maximum number of aio requests */
FS_INOTIFY=20,  /* inotify submenu */
+   FS_NR_USER_MOUNTS=21,   /* int:current number of user mounts */
+   FS_MAX_USER_MOUNTS=22,  /* int:maximum number of user mounts */
FS_OCFS2=988,   /* ocfs2 */
 };
 
Index: linux/kernel/sysctl.c
===
--- linux.orig/kernel/sysctl.c  2007-04-11 18:27:46.0 +0200
+++ linux/kernel/sysctl.c   2007-04-11 20:08:16.0 +0200
@@ -1063,6 +1063,22 @@ static ctl_table fs_table[] = {
 #endif 
 #endif
{
+   .ctl_name   = FS_NR_USER_MOUNTS,
+   .procname   = nr_user_mounts,
+   .data   = nr_user_mounts,
+   .maxlen = sizeof(int),
+   .mode   = 0444,
+   .proc_handler   = proc_dointvec,
+   },
+   {
+   .ctl_name   = FS_MAX_USER_MOUNTS,
+   .procname   = max_user_mounts,
+   .data   = max_user_mounts,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+   {
.ctl_name   = KERN_SETUID_DUMPABLE,
.procname   = suid_dumpable,
.data   = suid_dumpable,
Index: linux/Documentation/filesystems/proc.txt
===
--- linux.orig/Documentation/filesystems/proc.txt   2007-04-11 
18:27:44.0 +0200
+++ linux/Documentation/filesystems/proc.txt2007-04-12 13:32:14.0 
+0200
@@ -923,6 +923,15 @@ reaches aio-max-nr then io_setup will fa
 raising aio-max-nr does not result in the pre-allocation or re-sizing
 of any kernel data structures.
 
+nr_user_mounts and max_user_mounts
+--
+
+These represent the number of user mounts and the maximum number of
+user mounts respectively.  User mounts may be created by
+unprivileged users.  User mounts may also be created with sysadmin
+privileges on behalf of a user, in which case nr_user_mounts may
+exceed max_user_mounts.
+
 2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats
 ---
 
Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-11 20:08:05.0 +0200
+++ linux/fs/namespace.c2007-04-12 13:29:05.0 +0200
@@ -39,6 +39,9 @@ static int hash_mask __read_mostly, hash
 static struct kmem_cache *mnt_cache __read_mostly;
 static struct rw_semaphore namespace_sem;
 
+int nr_user_mounts;
+int max_user_mounts = 1024;
+
 /* /sys/fs */
 decl_subsys(fs, NULL, NULL);
 EXPORT_SYMBOL_GPL(fs_subsys);
@@ -227,11 +230,30 @@ static struct vfsmount *skip_mnt_tree(st
return p;
 }
 
+static void dec_nr_user_mounts(void)
+{
+   spin_lock(vfsmount_lock);
+   nr_user_mounts--;
+   spin_unlock(vfsmount_lock);
+}
+
 static void set_mnt_user(struct vfsmount *mnt)
 {
BUG_ON(mnt-mnt_flags  MNT_USER);
mnt-mnt_uid = current-uid;
mnt-mnt_flags |= MNT_USER;
+   spin_lock(vfsmount_lock);
+   nr_user_mounts++;
+   spin_unlock(vfsmount_lock);
+}
+
+static void clear_mnt_user(struct vfsmount *mnt)
+{
+   if (mnt-mnt_flags  MNT_USER) {
+   mnt-mnt_uid = 0;
+   mnt-mnt_flags = ~MNT_USER;
+   dec_nr_user_mounts();
+   }
 }
 
 static struct vfsmount *clone_mnt(struct vfsmount *old, struct dentry *root,
@@ -283,6 +305,7 @@ static inline void __mntput(struct vfsmo
 {
struct super_block *sb = mnt-mnt_sb;
dput(mnt-mnt_root);
+   clear_mnt_user(mnt);
free_vfsmnt(mnt);
deactivate_super(sb);
 }
@@ -1023,6 +1046,7 @@ static int do_remount(struct nameidata *
down_write(sb-s_umount);
err = do_remount_sb(sb, flags, data, 0);
if (!err) {
+   clear_mnt_user(nd-mnt);
nd-mnt-mnt_flags = mnt_flags;
if (flags  MS_SETUSER)
set_mnt_user(nd-mnt);
Index: linux/include/linux/fs.h

[patch 08/10] put declaration of put_filesystem() in fs.h

2007-04-16 Thread Miklos Szeredi
From: Miklos Szeredi [EMAIL PROTECTED]

Declarations go into headers.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/fs/super.c
===
--- linux.orig/fs/super.c   2007-04-13 12:26:11.0 +0200
+++ linux/fs/super.c2007-04-13 13:25:40.0 +0200
@@ -40,10 +40,6 @@
 #include asm/uaccess.h
 
 
-void get_filesystem(struct file_system_type *fs);
-void put_filesystem(struct file_system_type *fs);
-struct file_system_type *get_fs_type(const char *name);
-
 LIST_HEAD(super_blocks);
 DEFINE_SPINLOCK(sb_lock);
 
Index: linux/include/linux/fs.h
===
--- linux.orig/include/linux/fs.h   2007-04-13 13:23:22.0 +0200
+++ linux/include/linux/fs.h2007-04-13 13:25:40.0 +0200
@@ -1922,6 +1922,8 @@ extern int vfs_fstat(unsigned int, struc
 
 extern int vfs_ioctl(struct file *, unsigned int, unsigned int, unsigned long);
 
+extern void get_filesystem(struct file_system_type *fs);
+extern void put_filesystem(struct file_system_type *fs);
 extern struct file_system_type *get_fs_type(const char *name);
 extern struct super_block *get_super(struct block_device *);
 extern struct super_block *user_get_super(dev_t);

--
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 06/10] propagate error values from clone_mnt

2007-04-16 Thread Miklos Szeredi
From: Miklos Szeredi [EMAIL PROTECTED]

Allow clone_mnt() to return errors other than ENOMEM.  This will be
used for returning a different error value when the number of user
mounts goes over the limit.

Fix copy_tree() to return EPERM for unbindable mounts.

Don't propagate further from dup_mnt_ns() as that copy_tree() can only
fail with -ENOMEM.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-13 13:22:17.0 +0200
+++ linux/fs/namespace.c2007-04-13 13:25:35.0 +0200
@@ -261,42 +261,42 @@ static struct vfsmount *clone_mnt(struct
 {
struct super_block *sb = old-mnt_sb;
struct vfsmount *mnt = alloc_vfsmnt(old-mnt_devname);
+   if (!mnt)
+   return ERR_PTR(-ENOMEM);
 
-   if (mnt) {
-   mnt-mnt_flags = old-mnt_flags;
-   atomic_inc(sb-s_active);
-   mnt-mnt_sb = sb;
-   mnt-mnt_root = dget(root);
-   mnt-mnt_mountpoint = mnt-mnt_root;
-   mnt-mnt_parent = mnt;
-
-   /* don't copy the MNT_USER flag */
-   mnt-mnt_flags = ~MNT_USER;
-   if (flag  CL_SETUSER)
-   set_mnt_user(mnt);
-
-   if (flag  CL_SLAVE) {
-   list_add(mnt-mnt_slave, old-mnt_slave_list);
-   mnt-mnt_master = old;
-   CLEAR_MNT_SHARED(mnt);
-   } else {
-   if ((flag  CL_PROPAGATION) || IS_MNT_SHARED(old))
-   list_add(mnt-mnt_share, old-mnt_share);
-   if (IS_MNT_SLAVE(old))
-   list_add(mnt-mnt_slave, old-mnt_slave);
-   mnt-mnt_master = old-mnt_master;
-   }
-   if (flag  CL_MAKE_SHARED)
-   set_mnt_shared(mnt);
+   mnt-mnt_flags = old-mnt_flags;
+   atomic_inc(sb-s_active);
+   mnt-mnt_sb = sb;
+   mnt-mnt_root = dget(root);
+   mnt-mnt_mountpoint = mnt-mnt_root;
+   mnt-mnt_parent = mnt;
+
+   /* don't copy the MNT_USER flag */
+   mnt-mnt_flags = ~MNT_USER;
+   if (flag  CL_SETUSER)
+   set_mnt_user(mnt);
 
-   /* stick the duplicate mount on the same expiry list
-* as the original if that was on one */
-   if (flag  CL_EXPIRE) {
-   spin_lock(vfsmount_lock);
-   if (!list_empty(old-mnt_expire))
-   list_add(mnt-mnt_expire, old-mnt_expire);
-   spin_unlock(vfsmount_lock);
-   }
+   if (flag  CL_SLAVE) {
+   list_add(mnt-mnt_slave, old-mnt_slave_list);
+   mnt-mnt_master = old;
+   CLEAR_MNT_SHARED(mnt);
+   } else {
+   if ((flag  CL_PROPAGATION) || IS_MNT_SHARED(old))
+   list_add(mnt-mnt_share, old-mnt_share);
+   if (IS_MNT_SLAVE(old))
+   list_add(mnt-mnt_slave, old-mnt_slave);
+   mnt-mnt_master = old-mnt_master;
+   }
+   if (flag  CL_MAKE_SHARED)
+   set_mnt_shared(mnt);
+
+   /* stick the duplicate mount on the same expiry list
+* as the original if that was on one */
+   if (flag  CL_EXPIRE) {
+   spin_lock(vfsmount_lock);
+   if (!list_empty(old-mnt_expire))
+   list_add(mnt-mnt_expire, old-mnt_expire);
+   spin_unlock(vfsmount_lock);
}
return mnt;
 }
@@ -782,11 +782,11 @@ struct vfsmount *copy_tree(struct vfsmou
struct nameidata nd;
 
if (!(flag  CL_COPY_ALL)  IS_MNT_UNBINDABLE(mnt))
-   return NULL;
+   return ERR_PTR(-EPERM);
 
res = q = clone_mnt(mnt, dentry, flag);
-   if (!q)
-   goto Enomem;
+   if (IS_ERR(q))
+   goto error;
q-mnt_mountpoint = mnt-mnt_mountpoint;
 
p = mnt;
@@ -807,8 +807,8 @@ struct vfsmount *copy_tree(struct vfsmou
nd.mnt = q;
nd.dentry = p-mnt_mountpoint;
q = clone_mnt(p, p-mnt_root, flag);
-   if (!q)
-   goto Enomem;
+   if (IS_ERR(q))
+   goto error;
spin_lock(vfsmount_lock);
list_add_tail(q-mnt_list, res-mnt_list);
attach_mnt(q, nd);
@@ -816,7 +816,7 @@ struct vfsmount *copy_tree(struct vfsmou
}
}
return res;
-Enomem:
+ error:
if (res) {
LIST_HEAD(umount_list);
spin_lock(vfsmount_lock);
@@ -824,7 +824,7 @@ Enomem:
spin_unlock(vfsmount_lock);
release_mounts(umount_list);
}
-   return 

[patch 05/10] Add permit user submounts flag to vfsmount

2007-04-16 Thread Miklos Szeredi
From: Miklos Szeredi [EMAIL PROTECTED]

If MNT_USERMNT flag is not set in the target vfsmount, then
unprivileged mounts will be denied.

By default this flag is cleared, and can be set on new mounts, on
remounts or with the MS_SETFLAGS option.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-13 13:20:12.0 +0200
+++ linux/fs/namespace.c2007-04-13 13:35:40.0 +0200
@@ -411,6 +411,7 @@ static int show_vfsmnt(struct seq_file *
{ MNT_NOATIME, ,noatime },
{ MNT_NODIRATIME, ,nodiratime },
{ MNT_RELATIME, ,relatime },
+   { MNT_USERMNT, ,usermnt },
{ 0, NULL }
};
struct proc_fs_info *fs_infop;
@@ -1505,9 +1506,11 @@ long do_mount(char *dev_name, char *dir_
mnt_flags |= MNT_NODIRATIME;
if (flags  MS_RELATIME)
mnt_flags |= MNT_RELATIME;
+   if (flags  MS_USERMNT)
+   mnt_flags |= MNT_USERMNT;
 
flags = ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
-  MS_NOATIME | MS_NODIRATIME | MS_RELATIME);
+  MS_NOATIME | MS_NODIRATIME | MS_RELATIME | MS_USERMNT);
 
/* ... and get the mountpoint */
retval = path_lookup(dir_name, LOOKUP_FOLLOW, nd);
Index: linux/include/linux/mount.h
===
--- linux.orig/include/linux/mount.h2007-04-13 13:17:08.0 +0200
+++ linux/include/linux/mount.h 2007-04-13 13:22:17.0 +0200
@@ -28,6 +28,7 @@ struct mnt_namespace;
 #define MNT_NOATIME0x08
 #define MNT_NODIRATIME 0x10
 #define MNT_RELATIME   0x20
+#define MNT_USERMNT0x40
 
 #define MNT_SHRINKABLE 0x100
 #define MNT_USER   0x200
Index: linux/include/linux/fs.h
===
--- linux.orig/include/linux/fs.h   2007-04-13 13:23:05.0 +0200
+++ linux/include/linux/fs.h2007-04-13 13:35:34.0 +0200
@@ -130,6 +130,7 @@ extern int dir_notify_enable;
 #define MS_SETFLAGS(123) /* set specified mount flags */
 #define MS_CLEARFLAGS  (124) /* clear specified mount flags */
 /* MS_SETFLAGS | MS_CLEARFLAGS: change mount flags to specified */
+#define MS_USERMNT (125) /* permit unpriv. submounts under this mount */
 #define MS_ACTIVE  (130)
 #define MS_NOUSER  (131)
 

--
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-16 Thread David Chinner
On Thu, Apr 12, 2007 at 05:05:50AM -0600, Andreas Dilger wrote:
 I'm interested in getting input for implementing an ioctl to efficiently
 map file extents  holes (FIEMAP) instead of looping over FIBMAP a billion
 times.  We already have customers with single files in the 10TB range and
 we additionally need to get the mapping over the network so it needs to
 be efficient in terms of how data is passed, and how easily it can be
 extracted from the filesystem.
 
 I had come up with a plan independently and was also steered toward
 XFS_IOC_GETBMAP* ioctls which are in fact very similar to my original
 plan, though I think the XFS structs used there are a bit bloated.

Yeah, they were designed with having a long term stable ABI
that limited expandability. Hence the future fields that never
got used ;)

 There was also recent discussion about SEEK_HOLE and SEEK_DATA as
 implemented by Sun, but even if we could skip the holes we still might
 need to do millions of FIBMAPs to see how large files are allocated
 on disk.  Conversely, having filesystems implement an efficient FIBMAP
 ioctl (or -fiemap() method) could in turn be leveraged for SEEK_HOLE
 and SEEK_DATA instead of doing looping over -bmap() inside the kernel
 as I saw one patch.

Yup.

 struct fibmap_extent {
   __u64 fe_start; /* starting offset in bytes */
   __u64 fe_len;   /* length in bytes */
 }
 
 struct fibmap {
   struct fibmap_extent fm_start;  /* offset, length of desired mapping */
   __u32 fm_extent_count;  /* number of extents in array */
   __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */
   __u64 unused;
   struct fibmap_extent fm_extents[0];
 }
 
 #define FIEMAP_LEN_MASK   0xff
 #define FIEMAP_LEN_HOLE   0x01
 #define FIEMAP_LEN_UNWRITTEN  0x02

I'm not sure I like stealing bits from the length to use a flags -
I'd prefer an explicit field per fibmap_extent for this.

Given that xfs_bmap uses extra information from the filesystem
(geometry) to display extra (and frequently used) information
about the alignment of extents. ie:

chook 681% xfs_bmap -vv fred
fred:
 EXT: FILE-OFFSET  BLOCK-RANGE  AG AG-OFFSET  TOTAL FLAGS
   0: [0..151]:288444888..288445039  8 (1696536..1696687)   152 00010
 FLAG Values:
01 Unwritten preallocated extent
001000 Doesn't begin on stripe unit
000100 Doesn't end   on stripe unit
10 Doesn't begin on stripe width
01 Doesn't end   on stripe width

This information could be easily passed up in the flags fields if the
filesystem has geometry information (there go 4 more flags ;). 

Also - what are the explicit sync semantics of this ioctl? The
XFS ioctl causes a fsync of the file first to convert delalloc
extents to real extents before returning the bmap. Is this functionality
going to be the same? If not, then we need a DELALLOC flag to indicate
extents that haven't been allocated yet. This might be handy to
have, anyway

 All offsets are in bytes to allow cases where filesystems are not going
 block-aligned/sized allocations (e.g. tail packing).

So it'll be ok for a few years yet ;)

  The fm_extents array
 returned contains the packed list of allocation extents for the file,
 including entries for holes (which have fe_start == 0, and a flag).

Internalling in XFS, we pass these around as:

#define DELAYSTARTBLOCK ((xfs_fsblock_t)-1LL)
#define HOLESTARTBLOCK  ((xfs_fsblock_t)-2LL)

And the offset passed out through XFS_IOC_GETBMAP[X] is a block
number of -1 for the start of a hole. Hence we don't need a
flag for this. We can expose delalloc extents like this as well
without needing flags...

 The -fm_extents[] array includes all of the holes in addition to
 allocated extents because this avoids the need to return both the logical
 and physical address for every extent and does not make processing any
 harder.

Doesn't really make it any easier to map to disk, either.

 One feature that XFS_IOC_GETBMAPX has that may be desirable is the
 ability to return unwritten extent information.

You got that with the unwritten flag above.

 required expanding the per-extent struct from 32 to 48 bytes per extent,

not sure I follow your maths here?

 but I'd rather limit a single extent to e.g. 2^56 bytes (oh, what hardship)
 and keep 8 bytes or so for input/output flags per extent (would need to
 ^ bits?
 be masked before use).
 
 
 Caller works something like:
 
   char buf[4096];
   struct fibmap *fm = (struct fibmap *)buf;
   int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent);
   
   fm-fm_extent.fe_start = 0; /* start of file */
   fm-fm_extent.fe_len = -1;  /* end of file */
   fm-fm_extent_count = count; /* max extents in fm_extents[] array */
   fm-fm_flags = 0;   /* maybe no DMAPI, etc like XFS */
 
   fd = 

d_revalidate not being called enough on mountpoints?

2007-04-16 Thread David Howells

Hi Al,

I think there might be a problem in the VFS with d_revalidate() not being
called enough on mountpoints.  As far as I can tell from the printks in my AFS
stuff, it's only called on the mounted-on dentry, and not the vfsmount-root
dentry.  However, with NFS at least (not so much AFS), can you trust that the
mount-root dentry still maps to the same inode and event if it does that that
inode is still up to date?

I discovered it because I was relying on d_revalidate() to spot that the
server had broken the callback on a directory that had been changed.  However,
the root directory of each volume isn't being d_revalidated.

In the kernel output, I see:

(1) Permission check on the root dir of /afs: volume ID 2001, vnode ID 1:

[0ls] == afs_permission({2001:1},1,)

(2) Revalidation of the mountpoint dir in /afs (vnode ID 6):

[0ls] == afs_d_revalidate({v={2001:6} n=.cambridge.redhat.com fl=20},)
[0ls] not promised

(3) At this point, the stats of the mounted-on dir are rechecked.

[0ls] new promise [fl=20]

(4) Permission check on the root dir of /afs/.cambridge.redhat.com (volume ID
2003, vnode ID 1):

[0ls] == afs_permission({2003:1},1,)

(5) Revalidation of the mountpoint dir in /afs/.cambridge.redhat.com (vnode ID
2):

[0ls] == afs_d_revalidate({v={2003:2} n=afsdoc fl=20},)
[0ls] not promised

(6) Again, the stats of the mounted-on dir are rechecked.

[0ls] new promise [fl=20]

I was expecting to see the root dentry of each vfsmount be revalidated, but
that doesn't occur.

David
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: d_revalidate not being called enough on mountpoints?

2007-04-16 Thread Trond Myklebust
On Mon, 2007-04-16 at 13:50 +0100, David Howells wrote:
 Hi Al,
 
 I think there might be a problem in the VFS with d_revalidate() not being
 called enough on mountpoints.  As far as I can tell from the printks in my AFS
 stuff, it's only called on the mounted-on dentry, and not the vfsmount-root
 dentry.  However, with NFS at least (not so much AFS), can you trust that the
 mount-root dentry still maps to the same inode and event if it does that that
 inode is still up to date?

The NFS mountpoint is _not_ defined in terms of the remote path. Once
you have mounted a given directory, the expectation is that that
particular directory stays mounted. You cannot allow it to morph into
another object just because someone messes with the path on the server. 

This is very much analogous to the expectation that a bind mount should
not morph into a different object just because someone has renamed
something in the directory you were binding from.

 I discovered it because I was relying on d_revalidate() to spot that the
 server had broken the callback on a directory that had been changed.  However,
 the root directory of each volume isn't being d_revalidated.

That sounds like an abuse. You are not revalidating the path itself
(which is the purpose of d_revalidate). Instead you are revalidating the
directory metadata, which is a very different thing.

Cheers
  Trond

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/8] unprivileged mount syscall

2007-04-16 Thread Eric W. Biederman
Miklos Szeredi [EMAIL PROTECTED] writes:

 Arn't there ways to escape chroot jails? Serge had pointed me to a URL
 which showed chroots can be escaped. And if that is true than having all
 user's private mount tree in the same namespace can be a security issue?

 No.  In fact chrooting the user into /share/$USER will actually
 _grant_ a privilege to the user, instead of taking it away.  It allows
 the user to modify it's root namespace, which it wouldn't be able to
 in the initial namespace.

 So even if the user could escape from the chroot (which I doubt), s/he
 would not be able to do any harm, since unprivileged mounting would be
 restricted to /share.  Also /share/$USER should only have read/search
 permission for $USER or no permissions at all, which would mean, that
 other users' namespaces would be safe from tampering as well.

A couple of points.
- chroot can be escaped, it is just a chdir for the root directory it
  is not a security feature.  The only security is that you have to be root to 
call chdir.
  A carefully done namespace setup won't have that issue.

- While it may not violate security as far as what a user is allowed to modify 
it may
  violate security as far as what a user is allowed to see.

There are interesting per login cases as well such as allowing a user to 
replicate
their mount tree from another machine when they log in.  When /home is on a 
network
filesystem this can be very practical and can allow propagation of mounts across
machines not just across a single login session.

Eric




-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag

2007-04-16 Thread Eric W. Biederman
Miklos Szeredi [EMAIL PROTECTED] writes:

 That depends.  Current patches check the unprivileged submounts
 allowed under this mount flag only on the requested mount and not on
 the propagated mounts.  Do you see a problem with this?

I think privileges of this sort should propagate.  If I read what you
just said correctly if I have a private mount namespace I won't be able
to mount anything unless when it was setup the unprivileged submount
command was explicitly set.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/8] unprivileged mount syscall

2007-04-16 Thread Miklos Szeredi
  Arn't there ways to escape chroot jails? Serge had pointed me to a URL
  which showed chroots can be escaped. And if that is true than having all
  user's private mount tree in the same namespace can be a security issue?
 
  No.  In fact chrooting the user into /share/$USER will actually
  _grant_ a privilege to the user, instead of taking it away.  It allows
  the user to modify it's root namespace, which it wouldn't be able to
  in the initial namespace.
 
  So even if the user could escape from the chroot (which I doubt), s/he
  would not be able to do any harm, since unprivileged mounting would be
  restricted to /share.  Also /share/$USER should only have read/search
  permission for $USER or no permissions at all, which would mean, that
  other users' namespaces would be safe from tampering as well.
 
 A couple of points.
 - chroot can be escaped, it is just a chdir for the root directory
 it is not a security feature.  The only security is that you have to
 be root to call chdir.  A carefully done namespace setup won't have
 that issue.
 
 - While it may not violate security as far as what a user is allowed
 to modify it may violate security as far as what a user is allowed
 to see.

I think that's just up to the permissions in the global namespace.  In
this example if you 'chmod 0 /share' there won't be anything for the
user to see.

 There are interesting per login cases as well such as allowing a
 user to replicate their mount tree from another machine when they
 log in.  When /home is on a network filesystem this can be very
 practical and can allow propagation of mounts across machines not
 just across a single login session.

Yeah, sounds interesting, but I think it's better to get the basics
working first, and then we can start to think about the extras.

Btw, there's nothing that prevents cloning the namespace _after_
chrooting into the per-user tree.  That would still be simpler than
doing it the other way round: first creating per-session namespaces
and then setting up mount propagation between them.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[nameidata 1/2] Don't pass NULL nameidata to vfs_create

2007-04-16 Thread Andreas Gruenbacher
On Thursday 12 April 2007 12:06, Christoph Hellwig wrote:
 Once again very strong NACK.  Every conditional passing of vfsmounts get my
 veto.  As mentioned last time if you really want this send a patch series
 first that passed the vfsmount consistantly.

I don't consider it fair to NACK this patch just because some other part of 
the kernel uses vfs_create() in the wrong way -- those are really independent 
issues. The bug is not that hard to fix though; here is a proposed patch on 
top of the other AppArmor patches.

The offenders are nfsd and the mqueue filesystem. Instantiate a struct 
nameidata there.

The .dentry and .mnt fields can easily be filled in in nfsd. The mqueue 
filesystem is somewhat special: files that mq_open creates are on an internal 
mount and the mqueue filesystem is not generally visible (it may or may not 
be mounted). When passing a regular vfsmount to vfs_create the files would 
appear disconnected while they actually are kernel-internal objects. Use a 
NULL vfsmount there to avoid that -- passing the kernel-internal vfsmount 
there wouldn't help, anyway.

Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]

---
 fs/namei.c|7 ---
 fs/nfsd/vfs.c |   23 +++
 ipc/mqueue.c  |6 +-
 3 files changed, 28 insertions(+), 8 deletions(-)

--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1503,7 +1503,7 @@ int vfs_create(struct inode *dir, struct
return -EACCES; /* shouldn't it be ENOSYS? */
mode = S_IALLUGO;
mode |= S_IFREG;
-   error = security_inode_create(dir, dentry, nd ? nd-mnt : NULL, mode);
+   error = security_inode_create(dir, dentry, nd-mnt, mode);
if (error)
return error;
DQUOT_INIT(dir);
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1108,6 +1108,18 @@ nfsd_commit(struct svc_rqst *rqstp, stru
 }
 #endif /* CONFIG_NFSD_V3 */
 
+static inline int
+nfsd_do_create(struct inode *dir, struct dentry *child, struct vfsmount *mnt,
+  int mode)
+{
+   struct nameidata nd = {
+   .dentry = child,
+   .mnt = mnt,
+   };
+
+   return vfs_create(dir, child, mode, nd);
+}
+
 /*
  * Create a file (regular, directory, device, fifo); UNIX sockets 
  * not yet implemented.
@@ -1192,7 +1204,8 @@ nfsd_create(struct svc_rqst *rqstp, stru
err = 0;
switch (type) {
case S_IFREG:
-   host_err = vfs_create(dirp, dchild, iap-ia_mode, NULL);
+   host_err = nfsd_do_create(dirp, dchild, exp-ex_mnt,
+ iap-ia_mode);
break;
case S_IFDIR:
host_err = vfs_mkdir(dirp, dchild, exp-ex_mnt, iap-ia_mode);
@@ -1253,6 +1266,7 @@ nfsd_create_v3(struct svc_rqst *rqstp, s
int *truncp, int *created)
 {
struct dentry   *dentry, *dchild = NULL;
+   struct svc_export *exp;
struct inode*dirp;
__be32  err;
int host_err;
@@ -1271,6 +1285,7 @@ nfsd_create_v3(struct svc_rqst *rqstp, s
goto out;
 
dentry = fhp-fh_dentry;
+   exp = fhp-fh_export;
dirp = dentry-d_inode;
 
/* Get all the sanity checks out of the way before
@@ -1288,7 +1303,7 @@ nfsd_create_v3(struct svc_rqst *rqstp, s
if (IS_ERR(dchild))
goto out_nfserr;
 
-   err = fh_compose(resfhp, fhp-fh_export, dchild, fhp);
+   err = fh_compose(resfhp, exp, dchild, fhp);
if (err)
goto out;
 
@@ -1334,13 +1349,13 @@ nfsd_create_v3(struct svc_rqst *rqstp, s
goto out;
}
 
-   host_err = vfs_create(dirp, dchild, iap-ia_mode, NULL);
+   host_err = nfsd_do_create(dirp, dchild, exp-ex_mnt, iap-ia_mode);
if (host_err  0)
goto out_nfserr;
if (created)
*created = 1;
 
-   if (EX_ISSYNC(fhp-fh_export)) {
+   if (EX_ISSYNC(exp)) {
err = nfserrno(nfsd_sync_dir(dentry));
/* setattr will sync the child (or not) */
}
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -604,6 +604,10 @@ static int mq_attr_ok(struct mq_attr *at
 static struct file *do_create(struct dentry *dir, struct dentry *dentry,
int oflag, mode_t mode, struct mq_attr __user *u_attr)
 {
+   struct nameidata nd = {
+   .dentry = dentry,
+   /* Not a regular create, so set .mnt to NULL. */
+   };
struct mq_attr attr;
int ret;
 
@@ -619,7 +623,7 @@ static struct file *do_create(struct den
}
 
mode = ~current-fs-umask;
-   ret = vfs_create(dir-d_inode, dentry, mode, NULL);
+   ret = vfs_create(dir-d_inode, dentry, mode, nd);
dentry-d_fsdata = NULL;
if (ret)
goto out;
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [nameidata 1/2] Don't pass NULL nameidata to vfs_create

2007-04-16 Thread Matthew Wilcox
On Mon, Apr 16, 2007 at 06:11:30PM +0200, Andreas Gruenbacher wrote:
 +static inline int
 +nfsd_do_create(struct inode *dir, struct dentry *child, struct vfsmount *mnt,
 +int mode)
 +{
 + struct nameidata nd = {
 + .dentry = child,
 + .mnt = mnt,
 + };
 +
 + return vfs_create(dir, child, mode, nd);
 +}
 +

Wouldn't it normally result in fewer instructions (on most architectures
... maybe not x86) to keep the same argument order as vfs_create?  ie:

static inline int nfsd_do_create(struct inode *dir, struct dentry *child,
 int mode, struct vfsmount *mnt)

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[nameidata 2/2] Pass no useless nameidata to the create, lookup, and permission IOPs

2007-04-16 Thread Andreas Gruenbacher
Here is a patch with request for comment.

The create, lookup, and permission inode operations are all passed a full
nameidata.  This is unfortunate because in nfsd and the mqueue filesystem we
must instantiate a struct nameidata, but cannot provide all of the same
information of a regular lookup.  The unused fields take up space on the
stack, but more importantly, it is not obvious which fields have meaningful
values and which don't, and so things might easily break.

This patch introduces struct nameidata2 with only the fields that make sense
independent of an actual lookup, and uses that struct in those places where a
full nameidata is not needed.

The entire patch is a little big so I'm only including the key parts here. The
complete version is here:

 
http://forgeftp.novell.com/apparmor/LKML_Submission-April_07/split-up-nameidata.diff

Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]

--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -14,21 +14,39 @@ struct open_intent {
 
 enum { MAX_NESTED_LINKS = 8 };
 
+/**
+ * Fields shared between nameidata and nameidata2 -- nameidata2 could
+ * be embedded in nameidata, but then the vfs code would become
+ * cluttered with dereferences.
+ */
+#define __NAMEIDATA2   \
+   struct dentry   *dentry;\
+   struct vfsmount *mnt;   \
+   unsigned intflags;  \
+   \
+   union { \
+   struct open_intent open;\
+   } intent;
+
 struct nameidata {
-   struct dentry   *dentry;
-   struct vfsmount *mnt;
+   __NAMEIDATA2
struct qstr last;
-   unsigned intflags;
int last_type;
unsigneddepth;
char *saved_names[MAX_NESTED_LINKS + 1];
+};
 
-   /* Intent data */
-   union {
-   struct open_intent open;
-   } intent;
+struct nameidata2 {
+   __NAMEIDATA2
 };
 
+#undef __NAMEIDATA2
+
+static inline struct nameidata2 *ND2(struct nameidata *nd)
+{
+   return container_of(nd-dentry, struct nameidata2, dentry);
+}
+
 struct path {
struct vfsmount *mnt;
struct dentry *dentry;
@@ -76,10 +94,9 @@ extern void path_release_on_umount(struc
 
 extern int __user_path_lookup_open(const char __user *, unsigned lookup_flags, 
struct nameidata *nd, int open_flags);
 extern int path_lookup_open(int dfd, const char *name, unsigned lookup_flags, 
struct nameidata *, int open_flags);
-extern struct file *lookup_instantiate_filp(struct nameidata *nd, struct 
dentry *dentry,
-   int (*open)(struct inode *, struct file *));
+extern struct file *lookup_instantiate_filp(struct nameidata2 *nd, struct 
dentry *dentry, int (*open)(struct inode *, struct file *));
 extern struct file *nameidata_to_filp(struct nameidata *nd, int flags);
-extern void release_open_intent(struct nameidata *);
+extern void release_open_intent(struct nameidata2 *);
 
 extern struct dentry * lookup_one_len(const char *, struct dentry *, int);
 
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -225,7 +225,7 @@ int generic_permission(struct inode *ino
return -EACCES;
 }
 
-int permission(struct inode *inode, int mask, struct nameidata *nd)
+int permission(struct inode *inode, int mask, struct nameidata2 *nd)
 {
umode_t mode = inode-i_mode;
int retval, submask;
@@ -278,7 +278,7 @@ int permission(struct inode *inode, int 
  * for filesystem access without changing the normal uids which
  * are used for other things.
  */
-int vfs_permission(struct nameidata *nd, int mask)
+int vfs_permission(struct nameidata2 *nd, int mask)
 {
return permission(nd-dentry-d_inode, mask, nd);
 }
@@ -366,7 +366,7 @@ void path_release_on_umount(struct namei
  * release_open_intent - free up open intent resources
  * @nd: pointer to nameidata
  */
-void release_open_intent(struct nameidata *nd)
+void release_open_intent(struct nameidata2 *nd)
 {
if (nd-intent.open.file-f_path.dentry == NULL)
put_filp(nd-intent.open.file);
@@ -377,7 +377,7 @@ void release_open_intent(struct nameidat
 static inline struct dentry *
 do_revalidate(struct dentry *dentry, struct nameidata *nd)
 {
-   int status = dentry-d_op-d_revalidate(dentry, nd);
+   int status = dentry-d_op-d_revalidate(dentry, ND2(nd));
if (unlikely(status = 0)) {
/*
 * The dentry failed validation.
@@ -455,7 +455,7 @@ static int exec_permission_lite(struct i
 
return -EACCES;
 ok:
-   return security_inode_permission(inode, MAY_EXEC, nd);
+   return security_inode_permission(inode, MAY_EXEC, ND2(nd));
 }
 
 /*
@@ -491,7 +491,7 @@ static struct dentry * real_lookup(struc
struct dentry * dentry = d_alloc(parent, name);
result = ERR_PTR(-ENOMEM);
if (dentry) {
-   result = dir-i_op-lookup(dir, dentry, 

Re: [nameidata 2/2] Pass no useless nameidata to the create, lookup, and permission IOPs

2007-04-16 Thread Christoph Hellwig
On Mon, Apr 16, 2007 at 06:29:20PM +0200, Andreas Gruenbacher wrote:
  enum { MAX_NESTED_LINKS = 8 };
  
 +/**
 + * Fields shared between nameidata and nameidata2 -- nameidata2 could
 + * be embedded in nameidata, but then the vfs code would become
 + * cluttered with dereferences.

you could use an anonymous struct embedeed in both.

 + */
 +#define __NAMEIDATA2 \
 + struct dentry   *dentry;\
 + struct vfsmount *mnt;   \
 + unsigned intflags;  \
 + \
 + union { \
 + struct open_intent open;\
 + } intent;

Or better just pass argument directly.  We really
should only pass down the the dentry and the
intent to the filesystem.  The filesystem has
not business looking at mnt in the operations,
and the relevant bits of flags (mostly whether it's
O_EXCLUSIVE) should be stored in the intent, because
that's the only way it should be used.

Doing it that way also allows to fix some braindead calling conventions
like this one:

 -int permission(struct inode *inode, int mask, struct nameidata *nd)
 +int permission(struct inode *inode, int mask, struct nameidata2 *nd)
  {
   umode_t mode = inode-i_mode;
   int retval, submask;
 @@ -278,7 +278,7 @@ int permission(struct inode *inode, int 
   * for filesystem access without changing the normal uids which
   * are used for other things.
   */

  static inline struct dentry *
  do_revalidate(struct dentry *dentry, struct nameidata *nd)

Or this one.

 - result = dir-i_op-lookup(dir, dentry, nd);
 + result = dir-i_op-lookup(dir, dentry, ND2(nd));

or this one.

etc..
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [nameidata 1/2] Don't pass NULL nameidata to vfs_create

2007-04-16 Thread Andreas Gruenbacher
On Monday 16 April 2007 18:21, Christoph Hellwig wrote:
 But anyway, creating fake nameidata structures is not really helpful.
 If there is a nameidata passed people expect it to be complete, and
 if you pass them to an LSM people will e.g. try to look into lookup
 intents.

I don't actually agree with that: when nfsd creates a file, it still is a file 
create no matter where it originates from, and so it does make sense to 
provide the appropriate intent information too. Struct nameidata contains 
other crap only needed during an actual lookup too --- that's a mess, and the 
nameidata2 patch shows one possibility how to deal with it. (It's the 
cleanest approach I could think of right now without cluttering the vfs code 
with insane amounts of dereferences.)

Thanks,
Andreas
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [nameidata 2/2] Pass no useless nameidata to the create, lookup, and permission IOPs

2007-04-16 Thread Randy Dunlap
On Mon, 16 Apr 2007 18:29:20 +0200 Andreas Gruenbacher wrote:

 Here is a patch with request for comment.
 
 --- a/include/linux/namei.h
 +++ b/include/linux/namei.h
 @@ -14,21 +14,39 @@ struct open_intent {
  
  enum { MAX_NESTED_LINKS = 8 };
  
 +/**

Please don't use the kernel-doc begin-marker /** when the comment block
isn't in kernel-doc format.

 + * Fields shared between nameidata and nameidata2 -- nameidata2 could
 + * be embedded in nameidata, but then the vfs code would become
 + * cluttered with dereferences.
 + */
 +#define __NAMEIDATA2 \
 + struct dentry   *dentry;\
 + struct vfsmount *mnt;   \
 + unsigned intflags;  \
 + \
 + union { \
 + struct open_intent open;\
 + } intent;
 +

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [nameidata 2/2] Pass no useless nameidata to the create, lookup, and permission IOPs

2007-04-16 Thread Andreas Gruenbacher
On Monday 16 April 2007 18:42, Randy Dunlap wrote:
 Please don't use the kernel-doc begin-marker /** when the comment block
 isn't in kernel-doc format.

Sigh, kernel-doc should improve things, not get in the way ...
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [nameidata 1/2] Don't pass NULL nameidata to vfs_create

2007-04-16 Thread Christoph Hellwig
On Mon, Apr 16, 2007 at 06:40:41PM +0200, Andreas Gruenbacher wrote:
 On Monday 16 April 2007 18:21, Christoph Hellwig wrote:
  But anyway, creating fake nameidata structures is not really helpful.
  If there is a nameidata passed people expect it to be complete, and
  if you pass them to an LSM people will e.g. try to look into lookup
  intents.
 
 I don't actually agree with that: when nfsd creates a file, it still is a 
 file 
 create no matter where it originates from, and so it does make sense to 
 provide the appropriate intent information too. Struct nameidata contains 
 other crap only needed during an actual lookup too --- that's a mess, and the 

You should provide intent information, yes - which your patch didn't :)

And yes, I didn't like doing the ugly intent in nameidata hack, it's
creating loads of problems for various parties, e.g. the stackable
filesystem folks.  Now the basic intent in nameidata mistaken has been
made even worse by passing back a struct file in it conditionally and
doing lots of work in -lookup that shouldn't be there.  (Which btw,
I expect to cause quite a few problems for apparmor or other lsms,
but I guess so far no one has tried them on NFSv4)

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [nameidata 2/2] Pass no useless nameidata to the create, lookup, and permission IOPs

2007-04-16 Thread Randy Dunlap

Andreas Gruenbacher wrote:

On Monday 16 April 2007 18:42, Randy Dunlap wrote:

Please don't use the kernel-doc begin-marker /** when the comment block
isn't in kernel-doc format.


Sigh, kernel-doc should improve things, not get in the way ...


Well, I see it as sort of a language that is embedded in C source code.

But what are you suggesting?

--
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag

2007-04-16 Thread Ram Pai
On Mon, 2007-04-16 at 11:56 +0200, Miklos Szeredi wrote:
Also for bind-mount and remount operations the flag has to be propagated
down its propagation tree.  Otherwise a unpriviledged mount in a shared
mount wont get reflected in its peers and slaves, leading to unidentical
shared-subtrees.
   
   That's an interesting question.  Do we want shared mounts to be
   totally identical, including mnt_flags?  It doesn't look as if
   do_remount() guarantees that currently.
  
  Depends on the semantics of each of the flags. Some flags like of the
  read/write flag, would not interfere with the propagation semantics
  AFAICT.  But this one certainly seems to interfere.
 
 That depends.  Current patches check the unprivileged submounts
 allowed under this mount flag only on the requested mount and not on
 the propagated mounts.  Do you see a problem with this?

Don't see a problem if the flag is propagated to all peers and slave
mounts. 

If not, I see a problem. What if the propagated mount has its flag set
to not do un-priviledged mounts, whereas the requested mount has it
allowed?

RP



 
 Miklos

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to query mount propagation state?

2007-04-16 Thread Ram Pai
On Mon, 2007-04-16 at 12:34 +0200, Miklos Szeredi wrote:
 Currently one of the difficulties with mount propagations is that
 there's no way to know the current state of the propagation tree.
 
 Has anyone thought about how this info could be queried from
 userspace?

I am attaching two patches that I had done way back in Oct 2006 
with Al Viro. I had sent these patches to Al Viro. But I forgot to
follow them up, I guess so did Al Viro.

The first patch disambiguates multiple mount-instances of the same
filesystem (or part of the same filesystem), by introducing a new
interface /proc/mounts_new. 

The second patch introduces a new proc interface that exposes all the
propagation trees within a namespace.  It does not show propagated
mounts residing in a different namespace (for privacy reasons). Maybe
one could modify the patch a little, to allow it; if the user has
root priviledges. 

RP

PS: Sorry these are attachments instead of inline patches. I am scared
of inlining in evolution. If needed I can send inline patches through
mutt.

 
 Thanks,
 Miklos
This patch disambiguates multiple mount-instances of the same
filesystem (or part of the same filesystem), by introducing a new
interface /proc/mounts_new. The interface has the following format.


FSID  mntpt  root-dentry  fstype fs-options


NOTE: root-dentry is the path to the dentry w.r.t to the root dentry of the
same filesystem.

for example: lets say we attempt the following commands
mount --bind /var /mnt
mount --bind /mnt/tmp /tmp1

'cat /proc/mounts' shows the following:
/dev/root /mnt ext2 rw 0 0
/dev/root /tmp1 ext2 rw 0 0

NOTE: The above mount entries, do not indicate that /tmp1 contains the same
directory tree as /var/tmp.

But 'cat /proc/mounts_new' shows us the following:
0x6200 /mnt /var ext2 rw 0 0
0x6200 /tmp1 /var/tmp ext2 rw 0 0

The above entries clearly indicates that /var/tmp directory of the ext2
filesystem with fsid=0x6200 is the directory tree that resides under /tmp1

Signed-off-by: Ram Pai [EMAIL PROTECTED]

---
 fs/dcache.c  |   53 
 fs/namespace.c   |   35 ++---
 fs/proc/base.c   |   32 +--
 fs/proc/proc_misc.c  |1 
 fs/seq_file.c|   77 ++-
 include/linux/dcache.h   |1 
 include/linux/seq_file.h |1 
 7 files changed, 172 insertions(+), 28 deletions(-)

Index: linux-2.6.17.10/fs/proc/base.c
===
--- linux-2.6.17.10.orig/fs/proc/base.c
+++ linux-2.6.17.10/fs/proc/base.c
@@ -104,6 +104,7 @@ enum pid_directory_inos {
 	PROC_TGID_MAPS,
 	PROC_TGID_NUMA_MAPS,
 	PROC_TGID_MOUNTS,
+	PROC_TGID_MOUNTS_NEW,
 	PROC_TGID_MOUNTSTATS,
 	PROC_TGID_WCHAN,
 #ifdef CONFIG_MMU
@@ -145,6 +146,7 @@ enum pid_directory_inos {
 	PROC_TID_MAPS,
 	PROC_TID_NUMA_MAPS,
 	PROC_TID_MOUNTS,
+	PROC_TID_MOUNTS_NEW,
 	PROC_TID_MOUNTSTATS,
 	PROC_TID_WCHAN,
 #ifdef CONFIG_MMU
@@ -203,6 +205,7 @@ static struct pid_entry tgid_base_stuff[
 	E(PROC_TGID_ROOT,  root,S_IFLNK|S_IRWXUGO),
 	E(PROC_TGID_EXE,   exe, S_IFLNK|S_IRWXUGO),
 	E(PROC_TGID_MOUNTS,mounts,  S_IFREG|S_IRUGO),
+	E(PROC_TGID_MOUNTS_NEW,mounts_new,  S_IFREG|S_IRUGO),
 	E(PROC_TGID_MOUNTSTATS, mountstats, S_IFREG|S_IRUSR),
 #ifdef CONFIG_MMU
 	E(PROC_TGID_SMAPS, smaps,   S_IFREG|S_IRUGO),
@@ -246,6 +249,7 @@ static struct pid_entry tid_base_stuff[]
 	E(PROC_TID_ROOT,   root,S_IFLNK|S_IRWXUGO),
 	E(PROC_TID_EXE,exe, S_IFLNK|S_IRWXUGO),
 	E(PROC_TID_MOUNTS, mounts,  S_IFREG|S_IRUGO),
+	E(PROC_TID_MOUNTS_NEW, mounts_new,  S_IFREG|S_IRUGO),
 #ifdef CONFIG_MMU
 	E(PROC_TID_SMAPS,  smaps,   S_IFREG|S_IRUGO),
 #endif
@@ -692,13 +696,13 @@ static struct file_operations proc_smaps
 };
 #endif
 
-extern struct seq_operations mounts_op;
 struct proc_mounts {
 	struct seq_file m;
 	int event;
 };
 
-static int mounts_open(struct inode *inode, struct file *file)
+static int __mounts_open(struct inode *inode, struct file *file,
+			struct seq_operations *mounts_op)
 {
 	struct task_struct *task = proc_task(inode);
 	struct namespace *namespace;
@@ -716,7 +720,7 @@ static int mounts_open(struct inode *ino
 		p = kmalloc(sizeof(struct proc_mounts), GFP_KERNEL);
 		if (p) {
 			file-private_data = p-m;
-			ret = seq_open(file, mounts_op);
+			ret = seq_open(file, mounts_op);
 			if (!ret) {
 p-m.private = namespace;
 p-event = namespace-event;
@@ -729,6 +733,16 @@ static int mounts_open(struct inode *ino
 	return ret;
 }
 
+extern struct seq_operations mounts_op, mounts_new_op;
+static int mounts_open(struct inode *inode, struct file *file)
+{
+	return (__mounts_open(inode, file, mounts_op));
+}
+static int mounts_new_open(struct inode *inode, struct file *file)
+{
+	return __mounts_open(inode, file, mounts_new_op);
+}

Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag

2007-04-16 Thread Miklos Szeredi
 Also for bind-mount and remount operations the flag has to be 
 propagated
 down its propagation tree.  Otherwise a unpriviledged mount in a 
 shared
 mount wont get reflected in its peers and slaves, leading to 
 unidentical
 shared-subtrees.

That's an interesting question.  Do we want shared mounts to be
totally identical, including mnt_flags?  It doesn't look as if
do_remount() guarantees that currently.
   
   Depends on the semantics of each of the flags. Some flags like of the
   read/write flag, would not interfere with the propagation semantics
   AFAICT.  But this one certainly seems to interfere.
  
  That depends.  Current patches check the unprivileged submounts
  allowed under this mount flag only on the requested mount and not on
  the propagated mounts.  Do you see a problem with this?
 
 Don't see a problem if the flag is propagated to all peers and slave
 mounts. 
 
 If not, I see a problem. What if the propagated mount has its flag set
 to not do un-priviledged mounts, whereas the requested mount has it
 allowed?

Then the mount is allowed.

It is up to the sysadmin/distro to design set up the propagations in a
way that this is not a problem.

I think it would be much less clear conceptually, if unprivileged
mounting would have to check propagations as well.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag

2007-04-16 Thread Eric W. Biederman
Miklos Szeredi [EMAIL PROTECTED] writes:

  That depends.  Current patches check the unprivileged submounts
  allowed under this mount flag only on the requested mount and not on
  the propagated mounts.  Do you see a problem with this?
 
 I think privileges of this sort should propagate.  If I read what you
 just said correctly if I have a private mount namespace I won't be able
 to mount anything unless when it was setup the unprivileged submount
 command was explicitly set.

 By design yes.  Why is that a problem?

It certainly doesn't match my intuition.

Why are directory permissions not sufficient to allow/deny non-priveleged 
mounts?
I don't understand that contention yet.

I should probably go back and look and see how plan9 handles mount/unmount
permissions.  Plan9 gets away with a lot more because it doesn't have
a suid bit and mount namespaces were always present, so they don't have
backwards compatibility problems.

My best guess at the moment is that plan9 treated mount/unmount as
completely unprivileged and used the mount namespaces to limit the
scope of what would be affected by a mount/unmount operation.  I think
that may be reasonable in linux as well but it will require the
presence of a mount namespace to limit the affects of what a user can
do.

So short of a more thorough audit I believe the final semantics should
be: 
- mount/unmount for non-priveleged processes should only be limited
  by the mount namespace and directory permissions.
- CLONE_NEWNS should not be a privileged operation. 

What prevents us from allowing these things?

- Unprivileged CLONE_NEWNS and unprivileged mounts needs resource
  accounting so we don't have a denial of service attack.

- Unprivileged mounts must be limited to directories that we have
  permission to modify in a way that we could get the same effect
  as the mount or unmount operation in terms of what files are visible
  otherwise we can mess up SUID executables.

- Anything else?

There are user space issues such as a reasonable pam module and how
to do backups.  However those are user space issues.

What am I missing that requires us to add MNT_USER and MNT_USERMNT?

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 05/10] Add permit user submounts flag to vfsmount

2007-04-16 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
 From: Miklos Szeredi [EMAIL PROTECTED]
 
 If MNT_USERMNT flag is not set in the target vfsmount, then

MNT_USER and MNT_USERMNT?  I claim no way will people keep those
straight.  How about MNT_ALLOWUSER and MNT_USER?

-serge

 unprivileged mounts will be denied.
 
 By default this flag is cleared, and can be set on new mounts, on
 remounts or with the MS_SETFLAGS option.
 
 Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
 ---
 
 Index: linux/fs/namespace.c
 ===
 --- linux.orig/fs/namespace.c 2007-04-13 13:20:12.0 +0200
 +++ linux/fs/namespace.c  2007-04-13 13:35:40.0 +0200
 @@ -411,6 +411,7 @@ static int show_vfsmnt(struct seq_file *
   { MNT_NOATIME, ,noatime },
   { MNT_NODIRATIME, ,nodiratime },
   { MNT_RELATIME, ,relatime },
 + { MNT_USERMNT, ,usermnt },
   { 0, NULL }
   };
   struct proc_fs_info *fs_infop;
 @@ -1505,9 +1506,11 @@ long do_mount(char *dev_name, char *dir_
   mnt_flags |= MNT_NODIRATIME;
   if (flags  MS_RELATIME)
   mnt_flags |= MNT_RELATIME;
 + if (flags  MS_USERMNT)
 + mnt_flags |= MNT_USERMNT;
 
   flags = ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
 -MS_NOATIME | MS_NODIRATIME | MS_RELATIME);
 +MS_NOATIME | MS_NODIRATIME | MS_RELATIME | MS_USERMNT);
 
   /* ... and get the mountpoint */
   retval = path_lookup(dir_name, LOOKUP_FOLLOW, nd);
 Index: linux/include/linux/mount.h
 ===
 --- linux.orig/include/linux/mount.h  2007-04-13 13:17:08.0 +0200
 +++ linux/include/linux/mount.h   2007-04-13 13:22:17.0 +0200
 @@ -28,6 +28,7 @@ struct mnt_namespace;
  #define MNT_NOATIME  0x08
  #define MNT_NODIRATIME   0x10
  #define MNT_RELATIME 0x20
 +#define MNT_USERMNT  0x40
 
  #define MNT_SHRINKABLE   0x100
  #define MNT_USER 0x200
 Index: linux/include/linux/fs.h
 ===
 --- linux.orig/include/linux/fs.h 2007-04-13 13:23:05.0 +0200
 +++ linux/include/linux/fs.h  2007-04-13 13:35:34.0 +0200
 @@ -130,6 +130,7 @@ extern int dir_notify_enable;
  #define MS_SETFLAGS  (123) /* set specified mount flags */
  #define MS_CLEARFLAGS(124) /* clear specified mount flags */
  /* MS_SETFLAGS | MS_CLEARFLAGS: change mount flags to specified */
 +#define MS_USERMNT   (125) /* permit unpriv. submounts under this mount */
  #define MS_ACTIVE(130)
  #define MS_NOUSER(131)
 
 
 --
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 02/10] allow unprivileged umount

2007-04-16 Thread Eric W. Biederman
Miklos Szeredi [EMAIL PROTECTED] writes:

 From: Miklos Szeredi [EMAIL PROTECTED]

 The owner doesn't need sysadmin capabilities to call umount().

 Similar behavior as umount(8) on mounts having user=UID option in
 /etc/mtab.  The difference is that umount also checks /etc/fstab,
 presumably to exclude another mount on the same mountpoint.


bool in the kernel?

int would be much more recognizable as this is not C++

Or do you have place to convert the rest of the kernel that is using
int to return a true/false value to bool?

 +static bool permit_umount(struct vfsmount *mnt, int flags)
 +{
 + if (capable(CAP_SYS_ADMIN))
 + return true;
 +
 + if (!(mnt-mnt_flags  MNT_USER))
 + return false;
 +
 + if (flags  MNT_FORCE)
 + return false;
 +
 + return mnt-mnt_uid == current-uid;
 +}

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag

2007-04-16 Thread Serge E. Hallyn
Quoting Eric W. Biederman ([EMAIL PROTECTED]):
 Miklos Szeredi [EMAIL PROTECTED] writes:
 
   That depends.  Current patches check the unprivileged submounts
   allowed under this mount flag only on the requested mount and not on
   the propagated mounts.  Do you see a problem with this?
  
  I think privileges of this sort should propagate.  If I read what you
  just said correctly if I have a private mount namespace I won't be able
  to mount anything unless when it was setup the unprivileged submount
  command was explicitly set.
 
  By design yes.  Why is that a problem?
 
 It certainly doesn't match my intuition.
 
 Why are directory permissions not sufficient to allow/deny non-priveleged 
 mounts?
 I don't understand that contention yet.

The same scenarios laid out previously in this thread.  I.e.

1. user hallyn does mount --bind / /home/hallyn/root
2. (...)
3. admin does deluser hallyn

and deluser starts wiping out root

Or,

1. user hallyn does mount --bind / /home/hallyn/root
2. backup daemon starts backing up /home/hallyn/root/home/hallyn/root/home...

So we started down the path of forcing users to clone a new namespace
before doing user mounts, which is what the clone flag was about.  Using
per-mount flags also suffices as you had pointed out, which is being
done here.  But directory permissions are inadequate.

(Unless you want to tackle each problem legacy tool one at a time to
remove problems - i.e. deluser should umount everything under
/home/hallyn before deleting, backup should be spawned from it's own
namespace cloned right after boot or just back up on one filesystem,
etc.)

-serge

 I should probably go back and look and see how plan9 handles mount/unmount
 permissions.  Plan9 gets away with a lot more because it doesn't have
 a suid bit and mount namespaces were always present, so they don't have
 backwards compatibility problems.
 
 My best guess at the moment is that plan9 treated mount/unmount as
 completely unprivileged and used the mount namespaces to limit the
 scope of what would be affected by a mount/unmount operation.  I think
 that may be reasonable in linux as well but it will require the
 presence of a mount namespace to limit the affects of what a user can
 do.
 
 So short of a more thorough audit I believe the final semantics should
 be: 
 - mount/unmount for non-priveleged processes should only be limited
   by the mount namespace and directory permissions.
 - CLONE_NEWNS should not be a privileged operation. 
 
 What prevents us from allowing these things?
 
 - Unprivileged CLONE_NEWNS and unprivileged mounts needs resource
   accounting so we don't have a denial of service attack.
 
 - Unprivileged mounts must be limited to directories that we have
   permission to modify in a way that we could get the same effect
   as the mount or unmount operation in terms of what files are visible
   otherwise we can mess up SUID executables.
 
 - Anything else?
 
 There are user space issues such as a reasonable pam module and how
 to do backups.  However those are user space issues.
 
 What am I missing that requires us to add MNT_USER and MNT_USERMNT?
 
 Eric
 -
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 02/10] allow unprivileged umount

2007-04-16 Thread Greg KH
On Mon, Apr 16, 2007 at 01:39:19PM -0600, Eric W. Biederman wrote:
 Miklos Szeredi [EMAIL PROTECTED] writes:
 
  From: Miklos Szeredi [EMAIL PROTECTED]
 
  The owner doesn't need sysadmin capabilities to call umount().
 
  Similar behavior as umount(8) on mounts having user=UID option in
  /etc/mtab.  The difference is that umount also checks /etc/fstab,
  presumably to exclude another mount on the same mountpoint.
 
 
 bool in the kernel?

It's there already...

 int would be much more recognizable as this is not C++
 
 Or do you have place to convert the rest of the kernel that is using
 int to return a true/false value to bool?

It's already underway:
$ git log | grep bool | wc -l
123

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AppArmor 39/41] AppArmor: Profile loading and manipulation, pathname matching

2007-04-16 Thread John Johansen
On Mon, Apr 16, 2007 at 08:27:08AM +0200, Andi Kleen wrote:
  It's nice to check for consistency though, so we're adding that. Profile 
  loading is a trusted operation, at least so far, and so security wise we 
  don't actually have to care --- if loading an invalid profile can bring 
  down 
  the system, then that's no worse than an arbitrary module that crashes the 
  machine. Not sure if there will ever be user loadable profiles; at least at 
  that point we had to care.
 
 A security system that allows to crash the kernel is a little weird 
 though. It would be better to check. Not that a recursion check
 is particularly expensive.
 
Indeed.  It will be fixed in the next rev.

thanks
john



pgp4HkMa7wCNY.pgp
Description: PGP signature


Re: How to query mount propagation state?

2007-04-16 Thread Karel Zak
On Mon, Apr 16, 2007 at 10:39:46AM -0700, Ram Pai wrote:

 This patch disambiguates multiple mount-instances of the same
 filesystem (or part of the same filesystem), by introducing a new
 interface /proc/mounts_new. The interface has the following format.
^^
 ... odd name. What will be the name for a next generation?
 /proc/mounts_new_new? :-)

 'cat /proc/mounts' shows the following:
 /dev/root /mnt ext2 rw 0 0
 /dev/root /tmp1 ext2 rw 0 0
 
 NOTE: The above mount entries, do not indicate that /tmp1 contains the same
 directory tree as /var/tmp.
 
 But 'cat /proc/mounts_new' shows us the following:
 0x6200 /mnt /var ext2 rw 0 0
 0x6200 /tmp1 /var/tmp ext2 rw 0 0

 Can't you purely and simply add the fsid= option to /proc/mounts?

 /dev/root /mnt ext2 rw,fsid=0x6200 0 0 
 /dev/root /mnt ext2 rw,fsid=0x6200 0 0

 I think you can do it without a negative impact to userspace.

 This patch introduces a new proc interface that exposes all the propagation 
 trees within the namespace.

 Good idea.

 It walks through each off the mounts in the namespace, and prints the 
 following information.
 
 mount-id: a unique mount identifier
 dev-id : the unique device used to identify the device containing the 
 filesystem
  
 Why not major:minor?

 path-from-root: mount point of the mount from /
 path-from-root-of-its-sb: path from its own root dentry.
 propagation-flag: SHARED, SLAVE, UNBINDABLE, PRIVATE
 peer-mount-id: the mount-id of its peer mount (if this mount is shared)
 master-mount-id: the mount-id of its master mount (if this mount is slave)

 Example:
 Here is a sample output of cat /proc/$$/mounts_propagation
 
 0xa917800 0x1 / / PRIVATE
 0xa917200 0x6200 / / PRIVATE
 0xa917180 0x3 /proc / PRIVATE
 0xa917f80 0xa /dev/pts / PRIVATE
 0xa917100 0x6210 /mnt / SHARED peer:0xa917100
 0xa917f00 0x6210 /tmp /1 SLAVE master:0xa917100
 0xa917900 0x6220 /mnt/2 / SHARED peer:0xa917900

 Same thing (although the mounts_propagation makes more sense than
 mount_new from my point of view). 
 
 cat /proc/mounts (or /proc/$$/mounts)

 /dev/root /mnt ext2 rw,mid=0xa917100,did=0x6210,prop=SHARED,peer=0xa917100


 my $0.02...

Karel

-- 
 Karel Zak  [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


AppArmor FAQ

2007-04-16 Thread John Johansen
Here we present our direct responses to the most frequent questions from
the AppArmor from the 2006 post.

Use of Pathnames For Access Control
---

Some people in the security field believe that pathnames are an
inappropriate security mechanism.  This depends on what you are
primarily trying to protect, and the rest follows from that.

Label-based security (exemplified by SELinux, and its predecessors in
MLS systems) attaches security policy to the data. As the data flows
through the system, the label sticks to the data, and so security
policy with respect to this data stays intact. This is a good approach
for ensuring secrecy, the kind of problem that intelligence agencies have.

Pathname-based security (exemplified in AppArmor, and its predecessor
Janus http://www.cs.berkeley.edu/~daw/janus/ and other systems like
Systrace http://www.citi.umich.edu/u/provos/systrace/ ) attach security
policy to the name of the data.

Controlling access to filenames is important because applications
primarily use those names to access the files behind them, and they
depend on getting to the right files. For example, login(1) expects
/etc/passwd to resolve to a valid list of user accounts.  In the
traditional UNIX model, files do have names but not labels, and
applications only operate in terms of those names.  Pathname-based
security puts more emphasis on the integrity of the system, making
secrecy the secondary goal that follows.

Caveat: Both label-based security and pathname-based security can
provide both secrecy and integrity protection, the above discussion is
only about which model makes it easier to provide which kind of security.

We acknowledge that not all objects on a UNIX system are paths, and we
agree that there is value in also protecting non-path resources.
Contrary to popular belief, AppArmor is *not* Pathnames R Us, but
rather Use native abstractions to mediate stuff:  when you mediate
something, you should use the native syntax that users normally use to
access the object. This follows the UNIX philosophy of least surprise
so that users can understand the specification. Pathnames are the
natural notation for users to understand what file access rights are
being granted in the policy, and so AppArmor uses shell syntax for fully
qualified pathnames, including shell syntax wildcards.

Similarly, AppArmor grants access to POSIX.1e capabilities by name, the
name of the capability. In future work where AppArmor will add network
access control, the notation will resemble IPTables firewall rules. This
is an important part of what makes AppArmor usable: always using the
native abstraction for mediating access.

We also acknowledge that pathname based access control requires a way to
perform pathname matching in the kernel, and this comes at a cost higher
than comparing object labels -- which assumes that all objects in the
system already have the appropriate labels.

However, those concerned with performance should note that AppArmor
overhead is already quite low (single-digit percent slowdown). Security
is rarely performance-neutral, and AppArmor, and SELinux, are no
exception. However, that overhead is small, and can be selectively
avoided by not applying AppArmor to performance-sensitive programs.

It is also easy to overlook the fact that putting all those labels in
place is a pretty expensive operation as well, particularly considering
large file systems. So by providing string matching in the kernel,
AppArmor trades run-time performance to grant reduced administrative work.

It has been suggested that AppArmor's pathname-based syntax could be
compiled into SELinux policy, and this is in fact what the SEEdit
project http://seedit.sourceforge.net/ does. However, any change in
policy requires a complete re-labeling of the file system, and the
policy cannot apply to files that do not yet exist. AppArmor's in-kernel
string matching allows for policy specifying access to files that might
come to exist in the future.

Use Of d_path() For Computing Pathnames
---

We have been criticized for the use of d_path(), for various reasons:

 - heuristic discovery of the vfsmount of a dentry,
 - inability to reliably identify deleted files,
 - inability to detect unreachable paths,
 - ambiguity of paths for chroot processes,
 - file lookup and the access check are not atomic.

Most of these issues are fixable (and fixed in the meantime), while the
non-atomicity is not really an issue.

Because struct vfsmount was not available to LSM hooks for computing
pathnames from (dentry, vfsmount) pairs, the version of AppArmor posted
last year used heuristics for rediscovering the vfsmounts associated
with dentries -- and possibly the wrong ones.  We are now passing the
vfsmount objects through to all LSM hooks that compute pathnames, and so
this heuristic is gone, and now we always use the appropriate vfsmount.

The d_path patch already in the -mm tree allows 

Re: [AppArmor 38/41] AppArmor: Module and LSM hooks

2007-04-16 Thread John Johansen
On Thu, Apr 12, 2007 at 11:21:01AM +0100, Alan Cox wrote:
  +
  +   /**
  +* parent can ptrace child when
  +* - parent is unconfined
  +* - parent is in complain mode
  +* - parent and child are confined by the same profile
  +*/
 
 Your profiles are name based. That means the same profile in a different
 namespace does different things. It would be a very odd case where it
 mattered but surely the parent ptrace child rule should also require that
 the parent and child are in the same namespace when using apparmor name
 based security.
 
you are right we should be requiring parent and child are in the same
namespace.  This has been fixed.

  +static int apparmor_capget(struct task_struct *task,
  +   kernel_cap_t *effective,
  +   kernel_cap_t *inheritable,
  +   kernel_cap_t *permitted)
  +{
  +   return cap_capget(task, effective, inheritable, permitted);
  +}
 
 Pointless function should go away.
 
yes we had a few of those thanks for pointing it out.

  +static int apparmor_sysctl(struct ctl_table *table, int op)
  +{
  +   int error = 0;
  +
  +   if ((op  002)  !capable(CAP_SYS_ADMIN))
  +   error = aa_reject_syscall(current, GFP_KERNEL,
  + sysctl (write));
  +
  +   return error;
 
 The usual file permission security override is DAC not ADMIN. What is the
 logic of this choice.
 
This was a very course grain check that was done to restrict access to
sysctl's that could be potentially used to elevated priledge.  The check
is inconsistent with AppArmor's model and we should be modelling
sysctl accesses as pathname access, and then we could be using standard
mediation.

thanks for the review
john



pgpu5UrkM9BxV.pgp
Description: PGP signature


Re: [AppArmor 31/41] Fix __d_path() for lazy unmounts and make it unambiguous; exclude unreachable mount points from /proc/mounts

2007-04-16 Thread Alan Cox
  That is a fairly significant and sudden change to the existing
  kernel/user interface.
 
 Well, this is not meant for 2.6.21. I hope it is possible to change it in 
 early 2.6.22; otherwise if we can't fix mistakes from the past we are pretty 
 doomed.

I don't believe the existing behaviour _IS_ a mistake.

  This is untrue. The process can get there (via fd passing with another
  task)
 
 Process can access file descriptors which are unreachable via path name just 
 fine indeed, but those fds still don't have a valid path in the context of 
 that process.

Which while problematic to your name based security is just fine to
everything else. 

 We are only talking about mount points unreachable by a particular process; 
 this does not mean that the mount point isn't reachable by other processes. 
 Human operators can choose the context from which they are looking 
 at /proc/mounts. If they are looking form the real root, the will see all 
 mounts that any process can reach (in that namespace).

Ok, providing the real root sees them all it isn't so bad, but to
assume you can filter based upon what the task can see is dodgy as an
assumption.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AppArmor 39/41] AppArmor: Profile loading and manipulation, pathname matching

2007-04-16 Thread Alan Cox
 don't actually have to care --- if loading an invalid profile can bring down 
 the system, then that's no worse than an arbitrary module that crashes the 
 machine. Not sure if there will ever be user loadable profiles; at least at 
 that point we had to care.

CAP_SYS_RAWIO is needed to do arbitary patching/loading in the capability
model so if you are using lesser capabilities it is a (minor) capability
rise but not a big problem, just ugly and wanting a fix

   + /*
   +  * Replacement needs to allocate a new aa_task_context for each
   +  * task confined by old_profile.  To do this the profile locks
   +  * are only held when the actual switch is done per task.  While
   +  * looping to allocate a new aa_task_context the old_task list
   +  * may get shorter if tasks exit/change their profile but will
   +  * not get longer as new task will not use old_profile detecting
   +  * that is stale.
   +  */
   + do {
   + new_cxt = aa_alloc_task_context(GFP_KERNEL | __GFP_NOFAIL);
  
  NOFAIL is usually a bad sign. It should be only used if there is no
  alternative.
 
 At this point there is no secure alternative to allocating a task context --- 
 except killing the task, maybe.

Can you count the number needed, preallocate them and then when you know
for sure either succeed or fail the operation as a whole ?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AppArmor 39/41] AppArmor: Profile loading and manipulation, pathname matching

2007-04-16 Thread John Johansen
On Mon, Apr 16, 2007 at 11:00:01PM +0100, Alan Cox wrote:
  don't actually have to care --- if loading an invalid profile can bring 
  down 
  the system, then that's no worse than an arbitrary module that crashes the 
  machine. Not sure if there will ever be user loadable profiles; at least at 
  that point we had to care.
 
 CAP_SYS_RAWIO is needed to do arbitary patching/loading in the capability
 model so if you are using lesser capabilities it is a (minor) capability
 rise but not a big problem, just ugly and wanting a fix
 
+   /*
+* Replacement needs to allocate a new aa_task_context for each
+* task confined by old_profile.  To do this the profile locks
+* are only held when the actual switch is done per task.  While
+* looping to allocate a new aa_task_context the old_task list
+* may get shorter if tasks exit/change their profile but will
+* not get longer as new task will not use old_profile detecting
+* that is stale.
+*/
+   do {
+   new_cxt = aa_alloc_task_context(GFP_KERNEL | 
__GFP_NOFAIL);
   
   NOFAIL is usually a bad sign. It should be only used if there is no
   alternative.
  
  At this point there is no secure alternative to allocating a task context 
  --- 
  except killing the task, maybe.
 
 Can you count the number needed, preallocate them and then when you know
 for sure either succeed or fail the operation as a whole ?

No, to be accurate the count would have to be made with the profile lock
held, which would then need to be released so as not to use GFP_ATOMIC
for the allocations.

An iterative approach could be taken where we do something like
repeat:
  lock profile
 count
 if preallocated  count
unlock profile
if (! allocate count - preallocated)
   Fail
goto repeat
  do replacement


pgp3UGFGlqBX8.pgp
Description: PGP signature


Re: 2.6.21-rc6 new aops patchset

2007-04-16 Thread Mark Fasheh
On Thu, Apr 12, 2007 at 06:48:52AM +0200, Nick Piggin wrote:
 http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
 
 2.6.21-rc6-new-aops*
 
 New aops patchset against 2.6.21-rc6.

Ok, here's an ocfs2 patch against -mm / ocfs2.git ALL. It makes use of the
context back pointer to store information related to the write which the vfs
normally doesn't know about. Most importantly this is an array of zero'd
pages which might have to be written out for an allocating write. Of note is
that I also stick the journal handle on there. Ocfs2 could use
current-journal_info for that, but I think it's much cleaner to just pass
that around as a file system specific context.

I tested this on a couple nodes and things seem to be running smoothly.

A couple of notes:

* The ocfs2 write context is probably a bit big. I'm much more concerned
  with readability though as Ocfs2 has much more baggage to carry around
  than other file systems.

* A ton of code was deleted :) The patch adds a bunch too, but that's mostly
  getting the old stuff to flow with -write_begin. Some assumptions about
  the size of the write that were made with my previous implemenation were
  no longer true (this is good).

* I could probably clean this up some more, but I'd be fine if the patch
  went upstream as-is. Diff seems to have mangled this patch file enough
  that it's probably much easier to read once applied.

* This doesn't use -perform_write (yet), so stuff is still being copied one
  page at a time. I _think_ things are pretty reasonably set up to allow
  larger writes though...
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[EMAIL PROTECTED]


From: Mark Fasheh [EMAIL PROTECTED]

ocfs2: Convert to new aops

Fix up ocfs2 to use -write_begin and -write_end. This lets us dump a large
amount of code which was implementing our own write path while preserving
the nice locking rules that were gained by moving away from -prepare_write.

Signed-off-by: Mark Fasheh [EMAIL PROTECTED]

---

 fs/ocfs2/aops.c |  777 +++
 fs/ocfs2/aops.h |   52 
 fs/ocfs2/file.c |  246 +
 3 files changed, 452 insertions(+), 623 deletions(-)

62668000c621fd18da6adf52e219f933e337e4bd
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index b74971e..55100cf 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -677,6 +677,8 @@ int ocfs2_map_page_blocks(struct page *p
 bh = bh-b_this_page, block_start += bsize) {
block_end = block_start + bsize;
 
+   clear_buffer_new(bh);
+
/*
 * Ignore blocks outside of our i/o range -
 * they may belong to unallocated clusters.
@@ -691,9 +693,8 @@ int ocfs2_map_page_blocks(struct page *p
 * For an allocating write with cluster size = page
 * size, we always write the entire page.
 */
-
-   if (buffer_new(bh))
-   clear_buffer_new(bh);
+   if (new)
+   set_buffer_new(bh);
 
if (!buffer_mapped(bh)) {
map_bh(bh, inode-i_sb, *p_blkno);
@@ -754,217 +755,187 @@ next_bh:
return ret;
 }
 
+#if (PAGE_CACHE_SIZE = OCFS2_MAX_CLUSTERSIZE)
+#define OCFS2_MAX_CTXT_PAGES   1
+#else
+#define OCFS2_MAX_CTXT_PAGES   (OCFS2_MAX_CLUSTERSIZE / PAGE_CACHE_SIZE)
+#endif
+
+#define OCFS2_MAX_CLUSTERS_PER_PAGE(PAGE_CACHE_SIZE / 
OCFS2_MIN_CLUSTERSIZE)
+
 /*
- * This will copy user data from the buffer page in the splice
- * context.
- *
- * For now, we ignore SPLICE_F_MOVE as that would require some extra
- * communication out all the way to ocfs2_write().
+ * Describe the state of a single cluster to be written to.
  */
-int ocfs2_map_and_write_splice_data(struct inode *inode,
- struct ocfs2_write_ctxt *wc, u64 *p_blkno,
- unsigned int *ret_from, unsigned int *ret_to)
-{
-   int ret;
-   unsigned int to, from, cluster_start, cluster_end;
-   char *src, *dst;
-   struct ocfs2_splice_write_priv *sp = wc-w_private;
-   struct pipe_buffer *buf = sp-s_buf;
-   unsigned long bytes, src_from;
-   struct ocfs2_super *osb = OCFS2_SB(inode-i_sb);
+struct ocfs2_write_cluster_desc {
+   u32 c_cpos;
+   u32 c_phys;
+   /*
+* Give this a unique field because c_phys eventually gets
+* filled.
+*/
+   unsignedc_new;   
+};
 
-   ocfs2_figure_cluster_boundaries(osb, wc-w_cpos, cluster_start,
-   cluster_end);
+struct ocfs2_write_ctxt {
+   /* Logical cluster position / len of write */
+   u32 w_cpos;
+   u32 w_clen;
 
-   from = sp-s_offset;
-   src_from = sp-s_buf_offset;
-   bytes = wc-w_count;
+   struct ocfs2_write_cluster_desc 

Re: AppArmor FAQ

2007-04-16 Thread James Morris
On Mon, 16 Apr 2007, John Johansen wrote:

 Label-based security (exemplified by SELinux, and its predecessors in
 MLS systems) attaches security policy to the data. As the data flows
 through the system, the label sticks to the data, and so security
 policy with respect to this data stays intact. This is a good approach
 for ensuring secrecy, the kind of problem that intelligence agencies have.

Labels are also a good approach for ensuring integrity, which is one of 
the most fundamental aspects of the security model implemented by SELinux.  

Some may infer otherwise from your document.

 Pathname-based security (exemplified in AppArmor, and its predecessor
 Janus http://www.cs.berkeley.edu/~daw/janus/ and other systems like
 Systrace http://www.citi.umich.edu/u/provos/systrace/ ) attach security
 policy to the name of the data.
 
 Controlling access to filenames is important because applications
 primarily use those names to access the files behind them, and they
 depend on getting to the right files. For example, login(1) expects
 /etc/passwd to resolve to a valid list of user accounts.

And it should, but alas may instead find otherwise due to namespace 
manipulation, object aliasing (e.g. symlinks), application error, 
configuration error, corrupted files, corrupted filesystems, misbehavior 
due to malware infection or various forms user error.

A pathname tells you nothing reliable about the security properties of the 
object its pointing to.  It is simply a mechanism for locating and 
referring to an object.

 In the traditional UNIX model, files do have names but not labels, and 
 applications only operate in terms of those names.

Just to be clear (as the above conflates two distinct notions):  
applications under SELinux still use pathnames for locating and referring 
to files.

SELinux security is enforced within the kernel, and an application which 
does not have permission to access an object will simply receive an error 
using the standard Unix mechanisms already used for DAC.  For example, a 
write(2) might fail with an EACCES error code.

The pathname used by an application to access an object has _nothing_ to 
do with the security attributes of the object.

Traditional Unix security in fact does not primarily depend on pathnames, 
but on DAC ownership and permission attributes stored in the file's inode.

DAC is of course a form of labeled security.

Imagine if you were re-inventing Unix and decided to implement pathname 
security for DAC instead of inode labeling.  What you would have is a more 
generalized version of apparmor, with the DAC attributes of pathnames for 
the entire filesystem stored in a text database with an in-kernel regex 
engine performing path reconstruction and pattern matching on every file 
access.  Sound like a good idea?  I hope not.

How about an analogy: think of kernel objects which are protected by 
locks.  Do you lock the path to the object or do you lock the object 
itself?


 Pathname-based security puts more emphasis on the integrity of the 
 system, making secrecy the secondary goal that follows.

This assertion is being made without any supporting evidence or rationale.  

If you're comparing pathname vs. label security, then it is clear that 
direct object labeling allows the security attributes of the system to be 
specified completely and unambiguously, whereas integrity enforced via 
pathnames alone requires several constraints to be applied to the goals of 
the policy.  So, it seems to me that the opposite of what you say is more 
correct, although it is a fairly oblique argument to start with.

More significant to note is that Type Enforcement was designed 
specifically to address integrity requirements, in response to the 
limitations of the early MLS models which were focused on confidentiality.

See:

A Practical Alternative to Hierarchical Integrity Policies
Boebert  Kain, Proceedings of the Eighth National Computer Security 
Conference, 1985.

Meeting Critical Security Objectives with Security-Enhanced Linux
http://www.nsa.gov/selinux/papers/ottawa01/index.html

Or pretty much any paper on the design of SELinux or Flask.

Integrity control is a foundational aspect of TE, Flask and SELinux.  

I've never understood why AppArmor presentations tend to so bizarrely 
suggest the opposite.

 Caveat: Both label-based security and pathname-based security can
 provide both secrecy and integrity protection, the above discussion is
 only about which model makes it easier to provide which kind of security.

I don't see how you've established anything in this regard.

 We acknowledge that not all objects on a UNIX system are paths, and we
 agree that there is value in also protecting non-path resources.
 Contrary to popular belief, AppArmor is *not* Pathnames R Us, but
 rather Use native abstractions to mediate stuff:  when you mediate
 something, you should use the native syntax that users normally use to
 access the object. This follows the UNIX 

Re: 2.6.21-rc6 new aops patchset

2007-04-16 Thread Nick Piggin
On Mon, Apr 16, 2007 at 05:19:32PM -0700, Mark Fasheh wrote:
 On Thu, Apr 12, 2007 at 06:48:52AM +0200, Nick Piggin wrote:
  http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
  
  2.6.21-rc6-new-aops*
  
  New aops patchset against 2.6.21-rc6.
 
 Ok, here's an ocfs2 patch against -mm / ocfs2.git ALL. It makes use of the
 context back pointer to store information related to the write which the vfs
 normally doesn't know about. Most importantly this is an array of zero'd
 pages which might have to be written out for an allocating write. Of note is
 that I also stick the journal handle on there. Ocfs2 could use
 current-journal_info for that, but I think it's much cleaner to just pass
 that around as a file system specific context.
 
 I tested this on a couple nodes and things seem to be running smoothly.

Thanks Mark, I'll let you know if I have any problems with it. I guess
the plan will be to rebase the new-aops patchset on -mm at some point
soon, but we'll have to work out where in -mm Andrew wants it and other
details like when to merge it... I'll try to bring Andrew into the public
discussion for that.


 A couple of notes:
 
 * The ocfs2 write context is probably a bit big. I'm much more concerned
   with readability though as Ocfs2 has much more baggage to carry around
   than other file systems.

I guess your high performance write path could be -perform_write in
future anyway, which could keep that on the stack.

I'll possibly omit the perform_write stuff in the first -mm merge, so
that we can get the basics reviewed and working, and exercise the
write_begin/write_end paths well first.

The bulk interface will come, and I think it is in much better shape to
be implemented after various things like iov iterator, the offset,length
based API rather than page based. Whether we introduce it by moving the
iov_iter and some of the more generic checks further down the stack first,
or introduce it as -perform_write aop first probably doesn't matter so
much: the conversion between one and the other is trivial compared to
implementing it in the first place, so nobody should worry that we haven't
sorted this out yet.

 
 * A ton of code was deleted :) The patch adds a bunch too, but that's mostly
   getting the old stuff to flow with -write_begin. Some assumptions about
   the size of the write that were made with my previous implemenation were
   no longer true (this is good).
 
 * I could probably clean this up some more, but I'd be fine if the patch
   went upstream as-is. Diff seems to have mangled this patch file enough
   that it's probably much easier to read once applied.

OK, thanks again. I'll ping you again in the next couple of days to
discuss merge plans.

Nick

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html