Re: [AppArmor 39/41] AppArmor: Profile loading and manipulation, pathname matching
It's nice to check for consistency though, so we're adding that. Profile loading is a trusted operation, at least so far, and so security wise we don't actually have to care --- if loading an invalid profile can bring down the system, then that's no worse than an arbitrary module that crashes the machine. Not sure if there will ever be user loadable profiles; at least at that point we had to care. A security system that allows to crash the kernel is a little weird though. It would be better to check. Not that a recursion check is particularly expensive. -Andi - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
Hi Andreas, --On 12 April 2007 5:05:50 AM -0600 Andreas Dilger [EMAIL PROTECTED] wrote: I'm interested in getting input for implementing an ioctl to efficiently map file extents holes (FIEMAP) instead of looping over FIBMAP a billion times. ... I had come up with a plan independently and was also steered toward XFS_IOC_GETBMAP* ioctls which are in fact very similar to my original plan, though I think the XFS structs used there are a bit bloated. They certainly seem to be (combining entries and header). struct fibmap_extent { __u64 fe_start; /* starting offset in bytes */ __u64 fe_len; /* length in bytes */ } struct fibmap { struct fibmap_extent fm_start; /* offset, length of desired mapping */ __u32 fm_extent_count; /* number of extents in array */ __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */ __u64 unused; struct fibmap_extent fm_extents[0]; } # define FIEMAP_LEN_MASK0xff # define FIEMAP_LEN_HOLE0x01 # define FIEMAP_LEN_UNWRITTEN 0x02 All offsets are in bytes to allow cases where filesystems are not going block-aligned/sized allocations (e.g. tail packing). The fm_extents array returned contains the packed list of allocation extents for the file, including entries for holes (which have fe_start == 0, and a flag). The -fm_extents[] array includes all of the holes in addition to allocated extents because this avoids the need to return both the logical and physical address for every extent and does not make processing any harder. Well, that's what stood out for me. I was wondering where the fe_block field had gone - the physical address. So is your fe_start; /* starting offset */ actually the disk location (not a logical file offset) _except_ in the header (fibmap) where it is the desired logical offset. Okay, looking at your example use below that's what it looks like. And when you refer to fm_start below, you mean fm_start.fe_start? Sorry, I realise this is just an approximation but this part confused me. So you get rid of all the logical file offsets in the extents because we report holes explicitly (and we know everything is contiguous if you include the holes). --Tim Caller works something like: char buf[4096]; struct fibmap *fm = (struct fibmap *)buf; int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent); fm-fm_extent.fe_start = 0; /* start of file */ fm-fm_extent.fe_len = -1; /* end of file */ fm-fm_extent_count = count; /* max extents in fm_extents[] array */ fm-fm_flags = 0;/* maybe no DMAPI, etc like XFS */ fd = open(path, O_RDONLY); printf(logical\t\tphysical\t\tbytes\n); /* The last entry will have less extents than the maximum */ while (fm-fm_extent_count == count) { rc = ioctl(fd, FIEMAP, fm); if (rc) break; /* kernel filled in fm_extents[] array, set fm_extent_count * to be actual number of extents returned, leaves fm_start * alone (unlike XFS_IOC_GETBMAP). */ for (i = 0; i fm-fm_extent_count; i++) { __u64 len = fm-fm_extents[i].fe_len FIEMAP_LEN_MASK; __u64 fm_next = fm-fm_start + len; int hole = fm-fm_extents[i].fe_len FIEMAP_LEN_HOLE; int unwr = fm-fm_extents[i].fe_len FIEMAP_LEN_UNWRITTEN; printf(%llu-%llu\t%llu-%llu\t%llu\t%s%s\n, fm-fm_start, fm_next - 1, hole ? 0 : fm-fm_extents[i].fe_start, hole ? 0 : fm-fm_extents[i].fe_start + fm-fm_extents[i].fe_len - 1, len, hole ? (hole) : , unwr ? (unwritten) : ); /* get ready for printing next extent, or next ioctl */ fm-fm_start = fm_next; } } - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/8] unprivileged mount syscall
On Fri, 2007-04-13 at 13:58 +0200, Miklos Szeredi wrote: On Wed, 2007-04-11 at 12:44 +0200, Miklos Szeredi wrote: 1. clone the master namespace. 2. in the new namespace move the tree under /share/$me to / for each ($user, $what, $how) { move /share/$user/$what to /$what if ($how == slave) { make the mount tree under /$what as slave } } 3. in the new namespace make the tree under /share as private and unmount /share Thanks. I get the basic idea now: the namespace itself need not be shared between the sessions, it is enough if share propagation is set up between the different namespaces of a user. I don't yet see either in your or Viro's description how the trees under /share/$USER are initialized. I guess they are recursively bound from /, and are made slaves. yes. I suppose, when a userid is created one of the steps would be mount --rbind / /share/$USER mount --make-rslave /share/$USER mount --make-rshared /share/$USER Thinking a bit more about this, I'm quite sure most users wouldn't even want private namespaces. It would be enough to chroot /share/$USER and be done with it. Private namespaces are only good for keeping a bunch of mounts referenced by a group of processes. But my guess is, that the natural behavior for users is to see a persistent set of mounts. If for example they mount something on a remote machine, then log out from the ssh session and later log back in, they would want to see their previous mount still there. They will continue see their previous mount tree. Even if all the namespaces belonging to the different sessions of the user get dismantled when all the sessions exit, the a mirror of those mount trees continue to exist under /share/$USER in the original namespace. So I don't think we have a issue. NOTE: when I say 'original namespace' I mean the admin namespace; the first namespace that gets created when the machine boots. RP Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/8] unprivileged mount syscall
On Fri, 2007-04-13 at 16:05 +0200, Miklos Szeredi wrote: Thinking a bit more about this, I'm quite sure most users wouldn't even want private namespaces. It would be enough to chroot /share/$USER and be done with it. Private namespaces are only good for keeping a bunch of mounts referenced by a group of processes. But my guess is, that the natural behavior for users is to see a persistent set of mounts. If for example they mount something on a remote machine, then log out from the ssh session and later log back in, they would want to see their previous mount still there. Miklos Agreed on desired behavior, but not on chroot sufficing. It actually sounds like you want exactly what was outlined in the OLS paper. Users still need to be in a different mounts namespace from the admin user so long as we consider the deluser and backup problems I don't think it matters, because /share/$USER duplicates a part or the whole of the user's namespace. So backup would have to be taught about /share anyway, and deluser operates on /home/$USER and not on /share/*, so there shouldn't be any problem. There's actually very little difference between rbind+chroot, and CLONE_NEWNS. In a private namespace: 1) when no more processes reference the namespace, the tree will be disbanded 2) the mount tree won't be accessible from outside the namespace Wanting a persistent namespace contradicts 1). Wanting a per-user (as opposed to per-session) namespace contradicts 2). The namespace _has_ to be accessible from outside, so that a new session can access/copy it. As i mentioned in the previous mail, disbanding all the namespaces of a user will not disband his mount tree, because a mirror of the mount tree still continues to exist in /share/$USER in the admin namespace. And a new user session can always use this copy to create a namespace that looks identical to that which existed earlier. So both requirements point to the rbind/chroot solution. Arn't there ways to escape chroot jails? Serge had pointed me to a URL which showed chroots can be escaped. And if that is true than having all user's private mount tree in the same namespace can be a security issue? RP Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag
Serge E. Hallyn [EMAIL PROTECTED] writes: Quoting Miklos Szeredi ([EMAIL PROTECTED]): From: Miklos Szeredi [EMAIL PROTECTED] If CLONE_NEWNS and CLONE_NEWNS_USERMNT are given to clone(2) or unshare(2), then allow user mounts within the new namespace. This is not flexible enough, because user mounts can't be enabled for the initial namespace. The remaining clone bits also getting dangerously few... Alternatives are: - prctl() flag - setting through the containers filesystem Sorry, I know I had mentioned it, but this is definately my least favorite approach. Curious whether are any other suggestions/opinions from the containers list? Given the existence of shared subtrees allowing/denying this at the mount namespace level is silly and wrong. If we need more than just the filesystem permission checks can we make it a mount flag settable with mount and remount that allows non-privileged users the ability to create mount points under it in directories they have full read/write access to. Also for bind-mount and remount operations the flag has to be propagated down its propagation tree. Otherwise a unpriviledged mount in a shared mount wont get reflected in its peers and slaves, leading to unidentical shared-subtrees. RP I don't like the use of clone flags for this purpose but in this case the shared subtress are a much more fundamental reasons for not doing this at the namespace level. Eric ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/8] unprivileged mount syscall
Arn't there ways to escape chroot jails? Serge had pointed me to a URL which showed chroots can be escaped. And if that is true than having all user's private mount tree in the same namespace can be a security issue? No. In fact chrooting the user into /share/$USER will actually _grant_ a privilege to the user, instead of taking it away. It allows the user to modify it's root namespace, which it wouldn't be able to in the initial namespace. So even if the user could escape from the chroot (which I doubt), s/he would not be able to do any harm, since unprivileged mounting would be restricted to /share. Also /share/$USER should only have read/search permission for $USER or no permissions at all, which would mean, that other users' namespaces would be safe from tampering as well. Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag
On Mon, 2007-04-16 at 11:32 +0200, Miklos Szeredi wrote: Given the existence of shared subtrees allowing/denying this at the mount namespace level is silly and wrong. If we need more than just the filesystem permission checks can we make it a mount flag settable with mount and remount that allows non-privileged users the ability to create mount points under it in directories they have full read/write access to. Also for bind-mount and remount operations the flag has to be propagated down its propagation tree. Otherwise a unpriviledged mount in a shared mount wont get reflected in its peers and slaves, leading to unidentical shared-subtrees. That's an interesting question. Do we want shared mounts to be totally identical, including mnt_flags? It doesn't look as if do_remount() guarantees that currently. Depends on the semantics of each of the flags. Some flags like of the read/write flag, would not interfere with the propagation semantics AFAICT. But this one certainly seems to interfere. RP Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 01/10] add user mounts to the kernel
From: Miklos Szeredi [EMAIL PROTECTED] Add ownership information to mounts. A new mount flag, MS_SETUSER is used to make a mount owned by a user. If this flag is specified, then the owner will be set to the current real user id and the mount will be marked with the MNT_USER flag. On remount don't preserve previous owner, and treat MS_SETUSER as for a new mount. The MS_SETUSER flag is ignored on mount move. The MNT_USER flag is not copied on any kind of mount cloning: namespace creation, binding or propagation. For bind mounts the cloned mount(s) are set to MNT_USER depending on the MS_SETUSER mount flag. In all the other cases MNT_USER is always cleared. For MNT_USER mounts a user=UID option is added to /proc/PID/mounts. This is compatible with how mount ownership is stored in /etc/mtab. It is expected, that in the future mount(8) will use MS_SETUSER to store mount ownership within the kernel. This would help in situations, where /etc/mtab is difficult or impossible to work with, e.g. when using mount propagation. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/fs/namespace.c === --- linux.orig/fs/namespace.c 2007-04-13 12:24:15.0 +0200 +++ linux/fs/namespace.c2007-04-13 13:03:44.0 +0200 @@ -227,6 +227,13 @@ static struct vfsmount *skip_mnt_tree(st return p; } +static void set_mnt_user(struct vfsmount *mnt) +{ + BUG_ON(mnt-mnt_flags MNT_USER); + mnt-mnt_uid = current-uid; + mnt-mnt_flags |= MNT_USER; +} + static struct vfsmount *clone_mnt(struct vfsmount *old, struct dentry *root, int flag) { @@ -241,6 +248,11 @@ static struct vfsmount *clone_mnt(struct mnt-mnt_mountpoint = mnt-mnt_root; mnt-mnt_parent = mnt; + /* don't copy the MNT_USER flag */ + mnt-mnt_flags = ~MNT_USER; + if (flag CL_SETUSER) + set_mnt_user(mnt); + if (flag CL_SLAVE) { list_add(mnt-mnt_slave, old-mnt_slave_list); mnt-mnt_master = old; @@ -403,6 +415,8 @@ static int show_vfsmnt(struct seq_file * if (mnt-mnt_flags fs_infop-flag) seq_puts(m, fs_infop-str); } + if (mnt-mnt_flags MNT_USER) + seq_printf(m, ,user=%i, mnt-mnt_uid); if (mnt-mnt_sb-s_op-show_options) err = mnt-mnt_sb-s_op-show_options(m, mnt); seq_puts(m, 0 0\n); @@ -920,8 +934,9 @@ static int do_change_type(struct nameida /* * do loopback mount. */ -static int do_loopback(struct nameidata *nd, char *old_name, int recurse) +static int do_loopback(struct nameidata *nd, char *old_name, int flags) { + int clone_flags; struct nameidata old_nd; struct vfsmount *mnt = NULL; int err = mount_is_safe(nd); @@ -941,11 +956,12 @@ static int do_loopback(struct nameidata if (!check_mnt(nd-mnt) || !check_mnt(old_nd.mnt)) goto out; + clone_flags = (flags MS_SETUSER) ? CL_SETUSER : 0; err = -ENOMEM; - if (recurse) - mnt = copy_tree(old_nd.mnt, old_nd.dentry, 0); + if (flags MS_REC) + mnt = copy_tree(old_nd.mnt, old_nd.dentry, clone_flags); else - mnt = clone_mnt(old_nd.mnt, old_nd.dentry, 0); + mnt = clone_mnt(old_nd.mnt, old_nd.dentry, clone_flags); if (!mnt) goto out; @@ -987,8 +1003,11 @@ static int do_remount(struct nameidata * down_write(sb-s_umount); err = do_remount_sb(sb, flags, data, 0); - if (!err) + if (!err) { nd-mnt-mnt_flags = mnt_flags; + if (flags MS_SETUSER) + set_mnt_user(nd-mnt); + } up_write(sb-s_umount); if (!err) security_sb_post_remount(nd-mnt, flags, data); @@ -1093,10 +1112,13 @@ static int do_new_mount(struct nameidata if (!capable(CAP_SYS_ADMIN)) return -EPERM; - mnt = do_kern_mount(type, flags, name, data); + mnt = do_kern_mount(type, flags ~MS_SETUSER, name, data); if (IS_ERR(mnt)) return PTR_ERR(mnt); + if (flags MS_SETUSER) + set_mnt_user(mnt); + return do_add_mount(mnt, nd, mnt_flags, NULL); } @@ -1127,7 +1149,8 @@ int do_add_mount(struct vfsmount *newmnt if (S_ISLNK(newmnt-mnt_root-d_inode-i_mode)) goto unlock; - newmnt-mnt_flags = mnt_flags; + /* MNT_USER was set earlier */ + newmnt-mnt_flags |= mnt_flags; if ((err = graft_tree(newmnt, nd))) goto unlock; @@ -1447,7 +1470,7 @@ long do_mount(char *dev_name, char *dir_ retval = do_remount(nd, flags ~MS_REMOUNT, mnt_flags, data_page); else if
[patch 02/10] allow unprivileged umount
From: Miklos Szeredi [EMAIL PROTECTED] The owner doesn't need sysadmin capabilities to call umount(). Similar behavior as umount(8) on mounts having user=UID option in /etc/mtab. The difference is that umount also checks /etc/fstab, presumably to exclude another mount on the same mountpoint. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/fs/namespace.c === --- linux.orig/fs/namespace.c 2007-04-11 20:07:51.0 +0200 +++ linux/fs/namespace.c2007-04-11 20:08:05.0 +0200 @@ -659,6 +659,25 @@ static int do_umount(struct vfsmount *mn } /* + * umount is permitted for + * - sysadmin + * - mount owner, if not forced umount + */ +static bool permit_umount(struct vfsmount *mnt, int flags) +{ + if (capable(CAP_SYS_ADMIN)) + return true; + + if (!(mnt-mnt_flags MNT_USER)) + return false; + + if (flags MNT_FORCE) + return false; + + return mnt-mnt_uid == current-uid; +} + +/* * Now umount can handle mount points as well as block devices. * This is important for filesystems which use unnamed block devices. * @@ -681,7 +700,7 @@ asmlinkage long sys_umount(char __user * goto dput_and_out; retval = -EPERM; - if (!capable(CAP_SYS_ADMIN)) + if (!permit_umount(nd.mnt, flags)) goto dput_and_out; retval = do_umount(nd.mnt, flags); -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 03/10] account user mounts
From: Miklos Szeredi [EMAIL PROTECTED] Add sysctl variables for accounting and limiting the number of user mounts. The maximum number of user mounts is set to 1024 by default. This won't in itself enable user mounts, setting the permit user submount mount flag will also be needed. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/include/linux/sysctl.h === --- linux.orig/include/linux/sysctl.h 2007-04-11 18:27:46.0 +0200 +++ linux/include/linux/sysctl.h2007-04-11 20:08:16.0 +0200 @@ -818,6 +818,8 @@ enum FS_AIO_NR=18, /* current system-wide number of aio requests */ FS_AIO_MAX_NR=19, /* system-wide maximum number of aio requests */ FS_INOTIFY=20, /* inotify submenu */ + FS_NR_USER_MOUNTS=21, /* int:current number of user mounts */ + FS_MAX_USER_MOUNTS=22, /* int:maximum number of user mounts */ FS_OCFS2=988, /* ocfs2 */ }; Index: linux/kernel/sysctl.c === --- linux.orig/kernel/sysctl.c 2007-04-11 18:27:46.0 +0200 +++ linux/kernel/sysctl.c 2007-04-11 20:08:16.0 +0200 @@ -1063,6 +1063,22 @@ static ctl_table fs_table[] = { #endif #endif { + .ctl_name = FS_NR_USER_MOUNTS, + .procname = nr_user_mounts, + .data = nr_user_mounts, + .maxlen = sizeof(int), + .mode = 0444, + .proc_handler = proc_dointvec, + }, + { + .ctl_name = FS_MAX_USER_MOUNTS, + .procname = max_user_mounts, + .data = max_user_mounts, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, + { .ctl_name = KERN_SETUID_DUMPABLE, .procname = suid_dumpable, .data = suid_dumpable, Index: linux/Documentation/filesystems/proc.txt === --- linux.orig/Documentation/filesystems/proc.txt 2007-04-11 18:27:44.0 +0200 +++ linux/Documentation/filesystems/proc.txt2007-04-12 13:32:14.0 +0200 @@ -923,6 +923,15 @@ reaches aio-max-nr then io_setup will fa raising aio-max-nr does not result in the pre-allocation or re-sizing of any kernel data structures. +nr_user_mounts and max_user_mounts +-- + +These represent the number of user mounts and the maximum number of +user mounts respectively. User mounts may be created by +unprivileged users. User mounts may also be created with sysadmin +privileges on behalf of a user, in which case nr_user_mounts may +exceed max_user_mounts. + 2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats --- Index: linux/fs/namespace.c === --- linux.orig/fs/namespace.c 2007-04-11 20:08:05.0 +0200 +++ linux/fs/namespace.c2007-04-12 13:29:05.0 +0200 @@ -39,6 +39,9 @@ static int hash_mask __read_mostly, hash static struct kmem_cache *mnt_cache __read_mostly; static struct rw_semaphore namespace_sem; +int nr_user_mounts; +int max_user_mounts = 1024; + /* /sys/fs */ decl_subsys(fs, NULL, NULL); EXPORT_SYMBOL_GPL(fs_subsys); @@ -227,11 +230,30 @@ static struct vfsmount *skip_mnt_tree(st return p; } +static void dec_nr_user_mounts(void) +{ + spin_lock(vfsmount_lock); + nr_user_mounts--; + spin_unlock(vfsmount_lock); +} + static void set_mnt_user(struct vfsmount *mnt) { BUG_ON(mnt-mnt_flags MNT_USER); mnt-mnt_uid = current-uid; mnt-mnt_flags |= MNT_USER; + spin_lock(vfsmount_lock); + nr_user_mounts++; + spin_unlock(vfsmount_lock); +} + +static void clear_mnt_user(struct vfsmount *mnt) +{ + if (mnt-mnt_flags MNT_USER) { + mnt-mnt_uid = 0; + mnt-mnt_flags = ~MNT_USER; + dec_nr_user_mounts(); + } } static struct vfsmount *clone_mnt(struct vfsmount *old, struct dentry *root, @@ -283,6 +305,7 @@ static inline void __mntput(struct vfsmo { struct super_block *sb = mnt-mnt_sb; dput(mnt-mnt_root); + clear_mnt_user(mnt); free_vfsmnt(mnt); deactivate_super(sb); } @@ -1023,6 +1046,7 @@ static int do_remount(struct nameidata * down_write(sb-s_umount); err = do_remount_sb(sb, flags, data, 0); if (!err) { + clear_mnt_user(nd-mnt); nd-mnt-mnt_flags = mnt_flags; if (flags MS_SETUSER) set_mnt_user(nd-mnt); Index: linux/include/linux/fs.h
[patch 08/10] put declaration of put_filesystem() in fs.h
From: Miklos Szeredi [EMAIL PROTECTED] Declarations go into headers. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/fs/super.c === --- linux.orig/fs/super.c 2007-04-13 12:26:11.0 +0200 +++ linux/fs/super.c2007-04-13 13:25:40.0 +0200 @@ -40,10 +40,6 @@ #include asm/uaccess.h -void get_filesystem(struct file_system_type *fs); -void put_filesystem(struct file_system_type *fs); -struct file_system_type *get_fs_type(const char *name); - LIST_HEAD(super_blocks); DEFINE_SPINLOCK(sb_lock); Index: linux/include/linux/fs.h === --- linux.orig/include/linux/fs.h 2007-04-13 13:23:22.0 +0200 +++ linux/include/linux/fs.h2007-04-13 13:25:40.0 +0200 @@ -1922,6 +1922,8 @@ extern int vfs_fstat(unsigned int, struc extern int vfs_ioctl(struct file *, unsigned int, unsigned int, unsigned long); +extern void get_filesystem(struct file_system_type *fs); +extern void put_filesystem(struct file_system_type *fs); extern struct file_system_type *get_fs_type(const char *name); extern struct super_block *get_super(struct block_device *); extern struct super_block *user_get_super(dev_t); -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 06/10] propagate error values from clone_mnt
From: Miklos Szeredi [EMAIL PROTECTED] Allow clone_mnt() to return errors other than ENOMEM. This will be used for returning a different error value when the number of user mounts goes over the limit. Fix copy_tree() to return EPERM for unbindable mounts. Don't propagate further from dup_mnt_ns() as that copy_tree() can only fail with -ENOMEM. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/fs/namespace.c === --- linux.orig/fs/namespace.c 2007-04-13 13:22:17.0 +0200 +++ linux/fs/namespace.c2007-04-13 13:25:35.0 +0200 @@ -261,42 +261,42 @@ static struct vfsmount *clone_mnt(struct { struct super_block *sb = old-mnt_sb; struct vfsmount *mnt = alloc_vfsmnt(old-mnt_devname); + if (!mnt) + return ERR_PTR(-ENOMEM); - if (mnt) { - mnt-mnt_flags = old-mnt_flags; - atomic_inc(sb-s_active); - mnt-mnt_sb = sb; - mnt-mnt_root = dget(root); - mnt-mnt_mountpoint = mnt-mnt_root; - mnt-mnt_parent = mnt; - - /* don't copy the MNT_USER flag */ - mnt-mnt_flags = ~MNT_USER; - if (flag CL_SETUSER) - set_mnt_user(mnt); - - if (flag CL_SLAVE) { - list_add(mnt-mnt_slave, old-mnt_slave_list); - mnt-mnt_master = old; - CLEAR_MNT_SHARED(mnt); - } else { - if ((flag CL_PROPAGATION) || IS_MNT_SHARED(old)) - list_add(mnt-mnt_share, old-mnt_share); - if (IS_MNT_SLAVE(old)) - list_add(mnt-mnt_slave, old-mnt_slave); - mnt-mnt_master = old-mnt_master; - } - if (flag CL_MAKE_SHARED) - set_mnt_shared(mnt); + mnt-mnt_flags = old-mnt_flags; + atomic_inc(sb-s_active); + mnt-mnt_sb = sb; + mnt-mnt_root = dget(root); + mnt-mnt_mountpoint = mnt-mnt_root; + mnt-mnt_parent = mnt; + + /* don't copy the MNT_USER flag */ + mnt-mnt_flags = ~MNT_USER; + if (flag CL_SETUSER) + set_mnt_user(mnt); - /* stick the duplicate mount on the same expiry list -* as the original if that was on one */ - if (flag CL_EXPIRE) { - spin_lock(vfsmount_lock); - if (!list_empty(old-mnt_expire)) - list_add(mnt-mnt_expire, old-mnt_expire); - spin_unlock(vfsmount_lock); - } + if (flag CL_SLAVE) { + list_add(mnt-mnt_slave, old-mnt_slave_list); + mnt-mnt_master = old; + CLEAR_MNT_SHARED(mnt); + } else { + if ((flag CL_PROPAGATION) || IS_MNT_SHARED(old)) + list_add(mnt-mnt_share, old-mnt_share); + if (IS_MNT_SLAVE(old)) + list_add(mnt-mnt_slave, old-mnt_slave); + mnt-mnt_master = old-mnt_master; + } + if (flag CL_MAKE_SHARED) + set_mnt_shared(mnt); + + /* stick the duplicate mount on the same expiry list +* as the original if that was on one */ + if (flag CL_EXPIRE) { + spin_lock(vfsmount_lock); + if (!list_empty(old-mnt_expire)) + list_add(mnt-mnt_expire, old-mnt_expire); + spin_unlock(vfsmount_lock); } return mnt; } @@ -782,11 +782,11 @@ struct vfsmount *copy_tree(struct vfsmou struct nameidata nd; if (!(flag CL_COPY_ALL) IS_MNT_UNBINDABLE(mnt)) - return NULL; + return ERR_PTR(-EPERM); res = q = clone_mnt(mnt, dentry, flag); - if (!q) - goto Enomem; + if (IS_ERR(q)) + goto error; q-mnt_mountpoint = mnt-mnt_mountpoint; p = mnt; @@ -807,8 +807,8 @@ struct vfsmount *copy_tree(struct vfsmou nd.mnt = q; nd.dentry = p-mnt_mountpoint; q = clone_mnt(p, p-mnt_root, flag); - if (!q) - goto Enomem; + if (IS_ERR(q)) + goto error; spin_lock(vfsmount_lock); list_add_tail(q-mnt_list, res-mnt_list); attach_mnt(q, nd); @@ -816,7 +816,7 @@ struct vfsmount *copy_tree(struct vfsmou } } return res; -Enomem: + error: if (res) { LIST_HEAD(umount_list); spin_lock(vfsmount_lock); @@ -824,7 +824,7 @@ Enomem: spin_unlock(vfsmount_lock); release_mounts(umount_list); } - return
[patch 05/10] Add permit user submounts flag to vfsmount
From: Miklos Szeredi [EMAIL PROTECTED] If MNT_USERMNT flag is not set in the target vfsmount, then unprivileged mounts will be denied. By default this flag is cleared, and can be set on new mounts, on remounts or with the MS_SETFLAGS option. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/fs/namespace.c === --- linux.orig/fs/namespace.c 2007-04-13 13:20:12.0 +0200 +++ linux/fs/namespace.c2007-04-13 13:35:40.0 +0200 @@ -411,6 +411,7 @@ static int show_vfsmnt(struct seq_file * { MNT_NOATIME, ,noatime }, { MNT_NODIRATIME, ,nodiratime }, { MNT_RELATIME, ,relatime }, + { MNT_USERMNT, ,usermnt }, { 0, NULL } }; struct proc_fs_info *fs_infop; @@ -1505,9 +1506,11 @@ long do_mount(char *dev_name, char *dir_ mnt_flags |= MNT_NODIRATIME; if (flags MS_RELATIME) mnt_flags |= MNT_RELATIME; + if (flags MS_USERMNT) + mnt_flags |= MNT_USERMNT; flags = ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE | - MS_NOATIME | MS_NODIRATIME | MS_RELATIME); + MS_NOATIME | MS_NODIRATIME | MS_RELATIME | MS_USERMNT); /* ... and get the mountpoint */ retval = path_lookup(dir_name, LOOKUP_FOLLOW, nd); Index: linux/include/linux/mount.h === --- linux.orig/include/linux/mount.h2007-04-13 13:17:08.0 +0200 +++ linux/include/linux/mount.h 2007-04-13 13:22:17.0 +0200 @@ -28,6 +28,7 @@ struct mnt_namespace; #define MNT_NOATIME0x08 #define MNT_NODIRATIME 0x10 #define MNT_RELATIME 0x20 +#define MNT_USERMNT0x40 #define MNT_SHRINKABLE 0x100 #define MNT_USER 0x200 Index: linux/include/linux/fs.h === --- linux.orig/include/linux/fs.h 2007-04-13 13:23:05.0 +0200 +++ linux/include/linux/fs.h2007-04-13 13:35:34.0 +0200 @@ -130,6 +130,7 @@ extern int dir_notify_enable; #define MS_SETFLAGS(123) /* set specified mount flags */ #define MS_CLEARFLAGS (124) /* clear specified mount flags */ /* MS_SETFLAGS | MS_CLEARFLAGS: change mount flags to specified */ +#define MS_USERMNT (125) /* permit unpriv. submounts under this mount */ #define MS_ACTIVE (130) #define MS_NOUSER (131) -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Thu, Apr 12, 2007 at 05:05:50AM -0600, Andreas Dilger wrote: I'm interested in getting input for implementing an ioctl to efficiently map file extents holes (FIEMAP) instead of looping over FIBMAP a billion times. We already have customers with single files in the 10TB range and we additionally need to get the mapping over the network so it needs to be efficient in terms of how data is passed, and how easily it can be extracted from the filesystem. I had come up with a plan independently and was also steered toward XFS_IOC_GETBMAP* ioctls which are in fact very similar to my original plan, though I think the XFS structs used there are a bit bloated. Yeah, they were designed with having a long term stable ABI that limited expandability. Hence the future fields that never got used ;) There was also recent discussion about SEEK_HOLE and SEEK_DATA as implemented by Sun, but even if we could skip the holes we still might need to do millions of FIBMAPs to see how large files are allocated on disk. Conversely, having filesystems implement an efficient FIBMAP ioctl (or -fiemap() method) could in turn be leveraged for SEEK_HOLE and SEEK_DATA instead of doing looping over -bmap() inside the kernel as I saw one patch. Yup. struct fibmap_extent { __u64 fe_start; /* starting offset in bytes */ __u64 fe_len; /* length in bytes */ } struct fibmap { struct fibmap_extent fm_start; /* offset, length of desired mapping */ __u32 fm_extent_count; /* number of extents in array */ __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */ __u64 unused; struct fibmap_extent fm_extents[0]; } #define FIEMAP_LEN_MASK 0xff #define FIEMAP_LEN_HOLE 0x01 #define FIEMAP_LEN_UNWRITTEN 0x02 I'm not sure I like stealing bits from the length to use a flags - I'd prefer an explicit field per fibmap_extent for this. Given that xfs_bmap uses extra information from the filesystem (geometry) to display extra (and frequently used) information about the alignment of extents. ie: chook 681% xfs_bmap -vv fred fred: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..151]:288444888..288445039 8 (1696536..1696687) 152 00010 FLAG Values: 01 Unwritten preallocated extent 001000 Doesn't begin on stripe unit 000100 Doesn't end on stripe unit 10 Doesn't begin on stripe width 01 Doesn't end on stripe width This information could be easily passed up in the flags fields if the filesystem has geometry information (there go 4 more flags ;). Also - what are the explicit sync semantics of this ioctl? The XFS ioctl causes a fsync of the file first to convert delalloc extents to real extents before returning the bmap. Is this functionality going to be the same? If not, then we need a DELALLOC flag to indicate extents that haven't been allocated yet. This might be handy to have, anyway All offsets are in bytes to allow cases where filesystems are not going block-aligned/sized allocations (e.g. tail packing). So it'll be ok for a few years yet ;) The fm_extents array returned contains the packed list of allocation extents for the file, including entries for holes (which have fe_start == 0, and a flag). Internalling in XFS, we pass these around as: #define DELAYSTARTBLOCK ((xfs_fsblock_t)-1LL) #define HOLESTARTBLOCK ((xfs_fsblock_t)-2LL) And the offset passed out through XFS_IOC_GETBMAP[X] is a block number of -1 for the start of a hole. Hence we don't need a flag for this. We can expose delalloc extents like this as well without needing flags... The -fm_extents[] array includes all of the holes in addition to allocated extents because this avoids the need to return both the logical and physical address for every extent and does not make processing any harder. Doesn't really make it any easier to map to disk, either. One feature that XFS_IOC_GETBMAPX has that may be desirable is the ability to return unwritten extent information. You got that with the unwritten flag above. required expanding the per-extent struct from 32 to 48 bytes per extent, not sure I follow your maths here? but I'd rather limit a single extent to e.g. 2^56 bytes (oh, what hardship) and keep 8 bytes or so for input/output flags per extent (would need to ^ bits? be masked before use). Caller works something like: char buf[4096]; struct fibmap *fm = (struct fibmap *)buf; int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent); fm-fm_extent.fe_start = 0; /* start of file */ fm-fm_extent.fe_len = -1; /* end of file */ fm-fm_extent_count = count; /* max extents in fm_extents[] array */ fm-fm_flags = 0; /* maybe no DMAPI, etc like XFS */ fd =
d_revalidate not being called enough on mountpoints?
Hi Al, I think there might be a problem in the VFS with d_revalidate() not being called enough on mountpoints. As far as I can tell from the printks in my AFS stuff, it's only called on the mounted-on dentry, and not the vfsmount-root dentry. However, with NFS at least (not so much AFS), can you trust that the mount-root dentry still maps to the same inode and event if it does that that inode is still up to date? I discovered it because I was relying on d_revalidate() to spot that the server had broken the callback on a directory that had been changed. However, the root directory of each volume isn't being d_revalidated. In the kernel output, I see: (1) Permission check on the root dir of /afs: volume ID 2001, vnode ID 1: [0ls] == afs_permission({2001:1},1,) (2) Revalidation of the mountpoint dir in /afs (vnode ID 6): [0ls] == afs_d_revalidate({v={2001:6} n=.cambridge.redhat.com fl=20},) [0ls] not promised (3) At this point, the stats of the mounted-on dir are rechecked. [0ls] new promise [fl=20] (4) Permission check on the root dir of /afs/.cambridge.redhat.com (volume ID 2003, vnode ID 1): [0ls] == afs_permission({2003:1},1,) (5) Revalidation of the mountpoint dir in /afs/.cambridge.redhat.com (vnode ID 2): [0ls] == afs_d_revalidate({v={2003:2} n=afsdoc fl=20},) [0ls] not promised (6) Again, the stats of the mounted-on dir are rechecked. [0ls] new promise [fl=20] I was expecting to see the root dentry of each vfsmount be revalidated, but that doesn't occur. David - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: d_revalidate not being called enough on mountpoints?
On Mon, 2007-04-16 at 13:50 +0100, David Howells wrote: Hi Al, I think there might be a problem in the VFS with d_revalidate() not being called enough on mountpoints. As far as I can tell from the printks in my AFS stuff, it's only called on the mounted-on dentry, and not the vfsmount-root dentry. However, with NFS at least (not so much AFS), can you trust that the mount-root dentry still maps to the same inode and event if it does that that inode is still up to date? The NFS mountpoint is _not_ defined in terms of the remote path. Once you have mounted a given directory, the expectation is that that particular directory stays mounted. You cannot allow it to morph into another object just because someone messes with the path on the server. This is very much analogous to the expectation that a bind mount should not morph into a different object just because someone has renamed something in the directory you were binding from. I discovered it because I was relying on d_revalidate() to spot that the server had broken the callback on a directory that had been changed. However, the root directory of each volume isn't being d_revalidated. That sounds like an abuse. You are not revalidating the path itself (which is the purpose of d_revalidate). Instead you are revalidating the directory metadata, which is a very different thing. Cheers Trond - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/8] unprivileged mount syscall
Miklos Szeredi [EMAIL PROTECTED] writes: Arn't there ways to escape chroot jails? Serge had pointed me to a URL which showed chroots can be escaped. And if that is true than having all user's private mount tree in the same namespace can be a security issue? No. In fact chrooting the user into /share/$USER will actually _grant_ a privilege to the user, instead of taking it away. It allows the user to modify it's root namespace, which it wouldn't be able to in the initial namespace. So even if the user could escape from the chroot (which I doubt), s/he would not be able to do any harm, since unprivileged mounting would be restricted to /share. Also /share/$USER should only have read/search permission for $USER or no permissions at all, which would mean, that other users' namespaces would be safe from tampering as well. A couple of points. - chroot can be escaped, it is just a chdir for the root directory it is not a security feature. The only security is that you have to be root to call chdir. A carefully done namespace setup won't have that issue. - While it may not violate security as far as what a user is allowed to modify it may violate security as far as what a user is allowed to see. There are interesting per login cases as well such as allowing a user to replicate their mount tree from another machine when they log in. When /home is on a network filesystem this can be very practical and can allow propagation of mounts across machines not just across a single login session. Eric - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag
Miklos Szeredi [EMAIL PROTECTED] writes: That depends. Current patches check the unprivileged submounts allowed under this mount flag only on the requested mount and not on the propagated mounts. Do you see a problem with this? I think privileges of this sort should propagate. If I read what you just said correctly if I have a private mount namespace I won't be able to mount anything unless when it was setup the unprivileged submount command was explicitly set. Eric - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/8] unprivileged mount syscall
Arn't there ways to escape chroot jails? Serge had pointed me to a URL which showed chroots can be escaped. And if that is true than having all user's private mount tree in the same namespace can be a security issue? No. In fact chrooting the user into /share/$USER will actually _grant_ a privilege to the user, instead of taking it away. It allows the user to modify it's root namespace, which it wouldn't be able to in the initial namespace. So even if the user could escape from the chroot (which I doubt), s/he would not be able to do any harm, since unprivileged mounting would be restricted to /share. Also /share/$USER should only have read/search permission for $USER or no permissions at all, which would mean, that other users' namespaces would be safe from tampering as well. A couple of points. - chroot can be escaped, it is just a chdir for the root directory it is not a security feature. The only security is that you have to be root to call chdir. A carefully done namespace setup won't have that issue. - While it may not violate security as far as what a user is allowed to modify it may violate security as far as what a user is allowed to see. I think that's just up to the permissions in the global namespace. In this example if you 'chmod 0 /share' there won't be anything for the user to see. There are interesting per login cases as well such as allowing a user to replicate their mount tree from another machine when they log in. When /home is on a network filesystem this can be very practical and can allow propagation of mounts across machines not just across a single login session. Yeah, sounds interesting, but I think it's better to get the basics working first, and then we can start to think about the extras. Btw, there's nothing that prevents cloning the namespace _after_ chrooting into the per-user tree. That would still be simpler than doing it the other way round: first creating per-session namespaces and then setting up mount propagation between them. Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[nameidata 1/2] Don't pass NULL nameidata to vfs_create
On Thursday 12 April 2007 12:06, Christoph Hellwig wrote: Once again very strong NACK. Every conditional passing of vfsmounts get my veto. As mentioned last time if you really want this send a patch series first that passed the vfsmount consistantly. I don't consider it fair to NACK this patch just because some other part of the kernel uses vfs_create() in the wrong way -- those are really independent issues. The bug is not that hard to fix though; here is a proposed patch on top of the other AppArmor patches. The offenders are nfsd and the mqueue filesystem. Instantiate a struct nameidata there. The .dentry and .mnt fields can easily be filled in in nfsd. The mqueue filesystem is somewhat special: files that mq_open creates are on an internal mount and the mqueue filesystem is not generally visible (it may or may not be mounted). When passing a regular vfsmount to vfs_create the files would appear disconnected while they actually are kernel-internal objects. Use a NULL vfsmount there to avoid that -- passing the kernel-internal vfsmount there wouldn't help, anyway. Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED] --- fs/namei.c|7 --- fs/nfsd/vfs.c | 23 +++ ipc/mqueue.c |6 +- 3 files changed, 28 insertions(+), 8 deletions(-) --- a/fs/namei.c +++ b/fs/namei.c @@ -1503,7 +1503,7 @@ int vfs_create(struct inode *dir, struct return -EACCES; /* shouldn't it be ENOSYS? */ mode = S_IALLUGO; mode |= S_IFREG; - error = security_inode_create(dir, dentry, nd ? nd-mnt : NULL, mode); + error = security_inode_create(dir, dentry, nd-mnt, mode); if (error) return error; DQUOT_INIT(dir); --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -1108,6 +1108,18 @@ nfsd_commit(struct svc_rqst *rqstp, stru } #endif /* CONFIG_NFSD_V3 */ +static inline int +nfsd_do_create(struct inode *dir, struct dentry *child, struct vfsmount *mnt, + int mode) +{ + struct nameidata nd = { + .dentry = child, + .mnt = mnt, + }; + + return vfs_create(dir, child, mode, nd); +} + /* * Create a file (regular, directory, device, fifo); UNIX sockets * not yet implemented. @@ -1192,7 +1204,8 @@ nfsd_create(struct svc_rqst *rqstp, stru err = 0; switch (type) { case S_IFREG: - host_err = vfs_create(dirp, dchild, iap-ia_mode, NULL); + host_err = nfsd_do_create(dirp, dchild, exp-ex_mnt, + iap-ia_mode); break; case S_IFDIR: host_err = vfs_mkdir(dirp, dchild, exp-ex_mnt, iap-ia_mode); @@ -1253,6 +1266,7 @@ nfsd_create_v3(struct svc_rqst *rqstp, s int *truncp, int *created) { struct dentry *dentry, *dchild = NULL; + struct svc_export *exp; struct inode*dirp; __be32 err; int host_err; @@ -1271,6 +1285,7 @@ nfsd_create_v3(struct svc_rqst *rqstp, s goto out; dentry = fhp-fh_dentry; + exp = fhp-fh_export; dirp = dentry-d_inode; /* Get all the sanity checks out of the way before @@ -1288,7 +1303,7 @@ nfsd_create_v3(struct svc_rqst *rqstp, s if (IS_ERR(dchild)) goto out_nfserr; - err = fh_compose(resfhp, fhp-fh_export, dchild, fhp); + err = fh_compose(resfhp, exp, dchild, fhp); if (err) goto out; @@ -1334,13 +1349,13 @@ nfsd_create_v3(struct svc_rqst *rqstp, s goto out; } - host_err = vfs_create(dirp, dchild, iap-ia_mode, NULL); + host_err = nfsd_do_create(dirp, dchild, exp-ex_mnt, iap-ia_mode); if (host_err 0) goto out_nfserr; if (created) *created = 1; - if (EX_ISSYNC(fhp-fh_export)) { + if (EX_ISSYNC(exp)) { err = nfserrno(nfsd_sync_dir(dentry)); /* setattr will sync the child (or not) */ } --- a/ipc/mqueue.c +++ b/ipc/mqueue.c @@ -604,6 +604,10 @@ static int mq_attr_ok(struct mq_attr *at static struct file *do_create(struct dentry *dir, struct dentry *dentry, int oflag, mode_t mode, struct mq_attr __user *u_attr) { + struct nameidata nd = { + .dentry = dentry, + /* Not a regular create, so set .mnt to NULL. */ + }; struct mq_attr attr; int ret; @@ -619,7 +623,7 @@ static struct file *do_create(struct den } mode = ~current-fs-umask; - ret = vfs_create(dir-d_inode, dentry, mode, NULL); + ret = vfs_create(dir-d_inode, dentry, mode, nd); dentry-d_fsdata = NULL; if (ret) goto out; - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [nameidata 1/2] Don't pass NULL nameidata to vfs_create
On Mon, Apr 16, 2007 at 06:11:30PM +0200, Andreas Gruenbacher wrote: +static inline int +nfsd_do_create(struct inode *dir, struct dentry *child, struct vfsmount *mnt, +int mode) +{ + struct nameidata nd = { + .dentry = child, + .mnt = mnt, + }; + + return vfs_create(dir, child, mode, nd); +} + Wouldn't it normally result in fewer instructions (on most architectures ... maybe not x86) to keep the same argument order as vfs_create? ie: static inline int nfsd_do_create(struct inode *dir, struct dentry *child, int mode, struct vfsmount *mnt) - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[nameidata 2/2] Pass no useless nameidata to the create, lookup, and permission IOPs
Here is a patch with request for comment. The create, lookup, and permission inode operations are all passed a full nameidata. This is unfortunate because in nfsd and the mqueue filesystem we must instantiate a struct nameidata, but cannot provide all of the same information of a regular lookup. The unused fields take up space on the stack, but more importantly, it is not obvious which fields have meaningful values and which don't, and so things might easily break. This patch introduces struct nameidata2 with only the fields that make sense independent of an actual lookup, and uses that struct in those places where a full nameidata is not needed. The entire patch is a little big so I'm only including the key parts here. The complete version is here: http://forgeftp.novell.com/apparmor/LKML_Submission-April_07/split-up-nameidata.diff Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED] --- a/include/linux/namei.h +++ b/include/linux/namei.h @@ -14,21 +14,39 @@ struct open_intent { enum { MAX_NESTED_LINKS = 8 }; +/** + * Fields shared between nameidata and nameidata2 -- nameidata2 could + * be embedded in nameidata, but then the vfs code would become + * cluttered with dereferences. + */ +#define __NAMEIDATA2 \ + struct dentry *dentry;\ + struct vfsmount *mnt; \ + unsigned intflags; \ + \ + union { \ + struct open_intent open;\ + } intent; + struct nameidata { - struct dentry *dentry; - struct vfsmount *mnt; + __NAMEIDATA2 struct qstr last; - unsigned intflags; int last_type; unsigneddepth; char *saved_names[MAX_NESTED_LINKS + 1]; +}; - /* Intent data */ - union { - struct open_intent open; - } intent; +struct nameidata2 { + __NAMEIDATA2 }; +#undef __NAMEIDATA2 + +static inline struct nameidata2 *ND2(struct nameidata *nd) +{ + return container_of(nd-dentry, struct nameidata2, dentry); +} + struct path { struct vfsmount *mnt; struct dentry *dentry; @@ -76,10 +94,9 @@ extern void path_release_on_umount(struc extern int __user_path_lookup_open(const char __user *, unsigned lookup_flags, struct nameidata *nd, int open_flags); extern int path_lookup_open(int dfd, const char *name, unsigned lookup_flags, struct nameidata *, int open_flags); -extern struct file *lookup_instantiate_filp(struct nameidata *nd, struct dentry *dentry, - int (*open)(struct inode *, struct file *)); +extern struct file *lookup_instantiate_filp(struct nameidata2 *nd, struct dentry *dentry, int (*open)(struct inode *, struct file *)); extern struct file *nameidata_to_filp(struct nameidata *nd, int flags); -extern void release_open_intent(struct nameidata *); +extern void release_open_intent(struct nameidata2 *); extern struct dentry * lookup_one_len(const char *, struct dentry *, int); --- a/fs/namei.c +++ b/fs/namei.c @@ -225,7 +225,7 @@ int generic_permission(struct inode *ino return -EACCES; } -int permission(struct inode *inode, int mask, struct nameidata *nd) +int permission(struct inode *inode, int mask, struct nameidata2 *nd) { umode_t mode = inode-i_mode; int retval, submask; @@ -278,7 +278,7 @@ int permission(struct inode *inode, int * for filesystem access without changing the normal uids which * are used for other things. */ -int vfs_permission(struct nameidata *nd, int mask) +int vfs_permission(struct nameidata2 *nd, int mask) { return permission(nd-dentry-d_inode, mask, nd); } @@ -366,7 +366,7 @@ void path_release_on_umount(struct namei * release_open_intent - free up open intent resources * @nd: pointer to nameidata */ -void release_open_intent(struct nameidata *nd) +void release_open_intent(struct nameidata2 *nd) { if (nd-intent.open.file-f_path.dentry == NULL) put_filp(nd-intent.open.file); @@ -377,7 +377,7 @@ void release_open_intent(struct nameidat static inline struct dentry * do_revalidate(struct dentry *dentry, struct nameidata *nd) { - int status = dentry-d_op-d_revalidate(dentry, nd); + int status = dentry-d_op-d_revalidate(dentry, ND2(nd)); if (unlikely(status = 0)) { /* * The dentry failed validation. @@ -455,7 +455,7 @@ static int exec_permission_lite(struct i return -EACCES; ok: - return security_inode_permission(inode, MAY_EXEC, nd); + return security_inode_permission(inode, MAY_EXEC, ND2(nd)); } /* @@ -491,7 +491,7 @@ static struct dentry * real_lookup(struc struct dentry * dentry = d_alloc(parent, name); result = ERR_PTR(-ENOMEM); if (dentry) { - result = dir-i_op-lookup(dir, dentry,
Re: [nameidata 2/2] Pass no useless nameidata to the create, lookup, and permission IOPs
On Mon, Apr 16, 2007 at 06:29:20PM +0200, Andreas Gruenbacher wrote: enum { MAX_NESTED_LINKS = 8 }; +/** + * Fields shared between nameidata and nameidata2 -- nameidata2 could + * be embedded in nameidata, but then the vfs code would become + * cluttered with dereferences. you could use an anonymous struct embedeed in both. + */ +#define __NAMEIDATA2 \ + struct dentry *dentry;\ + struct vfsmount *mnt; \ + unsigned intflags; \ + \ + union { \ + struct open_intent open;\ + } intent; Or better just pass argument directly. We really should only pass down the the dentry and the intent to the filesystem. The filesystem has not business looking at mnt in the operations, and the relevant bits of flags (mostly whether it's O_EXCLUSIVE) should be stored in the intent, because that's the only way it should be used. Doing it that way also allows to fix some braindead calling conventions like this one: -int permission(struct inode *inode, int mask, struct nameidata *nd) +int permission(struct inode *inode, int mask, struct nameidata2 *nd) { umode_t mode = inode-i_mode; int retval, submask; @@ -278,7 +278,7 @@ int permission(struct inode *inode, int * for filesystem access without changing the normal uids which * are used for other things. */ static inline struct dentry * do_revalidate(struct dentry *dentry, struct nameidata *nd) Or this one. - result = dir-i_op-lookup(dir, dentry, nd); + result = dir-i_op-lookup(dir, dentry, ND2(nd)); or this one. etc.. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [nameidata 1/2] Don't pass NULL nameidata to vfs_create
On Monday 16 April 2007 18:21, Christoph Hellwig wrote: But anyway, creating fake nameidata structures is not really helpful. If there is a nameidata passed people expect it to be complete, and if you pass them to an LSM people will e.g. try to look into lookup intents. I don't actually agree with that: when nfsd creates a file, it still is a file create no matter where it originates from, and so it does make sense to provide the appropriate intent information too. Struct nameidata contains other crap only needed during an actual lookup too --- that's a mess, and the nameidata2 patch shows one possibility how to deal with it. (It's the cleanest approach I could think of right now without cluttering the vfs code with insane amounts of dereferences.) Thanks, Andreas - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [nameidata 2/2] Pass no useless nameidata to the create, lookup, and permission IOPs
On Mon, 16 Apr 2007 18:29:20 +0200 Andreas Gruenbacher wrote: Here is a patch with request for comment. --- a/include/linux/namei.h +++ b/include/linux/namei.h @@ -14,21 +14,39 @@ struct open_intent { enum { MAX_NESTED_LINKS = 8 }; +/** Please don't use the kernel-doc begin-marker /** when the comment block isn't in kernel-doc format. + * Fields shared between nameidata and nameidata2 -- nameidata2 could + * be embedded in nameidata, but then the vfs code would become + * cluttered with dereferences. + */ +#define __NAMEIDATA2 \ + struct dentry *dentry;\ + struct vfsmount *mnt; \ + unsigned intflags; \ + \ + union { \ + struct open_intent open;\ + } intent; + --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [nameidata 2/2] Pass no useless nameidata to the create, lookup, and permission IOPs
On Monday 16 April 2007 18:42, Randy Dunlap wrote: Please don't use the kernel-doc begin-marker /** when the comment block isn't in kernel-doc format. Sigh, kernel-doc should improve things, not get in the way ... - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [nameidata 1/2] Don't pass NULL nameidata to vfs_create
On Mon, Apr 16, 2007 at 06:40:41PM +0200, Andreas Gruenbacher wrote: On Monday 16 April 2007 18:21, Christoph Hellwig wrote: But anyway, creating fake nameidata structures is not really helpful. If there is a nameidata passed people expect it to be complete, and if you pass them to an LSM people will e.g. try to look into lookup intents. I don't actually agree with that: when nfsd creates a file, it still is a file create no matter where it originates from, and so it does make sense to provide the appropriate intent information too. Struct nameidata contains other crap only needed during an actual lookup too --- that's a mess, and the You should provide intent information, yes - which your patch didn't :) And yes, I didn't like doing the ugly intent in nameidata hack, it's creating loads of problems for various parties, e.g. the stackable filesystem folks. Now the basic intent in nameidata mistaken has been made even worse by passing back a struct file in it conditionally and doing lots of work in -lookup that shouldn't be there. (Which btw, I expect to cause quite a few problems for apparmor or other lsms, but I guess so far no one has tried them on NFSv4) - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [nameidata 2/2] Pass no useless nameidata to the create, lookup, and permission IOPs
Andreas Gruenbacher wrote: On Monday 16 April 2007 18:42, Randy Dunlap wrote: Please don't use the kernel-doc begin-marker /** when the comment block isn't in kernel-doc format. Sigh, kernel-doc should improve things, not get in the way ... Well, I see it as sort of a language that is embedded in C source code. But what are you suggesting? -- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag
On Mon, 2007-04-16 at 11:56 +0200, Miklos Szeredi wrote: Also for bind-mount and remount operations the flag has to be propagated down its propagation tree. Otherwise a unpriviledged mount in a shared mount wont get reflected in its peers and slaves, leading to unidentical shared-subtrees. That's an interesting question. Do we want shared mounts to be totally identical, including mnt_flags? It doesn't look as if do_remount() guarantees that currently. Depends on the semantics of each of the flags. Some flags like of the read/write flag, would not interfere with the propagation semantics AFAICT. But this one certainly seems to interfere. That depends. Current patches check the unprivileged submounts allowed under this mount flag only on the requested mount and not on the propagated mounts. Do you see a problem with this? Don't see a problem if the flag is propagated to all peers and slave mounts. If not, I see a problem. What if the propagated mount has its flag set to not do un-priviledged mounts, whereas the requested mount has it allowed? RP Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to query mount propagation state?
On Mon, 2007-04-16 at 12:34 +0200, Miklos Szeredi wrote: Currently one of the difficulties with mount propagations is that there's no way to know the current state of the propagation tree. Has anyone thought about how this info could be queried from userspace? I am attaching two patches that I had done way back in Oct 2006 with Al Viro. I had sent these patches to Al Viro. But I forgot to follow them up, I guess so did Al Viro. The first patch disambiguates multiple mount-instances of the same filesystem (or part of the same filesystem), by introducing a new interface /proc/mounts_new. The second patch introduces a new proc interface that exposes all the propagation trees within a namespace. It does not show propagated mounts residing in a different namespace (for privacy reasons). Maybe one could modify the patch a little, to allow it; if the user has root priviledges. RP PS: Sorry these are attachments instead of inline patches. I am scared of inlining in evolution. If needed I can send inline patches through mutt. Thanks, Miklos This patch disambiguates multiple mount-instances of the same filesystem (or part of the same filesystem), by introducing a new interface /proc/mounts_new. The interface has the following format. FSID mntpt root-dentry fstype fs-options NOTE: root-dentry is the path to the dentry w.r.t to the root dentry of the same filesystem. for example: lets say we attempt the following commands mount --bind /var /mnt mount --bind /mnt/tmp /tmp1 'cat /proc/mounts' shows the following: /dev/root /mnt ext2 rw 0 0 /dev/root /tmp1 ext2 rw 0 0 NOTE: The above mount entries, do not indicate that /tmp1 contains the same directory tree as /var/tmp. But 'cat /proc/mounts_new' shows us the following: 0x6200 /mnt /var ext2 rw 0 0 0x6200 /tmp1 /var/tmp ext2 rw 0 0 The above entries clearly indicates that /var/tmp directory of the ext2 filesystem with fsid=0x6200 is the directory tree that resides under /tmp1 Signed-off-by: Ram Pai [EMAIL PROTECTED] --- fs/dcache.c | 53 fs/namespace.c | 35 ++--- fs/proc/base.c | 32 +-- fs/proc/proc_misc.c |1 fs/seq_file.c| 77 ++- include/linux/dcache.h |1 include/linux/seq_file.h |1 7 files changed, 172 insertions(+), 28 deletions(-) Index: linux-2.6.17.10/fs/proc/base.c === --- linux-2.6.17.10.orig/fs/proc/base.c +++ linux-2.6.17.10/fs/proc/base.c @@ -104,6 +104,7 @@ enum pid_directory_inos { PROC_TGID_MAPS, PROC_TGID_NUMA_MAPS, PROC_TGID_MOUNTS, + PROC_TGID_MOUNTS_NEW, PROC_TGID_MOUNTSTATS, PROC_TGID_WCHAN, #ifdef CONFIG_MMU @@ -145,6 +146,7 @@ enum pid_directory_inos { PROC_TID_MAPS, PROC_TID_NUMA_MAPS, PROC_TID_MOUNTS, + PROC_TID_MOUNTS_NEW, PROC_TID_MOUNTSTATS, PROC_TID_WCHAN, #ifdef CONFIG_MMU @@ -203,6 +205,7 @@ static struct pid_entry tgid_base_stuff[ E(PROC_TGID_ROOT, root,S_IFLNK|S_IRWXUGO), E(PROC_TGID_EXE, exe, S_IFLNK|S_IRWXUGO), E(PROC_TGID_MOUNTS,mounts, S_IFREG|S_IRUGO), + E(PROC_TGID_MOUNTS_NEW,mounts_new, S_IFREG|S_IRUGO), E(PROC_TGID_MOUNTSTATS, mountstats, S_IFREG|S_IRUSR), #ifdef CONFIG_MMU E(PROC_TGID_SMAPS, smaps, S_IFREG|S_IRUGO), @@ -246,6 +249,7 @@ static struct pid_entry tid_base_stuff[] E(PROC_TID_ROOT, root,S_IFLNK|S_IRWXUGO), E(PROC_TID_EXE,exe, S_IFLNK|S_IRWXUGO), E(PROC_TID_MOUNTS, mounts, S_IFREG|S_IRUGO), + E(PROC_TID_MOUNTS_NEW, mounts_new, S_IFREG|S_IRUGO), #ifdef CONFIG_MMU E(PROC_TID_SMAPS, smaps, S_IFREG|S_IRUGO), #endif @@ -692,13 +696,13 @@ static struct file_operations proc_smaps }; #endif -extern struct seq_operations mounts_op; struct proc_mounts { struct seq_file m; int event; }; -static int mounts_open(struct inode *inode, struct file *file) +static int __mounts_open(struct inode *inode, struct file *file, + struct seq_operations *mounts_op) { struct task_struct *task = proc_task(inode); struct namespace *namespace; @@ -716,7 +720,7 @@ static int mounts_open(struct inode *ino p = kmalloc(sizeof(struct proc_mounts), GFP_KERNEL); if (p) { file-private_data = p-m; - ret = seq_open(file, mounts_op); + ret = seq_open(file, mounts_op); if (!ret) { p-m.private = namespace; p-event = namespace-event; @@ -729,6 +733,16 @@ static int mounts_open(struct inode *ino return ret; } +extern struct seq_operations mounts_op, mounts_new_op; +static int mounts_open(struct inode *inode, struct file *file) +{ + return (__mounts_open(inode, file, mounts_op)); +} +static int mounts_new_open(struct inode *inode, struct file *file) +{ + return __mounts_open(inode, file, mounts_new_op); +}
Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag
Also for bind-mount and remount operations the flag has to be propagated down its propagation tree. Otherwise a unpriviledged mount in a shared mount wont get reflected in its peers and slaves, leading to unidentical shared-subtrees. That's an interesting question. Do we want shared mounts to be totally identical, including mnt_flags? It doesn't look as if do_remount() guarantees that currently. Depends on the semantics of each of the flags. Some flags like of the read/write flag, would not interfere with the propagation semantics AFAICT. But this one certainly seems to interfere. That depends. Current patches check the unprivileged submounts allowed under this mount flag only on the requested mount and not on the propagated mounts. Do you see a problem with this? Don't see a problem if the flag is propagated to all peers and slave mounts. If not, I see a problem. What if the propagated mount has its flag set to not do un-priviledged mounts, whereas the requested mount has it allowed? Then the mount is allowed. It is up to the sysadmin/distro to design set up the propagations in a way that this is not a problem. I think it would be much less clear conceptually, if unprivileged mounting would have to check propagations as well. Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag
Miklos Szeredi [EMAIL PROTECTED] writes: That depends. Current patches check the unprivileged submounts allowed under this mount flag only on the requested mount and not on the propagated mounts. Do you see a problem with this? I think privileges of this sort should propagate. If I read what you just said correctly if I have a private mount namespace I won't be able to mount anything unless when it was setup the unprivileged submount command was explicitly set. By design yes. Why is that a problem? It certainly doesn't match my intuition. Why are directory permissions not sufficient to allow/deny non-priveleged mounts? I don't understand that contention yet. I should probably go back and look and see how plan9 handles mount/unmount permissions. Plan9 gets away with a lot more because it doesn't have a suid bit and mount namespaces were always present, so they don't have backwards compatibility problems. My best guess at the moment is that plan9 treated mount/unmount as completely unprivileged and used the mount namespaces to limit the scope of what would be affected by a mount/unmount operation. I think that may be reasonable in linux as well but it will require the presence of a mount namespace to limit the affects of what a user can do. So short of a more thorough audit I believe the final semantics should be: - mount/unmount for non-priveleged processes should only be limited by the mount namespace and directory permissions. - CLONE_NEWNS should not be a privileged operation. What prevents us from allowing these things? - Unprivileged CLONE_NEWNS and unprivileged mounts needs resource accounting so we don't have a denial of service attack. - Unprivileged mounts must be limited to directories that we have permission to modify in a way that we could get the same effect as the mount or unmount operation in terms of what files are visible otherwise we can mess up SUID executables. - Anything else? There are user space issues such as a reasonable pam module and how to do backups. However those are user space issues. What am I missing that requires us to add MNT_USER and MNT_USERMNT? Eric - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 05/10] Add permit user submounts flag to vfsmount
Quoting Miklos Szeredi ([EMAIL PROTECTED]): From: Miklos Szeredi [EMAIL PROTECTED] If MNT_USERMNT flag is not set in the target vfsmount, then MNT_USER and MNT_USERMNT? I claim no way will people keep those straight. How about MNT_ALLOWUSER and MNT_USER? -serge unprivileged mounts will be denied. By default this flag is cleared, and can be set on new mounts, on remounts or with the MS_SETFLAGS option. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/fs/namespace.c === --- linux.orig/fs/namespace.c 2007-04-13 13:20:12.0 +0200 +++ linux/fs/namespace.c 2007-04-13 13:35:40.0 +0200 @@ -411,6 +411,7 @@ static int show_vfsmnt(struct seq_file * { MNT_NOATIME, ,noatime }, { MNT_NODIRATIME, ,nodiratime }, { MNT_RELATIME, ,relatime }, + { MNT_USERMNT, ,usermnt }, { 0, NULL } }; struct proc_fs_info *fs_infop; @@ -1505,9 +1506,11 @@ long do_mount(char *dev_name, char *dir_ mnt_flags |= MNT_NODIRATIME; if (flags MS_RELATIME) mnt_flags |= MNT_RELATIME; + if (flags MS_USERMNT) + mnt_flags |= MNT_USERMNT; flags = ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE | -MS_NOATIME | MS_NODIRATIME | MS_RELATIME); +MS_NOATIME | MS_NODIRATIME | MS_RELATIME | MS_USERMNT); /* ... and get the mountpoint */ retval = path_lookup(dir_name, LOOKUP_FOLLOW, nd); Index: linux/include/linux/mount.h === --- linux.orig/include/linux/mount.h 2007-04-13 13:17:08.0 +0200 +++ linux/include/linux/mount.h 2007-04-13 13:22:17.0 +0200 @@ -28,6 +28,7 @@ struct mnt_namespace; #define MNT_NOATIME 0x08 #define MNT_NODIRATIME 0x10 #define MNT_RELATIME 0x20 +#define MNT_USERMNT 0x40 #define MNT_SHRINKABLE 0x100 #define MNT_USER 0x200 Index: linux/include/linux/fs.h === --- linux.orig/include/linux/fs.h 2007-04-13 13:23:05.0 +0200 +++ linux/include/linux/fs.h 2007-04-13 13:35:34.0 +0200 @@ -130,6 +130,7 @@ extern int dir_notify_enable; #define MS_SETFLAGS (123) /* set specified mount flags */ #define MS_CLEARFLAGS(124) /* clear specified mount flags */ /* MS_SETFLAGS | MS_CLEARFLAGS: change mount flags to specified */ +#define MS_USERMNT (125) /* permit unpriv. submounts under this mount */ #define MS_ACTIVE(130) #define MS_NOUSER(131) -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 02/10] allow unprivileged umount
Miklos Szeredi [EMAIL PROTECTED] writes: From: Miklos Szeredi [EMAIL PROTECTED] The owner doesn't need sysadmin capabilities to call umount(). Similar behavior as umount(8) on mounts having user=UID option in /etc/mtab. The difference is that umount also checks /etc/fstab, presumably to exclude another mount on the same mountpoint. bool in the kernel? int would be much more recognizable as this is not C++ Or do you have place to convert the rest of the kernel that is using int to return a true/false value to bool? +static bool permit_umount(struct vfsmount *mnt, int flags) +{ + if (capable(CAP_SYS_ADMIN)) + return true; + + if (!(mnt-mnt_flags MNT_USER)) + return false; + + if (flags MNT_FORCE) + return false; + + return mnt-mnt_uid == current-uid; +} Eric - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag
Quoting Eric W. Biederman ([EMAIL PROTECTED]): Miklos Szeredi [EMAIL PROTECTED] writes: That depends. Current patches check the unprivileged submounts allowed under this mount flag only on the requested mount and not on the propagated mounts. Do you see a problem with this? I think privileges of this sort should propagate. If I read what you just said correctly if I have a private mount namespace I won't be able to mount anything unless when it was setup the unprivileged submount command was explicitly set. By design yes. Why is that a problem? It certainly doesn't match my intuition. Why are directory permissions not sufficient to allow/deny non-priveleged mounts? I don't understand that contention yet. The same scenarios laid out previously in this thread. I.e. 1. user hallyn does mount --bind / /home/hallyn/root 2. (...) 3. admin does deluser hallyn and deluser starts wiping out root Or, 1. user hallyn does mount --bind / /home/hallyn/root 2. backup daemon starts backing up /home/hallyn/root/home/hallyn/root/home... So we started down the path of forcing users to clone a new namespace before doing user mounts, which is what the clone flag was about. Using per-mount flags also suffices as you had pointed out, which is being done here. But directory permissions are inadequate. (Unless you want to tackle each problem legacy tool one at a time to remove problems - i.e. deluser should umount everything under /home/hallyn before deleting, backup should be spawned from it's own namespace cloned right after boot or just back up on one filesystem, etc.) -serge I should probably go back and look and see how plan9 handles mount/unmount permissions. Plan9 gets away with a lot more because it doesn't have a suid bit and mount namespaces were always present, so they don't have backwards compatibility problems. My best guess at the moment is that plan9 treated mount/unmount as completely unprivileged and used the mount namespaces to limit the scope of what would be affected by a mount/unmount operation. I think that may be reasonable in linux as well but it will require the presence of a mount namespace to limit the affects of what a user can do. So short of a more thorough audit I believe the final semantics should be: - mount/unmount for non-priveleged processes should only be limited by the mount namespace and directory permissions. - CLONE_NEWNS should not be a privileged operation. What prevents us from allowing these things? - Unprivileged CLONE_NEWNS and unprivileged mounts needs resource accounting so we don't have a denial of service attack. - Unprivileged mounts must be limited to directories that we have permission to modify in a way that we could get the same effect as the mount or unmount operation in terms of what files are visible otherwise we can mess up SUID executables. - Anything else? There are user space issues such as a reasonable pam module and how to do backups. However those are user space issues. What am I missing that requires us to add MNT_USER and MNT_USERMNT? Eric - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 02/10] allow unprivileged umount
On Mon, Apr 16, 2007 at 01:39:19PM -0600, Eric W. Biederman wrote: Miklos Szeredi [EMAIL PROTECTED] writes: From: Miklos Szeredi [EMAIL PROTECTED] The owner doesn't need sysadmin capabilities to call umount(). Similar behavior as umount(8) on mounts having user=UID option in /etc/mtab. The difference is that umount also checks /etc/fstab, presumably to exclude another mount on the same mountpoint. bool in the kernel? It's there already... int would be much more recognizable as this is not C++ Or do you have place to convert the rest of the kernel that is using int to return a true/false value to bool? It's already underway: $ git log | grep bool | wc -l 123 - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 39/41] AppArmor: Profile loading and manipulation, pathname matching
On Mon, Apr 16, 2007 at 08:27:08AM +0200, Andi Kleen wrote: It's nice to check for consistency though, so we're adding that. Profile loading is a trusted operation, at least so far, and so security wise we don't actually have to care --- if loading an invalid profile can bring down the system, then that's no worse than an arbitrary module that crashes the machine. Not sure if there will ever be user loadable profiles; at least at that point we had to care. A security system that allows to crash the kernel is a little weird though. It would be better to check. Not that a recursion check is particularly expensive. Indeed. It will be fixed in the next rev. thanks john pgp4HkMa7wCNY.pgp Description: PGP signature
Re: How to query mount propagation state?
On Mon, Apr 16, 2007 at 10:39:46AM -0700, Ram Pai wrote: This patch disambiguates multiple mount-instances of the same filesystem (or part of the same filesystem), by introducing a new interface /proc/mounts_new. The interface has the following format. ^^ ... odd name. What will be the name for a next generation? /proc/mounts_new_new? :-) 'cat /proc/mounts' shows the following: /dev/root /mnt ext2 rw 0 0 /dev/root /tmp1 ext2 rw 0 0 NOTE: The above mount entries, do not indicate that /tmp1 contains the same directory tree as /var/tmp. But 'cat /proc/mounts_new' shows us the following: 0x6200 /mnt /var ext2 rw 0 0 0x6200 /tmp1 /var/tmp ext2 rw 0 0 Can't you purely and simply add the fsid= option to /proc/mounts? /dev/root /mnt ext2 rw,fsid=0x6200 0 0 /dev/root /mnt ext2 rw,fsid=0x6200 0 0 I think you can do it without a negative impact to userspace. This patch introduces a new proc interface that exposes all the propagation trees within the namespace. Good idea. It walks through each off the mounts in the namespace, and prints the following information. mount-id: a unique mount identifier dev-id : the unique device used to identify the device containing the filesystem Why not major:minor? path-from-root: mount point of the mount from / path-from-root-of-its-sb: path from its own root dentry. propagation-flag: SHARED, SLAVE, UNBINDABLE, PRIVATE peer-mount-id: the mount-id of its peer mount (if this mount is shared) master-mount-id: the mount-id of its master mount (if this mount is slave) Example: Here is a sample output of cat /proc/$$/mounts_propagation 0xa917800 0x1 / / PRIVATE 0xa917200 0x6200 / / PRIVATE 0xa917180 0x3 /proc / PRIVATE 0xa917f80 0xa /dev/pts / PRIVATE 0xa917100 0x6210 /mnt / SHARED peer:0xa917100 0xa917f00 0x6210 /tmp /1 SLAVE master:0xa917100 0xa917900 0x6220 /mnt/2 / SHARED peer:0xa917900 Same thing (although the mounts_propagation makes more sense than mount_new from my point of view). cat /proc/mounts (or /proc/$$/mounts) /dev/root /mnt ext2 rw,mid=0xa917100,did=0x6210,prop=SHARED,peer=0xa917100 my $0.02... Karel -- Karel Zak [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
AppArmor FAQ
Here we present our direct responses to the most frequent questions from the AppArmor from the 2006 post. Use of Pathnames For Access Control --- Some people in the security field believe that pathnames are an inappropriate security mechanism. This depends on what you are primarily trying to protect, and the rest follows from that. Label-based security (exemplified by SELinux, and its predecessors in MLS systems) attaches security policy to the data. As the data flows through the system, the label sticks to the data, and so security policy with respect to this data stays intact. This is a good approach for ensuring secrecy, the kind of problem that intelligence agencies have. Pathname-based security (exemplified in AppArmor, and its predecessor Janus http://www.cs.berkeley.edu/~daw/janus/ and other systems like Systrace http://www.citi.umich.edu/u/provos/systrace/ ) attach security policy to the name of the data. Controlling access to filenames is important because applications primarily use those names to access the files behind them, and they depend on getting to the right files. For example, login(1) expects /etc/passwd to resolve to a valid list of user accounts. In the traditional UNIX model, files do have names but not labels, and applications only operate in terms of those names. Pathname-based security puts more emphasis on the integrity of the system, making secrecy the secondary goal that follows. Caveat: Both label-based security and pathname-based security can provide both secrecy and integrity protection, the above discussion is only about which model makes it easier to provide which kind of security. We acknowledge that not all objects on a UNIX system are paths, and we agree that there is value in also protecting non-path resources. Contrary to popular belief, AppArmor is *not* Pathnames R Us, but rather Use native abstractions to mediate stuff: when you mediate something, you should use the native syntax that users normally use to access the object. This follows the UNIX philosophy of least surprise so that users can understand the specification. Pathnames are the natural notation for users to understand what file access rights are being granted in the policy, and so AppArmor uses shell syntax for fully qualified pathnames, including shell syntax wildcards. Similarly, AppArmor grants access to POSIX.1e capabilities by name, the name of the capability. In future work where AppArmor will add network access control, the notation will resemble IPTables firewall rules. This is an important part of what makes AppArmor usable: always using the native abstraction for mediating access. We also acknowledge that pathname based access control requires a way to perform pathname matching in the kernel, and this comes at a cost higher than comparing object labels -- which assumes that all objects in the system already have the appropriate labels. However, those concerned with performance should note that AppArmor overhead is already quite low (single-digit percent slowdown). Security is rarely performance-neutral, and AppArmor, and SELinux, are no exception. However, that overhead is small, and can be selectively avoided by not applying AppArmor to performance-sensitive programs. It is also easy to overlook the fact that putting all those labels in place is a pretty expensive operation as well, particularly considering large file systems. So by providing string matching in the kernel, AppArmor trades run-time performance to grant reduced administrative work. It has been suggested that AppArmor's pathname-based syntax could be compiled into SELinux policy, and this is in fact what the SEEdit project http://seedit.sourceforge.net/ does. However, any change in policy requires a complete re-labeling of the file system, and the policy cannot apply to files that do not yet exist. AppArmor's in-kernel string matching allows for policy specifying access to files that might come to exist in the future. Use Of d_path() For Computing Pathnames --- We have been criticized for the use of d_path(), for various reasons: - heuristic discovery of the vfsmount of a dentry, - inability to reliably identify deleted files, - inability to detect unreachable paths, - ambiguity of paths for chroot processes, - file lookup and the access check are not atomic. Most of these issues are fixable (and fixed in the meantime), while the non-atomicity is not really an issue. Because struct vfsmount was not available to LSM hooks for computing pathnames from (dentry, vfsmount) pairs, the version of AppArmor posted last year used heuristics for rediscovering the vfsmounts associated with dentries -- and possibly the wrong ones. We are now passing the vfsmount objects through to all LSM hooks that compute pathnames, and so this heuristic is gone, and now we always use the appropriate vfsmount. The d_path patch already in the -mm tree allows
Re: [AppArmor 38/41] AppArmor: Module and LSM hooks
On Thu, Apr 12, 2007 at 11:21:01AM +0100, Alan Cox wrote: + + /** +* parent can ptrace child when +* - parent is unconfined +* - parent is in complain mode +* - parent and child are confined by the same profile +*/ Your profiles are name based. That means the same profile in a different namespace does different things. It would be a very odd case where it mattered but surely the parent ptrace child rule should also require that the parent and child are in the same namespace when using apparmor name based security. you are right we should be requiring parent and child are in the same namespace. This has been fixed. +static int apparmor_capget(struct task_struct *task, + kernel_cap_t *effective, + kernel_cap_t *inheritable, + kernel_cap_t *permitted) +{ + return cap_capget(task, effective, inheritable, permitted); +} Pointless function should go away. yes we had a few of those thanks for pointing it out. +static int apparmor_sysctl(struct ctl_table *table, int op) +{ + int error = 0; + + if ((op 002) !capable(CAP_SYS_ADMIN)) + error = aa_reject_syscall(current, GFP_KERNEL, + sysctl (write)); + + return error; The usual file permission security override is DAC not ADMIN. What is the logic of this choice. This was a very course grain check that was done to restrict access to sysctl's that could be potentially used to elevated priledge. The check is inconsistent with AppArmor's model and we should be modelling sysctl accesses as pathname access, and then we could be using standard mediation. thanks for the review john pgpu5UrkM9BxV.pgp Description: PGP signature
Re: [AppArmor 31/41] Fix __d_path() for lazy unmounts and make it unambiguous; exclude unreachable mount points from /proc/mounts
That is a fairly significant and sudden change to the existing kernel/user interface. Well, this is not meant for 2.6.21. I hope it is possible to change it in early 2.6.22; otherwise if we can't fix mistakes from the past we are pretty doomed. I don't believe the existing behaviour _IS_ a mistake. This is untrue. The process can get there (via fd passing with another task) Process can access file descriptors which are unreachable via path name just fine indeed, but those fds still don't have a valid path in the context of that process. Which while problematic to your name based security is just fine to everything else. We are only talking about mount points unreachable by a particular process; this does not mean that the mount point isn't reachable by other processes. Human operators can choose the context from which they are looking at /proc/mounts. If they are looking form the real root, the will see all mounts that any process can reach (in that namespace). Ok, providing the real root sees them all it isn't so bad, but to assume you can filter based upon what the task can see is dodgy as an assumption. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 39/41] AppArmor: Profile loading and manipulation, pathname matching
don't actually have to care --- if loading an invalid profile can bring down the system, then that's no worse than an arbitrary module that crashes the machine. Not sure if there will ever be user loadable profiles; at least at that point we had to care. CAP_SYS_RAWIO is needed to do arbitary patching/loading in the capability model so if you are using lesser capabilities it is a (minor) capability rise but not a big problem, just ugly and wanting a fix + /* + * Replacement needs to allocate a new aa_task_context for each + * task confined by old_profile. To do this the profile locks + * are only held when the actual switch is done per task. While + * looping to allocate a new aa_task_context the old_task list + * may get shorter if tasks exit/change their profile but will + * not get longer as new task will not use old_profile detecting + * that is stale. + */ + do { + new_cxt = aa_alloc_task_context(GFP_KERNEL | __GFP_NOFAIL); NOFAIL is usually a bad sign. It should be only used if there is no alternative. At this point there is no secure alternative to allocating a task context --- except killing the task, maybe. Can you count the number needed, preallocate them and then when you know for sure either succeed or fail the operation as a whole ? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 39/41] AppArmor: Profile loading and manipulation, pathname matching
On Mon, Apr 16, 2007 at 11:00:01PM +0100, Alan Cox wrote: don't actually have to care --- if loading an invalid profile can bring down the system, then that's no worse than an arbitrary module that crashes the machine. Not sure if there will ever be user loadable profiles; at least at that point we had to care. CAP_SYS_RAWIO is needed to do arbitary patching/loading in the capability model so if you are using lesser capabilities it is a (minor) capability rise but not a big problem, just ugly and wanting a fix + /* +* Replacement needs to allocate a new aa_task_context for each +* task confined by old_profile. To do this the profile locks +* are only held when the actual switch is done per task. While +* looping to allocate a new aa_task_context the old_task list +* may get shorter if tasks exit/change their profile but will +* not get longer as new task will not use old_profile detecting +* that is stale. +*/ + do { + new_cxt = aa_alloc_task_context(GFP_KERNEL | __GFP_NOFAIL); NOFAIL is usually a bad sign. It should be only used if there is no alternative. At this point there is no secure alternative to allocating a task context --- except killing the task, maybe. Can you count the number needed, preallocate them and then when you know for sure either succeed or fail the operation as a whole ? No, to be accurate the count would have to be made with the profile lock held, which would then need to be released so as not to use GFP_ATOMIC for the allocations. An iterative approach could be taken where we do something like repeat: lock profile count if preallocated count unlock profile if (! allocate count - preallocated) Fail goto repeat do replacement pgp3UGFGlqBX8.pgp Description: PGP signature
Re: 2.6.21-rc6 new aops patchset
On Thu, Apr 12, 2007 at 06:48:52AM +0200, Nick Piggin wrote: http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/ 2.6.21-rc6-new-aops* New aops patchset against 2.6.21-rc6. Ok, here's an ocfs2 patch against -mm / ocfs2.git ALL. It makes use of the context back pointer to store information related to the write which the vfs normally doesn't know about. Most importantly this is an array of zero'd pages which might have to be written out for an allocating write. Of note is that I also stick the journal handle on there. Ocfs2 could use current-journal_info for that, but I think it's much cleaner to just pass that around as a file system specific context. I tested this on a couple nodes and things seem to be running smoothly. A couple of notes: * The ocfs2 write context is probably a bit big. I'm much more concerned with readability though as Ocfs2 has much more baggage to carry around than other file systems. * A ton of code was deleted :) The patch adds a bunch too, but that's mostly getting the old stuff to flow with -write_begin. Some assumptions about the size of the write that were made with my previous implemenation were no longer true (this is good). * I could probably clean this up some more, but I'd be fine if the patch went upstream as-is. Diff seems to have mangled this patch file enough that it's probably much easier to read once applied. * This doesn't use -perform_write (yet), so stuff is still being copied one page at a time. I _think_ things are pretty reasonably set up to allow larger writes though... --Mark -- Mark Fasheh Senior Software Developer, Oracle [EMAIL PROTECTED] From: Mark Fasheh [EMAIL PROTECTED] ocfs2: Convert to new aops Fix up ocfs2 to use -write_begin and -write_end. This lets us dump a large amount of code which was implementing our own write path while preserving the nice locking rules that were gained by moving away from -prepare_write. Signed-off-by: Mark Fasheh [EMAIL PROTECTED] --- fs/ocfs2/aops.c | 777 +++ fs/ocfs2/aops.h | 52 fs/ocfs2/file.c | 246 + 3 files changed, 452 insertions(+), 623 deletions(-) 62668000c621fd18da6adf52e219f933e337e4bd diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c index b74971e..55100cf 100644 --- a/fs/ocfs2/aops.c +++ b/fs/ocfs2/aops.c @@ -677,6 +677,8 @@ int ocfs2_map_page_blocks(struct page *p bh = bh-b_this_page, block_start += bsize) { block_end = block_start + bsize; + clear_buffer_new(bh); + /* * Ignore blocks outside of our i/o range - * they may belong to unallocated clusters. @@ -691,9 +693,8 @@ int ocfs2_map_page_blocks(struct page *p * For an allocating write with cluster size = page * size, we always write the entire page. */ - - if (buffer_new(bh)) - clear_buffer_new(bh); + if (new) + set_buffer_new(bh); if (!buffer_mapped(bh)) { map_bh(bh, inode-i_sb, *p_blkno); @@ -754,217 +755,187 @@ next_bh: return ret; } +#if (PAGE_CACHE_SIZE = OCFS2_MAX_CLUSTERSIZE) +#define OCFS2_MAX_CTXT_PAGES 1 +#else +#define OCFS2_MAX_CTXT_PAGES (OCFS2_MAX_CLUSTERSIZE / PAGE_CACHE_SIZE) +#endif + +#define OCFS2_MAX_CLUSTERS_PER_PAGE(PAGE_CACHE_SIZE / OCFS2_MIN_CLUSTERSIZE) + /* - * This will copy user data from the buffer page in the splice - * context. - * - * For now, we ignore SPLICE_F_MOVE as that would require some extra - * communication out all the way to ocfs2_write(). + * Describe the state of a single cluster to be written to. */ -int ocfs2_map_and_write_splice_data(struct inode *inode, - struct ocfs2_write_ctxt *wc, u64 *p_blkno, - unsigned int *ret_from, unsigned int *ret_to) -{ - int ret; - unsigned int to, from, cluster_start, cluster_end; - char *src, *dst; - struct ocfs2_splice_write_priv *sp = wc-w_private; - struct pipe_buffer *buf = sp-s_buf; - unsigned long bytes, src_from; - struct ocfs2_super *osb = OCFS2_SB(inode-i_sb); +struct ocfs2_write_cluster_desc { + u32 c_cpos; + u32 c_phys; + /* +* Give this a unique field because c_phys eventually gets +* filled. +*/ + unsignedc_new; +}; - ocfs2_figure_cluster_boundaries(osb, wc-w_cpos, cluster_start, - cluster_end); +struct ocfs2_write_ctxt { + /* Logical cluster position / len of write */ + u32 w_cpos; + u32 w_clen; - from = sp-s_offset; - src_from = sp-s_buf_offset; - bytes = wc-w_count; + struct ocfs2_write_cluster_desc
Re: AppArmor FAQ
On Mon, 16 Apr 2007, John Johansen wrote: Label-based security (exemplified by SELinux, and its predecessors in MLS systems) attaches security policy to the data. As the data flows through the system, the label sticks to the data, and so security policy with respect to this data stays intact. This is a good approach for ensuring secrecy, the kind of problem that intelligence agencies have. Labels are also a good approach for ensuring integrity, which is one of the most fundamental aspects of the security model implemented by SELinux. Some may infer otherwise from your document. Pathname-based security (exemplified in AppArmor, and its predecessor Janus http://www.cs.berkeley.edu/~daw/janus/ and other systems like Systrace http://www.citi.umich.edu/u/provos/systrace/ ) attach security policy to the name of the data. Controlling access to filenames is important because applications primarily use those names to access the files behind them, and they depend on getting to the right files. For example, login(1) expects /etc/passwd to resolve to a valid list of user accounts. And it should, but alas may instead find otherwise due to namespace manipulation, object aliasing (e.g. symlinks), application error, configuration error, corrupted files, corrupted filesystems, misbehavior due to malware infection or various forms user error. A pathname tells you nothing reliable about the security properties of the object its pointing to. It is simply a mechanism for locating and referring to an object. In the traditional UNIX model, files do have names but not labels, and applications only operate in terms of those names. Just to be clear (as the above conflates two distinct notions): applications under SELinux still use pathnames for locating and referring to files. SELinux security is enforced within the kernel, and an application which does not have permission to access an object will simply receive an error using the standard Unix mechanisms already used for DAC. For example, a write(2) might fail with an EACCES error code. The pathname used by an application to access an object has _nothing_ to do with the security attributes of the object. Traditional Unix security in fact does not primarily depend on pathnames, but on DAC ownership and permission attributes stored in the file's inode. DAC is of course a form of labeled security. Imagine if you were re-inventing Unix and decided to implement pathname security for DAC instead of inode labeling. What you would have is a more generalized version of apparmor, with the DAC attributes of pathnames for the entire filesystem stored in a text database with an in-kernel regex engine performing path reconstruction and pattern matching on every file access. Sound like a good idea? I hope not. How about an analogy: think of kernel objects which are protected by locks. Do you lock the path to the object or do you lock the object itself? Pathname-based security puts more emphasis on the integrity of the system, making secrecy the secondary goal that follows. This assertion is being made without any supporting evidence or rationale. If you're comparing pathname vs. label security, then it is clear that direct object labeling allows the security attributes of the system to be specified completely and unambiguously, whereas integrity enforced via pathnames alone requires several constraints to be applied to the goals of the policy. So, it seems to me that the opposite of what you say is more correct, although it is a fairly oblique argument to start with. More significant to note is that Type Enforcement was designed specifically to address integrity requirements, in response to the limitations of the early MLS models which were focused on confidentiality. See: A Practical Alternative to Hierarchical Integrity Policies Boebert Kain, Proceedings of the Eighth National Computer Security Conference, 1985. Meeting Critical Security Objectives with Security-Enhanced Linux http://www.nsa.gov/selinux/papers/ottawa01/index.html Or pretty much any paper on the design of SELinux or Flask. Integrity control is a foundational aspect of TE, Flask and SELinux. I've never understood why AppArmor presentations tend to so bizarrely suggest the opposite. Caveat: Both label-based security and pathname-based security can provide both secrecy and integrity protection, the above discussion is only about which model makes it easier to provide which kind of security. I don't see how you've established anything in this regard. We acknowledge that not all objects on a UNIX system are paths, and we agree that there is value in also protecting non-path resources. Contrary to popular belief, AppArmor is *not* Pathnames R Us, but rather Use native abstractions to mediate stuff: when you mediate something, you should use the native syntax that users normally use to access the object. This follows the UNIX
Re: 2.6.21-rc6 new aops patchset
On Mon, Apr 16, 2007 at 05:19:32PM -0700, Mark Fasheh wrote: On Thu, Apr 12, 2007 at 06:48:52AM +0200, Nick Piggin wrote: http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/ 2.6.21-rc6-new-aops* New aops patchset against 2.6.21-rc6. Ok, here's an ocfs2 patch against -mm / ocfs2.git ALL. It makes use of the context back pointer to store information related to the write which the vfs normally doesn't know about. Most importantly this is an array of zero'd pages which might have to be written out for an allocating write. Of note is that I also stick the journal handle on there. Ocfs2 could use current-journal_info for that, but I think it's much cleaner to just pass that around as a file system specific context. I tested this on a couple nodes and things seem to be running smoothly. Thanks Mark, I'll let you know if I have any problems with it. I guess the plan will be to rebase the new-aops patchset on -mm at some point soon, but we'll have to work out where in -mm Andrew wants it and other details like when to merge it... I'll try to bring Andrew into the public discussion for that. A couple of notes: * The ocfs2 write context is probably a bit big. I'm much more concerned with readability though as Ocfs2 has much more baggage to carry around than other file systems. I guess your high performance write path could be -perform_write in future anyway, which could keep that on the stack. I'll possibly omit the perform_write stuff in the first -mm merge, so that we can get the basics reviewed and working, and exercise the write_begin/write_end paths well first. The bulk interface will come, and I think it is in much better shape to be implemented after various things like iov iterator, the offset,length based API rather than page based. Whether we introduce it by moving the iov_iter and some of the more generic checks further down the stack first, or introduce it as -perform_write aop first probably doesn't matter so much: the conversion between one and the other is trivial compared to implementing it in the first place, so nobody should worry that we haven't sorted this out yet. * A ton of code was deleted :) The patch adds a bunch too, but that's mostly getting the old stuff to flow with -write_begin. Some assumptions about the size of the write that were made with my previous implemenation were no longer true (this is good). * I could probably clean this up some more, but I'd be fine if the patch went upstream as-is. Diff seems to have mangled this patch file enough that it's probably much easier to read once applied. OK, thanks again. I'll ping you again in the next couple of days to discuss merge plans. Nick - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html