Re: [patch 0/8] unprivileged mount syscall
Arn't there ways to escape chroot jails? Serge had pointed me to a URL which showed chroots can be escaped. And if that is true than having all user's private mount tree in the same namespace can be a security issue? No. In fact chrooting the user into /share/$USER will actually _grant_ a privilege to the user, instead of taking it away. It allows the user to modify it's root namespace, which it wouldn't be able to in the initial namespace. So even if the user could escape from the chroot (which I doubt), s/he would not be able to do any harm, since unprivileged mounting would be restricted to /share. Also /share/$USER should only have read/search permission for $USER or no permissions at all, which would mean, that other users' namespaces would be safe from tampering as well. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag
Given the existence of shared subtrees allowing/denying this at the mount namespace level is silly and wrong. If we need more than just the filesystem permission checks can we make it a mount flag settable with mount and remount that allows non-privileged users the ability to create mount points under it in directories they have full read/write access to. Also for bind-mount and remount operations the flag has to be propagated down its propagation tree. Otherwise a unpriviledged mount in a shared mount wont get reflected in its peers and slaves, leading to unidentical shared-subtrees. That's an interesting question. Do we want shared mounts to be totally identical, including mnt_flags? It doesn't look as if do_remount() guarantees that currently. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag
Also for bind-mount and remount operations the flag has to be propagated down its propagation tree. Otherwise a unpriviledged mount in a shared mount wont get reflected in its peers and slaves, leading to unidentical shared-subtrees. That's an interesting question. Do we want shared mounts to be totally identical, including mnt_flags? It doesn't look as if do_remount() guarantees that currently. Depends on the semantics of each of the flags. Some flags like of the read/write flag, would not interfere with the propagation semantics AFAICT. But this one certainly seems to interfere. That depends. Current patches check the unprivileged submounts allowed under this mount flag only on the requested mount and not on the propagated mounts. Do you see a problem with this? Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 01/10] add user mounts to the kernel
From: Miklos Szeredi [EMAIL PROTECTED] Add ownership information to mounts. A new mount flag, MS_SETUSER is used to make a mount owned by a user. If this flag is specified, then the owner will be set to the current real user id and the mount will be marked with the MNT_USER flag. On remount don't preserve previous owner, and treat MS_SETUSER as for a new mount. The MS_SETUSER flag is ignored on mount move. The MNT_USER flag is not copied on any kind of mount cloning: namespace creation, binding or propagation. For bind mounts the cloned mount(s) are set to MNT_USER depending on the MS_SETUSER mount flag. In all the other cases MNT_USER is always cleared. For MNT_USER mounts a user=UID option is added to /proc/PID/mounts. This is compatible with how mount ownership is stored in /etc/mtab. It is expected, that in the future mount(8) will use MS_SETUSER to store mount ownership within the kernel. This would help in situations, where /etc/mtab is difficult or impossible to work with, e.g. when using mount propagation. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/fs/namespace.c === --- linux.orig/fs/namespace.c 2007-04-13 12:24:15.0 +0200 +++ linux/fs/namespace.c2007-04-13 13:03:44.0 +0200 @@ -227,6 +227,13 @@ static struct vfsmount *skip_mnt_tree(st return p; } +static void set_mnt_user(struct vfsmount *mnt) +{ + BUG_ON(mnt-mnt_flags MNT_USER); + mnt-mnt_uid = current-uid; + mnt-mnt_flags |= MNT_USER; +} + static struct vfsmount *clone_mnt(struct vfsmount *old, struct dentry *root, int flag) { @@ -241,6 +248,11 @@ static struct vfsmount *clone_mnt(struct mnt-mnt_mountpoint = mnt-mnt_root; mnt-mnt_parent = mnt; + /* don't copy the MNT_USER flag */ + mnt-mnt_flags = ~MNT_USER; + if (flag CL_SETUSER) + set_mnt_user(mnt); + if (flag CL_SLAVE) { list_add(mnt-mnt_slave, old-mnt_slave_list); mnt-mnt_master = old; @@ -403,6 +415,8 @@ static int show_vfsmnt(struct seq_file * if (mnt-mnt_flags fs_infop-flag) seq_puts(m, fs_infop-str); } + if (mnt-mnt_flags MNT_USER) + seq_printf(m, ,user=%i, mnt-mnt_uid); if (mnt-mnt_sb-s_op-show_options) err = mnt-mnt_sb-s_op-show_options(m, mnt); seq_puts(m, 0 0\n); @@ -920,8 +934,9 @@ static int do_change_type(struct nameida /* * do loopback mount. */ -static int do_loopback(struct nameidata *nd, char *old_name, int recurse) +static int do_loopback(struct nameidata *nd, char *old_name, int flags) { + int clone_flags; struct nameidata old_nd; struct vfsmount *mnt = NULL; int err = mount_is_safe(nd); @@ -941,11 +956,12 @@ static int do_loopback(struct nameidata if (!check_mnt(nd-mnt) || !check_mnt(old_nd.mnt)) goto out; + clone_flags = (flags MS_SETUSER) ? CL_SETUSER : 0; err = -ENOMEM; - if (recurse) - mnt = copy_tree(old_nd.mnt, old_nd.dentry, 0); + if (flags MS_REC) + mnt = copy_tree(old_nd.mnt, old_nd.dentry, clone_flags); else - mnt = clone_mnt(old_nd.mnt, old_nd.dentry, 0); + mnt = clone_mnt(old_nd.mnt, old_nd.dentry, clone_flags); if (!mnt) goto out; @@ -987,8 +1003,11 @@ static int do_remount(struct nameidata * down_write(sb-s_umount); err = do_remount_sb(sb, flags, data, 0); - if (!err) + if (!err) { nd-mnt-mnt_flags = mnt_flags; + if (flags MS_SETUSER) + set_mnt_user(nd-mnt); + } up_write(sb-s_umount); if (!err) security_sb_post_remount(nd-mnt, flags, data); @@ -1093,10 +1112,13 @@ static int do_new_mount(struct nameidata if (!capable(CAP_SYS_ADMIN)) return -EPERM; - mnt = do_kern_mount(type, flags, name, data); + mnt = do_kern_mount(type, flags ~MS_SETUSER, name, data); if (IS_ERR(mnt)) return PTR_ERR(mnt); + if (flags MS_SETUSER) + set_mnt_user(mnt); + return do_add_mount(mnt, nd, mnt_flags, NULL); } @@ -1127,7 +1149,8 @@ int do_add_mount(struct vfsmount *newmnt if (S_ISLNK(newmnt-mnt_root-d_inode-i_mode)) goto unlock; - newmnt-mnt_flags = mnt_flags; + /* MNT_USER was set earlier */ + newmnt-mnt_flags |= mnt_flags; if ((err = graft_tree(newmnt, nd))) goto unlock; @@ -1447,7 +1470,7 @@ long do_mount(char *dev_name, char *dir_ retval = do_remount(nd, flags ~MS_REMOUNT, mnt_flags, data_page); else
[patch 02/10] allow unprivileged umount
From: Miklos Szeredi [EMAIL PROTECTED] The owner doesn't need sysadmin capabilities to call umount(). Similar behavior as umount(8) on mounts having user=UID option in /etc/mtab. The difference is that umount also checks /etc/fstab, presumably to exclude another mount on the same mountpoint. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/fs/namespace.c === --- linux.orig/fs/namespace.c 2007-04-11 20:07:51.0 +0200 +++ linux/fs/namespace.c2007-04-11 20:08:05.0 +0200 @@ -659,6 +659,25 @@ static int do_umount(struct vfsmount *mn } /* + * umount is permitted for + * - sysadmin + * - mount owner, if not forced umount + */ +static bool permit_umount(struct vfsmount *mnt, int flags) +{ + if (capable(CAP_SYS_ADMIN)) + return true; + + if (!(mnt-mnt_flags MNT_USER)) + return false; + + if (flags MNT_FORCE) + return false; + + return mnt-mnt_uid == current-uid; +} + +/* * Now umount can handle mount points as well as block devices. * This is important for filesystems which use unnamed block devices. * @@ -681,7 +700,7 @@ asmlinkage long sys_umount(char __user * goto dput_and_out; retval = -EPERM; - if (!capable(CAP_SYS_ADMIN)) + if (!permit_umount(nd.mnt, flags)) goto dput_and_out; retval = do_umount(nd.mnt, flags); -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 07/10] allow unprivileged bind mounts
From: Miklos Szeredi [EMAIL PROTECTED] Allow bind mounts to unprivileged users if the following conditions are met: - user submounts are permitted on the mountpoint's mount - mountpoint is not a symlink or special file - mountpoint is not a sticky directory or is owned by the current user - mountpoint is writable by user - the number of user mounts is below the maximum Unprivileged mounts imply MS_SETUSER, and will also have the nosuid and nodev mount flags set. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/fs/namespace.c === --- linux.orig/fs/namespace.c 2007-04-13 13:35:53.0 +0200 +++ linux/fs/namespace.c2007-04-13 14:17:39.0 +0200 @@ -237,11 +237,30 @@ static void dec_nr_user_mounts(void) spin_unlock(vfsmount_lock); } -static void set_mnt_user(struct vfsmount *mnt) +static int reserve_user_mount(void) +{ + int err = 0; + spin_lock(vfsmount_lock); + if (nr_user_mounts = max_user_mounts !capable(CAP_SYS_ADMIN)) + err = -EPERM; + else + nr_user_mounts++; + spin_unlock(vfsmount_lock); + return err; +} + +static void __set_mnt_user(struct vfsmount *mnt) { BUG_ON(mnt-mnt_flags MNT_USER); mnt-mnt_uid = current-uid; mnt-mnt_flags |= MNT_USER; + if (!capable(CAP_SYS_ADMIN)) + mnt-mnt_flags |= MNT_NOSUID | MNT_NODEV; +} + +static void set_mnt_user(struct vfsmount *mnt) +{ + __set_mnt_user(mnt); spin_lock(vfsmount_lock); nr_user_mounts++; spin_unlock(vfsmount_lock); @@ -260,9 +279,16 @@ static struct vfsmount *clone_mnt(struct int flag) { struct super_block *sb = old-mnt_sb; - struct vfsmount *mnt = alloc_vfsmnt(old-mnt_devname); + struct vfsmount *mnt; + + if (flag CL_SETUSER) { + int err = reserve_user_mount(); + if (err) + return ERR_PTR(err); + } + mnt = alloc_vfsmnt(old-mnt_devname); if (!mnt) - return ERR_PTR(-ENOMEM); + goto alloc_failed; mnt-mnt_flags = old-mnt_flags; atomic_inc(sb-s_active); @@ -274,7 +300,7 @@ static struct vfsmount *clone_mnt(struct /* don't copy the MNT_USER flag */ mnt-mnt_flags = ~MNT_USER; if (flag CL_SETUSER) - set_mnt_user(mnt); + __set_mnt_user(mnt); if (flag CL_SLAVE) { list_add(mnt-mnt_slave, old-mnt_slave_list); @@ -299,6 +325,11 @@ static struct vfsmount *clone_mnt(struct spin_unlock(vfsmount_lock); } return mnt; + + alloc_failed: + if (flag CL_SETUSER) + dec_nr_user_mounts(); + return ERR_PTR(-ENOMEM); } static inline void __mntput(struct vfsmount *mnt) @@ -746,22 +777,35 @@ asmlinkage long sys_oldumount(char __use #endif -static int mount_is_safe(struct nameidata *nd) +/* + * Conditions for unprivileged mounts are: + * - user submounts are permitted under this mount + * - mountpoint is not a symlink or special file + * - mountpoint is absolutely writable by user + * o if it's a sticky directory, it must be owned by the user + * o it must not be an append-only file/directory + */ +static int mount_is_safe(struct nameidata *nd, int *flags) { + struct inode *inode = nd-dentry-d_inode; + if (capable(CAP_SYS_ADMIN)) return 0; - return -EPERM; -#ifdef notyet - if (S_ISLNK(nd-dentry-d_inode-i_mode)) + + if (!(nd-mnt-mnt_flags MNT_USERMNT)) return -EPERM; - if (nd-dentry-d_inode-i_mode S_ISVTX) { - if (current-uid != nd-dentry-d_inode-i_uid) - return -EPERM; - } - if (vfs_permission(nd, MAY_WRITE)) + + if (!S_ISDIR(inode-i_mode) !S_ISREG(inode-i_mode)) + return -EPERM; + + if ((inode-i_mode S_ISVTX) current-fsuid != inode-i_uid) return -EPERM; + + if (vfs_permission(nd, MAY_WRITE) || IS_APPEND(inode)) + return -EPERM; + + *flags |= MS_SETUSER; return 0; -#endif } static int lives_below_in_same_fs(struct dentry *d, struct dentry *dentry) @@ -991,7 +1035,7 @@ static int do_loopback(struct nameidata int clone_flags; struct nameidata old_nd; struct vfsmount *mnt = NULL; - int err = mount_is_safe(nd); + int err = mount_is_safe(nd, flags); if (err) return err; if (!old_name || !*old_name) -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 04/10] allow per-mount flags to be set/cleared individually
With MS_REMOUNT, it is difficult to change mount flags individually, without touching other mount flags or options, having to parse /proc/mounts to get the old flag values and options. It is also difficult to change a mount flag recursively, having to walk the mount tree in userspace. This patch solves this problem by generalizing do_change_type() so that not only the propagation property can be changed, but mnt_flags can be set/cleared individually. From: Miklos Szeredi [EMAIL PROTECTED] Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/fs/namespace.c === --- linux.orig/fs/namespace.c 2007-04-13 13:04:04.0 +0200 +++ linux/fs/namespace.c2007-04-13 13:04:29.0 +0200 @@ -955,19 +956,28 @@ out_unlock: /* * recursively change the type of the mountpoint. */ -static int do_change_type(struct nameidata *nd, int flag) +static int do_change_mnt(struct nameidata *nd, int flag, int mnt_flags) { struct vfsmount *m, *mnt = nd-mnt; int recurse = flag MS_REC; - int type = flag ~MS_REC; + int type = flag (MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE); if (nd-dentry != nd-mnt-mnt_root) return -EINVAL; down_write(namespace_sem); spin_lock(vfsmount_lock); - for (m = mnt; m; m = (recurse ? next_mnt(m, mnt) : NULL)) - change_mnt_propagation(m, type); + for (m = mnt; m; m = (recurse ? next_mnt(m, mnt) : NULL)) { + if (type) + change_mnt_propagation(m, type); + + if (flag (MS_SETFLAGS | MS_CLEARFLAGS)) + m-mnt_flags = mnt_flags; + else if (flag MS_SETFLAGS) + m-mnt_flags |= mnt_flags; + else if (flag MS_CLEARFLAGS) + m-mnt_flags = ~mnt_flags; + } spin_unlock(vfsmount_lock); up_write(namespace_sem); return 0; @@ -1514,8 +1526,9 @@ long do_mount(char *dev_name, char *dir_ data_page); else if (flags MS_BIND) retval = do_loopback(nd, dev_name, flags); - else if (flags (MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE)) - retval = do_change_type(nd, flags); + else if (flags (MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE | + MS_SETFLAGS | MS_CLEARFLAGS)) + retval = do_change_mnt(nd, flags, mnt_flags); else if (flags MS_MOVE) retval = do_move_mount(nd, dev_name); else Index: linux/include/linux/fs.h === --- linux.orig/include/linux/fs.h 2007-04-13 13:04:04.0 +0200 +++ linux/include/linux/fs.h2007-04-13 13:04:29.0 +0200 @@ -127,6 +127,9 @@ extern int dir_notify_enable; #define MS_SHARED (120) /* change to shared */ #define MS_RELATIME(121) /* Update atime relative to mtime/ctime. */ #define MS_SETUSER (122) /* set mnt_uid to current user */ +#define MS_SETFLAGS(123) /* set specified mount flags */ +#define MS_CLEARFLAGS (124) /* clear specified mount flags */ +/* MS_SETFLAGS | MS_CLEARFLAGS: change mount flags to specified */ #define MS_ACTIVE (130) #define MS_NOUSER (131) -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 00/10] mount ownership and unprivileged mount syscall (v3)
This patchset adds support for keeping mount ownership information in the kernel, and allow unprivileged mount(2) and umount(2) in certain cases. This can be useful for the following reasons: - mount(8) can store ownership (user=XY option) in the kernel instead, or in addition to storing it in /etc/mtab. For example if private namespaces are used with mount propagations /etc/mtab becomes unworkable, but using /proc/mounts works fine - fuse won't need a special suid-root mount/umount utility. Plain umount(8) can easily be made to work with unprivileged fuse mounts - users can use bind mounts without having to pre-configure them in /etc/fstab The following security measures are taken for unprivileged mounts: - only allow submounting under mounts which have a special mount flag set - only allow mounting on files/directories writable by the user - limit the number of user mounts - force nosuid,nodev mount options Changes from the previous submissions: - add mount flags to set/clear mnt_flags individually - add usermnt mount flag. If it is set, then allow unprivileged submounts under this mount - make max number of user mounts default to 1024, since now the usermnt flag will prevent user mounts by default -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 03/10] account user mounts
From: Miklos Szeredi [EMAIL PROTECTED] Add sysctl variables for accounting and limiting the number of user mounts. The maximum number of user mounts is set to 1024 by default. This won't in itself enable user mounts, setting the permit user submount mount flag will also be needed. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/include/linux/sysctl.h === --- linux.orig/include/linux/sysctl.h 2007-04-11 18:27:46.0 +0200 +++ linux/include/linux/sysctl.h2007-04-11 20:08:16.0 +0200 @@ -818,6 +818,8 @@ enum FS_AIO_NR=18, /* current system-wide number of aio requests */ FS_AIO_MAX_NR=19, /* system-wide maximum number of aio requests */ FS_INOTIFY=20, /* inotify submenu */ + FS_NR_USER_MOUNTS=21, /* int:current number of user mounts */ + FS_MAX_USER_MOUNTS=22, /* int:maximum number of user mounts */ FS_OCFS2=988, /* ocfs2 */ }; Index: linux/kernel/sysctl.c === --- linux.orig/kernel/sysctl.c 2007-04-11 18:27:46.0 +0200 +++ linux/kernel/sysctl.c 2007-04-11 20:08:16.0 +0200 @@ -1063,6 +1063,22 @@ static ctl_table fs_table[] = { #endif #endif { + .ctl_name = FS_NR_USER_MOUNTS, + .procname = nr_user_mounts, + .data = nr_user_mounts, + .maxlen = sizeof(int), + .mode = 0444, + .proc_handler = proc_dointvec, + }, + { + .ctl_name = FS_MAX_USER_MOUNTS, + .procname = max_user_mounts, + .data = max_user_mounts, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, + { .ctl_name = KERN_SETUID_DUMPABLE, .procname = suid_dumpable, .data = suid_dumpable, Index: linux/Documentation/filesystems/proc.txt === --- linux.orig/Documentation/filesystems/proc.txt 2007-04-11 18:27:44.0 +0200 +++ linux/Documentation/filesystems/proc.txt2007-04-12 13:32:14.0 +0200 @@ -923,6 +923,15 @@ reaches aio-max-nr then io_setup will fa raising aio-max-nr does not result in the pre-allocation or re-sizing of any kernel data structures. +nr_user_mounts and max_user_mounts +-- + +These represent the number of user mounts and the maximum number of +user mounts respectively. User mounts may be created by +unprivileged users. User mounts may also be created with sysadmin +privileges on behalf of a user, in which case nr_user_mounts may +exceed max_user_mounts. + 2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats --- Index: linux/fs/namespace.c === --- linux.orig/fs/namespace.c 2007-04-11 20:08:05.0 +0200 +++ linux/fs/namespace.c2007-04-12 13:29:05.0 +0200 @@ -39,6 +39,9 @@ static int hash_mask __read_mostly, hash static struct kmem_cache *mnt_cache __read_mostly; static struct rw_semaphore namespace_sem; +int nr_user_mounts; +int max_user_mounts = 1024; + /* /sys/fs */ decl_subsys(fs, NULL, NULL); EXPORT_SYMBOL_GPL(fs_subsys); @@ -227,11 +230,30 @@ static struct vfsmount *skip_mnt_tree(st return p; } +static void dec_nr_user_mounts(void) +{ + spin_lock(vfsmount_lock); + nr_user_mounts--; + spin_unlock(vfsmount_lock); +} + static void set_mnt_user(struct vfsmount *mnt) { BUG_ON(mnt-mnt_flags MNT_USER); mnt-mnt_uid = current-uid; mnt-mnt_flags |= MNT_USER; + spin_lock(vfsmount_lock); + nr_user_mounts++; + spin_unlock(vfsmount_lock); +} + +static void clear_mnt_user(struct vfsmount *mnt) +{ + if (mnt-mnt_flags MNT_USER) { + mnt-mnt_uid = 0; + mnt-mnt_flags = ~MNT_USER; + dec_nr_user_mounts(); + } } static struct vfsmount *clone_mnt(struct vfsmount *old, struct dentry *root, @@ -283,6 +305,7 @@ static inline void __mntput(struct vfsmo { struct super_block *sb = mnt-mnt_sb; dput(mnt-mnt_root); + clear_mnt_user(mnt); free_vfsmnt(mnt); deactivate_super(sb); } @@ -1023,6 +1046,7 @@ static int do_remount(struct nameidata * down_write(sb-s_umount); err = do_remount_sb(sb, flags, data, 0); if (!err) { + clear_mnt_user(nd-mnt); nd-mnt-mnt_flags = mnt_flags; if (flags MS_SETUSER) set_mnt_user(nd-mnt); Index: linux/include/linux/fs.h
[patch 08/10] put declaration of put_filesystem() in fs.h
From: Miklos Szeredi [EMAIL PROTECTED] Declarations go into headers. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/fs/super.c === --- linux.orig/fs/super.c 2007-04-13 12:26:11.0 +0200 +++ linux/fs/super.c2007-04-13 13:25:40.0 +0200 @@ -40,10 +40,6 @@ #include asm/uaccess.h -void get_filesystem(struct file_system_type *fs); -void put_filesystem(struct file_system_type *fs); -struct file_system_type *get_fs_type(const char *name); - LIST_HEAD(super_blocks); DEFINE_SPINLOCK(sb_lock); Index: linux/include/linux/fs.h === --- linux.orig/include/linux/fs.h 2007-04-13 13:23:22.0 +0200 +++ linux/include/linux/fs.h2007-04-13 13:25:40.0 +0200 @@ -1922,6 +1922,8 @@ extern int vfs_fstat(unsigned int, struc extern int vfs_ioctl(struct file *, unsigned int, unsigned int, unsigned long); +extern void get_filesystem(struct file_system_type *fs); +extern void put_filesystem(struct file_system_type *fs); extern struct file_system_type *get_fs_type(const char *name); extern struct super_block *get_super(struct block_device *); extern struct super_block *user_get_super(dev_t); -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 06/10] propagate error values from clone_mnt
From: Miklos Szeredi [EMAIL PROTECTED] Allow clone_mnt() to return errors other than ENOMEM. This will be used for returning a different error value when the number of user mounts goes over the limit. Fix copy_tree() to return EPERM for unbindable mounts. Don't propagate further from dup_mnt_ns() as that copy_tree() can only fail with -ENOMEM. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/fs/namespace.c === --- linux.orig/fs/namespace.c 2007-04-13 13:22:17.0 +0200 +++ linux/fs/namespace.c2007-04-13 13:25:35.0 +0200 @@ -261,42 +261,42 @@ static struct vfsmount *clone_mnt(struct { struct super_block *sb = old-mnt_sb; struct vfsmount *mnt = alloc_vfsmnt(old-mnt_devname); + if (!mnt) + return ERR_PTR(-ENOMEM); - if (mnt) { - mnt-mnt_flags = old-mnt_flags; - atomic_inc(sb-s_active); - mnt-mnt_sb = sb; - mnt-mnt_root = dget(root); - mnt-mnt_mountpoint = mnt-mnt_root; - mnt-mnt_parent = mnt; - - /* don't copy the MNT_USER flag */ - mnt-mnt_flags = ~MNT_USER; - if (flag CL_SETUSER) - set_mnt_user(mnt); - - if (flag CL_SLAVE) { - list_add(mnt-mnt_slave, old-mnt_slave_list); - mnt-mnt_master = old; - CLEAR_MNT_SHARED(mnt); - } else { - if ((flag CL_PROPAGATION) || IS_MNT_SHARED(old)) - list_add(mnt-mnt_share, old-mnt_share); - if (IS_MNT_SLAVE(old)) - list_add(mnt-mnt_slave, old-mnt_slave); - mnt-mnt_master = old-mnt_master; - } - if (flag CL_MAKE_SHARED) - set_mnt_shared(mnt); + mnt-mnt_flags = old-mnt_flags; + atomic_inc(sb-s_active); + mnt-mnt_sb = sb; + mnt-mnt_root = dget(root); + mnt-mnt_mountpoint = mnt-mnt_root; + mnt-mnt_parent = mnt; + + /* don't copy the MNT_USER flag */ + mnt-mnt_flags = ~MNT_USER; + if (flag CL_SETUSER) + set_mnt_user(mnt); - /* stick the duplicate mount on the same expiry list -* as the original if that was on one */ - if (flag CL_EXPIRE) { - spin_lock(vfsmount_lock); - if (!list_empty(old-mnt_expire)) - list_add(mnt-mnt_expire, old-mnt_expire); - spin_unlock(vfsmount_lock); - } + if (flag CL_SLAVE) { + list_add(mnt-mnt_slave, old-mnt_slave_list); + mnt-mnt_master = old; + CLEAR_MNT_SHARED(mnt); + } else { + if ((flag CL_PROPAGATION) || IS_MNT_SHARED(old)) + list_add(mnt-mnt_share, old-mnt_share); + if (IS_MNT_SLAVE(old)) + list_add(mnt-mnt_slave, old-mnt_slave); + mnt-mnt_master = old-mnt_master; + } + if (flag CL_MAKE_SHARED) + set_mnt_shared(mnt); + + /* stick the duplicate mount on the same expiry list +* as the original if that was on one */ + if (flag CL_EXPIRE) { + spin_lock(vfsmount_lock); + if (!list_empty(old-mnt_expire)) + list_add(mnt-mnt_expire, old-mnt_expire); + spin_unlock(vfsmount_lock); } return mnt; } @@ -782,11 +782,11 @@ struct vfsmount *copy_tree(struct vfsmou struct nameidata nd; if (!(flag CL_COPY_ALL) IS_MNT_UNBINDABLE(mnt)) - return NULL; + return ERR_PTR(-EPERM); res = q = clone_mnt(mnt, dentry, flag); - if (!q) - goto Enomem; + if (IS_ERR(q)) + goto error; q-mnt_mountpoint = mnt-mnt_mountpoint; p = mnt; @@ -807,8 +807,8 @@ struct vfsmount *copy_tree(struct vfsmou nd.mnt = q; nd.dentry = p-mnt_mountpoint; q = clone_mnt(p, p-mnt_root, flag); - if (!q) - goto Enomem; + if (IS_ERR(q)) + goto error; spin_lock(vfsmount_lock); list_add_tail(q-mnt_list, res-mnt_list); attach_mnt(q, nd); @@ -816,7 +816,7 @@ struct vfsmount *copy_tree(struct vfsmou } } return res; -Enomem: + error: if (res) { LIST_HEAD(umount_list); spin_lock(vfsmount_lock); @@ -824,7 +824,7 @@ Enomem: spin_unlock(vfsmount_lock); release_mounts(umount_list); } - return
[patch 10/10] allow unprivileged fuse mounts
From: Miklos Szeredi [EMAIL PROTECTED] Use FS_SAFE for fuse fs type, but not for fuseblk. FUSE was designed from the beginning to be safe for unprivileged users. This has also been verified in practice over many years. In addition unprivileged fuse mounts require the usermnt mount option to be set on the parent mount, which is more strict than the current userspace policy. This will enable future installations to remove the suid-root fusermount utility. Don't require the user_id= and group_id= options for unprivileged mounts, but if they are present, verify them for sanity. Disallow the allow_other option for unprivileged mounts. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/fs/fuse/inode.c === --- linux.orig/fs/fuse/inode.c 2007-04-13 14:20:23.0 +0200 +++ linux/fs/fuse/inode.c 2007-04-13 14:20:27.0 +0200 @@ -311,6 +311,19 @@ static int parse_fuse_opt(char *opt, str d-max_read = ~0; d-blksize = 512; + /* +* For unprivileged mounts use current uid/gid. Still allow +* user_id and group_id options for compatibility, but +* only if they match these values. +*/ + if (!capable(CAP_SYS_ADMIN)) { + d-user_id = current-uid; + d-user_id_present = 1; + d-group_id = current-gid; + d-group_id_present = 1; + + } + while ((p = strsep(opt, ,)) != NULL) { int token; int value; @@ -339,6 +352,8 @@ static int parse_fuse_opt(char *opt, str case OPT_USER_ID: if (match_int(args[0], value)) return 0; + if (d-user_id_present d-user_id != value) + return 0; d-user_id = value; d-user_id_present = 1; break; @@ -346,6 +361,8 @@ static int parse_fuse_opt(char *opt, str case OPT_GROUP_ID: if (match_int(args[0], value)) return 0; + if (d-group_id_present d-group_id != value) + return 0; d-group_id = value; d-group_id_present = 1; break; @@ -536,6 +553,10 @@ static int fuse_fill_super(struct super_ if (!parse_fuse_opt((char *) data, d, is_bdev)) return -EINVAL; + /* This is a privileged option */ + if ((d.flags FUSE_ALLOW_OTHER) !capable(CAP_SYS_ADMIN)) + return -EPERM; + if (is_bdev) { #ifdef CONFIG_BLOCK if (!sb_set_blocksize(sb, d.blksize)) @@ -639,6 +660,7 @@ static struct file_system_type fuse_fs_t .fs_flags = FS_HAS_SUBTYPE, .get_sb = fuse_get_sb, .kill_sb= kill_anon_super, + .fs_flags = FS_SAFE, }; #ifdef CONFIG_BLOCK -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 05/10] Add permit user submounts flag to vfsmount
From: Miklos Szeredi [EMAIL PROTECTED] If MNT_USERMNT flag is not set in the target vfsmount, then unprivileged mounts will be denied. By default this flag is cleared, and can be set on new mounts, on remounts or with the MS_SETFLAGS option. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/fs/namespace.c === --- linux.orig/fs/namespace.c 2007-04-13 13:20:12.0 +0200 +++ linux/fs/namespace.c2007-04-13 13:35:40.0 +0200 @@ -411,6 +411,7 @@ static int show_vfsmnt(struct seq_file * { MNT_NOATIME, ,noatime }, { MNT_NODIRATIME, ,nodiratime }, { MNT_RELATIME, ,relatime }, + { MNT_USERMNT, ,usermnt }, { 0, NULL } }; struct proc_fs_info *fs_infop; @@ -1505,9 +1506,11 @@ long do_mount(char *dev_name, char *dir_ mnt_flags |= MNT_NODIRATIME; if (flags MS_RELATIME) mnt_flags |= MNT_RELATIME; + if (flags MS_USERMNT) + mnt_flags |= MNT_USERMNT; flags = ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE | - MS_NOATIME | MS_NODIRATIME | MS_RELATIME); + MS_NOATIME | MS_NODIRATIME | MS_RELATIME | MS_USERMNT); /* ... and get the mountpoint */ retval = path_lookup(dir_name, LOOKUP_FOLLOW, nd); Index: linux/include/linux/mount.h === --- linux.orig/include/linux/mount.h2007-04-13 13:17:08.0 +0200 +++ linux/include/linux/mount.h 2007-04-13 13:22:17.0 +0200 @@ -28,6 +28,7 @@ struct mnt_namespace; #define MNT_NOATIME0x08 #define MNT_NODIRATIME 0x10 #define MNT_RELATIME 0x20 +#define MNT_USERMNT0x40 #define MNT_SHRINKABLE 0x100 #define MNT_USER 0x200 Index: linux/include/linux/fs.h === --- linux.orig/include/linux/fs.h 2007-04-13 13:23:05.0 +0200 +++ linux/include/linux/fs.h2007-04-13 13:35:34.0 +0200 @@ -130,6 +130,7 @@ extern int dir_notify_enable; #define MS_SETFLAGS(123) /* set specified mount flags */ #define MS_CLEARFLAGS (124) /* clear specified mount flags */ /* MS_SETFLAGS | MS_CLEARFLAGS: change mount flags to specified */ +#define MS_USERMNT (125) /* permit unpriv. submounts under this mount */ #define MS_ACTIVE (130) #define MS_NOUSER (131) -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 09/10] allow unprivileged mounts
From: Miklos Szeredi [EMAIL PROTECTED] Define a new fs flag FS_SAFE, which denotes, that unprivileged mounting of this filesystem may not constitute a security problem. Since most filesystems haven't been designed with unprivileged mounting in mind, a thorough audit is needed before setting this flag. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/fs/namespace.c === --- linux.orig/fs/namespace.c 2007-04-13 13:35:29.0 +0200 +++ linux/fs/namespace.c2007-04-13 13:35:30.0 +0200 @@ -785,7 +785,8 @@ asmlinkage long sys_oldumount(char __use * o if it's a sticky directory, it must be owned by the user * o it must not be an append-only file/directory */ -static int mount_is_safe(struct nameidata *nd, int *flags) +static int mount_is_safe(struct nameidata *nd, struct file_system_type *type, +int *flags) { struct inode *inode = nd-dentry-d_inode; @@ -795,6 +796,9 @@ static int mount_is_safe(struct nameidat if (!(nd-mnt-mnt_flags MNT_USERMNT)) return -EPERM; + if (type !(type-fs_flags FS_SAFE)) + return -EPERM; + if (!S_ISDIR(inode-i_mode) !S_ISREG(inode-i_mode)) return -EPERM; @@ -1035,7 +1039,7 @@ static int do_loopback(struct nameidata int clone_flags; struct nameidata old_nd; struct vfsmount *mnt = NULL; - int err = mount_is_safe(nd, flags); + int err = mount_is_safe(nd, NULL, flags); if (err) return err; if (!old_name || !*old_name) @@ -1197,26 +1201,46 @@ out: * create a new mount for userspace and request it to be added into the * namespace's tree */ -static int do_new_mount(struct nameidata *nd, char *type, int flags, +static int do_new_mount(struct nameidata *nd, char *fstype, int flags, int mnt_flags, char *name, void *data) { + int err; struct vfsmount *mnt; + struct file_system_type *type; - if (!type || !memchr(type, 0, PAGE_SIZE)) + if (!fstype || !memchr(fstype, 0, PAGE_SIZE)) return -EINVAL; - /* we need capabilities... */ - if (!capable(CAP_SYS_ADMIN)) - return -EPERM; + type = get_fs_type(fstype); + if (!type) + return -ENODEV; - mnt = do_kern_mount(type, flags ~MS_SETUSER, name, data); - if (IS_ERR(mnt)) + err = mount_is_safe(nd, type, flags); + if (err) + goto out_put_filesystem; + + if (flags MS_SETUSER) { + err = reserve_user_mount(); + if (err) + goto out_put_filesystem; + } + + mnt = vfs_kern_mount(type, flags ~MS_SETUSER, name, data); + put_filesystem(type); + if (IS_ERR(mnt)) { + if (flags MS_SETUSER) + dec_nr_user_mounts(); return PTR_ERR(mnt); + } if (flags MS_SETUSER) - set_mnt_user(mnt); + __set_mnt_user(mnt); return do_add_mount(mnt, nd, mnt_flags, NULL); + + out_put_filesystem: + put_filesystem(type); + return err; } /* @@ -1246,7 +1270,7 @@ int do_add_mount(struct vfsmount *newmnt if (S_ISLNK(newmnt-mnt_root-d_inode-i_mode)) goto unlock; - /* MNT_USER was set earlier */ + /* some flags may have been set earlier */ newmnt-mnt_flags |= mnt_flags; if ((err = graft_tree(newmnt, nd))) goto unlock; Index: linux/include/linux/fs.h === --- linux.orig/include/linux/fs.h 2007-04-13 13:35:29.0 +0200 +++ linux/include/linux/fs.h2007-04-13 13:35:30.0 +0200 @@ -96,6 +96,7 @@ extern int dir_notify_enable; #define FS_REQUIRES_DEV 1 #define FS_BINARY_MOUNTDATA 2 #define FS_HAS_SUBTYPE 4 +#define FS_SAFE 8 /* Safe to mount by unprivileged users */ #define FS_REVAL_DOT 16384 /* Check the paths ., .. for staleness */ #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() * during rename() internally. -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/8] unprivileged mount syscall
Arn't there ways to escape chroot jails? Serge had pointed me to a URL which showed chroots can be escaped. And if that is true than having all user's private mount tree in the same namespace can be a security issue? No. In fact chrooting the user into /share/$USER will actually _grant_ a privilege to the user, instead of taking it away. It allows the user to modify it's root namespace, which it wouldn't be able to in the initial namespace. So even if the user could escape from the chroot (which I doubt), s/he would not be able to do any harm, since unprivileged mounting would be restricted to /share. Also /share/$USER should only have read/search permission for $USER or no permissions at all, which would mean, that other users' namespaces would be safe from tampering as well. A couple of points. - chroot can be escaped, it is just a chdir for the root directory it is not a security feature. The only security is that you have to be root to call chdir. A carefully done namespace setup won't have that issue. - While it may not violate security as far as what a user is allowed to modify it may violate security as far as what a user is allowed to see. I think that's just up to the permissions in the global namespace. In this example if you 'chmod 0 /share' there won't be anything for the user to see. There are interesting per login cases as well such as allowing a user to replicate their mount tree from another machine when they log in. When /home is on a network filesystem this can be very practical and can allow propagation of mounts across machines not just across a single login session. Yeah, sounds interesting, but I think it's better to get the basics working first, and then we can start to think about the extras. Btw, there's nothing that prevents cloning the namespace _after_ chrooting into the per-user tree. That would still be simpler than doing it the other way round: first creating per-session namespaces and then setting up mount propagation between them. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag
That depends. Current patches check the unprivileged submounts allowed under this mount flag only on the requested mount and not on the propagated mounts. Do you see a problem with this? I think privileges of this sort should propagate. If I read what you just said correctly if I have a private mount namespace I won't be able to mount anything unless when it was setup the unprivileged submount command was explicitly set. By design yes. Why is that a problem? Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag
Also for bind-mount and remount operations the flag has to be propagated down its propagation tree. Otherwise a unpriviledged mount in a shared mount wont get reflected in its peers and slaves, leading to unidentical shared-subtrees. That's an interesting question. Do we want shared mounts to be totally identical, including mnt_flags? It doesn't look as if do_remount() guarantees that currently. Depends on the semantics of each of the flags. Some flags like of the read/write flag, would not interfere with the propagation semantics AFAICT. But this one certainly seems to interfere. That depends. Current patches check the unprivileged submounts allowed under this mount flag only on the requested mount and not on the propagated mounts. Do you see a problem with this? Don't see a problem if the flag is propagated to all peers and slave mounts. If not, I see a problem. What if the propagated mount has its flag set to not do un-priviledged mounts, whereas the requested mount has it allowed? Then the mount is allowed. It is up to the sysadmin/distro to design set up the propagations in a way that this is not a problem. I think it would be much less clear conceptually, if unprivileged mounting would have to check propagations as well. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ZFS with Linux: An Open Plea
FUSE is nice for trying out new and interresting ideas in userspace - it has its uses. Yes, but it is not really for the end-user. To paraphrase another, it is mostly academic. Oh? I thought those ~10,000 downloads of SSHFS and ~200,000 downloads of NTFS-3G were end users.(*) Maybe I was wrong though. Thanks for the clarification. Miklos (*) Numbers obviously don't include packaged installations - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 05/10] Add permit user submounts flag to vfsmount
From: Miklos Szeredi [EMAIL PROTECTED] If MNT_USERMNT flag is not set in the target vfsmount, then MNT_USER and MNT_USERMNT? I claim no way will people keep those straight. How about MNT_ALLOWUSER and MNT_USER? Umm, is allowuser more clear than usermnt? What is allowed to the user? allowusermnt may be more descriptive, but it's a bit too long. I don't think it matters all that much, the user will have to look up the semantics in the manpage anyway. Is nosuid descriptive? Not very much, but we got used to it. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag
Interesting So far even today these things can happen, however they are sufficiently unlikely the tools don't account for them. Once a hostile user can cause them things are more of a problem. (Unless you want to tackle each problem legacy tool one at a time to remove problems - i.e. deluser should umount everything under /home/hallyn before deleting, backup should be spawned from it's own namespace cloned right after boot or just back up on one filesystem, etc.) I don't see a way that backup and deluser won't need to be modified to work properly in a system where non-priveleged mounts are allowed, at least they will need to account for /share. That said it is clearly a hazard if we enable this functionality by default. If we setup a pam module that triggers on login and perhaps when cron and at jobs run to setup an additional mount namespace I think keeping applications locked away in their own mount namespace is sufficient to avoid hostile users from doing unexpected things to the initial mount namespace. So unless I am mistake it should be relatively simple to prevent user space from encountering problems. That still leaves the question of how we handle systems with an old user space that is insufficiently robust to deal with mounts occurring at unexpected locations. I think a simple sysctl to enable/disable of non-priveleged mounts defaulting to disabled is enough. Am I correct or will it be more difficult than just a little pam module to ensure non-trusted users never run in the initial mount namespace? I'm still not sure, what your problem is. With the v3 of the usermounts patchset, by default, user mounts are disabled, because the allow unpriv submounts flag is cleared on all mounts. There are several options available to sysadmins and distro builders to enable user mounts in a secure way: - pam module, which creates a private namespace, and sets allow unpriv submounts on the mounts within the namespace - pam module, which rbinds / onto /mnt/ns/$USER, and chroots into /mnt/ns/$USER, then sets the allow unpriv submounts on the mounts under /mnt/ns/$USER. - sysadmin creates /mnt/usermounts writable to all users, with sticky bit (same as /tmp), does mount --bind /mnt/usermounts /mnt/usermounts and sets the allow unpriv submounts on /mnt/usermounts. All of these are perfectly safe wrt userdel and backup (assuming it doesn't try back up /mnt). Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 05/10] Add permit user submounts flag to vfsmount
MNT_USER and MNT_USERMNT? I claim no way will people keep those straight. How about MNT_ALLOWUSER and MNT_USER? Umm, is allowuser more clear than usermnt? What is allowed to the I think so, yes. One makes it clear that we're talking about allowing user (somethings :), one might just as well mean this is a user mount. user? allowusermnt may be more descriptive, but it's a bit too long. Yes, if it weren't too long it would by far have been my preference. Maybe despite the length we should still go with it... I don't think it matters all that much, the user will have to look up the semantics in the manpage anyway. Is nosuid descriptive? Not very much, but we got used to it. nosuid is quite clear. Is it? Shouldn't these be allowsuid, noallowsuid, allowexec, noallowexec? See, we mentally add the allow quite easily. MNT_USER and MNT_USERMNT are so confusing that in the time I go from quitting the manpage to foregrounding my editor, I may have already forgotten which was which. Well, to the user they are always in the form user=123 and usermnt, so they are not as easy to confuse. But I feel a bit stupid bickering about this, because it isn't so important. allowuser or allowusermnt are fine by me if you think they are substantially better than usermnt. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag
I'm a bit lost about what is currently done and who advocates for what. It seems to me the MNT_ALLOWUSERMNT (or whatever :) flag should be propagated. In the /share rbind+chroot example, I assume the admin would start by doing mount --bind /share /share mount --make-slave /share mount --bind -o allow_user_mounts /share (or whatever) mount --make-shared /share then on login, pam does chroot /share/$USER or some sort of mount --bind /share /home/$USER/root chroot /home/$USER/root or whatever. In any case, the user cannot make user mounts except under /share, and any cloned namespaces will still allow user mounts. I don't quite understand your method. This is how I think of it: mount --make-rshared / mkdir -p /mnt/ns/$USER mount --rbind / /mnt/ns/$USER mount --make-rslave /mnt/ns/$USER mount --set-flags --recursive -oallowusermnt /mnt/ns/$USER chroot /mnt/ns/$USER su - $USER I did actually try something equivalent (without the fancy mount commands though), and it worked fine. The only problem is the proliferation of mounts in /proc/mounts. There was a recently posted patch in AppArmor, that at least hides unreachable mounts from /proc/mounts, so the user wouldn't see all those. But it could still be pretty confusing to the sysadmin. So in that sense doing it the complicated way, by first cloning the namespace, and then copying and sharing mounts individually which need to be shared could relieve this somewhat. Another point: user mounts under /proc and /sys shouldn't be allowed. There are files there (at least in /proc) that are seemingly writable by the user, but they are still not writable in the sense, that normal files are. Anyway, there are lots of userspace policy issues, but those don't impact the kernel part. As for the original question of propagating the allowusermnt flag, I think it doesn't matter, as long as it's consistent and documented. Propagating some mount flags and not propagating others is inconsistent and confusing, so I wouldn't want that. Currently remount doesn't propagate mount flags, that may be a bug, dunno. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag
I'm still not sure, what your problem is. My problem right now is that I see a serious complexity escalation in the user interface that we must support indefinitely. I see us taking a nice powerful concept and seriously watering it down. To some extent we have to avoid confusing suid applications. (I would so love to remove the SUID bit...). I'm being contrary to ensure we have a good code review. OK. And it's very much appreciated :) I have heard it said that there are two kinds of design. Something so simple it obviously has no deficiencies. Something so complex it has no obvious deficiencies. I am very much afraid we are slipping the mount namespace into the latter category of work. Which is a bad bad thing for core OS feature. I've tried to make this unprivileged mount thing as simple as possible, and no simpler. If we can make it even simpler, all the better. With the v3 of the usermounts patchset, by default, user mounts are disabled, because the allow unpriv submounts flag is cleared on all mounts. There are several options available to sysadmins and distro builders to enable user mounts in a secure way: - pam module, which creates a private namespace, and sets allow unpriv submounts on the mounts within the namespace - pam module, which rbinds / onto /mnt/ns/$USER, and chroots into /mnt/ns/$USER, then sets the allow unpriv submounts on the mounts under /mnt/ns/$USER. In part this really disturbs me because we now have two mechanisms for controlling the scope of what a user can do. You mean rbind+chroot and clone(CLONE_NS)? Yes, those are two different mechanisms achieving very similar results. But what has this to do with unprivileged mounts? A flag or a new namespace. Two mechanisms to accomplish the same thing sound wrong, and hard to manage. The flag permitting the unprivileged mounts (which we now agreed to name allowusermnt) is used in both cases. Just creating a new namespace doesn't always imply that you want to allow user mounts inside, does it? These are orthogonal features. - sysadmin creates /mnt/usermounts writable to all users, with sticky bit (same as /tmp), does mount --bind /mnt/usermounts /mnt/usermounts and sets the allow unpriv submounts on /mnt/usermounts. All of these are perfectly safe wrt userdel and backup (assuming it doesn't try back up /mnt). I also don't understand at all the user= mount flag and options. The user=UID or (or MNT_USER flag) serves multiple purposes: - help mount(8) move away from /etc/mtab - allow unprivileged umounts - account user mounts All it seemed to be used for was adding permissions to unmount. In user space to deal with the lack of any form of untrusted mounts I can understand this. In kernel space this seems to be more of a problem. Why is handling unprivileged mounts in kernel different from handling them in userspace in this respect? Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag
mount --make-rshared / mkdir -p /mnt/ns/$USER mount --rbind / /mnt/ns/$USER mount --make-rslave /mnt/ns/$USER This was my main point - that the tree in which users can mount will be a slave of /, so that propagating the are user mounts allowed flag among peers is safe and intuitive. I think it's equally intuitive if flags are not propagated. After all, I may want to set the mount flag _only_ on this particular mount, and not anything else. This is something I wouln't be sure about either way without consulting the man. True. But the kernel functionality you provide enables both ways so no problem in either case :) Another point: user mounts under /proc and /sys shouldn't be allowed. There are files there (at least in /proc) that are seemingly writable by the user, but they are still not writable in the sense, that normal files are. Good point. Anyway, there are lots of userspace policy issues, but those don't impact the kernel part. Though it might make sense to enforce /proc and /sys not allowing user mounts under them in the kernel. I think not, it would just be another policy decision having to be made in the kernel on an fs by fs basis, and getting it wrong would be much worse than getting it wrong in userspace. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag
I'm a bit lost about what is currently done and who advocates for what. It seems to me the MNT_ALLOWUSERMNT (or whatever :) flag should be propagated. In the /share rbind+chroot example, I assume the admin would start by doing mount --bind /share /share mount --make-slave /share mount --bind -o allow_user_mounts /share (or whatever) mount --make-shared /share then on login, pam does chroot /share/$USER or some sort of mount --bind /share /home/$USER/root chroot /home/$USER/root or whatever. In any case, the user cannot make user mounts except under /share, and any cloned namespaces will still allow user mounts. I don't quite understand your method. This is how I think of it: mount --make-rshared / mkdir -p /mnt/ns/$USER mount --rbind / /mnt/ns/$USER mount --make-rslave /mnt/ns/$USER mount --set-flags --recursive -oallowusermnt /mnt/ns/$USER chroot /mnt/ns/$USER su - $USER I did actually try something equivalent (without the fancy mount commands though), and it worked fine. The only problem is the proliferation of mounts in /proc/mounts. There was a recently posted patch in AppArmor, that at least hides unreachable mounts from /proc/mounts, so the user wouldn't see all those. But it could still be pretty confusing to the sysadmin. unbindable mounts were designed to overcome the proliferation problem. Your steps should be something like this: mount --make-rshared / mkdir -p /mnt/ns mount --bind /mnt/ns /mnt/ns mount --make-unbindable /mnt/ns mkdir -p /mnt/ns/$USER mount --rbind / /mnt/ns/$USER mount --make-rslave /mnt/ns/$USER mount --set-flags --recursive -oallowusermnt /mnt/ns/$USER chroot /mnt/ns/$USER su - $USER try this and your proliferation problem will disappear. :-) Right, this is needed. My problem wasn't actually this (which would only have hit, if I tried with more than one user), just that the number of mounts in /proc/mounts grows linearly with the number of users. That can't be helped in such an easy way unfortunately. Propagating some mount flags and not propagating others is inconsistent and confusing, so I wouldn't want that. Currently remount doesn't propagate mount flags, that may be a bug, For consistency reason, one can propagate all the flags. But propagating only those flags that interfere with shared-subtree semantics should suffice. I still don't believe not propagating allowusermnt interferes with mount propagation. In my posted patches the mount (including propagations) is allowed based on the allowusermnt flag on the parent of the requested mount. The flag is _not_ checked during propagation. Allowing this and other flags to NOT be propagated just makes it possible to have a set of shared mounts with asymmetric properties, which may actually be desirable. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag
I've tried to make this unprivileged mount thing as simple as possible, and no simpler. If we can make it even simpler, all the better. We are certainly much more complex then the code in plan9 (just read through it) so I think we have room for improvement. Just for reference what I saw in plan 9 was: - No super user checks in it's mount, unmount, or namespace creation paths. - A flag to deny new mounts but not new bind mounts (for administrative purposes the comment said). Our differences from plan9. - suid capable binaries. (SUID please go away). - A history of programs assuming only root could call mount/unmount. I hate suid as well. _The_ motivation behind this patchset was to get rid of fusermount, a suid mount helper for fuse. But I don't think suid is going away, and definitely not overnight. Also I don't think we want to require auditing userspace before enabling user mounts. If I understand correctly, your proposal is to get rid of MNT_USER and MNT_ALLOWUSERMNT and allow/deny unprivileged mounts and umounts based on a boolean sysctl flag and on a check if the target namespace is the initial namespace or not. And maybe add some extra checks which prevent ugliness from happening with suid programs. Is this correct? If so, how are we going to make sure this won't break existing userspace without doing a full audit of all suid programs in every distro that wants this feature? Also how are we going to prevent the user from creating millions of mounts, and using up all the kernel memory for vfsmounts? Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag
Allowing this and other flags to NOT be propagated just makes it possible to have a set of shared mounts with asymmetric properties, which may actually be desirable. The shared mount feature was designed to ensure that the mount remained identical at all the locations. OK, so remount not propagating mount flags is a bug then? Now designing features to make it un-identical but still naming it shared, will break its original purpose. Slave mounts were designed to make it asymmetric. What if I want to modify flags in a master mount, but not the slave mount? Would I be screwed? For example: mount is read-only in both master and slave. I want to mark it read-write in master but not in slave. What do I do? Whatever feature that is desired to be exploited; can that be exploited with the current set of semantics that we have? Is there a real need to make the mounts asymmetric but at the same time name them as shared? Maybe I dont understand what the desired application is? I do think this question of propagating mount flags is totally independent of user mounts. As it stands, currently remount doesn't propagate mount flags, and I don't see any compelling reasons why it should. The patchset introduces a new mount flag allowusermnt, but I don't see any compelling reason to propagate this flag _either_. Please say so if you do have such a reason. As I've explained, having this flag set differently in parts of a propagation tree does not interfere with or break propagation in any way. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag
I've tried to make this unprivileged mount thing as simple as possible, and no simpler. If we can make it even simpler, all the better. We are certainly much more complex then the code in plan9 (just read through it) so I think we have room for improvement. Just for reference what I saw in plan 9 was: - No super user checks in it's mount, unmount, or namespace creation paths. - A flag to deny new mounts but not new bind mounts (for administrative purposes the comment said). Our differences from plan9. - suid capable binaries. (SUID please go away). - A history of programs assuming only root could call mount/unmount. I hate suid as well. _The_ motivation behind this patchset was to get rid of fusermount, a suid mount helper for fuse. But I don't think suid is going away, and definitely not overnight. Also I don't think we want to require auditing userspace before enabling user mounts. If I understand correctly, your proposal is to get rid of MNT_USER and MNT_ALLOWUSERMNT and allow/deny unprivileged mounts and umounts based on a boolean sysctl flag and on a check if the target namespace is the initial namespace or not. And maybe add some extra checks which prevent ugliness from happening with suid programs. Is this correct? If so, how are we going to make sure this won't break existing userspace without doing a full audit of all suid programs in every distro that wants this feature? Also how are we going to prevent the user from creating millions of mounts, and using up all the kernel memory for vfsmounts? Don't forget that almost all mount flags are per-superblock. How are you planning on dealing with the case that one user mounts a filesystem read-only, while another is trying to mount the same one read-write? Yeah, I forgot, the per-mount read-only patches are not yet in. That doesn't really change my agrument though. _If_ the flag is per mount, then it makes sense to be able to change it on a master and not on a slave. If mount flags are propagated, this is not possible. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag
Don't forget that almost all mount flags are per-superblock. How are you planning on dealing with the case that one user mounts a filesystem read-only, while another is trying to mount the same one read-write? Yeah, I forgot, the per-mount read-only patches are not yet in. That doesn't really change my agrument though. _If_ the flag is per mount, then it makes sense to be able to change it on a master and not on a slave. If mount flags are propagated, this is not possible. Read-only isn't the only issue. On something like NFS, there are flags to set the security flavour, turn on/off encryption etc. If I mount your home directory using no encryption in my namespace, for instance, then neither you nor the administrator will be able to turn it on afterwards in yours without first unmounting it from mine so that the superblock is destroyed. OK, that's interesting, but I fail to grasp the relevance of this to unprivileged mounts. Or are you thinking of unprivileged NFS mounts? Well, think again, because that involves _much_ more than it seems at first glance. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag
I've tried to make this unprivileged mount thing as simple as possible, and no simpler. If we can make it even simpler, all the better. We are certainly much more complex then the code in plan9 (just read through it) so I think we have room for improvement. Just for reference what I saw in plan 9 was: - No super user checks in it's mount, unmount, or namespace creation paths. - A flag to deny new mounts but not new bind mounts (for administrative purposes the comment said). Our differences from plan9. - suid capable binaries. (SUID please go away). - A history of programs assuming only root could call mount/unmount. I hate suid as well. _The_ motivation behind this patchset was to get rid of fusermount, a suid mount helper for fuse. But I don't think suid is going away, and definitely not overnight. Agreed, unless it just happens that killing is equally as hard as doing non-privileged mounts. Which I really don't think will be the case. suid executables can always be replaced by a non-privileged part that connect to a daemon on localhost. Also I don't think we want to require auditing userspace before enabling user mounts. Given the complexity of the code to avoids audits adds, and especially the uncertainty of how all of the pieces add up together I really think we do need to audit user space (but only in a limited way). This is a paradigm shift in how mounts are managed and to get the full benefit of it we need to be prepared to deal with the paradigm shift. Essentially what the audit must ensure is that (a) non-root users are always run in a non-default namespace so mounts and unmounts they generate will not have global effect. But unprivileged proceses are not only created through login. What happens, when some process does setuid(NONZERO). With your proposal it now gained the privilege of mounting in the initial namespace. IOW every program calling setuid() now has to be audited for any problems this could cause. (b) If we setup something like the proposed /share that administrative tools know how to treat it properly. That is not an especially hard audit of user space and it noticeably reduces the complexity of what we must implement. If I understand correctly, your proposal is to get rid of MNT_USER and MNT_ALLOWUSERMNT and allow/deny unprivileged mounts and umounts based on a boolean sysctl flag and on a check if the target namespace is the initial namespace or not. No. I really do not want to treat the initial namespace in a special way from an implementation point of view. Ah. From a distro point of view I don't think we should ever allow a user into the initial mount namespace, and that is what we should audit. All of the ways a user can login to a machine. We need to audit the login paths anyway if we are going to do anything interesting with the mount namespace. So I see this as no real additional overhead to make things usable. And maybe add some extra checks which prevent ugliness from happening with suid programs. Is this correct? Definitely add some extra permission checks to prevent ugliness from happening with suid executables. In the general case the problem with suid executables is that they expect parts of the mount namespace that a user cannot modify not to be modified. Ensuring that our permission checks keep that promise for mount and umount is going to be a bit challenging, because we need to think through a lot of scenarios. But it is not that hard. If so, how are we going to make sure this won't break existing userspace without doing a full audit of all suid programs in every distro that wants this feature? The permission checks to ensure you can not modify a filesystem with mounts in a way that you could not modify it by creating or deleting files should be enough. Umm, well. What if user mounts a filesystem read-only. Then wants to unmount, but hey, it's read-only, no permission to write, no unmount... That's just a stupid example, but it shows, that file permissions are not always useful for determining what you can mount, and especially what you can unmount. That probably means that we are restricted to mounting filesystems with the equivalent of uid=$(id -u) gid=$(id -g). That is if you aren't root all files in the filesystem you cause to be mounted must be owned by you. That way you have permission to unmount or delete them. That rules out unprivileged bind mounts. At the very least the user who causes a mount to be created should be the owner and have complete control over the directory mount point. So that they can at least theoretically remove all of the files and directories at the mount point (which would be equivalent to unmounting it). Checking the permissions on the mountpoint to allow unmounting is - rather inelegant: user can't see those permissions, can only determine if
Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag
As I said earlier, I see a case where two mounts that are peers of each other can become un-identical if we dont propagate the allowusermnt. As a practical example. /tmp and /mnt are peers of each other. /tmp has its allowusermnt flag set, which has not been propagated to /mnt. now a normal-user mounts an ext2 file system under /tmp at /tmp/1 unfortunately the mount wont appear under /mnt/1 Argh, that is not true. That's what I've been trying to explain to you all along. I now realize you did, but I failed to catch it. sorry :-( It is my fault also. I will add better description to the patch headers. The propagation will be done _regardless_ of the flag. The flag is only checked for the parent of the _requested_ mount. If it is allowed there, the mount, including any propagations are allowed. If it's denied, then obviously it's denied everywhere. and in case if you allow the mount to appear under /mnt/1, you will break unpriviledge mounts semantics which promises: a normal user will not be able to mount at a location that does not allow user-mounts. No, it does not promise that. The flag just promises, that the user cannot _request_ a mount on the parent mount. ok. if the ability for a normal user to mount something *indirectly* under a mount that has its 'allowusermnt flag' unset, is acceptable under the definition of 'allowusermnt', i guess my only choice is to accept it. :-) It is the only sane way I think. Maybe the MS_SETFLAGS/MS_CLEARFLAGS mount operation could have an option for propagating the given flag(s). Then it's really up to the sysadmin, to determine in each case what the desired behavior is. Yes, you are right, that most of the time propagating the allowusermnt between peers may be the right thing to do. But for example propagating it from master to slave may not be a good thing at all. So it shouldn't be for the kernel to decide. Thanks, Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag
Checking the permissions on the mountpoint to allow unmounting is - rather inelegant: user can't see those permissions, can only determine if umount is allowed by trial and error - may be a security hole, e.g.: sysadmin: mkdir -m 777 /mnt/disk mount /dev/hda2 /mnt/disk Of course the user doesn't have the right to delete the contents of the mount, yet the permissions on the mountpoint would imply that s/he has permission to umount the disk. It is becoming increasingly apparent, that mount/umount permission based on file permissions is inherently broken: 1) there are user-writable files under /proc/$PID/, which definitely shouldn't be allowed to be overmounted 2) if user mounts an fs read-only, then wants to create a submount of this, it will fail with the current patchset Solving 2) should be trivial: submounting a mount owned by the user should be always allowed regardless of the file permissions. Maybe this could be generaized to say, that a mount can be submounted by an unprivileged user IFF parent mount is owned by said user. This would get rid of some of the complications in the current patchset, namely the functionality of MNT_ALLOWUSERMNT and MNT_USER flags would be merged, and the permission checking would be removed. For example on login, the user could get a private namespace set up some that the home directory is owned by the user, and hence can be freely submounted: clone(CLONE_NEWNS) mount --bind /home/$USER /home/$USER mount --remount -ouser=$USER /home/$USER This is of course more limiting than allowing mounts based on file permissions, but it's also a lot cleaner. Hmm? Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 09/12] mm: count unstable pages per BDI
Count per BDI unstable pages. I'm wondering, is it really worth having this category separate from per BDI brity pages? With the exception of the export to sysfs, always the sum of unstable + dirty is used. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 11/12] mm: per device dirty threshold
+static inline unsigned long bdi_stat_delta(void) +{ +#ifdef CONFIG_SMP + return NR_CPUS * FBC_BATCH; Shouln't this be multiplied by the number of counters to sum? I.e. 3 if dirty and unstable are separate, and 2 if they are not. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 09/12] mm: count unstable pages per BDI
Index: linux-2.6/fs/buffer.c === --- linux-2.6.orig/fs/buffer.c2007-04-19 19:59:26.0 +0200 +++ linux-2.6/fs/buffer.c 2007-04-19 20:35:39.0 +0200 @@ -733,7 +733,7 @@ int __set_page_dirty_buffers(struct page if (page-mapping) {/* Race with truncate? */ if (mapping_cap_account_dirty(mapping)) { __inc_zone_page_state(page, NR_FILE_DIRTY); - __inc_bdi_stat(mapping-backing_dev_info, BDI_DIRTY); + __inc_bdi_stat(mapping-backing_dev_info, BDI_RECLAIM); This name suggests it's _under_ reclaim, which is not true. You might rather want to call it BDI_RECLAIMABLE, or something similar in meaning. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 2/8] allow unprivileged umount
From: Miklos Szeredi [EMAIL PROTECTED] The owner doesn't need sysadmin capabilities to call umount(). Similar behavior as umount(8) on mounts having user=UID option in /etc/mtab. The difference is that umount also checks /etc/fstab, presumably to exclude another mount on the same mountpoint. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/fs/namespace.c === --- linux.orig/fs/namespace.c 2007-04-20 11:55:05.0 +0200 +++ linux/fs/namespace.c2007-04-20 11:55:06.0 +0200 @@ -659,6 +659,25 @@ static int do_umount(struct vfsmount *mn } /* + * umount is permitted for + * - sysadmin + * - mount owner, if not forced umount + */ +static bool permit_umount(struct vfsmount *mnt, int flags) +{ + if (capable(CAP_SYS_ADMIN)) + return true; + + if (!(mnt-mnt_flags MNT_USER)) + return false; + + if (flags MNT_FORCE) + return false; + + return mnt-mnt_uid == current-uid; +} + +/* * Now umount can handle mount points as well as block devices. * This is important for filesystems which use unnamed block devices. * @@ -681,7 +700,7 @@ asmlinkage long sys_umount(char __user * goto dput_and_out; retval = -EPERM; - if (!capable(CAP_SYS_ADMIN)) + if (!permit_umount(nd.mnt, flags)) goto dput_and_out; retval = do_umount(nd.mnt, flags); -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 1/8] add user mounts to the kernel
From: Miklos Szeredi [EMAIL PROTECTED] Add ownership information to mounts. A new mount flag, MS_SETUSER is used to make a mount owned by a user. If this flag is specified, then the owner will be set to the current real user id and the mount will be marked with the MNT_USER flag. On remount don't preserve previous owner, and treat MS_SETUSER as for a new mount. The MS_SETUSER flag is ignored on mount move. The MNT_USER flag is not copied on any kind of mount cloning: namespace creation, binding or propagation. For bind mounts the cloned mount(s) are set to MNT_USER depending on the MS_SETUSER mount flag. In all the other cases MNT_USER is always cleared. For MNT_USER mounts a user=UID option is added to /proc/PID/mounts. This is compatible with how mount ownership is stored in /etc/mtab. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/fs/namespace.c === --- linux.orig/fs/namespace.c 2007-04-20 11:55:02.0 +0200 +++ linux/fs/namespace.c2007-04-20 11:55:05.0 +0200 @@ -227,6 +227,13 @@ static struct vfsmount *skip_mnt_tree(st return p; } +static void set_mnt_user(struct vfsmount *mnt) +{ + BUG_ON(mnt-mnt_flags MNT_USER); + mnt-mnt_uid = current-uid; + mnt-mnt_flags |= MNT_USER; +} + static struct vfsmount *clone_mnt(struct vfsmount *old, struct dentry *root, int flag) { @@ -241,6 +248,11 @@ static struct vfsmount *clone_mnt(struct mnt-mnt_mountpoint = mnt-mnt_root; mnt-mnt_parent = mnt; + /* don't copy the MNT_USER flag */ + mnt-mnt_flags = ~MNT_USER; + if (flag CL_SETUSER) + set_mnt_user(mnt); + if (flag CL_SLAVE) { list_add(mnt-mnt_slave, old-mnt_slave_list); mnt-mnt_master = old; @@ -403,6 +415,8 @@ static int show_vfsmnt(struct seq_file * if (mnt-mnt_flags fs_infop-flag) seq_puts(m, fs_infop-str); } + if (mnt-mnt_flags MNT_USER) + seq_printf(m, ,user=%i, mnt-mnt_uid); if (mnt-mnt_sb-s_op-show_options) err = mnt-mnt_sb-s_op-show_options(m, mnt); seq_puts(m, 0 0\n); @@ -920,8 +934,9 @@ static int do_change_type(struct nameida /* * do loopback mount. */ -static int do_loopback(struct nameidata *nd, char *old_name, int recurse) +static int do_loopback(struct nameidata *nd, char *old_name, int flags) { + int clone_flags; struct nameidata old_nd; struct vfsmount *mnt = NULL; int err = mount_is_safe(nd); @@ -941,11 +956,12 @@ static int do_loopback(struct nameidata if (!check_mnt(nd-mnt) || !check_mnt(old_nd.mnt)) goto out; + clone_flags = (flags MS_SETUSER) ? CL_SETUSER : 0; err = -ENOMEM; - if (recurse) - mnt = copy_tree(old_nd.mnt, old_nd.dentry, 0); + if (flags MS_REC) + mnt = copy_tree(old_nd.mnt, old_nd.dentry, clone_flags); else - mnt = clone_mnt(old_nd.mnt, old_nd.dentry, 0); + mnt = clone_mnt(old_nd.mnt, old_nd.dentry, clone_flags); if (!mnt) goto out; @@ -987,8 +1003,11 @@ static int do_remount(struct nameidata * down_write(sb-s_umount); err = do_remount_sb(sb, flags, data, 0); - if (!err) + if (!err) { nd-mnt-mnt_flags = mnt_flags; + if (flags MS_SETUSER) + set_mnt_user(nd-mnt); + } up_write(sb-s_umount); if (!err) security_sb_post_remount(nd-mnt, flags, data); @@ -1093,10 +1112,13 @@ static int do_new_mount(struct nameidata if (!capable(CAP_SYS_ADMIN)) return -EPERM; - mnt = do_kern_mount(type, flags, name, data); + mnt = do_kern_mount(type, flags ~MS_SETUSER, name, data); if (IS_ERR(mnt)) return PTR_ERR(mnt); + if (flags MS_SETUSER) + set_mnt_user(mnt); + return do_add_mount(mnt, nd, mnt_flags, NULL); } @@ -1127,7 +1149,8 @@ int do_add_mount(struct vfsmount *newmnt if (S_ISLNK(newmnt-mnt_root-d_inode-i_mode)) goto unlock; - newmnt-mnt_flags = mnt_flags; + /* MNT_USER was set earlier */ + newmnt-mnt_flags |= mnt_flags; if ((err = graft_tree(newmnt, nd))) goto unlock; @@ -1447,7 +1470,7 @@ long do_mount(char *dev_name, char *dir_ retval = do_remount(nd, flags ~MS_REMOUNT, mnt_flags, data_page); else if (flags MS_BIND) - retval = do_loopback(nd, dev_name, flags MS_REC); + retval = do_loopback(nd, dev_name, flags); else if (flags (MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE
[patch 4/8] propagate error values from clone_mnt
From: Miklos Szeredi [EMAIL PROTECTED] Allow clone_mnt() to return errors other than ENOMEM. This will be used for returning a different error value when the number of user mounts goes over the limit. Fix copy_tree() to return EPERM for unbindable mounts. Don't propagate further from dup_mnt_ns() as that copy_tree() can only fail with -ENOMEM. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/fs/namespace.c === --- linux.orig/fs/namespace.c 2007-04-20 11:55:07.0 +0200 +++ linux/fs/namespace.c2007-04-20 11:55:09.0 +0200 @@ -261,42 +261,42 @@ static struct vfsmount *clone_mnt(struct { struct super_block *sb = old-mnt_sb; struct vfsmount *mnt = alloc_vfsmnt(old-mnt_devname); + if (!mnt) + return ERR_PTR(-ENOMEM); - if (mnt) { - mnt-mnt_flags = old-mnt_flags; - atomic_inc(sb-s_active); - mnt-mnt_sb = sb; - mnt-mnt_root = dget(root); - mnt-mnt_mountpoint = mnt-mnt_root; - mnt-mnt_parent = mnt; - - /* don't copy the MNT_USER flag */ - mnt-mnt_flags = ~MNT_USER; - if (flag CL_SETUSER) - set_mnt_user(mnt); - - if (flag CL_SLAVE) { - list_add(mnt-mnt_slave, old-mnt_slave_list); - mnt-mnt_master = old; - CLEAR_MNT_SHARED(mnt); - } else { - if ((flag CL_PROPAGATION) || IS_MNT_SHARED(old)) - list_add(mnt-mnt_share, old-mnt_share); - if (IS_MNT_SLAVE(old)) - list_add(mnt-mnt_slave, old-mnt_slave); - mnt-mnt_master = old-mnt_master; - } - if (flag CL_MAKE_SHARED) - set_mnt_shared(mnt); + mnt-mnt_flags = old-mnt_flags; + atomic_inc(sb-s_active); + mnt-mnt_sb = sb; + mnt-mnt_root = dget(root); + mnt-mnt_mountpoint = mnt-mnt_root; + mnt-mnt_parent = mnt; + + /* don't copy the MNT_USER flag */ + mnt-mnt_flags = ~MNT_USER; + if (flag CL_SETUSER) + set_mnt_user(mnt); - /* stick the duplicate mount on the same expiry list -* as the original if that was on one */ - if (flag CL_EXPIRE) { - spin_lock(vfsmount_lock); - if (!list_empty(old-mnt_expire)) - list_add(mnt-mnt_expire, old-mnt_expire); - spin_unlock(vfsmount_lock); - } + if (flag CL_SLAVE) { + list_add(mnt-mnt_slave, old-mnt_slave_list); + mnt-mnt_master = old; + CLEAR_MNT_SHARED(mnt); + } else { + if ((flag CL_PROPAGATION) || IS_MNT_SHARED(old)) + list_add(mnt-mnt_share, old-mnt_share); + if (IS_MNT_SLAVE(old)) + list_add(mnt-mnt_slave, old-mnt_slave); + mnt-mnt_master = old-mnt_master; + } + if (flag CL_MAKE_SHARED) + set_mnt_shared(mnt); + + /* stick the duplicate mount on the same expiry list +* as the original if that was on one */ + if (flag CL_EXPIRE) { + spin_lock(vfsmount_lock); + if (!list_empty(old-mnt_expire)) + list_add(mnt-mnt_expire, old-mnt_expire); + spin_unlock(vfsmount_lock); } return mnt; } @@ -781,11 +781,11 @@ struct vfsmount *copy_tree(struct vfsmou struct nameidata nd; if (!(flag CL_COPY_ALL) IS_MNT_UNBINDABLE(mnt)) - return NULL; + return ERR_PTR(-EPERM); res = q = clone_mnt(mnt, dentry, flag); - if (!q) - goto Enomem; + if (IS_ERR(q)) + goto error; q-mnt_mountpoint = mnt-mnt_mountpoint; p = mnt; @@ -806,8 +806,8 @@ struct vfsmount *copy_tree(struct vfsmou nd.mnt = q; nd.dentry = p-mnt_mountpoint; q = clone_mnt(p, p-mnt_root, flag); - if (!q) - goto Enomem; + if (IS_ERR(q)) + goto error; spin_lock(vfsmount_lock); list_add_tail(q-mnt_list, res-mnt_list); attach_mnt(q, nd); @@ -815,7 +815,7 @@ struct vfsmount *copy_tree(struct vfsmou } } return res; -Enomem: + error: if (res) { LIST_HEAD(umount_list); spin_lock(vfsmount_lock); @@ -823,7 +823,7 @@ Enomem: spin_unlock(vfsmount_lock); release_mounts(umount_list); } - return
[patch 6/8] put declaration of put_filesystem() in fs.h
From: Miklos Szeredi [EMAIL PROTECTED] Declarations go into headers. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/fs/super.c === --- linux.orig/fs/super.c 2007-04-20 11:55:02.0 +0200 +++ linux/fs/super.c2007-04-20 11:55:11.0 +0200 @@ -40,10 +40,6 @@ #include asm/uaccess.h -void get_filesystem(struct file_system_type *fs); -void put_filesystem(struct file_system_type *fs); -struct file_system_type *get_fs_type(const char *name); - LIST_HEAD(super_blocks); DEFINE_SPINLOCK(sb_lock); Index: linux/include/linux/fs.h === --- linux.orig/include/linux/fs.h 2007-04-20 11:55:07.0 +0200 +++ linux/include/linux/fs.h2007-04-20 11:55:11.0 +0200 @@ -1918,6 +1918,8 @@ extern int vfs_fstat(unsigned int, struc extern int vfs_ioctl(struct file *, unsigned int, unsigned int, unsigned long); +extern void get_filesystem(struct file_system_type *fs); +extern void put_filesystem(struct file_system_type *fs); extern struct file_system_type *get_fs_type(const char *name); extern struct super_block *get_super(struct block_device *); extern struct super_block *user_get_super(dev_t); -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 5/8] allow unprivileged bind mounts
From: Miklos Szeredi [EMAIL PROTECTED] Allow bind mounts to unprivileged users if the following conditions are met: - mountpoint is not a symlink or special file - parent mount is owned by the user - the number of user mounts is below the maximum Unprivileged mounts imply MS_SETUSER, and will also have the nosuid and nodev mount flags set. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/fs/namespace.c === --- linux.orig/fs/namespace.c 2007-04-20 11:55:09.0 +0200 +++ linux/fs/namespace.c2007-04-20 11:55:10.0 +0200 @@ -237,11 +237,30 @@ static void dec_nr_user_mounts(void) spin_unlock(vfsmount_lock); } -static void set_mnt_user(struct vfsmount *mnt) +static int reserve_user_mount(void) +{ + int err = 0; + spin_lock(vfsmount_lock); + if (nr_user_mounts = max_user_mounts !capable(CAP_SYS_ADMIN)) + err = -EPERM; + else + nr_user_mounts++; + spin_unlock(vfsmount_lock); + return err; +} + +static void __set_mnt_user(struct vfsmount *mnt) { BUG_ON(mnt-mnt_flags MNT_USER); mnt-mnt_uid = current-uid; mnt-mnt_flags |= MNT_USER; + if (!capable(CAP_SYS_ADMIN)) + mnt-mnt_flags |= MNT_NOSUID | MNT_NODEV; +} + +static void set_mnt_user(struct vfsmount *mnt) +{ + __set_mnt_user(mnt); spin_lock(vfsmount_lock); nr_user_mounts++; spin_unlock(vfsmount_lock); @@ -260,9 +279,16 @@ static struct vfsmount *clone_mnt(struct int flag) { struct super_block *sb = old-mnt_sb; - struct vfsmount *mnt = alloc_vfsmnt(old-mnt_devname); + struct vfsmount *mnt; + + if (flag CL_SETUSER) { + int err = reserve_user_mount(); + if (err) + return ERR_PTR(err); + } + mnt = alloc_vfsmnt(old-mnt_devname); if (!mnt) - return ERR_PTR(-ENOMEM); + goto alloc_failed; mnt-mnt_flags = old-mnt_flags; atomic_inc(sb-s_active); @@ -274,7 +300,7 @@ static struct vfsmount *clone_mnt(struct /* don't copy the MNT_USER flag */ mnt-mnt_flags = ~MNT_USER; if (flag CL_SETUSER) - set_mnt_user(mnt); + __set_mnt_user(mnt); if (flag CL_SLAVE) { list_add(mnt-mnt_slave, old-mnt_slave_list); @@ -299,6 +325,11 @@ static struct vfsmount *clone_mnt(struct spin_unlock(vfsmount_lock); } return mnt; + + alloc_failed: + if (flag CL_SETUSER) + dec_nr_user_mounts(); + return ERR_PTR(-ENOMEM); } static inline void __mntput(struct vfsmount *mnt) @@ -745,22 +776,29 @@ asmlinkage long sys_oldumount(char __use #endif -static int mount_is_safe(struct nameidata *nd) +/* + * Conditions for unprivileged mounts are: + * - mountpoint is not a symlink or special file + * - mountpoint is in a mount owned by the user + */ +static bool permit_mount(struct nameidata *nd, int *flags) { + struct inode *inode = nd-dentry-d_inode; + if (capable(CAP_SYS_ADMIN)) - return 0; - return -EPERM; -#ifdef notyet - if (S_ISLNK(nd-dentry-d_inode-i_mode)) - return -EPERM; - if (nd-dentry-d_inode-i_mode S_ISVTX) { - if (current-uid != nd-dentry-d_inode-i_uid) - return -EPERM; - } - if (vfs_permission(nd, MAY_WRITE)) - return -EPERM; - return 0; -#endif + return true; + + if (!S_ISDIR(inode-i_mode) !S_ISREG(inode-i_mode)) + return false; + + if (!(nd-mnt-mnt_flags MNT_USER)) + return false; + + if (nd-mnt-mnt_uid != current-uid) + return false; + + *flags |= MS_SETUSER; + return true; } static int lives_below_in_same_fs(struct dentry *d, struct dentry *dentry) @@ -981,9 +1019,10 @@ static int do_loopback(struct nameidata int clone_flags; struct nameidata old_nd; struct vfsmount *mnt = NULL; - int err = mount_is_safe(nd); - if (err) - return err; + int err; + + if (!permit_mount(nd, flags)) + return -EPERM; if (!old_name || !*old_name) return -EINVAL; err = path_lookup(old_name, LOOKUP_FOLLOW, old_nd); -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 8/8] allow unprivileged fuse mounts
From: Miklos Szeredi [EMAIL PROTECTED] Use FS_SAFE for fuse fs type, but not for fuseblk. FUSE was designed from the beginning to be safe for unprivileged users. This has also been verified in practice over many years. In addition unprivileged mounts require the parent mount to be owned by the user, which is more strict than the current userspace policy. This will enable future installations to remove the suid-root fusermount utility. Don't require the user_id= and group_id= options for unprivileged mounts, but if they are present, verify them for sanity. Disallow the allow_other option for unprivileged mounts. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/fs/fuse/inode.c === --- linux.orig/fs/fuse/inode.c 2007-04-20 11:55:01.0 +0200 +++ linux/fs/fuse/inode.c 2007-04-20 11:55:14.0 +0200 @@ -311,6 +311,19 @@ static int parse_fuse_opt(char *opt, str d-max_read = ~0; d-blksize = 512; + /* +* For unprivileged mounts use current uid/gid. Still allow +* user_id and group_id options for compatibility, but +* only if they match these values. +*/ + if (!capable(CAP_SYS_ADMIN)) { + d-user_id = current-uid; + d-user_id_present = 1; + d-group_id = current-gid; + d-group_id_present = 1; + + } + while ((p = strsep(opt, ,)) != NULL) { int token; int value; @@ -339,6 +352,8 @@ static int parse_fuse_opt(char *opt, str case OPT_USER_ID: if (match_int(args[0], value)) return 0; + if (d-user_id_present d-user_id != value) + return 0; d-user_id = value; d-user_id_present = 1; break; @@ -346,6 +361,8 @@ static int parse_fuse_opt(char *opt, str case OPT_GROUP_ID: if (match_int(args[0], value)) return 0; + if (d-group_id_present d-group_id != value) + return 0; d-group_id = value; d-group_id_present = 1; break; @@ -536,6 +553,10 @@ static int fuse_fill_super(struct super_ if (!parse_fuse_opt((char *) data, d, is_bdev)) return -EINVAL; + /* This is a privileged option */ + if ((d.flags FUSE_ALLOW_OTHER) !capable(CAP_SYS_ADMIN)) + return -EPERM; + if (is_bdev) { #ifdef CONFIG_BLOCK if (!sb_set_blocksize(sb, d.blksize)) @@ -639,6 +660,7 @@ static struct file_system_type fuse_fs_t .fs_flags = FS_HAS_SUBTYPE, .get_sb = fuse_get_sb, .kill_sb= kill_anon_super, + .fs_flags = FS_SAFE, }; #ifdef CONFIG_BLOCK -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 3/8] account user mounts
From: Miklos Szeredi [EMAIL PROTECTED] Add sysctl variables for accounting and limiting the number of user mounts. The maximum number of user mounts is set to 1024 by default. This won't in itself enable user mounts, setting a mount to be owned by a user is first needed Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/include/linux/sysctl.h === --- linux.orig/include/linux/sysctl.h 2007-04-20 11:55:02.0 +0200 +++ linux/include/linux/sysctl.h2007-04-20 11:55:07.0 +0200 @@ -818,6 +818,8 @@ enum FS_AIO_NR=18, /* current system-wide number of aio requests */ FS_AIO_MAX_NR=19, /* system-wide maximum number of aio requests */ FS_INOTIFY=20, /* inotify submenu */ + FS_NR_USER_MOUNTS=21, /* int:current number of user mounts */ + FS_MAX_USER_MOUNTS=22, /* int:maximum number of user mounts */ FS_OCFS2=988, /* ocfs2 */ }; Index: linux/kernel/sysctl.c === --- linux.orig/kernel/sysctl.c 2007-04-20 11:55:02.0 +0200 +++ linux/kernel/sysctl.c 2007-04-20 11:55:07.0 +0200 @@ -1063,6 +1063,22 @@ static ctl_table fs_table[] = { #endif #endif { + .ctl_name = FS_NR_USER_MOUNTS, + .procname = nr_user_mounts, + .data = nr_user_mounts, + .maxlen = sizeof(int), + .mode = 0444, + .proc_handler = proc_dointvec, + }, + { + .ctl_name = FS_MAX_USER_MOUNTS, + .procname = max_user_mounts, + .data = max_user_mounts, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, + { .ctl_name = KERN_SETUID_DUMPABLE, .procname = suid_dumpable, .data = suid_dumpable, Index: linux/Documentation/filesystems/proc.txt === --- linux.orig/Documentation/filesystems/proc.txt 2007-04-20 11:55:02.0 +0200 +++ linux/Documentation/filesystems/proc.txt2007-04-20 11:55:07.0 +0200 @@ -923,6 +923,15 @@ reaches aio-max-nr then io_setup will fa raising aio-max-nr does not result in the pre-allocation or re-sizing of any kernel data structures. +nr_user_mounts and max_user_mounts +-- + +These represent the number of user mounts and the maximum number of +user mounts respectively. User mounts may be created by +unprivileged users. User mounts may also be created with sysadmin +privileges on behalf of a user, in which case nr_user_mounts may +exceed max_user_mounts. + 2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats --- Index: linux/fs/namespace.c === --- linux.orig/fs/namespace.c 2007-04-20 11:55:06.0 +0200 +++ linux/fs/namespace.c2007-04-20 11:55:07.0 +0200 @@ -39,6 +39,9 @@ static int hash_mask __read_mostly, hash static struct kmem_cache *mnt_cache __read_mostly; static struct rw_semaphore namespace_sem; +int nr_user_mounts; +int max_user_mounts = 1024; + /* /sys/fs */ decl_subsys(fs, NULL, NULL); EXPORT_SYMBOL_GPL(fs_subsys); @@ -227,11 +230,30 @@ static struct vfsmount *skip_mnt_tree(st return p; } +static void dec_nr_user_mounts(void) +{ + spin_lock(vfsmount_lock); + nr_user_mounts--; + spin_unlock(vfsmount_lock); +} + static void set_mnt_user(struct vfsmount *mnt) { BUG_ON(mnt-mnt_flags MNT_USER); mnt-mnt_uid = current-uid; mnt-mnt_flags |= MNT_USER; + spin_lock(vfsmount_lock); + nr_user_mounts++; + spin_unlock(vfsmount_lock); +} + +static void clear_mnt_user(struct vfsmount *mnt) +{ + if (mnt-mnt_flags MNT_USER) { + mnt-mnt_uid = 0; + mnt-mnt_flags = ~MNT_USER; + dec_nr_user_mounts(); + } } static struct vfsmount *clone_mnt(struct vfsmount *old, struct dentry *root, @@ -283,6 +305,7 @@ static inline void __mntput(struct vfsmo { struct super_block *sb = mnt-mnt_sb; dput(mnt-mnt_root); + clear_mnt_user(mnt); free_vfsmnt(mnt); deactivate_super(sb); } @@ -1023,6 +1046,7 @@ static int do_remount(struct nameidata * down_write(sb-s_umount); err = do_remount_sb(sb, flags, data, 0); if (!err) { + clear_mnt_user(nd-mnt); nd-mnt-mnt_flags = mnt_flags; if (flags MS_SETUSER) set_mnt_user(nd-mnt); Index: linux/include/linux/fs.h
[patch 7/8] allow unprivileged mounts
From: Miklos Szeredi [EMAIL PROTECTED] Define a new fs flag FS_SAFE, which denotes, that unprivileged mounting of this filesystem may not constitute a security problem. Since most filesystems haven't been designed with unprivileged mounting in mind, a thorough audit is needed before setting this flag. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/fs/namespace.c === --- linux.orig/fs/namespace.c 2007-04-20 11:55:10.0 +0200 +++ linux/fs/namespace.c2007-04-20 11:55:13.0 +0200 @@ -781,13 +781,17 @@ asmlinkage long sys_oldumount(char __use * - mountpoint is not a symlink or special file * - mountpoint is in a mount owned by the user */ -static bool permit_mount(struct nameidata *nd, int *flags) +static bool permit_mount(struct nameidata *nd, struct file_system_type *type, +int *flags) { struct inode *inode = nd-dentry-d_inode; if (capable(CAP_SYS_ADMIN)) return true; + if (type !(type-fs_flags FS_SAFE)) + return false; + if (!S_ISDIR(inode-i_mode) !S_ISREG(inode-i_mode)) return false; @@ -1021,7 +1025,7 @@ static int do_loopback(struct nameidata struct vfsmount *mnt = NULL; int err; - if (!permit_mount(nd, flags)) + if (!permit_mount(nd, NULL, flags)) return -EPERM; if (!old_name || !*old_name) return -EINVAL; @@ -1182,26 +1186,46 @@ out: * create a new mount for userspace and request it to be added into the * namespace's tree */ -static int do_new_mount(struct nameidata *nd, char *type, int flags, +static int do_new_mount(struct nameidata *nd, char *fstype, int flags, int mnt_flags, char *name, void *data) { + int err; struct vfsmount *mnt; + struct file_system_type *type; - if (!type || !memchr(type, 0, PAGE_SIZE)) + if (!fstype || !memchr(fstype, 0, PAGE_SIZE)) return -EINVAL; - /* we need capabilities... */ - if (!capable(CAP_SYS_ADMIN)) - return -EPERM; - - mnt = do_kern_mount(type, flags ~MS_SETUSER, name, data); - if (IS_ERR(mnt)) + type = get_fs_type(fstype); + if (!type) + return -ENODEV; + + err = -EPERM; + if (!permit_mount(nd, type, flags)) + goto out_put_filesystem; + + if (flags MS_SETUSER) { + err = reserve_user_mount(); + if (err) + goto out_put_filesystem; + } + + mnt = vfs_kern_mount(type, flags ~MS_SETUSER, name, data); + put_filesystem(type); + if (IS_ERR(mnt)) { + if (flags MS_SETUSER) + dec_nr_user_mounts(); return PTR_ERR(mnt); + } if (flags MS_SETUSER) - set_mnt_user(mnt); + __set_mnt_user(mnt); return do_add_mount(mnt, nd, mnt_flags, NULL); + + out_put_filesystem: + put_filesystem(type); + return err; } /* @@ -1231,7 +1255,7 @@ int do_add_mount(struct vfsmount *newmnt if (S_ISLNK(newmnt-mnt_root-d_inode-i_mode)) goto unlock; - /* MNT_USER was set earlier */ + /* some flags may have been set earlier */ newmnt-mnt_flags |= mnt_flags; if ((err = graft_tree(newmnt, nd))) goto unlock; Index: linux/include/linux/fs.h === --- linux.orig/include/linux/fs.h 2007-04-20 11:55:11.0 +0200 +++ linux/include/linux/fs.h2007-04-20 11:55:13.0 +0200 @@ -96,6 +96,7 @@ extern int dir_notify_enable; #define FS_REQUIRES_DEV 1 #define FS_BINARY_MOUNTDATA 2 #define FS_HAS_SUBTYPE 4 +#define FS_SAFE 8 /* Safe to mount by unprivileged users */ #define FS_REVAL_DOT 16384 /* Check the paths ., .. for staleness */ #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() * during rename() internally. -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 0/8] mount ownership and unprivileged mount syscall (v4)
This patchset has now been bared to the lowest common denominator that everybody can agree on. Or at least there weren't any objections to this proposal. Andrew, please consider it for -mm. Thanks, Miklos v3 - v4: - simplify interface as much as possible, now only a single option (user=UID) is used to control everything - no longer allow/deny mounting based on file/directory permissions, that approach does not always make sense This patchset adds support for keeping mount ownership information in the kernel, and allow unprivileged mount(2) and umount(2) in certain cases. The mount owner has the following privileges: - unmount the owned mount - create a submount under the owned mount The sysadmin can set the owner explicitly on mount and remount. When an unprivileged user creates a mount, then the owner is automatically set to the user. The following use cases are envisioned: 1) Private namespace, with selected mounts owned by user. E.g. /home/$USER is a good candidate for allowing unpriv mounts and unmounts within. 2) Private namespace, with all mounts owned by user and having the nosuid flag. User can mount and umount anywhere within the namespace, but suid programs will not work. 3) Global namespace, with a designated directory, which is a mount owned by the user. E.g. /mnt/users/$USER is set up so that it is bind mounted onto itself, and set to be owned by $USER. The user can add/remove mounts only under this directory. The following extra security measures are taken for unprivileged mounts: - usermounts are limited by a sysctl tunable - force nosuid,nodev mount options on the created mount -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [d_path 0/7] Fixes to d_path: Respin
I gave a chroot example that showed that in the current implementation, you can get pretty random clashes between mounts; there are other cases with lazy unmounts as well. Irrelevant as well. If you create chroot problems it's your problem. The fact is that if you have a normal setup the code works fine. All other situations cannot be handled with the current kernel interface. This does not give anybody the right to say since the code doesn't always work we can break it completely. That's completely unacceptable. I'm not sure I understand the situation completely. What exactly is broken in libc by removing unreachable mounts from /proc/mounts? Is it the situation when - file descriptor is opened - process does chroot - process does fstatvfs on file descriptor ? In that case currently fstatvfs() _usually_ gives the correct results, but can give wrong results if mounts paths accidently clash in /proc/mounts? Also isn't it the case, that fstatvfs() or statvfs() performed within the chroot could also give incorrect result for a _reachable_ mount if it clashes with an unreachable mount? If this is the case, I would think that removing the unreachable mounts from /proc/mounts, would actually be fixing this second case, which is more likely to be used anyway. BTW, this patch, or at least a predecessor is in -mm, and it very much feels the Right Thing(tm). The /proc/mounts under a chroot environment actually looks sane, instead of some random crap, that it was previously. While we should make every effort to keep the kernel interfaces stable, this shouldn't prevent us from fixing bugs. And this one is clearly a bug, even if not a very serious one. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/8] add user mounts to the kernel
From: Miklos Szeredi [EMAIL PROTECTED] Add ownership information to mounts. A new mount flag, MS_SETUSER is used to make a mount owned by a user. If this flag is specified, then the owner will be set to the current real user id and the mount will be marked with the MNT_USER flag. On remount don't preserve previous owner, and treat MS_SETUSER as for a new mount. The MS_SETUSER flag is ignored on mount move. So is a modified mount(8) needed? If so, is there some convenient way in which testers can get hold of it? I haven't yet added this to mount(8). I'll do that and post patches. There's an ultra sophisticated mount tester program I use. It's slightly user unfriendly... Miklos #include stdio.h #include stdlib.h #include string.h #include sys/mount.h int main(int argc, char *argv[]) { int res; int un = (strrchr(argv[0], '/') ?: argv[0] - 1)[1] == 'u'; if ((un argc != 3) || (!un argc != 6)) { fprintf(stderr, usage: %s %s\n, argv[0], un ? dir flags : dev dir type flags opts); return 1; } if (un) res = umount2(argv[1], atoi(argv[2])); else res = mount(argv[1], argv[2], argv[3], atoi(argv[4]), argv[5]); if (res == -1) { perror(argv[0]); return 1; } return 0; } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/8] allow unprivileged umount
+static bool permit_umount(struct vfsmount *mnt, int flags) +{ ... + return mnt-mnt_uid == current-uid; +} Yes, this seems very wrong. I'd have thought that comparing user_struct*'s would get us a heck of a lot closer to being able to support aliasing of UIDs between different namespaces. OK, I'll fix this up. Actually an earlier version of this patch did use user_struct's but I'd changed it to uids, because it's simpler. I didn't think about this being contrary to the id namespaces thing. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 7/8] allow unprivileged mounts
On Fri, 20 Apr 2007 12:25:39 +0200 Miklos Szeredi [EMAIL PROTECTED] wrote: Define a new fs flag FS_SAFE, which denotes, that unprivileged mounting of this filesystem may not constitute a security problem. Since most filesystems haven't been designed with unprivileged mounting in mind, a thorough audit is needed before setting this flag. Practically speaking, is there any realistic likelihood that any filesystem apart from FUSE will ever use this? V9FS people did express an interest in this. Yeah, I should've CC-ed them, but forgot. Sorry. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 8/8] allow unprivileged fuse mounts
Use FS_SAFE for fuse fs type, but not for fuseblk. FUSE was designed from the beginning to be safe for unprivileged users. This has also been verified in practice over many years. How does FUSE do this? There are obvious cases like crafting a filesystem which has setuid executables or world-writeable device nodes or whatever. I'm sure there are lots of other cases. Where is FUSE's implementation of all this protection described? Most of it is in Documentation/filesystems/fuse.txt, some of it is code comments. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 7/8] allow unprivileged mounts
Define a new fs flag FS_SAFE, which denotes, that unprivileged mounting of this filesystem may not constitute a security problem. Since most filesystems haven't been designed with unprivileged mounting in mind, a thorough audit is needed before setting this flag. Practically speaking, is there any realistic likelihood that any filesystem apart from FUSE will ever use this? V9FS people did express an interest in this. Yeah, I should've CC-ed them, but forgot. Sorry. And CIFS maybe. They also have an unprivileged mounting suid hack. But I'm not very optimistic about CIFS, seeing some of the code, that's in there. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/10] mm: per device dirty threshold
On Fri, 20 Apr 2007 17:52:04 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote: Scale writeback cache per backing device, proportional to its writeout speed. By decoupling the BDI dirty thresholds a number of problems we currently have will go away, namely: - mutual interference starvation (for any number of BDIs); - deadlocks with stacked BDIs (loop, FUSE and local NFS mounts). It might be that all dirty pages are for a single BDI while other BDIs are idling. By giving each BDI a 'fair' share of the dirty limit, each one can have dirty pages outstanding and make progress. A global threshold also creates a deadlock for stacked BDIs; when A writes to B, and A generates enough dirty pages to get throttled, B will never start writeback until the dirty pages go away. Again, by giving each BDI its own 'independent' dirty limit, this problem is avoided. So the problem is to determine how to distribute the total dirty limit across the BDIs fairly and efficiently. A DBI that has a large dirty limit but does not have any dirty pages outstanding is a waste. What is done is to keep a floating proportion between the DBIs based on writeback completions. This way faster/more active devices get a larger share than slower/idle devices. This is a pretty major improvement to various nasty corner-cases, if it works. Does it work? Please describe the testing you did, and the results. Has this been confirmed to fix Miklos's FUSE and loopback problems? I haven't yet tested it (will do), but I'm sure it does solve the deadlock in balance_dirty_pages(), if for no other reason, that when the queue is idle (no dirty or writeback pages), then it allowes the caller to dirty some more pages. The other deadlock, in throttle_vm_writeout() is still to be solved. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/10] mm: per device dirty threshold
The other deadlock, in throttle_vm_writeout() is still to be solved. Let's go back to the original changelog: Author: marcelo.tosatti marcelo.tosatti Date: Tue Mar 8 17:25:19 2005 + [PATCH] vm: pageout throttling With silly pageout testcases it is possible to place huge amounts of memory under I/O. With a large request queue (CFQ uses 8192 requests) it is possible to place _all_ memory under I/O at the same time. This means that all memory is pinned and unreclaimable and the VM gets upset and goes oom. The patch limits the amount of memory which is under pageout writeout to be a little more than the amount of memory at which balance_dirty_pages() callers will synchronously throttle. This means that heavy pageout activity can starve heavy writeback activity completely, but heavy writeback activity will not cause starvation of pageout. Because we don't want a simple `dd' to be causing excessive latencies in page reclaim. Signed-off-by: Andrew Morton [EMAIL PROTECTED] Signed-off-by: Linus Torvalds [EMAIL PROTECTED] (A good one! I wrote it ;)) I believe that the combination of dirty-page-tracking and its calls to balance_dirty_pages() mean that we can now never get more than dirty_ratio of memory into the dirty-or-writeback condition. The vm scanner can convert dirty pages into clean, under-writeback pages, but it cannot increase the total of dirty+writeback. What about swapout? That can increase the number of writeback pages, without decreasing the number of dirty pages, no? Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/8] allow unprivileged umount
On Sat, 21 Apr 2007 10:09:42 +0200 Miklos Szeredi [EMAIL PROTECTED] wrote: +static bool permit_umount(struct vfsmount *mnt, int flags) +{ ... + return mnt-mnt_uid == current-uid; +} Yes, this seems very wrong. I'd have thought that comparing user_struct*'s would get us a heck of a lot closer to being able to support aliasing of UIDs between different namespaces. OK, I'll fix this up. Actually an earlier version of this patch did use user_struct's but I'd changed it to uids, because it's simpler. OK.. I didn't think about this being contrary to the id namespaces thing. Well I was madly assuming that when serarate UID namespaces are in use, UID 42 in container A will have a different user_struct from UID 42 in container B. I'd suggest that we provoke an opinion from Eric co before you do work on this. That is what I what I have been thinking as well, Does this mean, that containers will need this? Or that you don't know yet? storing a user struct on each mount point seems sane, plus it allows per user mount rlimits which is major plus. Especially since we seem to be doing accounting only for user mounts a per user rlimit seems good. I'm not against per-user rlimits for mounts, but I'd rather do this later... To get the user we should be user fs_uid as HPA suggested. fsuid is exclusively used for checking file permissions, which we don't do here anymore. So while it could be argued, that mount() _is_ a filesystem operation, it is really a different sort of filesystem operation than the rest. OTOH it wouldn't hurt to use fsuid instead of ruid... Finally I'm pretty certain the capability we should care about in this context is CAP_SETUID. Instead of CAP_SYS_ADMIN. If we have CAP_SETUID we can become which ever user owns the mount, and the root user in a container needs this, so he can run login programs. So changing the appropriate super user checks from CAP_SYS_ADMIN to CAP_SETUID I think is the right thing todo. That's a flawed logic. If you want to mount as a specific user, and you have CAP_SETUID, then just use set*uid() and then mount(). Changing the capability check for mount() would break the userspace ABI. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/8] add user mounts to the kernel
The MNT_USER flag is not copied on any kind of mount cloning: namespace creation, binding or propagation. I half agree, and as an initial approximation this works. Ultimately we should be at the point that for mount propagation that we copy the owner of the from the owner of our parent mount at the propagation destination. Yes, that sounds the most sane. Ram, what do you think? + if (mnt-mnt_flags MNT_USER) + seq_printf(m, ,user=%i, mnt-mnt_uid); How about making the test if (mnt-mnt_user != root_user) We don't want to treat root_user special. That's what capabilities were invented for. Index: linux/include/linux/fs.h === --- linux.orig/include/linux/fs.h 2007-04-20 11:55:02.0 +0200 +++ linux/include/linux/fs.h2007-04-20 11:55:05.0 +0200 @@ -123,6 +123,7 @@ extern int dir_notify_enable; #define MS_SLAVE (119) /* change to slave */ #define MS_SHARED (120) /* change to shared */ #define MS_RELATIME(121) /* Update atime relative to mtime/ctime. */ +#define MS_SETUSER (122) /* set mnt_uid to current user */ If we unconditionally use the fsuid I think we can get away without this flag. That coudl work if we wouldn't have to worry about breaking the user interface. As it is, we cannot be sure, that existing callers of mount(2) don't have fsuid set to some random value. #define MNT_SHRINKABLE 0x100 +#define MNT_USER 0x200 If we assign a user to all mount points and root gets to own the initial set of mounts then we don't need the internal MNT_USER flag. I think we do want to treat owned mounts special, rather than treating user=0 mounts special. + + uid_t mnt_uid; /* owner of the mount */ Can we please make this a user struct. That requires a bit of reference counting but it has uid namespace benefits as well as making it easy to implement per user mount rlimits. OK, can you ellaborate, what the uid namespace benifits are? Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/8] allow unprivileged umount
I suspect we can allow MNT_FORCE for non-privileged users as well if we can trust the filesystem. I don't think so. MNT_FORCE has side effects on the superblock. So a user shouldn't be able to force an unmount on a bind mount s/he did, but there's no problem with allowing plain/lazy unmounts. We could possibly allow MNT_FORCE, for FS_SAFE filesystems. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 3/8] account user mounts
From: Miklos Szeredi [EMAIL PROTECTED] Add sysctl variables for accounting and limiting the number of user mounts. The maximum number of user mounts is set to 1024 by default. This won't in itself enable user mounts, setting a mount to be owned by a user is first needed Since each mount has a user can we just make this a per user rlimit? If we are going to implement a sysctl at this point I think it should be a global limit that doesn't care if who you are. Even root can have recursive mounts that attempt to get out of control. Recursive bind mounts are done carefully enough, so they don't get out of control. Recursive mount propagations can get out of control. But root can shoot itself in the foot any number of ways, and it's not for the kernel to police that. Also currently you are not checking the max_users. It looks like you do this in a later patch but still it is a little strange to allow user own mounts and have accounting but to not check the limit at this state. Yeah, but at this stage user mounts are not yet allowed, so this is safe. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 5/8] allow unprivileged bind mounts
From: Miklos Szeredi [EMAIL PROTECTED] Allow bind mounts to unprivileged users if the following conditions are met: - mountpoint is not a symlink or special file Why? This sounds like a left over from when we were checking permissions. Hmm, yes. Don't know. Maybe only the symlink check. Bind mounts of directory over non-directy, and vica versa are already excluded, even for root. - parent mount is owned by the user - the number of user mounts is below the maximum Unprivileged mounts imply MS_SETUSER, and will also have the nosuid and nodev mount flags set. So in principle I agree, but in detail I disagree. capable(CAP_SETUID) should be required to leave MNT_NOSUID clear. capable(CAP_MKNOD) should be required to leave MNT_NODEV clear. I.e. We should not special case this as a user mount but rather simply check to see if the user performing the mount has the appropriate capabilities to allow the flags. Sounds sane. Will fix. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 8/8] allow unprivileged fuse mounts
+ /* +* For unprivileged mounts use current uid/gid. Still allow +* user_id and group_id options for compatibility, but +* only if they match these values. +*/ + if (!capable(CAP_SYS_ADMIN)) { + d-user_id = current-uid; + d-user_id_present = 1; + d-group_id = current-gid; + d-group_id_present = 1; + + } CAP_SETUID is the appropriate capability... This is not a dimension we have not fully explored. What is the problem with a user controlled mount having different uid and gid values. Yes they map into different users but how is this a problem. The only problem that I can recall is the historic chown problem where you could give files to other users and mess up their quotas. Or is the problem other users writing to this user controlled filesystem? Yes. Or even just a suid process trying to access the user controlled filesystem. See Documentation/filesystems/fuse.txt for the gory details. Eric, thanks for the detailed review :) Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/8] allow unprivileged umount
Does this mean, that containers will need this? Or that you don't know yet? The uid namespace is something we have to handle carefully and we have not decided on the final design. What is clear is that all permission checks will need to become either (uid namspace, uid) tuple comparisons. Or struct user pointer comparisons. To see if we are talking about the same uid. So the eventual uid namespace combined with the possibility for rlimits if we use struct user *. See to make using a struct user a clear win. OK, if we don't yet know, I'd rather leave this for later. It will be trivial to change to user_struct if we want per-user rlimits. storing a user struct on each mount point seems sane, plus it allows per user mount rlimits which is major plus. Especially since we seem to be doing accounting only for user mounts a per user rlimit seems good. I'm not against per-user rlimits for mounts, but I'd rather do this later... Then let's add a non-discriminate limit. Instead of a limit that applies only to root. See reply to relevant patch. To get the user we should be user fs_uid as HPA suggested. fsuid is exclusively used for checking file permissions, which we don't do here anymore. So while it could be argued, that mount() _is_ a filesystem operation, it is really a different sort of filesystem operation than the rest. OTOH it wouldn't hurt to use fsuid instead of ruid... Yes. I may be confused but I'm pretty certain we want either the fsuid or the euid to be the mount owner. ruid just looks wrong. The fsuid is a special case of the effective uid. Which is who we should perform operations as. Unless I'm just confused. Definitely not euid. Euid is the one which is effective, i.e. it will basically always be zero for a privileged mount(). Ruid is the one which is returned by getuid(). If a user execs a suid-root program, then ruid will be the id of the user, while euid will be zero. Finally I'm pretty certain the capability we should care about in this context is CAP_SETUID. Instead of CAP_SYS_ADMIN. If we have CAP_SETUID we can become which ever user owns the mount, and the root user in a container needs this, so he can run login programs. So changing the appropriate super user checks from CAP_SYS_ADMIN to CAP_SETUID I think is the right thing todo. That's a flawed logic. If you want to mount as a specific user, and you have CAP_SETUID, then just use set*uid() and then mount(). Totally agreed for mount. Changing the capability check for mount() would break the userspace ABI. Sorry I apparently wasn't clear. CAP_SETUID should be the capability check for umount. The argument applies to umount as well. For compatibility, we _need_ the CAP_SYS_ADMIN check. And if program has CAP_SETUID but not CAP_SYS_ADMIN, it can just set the id to the mount owner before calling umount. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/8] add user mounts to the kernel
+if (mnt-mnt_flags MNT_USER) +seq_printf(m, ,user=%i, mnt-mnt_uid); How about making the test if (mnt-mnt_user != root_user) We don't want to treat root_user special. That's what capabilities were invented for. For the print statement? What ever it is minor. It is a user interface, not a print statement. Your suggested change would be vetoed by any number of people. So either we have all mounts having owners, AND have /proc/mounts add user=0 to all mounts. While I don't _think_ this would actually break userspace, it would definitely make people complain. The other choice is what the current patchset does: is to have legacy mounts without owners, and new generation mounts with owners having user=UID in /proc/mounts, regardless of the value of UID. So I want to minimize the changes needed to existing programs. Now if all we have to do is specify MS_SETUSER when root a user with CAP_SETUID is setting up a mount as a user other then himself then I don't much care. If we have to call MS_SETUSER as unprivileged users You don't. Unprivileged mounts _imply_ MS_SETUSER. + +uid_t mnt_uid; /* owner of the mount */ Can we please make this a user struct. That requires a bit of reference counting but it has uid namespace benefits as well as making it easy to implement per user mount rlimits. OK, can you ellaborate, what the uid namespace benifits are? In the uid namespace the comparison is simpler as are the propagations rules. Basically if you use a struct user you will never need to care about a uid namespace. If you don't we will have to tear through this code another time. Well, OK. I'll do the user_struct thing then. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 3/8] account user mounts
From: Miklos Szeredi [EMAIL PROTECTED] Add sysctl variables for accounting and limiting the number of user mounts. The maximum number of user mounts is set to 1024 by default. This won't in itself enable user mounts, setting a mount to be owned by a user is first needed Since each mount has a user can we just make this a per user rlimit? If we are going to implement a sysctl at this point I think it should be a global limit that doesn't care if who you are. Even root can have recursive mounts that attempt to get out of control. Recursive bind mounts are done carefully enough, so they don't get out of control. Recursive mount propagations can get out of control. But root can shoot itself in the foot any number of ways, and it's not for the kernel to police that. Yes. It is. This is mostly about removing special cases. We routinely have limits on resources that are global and apply to root along with every one else. Root can change them but they still apply to root. Things like the number of inodes in the system or the total number of files. There's no max_inodes any more. As for max_files: get_empty_filp(): /* * Privileged users can go above max_files */ if (get_nr_files() = files_stat.max_files !capable(CAP_SYS_ADMIN)) { Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 7/8] allow unprivileged mounts
On Apr 21 2007 10:57, Eric W. Biederman wrote: tmpfs! tmpfs is a possible problem because it can consume lots of ram/swap. Which is why it has limits on the amount of space it can consume. Users can gobble up all RAM and swap already today. (Unless they are confined into an rlimit, which, in most systems, is not the case.) And in case /dev/shm exists, they can already fill it without running into an rlimit early. There are systems that care about rlimits and there is strong intersection between caring about rlimits and user mounts. Although I do agree that it looks like we have gotten lazy with the default mount options for /dev/shm. Going a little farther any filesystem that is safe to put on a usb stick and mount automatically should ultimately be safe for unprivileged mounts as well. Actually, that's not as simple. For the usb stick or cdrom you need physical access to the machine. And once you have that you basically have full control over the system anyway. But with block filesystems, the user would still need access to the device (currently kernel doesn't even check this I think). So it may make sense to mark all block based filesystems safe, and defer permission checking to user access on the block device. But the safe flag is still needed for filesystems, which don't have such an additional access checking, such as network filesystems. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/8] add user mounts to the kernel
+ +uid_t mnt_uid; /* owner of the mount */ Can we please make this a user struct. That requires a bit of reference counting but it has uid namespace benefits as well as making it easy to implement per user mount rlimits. OK, can you ellaborate, what the uid namespace benifits are? In the uid namespace the comparison is simpler as are the propagations rules. Basically if you use a struct user you will never need to care about a uid namespace. I tried to implement it but got stuck on this: fsuid doesn't have a user_struct in task_struct (yet), so we'd now have to convert current-fsuid to a user_struct. This can be done with alloc_uid(), but this can fail, bringing in extra error handling complexity. Also we'd have to compare current-fsuid with a user_struct, which we don't yet know how will actually be done in the future. So it seems, we still have to care about the uid namespace, at least if fsuid is preferred to ruid. Anyway, here's a patch fixing the other things you brought up, and which I agree with. Does this look OK? Thanks, Miklos Index: linux/fs/namespace.c === --- linux.orig/fs/namespace.c 2007-04-22 17:48:18.0 +0200 +++ linux/fs/namespace.c2007-04-22 18:19:51.0 +0200 @@ -252,10 +252,12 @@ static int reserve_user_mount(void) static void __set_mnt_user(struct vfsmount *mnt) { BUG_ON(mnt-mnt_flags MNT_USER); - mnt-mnt_uid = current-uid; + mnt-mnt_uid = current-fsuid; mnt-mnt_flags |= MNT_USER; - if (!capable(CAP_SYS_ADMIN)) - mnt-mnt_flags |= MNT_NOSUID | MNT_NODEV; + if (!capable(CAP_SETUID)) + mnt-mnt_flags |= MNT_NOSUID; + if (!capable(CAP_MKNOD)) + mnt-mnt_flags |= MNT_NODEV; } static void set_mnt_user(struct vfsmount *mnt) @@ -725,10 +727,10 @@ static bool permit_umount(struct vfsmoun if (!(mnt-mnt_flags MNT_USER)) return false; - if (flags MNT_FORCE) + if ((flags MNT_FORCE) !(mnt-mnt_sb-s_type-fs_flags FS_SAFE)) return false; - return mnt-mnt_uid == current-uid; + return mnt-mnt_uid == current-fsuid; } /* @@ -792,13 +794,13 @@ static bool permit_mount(struct nameidat if (type !(type-fs_flags FS_SAFE)) return false; - if (!S_ISDIR(inode-i_mode) !S_ISREG(inode-i_mode)) + if (S_ISLNK(inode-i_mode)) return false; if (!(nd-mnt-mnt_flags MNT_USER)) return false; - if (nd-mnt-mnt_uid != current-uid) + if (nd-mnt-mnt_uid != current-fsuid) return false; *flags |= MS_SETUSER; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/10] mm: per device dirty threshold
The other deadlock, in throttle_vm_writeout() is still to be solved. Let's go back to the original changelog: Author: marcelo.tosatti marcelo.tosatti Date: Tue Mar 8 17:25:19 2005 + [PATCH] vm: pageout throttling With silly pageout testcases it is possible to place huge amounts of memory under I/O. With a large request queue (CFQ uses 8192 requests) it is possible to place _all_ memory under I/O at the same time. This means that all memory is pinned and unreclaimable and the VM gets upset and goes oom. The patch limits the amount of memory which is under pageout writeout to be a little more than the amount of memory at which balance_dirty_pages() callers will synchronously throttle. This means that heavy pageout activity can starve heavy writeback activity completely, but heavy writeback activity will not cause starvation of pageout. Because we don't want a simple `dd' to be causing excessive latencies in page reclaim. Signed-off-by: Andrew Morton [EMAIL PROTECTED] Signed-off-by: Linus Torvalds [EMAIL PROTECTED] (A good one! I wrote it ;)) I believe that the combination of dirty-page-tracking and its calls to balance_dirty_pages() mean that we can now never get more than dirty_ratio of memory into the dirty-or-writeback condition. The vm scanner can convert dirty pages into clean, under-writeback pages, but it cannot increase the total of dirty+writeback. What about swapout? That can increase the number of writeback pages, without decreasing the number of dirty pages, no? Could we not solve that by enabling cap_account_writeback on swapper_space, and thereby account swap writeback pages. Then the VM knows it has outstanding IO and need not panic. Hmm, I'm not sure that would be right, because then those writeback pages would be accounted twice: once for swapper_space, and once for the real device. So there's a condition, when lots of anonymous pages are turned into swap-cache writeback pages, and we should somehow throttle this, because This means that all memory is pinned and unreclaimable and the VM gets upset and goes oom. although, it's not quite clear in my mind, how the VM gets upset about this. One way to throttle just the swapout activity, is to do the per-bdi accounting on swapper_space, and limit the number of writeback pages to e.g. the global threshold + 10%, which is basically what throttle_vm_writeout() currently does, only now it does it indiscriminately, and not just on swap writeback pages. Does this make any sense? Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/10] mm: per device dirty threshold
This is probably a reasonable thing to do but it doesn't feel like the right place. I think get_dirty_limits should return the raw threshold, and balance_dirty_pages should do both tests - the bdi-local test and the system-wide test. Ok, that makes sense I guess. Well, my narrow minded world view says it's not such a good idea, because it would again introduce the deadlock scenario, we're trying to avoid. In a sense allowing a queue to go over the global limit just a little bit is a good thing. Actually the very original code does that: if writeback was started for write_chunk number of pages, then we allow ratelimit (8) _new_ pages to be dirtied, effectively ignoring the global limit. That's why I've been saying, that the current code is so unfair: if there are lots of dirty pages to be written back to a particular device, then balance_dirty_pages() allows the dirty producer to make even more pages dirty, but if there are _no_ dirty pages for a device, and we are over the limit, then that dirty producer is allowed absolutely no new dirty pages until the global counts subside. I'm still not quite sure what purpose the above soft limiting serves. It seems to just give advantage to writers, which managed to accumulate lots of dirty pages, and then can convert that into even more dirtyings. Would it make sense to remove this behavior, and ensure that balance_dirty_pages() doesn't return until the per-queue limits have been complied with? Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/10] mm: per device dirty threshold
This is probably a reasonable thing to do but it doesn't feel like the right place. I think get_dirty_limits should return the raw threshold, and balance_dirty_pages should do both tests - the bdi-local test and the system-wide test. Ok, that makes sense I guess. Well, my narrow minded world view says it's not such a good idea, because it would again introduce the deadlock scenario, we're trying to avoid. I was only referring to the placement of the clipping; and exactly where that happens does not affect the deadlock. OK. In a sense allowing a queue to go over the global limit just a little bit is a good thing. Actually the very original code does that: if writeback was started for write_chunk number of pages, then we allow ratelimit (8) _new_ pages to be dirtied, effectively ignoring the global limit. It might be time to get rid of that rate-limiting. balance_dirty_pages()'s fast path is not nearly as heavy as it used to be. All these fancy counter systems have removed quite a bit of iteration from there. Hmm. The rate limiting probably makes lots of sense for dirty_exceeded==0, when ratelimit can be a nice large value. For dirty_exceeded==1 it may make sense to disable ratelimiting, OTOH having a granularity of 8 pages probably doesn't matter, because of the granularity of the percpu counter is usually larger (except on UP). That's why I've been saying, that the current code is so unfair: if there are lots of dirty pages to be written back to a particular device, then balance_dirty_pages() allows the dirty producer to make even more pages dirty, but if there are _no_ dirty pages for a device, and we are over the limit, then that dirty producer is allowed absolutely no new dirty pages until the global counts subside. Well, that got fixed on a per device basis with this patch, it is still true for multiple tasks writing to the same device. Yes, this is the part of this patchset I'm personally interested in ;) I'm still not quite sure what purpose the above soft limiting serves. It seems to just give advantage to writers, which managed to accumulate lots of dirty pages, and then can convert that into even more dirtyings. The queues only limit the actual in-flight writeback pages, balance_dirty_pages() considers all pages that might become writeback as well as those that are. Would it make sense to remove this behavior, and ensure that balance_dirty_pages() doesn't return until the per-queue limits have been complied with? I don't think that will help, balance_dirty_pages drives the queues. That is, it converts pages from mere dirty to writeback. Yes. But current logic says, that if you convert write_chunk dirty to writeback, you are allowed to dirty ratelimit more. D: number of dirty pages W: number of writeback pages L: global limit C: write_chunk = ratelimit_pages * 1.5 R: ratelimit If D+W = L, then R = 8 Let's assume, that D == L and W == 0. And that all of the dirty pages belong to a single device. Also for simplicity, lets assume an infinite length queue, and a slow device. Then while converting the dirty pages to writeback, D / C * R new dirty pages can be created. So when all existing dirty have been converted: D = L / C * R W = L D + W = L * (1 + R / C) So we see, that we're now even more above the limit than before the conversion. This means, that we starve writers to other devices, which don't have as many dirty pages, because until the slow device doesn't finish these writes they will not get to do anything. Your patch helps this in that if the other writers have an empty queue and no dirty, they will be allowed to slowly start writing. But they will not gain their full share until the slow dirty-hog goes below the global limit, which may take some time. So I think the logical thing to do, is if the dirty-hog is over it's queue limit, don't let it dirty any more until it's dirty+writeback go below the limit. That allowes other devices to more quickly gain their share of dirty pages. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/10] mm: per device dirty threshold
Ahh, now I see; I had totally blocked out these few lines: pages_written += write_chunk - wbc.nr_to_write; if (pages_written = write_chunk) break; /* We've done our duty */ yeah, those look dubious indeed... And reading back Neil's comments, I think he agrees. Shall we just kill those? I think we should. Athough I'm a little afraid, that Akpm will tell me again, that I'm a stupid git, and that those lines are in fact vitally important ;) Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/10] mm: per device dirty threshold
Ahh, now I see; I had totally blocked out these few lines: pages_written += write_chunk - wbc.nr_to_write; if (pages_written = write_chunk) break; /* We've done our duty */ yeah, those look dubious indeed... And reading back Neil's comments, I think he agrees. Shall we just kill those? I think we should. Athough I'm a little afraid, that Akpm will tell me again, that I'm a stupid git, and that those lines are in fact vitally important ;) It depends what they're replaced with. That code is there, iirc, to prevent a process from getting stuck in balance_dirty_pages() forever due to the dirtying activity of other processes. hm, we ask the process to write write_chunk pages each go around the loop. So if it wrote write-chunk/2 pages on the first pass it might end up writing write_chunk*1.5 pages total. I guess that's rare and doesn't matter much if it does happen - the upper bound is write_chunk*2-1, I think. Right, but I think the problem is that its dirty - writeback, not dirty - writeback completed. Ie. they don't guarantee progress, it could be that the total nr_reclaimable + nr_writeback will steadily increase due to this break. How about ensuring that vm_writeout_total increases least 2*sync_writeback_pages() during our stay in balance_dirty_pages(). That way we have the guarantee that more pages get written out than can be dirtied. No, because that's a global counter, which many writers could be looking at. We'd need a per-task writeout counter, but when finishing the write we don't know anymore which task it was performed for. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/10] mm: per device dirty threshold
On Tue, 24 Apr 2007 12:12:18 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote: On Tue, 2007-04-24 at 03:00 -0700, Andrew Morton wrote: On Tue, 24 Apr 2007 11:47:20 +0200 Miklos Szeredi [EMAIL PROTECTED] wrote: Ahh, now I see; I had totally blocked out these few lines: pages_written += write_chunk - wbc.nr_to_write; if (pages_written = write_chunk) break; /* We've done our duty */ yeah, those look dubious indeed... And reading back Neil's comments, I think he agrees. Shall we just kill those? I think we should. Athough I'm a little afraid, that Akpm will tell me again, that I'm a stupid git, and that those lines are in fact vitally important ;) It depends what they're replaced with. That code is there, iirc, to prevent a process from getting stuck in balance_dirty_pages() forever due to the dirtying activity of other processes. hm, we ask the process to write write_chunk pages each go around the loop. So if it wrote write-chunk/2 pages on the first pass it might end up writing write_chunk*1.5 pages total. I guess that's rare and doesn't matter much if it does happen - the upper bound is write_chunk*2-1, I think. Right, but I think the problem is that its dirty - writeback, not dirty - writeback completed. Ie. they don't guarantee progress, it could be that the total nr_reclaimable + nr_writeback will steadily increase due to this break. Don't think so. We call balance_dirty_pages() once per ratelimit_pages dirtyings and when we get there, we write 1.5*ratelimit_pages pages. No, we _start_ writeback for 1.5*ratelimit_pages pages, but do not wait for those writebacks to finish. So for a slow device and a fast writer, dirty+writeback can indeed increase beyond the dirty threshold. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/10] mm: per device dirty threshold
No, we _start_ writeback for 1.5*ratelimit_pages pages, but do not wait for those writebacks to finish. So for a slow device and a fast writer, dirty+writeback can indeed increase beyond the dirty threshold. Nope, try it. If a process dirties 1000 pages it'll then go into balance_dirty_pages() and start writeback against 1,500 pages. When we hit dirty_ratio that process will be required to write back 1,500 pages for each eight pages which it dirtied. We'll quickly reach the stage where there are no longer 1,500 pages to be written back and the process will block in balance_dirty_pages() until the dirty+writeback level subsides. OK. I was confused by this: static long ratelimit_pages = 32; and didn't realize, that that 32 is totally irrelevant. So I'm still right, that for N dirty pages, the writer is allowed to dirty N/1500*8 more dirty pages, but I agree, that this isn't really an issue. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/8] mount ownership and unprivileged mount syscall (v4)
The following extra security measures are taken for unprivileged mounts: - usermounts are limited by a sysctl tunable - force nosuid,nodev mount options on the created mount The original userspace user= solution also implies the noexec option by default (you can override the default by exec option). Unlike nosuid and nodev, I don't think noexec has real security benefits. It means the kernel based solution is not fully compatible ;-( Oh, I don't think that matters. For traditional /etc/fstab based user mounts, mount(8) will have to remain suid-root, the kernel can't replace the fstab check. In fact the latest patches don't even support these legacy user mounts too well: setting the owner of a mount gives not only umount privilege, but the ability to submount. This is not necessarily a good thing for these kinds of user mounts. We could add a new nosubmount or similar flag, to prevent submounting, but that again would go against the simplicity of the current approach, so I'm not sure it's worth it. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] unprivileged mounts update
From: Miklos Szeredi [EMAIL PROTECTED] - refine adding nosuid and nodev flags for unprivileged mounts: o add nosuid, only if mounter doesn't have CAP_SETUID capability o add nodev, only if mounter doesn't have CAP_MKNOD capability - allow unprivileged forced unmount, but only for FS_SAFE filesystems - allow mounting over special files, but not symlinks - for mounting and umounting check fsuid instead of ruid Thanks to everyone for the comments, with special thanks to Serge Hallyn and Eric Biederman. For testing the new functionality provided by this patchset a simple tool similar in syntax to mount(8) is available from: http://www.kernel.org/pub/linux/kernel/people/mszeredi/mmount Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/fs/namespace.c === --- linux.orig/fs/namespace.c 2007-04-22 17:48:18.0 +0200 +++ linux/fs/namespace.c2007-04-22 18:19:51.0 +0200 @@ -252,10 +252,12 @@ static int reserve_user_mount(void) static void __set_mnt_user(struct vfsmount *mnt) { BUG_ON(mnt-mnt_flags MNT_USER); - mnt-mnt_uid = current-uid; + mnt-mnt_uid = current-fsuid; mnt-mnt_flags |= MNT_USER; - if (!capable(CAP_SYS_ADMIN)) - mnt-mnt_flags |= MNT_NOSUID | MNT_NODEV; + if (!capable(CAP_SETUID)) + mnt-mnt_flags |= MNT_NOSUID; + if (!capable(CAP_MKNOD)) + mnt-mnt_flags |= MNT_NODEV; } static void set_mnt_user(struct vfsmount *mnt) @@ -725,10 +727,10 @@ static bool permit_umount(struct vfsmoun if (!(mnt-mnt_flags MNT_USER)) return false; - if (flags MNT_FORCE) + if ((flags MNT_FORCE) !(mnt-mnt_sb-s_type-fs_flags FS_SAFE)) return false; - return mnt-mnt_uid == current-uid; + return mnt-mnt_uid == current-fsuid; } /* @@ -792,13 +794,13 @@ static bool permit_mount(struct nameidat if (type !(type-fs_flags FS_SAFE)) return false; - if (!S_ISDIR(inode-i_mode) !S_ISREG(inode-i_mode)) + if (S_ISLNK(inode-i_mode)) return false; if (!(nd-mnt-mnt_flags MNT_USER)) return false; - if (nd-mnt-mnt_uid != current-uid) + if (nd-mnt-mnt_uid != current-fsuid) return false; *flags |= MS_SETUSER; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 3/8] per backing_dev dirty and writeback page accounting
I have no idea how serious the scalability problems with this are. If they are serious, different solutions can probably be found for the above, but this is certainly the simplest. Atomic operations to a single per-backing device from all CPUs at once? That's a pretty serious scalability issue and it will cause a major performance regression for XFS. OK. How about just accounting writeback pages? That should be much less of a problem, since normally writeback is started from pdflush/kupdate in large batches without any concurrency. Or is it possible to export the state of the device queue to mm? E.g. could balance_dirty_pages() query the backing dev if there are any outstanding write requests? I'd call this a showstopper right now - maybe you need to look at something like the ZVC code that Christoph Lameter wrote, perhaps? That's rather a heavyweight approach for this I think. The only info balance_dirty_pages() really needs is whether there are any dirty+writeback bound for the backing dev or not. It knows about the diry pages, since it calls writeback_inodes() which scans the dirty pages for this backing dev looking for ones to write out. If after returning from writeback_inodes() wbc-nr_to_write didn't decrease and wbc-pages_skipped is zero then we know that there are no more dirty pages for the device. Or at least there are no dirty pages which aren't already under writeback. Thanks, Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 3/8] per backing_dev dirty and writeback page accounting
I'll try to explain the reason for the deadlock first. IIUC, your problem is that there's another bdi that holds all the dirty pages, and this throttle loop never flushes pages from that other bdi and we sleep instead. It seems to me that the fundamental problem is that to clean the pages we need to flush both bdi's, not just the bdi we are directly dirtying. This is what happens: write fault on upper filesystem balance_dirty_pages submit write requests loop ... --- fuse IPC --- [fuse loopback fs thread 1] read request sys_write mutex_lock(i_mutex) ... balance_dirty_pages submit write requests loop ... write requests completed ... dirty still over limit ... ... loop forever [fuse loopback fs thread 1] read request sys_write mute_lock(i_mutex) blocks So the queue for the upper filesystem is full. The queue for the lower filesystem is empty. There are no dirty pages in the lower filesystem. So kicking pdflush for the lower filesystem doesn't help, there's nothing to do. balance_dirty_pages() for the lower filesystem should just realize that there's nothing to do and return, and then there would be progress. So there's there's really no need to do any accounting, just some logic to determine that a backing dev is nearly or completely quiescent. And getting out of this tight situation doesn't have to be efficient. This is probably a very rare corner case, that almost never happens in real life, only with aggressive test tools like bash_shared_mapping. OK. How about just accounting writeback pages? That should be much less of a problem, since normally writeback is started from pdflush/kupdate in large batches without any concurrency. Except when you are throttling you bounce the cacheline around each cpu as it triggers foreground writeback. Yeah, we'd loose a bit of CPU, but not any write performance, since it is being throttled back anyway. Or is it possible to export the state of the device queue to mm? E.g. could balance_dirty_pages() query the backing dev if there are any outstanding write requests? Not directly - writeback_in_progress(bdi) is a coarse measure indicating pdflush is active on this bdi, which implies outstanding write requests). Hmm, not quite what I need. I'd call this a showstopper right now - maybe you need to look at something like the ZVC code that Christoph Lameter wrote, perhaps? That's rather a heavyweight approach for this I think. But if you want to use per-page accounting, you are going to need a per-cpu or per-zone set of counters on each bdi to do this without introducing regressions. Yes, this is an option, but I hope for a simpler solution. Thanks, Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 3/8] per backing_dev dirty and writeback page accounting
IIUC, your problem is that there's another bdi that holds all the dirty pages, and this throttle loop never flushes pages from that other bdi and we sleep instead. It seems to me that the fundamental problem is that to clean the pages we need to flush both bdi's, not just the bdi we are directly dirtying. This is what happens: write fault on upper filesystem balance_dirty_pages submit write requests loop ... Isn't this loop transferring the dirty state from the upper filesystem to the lower filesystem? What this loop is doing is putting write requests in the request queue, and in so doing transforming page state from dirty to writeback. What I don't see here is how the pages on this filesystem are not getting cleaned if the lower filesystem is being flushed properly. Because the lower filesystem writes back one request, but then gets stuck in balance_dirty_pages before returning. So the write request is never completed. The problem is that balance_dirty_pages is waiting for the condition that the global number of dirty+writeback pages goes below the threshold. But this condition can only be satisfied if balance_dirty_pages() returns. I'm probably missing something big and obvious, but I'm not familiar with the exact workings of FUSE so please excuse my ignorance --- fuse IPC --- [fuse loopback fs thread 1] This is the lower filesystem? Or a callback thread for doing the write requests to the lower filesystem? This is the fuse daemon. It's a normal process that reads requests from /dev/fuse, serves these requests then writes the reply back onto /dev/fuse. It is usually multithreaded, so it can serve many requests in parallel. The loopback filesystem serves the requests by issuing the relevant filesystem syscalls on the underlying fs. read request sys_write mutex_lock(i_mutex) ... balance_dirty_pages submit write requests loop ... write requests completed ... dirty still over limit ... ... loop forever Hmmm - the situation in balance_dirty_pages() after an attempt to writeback_inodes(wbc) that has written nothing because there is nothing to write would be: wbc-nr_write == write_chunk wbc-pages_skipped == 0 wbc-encountered_congestion == 0 !bdi_congested(wbc-bdi) What happens if you make that an exit condition to the loop? That's almost right. The only problem is that even if there's no congestion, the device queue can be holding a great amount of yet unwritten pages. So exiting on this condition would mean, that dirty+writeback could go way over the threshold. How much this would be a problem? I don't know, I guess it depends on many things: how many queues, how many requests per queue, how many bytes per request. Or alternatively, adding another bit to the wbc structure to say there was nothing to do and setting that if we find list_empty(sb-s_dirty) when trying to flush dirty inodes. [ FWIW, this may also solve another problem of fast block devices being throttled incorrectly when a slow block dev is consuming all the dirty pages... ] There may be a patch floating around, which I think basically does this, but only as long as the dirty+writeback are over a soft limit, but under the hard limit. When over the the hard limit, balance_dirty_pages still loops until dirty+writeback go below the threshold. Thanks, Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 3/8] per backing_dev dirty and writeback page accounting
Only if the queue depth is not bound. Queue depths are bound and so the distance we can go over the threshold is limited. This is the fundamental principle on which the throttling is based. Hence, if the queue is not full, then we will have either written dirty pages to it (i.e wbc-nr_write != write_chunk so we will throttle or continue normally if write_chunk was written) or we have no more dirty pages left. Having no dirty pages left on the bdi and it not being congested means we effectively have a clean, idle bdi. We should not be trying to throttle writeback here - we can't do anything to improve the situation by continuing to try to do writeback on this bdi, so we may as well give up and let the writer continue. Once we have dirty pages on the bdi, we'll get throttled appropriately. OK, you convinced me. How about this patch? I introduced a new wbc counter, that sums the number of dirty pages encountered, including ones already under writeback. Dave, big thanks for your insights. Miklos Index: linux/include/linux/writeback.h === --- linux.orig/include/linux/writeback.h2007-03-14 22:43:42.0 +0100 +++ linux/include/linux/writeback.h 2007-03-14 22:58:56.0 +0100 @@ -44,6 +44,7 @@ struct writeback_control { long nr_to_write; /* Write this many pages, and decrement this for each page written */ long pages_skipped; /* Pages which were not written */ + long nr_dirty; /* Number of dirty pages encountered */ /* * For a_ops-writepages(): is start or end are non-zero then this is Index: linux/mm/page-writeback.c === --- linux.orig/mm/page-writeback.c 2007-03-14 22:41:01.0 +0100 +++ linux/mm/page-writeback.c 2007-03-14 23:00:20.0 +0100 @@ -220,6 +220,17 @@ static void balance_dirty_pages(struct a pages_written += write_chunk - wbc.nr_to_write; if (pages_written = write_chunk) break; /* We've done our duty */ + + /* +* If just a few dirty pages were encountered, and +* the queue is not congested, then allow this dirty +* producer to continue. This resolves the deadlock +* that happens when one filesystem writes back data +* through another. It should also help when a slow +* device is completely blocking other writes. +*/ + if (wbc.nr_dirty 8 !bdi_write_congested(bdi)) + break; } congestion_wait(WRITE, HZ/10); } @@ -612,6 +623,7 @@ retry: min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) { unsigned i; + wbc-nr_dirty += nr_pages; scanned = 1; for (i = 0; i nr_pages; i++) { struct page *page = pvec.pages[i]; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] fix quadratic behavior of shrink_dcache_parent()
Unfortunately this patch doesn't completely solve this problem, since the system will still be hosed due to all memory being used up by dentries. And I bet the OOM killer won't find the real target (du) but will kill anything before that. So the second part of the problem is to somehow limit the number of dentries used. Not easy... At least with this patch (if I am reading it correctly), once the offending culprit is identified and the programs are killed off, everything will go back to normal without a reboot. Yes, it's basically a normal OOM situation. What makes it worse than usual, is that it's _kernel_ memory being used up which is not attributed to any process. So from the OOM killer's point of view it looks like none of the userspace programs are responsible for the OOM, and it will just select targets randomly. By the time it has come to the actual culprits (the filesystem daemon and 'du') it will have killed almost everything useful. So it's definitely better than without the patch, but still has room for improvement. Thanks, Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH] add filesystem subtype support
There's a slight problem with filesystem type representation in fuse based filesystems. From the kernel's view, there are just two filesystem types: fuse and fuseblk. From the user's view there are lots of different filesystem types. The user is not even much concerned if the filesystem is fuse based or not. So there's a conflict of interest in how this should be represented in fstab, mtab and /proc/mounts. The current scheme is to encode the real filesystem type in the mount source. So an sshfs mount looks like this: [EMAIL PROTECTED]:/ /mnt/serverfuse rw,nosuid,nodev,... This url-ish syntax works OK for sshfs and similar filesystems. However for block device based filesystems (ntfs-3g, zfs) it doesn't work, since the kernel expects the mount source to be a real device name. A possibly better scheme would be to encode the real type in the type field as type.subtype. So fuse mounts would look like this: /dev/hda1 /mnt/windows fuseblk.ntfs-3g rw,... [EMAIL PROTECTED]:/ /mnt/serverfuse.sshfsrw,nosuid,nodev,... This patch adds the necessary code to the kernel so that this can be correctly displayed in /proc/mounts. --- Index: linux/fs/filesystems.c === --- linux.orig/fs/filesystems.c 2007-02-12 12:42:55.0 +0100 +++ linux/fs/filesystems.c 2007-02-12 12:43:00.0 +0100 @@ -42,11 +42,12 @@ void put_filesystem(struct file_system_t module_put(fs-owner); } -static struct file_system_type **find_filesystem(const char *name) +static struct file_system_type **find_filesystem(const char *name, unsigned len) { struct file_system_type **p; for (p=file_systems; *p; p=(*p)-next) - if (strcmp((*p)-name,name) == 0) + if (strlen((*p)-name) == len + strncmp((*p)-name, name, len) == 0) break; return p; } @@ -69,11 +70,12 @@ int register_filesystem(struct file_syst int res = 0; struct file_system_type ** p; + BUG_ON(strchr(fs-name, '.')); if (fs-next) return -EBUSY; INIT_LIST_HEAD(fs-fs_supers); write_lock(file_systems_lock); - p = find_filesystem(fs-name); + p = find_filesystem(fs-name, strlen(fs-name)); if (*p) res = -EBUSY; else @@ -216,19 +218,26 @@ int get_filesystem_list(char * buf) struct file_system_type *get_fs_type(const char *name) { struct file_system_type *fs; + const char *dot = strchr(name, '.'); + unsigned len = dot ? dot - name : strlen(name); read_lock(file_systems_lock); - fs = *(find_filesystem(name)); + fs = *(find_filesystem(name, len)); if (fs !try_module_get(fs-owner)) fs = NULL; read_unlock(file_systems_lock); - if (!fs (request_module(%s, name) == 0)) { + if (!fs (request_module(%.*s, len, name) == 0)) { read_lock(file_systems_lock); - fs = *(find_filesystem(name)); + fs = *(find_filesystem(name, len)); if (fs !try_module_get(fs-owner)) fs = NULL; read_unlock(file_systems_lock); } + + if (dot fs !(fs-fs_flags FS_HAS_SUBTYPE)) { + put_filesystem(fs); + fs = NULL; + } return fs; } Index: linux/fs/fuse/inode.c === --- linux.orig/fs/fuse/inode.c 2007-02-12 12:42:55.0 +0100 +++ linux/fs/fuse/inode.c 2007-02-12 12:43:00.0 +0100 @@ -634,6 +634,7 @@ static int fuse_get_sb(struct file_syste static struct file_system_type fuse_fs_type = { .owner = THIS_MODULE, .name = fuse, + .fs_flags = FS_HAS_SUBTYPE, .get_sb = fuse_get_sb, .kill_sb= kill_anon_super, }; @@ -650,6 +651,7 @@ static int fuse_get_sb_blk(struct file_s static struct file_system_type fuseblk_fs_type = { .owner = THIS_MODULE, .name = fuseblk, + .fs_flags = FS_HAS_SUBTYPE, .get_sb = fuse_get_sb_blk, .kill_sb= kill_block_super, .fs_flags = FS_REQUIRES_DEV, Index: linux/fs/namespace.c === --- linux.orig/fs/namespace.c 2007-02-12 12:42:55.0 +0100 +++ linux/fs/namespace.c2007-02-12 12:43:00.0 +0100 @@ -378,6 +378,10 @@ static int show_vfsmnt(struct seq_file * seq_path(m, mnt, mnt-mnt_root, \t\n\\); seq_putc(m, ' '); mangle(m, mnt-mnt_sb-s_type-name); + if (mnt-mnt_sb-s_subtype mnt-mnt_sb-s_subtype[0]) { + seq_putc(m, '.'); + mangle(m, mnt-mnt_sb-s_subtype); + } seq_puts(m, mnt-mnt_sb-s_flags MS_RDONLY ? ro : rw); for (fs_infop =
Re: [RFC PATCH] add filesystem subtype support
There's a slight problem with filesystem type representation in fuse based filesystems. From the kernel's view, there are just two filesystem types: fuse and fuseblk. From the user's view there are lots of different filesystem types. The user is not even much concerned if the filesystem is fuse based or not. Yes. Those who are concerned with the fstype and mount like mount -t fstype device mountpoint apparently expect mount/fstab line like device mountpoint fstype ... The problem with supporting /dev/hda1 /mnt/windows ntfs-3g rw,... type syntax in /proc/mounts is that it would add much more complexity to the kernel. The same could be achieved in userspace by teaching the mount utility about special filesystems like ntfs-3g. This patch aims for the minium footprint for a feature, that really doesn't concern the kernel at all. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] add filesystem subtype support
-static struct file_system_type **find_filesystem(const char *name) +static struct file_system_type **find_filesystem(const char *name, unsigned len) { struct file_system_type **p; for (p=file_systems; *p; p=(*p)-next) -if (strcmp((*p)-name,name) == 0) +if (strlen((*p)-name) == len +strncmp((*p)-name, name, len) == 0) break; return p; } Question btw, why does this function return a struct file_system_type ** at all? Would not struct file_system_type * suffice? It's used in register_filesystem() to get the end of the list pointer. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [uml-devel] UML hang with 100% CPU
Strangely enough after continuing in gdb, UML is back to normal, and I can't make it hang any more. It must be something timing related. Can you see if the patch below fixes it? Yay! Got my nice fast UML back instead of ugly slow QEmu ;) Seems to work perfectly now. Thanks, Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] consolidate generic_writepages and mpage_writepages
From: Miklos Szeredi [EMAIL PROTECTED] Clean up massive code duplication between mpage_writepages() and generic_writepages(). The new generic function, write_cache_pages() takes a function pointer argument, which will be called for each page to be written. Maybe cifs_writepages() too can use this infrastructure, but I'm not touching that with a ten-foot pole. The upcoming page writeback support in fuse will also want this. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/fs/mpage.c === --- linux.orig/fs/mpage.c 2007-02-16 16:54:41.0 +0100 +++ linux/fs/mpage.c2007-02-16 16:54:42.0 +0100 @@ -458,11 +458,18 @@ EXPORT_SYMBOL(mpage_readpage); * written, so it can intelligently allocate a suitably-sized BIO. For now, * just allocate full-size (16-page) BIOs. */ -static struct bio * -__mpage_writepage(struct bio *bio, struct page *page, get_block_t get_block, - sector_t *last_block_in_bio, int *ret, struct writeback_control *wbc, - writepage_t writepage_fn) +struct mpage_data { + struct bio *bio; + sector_t last_block_in_bio; + get_block_t *get_block; + unsigned use_writepage; +}; + +static int __mpage_writepage(struct page *page, struct writeback_control *wbc, +void *data) { + struct mpage_data *mpd = data; + struct bio *bio = mpd-bio; struct address_space *mapping = page-mapping; struct inode *inode = page-mapping-host; const unsigned blkbits = inode-i_blkbits; @@ -480,6 +487,7 @@ __mpage_writepage(struct bio *bio, struc int length; struct buffer_head map_bh; loff_t i_size = i_size_read(inode); + int ret = 0; if (page_has_buffers(page)) { struct buffer_head *head = page_buffers(page); @@ -542,7 +550,7 @@ __mpage_writepage(struct bio *bio, struc map_bh.b_state = 0; map_bh.b_size = 1 blkbits; - if (get_block(inode, block_in_file, map_bh, 1)) + if (mpd-get_block(inode, block_in_file, map_bh, 1)) goto confused; if (buffer_new(map_bh)) unmap_underlying_metadata(map_bh.b_bdev, @@ -591,7 +599,7 @@ page_is_mapped: /* * This page will go to BIO. Do we need to send this BIO off first? */ - if (bio *last_block_in_bio != blocks[0] - 1) + if (bio mpd-last_block_in_bio != blocks[0] - 1) bio = mpage_bio_submit(WRITE, bio); alloc_new: @@ -648,7 +656,7 @@ alloc_new: boundary_block, 1 blkbits); } } else { - *last_block_in_bio = blocks[blocks_per_page - 1]; + mpd-last_block_in_bio = blocks[blocks_per_page - 1]; } goto out; @@ -656,18 +664,19 @@ confused: if (bio) bio = mpage_bio_submit(WRITE, bio); - if (writepage_fn) { - *ret = (*writepage_fn)(page, wbc); + if (mpd-use_writepage) { + ret = mapping-a_ops-writepage(page, wbc); } else { - *ret = -EAGAIN; + ret = -EAGAIN; goto out; } /* * The caller has a ref on the inode, so *mapping is stable */ - mapping_set_error(mapping, *ret); + mapping_set_error(mapping, ret); out: - return bio; + mpd-bio = bio; + return ret; } /** @@ -690,120 +699,27 @@ out: * the call was made get new I/O started against them. If wbc-sync_mode is * WB_SYNC_ALL then we were called for data integrity and we must wait for * existing IO to complete. - * - * If you fix this you should check generic_writepages() also! */ int mpage_writepages(struct address_space *mapping, struct writeback_control *wbc, get_block_t get_block) { - struct backing_dev_info *bdi = mapping-backing_dev_info; - struct bio *bio = NULL; - sector_t last_block_in_bio = 0; - int ret = 0; - int done = 0; - int (*writepage)(struct page *page, struct writeback_control *wbc); - struct pagevec pvec; - int nr_pages; - pgoff_t index; - pgoff_t end;/* Inclusive */ - int scanned = 0; - int range_whole = 0; - - if (wbc-nonblocking bdi_write_congested(bdi)) { - wbc-encountered_congestion = 1; - return 0; - } + int ret; - writepage = NULL; - if (get_block == NULL) - writepage = mapping-a_ops-writepage; - - pagevec_init(pvec, 0); - if (wbc-range_cyclic) { - index = mapping-writeback_index; /* Start from prev offset */ - end = -1; - } else { - index = wbc-range_start PAGE_CACHE_SHIFT; - end = wbc-range_end PAGE_CACHE_SHIFT; - if (wbc
Re: [Fwd: [PATCH] consolidate generic_writepages and mpage_writepages]
Maybe cifs_writepages() too can use this infrastructure, but I'm not touching that with a ten-foot pole. The cifs case ought to be one of the simpler ones, pseudo-code is pretty easy, the hard part is all of the stuff unrelated to cifs: Ideally if there were generic functions to help out, cifs writepages would look roughly like the following cifs_writepages(struct address_space *mapping, struct writeback_control *wbc) { while (no more pages to write) { /* find writeable file handle for this inode */ /* find the biggest set of contiguous pages that total less than wsize */ if (packet signing is enabled) /* write lock pages so they can not be changed under us while we are calculating the checksum */ CIFSSMBWrite2(tree_connection, network_file_handle, array of iovecs, number of iovecs); if(packet signing was enabled) /* unlock pages */ if(error) { set page errors if (mounted hard ) continue; /* retry */ else /* if no retry possible */ return error to caller; } update bytes written statistics update index to point to next set of pages } /* end while loop */ } write_cache_pages() now takes care of the while (no more pages to write) { } part. All you have to do is to make the body of the loop into a function and pass it a structure with the data you want to preserve between invocations. The hard part is untangling the breaks and continues. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
dirty balancing deadlock
I was testing the new fuse shared writable mmap support, and finding that bash-shared-mapping deadlocks (which isn't so strange ;). What is more strange is that this is not an OOM situation at all, with plenty of free and cached pages. A little more investigation shows that a similar deadlock happens reliably with bash-shared-mapping on a loopback mount, even if only half the total memory is used. The cause is slightly different in the two cases: - loopback mount: allocation by the underlying filesystem is stalled on throttle_vm_writeout() - fuse-loop: page dirtying on the underlying filesystem is stalled on balance_dirty_pages() In both cases the underlying fs is totally innocent, with no dirty/writback pages, yet it's waiting for the global dirty+writeback to go below the threshold, which obviously won't, until the allocation/dirtying succeeds. I'm not quite sure what the solution is, and asking for thoughts. Ideas: - per filesystem dirty counters. If filesystem is clean (or dirty is below some minimum), then balance_dirty_pages() should no wait any more - throttle_vm_writeout() was meant to throttle swapping, no? So in that case there should be a separate swap-writback counter Thanks, Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
I was testing the new fuse shared writable mmap support, and finding that bash-shared-mapping deadlocks (which isn't so strange ;). What is more strange is that this is not an OOM situation at all, with plenty of free and cached pages. A little more investigation shows that a similar deadlock happens reliably with bash-shared-mapping on a loopback mount, even if only half the total memory is used. The cause is slightly different in the two cases: - loopback mount: allocation by the underlying filesystem is stalled on throttle_vm_writeout() - fuse-loop: page dirtying on the underlying filesystem is stalled on balance_dirty_pages() In both cases the underlying fs is totally innocent, with no dirty/writback pages, yet it's waiting for the global dirty+writeback to go below the threshold, which obviously won't, until the allocation/dirtying succeeds. I'm not quite sure what the solution is, and asking for thoughts. But these things don't just throttle. They also perform large amounts of writeback, which causes the dirty levels to subside. From your description it appears that this writeback isn't happening, or isn't working. How come? - filesystems A and B - write to A will end up as write to B - dirty pages in A manage to go over dirty_threshold - page writeback is started from A - this triggers writeback for a couple of pages in B - writeback finishes normally, but dirty+writeback pages are still over threshold - balance_dirty_pages in B gets stuck, nothing ever moves after this At least this is my theory for what happens. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
Andrew Morton wrote: On Sun, 18 Feb 2007 19:28:18 +0100 Miklos Szeredi [EMAIL PROTECTED] wrote: I was testing the new fuse shared writable mmap support, and finding that bash-shared-mapping deadlocks (which isn't so strange ;). What is more strange is that this is not an OOM situation at all, with plenty of free and cached pages. A little more investigation shows that a similar deadlock happens reliably with bash-shared-mapping on a loopback mount, even if only half the total memory is used. The cause is slightly different in the two cases: - loopback mount: allocation by the underlying filesystem is stalled on throttle_vm_writeout() - fuse-loop: page dirtying on the underlying filesystem is stalled on balance_dirty_pages() In both cases the underlying fs is totally innocent, with no dirty/writback pages, yet it's waiting for the global dirty+writeback to go below the threshold, which obviously won't, until the allocation/dirtying succeeds. I'm not quite sure what the solution is, and asking for thoughts. But these things don't just throttle. They also perform large amounts of writeback, which causes the dirty levels to subside. From your description it appears that this writeback isn't happening, or isn't working. How come? Is the fuse daemon trying to do writeback to itself, perhaps? That is, trying to write out data to the FUSE filesystem, for which it is also the server. No. It's trying to write out data to a different filesystem. Trying to write out data to itself very obviously deadlocks, but that doesn't affect anything beside the stupid filesystem itself, and there are mechanisms for aborting such a situation (forced umount, abort through fuse-control filesystem). Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
I was testing the new fuse shared writable mmap support, and finding that bash-shared-mapping deadlocks (which isn't so strange ;). What is more strange is that this is not an OOM situation at all, with plenty of free and cached pages. A little more investigation shows that a similar deadlock happens reliably with bash-shared-mapping on a loopback mount, even if only half the total memory is used. The cause is slightly different in the two cases: - loopback mount: allocation by the underlying filesystem is stalled on throttle_vm_writeout() - fuse-loop: page dirtying on the underlying filesystem is stalled on balance_dirty_pages() In both cases the underlying fs is totally innocent, with no dirty/writback pages, yet it's waiting for the global dirty+writeback to go below the threshold, which obviously won't, until the allocation/dirtying succeeds. I'm not quite sure what the solution is, and asking for thoughts. But these things don't just throttle. They also perform large amounts of writeback, which causes the dirty levels to subside. From your description it appears that this writeback isn't happening, or isn't working. How come? - filesystems A and B - write to A will end up as write to B - dirty pages in A manage to go over dirty_threshold - page writeback is started from A - this triggers writeback for a couple of pages in B - writeback finishes normally, but dirty+writeback pages are still over threshold - balance_dirty_pages in B gets stuck, nothing ever moves after this At least this is my theory for what happens. Is B a real filesystem? Yes. If so, writes to B will decrease the dirty memory threshold. Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. Some pages queued for writeback (doesn't matter how much). B writes back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for B doesn't know that there's nothing more to write back for B, it's just waiting there for those 1099, which'll never get written. The writeout code _should_ just sit there transferring dirtyiness from A to B and cleaning pages via B, looping around, alternating between both. What does sysrq-t say? This is the fuse daemon thread that got stuck. There are lots of others that are stuck on some ext3 mutex as a result of this. fusexmp_fh_no D 40045401 0 527493 533 495 (NOTLB) 088d55f8 0001 08dcfb14 0805d8cb 08a09b78 088d55f8 08dc8000 08dc8000 08dcfb3c 0805a38a 08a09680 088d5100 08dcfb2c 08dc8000 08dc8000 0847c300 088d5100 08a09680 08dcfb94 08182fe6 08a09680 088d5100 08a09680 Call Trace: 08dcfb00: [0805d8cb] switch_to_skas+0x3b/0x83 08dcfb18: [0805a38a] _switch_to+0x49/0x99 08dcfb40: [08182fe6] schedule+0x246/0x547 08dcfb98: [08183a03] schedule_timeout+0x4e/0xb6 08dcfbcc: [08183991] io_schedule_timeout+0x11/0x20 08dcfbd4: [080a0cf2] congestion_wait+0x72/0x87 08dcfc04: [0809c693] balance_dirty_pages+0xa8/0x153 08dcfc5c: [0809c7bf] balance_dirty_pages_ratelimited_nr+0x43/0x45 08dcfc68: [080992b5] generic_file_buffered_write+0x3e3/0x6f5 08dcfd20: [0809988e] __generic_file_aio_write_nolock+0x2c7/0x5dd 08dcfda8: [08099cb6] generic_file_aio_write+0x55/0xc7 08dcfddc: [080ea1e6] ext3_file_write+0x39/0xaf 08dcfe04: [080b060b] do_sync_write+0xd8/0x10e 08dcfebc: [080b06e3] vfs_write+0xa2/0x1cb 08dcfeec: [080b09b8] sys_pwrite64+0x65/0x69 08dcff10: [0805dd54] handle_syscall+0x90/0xbc 08dcff64: [0806d56c] handle_trap+0x27/0x121 08dcff8c: [0806dc65] userspace+0x1de/0x226 08dcffe4: [0805da19] fork_handler+0x76/0x88 08dcfffc: [d4cf0007] 0xd4cf0007 /proc/vmstat: nr_anon_pages 668 nr_mapped 3168 nr_file_pages 5191 nr_slab_reclaimable 173 nr_slab_unreclaimable 494 nr_page_table_pages 65 nr_dirty 2174 nr_writeback 10 nr_unstable 0 nr_bounce 0 nr_vmscan_write 0 pgpgin 10955 pgpgout 421091 pswpin 0 pswpout 0 pgalloc_dma 0 pgalloc_normal 268761 pgfree 269709 pgactivate 128287 pgdeactivate 31253 pgfault 237350 pgmajfault 4340 pgrefill_dma 0 pgrefill_normal 127899 pgsteal_dma 0 pgsteal_normal 46892 pgscan_kswapd_dma 0 pgscan_kswapd_normal 47104 pgscan_direct_dma 0 pgscan_direct_normal 36544 pginodesteal 0 slabs_scanned 2048 kswapd_steal 25083 kswapd_inodesteal 335 pageoutrun 656 allocstall 423 pgrotated 0 Breakpoint 3, balance_dirty_pages (mapping=0xa01feb0) at mm/page-writeback.c:202 202 dirty_exceeded = 1; (gdb) p dirty_thresh $1 = 2113 (gdb) For completeness' sake, here's the backtrace for the stuck loopback as well: loop0 D BFFFE101 0 499 5 50059 (L-TLB) 088cc578 0001 09197c4c 0805d8cb 084fe6f8 088cc578 0919 0919 09197c74 0805a38a 084fe200 088cc080 09197c64 0919 0919 086d9c80 088cc080 084fe200 09197ccc 08182ab6 084fe200 088cc080 084fe200 Call Trace:
Re: dirty balancing deadlock
If so, writes to B will decrease the dirty memory threshold. Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. Some pages queued for writeback (doesn't matter how much). B writes back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for B doesn't know that there's nothing more to write back for B, it's just waiting there for those 1099, which'll never get written. hm, OK, arguable. I guess something like this.. Doesn't help the fuse case, but does seem to help the loopback mount one. For fuse it's worse with the patch: now the write triggered by the balance recurses into fuse, with disastrous results, since the fuse writeback is now blocked on the userspace queue. fusexmp_fh_no D 40136678 0 505494 506 504 (NOTLB) 08982b78 0001 08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000 08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 08f98000 085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 089a7100 Call Trace: 08f9f9a0: [0805d8cb] switch_to_skas+0x3b/0x83 08f9f9b8: [0805a38a] _switch_to+0x49/0x99 08f9f9e0: [08183006] schedule+0x246/0x547 08f9fa38: [08103c7e] fuse_get_req_wp+0xe9/0x14a 08f9fa70: [08103d2e] fuse_writepage+0x4f/0x12c 08f9faac: [0809ce3f] __writepage+0x1e/0x3d 08f9fac0: [0809cd39] write_cache_pages+0x222/0x30a 08f9fb44: [0809ce8d] generic_writepages+0x2f/0x35 08f9fb5c: [0809ced6] do_writepages+0x43/0x45 08f9fb70: [080cb8d2] __writeback_single_inode+0xbc/0x173 08f9fbb8: [080cbb30] sync_sb_inodes+0x1a7/0x260 08f9fbe8: [080cbc54] writeback_inodes+0x6b/0x81 08f9fc04: [0809c640] balance_dirty_pages+0x55/0x153 08f9fc5c: [0809c7bf] balance_dirty_pages_ratelimited_nr+0x43/0x45 08f9fc68: [080992b5] generic_file_buffered_write+0x3e3/0x6f5 08f9fd20: [0809988e] __generic_file_aio_write_nolock+0x2c7/0x5dd 08f9fda8: [08099cb6] generic_file_aio_write+0x55/0xc7 08f9fddc: [080ea206] ext3_file_write+0x39/0xaf 08f9fe04: [080b060b] do_sync_write+0xd8/0x10e 08f9febc: [080b06e3] vfs_write+0xa2/0x1cb 08f9feec: [080b09b8] sys_pwrite64+0x65/0x69 08f9ff10: [0805dd54] handle_syscall+0x90/0xbc 08f9ff64: [0806d56c] handle_trap+0x27/0x121 08f9ff8c: [0806dc65] userspace+0x1de/0x226 08f9ffe4: [0805da19] fork_handler+0x76/0x88 08f9fffc: [] nosmp+0xf7fb7000/0x14 but where's pdflush? It should be busily transferring dirtiness from A to B. The transfer of dirtyness from A to B goes through the narrow channel of i_mutex. And once that is plugged by the stuck balance_dirty_pages() nothing else can pass through. The writeout code _should_ just sit there transferring dirtyiness from A to B and cleaning pages via B, looping around, alternating between both. What does sysrq-t say? This is the fuse daemon thread that got stuck. Where's pdflsuh? Doing nothing I guess. The request queue for the fuse filesystem is full, so writepage with wbc-nonblocking=1 will be skipped. pdflush D 40045401 023 52412 (L-TLB) 088d5bf8 0001 08907df8 0805d8cb 088d55f8 088d5bf8 0890 0890 08907e20 0805a38a 088d5100 088d5700 08907e10 0890 0890 0847c300 088d5700 088d5100 08907e78 08182fe6 088d5100 088d5700 088d5100 Call Trace: 08907de4: [0805d8cb] switch_to_skas+0x3b/0x83 08907dfc: [0805a38a] _switch_to+0x49/0x99 08907e24: [08182fe6] schedule+0x246/0x547 08907e7c: [08183a03] schedule_timeout+0x4e/0xb6 08907eb0: [08183991] io_schedule_timeout+0x11/0x20 08907eb8: [080a0cf2] congestion_wait+0x72/0x87 08907ee8: [0809c860] background_writeout+0x35/0xa4 08907f38: [0809d41e] __pdflush+0xae/0x152 08907f54: [0809d4f5] pdflush+0x33/0x39 08907f84: [0808a03a] kthread+0xa7/0xab 08907fb4: [0806a0f1] run_kernel_thread+0x41/0x50 08907fe0: [0805d975] new_thread_handler+0x62/0x8b 08907ffc: [] nosmp+0xf7fb7000/0x14 pdflush D 40045401 024 52523 (L-TLB) 081e1458 0001 088ffe00 0805d8cb 088d5bf8 081e1458 088f8000 088f8000 088ffe28 0805a38a 088d5700 081e0f60 088ffe18 088f8000 088f8000 0847c300 081e0f60 088d5700 088ffe80 08182fe6 088d5700 081e0f60 088d5700 Call Trace: 088ffdec: [0805d8cb] switch_to_skas+0x3b/0x83 088ffe04: [0805a38a] _switch_to+0x49/0x99 088ffe2c: [08182fe6] schedule+0x246/0x547 088ffe84: [08183a03] schedule_timeout+0x4e/0xb6 088ffeb8: [08183991] io_schedule_timeout+0x11/0x20 088ffec0: [080a0cf2] congestion_wait+0x72/0x87 088ffef0: [0809c98c] wb_kupdate+0x93/0xd9 088fff38: [0809d41e] __pdflush+0xae/0x152 088fff54: [0809d4f5] pdflush+0x33/0x39 088fff84: [0808a03a] kthread+0xa7/0xab 088fffb4: [0806a0f1] run_kernel_thread+0x41/0x50 088fffe0: [0805d975] new_thread_handler+0x62/0x8b 088c: [] nosmp+0xf7fb7000/0x14 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please
Re: dirty balancing deadlock
--- a/fs/fs-writeback.c~a +++ a/fs/fs-writeback.c @@ -356,7 +356,7 @@ int generic_sync_sb_inodes(struct super_ continue; /* Skip a congested blockdev */ } - if (wbc-bdi bdi != wbc-bdi) { + if (wbc-bdi bdi != wbc-bdi bdi_write_congested(bdi)) { if (!sb_is_blkdev_sb(sb)) break; /* fs has the wrong queue */ list_move(inode-i_list, sb-s_dirty); Checking bdi_write_congested(bdi) is not reliable, since the queue can become congested _after_ the check is done. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
If so, writes to B will decrease the dirty memory threshold. Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. Some pages queued for writeback (doesn't matter how much). B writes back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for B doesn't know that there's nothing more to write back for B, it's just waiting there for those 1099, which'll never get written. hm, OK, arguable. I guess something like this.. Doesn't help the fuse case, but does seem to help the loopback mount one. No sorry, it doesn't even help the loopback deadlock. It sometimes takes quite a while to trigger... Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
If so, writes to B will decrease the dirty memory threshold. Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. Some pages queued for writeback (doesn't matter how much). B writes back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for B doesn't know that there's nothing more to write back for B, it's just waiting there for those 1099, which'll never get written. hm, OK, arguable. I guess something like this.. Doesn't help the fuse case, but does seem to help the loopback mount one. For fuse it's worse with the patch: now the write triggered by the balance recurses into fuse, with disastrous results, since the fuse writeback is now blocked on the userspace queue. fusexmp_fh_no D 40136678 0 505494 506 504 (NOTLB) 08982b78 0001 08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000 08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 08f98000 085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 089a7100 Call Trace: 08f9f9a0: [0805d8cb] switch_to_skas+0x3b/0x83 08f9f9b8: [0805a38a] _switch_to+0x49/0x99 08f9f9e0: [08183006] schedule+0x246/0x547 08f9fa38: [08103c7e] fuse_get_req_wp+0xe9/0x14a 08f9fa70: [08103d2e] fuse_writepage+0x4f/0x12c In general, writepage is supposed to do work without blocking on expensive locks that will get pdflush and dirty reclaim stuck in this fashion. You'll probably have to take the same approach reiserfs does in data=journal mode, which is leaving the page dirty if fuse_get_req_wp is going to block without making progress. Pdflush, and dirty reclaim set wbc-nonblocking to true. balance_dirty_pages and fsync don't. The problem here is that Andrew's patch is wrong to let balance_dirty_pages() try to write back pages from a different queue. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
In general, writepage is supposed to do work without blocking on expensive locks that will get pdflush and dirty reclaim stuck in this fashion. You'll probably have to take the same approach reiserfs does in data=journal mode, which is leaving the page dirty if fuse_get_req_wp is going to block without making progress. Pdflush, and dirty reclaim set wbc-nonblocking to true. balance_dirty_pages and fsync don't. The problem here is that Andrew's patch is wrong to let balance_dirty_pages() try to write back pages from a different queue. async or sync, writepage is supposed to either make progress or bail. loopback aside, if the fuse call is blocking long term, you're going to run into problems. Hmm, like what? Thanks, Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
How about this? Solves the FUSE deadlock, but not the throttle_vm_writeout() one. I'll try to tackle that one as well. If the per-bdi dirty counter goes below 16, balance_dirty_pages() returns. Does the constant need to tunable? If it's too large, then the global threshold is more easily exceeded. If it's too small, then in a tight situation progress will be slower. Thanks, Miklos Index: linux/mm/page-writeback.c === --- linux.orig/mm/page-writeback.c 2007-02-19 17:32:41.0 +0100 +++ linux/mm/page-writeback.c 2007-02-19 18:05:28.0 +0100 @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a dirty_thresh) break; + /* +* Acquit this producer if there's little or nothing +* to write back to this particular queue +* +* Without this check a deadlock is possible in the +* following case: +* +* - filesystem A writes data through filesystem B +* - filesystem A has dirty pages over dirty_thresh +* - writeback is started, this triggers a write in B +* - balance_dirty_pages() is called synchronously +* - the write to B blocks +* - the writeback completes, but dirty is still over threshold +* - the blocking write prevents futher writes from happening +*/ + if (atomic_long_read(bdi-nr_dirty) + + atomic_long_read(bdi-nr_writeback) 16) + break; + if (!dirty_exceeded) dirty_exceeded = 1; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
Solves the FUSE deadlock, but not the throttle_vm_writeout() one. I'll try to tackle that one as well. If the per-bdi dirty counter goes below 16, balance_dirty_pages() returns. Does the constant need to tunable? If it's too large, then the global threshold is more easily exceeded. If it's too small, then in a tight situation progress will be slower. Similar in spirit, this should solve the deadlock on throttle_vm_writeout(). Totally untested. Does this approach look workable? Thanks, Miklos Index: linux/include/linux/swap.h === --- linux.orig/include/linux/swap.h 2007-02-19 23:39:36.0 +0100 +++ linux/include/linux/swap.h 2007-02-20 00:03:38.0 +0100 @@ -277,10 +277,14 @@ static inline void disable_swap_token(vo put_swap_token(swap_token_mm); } +#define nr_swap_writeback \ + atomic_long_read(swapper_space.backing_dev_info-nr_writeback) + #else /* CONFIG_SWAP */ #define total_swap_pages 0 #define total_swapcache_pages 0UL +#define nr_swap_writeback 0UL #define si_swapinfo(val) \ do { (val)-freeswap = (val)-totalswap = 0; } while (0) Index: linux/mm/page-writeback.c === --- linux.orig/mm/page-writeback.c 2007-02-19 23:43:03.0 +0100 +++ linux/mm/page-writeback.c 2007-02-20 00:03:49.0 +0100 @@ -33,6 +33,7 @@ #include linux/syscalls.h #include linux/buffer_head.h #include linux/pagevec.h +#include linux/swap.h /* * The maximum number of pages to writeout in a single bdflush/kupdate @@ -332,6 +333,9 @@ void throttle_vm_writeout(void) if (global_page_state(NR_UNSTABLE_NFS) + global_page_state(NR_WRITEBACK) = dirty_thresh) break; + + if (nr_swap_writeback 16) + break; congestion_wait(WRITE, HZ/10); } } Index: linux/mm/page_io.c === --- linux.orig/mm/page_io.c 2007-02-19 23:24:23.0 +0100 +++ linux/mm/page_io.c 2007-02-19 23:42:21.0 +0100 @@ -70,6 +70,7 @@ static int end_swap_bio_write(struct bio ClearPageReclaim(page); } end_page_writeback(page); + atomic_long_dec(swapper_space.backing_dev_info-nr_writeback); bio_put(bio); return 0; } @@ -121,6 +122,7 @@ int swap_writepage(struct page *page, st if (wbc-sync_mode == WB_SYNC_ALL) rw |= (1 BIO_RW_SYNC); count_vm_event(PSWPOUT); + atomic_long_inc(swapper_space.backing_dev_info-nr_writeback); set_page_writeback(page); unlock_page(page); submit_bio(rw, bio); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 1/3] fix illogical behavior in balance_dirty_pages()
This is a slightly different take on the fix for the deadlock in fuse with dirty balancing. David Chinner convinced me, that per-bdi counters are too expensive, and that it's not worth trying to account the number of pages under writeback, as they will be limited by the queue anyway. From: Miklos Szeredi [EMAIL PROTECTED] Current behavior of balance_dirty_pages() is to try to start writeout into the specified queue for at least write_chunk number of pages. If write_chunk pages have been submitted, then return. However if there are less than write_chunk dirty pages for this queue, then it doesn't return, waiting for the global dirty+writeback counters to subside, but without doing any actual work. This is illogical behavior: it allows more dirtyings while there are dirty pages, but stops further dirtying completely if there are no more dirty pages. It also makes a deadlock possible when one filesystem is writing data through another, and the balance_dirty_pages() for the lower filesystem is stalling the writeback for the upper filesystem's data (*). So the exit condition should instead be: submitted at least write_chunk number of pages OR submitted ALL the dirty pages destined for this backing dev AND backing dev is not congested To do this, introduce a new counter in writeback_control, which counts the number of dirty pages encountered during writeback. This includes all dirty pages, even those which are already under writeback but have been dirtied again, and those which have been skipped due to having locked buffers. If this counter is zero after trying to submit some pages for writeback, and the backing dev is uncongested, then don't wait any more. After this, newly dirtied pages can quickly be written back to this backing dev. If there are globally no more pages to submit for writeback (nr_reclaimable == 0), then also don't wait for ever, only while this backing dev is congested. (*) For more info on this deadlock, see the following discussions: http://lkml.org/lkml/2007/3/1/9 http://lkml.org/lkml/2007/3/12/16 Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/include/linux/writeback.h === --- linux.orig/include/linux/writeback.h2007-03-24 22:06:56.0 +0100 +++ linux/include/linux/writeback.h 2007-03-24 22:29:02.0 +0100 @@ -44,6 +44,7 @@ struct writeback_control { long nr_to_write; /* Write this many pages, and decrement this for each page written */ long pages_skipped; /* Pages which were not written */ + long nr_dirty; /* Number of dirty pages encountered */ /* * For a_ops-writepages(): is start or end are non-zero then this is Index: linux/mm/page-writeback.c === --- linux.orig/mm/page-writeback.c 2007-03-24 22:06:56.0 +0100 +++ linux/mm/page-writeback.c 2007-03-24 22:29:02.0 +0100 @@ -207,7 +207,15 @@ static void balance_dirty_pages(struct a * written to the server's write cache, but has not yet * been flushed to permanent storage. */ - if (nr_reclaimable) { + if (!nr_reclaimable) { + /* +* If there's nothing more to write back and this queue +* is uncongested, then it is possible to quickly +* write out some more data, so let's not wait +*/ + if (!bdi_write_congested(bdi)) + break; + } else { writeback_inodes(wbc); get_dirty_limits(background_thresh, dirty_thresh, mapping); @@ -220,6 +228,14 @@ static void balance_dirty_pages(struct a pages_written += write_chunk - wbc.nr_to_write; if (pages_written = write_chunk) break; /* We've done our duty */ + + /* +* If there are no more dirty pages for this backing +* backing dev, and the queue is not congested, then +* it is possible to quickly write out some more data +*/ + if (!wbc.nr_dirty !bdi_write_congested(bdi)) + break; } congestion_wait(WRITE, HZ/10); } @@ -619,6 +635,7 @@ retry: min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) { unsigned i; + wbc-nr_dirty += nr_pages; scanned = 1; for (i = 0; i nr_pages; i
[patch 2/3] remove throttle_vm_writeout()
From: Miklos Szeredi [EMAIL PROTECTED] Remove this function. It's purpose was to limit the global number of writeback pages from submitted by direct reclaim. But this is equally well accomplished by limited queue lengths. When this function was added, the device queues had much larger default lengths (8192 requests, now it's 128), causing problems. When writable shared mapping support is added to fuse, this function would be able to cause a deadlock if the userspace filesystem needs to allocate memory while writing back dirty pages. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/include/linux/writeback.h === --- linux.orig/include/linux/writeback.h2007-03-24 22:07:00.0 +0100 +++ linux/include/linux/writeback.h 2007-03-24 22:28:52.0 +0100 @@ -85,7 +85,6 @@ static inline void wait_on_inode(struct int wakeup_pdflush(long nr_pages); void laptop_io_completion(void); void laptop_sync_completion(void); -void throttle_vm_writeout(gfp_t gfp_mask); /* These are exported to sysctl. */ extern int dirty_background_ratio; Index: linux/mm/page-writeback.c === --- linux.orig/mm/page-writeback.c 2007-03-24 22:07:00.0 +0100 +++ linux/mm/page-writeback.c 2007-03-24 22:28:52.0 +0100 @@ -312,37 +312,6 @@ void balance_dirty_pages_ratelimited_nr( } EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr); -void throttle_vm_writeout(gfp_t gfp_mask) -{ - long background_thresh; - long dirty_thresh; - - if ((gfp_mask (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) { - /* -* The caller might hold locks which can prevent IO completion -* or progress in the filesystem. So we cannot just sit here -* waiting for IO to complete. -*/ - congestion_wait(WRITE, HZ/10); - return; - } - -for ( ; ; ) { - get_dirty_limits(background_thresh, dirty_thresh, NULL); - -/* - * Boost the allowable dirty threshold a bit for page - * allocators so they don't get DoS'ed by heavy writers - */ -dirty_thresh += dirty_thresh / 10; /* wh... */ - -if (global_page_state(NR_UNSTABLE_NFS) + - global_page_state(NR_WRITEBACK) = dirty_thresh) - break; -congestion_wait(WRITE, HZ/10); -} -} - /* * writeback at least _min_pages, and keep writing until the amount of dirty * memory is less than the background threshold, or until we're all clean. Index: linux/mm/vmscan.c === --- linux.orig/mm/vmscan.c 2007-03-24 22:06:53.0 +0100 +++ linux/mm/vmscan.c 2007-03-24 22:07:03.0 +0100 @@ -952,8 +952,6 @@ static unsigned long shrink_zone(int pri } } - throttle_vm_writeout(sc-gfp_mask); - atomic_dec(zone-reclaim_in_progress); return nr_reclaimed; } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 3/3] balance dirty pages from loop device
From: Miklos Szeredi [EMAIL PROTECTED] The function do_lo_send_aops() should call balance_dirty_pages_ratelimited() after each page similarly to generic_file_buffered_write(). Without this, writing the loop device directly (not through a filesystem) is very slow, and also slows the whole system down, because nr_dirty is constantly over the limit. Beware: this patch without the fix to balance_dirty_pages() makes a loopback mounted filesystem prone to deadlock. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/drivers/block/loop.c === --- linux.orig/drivers/block/loop.c 2007-03-24 21:00:40.0 +0100 +++ linux/drivers/block/loop.c 2007-03-24 22:07:06.0 +0100 @@ -275,6 +275,8 @@ static int do_lo_send_aops(struct loop_d pos += size; unlock_page(page); page_cache_release(page); + balance_dirty_pages_ratelimited(mapping); + cond_resched(); } ret = 0; out: - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] add file position info to proc
From: Miklos Szeredi [EMAIL PROTECTED] This patch adds support for finding out the current file position, open flags and possibly other info in the future. These new entries are added: /proc/PID/fdinfo/FD /proc/PID/task/TID/fdinfo/FD For each fd the information is provided in the following format: pos:1234 flags: 012 Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/fs/proc/base.c === --- linux.orig/fs/proc/base.c 2007-03-24 19:00:48.0 +0100 +++ linux/fs/proc/base.c2007-03-24 22:28:14.0 +0100 @@ -1199,7 +1199,10 @@ out: return ~0U; } -static int proc_fd_link(struct inode *inode, struct dentry **dentry, struct vfsmount **mnt) +#define PROC_FDINFO_MAX 64 + +static int proc_fd_info(struct inode *inode, struct dentry **dentry, + struct vfsmount **mnt, char *info) { struct task_struct *task = get_proc_task(inode); struct files_struct *files = NULL; @@ -1218,8 +1221,16 @@ static int proc_fd_link(struct inode *in spin_lock(files-file_lock); file = fcheck_files(files, fd); if (file) { - *mnt = mntget(file-f_path.mnt); - *dentry = dget(file-f_path.dentry); + if (mnt) + *mnt = mntget(file-f_path.mnt); + if (dentry) + *dentry = dget(file-f_path.dentry); + if (info) + snprintf(info, PROC_FDINFO_MAX, +pos:\t%lli\n +flags:\t0%o\n, +(long long) file-f_pos, +file-f_flags); spin_unlock(files-file_lock); put_files_struct(files); return 0; @@ -1230,6 +1241,12 @@ static int proc_fd_link(struct inode *in return -ENOENT; } +static int proc_fd_link(struct inode *inode, struct dentry **dentry, + struct vfsmount **mnt) +{ + return proc_fd_info(inode, dentry, mnt, NULL); +} + static int tid_fd_revalidate(struct dentry *dentry, struct nameidata *nd) { struct inode *inode = dentry-d_inode; @@ -1325,7 +1342,9 @@ out_iput: goto out; } -static struct dentry *proc_lookupfd(struct inode * dir, struct dentry * dentry, struct nameidata *nd) +static struct dentry *proc_lookupfd_common(struct inode *dir, + struct dentry *dentry, + instantiate_t instantiate) { struct task_struct *task = get_proc_task(dir); unsigned fd = name_to_int(dentry); @@ -1336,23 +1355,15 @@ static struct dentry *proc_lookupfd(stru if (fd == ~0U) goto out; - result = proc_fd_instantiate(dir, dentry, task, fd); + result = instantiate(dir, dentry, task, fd); out: put_task_struct(task); out_no_task: return result; } -static int proc_fd_fill_cache(struct file *filp, void *dirent, filldir_t filldir, - struct task_struct *task, int fd) -{ - char name[PROC_NUMBUF]; - int len = snprintf(name, sizeof(name), %d, fd); - return proc_fill_cache(filp, dirent, filldir, name, len, - proc_fd_instantiate, task, fd); -} - -static int proc_readfd(struct file * filp, void * dirent, filldir_t filldir) +static int proc_readfd_common(struct file * filp, void * dirent, + filldir_t filldir, instantiate_t instantiate) { struct dentry *dentry = filp-f_path.dentry; struct inode *inode = dentry-d_inode; @@ -1388,12 +1399,17 @@ static int proc_readfd(struct file * fil for (fd = filp-f_pos-2; fd fdt-max_fds; fd++, filp-f_pos++) { + char name[PROC_NUMBUF]; + int len; if (!fcheck_files(files, fd)) continue; rcu_read_unlock(); - if (proc_fd_fill_cache(filp, dirent, filldir, p, fd) 0) { + len = snprintf(name, sizeof(name), %d, fd); + if (proc_fill_cache(filp, dirent, filldir, + name, len, instantiate, + p, fd) 0) { rcu_read_lock(); break; } @@ -1408,6 +1424,32 @@ out_no_task: return retval; } +static struct dentry *proc_lookupfd(struct inode *dir, struct dentry *dentry
[patch 1/3] split mmap
From: Miklos Szeredi [EMAIL PROTECTED] This is a straightforward split of do_mmap_pgoff() into two functions: - do_mmap_pgoff() checks the parameters, and calculates the vma flags. Then it calls - mmap_region(), which does the actual mapping Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/mm/mmap.c === --- linux.orig/mm/mmap.c2007-03-24 21:00:40.0 +0100 +++ linux/mm/mmap.c 2007-03-24 22:28:52.0 +0100 @@ -893,14 +893,11 @@ unsigned long do_mmap_pgoff(struct file unsigned long flags, unsigned long pgoff) { struct mm_struct * mm = current-mm; - struct vm_area_struct * vma, * prev; struct inode *inode; unsigned int vm_flags; - int correct_wcount = 0; int error; - struct rb_node ** rb_link, * rb_parent; int accountable = 1; - unsigned long charged = 0, reqprot = prot; + unsigned long reqprot = prot; /* * Does the application expect PROT_READ to imply PROT_EXEC? @@ -1025,7 +1022,25 @@ unsigned long do_mmap_pgoff(struct file error = security_file_mmap(file, reqprot, prot, flags); if (error) return error; - + + return mmap_region(file, addr, len, flags, vm_flags, pgoff, + accountable); +} +EXPORT_SYMBOL(do_mmap_pgoff); + +unsigned long mmap_region(struct file *file, unsigned long addr, + unsigned long len, unsigned long flags, + unsigned int vm_flags, unsigned long pgoff, + int accountable) +{ + struct mm_struct *mm = current-mm; + struct vm_area_struct *vma, *prev; + int correct_wcount = 0; + int error; + struct rb_node **rb_link, *rb_parent; + unsigned long charged = 0; + struct inode *inode = file ? file-f_path.dentry-d_inode : NULL; + /* Clear old maps */ error = -ENOMEM; munmap_back: @@ -1174,8 +1189,6 @@ unacct_error: return error; } -EXPORT_SYMBOL(do_mmap_pgoff); - /* Get an address range which is currently unmapped. * For shmat() with addr=0. * Index: linux/include/linux/mm.h === --- linux.orig/include/linux/mm.h 2007-03-24 21:00:40.0 +0100 +++ linux/include/linux/mm.h2007-03-24 22:28:52.0 +0100 @@ -1035,6 +1035,10 @@ extern unsigned long get_unmapped_area(s extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flag, unsigned long pgoff); +extern unsigned long mmap_region(struct file *file, unsigned long addr, + unsigned long len, unsigned long flags, + unsigned int vm_flags, unsigned long pgoff, + int accountable); static inline unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 2/3] only allow nonlinear vmas for ram backed filesystems
From: Miklos Szeredi [EMAIL PROTECTED] Dirty page accounting/limiting doesn't work for nonlinear mappings, so for non-ram backed filesystems emulate with linear mappings. This retains ABI compatibility with previous kernels at minimal code cost. All known users of nonlinear mappings actually use tmpfs, so this shouldn't have any negative effect. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux-2.6.21-rc4-mm1/mm/fremap.c === --- linux-2.6.21-rc4-mm1.orig/mm/fremap.c 2007-03-24 22:30:05.0 +0100 +++ linux-2.6.21-rc4-mm1/mm/fremap.c2007-03-24 22:37:59.0 +0100 @@ -181,6 +181,24 @@ asmlinkage long sys_remap_file_pages(uns goto retry; } mapping = vma-vm_file-f_mapping; + /* +* page_mkclean doesn't work on nonlinear vmas, so if dirty +* pages need to be accounted, emulate with linear vmas. +*/ + if (mapping_cap_account_dirty(mapping)) { + unsigned long addr; + + flags = MAP_NONBLOCK; + addr = mmap_region(vma-vm_file, start, size, flags, + vma-vm_flags, pgoff, 1); + if (IS_ERR_VALUE(addr)) + err = addr; + else { + BUG_ON(addr != start); + err = 0; + } + goto out; + } spin_lock(mapping-i_mmap_lock); flush_dcache_mmap_lock(mapping); vma-vm_flags |= VM_NONLINEAR; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/