from:"Miklos Szeredi"

Re: [patch 0/8] unprivileged mount syscall

2007-04-16 Thread Miklos Szeredi

 Arn't there ways to escape chroot jails? Serge had pointed me to a URL
 which showed chroots can be escaped. And if that is true than having all
 user's private mount tree in the same namespace can be a security issue?

No.  In fact chrooting the user into /share/$USER will actually
_grant_ a privilege to the user, instead of taking it away.  It allows
the user to modify it's root namespace, which it wouldn't be able to
in the initial namespace.

So even if the user could escape from the chroot (which I doubt), s/he
would not be able to do any harm, since unprivileged mounting would be
restricted to /share.  Also /share/$USER should only have read/search
permission for $USER or no permissions at all, which would mean, that
other users' namespaces would be safe from tampering as well.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag

2007-04-16 Thread Miklos Szeredi

  Given the existence of shared subtrees allowing/denying this at the
  mount
  namespace level is silly and wrong.
  
  If we need more than just the filesystem permission checks can we
  make it a mount flag settable with mount and remount that allows
  non-privileged users the ability to create mount points under it
  in directories they have full read/write access to.
 
 Also for bind-mount and remount operations the flag has to be propagated
 down its propagation tree.  Otherwise a unpriviledged mount in a shared
 mount wont get reflected in its peers and slaves, leading to unidentical
 shared-subtrees.

That's an interesting question.  Do we want shared mounts to be
totally identical, including mnt_flags?  It doesn't look as if
do_remount() guarantees that currently.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag

2007-04-16 Thread Miklos Szeredi

   Also for bind-mount and remount operations the flag has to be propagated
   down its propagation tree.  Otherwise a unpriviledged mount in a shared
   mount wont get reflected in its peers and slaves, leading to unidentical
   shared-subtrees.
  
  That's an interesting question.  Do we want shared mounts to be
  totally identical, including mnt_flags?  It doesn't look as if
  do_remount() guarantees that currently.
 
 Depends on the semantics of each of the flags. Some flags like of the
 read/write flag, would not interfere with the propagation semantics
 AFAICT.  But this one certainly seems to interfere.

That depends.  Current patches check the unprivileged submounts
allowed under this mount flag only on the requested mount and not on
the propagated mounts.  Do you see a problem with this?

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 01/10] add user mounts to the kernel

2007-04-16 Thread Miklos Szeredi

From: Miklos Szeredi [EMAIL PROTECTED]

Add ownership information to mounts.

A new mount flag, MS_SETUSER is used to make a mount owned by a user.
If this flag is specified, then the owner will be set to the current
real user id and the mount will be marked with the MNT_USER flag.  On
remount don't preserve previous owner, and treat MS_SETUSER as for a
new mount.  The MS_SETUSER flag is ignored on mount move.

The MNT_USER flag is not copied on any kind of mount cloning:
namespace creation, binding or propagation.  For bind mounts the
cloned mount(s) are set to MNT_USER depending on the MS_SETUSER mount
flag.  In all the other cases MNT_USER is always cleared.

For MNT_USER mounts a user=UID option is added to /proc/PID/mounts.
This is compatible with how mount ownership is stored in /etc/mtab.

It is expected, that in the future mount(8) will use MS_SETUSER to
store mount ownership within the kernel.  This would help in
situations, where /etc/mtab is difficult or impossible to work with,
e.g. when using mount propagation.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-13 12:24:15.0 +0200
+++ linux/fs/namespace.c2007-04-13 13:03:44.0 +0200
@@ -227,6 +227,13 @@ static struct vfsmount *skip_mnt_tree(st
return p;
 }
 
+static void set_mnt_user(struct vfsmount *mnt)
+{
+   BUG_ON(mnt-mnt_flags  MNT_USER);
+   mnt-mnt_uid = current-uid;
+   mnt-mnt_flags |= MNT_USER;
+}
+
 static struct vfsmount *clone_mnt(struct vfsmount *old, struct dentry *root,
int flag)
 {
@@ -241,6 +248,11 @@ static struct vfsmount *clone_mnt(struct
mnt-mnt_mountpoint = mnt-mnt_root;
mnt-mnt_parent = mnt;
 
+   /* don't copy the MNT_USER flag */
+   mnt-mnt_flags = ~MNT_USER;
+   if (flag  CL_SETUSER)
+   set_mnt_user(mnt);
+
if (flag  CL_SLAVE) {
list_add(mnt-mnt_slave, old-mnt_slave_list);
mnt-mnt_master = old;
@@ -403,6 +415,8 @@ static int show_vfsmnt(struct seq_file *
if (mnt-mnt_flags  fs_infop-flag)
seq_puts(m, fs_infop-str);
}
+   if (mnt-mnt_flags  MNT_USER)
+   seq_printf(m, ,user=%i, mnt-mnt_uid);
if (mnt-mnt_sb-s_op-show_options)
err = mnt-mnt_sb-s_op-show_options(m, mnt);
seq_puts(m,  0 0\n);
@@ -920,8 +934,9 @@ static int do_change_type(struct nameida
 /*
  * do loopback mount.
  */
-static int do_loopback(struct nameidata *nd, char *old_name, int recurse)
+static int do_loopback(struct nameidata *nd, char *old_name, int flags)
 {
+   int clone_flags;
struct nameidata old_nd;
struct vfsmount *mnt = NULL;
int err = mount_is_safe(nd);
@@ -941,11 +956,12 @@ static int do_loopback(struct nameidata 
if (!check_mnt(nd-mnt) || !check_mnt(old_nd.mnt))
goto out;
 
+   clone_flags = (flags  MS_SETUSER) ? CL_SETUSER : 0;
err = -ENOMEM;
-   if (recurse)
-   mnt = copy_tree(old_nd.mnt, old_nd.dentry, 0);
+   if (flags  MS_REC)
+   mnt = copy_tree(old_nd.mnt, old_nd.dentry, clone_flags);
else
-   mnt = clone_mnt(old_nd.mnt, old_nd.dentry, 0);
+   mnt = clone_mnt(old_nd.mnt, old_nd.dentry, clone_flags);
 
if (!mnt)
goto out;
@@ -987,8 +1003,11 @@ static int do_remount(struct nameidata *
 
down_write(sb-s_umount);
err = do_remount_sb(sb, flags, data, 0);
-   if (!err)
+   if (!err) {
nd-mnt-mnt_flags = mnt_flags;
+   if (flags  MS_SETUSER)
+   set_mnt_user(nd-mnt);
+   }
up_write(sb-s_umount);
if (!err)
security_sb_post_remount(nd-mnt, flags, data);
@@ -1093,10 +1112,13 @@ static int do_new_mount(struct nameidata
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
 
-   mnt = do_kern_mount(type, flags, name, data);
+   mnt = do_kern_mount(type, flags  ~MS_SETUSER, name, data);
if (IS_ERR(mnt))
return PTR_ERR(mnt);
 
+   if (flags  MS_SETUSER)
+   set_mnt_user(mnt);
+
return do_add_mount(mnt, nd, mnt_flags, NULL);
 }
 
@@ -1127,7 +1149,8 @@ int do_add_mount(struct vfsmount *newmnt
if (S_ISLNK(newmnt-mnt_root-d_inode-i_mode))
goto unlock;
 
-   newmnt-mnt_flags = mnt_flags;
+   /* MNT_USER was set earlier */
+   newmnt-mnt_flags |= mnt_flags;
if ((err = graft_tree(newmnt, nd)))
goto unlock;
 
@@ -1447,7 +1470,7 @@ long do_mount(char *dev_name, char *dir_
retval = do_remount(nd, flags  ~MS_REMOUNT, mnt_flags,
data_page);
else

[patch 02/10] allow unprivileged umount

2007-04-16 Thread Miklos Szeredi

From: Miklos Szeredi [EMAIL PROTECTED]

The owner doesn't need sysadmin capabilities to call umount().

Similar behavior as umount(8) on mounts having user=UID option in
/etc/mtab.  The difference is that umount also checks /etc/fstab,
presumably to exclude another mount on the same mountpoint.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-11 20:07:51.0 +0200
+++ linux/fs/namespace.c2007-04-11 20:08:05.0 +0200
@@ -659,6 +659,25 @@ static int do_umount(struct vfsmount *mn
 }
 
 /*
+ * umount is permitted for
+ *  - sysadmin
+ *  - mount owner, if not forced umount
+ */
+static bool permit_umount(struct vfsmount *mnt, int flags)
+{
+   if (capable(CAP_SYS_ADMIN))
+   return true;
+
+   if (!(mnt-mnt_flags  MNT_USER))
+   return false;
+
+   if (flags  MNT_FORCE)
+   return false;
+
+   return mnt-mnt_uid == current-uid;
+}
+
+/*
  * Now umount can handle mount points as well as block devices.
  * This is important for filesystems which use unnamed block devices.
  *
@@ -681,7 +700,7 @@ asmlinkage long sys_umount(char __user *
goto dput_and_out;
 
retval = -EPERM;
-   if (!capable(CAP_SYS_ADMIN))
+   if (!permit_umount(nd.mnt, flags))
goto dput_and_out;
 
retval = do_umount(nd.mnt, flags);

--
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 07/10] allow unprivileged bind mounts

2007-04-16 Thread Miklos Szeredi

From: Miklos Szeredi [EMAIL PROTECTED]

Allow bind mounts to unprivileged users if the following conditions
are met:

  - user submounts are permitted on the mountpoint's mount
  - mountpoint is not a symlink or special file
  - mountpoint is not a sticky directory or is owned by the current user
  - mountpoint is writable by user
  - the number of user mounts is below the maximum

Unprivileged mounts imply MS_SETUSER, and will also have the nosuid
and nodev mount flags set.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-13 13:35:53.0 +0200
+++ linux/fs/namespace.c2007-04-13 14:17:39.0 +0200
@@ -237,11 +237,30 @@ static void dec_nr_user_mounts(void)
spin_unlock(vfsmount_lock);
 }
 
-static void set_mnt_user(struct vfsmount *mnt)
+static int reserve_user_mount(void)
+{
+   int err = 0;
+   spin_lock(vfsmount_lock);
+   if (nr_user_mounts = max_user_mounts  !capable(CAP_SYS_ADMIN))
+   err = -EPERM;
+   else
+   nr_user_mounts++;
+   spin_unlock(vfsmount_lock);
+   return err;
+}
+
+static void __set_mnt_user(struct vfsmount *mnt)
 {
BUG_ON(mnt-mnt_flags  MNT_USER);
mnt-mnt_uid = current-uid;
mnt-mnt_flags |= MNT_USER;
+   if (!capable(CAP_SYS_ADMIN))
+   mnt-mnt_flags |= MNT_NOSUID | MNT_NODEV;
+}
+
+static void set_mnt_user(struct vfsmount *mnt)
+{
+   __set_mnt_user(mnt);
spin_lock(vfsmount_lock);
nr_user_mounts++;
spin_unlock(vfsmount_lock);
@@ -260,9 +279,16 @@ static struct vfsmount *clone_mnt(struct
int flag)
 {
struct super_block *sb = old-mnt_sb;
-   struct vfsmount *mnt = alloc_vfsmnt(old-mnt_devname);
+   struct vfsmount *mnt;
+
+   if (flag  CL_SETUSER) {
+   int err = reserve_user_mount();
+   if (err)
+   return ERR_PTR(err);
+   }
+   mnt = alloc_vfsmnt(old-mnt_devname);
if (!mnt)
-   return ERR_PTR(-ENOMEM);
+   goto alloc_failed;
 
mnt-mnt_flags = old-mnt_flags;
atomic_inc(sb-s_active);
@@ -274,7 +300,7 @@ static struct vfsmount *clone_mnt(struct
/* don't copy the MNT_USER flag */
mnt-mnt_flags = ~MNT_USER;
if (flag  CL_SETUSER)
-   set_mnt_user(mnt);
+   __set_mnt_user(mnt);
 
if (flag  CL_SLAVE) {
list_add(mnt-mnt_slave, old-mnt_slave_list);
@@ -299,6 +325,11 @@ static struct vfsmount *clone_mnt(struct
spin_unlock(vfsmount_lock);
}
return mnt;
+
+ alloc_failed:
+   if (flag  CL_SETUSER)
+   dec_nr_user_mounts();
+   return ERR_PTR(-ENOMEM);
 }
 
 static inline void __mntput(struct vfsmount *mnt)
@@ -746,22 +777,35 @@ asmlinkage long sys_oldumount(char __use
 
 #endif
 
-static int mount_is_safe(struct nameidata *nd)
+/*
+ * Conditions for unprivileged mounts are:
+ * - user submounts are permitted under this mount
+ * - mountpoint is not a symlink or special file
+ * - mountpoint is absolutely writable by user
+ *  o if it's a sticky directory, it must be owned by the user
+ *  o it must not be an append-only file/directory
+ */
+static int mount_is_safe(struct nameidata *nd, int *flags)
 {
+   struct inode *inode = nd-dentry-d_inode;
+
if (capable(CAP_SYS_ADMIN))
return 0;
-   return -EPERM;
-#ifdef notyet
-   if (S_ISLNK(nd-dentry-d_inode-i_mode))
+
+   if (!(nd-mnt-mnt_flags  MNT_USERMNT))
return -EPERM;
-   if (nd-dentry-d_inode-i_mode  S_ISVTX) {
-   if (current-uid != nd-dentry-d_inode-i_uid)
-   return -EPERM;
-   }
-   if (vfs_permission(nd, MAY_WRITE))
+
+   if (!S_ISDIR(inode-i_mode)  !S_ISREG(inode-i_mode))
+   return -EPERM;
+
+   if ((inode-i_mode  S_ISVTX)  current-fsuid != inode-i_uid)
return -EPERM;
+
+   if (vfs_permission(nd, MAY_WRITE) || IS_APPEND(inode))
+   return -EPERM;
+
+   *flags |= MS_SETUSER;
return 0;
-#endif
 }
 
 static int lives_below_in_same_fs(struct dentry *d, struct dentry *dentry)
@@ -991,7 +1035,7 @@ static int do_loopback(struct nameidata 
int clone_flags;
struct nameidata old_nd;
struct vfsmount *mnt = NULL;
-   int err = mount_is_safe(nd);
+   int err = mount_is_safe(nd, flags);
if (err)
return err;
if (!old_name || !*old_name)

--
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 04/10] allow per-mount flags to be set/cleared individually

2007-04-16 Thread Miklos Szeredi

With MS_REMOUNT, it is difficult to change mount flags individually,
without touching other mount flags or options, having to parse
/proc/mounts to get the old flag values and options.

It is also difficult to change a mount flag recursively, having to
walk the mount tree in userspace.

This patch solves this problem by generalizing do_change_type() so
that not only the propagation property can be changed, but mnt_flags
can be set/cleared individually.

From: Miklos Szeredi [EMAIL PROTECTED]

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-13 13:04:04.0 +0200
+++ linux/fs/namespace.c2007-04-13 13:04:29.0 +0200
@@ -955,19 +956,28 @@ out_unlock:
 /*
  * recursively change the type of the mountpoint.
  */
-static int do_change_type(struct nameidata *nd, int flag)
+static int do_change_mnt(struct nameidata *nd, int flag, int mnt_flags)
 {
struct vfsmount *m, *mnt = nd-mnt;
int recurse = flag  MS_REC;
-   int type = flag  ~MS_REC;
+   int type = flag  (MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE);
 
if (nd-dentry != nd-mnt-mnt_root)
return -EINVAL;
 
down_write(namespace_sem);
spin_lock(vfsmount_lock);
-   for (m = mnt; m; m = (recurse ? next_mnt(m, mnt) : NULL))
-   change_mnt_propagation(m, type);
+   for (m = mnt; m; m = (recurse ? next_mnt(m, mnt) : NULL)) {
+   if (type)
+   change_mnt_propagation(m, type);
+
+   if (flag  (MS_SETFLAGS | MS_CLEARFLAGS))
+   m-mnt_flags = mnt_flags;
+   else if (flag  MS_SETFLAGS)
+   m-mnt_flags |= mnt_flags;
+   else if (flag  MS_CLEARFLAGS)
+   m-mnt_flags = ~mnt_flags;
+   }
spin_unlock(vfsmount_lock);
up_write(namespace_sem);
return 0;
@@ -1514,8 +1526,9 @@ long do_mount(char *dev_name, char *dir_
data_page);
else if (flags  MS_BIND)
retval = do_loopback(nd, dev_name, flags);
-   else if (flags  (MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE))
-   retval = do_change_type(nd, flags);
+   else if (flags  (MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE |
+ MS_SETFLAGS | MS_CLEARFLAGS))
+   retval = do_change_mnt(nd, flags, mnt_flags);
else if (flags  MS_MOVE)
retval = do_move_mount(nd, dev_name);
else
Index: linux/include/linux/fs.h
===
--- linux.orig/include/linux/fs.h   2007-04-13 13:04:04.0 +0200
+++ linux/include/linux/fs.h2007-04-13 13:04:29.0 +0200
@@ -127,6 +127,9 @@ extern int dir_notify_enable;
 #define MS_SHARED  (120) /* change to shared */
 #define MS_RELATIME(121) /* Update atime relative to mtime/ctime. */
 #define MS_SETUSER (122) /* set mnt_uid to current user */
+#define MS_SETFLAGS(123) /* set specified mount flags */
+#define MS_CLEARFLAGS  (124) /* clear specified mount flags */
+/* MS_SETFLAGS | MS_CLEARFLAGS: change mount flags to specified */
 #define MS_ACTIVE  (130)
 #define MS_NOUSER  (131)
 

--
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 00/10] mount ownership and unprivileged mount syscall (v3)

2007-04-16 Thread Miklos Szeredi

This patchset adds support for keeping mount ownership information in
the kernel, and allow unprivileged mount(2) and umount(2) in certain
cases.

This can be useful for the following reasons:

- mount(8) can store ownership (user=XY option) in the kernel
  instead, or in addition to storing it in /etc/mtab.  For example if
  private namespaces are used with mount propagations /etc/mtab
  becomes unworkable, but using /proc/mounts works fine

- fuse won't need a special suid-root mount/umount utility.  Plain
  umount(8) can easily be made to work with unprivileged fuse mounts

- users can use bind mounts without having to pre-configure them in
  /etc/fstab

The following security measures are taken for unprivileged mounts:

 - only allow submounting under mounts which have a special mount flag set
 - only allow mounting on files/directories writable by the user
 - limit the number of user mounts
 - force nosuid,nodev mount options

Changes from the previous submissions:

 - add mount flags to set/clear mnt_flags individually
 - add usermnt mount flag.  If it is set, then allow unprivileged
   submounts under this mount
 - make max number of user mounts default to 1024, since now the
   usermnt flag will prevent user mounts by default

--
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 03/10] account user mounts

2007-04-16 Thread Miklos Szeredi

From: Miklos Szeredi [EMAIL PROTECTED]

Add sysctl variables for accounting and limiting the number of user
mounts.

The maximum number of user mounts is set to 1024 by default.  This
won't in itself enable user mounts, setting the permit user submount
mount flag will also be needed.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/include/linux/sysctl.h
===
--- linux.orig/include/linux/sysctl.h   2007-04-11 18:27:46.0 +0200
+++ linux/include/linux/sysctl.h2007-04-11 20:08:16.0 +0200
@@ -818,6 +818,8 @@ enum
FS_AIO_NR=18,   /* current system-wide number of aio requests */
FS_AIO_MAX_NR=19,   /* system-wide maximum number of aio requests */
FS_INOTIFY=20,  /* inotify submenu */
+   FS_NR_USER_MOUNTS=21,   /* int:current number of user mounts */
+   FS_MAX_USER_MOUNTS=22,  /* int:maximum number of user mounts */
FS_OCFS2=988,   /* ocfs2 */
 };
 
Index: linux/kernel/sysctl.c
===
--- linux.orig/kernel/sysctl.c  2007-04-11 18:27:46.0 +0200
+++ linux/kernel/sysctl.c   2007-04-11 20:08:16.0 +0200
@@ -1063,6 +1063,22 @@ static ctl_table fs_table[] = {
 #endif 
 #endif
{
+   .ctl_name   = FS_NR_USER_MOUNTS,
+   .procname   = nr_user_mounts,
+   .data   = nr_user_mounts,
+   .maxlen = sizeof(int),
+   .mode   = 0444,
+   .proc_handler   = proc_dointvec,
+   },
+   {
+   .ctl_name   = FS_MAX_USER_MOUNTS,
+   .procname   = max_user_mounts,
+   .data   = max_user_mounts,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+   {
.ctl_name   = KERN_SETUID_DUMPABLE,
.procname   = suid_dumpable,
.data   = suid_dumpable,
Index: linux/Documentation/filesystems/proc.txt
===
--- linux.orig/Documentation/filesystems/proc.txt   2007-04-11 
18:27:44.0 +0200
+++ linux/Documentation/filesystems/proc.txt2007-04-12 13:32:14.0 
+0200
@@ -923,6 +923,15 @@ reaches aio-max-nr then io_setup will fa
 raising aio-max-nr does not result in the pre-allocation or re-sizing
 of any kernel data structures.
 
+nr_user_mounts and max_user_mounts
+--
+
+These represent the number of user mounts and the maximum number of
+user mounts respectively.  User mounts may be created by
+unprivileged users.  User mounts may also be created with sysadmin
+privileges on behalf of a user, in which case nr_user_mounts may
+exceed max_user_mounts.
+
 2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats
 ---
 
Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-11 20:08:05.0 +0200
+++ linux/fs/namespace.c2007-04-12 13:29:05.0 +0200
@@ -39,6 +39,9 @@ static int hash_mask __read_mostly, hash
 static struct kmem_cache *mnt_cache __read_mostly;
 static struct rw_semaphore namespace_sem;
 
+int nr_user_mounts;
+int max_user_mounts = 1024;
+
 /* /sys/fs */
 decl_subsys(fs, NULL, NULL);
 EXPORT_SYMBOL_GPL(fs_subsys);
@@ -227,11 +230,30 @@ static struct vfsmount *skip_mnt_tree(st
return p;
 }
 
+static void dec_nr_user_mounts(void)
+{
+   spin_lock(vfsmount_lock);
+   nr_user_mounts--;
+   spin_unlock(vfsmount_lock);
+}
+
 static void set_mnt_user(struct vfsmount *mnt)
 {
BUG_ON(mnt-mnt_flags  MNT_USER);
mnt-mnt_uid = current-uid;
mnt-mnt_flags |= MNT_USER;
+   spin_lock(vfsmount_lock);
+   nr_user_mounts++;
+   spin_unlock(vfsmount_lock);
+}
+
+static void clear_mnt_user(struct vfsmount *mnt)
+{
+   if (mnt-mnt_flags  MNT_USER) {
+   mnt-mnt_uid = 0;
+   mnt-mnt_flags = ~MNT_USER;
+   dec_nr_user_mounts();
+   }
 }
 
 static struct vfsmount *clone_mnt(struct vfsmount *old, struct dentry *root,
@@ -283,6 +305,7 @@ static inline void __mntput(struct vfsmo
 {
struct super_block *sb = mnt-mnt_sb;
dput(mnt-mnt_root);
+   clear_mnt_user(mnt);
free_vfsmnt(mnt);
deactivate_super(sb);
 }
@@ -1023,6 +1046,7 @@ static int do_remount(struct nameidata *
down_write(sb-s_umount);
err = do_remount_sb(sb, flags, data, 0);
if (!err) {
+   clear_mnt_user(nd-mnt);
nd-mnt-mnt_flags = mnt_flags;
if (flags  MS_SETUSER)
set_mnt_user(nd-mnt);
Index: linux/include/linux/fs.h

[patch 08/10] put declaration of put_filesystem() in fs.h

2007-04-16 Thread Miklos Szeredi

From: Miklos Szeredi [EMAIL PROTECTED]

Declarations go into headers.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/fs/super.c
===
--- linux.orig/fs/super.c   2007-04-13 12:26:11.0 +0200
+++ linux/fs/super.c2007-04-13 13:25:40.0 +0200
@@ -40,10 +40,6 @@
 #include asm/uaccess.h
 
 
-void get_filesystem(struct file_system_type *fs);
-void put_filesystem(struct file_system_type *fs);
-struct file_system_type *get_fs_type(const char *name);
-
 LIST_HEAD(super_blocks);
 DEFINE_SPINLOCK(sb_lock);
 
Index: linux/include/linux/fs.h
===
--- linux.orig/include/linux/fs.h   2007-04-13 13:23:22.0 +0200
+++ linux/include/linux/fs.h2007-04-13 13:25:40.0 +0200
@@ -1922,6 +1922,8 @@ extern int vfs_fstat(unsigned int, struc
 
 extern int vfs_ioctl(struct file *, unsigned int, unsigned int, unsigned long);
 
+extern void get_filesystem(struct file_system_type *fs);
+extern void put_filesystem(struct file_system_type *fs);
 extern struct file_system_type *get_fs_type(const char *name);
 extern struct super_block *get_super(struct block_device *);
 extern struct super_block *user_get_super(dev_t);

--
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 06/10] propagate error values from clone_mnt

2007-04-16 Thread Miklos Szeredi

From: Miklos Szeredi [EMAIL PROTECTED]

Allow clone_mnt() to return errors other than ENOMEM.  This will be
used for returning a different error value when the number of user
mounts goes over the limit.

Fix copy_tree() to return EPERM for unbindable mounts.

Don't propagate further from dup_mnt_ns() as that copy_tree() can only
fail with -ENOMEM.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-13 13:22:17.0 +0200
+++ linux/fs/namespace.c2007-04-13 13:25:35.0 +0200
@@ -261,42 +261,42 @@ static struct vfsmount *clone_mnt(struct
 {
struct super_block *sb = old-mnt_sb;
struct vfsmount *mnt = alloc_vfsmnt(old-mnt_devname);
+   if (!mnt)
+   return ERR_PTR(-ENOMEM);
 
-   if (mnt) {
-   mnt-mnt_flags = old-mnt_flags;
-   atomic_inc(sb-s_active);
-   mnt-mnt_sb = sb;
-   mnt-mnt_root = dget(root);
-   mnt-mnt_mountpoint = mnt-mnt_root;
-   mnt-mnt_parent = mnt;
-
-   /* don't copy the MNT_USER flag */
-   mnt-mnt_flags = ~MNT_USER;
-   if (flag  CL_SETUSER)
-   set_mnt_user(mnt);
-
-   if (flag  CL_SLAVE) {
-   list_add(mnt-mnt_slave, old-mnt_slave_list);
-   mnt-mnt_master = old;
-   CLEAR_MNT_SHARED(mnt);
-   } else {
-   if ((flag  CL_PROPAGATION) || IS_MNT_SHARED(old))
-   list_add(mnt-mnt_share, old-mnt_share);
-   if (IS_MNT_SLAVE(old))
-   list_add(mnt-mnt_slave, old-mnt_slave);
-   mnt-mnt_master = old-mnt_master;
-   }
-   if (flag  CL_MAKE_SHARED)
-   set_mnt_shared(mnt);
+   mnt-mnt_flags = old-mnt_flags;
+   atomic_inc(sb-s_active);
+   mnt-mnt_sb = sb;
+   mnt-mnt_root = dget(root);
+   mnt-mnt_mountpoint = mnt-mnt_root;
+   mnt-mnt_parent = mnt;
+
+   /* don't copy the MNT_USER flag */
+   mnt-mnt_flags = ~MNT_USER;
+   if (flag  CL_SETUSER)
+   set_mnt_user(mnt);
 
-   /* stick the duplicate mount on the same expiry list
-* as the original if that was on one */
-   if (flag  CL_EXPIRE) {
-   spin_lock(vfsmount_lock);
-   if (!list_empty(old-mnt_expire))
-   list_add(mnt-mnt_expire, old-mnt_expire);
-   spin_unlock(vfsmount_lock);
-   }
+   if (flag  CL_SLAVE) {
+   list_add(mnt-mnt_slave, old-mnt_slave_list);
+   mnt-mnt_master = old;
+   CLEAR_MNT_SHARED(mnt);
+   } else {
+   if ((flag  CL_PROPAGATION) || IS_MNT_SHARED(old))
+   list_add(mnt-mnt_share, old-mnt_share);
+   if (IS_MNT_SLAVE(old))
+   list_add(mnt-mnt_slave, old-mnt_slave);
+   mnt-mnt_master = old-mnt_master;
+   }
+   if (flag  CL_MAKE_SHARED)
+   set_mnt_shared(mnt);
+
+   /* stick the duplicate mount on the same expiry list
+* as the original if that was on one */
+   if (flag  CL_EXPIRE) {
+   spin_lock(vfsmount_lock);
+   if (!list_empty(old-mnt_expire))
+   list_add(mnt-mnt_expire, old-mnt_expire);
+   spin_unlock(vfsmount_lock);
}
return mnt;
 }
@@ -782,11 +782,11 @@ struct vfsmount *copy_tree(struct vfsmou
struct nameidata nd;
 
if (!(flag  CL_COPY_ALL)  IS_MNT_UNBINDABLE(mnt))
-   return NULL;
+   return ERR_PTR(-EPERM);
 
res = q = clone_mnt(mnt, dentry, flag);
-   if (!q)
-   goto Enomem;
+   if (IS_ERR(q))
+   goto error;
q-mnt_mountpoint = mnt-mnt_mountpoint;
 
p = mnt;
@@ -807,8 +807,8 @@ struct vfsmount *copy_tree(struct vfsmou
nd.mnt = q;
nd.dentry = p-mnt_mountpoint;
q = clone_mnt(p, p-mnt_root, flag);
-   if (!q)
-   goto Enomem;
+   if (IS_ERR(q))
+   goto error;
spin_lock(vfsmount_lock);
list_add_tail(q-mnt_list, res-mnt_list);
attach_mnt(q, nd);
@@ -816,7 +816,7 @@ struct vfsmount *copy_tree(struct vfsmou
}
}
return res;
-Enomem:
+ error:
if (res) {
LIST_HEAD(umount_list);
spin_lock(vfsmount_lock);
@@ -824,7 +824,7 @@ Enomem:
spin_unlock(vfsmount_lock);
release_mounts(umount_list);
}
-   return

[patch 10/10] allow unprivileged fuse mounts

2007-04-16 Thread Miklos Szeredi

From: Miklos Szeredi [EMAIL PROTECTED]

Use FS_SAFE for fuse fs type, but not for fuseblk.

FUSE was designed from the beginning to be safe for unprivileged
users.  This has also been verified in practice over many years.  In
addition unprivileged fuse mounts require the usermnt mount option
to be set on the parent mount, which is more strict than the current
userspace policy.

This will enable future installations to remove the suid-root
fusermount utility.

Don't require the user_id= and group_id= options for unprivileged
mounts, but if they are present, verify them for sanity.

Disallow the allow_other option for unprivileged mounts.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/fs/fuse/inode.c
===
--- linux.orig/fs/fuse/inode.c  2007-04-13 14:20:23.0 +0200
+++ linux/fs/fuse/inode.c   2007-04-13 14:20:27.0 +0200
@@ -311,6 +311,19 @@ static int parse_fuse_opt(char *opt, str
d-max_read = ~0;
d-blksize = 512;
 
+   /*
+* For unprivileged mounts use current uid/gid.  Still allow
+* user_id and group_id options for compatibility, but
+* only if they match these values.
+*/
+   if (!capable(CAP_SYS_ADMIN)) {
+   d-user_id = current-uid;
+   d-user_id_present = 1;
+   d-group_id = current-gid;
+   d-group_id_present = 1;
+
+   }
+
while ((p = strsep(opt, ,)) != NULL) {
int token;
int value;
@@ -339,6 +352,8 @@ static int parse_fuse_opt(char *opt, str
case OPT_USER_ID:
if (match_int(args[0], value))
return 0;
+   if (d-user_id_present  d-user_id != value)
+   return 0;
d-user_id = value;
d-user_id_present = 1;
break;
@@ -346,6 +361,8 @@ static int parse_fuse_opt(char *opt, str
case OPT_GROUP_ID:
if (match_int(args[0], value))
return 0;
+   if (d-group_id_present  d-group_id != value)
+   return 0;
d-group_id = value;
d-group_id_present = 1;
break;
@@ -536,6 +553,10 @@ static int fuse_fill_super(struct super_
if (!parse_fuse_opt((char *) data, d, is_bdev))
return -EINVAL;
 
+   /* This is a privileged option */
+   if ((d.flags  FUSE_ALLOW_OTHER)  !capable(CAP_SYS_ADMIN))
+   return -EPERM;
+
if (is_bdev) {
 #ifdef CONFIG_BLOCK
if (!sb_set_blocksize(sb, d.blksize))
@@ -639,6 +660,7 @@ static struct file_system_type fuse_fs_t
.fs_flags   = FS_HAS_SUBTYPE,
.get_sb = fuse_get_sb,
.kill_sb= kill_anon_super,
+   .fs_flags   = FS_SAFE,
 };
 
 #ifdef CONFIG_BLOCK

--
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 05/10] Add permit user submounts flag to vfsmount

2007-04-16 Thread Miklos Szeredi

From: Miklos Szeredi [EMAIL PROTECTED]

If MNT_USERMNT flag is not set in the target vfsmount, then
unprivileged mounts will be denied.

By default this flag is cleared, and can be set on new mounts, on
remounts or with the MS_SETFLAGS option.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-13 13:20:12.0 +0200
+++ linux/fs/namespace.c2007-04-13 13:35:40.0 +0200
@@ -411,6 +411,7 @@ static int show_vfsmnt(struct seq_file *
{ MNT_NOATIME, ,noatime },
{ MNT_NODIRATIME, ,nodiratime },
{ MNT_RELATIME, ,relatime },
+   { MNT_USERMNT, ,usermnt },
{ 0, NULL }
};
struct proc_fs_info *fs_infop;
@@ -1505,9 +1506,11 @@ long do_mount(char *dev_name, char *dir_
mnt_flags |= MNT_NODIRATIME;
if (flags  MS_RELATIME)
mnt_flags |= MNT_RELATIME;
+   if (flags  MS_USERMNT)
+   mnt_flags |= MNT_USERMNT;
 
flags = ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
-  MS_NOATIME | MS_NODIRATIME | MS_RELATIME);
+  MS_NOATIME | MS_NODIRATIME | MS_RELATIME | MS_USERMNT);
 
/* ... and get the mountpoint */
retval = path_lookup(dir_name, LOOKUP_FOLLOW, nd);
Index: linux/include/linux/mount.h
===
--- linux.orig/include/linux/mount.h2007-04-13 13:17:08.0 +0200
+++ linux/include/linux/mount.h 2007-04-13 13:22:17.0 +0200
@@ -28,6 +28,7 @@ struct mnt_namespace;
 #define MNT_NOATIME0x08
 #define MNT_NODIRATIME 0x10
 #define MNT_RELATIME   0x20
+#define MNT_USERMNT0x40
 
 #define MNT_SHRINKABLE 0x100
 #define MNT_USER   0x200
Index: linux/include/linux/fs.h
===
--- linux.orig/include/linux/fs.h   2007-04-13 13:23:05.0 +0200
+++ linux/include/linux/fs.h2007-04-13 13:35:34.0 +0200
@@ -130,6 +130,7 @@ extern int dir_notify_enable;
 #define MS_SETFLAGS(123) /* set specified mount flags */
 #define MS_CLEARFLAGS  (124) /* clear specified mount flags */
 /* MS_SETFLAGS | MS_CLEARFLAGS: change mount flags to specified */
+#define MS_USERMNT (125) /* permit unpriv. submounts under this mount */
 #define MS_ACTIVE  (130)
 #define MS_NOUSER  (131)
 

--
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 09/10] allow unprivileged mounts

2007-04-16 Thread Miklos Szeredi

From: Miklos Szeredi [EMAIL PROTECTED]

Define a new fs flag FS_SAFE, which denotes, that unprivileged
mounting of this filesystem may not constitute a security problem.

Since most filesystems haven't been designed with unprivileged
mounting in mind, a thorough audit is needed before setting this flag.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-13 13:35:29.0 +0200
+++ linux/fs/namespace.c2007-04-13 13:35:30.0 +0200
@@ -785,7 +785,8 @@ asmlinkage long sys_oldumount(char __use
  *  o if it's a sticky directory, it must be owned by the user
  *  o it must not be an append-only file/directory
  */
-static int mount_is_safe(struct nameidata *nd, int *flags)
+static int mount_is_safe(struct nameidata *nd, struct file_system_type *type,
+int *flags)
 {
struct inode *inode = nd-dentry-d_inode;
 
@@ -795,6 +796,9 @@ static int mount_is_safe(struct nameidat
if (!(nd-mnt-mnt_flags  MNT_USERMNT))
return -EPERM;
 
+   if (type  !(type-fs_flags  FS_SAFE))
+   return -EPERM;
+
if (!S_ISDIR(inode-i_mode)  !S_ISREG(inode-i_mode))
return -EPERM;
 
@@ -1035,7 +1039,7 @@ static int do_loopback(struct nameidata 
int clone_flags;
struct nameidata old_nd;
struct vfsmount *mnt = NULL;
-   int err = mount_is_safe(nd, flags);
+   int err = mount_is_safe(nd, NULL, flags);
if (err)
return err;
if (!old_name || !*old_name)
@@ -1197,26 +1201,46 @@ out:
  * create a new mount for userspace and request it to be added into the
  * namespace's tree
  */
-static int do_new_mount(struct nameidata *nd, char *type, int flags,
+static int do_new_mount(struct nameidata *nd, char *fstype, int flags,
int mnt_flags, char *name, void *data)
 {
+   int err;
struct vfsmount *mnt;
+   struct file_system_type *type;
 
-   if (!type || !memchr(type, 0, PAGE_SIZE))
+   if (!fstype || !memchr(fstype, 0, PAGE_SIZE))
return -EINVAL;
 
-   /* we need capabilities... */
-   if (!capable(CAP_SYS_ADMIN))
-   return -EPERM;
+   type = get_fs_type(fstype);
+   if (!type)
+   return -ENODEV;
 
-   mnt = do_kern_mount(type, flags  ~MS_SETUSER, name, data);
-   if (IS_ERR(mnt))
+   err = mount_is_safe(nd, type, flags);
+   if (err)
+   goto out_put_filesystem;
+
+   if (flags  MS_SETUSER) {
+   err = reserve_user_mount();
+   if (err)
+   goto out_put_filesystem;
+   }
+
+   mnt = vfs_kern_mount(type, flags  ~MS_SETUSER, name, data);
+   put_filesystem(type);
+   if (IS_ERR(mnt)) {
+   if (flags  MS_SETUSER)
+   dec_nr_user_mounts();
return PTR_ERR(mnt);
+   }
 
if (flags  MS_SETUSER)
-   set_mnt_user(mnt);
+   __set_mnt_user(mnt);
 
return do_add_mount(mnt, nd, mnt_flags, NULL);
+
+ out_put_filesystem:
+   put_filesystem(type);
+   return err;
 }
 
 /*
@@ -1246,7 +1270,7 @@ int do_add_mount(struct vfsmount *newmnt
if (S_ISLNK(newmnt-mnt_root-d_inode-i_mode))
goto unlock;
 
-   /* MNT_USER was set earlier */
+   /* some flags may have been set earlier */
newmnt-mnt_flags |= mnt_flags;
if ((err = graft_tree(newmnt, nd)))
goto unlock;
Index: linux/include/linux/fs.h
===
--- linux.orig/include/linux/fs.h   2007-04-13 13:35:29.0 +0200
+++ linux/include/linux/fs.h2007-04-13 13:35:30.0 +0200
@@ -96,6 +96,7 @@ extern int dir_notify_enable;
 #define FS_REQUIRES_DEV 1 
 #define FS_BINARY_MOUNTDATA 2
 #define FS_HAS_SUBTYPE 4
+#define FS_SAFE 8  /* Safe to mount by unprivileged users */
 #define FS_REVAL_DOT   16384   /* Check the paths ., .. for staleness */
 #define FS_RENAME_DOES_D_MOVE  32768   /* FS will handle d_move()
 * during rename() internally.

--
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 0/8] unprivileged mount syscall

2007-04-16 Thread Miklos Szeredi

  Arn't there ways to escape chroot jails? Serge had pointed me to a URL
  which showed chroots can be escaped. And if that is true than having all
  user's private mount tree in the same namespace can be a security issue?
 
  No.  In fact chrooting the user into /share/$USER will actually
  _grant_ a privilege to the user, instead of taking it away.  It allows
  the user to modify it's root namespace, which it wouldn't be able to
  in the initial namespace.
 
  So even if the user could escape from the chroot (which I doubt), s/he
  would not be able to do any harm, since unprivileged mounting would be
  restricted to /share.  Also /share/$USER should only have read/search
  permission for $USER or no permissions at all, which would mean, that
  other users' namespaces would be safe from tampering as well.
 
 A couple of points.
 - chroot can be escaped, it is just a chdir for the root directory
 it is not a security feature.  The only security is that you have to
 be root to call chdir.  A carefully done namespace setup won't have
 that issue.
 
 - While it may not violate security as far as what a user is allowed
 to modify it may violate security as far as what a user is allowed
 to see.

I think that's just up to the permissions in the global namespace.  In
this example if you 'chmod 0 /share' there won't be anything for the
user to see.

 There are interesting per login cases as well such as allowing a
 user to replicate their mount tree from another machine when they
 log in.  When /home is on a network filesystem this can be very
 practical and can allow propagation of mounts across machines not
 just across a single login session.

Yeah, sounds interesting, but I think it's better to get the basics
working first, and then we can start to think about the extras.

Btw, there's nothing that prevents cloning the namespace _after_
chrooting into the per-user tree.  That would still be simpler than
doing it the other way round: first creating per-session namespaces
and then setting up mount propagation between them.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag

2007-04-16 Thread Miklos Szeredi

  That depends.  Current patches check the unprivileged submounts
  allowed under this mount flag only on the requested mount and not on
  the propagated mounts.  Do you see a problem with this?
 
 I think privileges of this sort should propagate.  If I read what you
 just said correctly if I have a private mount namespace I won't be able
 to mount anything unless when it was setup the unprivileged submount
 command was explicitly set.

By design yes.  Why is that a problem?

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag

2007-04-16 Thread Miklos Szeredi

 Also for bind-mount and remount operations the flag has to be 
 propagated
 down its propagation tree.  Otherwise a unpriviledged mount in a 
 shared
 mount wont get reflected in its peers and slaves, leading to 
 unidentical
 shared-subtrees.

That's an interesting question.  Do we want shared mounts to be
totally identical, including mnt_flags?  It doesn't look as if
do_remount() guarantees that currently.
   
   Depends on the semantics of each of the flags. Some flags like of the
   read/write flag, would not interfere with the propagation semantics
   AFAICT.  But this one certainly seems to interfere.
  
  That depends.  Current patches check the unprivileged submounts
  allowed under this mount flag only on the requested mount and not on
  the propagated mounts.  Do you see a problem with this?
 
 Don't see a problem if the flag is propagated to all peers and slave
 mounts. 
 
 If not, I see a problem. What if the propagated mount has its flag set
 to not do un-priviledged mounts, whereas the requested mount has it
 allowed?

Then the mount is allowed.

It is up to the sysadmin/distro to design set up the propagations in a
way that this is not a problem.

I think it would be much less clear conceptually, if unprivileged
mounting would have to check propagations as well.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ZFS with Linux: An Open Plea

2007-04-17 Thread Miklos Szeredi

  FUSE is nice for trying out new and interresting ideas in userspace -
  it has its uses.
 
 Yes, but it is not really for the end-user. To paraphrase another, it is  
 mostly academic.

Oh?  I thought those ~10,000 downloads of SSHFS and ~200,000 downloads
of NTFS-3G were end users.(*)

Maybe I was wrong though.  Thanks for the clarification.

Miklos

(*) Numbers obviously don't include packaged installations
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 05/10] Add permit user submounts flag to vfsmount

2007-04-17 Thread Miklos Szeredi

  From: Miklos Szeredi [EMAIL PROTECTED]
  
  If MNT_USERMNT flag is not set in the target vfsmount, then
 
 MNT_USER and MNT_USERMNT?  I claim no way will people keep those
 straight.  How about MNT_ALLOWUSER and MNT_USER?

Umm, is allowuser more clear than usermnt?  What is allowed to the
user?  allowusermnt may be more descriptive, but it's a bit too
long.

I don't think it matters all that much, the user will have to look up
the semantics in the manpage anyway.  Is nosuid descriptive?  Not
very much, but we got used to it.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag

2007-04-17 Thread Miklos Szeredi

 Interesting
 
 So far even today these things can happen, however they are sufficiently
 unlikely the tools don't account for them.
 
 Once a hostile user can cause them things are more of a problem.
 
  (Unless you want to tackle each problem legacy tool one at a time to
  remove problems - i.e. deluser should umount everything under
  /home/hallyn before deleting, backup should be spawned from it's own
  namespace cloned right after boot or just back up on one filesystem,
  etc.)
 
 I don't see a way that backup and deluser won't need to be modified
 to work properly in a system where non-priveleged mounts are allowed,
 at least they will need to account for /share.
 
 That said it is clearly a hazard if we enable this functionality by
 default.
 
 If we setup a pam module that triggers on login and perhaps when
 cron and at jobs run to setup an additional mount namespace I think
 keeping applications locked away in their own mount namespace is
 sufficient to avoid hostile users from doing unexpected things to
 the initial mount namespace.  So unless I am mistake it should be
 relatively simple to prevent user space from encountering problems.
 
 That still leaves the question of how we handle systems with an old
 user space that is insufficiently robust to deal with mounts occurring
 at unexpected locations.
 
 
   I think a simple sysctl to enable/disable of non-priveleged mounts 
   defaulting to disabled is enough.
 
 Am I correct or will it be more difficult than just a little pam
 module to ensure non-trusted users never run in the initial mount
 namespace?

I'm still not sure, what your problem is.

With the v3 of the usermounts patchset, by default, user mounts are
disabled, because the allow unpriv submounts flag is cleared on all
mounts.

There are several options available to sysadmins and distro builders
to enable user mounts in a secure way:

  - pam module, which creates a private namespace, and sets allow
unpriv submounts on the mounts within the namespace

  - pam module, which rbinds / onto /mnt/ns/$USER, and chroots into
/mnt/ns/$USER, then sets the allow unpriv submounts on the
mounts under /mnt/ns/$USER.

  - sysadmin creates /mnt/usermounts writable to all users, with
sticky bit (same as /tmp), does mount --bind /mnt/usermounts
/mnt/usermounts and sets the allow unpriv submounts on
/mnt/usermounts.

All of these are perfectly safe wrt userdel and backup (assuming it
doesn't try back up /mnt).

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 05/10] Add permit user submounts flag to vfsmount

2007-04-17 Thread Miklos Szeredi

   MNT_USER and MNT_USERMNT?  I claim no way will people keep those
   straight.  How about MNT_ALLOWUSER and MNT_USER?
  
  Umm, is allowuser more clear than usermnt?  What is allowed to the
 
 I think so, yes.  One makes it clear that we're talking about allowing
 user (somethings :), one might just as well mean this is a user mount.
 
  user?  allowusermnt may be more descriptive, but it's a bit too
  long.
 
 Yes, if it weren't too long it would by far have been my preference.
 Maybe despite the length we should still go with it...
 
  I don't think it matters all that much, the user will have to look up
  the semantics in the manpage anyway.  Is nosuid descriptive?  Not
  very much, but we got used to it.
 
 nosuid is quite clear.

Is it?  Shouldn't these be allowsuid, noallowsuid, allowexec,
noallowexec?

See, we mentally add the allow quite easily.

 MNT_USER and MNT_USERMNT are so confusing that in the time I go from
 quitting the manpage to foregrounding my editor, I may have already
 forgotten which was which.

Well, to the user they are always in the form user=123 and
usermnt, so they are not as easy to confuse.

But I feel a bit stupid bickering about this, because it isn't so
important.  allowuser or allowusermnt are fine by me if you think
they are substantially better than usermnt.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag

2007-04-17 Thread Miklos Szeredi

 I'm a bit lost about what is currently done and who advocates for what.
 
 It seems to me the MNT_ALLOWUSERMNT (or whatever :) flag should be
 propagated.  In the /share rbind+chroot example, I assume the admin
 would start by doing
 
   mount --bind /share /share
   mount --make-slave /share
   mount --bind -o allow_user_mounts /share (or whatever)
   mount --make-shared /share
 
 then on login, pam does
 
   chroot /share/$USER
 
 or some sort of
 
   mount --bind /share /home/$USER/root
   chroot /home/$USER/root
 
 or whatever.  In any case, the user cannot make user mounts except under
 /share, and any cloned namespaces will still allow user mounts.

I don't quite understand your method.  This is how I think of it:

mount --make-rshared /
mkdir -p /mnt/ns/$USER
mount --rbind / /mnt/ns/$USER
mount --make-rslave /mnt/ns/$USER
mount --set-flags --recursive -oallowusermnt /mnt/ns/$USER
chroot /mnt/ns/$USER
su - $USER

I did actually try something equivalent (without the fancy mount
commands though), and it worked fine.  The only problem is the
proliferation of mounts in /proc/mounts.  There was a recently posted
patch in AppArmor, that at least hides unreachable mounts from
/proc/mounts, so the user wouldn't see all those.  But it could still
be pretty confusing to the sysadmin.

So in that sense doing it the complicated way, by first cloning the
namespace, and then copying and sharing mounts individually which need
to be shared could relieve this somewhat.

Another point: user mounts under /proc and /sys shouldn't be allowed.
There are files there (at least in /proc) that are seemingly writable
by the user, but they are still not writable in the sense, that
normal files are.

Anyway, there are lots of userspace policy issues, but those don't
impact the kernel part.

As for the original question of propagating the allowusermnt flag, I
think it doesn't matter, as long as it's consistent and documented.

Propagating some mount flags and not propagating others is
inconsistent and confusing, so I wouldn't want that.  Currently
remount doesn't propagate mount flags, that may be a bug, dunno.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag

2007-04-17 Thread Miklos Szeredi

  I'm still not sure, what your problem is.
 
 My problem right now is that I see a serious complexity escalation in
 the user interface that we must support indefinitely.
 
 I see us taking a nice powerful concept and seriously watering it down.
 To some extent we have to avoid confusing suid applications.  (I would
 so love to remove the SUID bit...).
 
 I'm being contrary to ensure we have a good code review.

OK.  And it's very much appreciated :)

 I have heard it said that there are two kinds of design.  Something
 so simple it obviously has no deficiencies.  Something so complex it has
 no obvious deficiencies.  I am very much afraid we are slipping the
 mount namespace into the latter category of work.  Which is a bad
 bad thing for core OS feature.

I've tried to make this unprivileged mount thing as simple as
possible, and no simpler.  If we can make it even simpler, all the
better.

  With the v3 of the usermounts patchset, by default, user mounts are
  disabled, because the allow unpriv submounts flag is cleared on all
  mounts.
 
  There are several options available to sysadmins and distro builders
  to enable user mounts in a secure way:
 
- pam module, which creates a private namespace, and sets allow
  unpriv submounts on the mounts within the namespace
 
- pam module, which rbinds / onto /mnt/ns/$USER, and chroots into
  /mnt/ns/$USER, then sets the allow unpriv submounts on the
  mounts under /mnt/ns/$USER.
 
 In part this really disturbs me because we now have two mechanisms for
 controlling the scope of what a user can do.

You mean rbind+chroot and clone(CLONE_NS)?  Yes, those are two
different mechanisms achieving very similar results.  But what has
this to do with unprivileged mounts?

 A flag or a new namespace.  Two mechanisms to accomplish the same
 thing sound wrong, and hard to manage.

The flag permitting the unprivileged mounts (which we now agreed to
name allowusermnt) is used in both cases.

Just creating a new namespace doesn't always imply that you want to
allow user mounts inside, does it?  These are orthogonal features.

- sysadmin creates /mnt/usermounts writable to all users, with
  sticky bit (same as /tmp), does mount --bind /mnt/usermounts
  /mnt/usermounts and sets the allow unpriv submounts on
  /mnt/usermounts.
 
  All of these are perfectly safe wrt userdel and backup (assuming it
  doesn't try back up /mnt).
 
 I also don't understand at all the user= mount flag and options.

The user=UID or (or MNT_USER flag) serves multiple purposes:

  - help mount(8) move away from /etc/mtab
  - allow unprivileged umounts
  - account user mounts

 All it seemed to be used for was adding permissions to unmount.  In user
 space to deal with the lack of any form of untrusted mounts I can understand
 this.  In kernel space this seems to be more of a problem.

Why is handling unprivileged mounts in kernel different from handling
them in userspace in this respect?

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag

2007-04-17 Thread Miklos Szeredi

  mount --make-rshared /
  mkdir -p /mnt/ns/$USER
  mount --rbind / /mnt/ns/$USER
  mount --make-rslave /mnt/ns/$USER
 
 This was my main point - that the tree in which users can mount will be
 a slave of /, so that propagating the are user mounts allowed flag
 among peers is safe and intuitive.

I think it's equally intuitive if flags are not propagated.  After
all, I may want to set the mount flag _only_ on this particular mount,
and not anything else.

This is something I wouln't be sure about either way without
consulting the man.

 True.  But the kernel functionality you provide enables both ways so no
 problem in either case :)
 
  Another point: user mounts under /proc and /sys shouldn't be allowed.
  There are files there (at least in /proc) that are seemingly writable
  by the user, but they are still not writable in the sense, that
  normal files are.
 
 Good point.
 
  Anyway, there are lots of userspace policy issues, but those don't
  impact the kernel part.
 
 Though it might make sense to enforce /proc and /sys not allowing user
 mounts under them in the kernel.

I think not, it would just be another policy decision having to be
made in the kernel on an fs by fs basis, and getting it wrong would be
much worse than getting it wrong in userspace.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag

2007-04-17 Thread Miklos Szeredi

   I'm a bit lost about what is currently done and who advocates for what.
   
   It seems to me the MNT_ALLOWUSERMNT (or whatever :) flag should be
   propagated.  In the /share rbind+chroot example, I assume the admin
   would start by doing
   
 mount --bind /share /share
 mount --make-slave /share
 mount --bind -o allow_user_mounts /share (or whatever)
 mount --make-shared /share
   
   then on login, pam does
   
 chroot /share/$USER
   
   or some sort of
   
 mount --bind /share /home/$USER/root
 chroot /home/$USER/root
   
   or whatever.  In any case, the user cannot make user mounts except under
   /share, and any cloned namespaces will still allow user mounts.
  
  I don't quite understand your method.  This is how I think of it:
  
  mount --make-rshared /
  mkdir -p /mnt/ns/$USER
  mount --rbind / /mnt/ns/$USER
  mount --make-rslave /mnt/ns/$USER
  mount --set-flags --recursive -oallowusermnt /mnt/ns/$USER
  chroot /mnt/ns/$USER
  su - $USER
  
  I did actually try something equivalent (without the fancy mount
  commands though), and it worked fine.  The only problem is the
  proliferation of mounts in /proc/mounts.  There was a recently posted
  patch in AppArmor, that at least hides unreachable mounts from
  /proc/mounts, so the user wouldn't see all those.  But it could still
  be pretty confusing to the sysadmin.
 
 unbindable mounts were designed to overcome the proliferation problem.
 
 Your steps should be something like this:
 
 mount --make-rshared /
 mkdir -p /mnt/ns
 mount --bind /mnt/ns /mnt/ns
 mount --make-unbindable /mnt/ns
 mkdir -p /mnt/ns/$USER
 mount --rbind / /mnt/ns/$USER
 mount --make-rslave /mnt/ns/$USER
 mount --set-flags --recursive -oallowusermnt /mnt/ns/$USER
 chroot /mnt/ns/$USER
 su - $USER
 
 try this and your proliferation problem will disappear. :-)

Right, this is needed.

My problem wasn't actually this (which would only have hit, if I tried
with more than one user), just that the number of mounts in
/proc/mounts grows linearly with the number of users.

That can't be helped in such an easy way unfortunately.

  Propagating some mount flags and not propagating others is
  inconsistent and confusing, so I wouldn't want that.  Currently
  remount doesn't propagate mount flags, that may be a bug, 
 
 For consistency reason, one can propagate all the flags. But
 propagating only those flags that interfere with shared-subtree
 semantics should suffice.

I still don't believe not propagating allowusermnt interferes with
mount propagation.  In my posted patches the mount (including
propagations) is allowed based on the allowusermnt flag on the
parent of the requested mount.  The flag is _not_ checked during
propagation.

Allowing this and other flags to NOT be propagated just makes it
possible to have a set of shared mounts with asymmetric properties,
which may actually be desirable.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag

2007-04-18 Thread Miklos Szeredi

  I've tried to make this unprivileged mount thing as simple as
  possible, and no simpler.  If we can make it even simpler, all the
  better.
 
 We are certainly much more complex then the code in plan9 (just
 read through it) so I think we have room for improvement.
 
 Just for reference what I saw in plan 9 was:
 - No super user checks in it's mount, unmount, or namespace creation paths.
 - A flag to deny new mounts but not new bind mounts (for administrative 
 purposes
   the comment said).
 
 Our differences from plan9.
 - suid capable binaries. (SUID please go away).
 - A history of programs assuming only root could call mount/unmount.

I hate suid as well.  _The_ motivation behind this patchset was to get
rid of fusermount, a suid mount helper for fuse.

But I don't think suid is going away, and definitely not overnight.
Also I don't think we want to require auditing userspace before
enabling user mounts.

If I understand correctly, your proposal is to get rid of MNT_USER and
MNT_ALLOWUSERMNT and allow/deny unprivileged mounts and umounts based
on a boolean sysctl flag and on a check if the target namespace is the
initial namespace or not.  And maybe add some extra checks which
prevent ugliness from happening with suid programs.  Is this correct?

If so, how are we going to make sure this won't break existing
userspace without doing a full audit of all suid programs in every
distro that wants this feature?

Also how are we going to prevent the user from creating millions of
mounts, and using up all the kernel memory for vfsmounts?

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag

2007-04-18 Thread Miklos Szeredi

  Allowing this and other flags to NOT be propagated just makes it
  possible to have a set of shared mounts with asymmetric properties,
  which may actually be desirable.
 
 The shared mount feature was designed to ensure that the mount remained
 identical at all the locations.

OK, so remount not propagating mount flags is a bug then?

 Now designing features to make it un-identical but still naming it
 shared, will break its original purpose.  Slave mounts were designed
 to make it asymmetric.

What if I want to modify flags in a master mount, but not the slave
mount?  Would I be screwed?  For example: mount is read-only in both
master and slave.  I want to mark it read-write in master but not in
slave.  What do I do?

 Whatever feature that is desired to be exploited; can that be exploited
 with the current set of semantics that we have? Is there a real need to
 make the mounts asymmetric but at the same time name them as shared?
 Maybe I dont understand what the desired application is? 

I do think this question of propagating mount flags is totally
independent of user mounts.

As it stands, currently remount doesn't propagate mount flags, and I
don't see any compelling reasons why it should.

The patchset introduces a new mount flag allowusermnt, but I don't
see any compelling reason to propagate this flag _either_.

Please say so if you do have such a reason.  As I've explained, having
this flag set differently in parts of a propagation tree does not
interfere with or break propagation in any way.

Miklos


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag

2007-04-18 Thread Miklos Szeredi

I've tried to make this unprivileged mount thing as simple as
possible, and no simpler.  If we can make it even simpler, all the
better.
   
   We are certainly much more complex then the code in plan9 (just
   read through it) so I think we have room for improvement.
   
   Just for reference what I saw in plan 9 was:
   - No super user checks in it's mount, unmount, or namespace creation 
   paths.
   - A flag to deny new mounts but not new bind mounts (for administrative 
   purposes
 the comment said).
   
   Our differences from plan9.
   - suid capable binaries. (SUID please go away).
   - A history of programs assuming only root could call mount/unmount.
  
  I hate suid as well.  _The_ motivation behind this patchset was to get
  rid of fusermount, a suid mount helper for fuse.
  
  But I don't think suid is going away, and definitely not overnight.
  Also I don't think we want to require auditing userspace before
  enabling user mounts.
  
  If I understand correctly, your proposal is to get rid of MNT_USER and
  MNT_ALLOWUSERMNT and allow/deny unprivileged mounts and umounts based
  on a boolean sysctl flag and on a check if the target namespace is the
  initial namespace or not.  And maybe add some extra checks which
  prevent ugliness from happening with suid programs.  Is this correct?
  
  If so, how are we going to make sure this won't break existing
  userspace without doing a full audit of all suid programs in every
  distro that wants this feature?
  
  Also how are we going to prevent the user from creating millions of
  mounts, and using up all the kernel memory for vfsmounts?
 
 Don't forget that almost all mount flags are per-superblock. How are you
 planning on dealing with the case that one user mounts a filesystem
 read-only, while another is trying to mount the same one read-write?

Yeah, I forgot, the per-mount read-only patches are not yet in.

That doesn't really change my agrument though.  _If_ the flag is per
mount, then it makes sense to be able to change it on a master and not
on a slave.  If mount flags are propagated, this is not possible.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag

2007-04-18 Thread Miklos Szeredi

   Don't forget that almost all mount flags are per-superblock. How are you
   planning on dealing with the case that one user mounts a filesystem
   read-only, while another is trying to mount the same one read-write?
  
  Yeah, I forgot, the per-mount read-only patches are not yet in.
  
  That doesn't really change my agrument though.  _If_ the flag is per
  mount, then it makes sense to be able to change it on a master and not
  on a slave.  If mount flags are propagated, this is not possible.
 
 Read-only isn't the only issue. On something like NFS, there are flags
 to set the security flavour, turn on/off encryption etc.
 
 If I mount your home directory using no encryption in my namespace, for
 instance, then neither you nor the administrator will be able to turn it
 on afterwards in yours without first unmounting it from mine so that the
 superblock is destroyed.

OK, that's interesting, but I fail to grasp the relevance of this to
unprivileged mounts.

Or are you thinking of unprivileged NFS mounts?  Well, think again,
because that involves _much_ more than it seems at first glance.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag

2007-04-18 Thread Miklos Szeredi

   I've tried to make this unprivileged mount thing as simple as
   possible, and no simpler.  If we can make it even simpler, all the
   better.
  
  We are certainly much more complex then the code in plan9 (just
  read through it) so I think we have room for improvement.
  
  Just for reference what I saw in plan 9 was:
  - No super user checks in it's mount, unmount, or namespace creation paths.
  - A flag to deny new mounts but not new bind mounts (for administrative
  purposes
the comment said).
  
  Our differences from plan9.
  - suid capable binaries. (SUID please go away).
  - A history of programs assuming only root could call mount/unmount.
 
  I hate suid as well.  _The_ motivation behind this patchset was to get
  rid of fusermount, a suid mount helper for fuse.
 
  But I don't think suid is going away, and definitely not overnight.
 
 Agreed, unless it just happens that killing is equally as hard as
 doing non-privileged mounts.   Which I really don't think will be
 the case.  suid executables can always be replaced by a non-privileged
 part that connect to a daemon on localhost.
 
  Also I don't think we want to require auditing userspace before
  enabling user mounts.
 
 Given the complexity of the code to avoids audits adds, and especially
 the uncertainty of how all of the pieces add up together I really
 think we do need to audit user space (but only in a limited way).
 
 This is a paradigm shift in how mounts are managed and to get
 the full benefit of it we need to be prepared to deal with the
 paradigm shift.
 
 Essentially what the audit must ensure is that 
 (a) non-root users are always run in a non-default namespace so
 mounts and unmounts they generate will not have global effect.

But unprivileged proceses are not only created through login.  What
happens, when some process does setuid(NONZERO).  With your proposal
it now gained the privilege of mounting in the initial namespace.  IOW
every program calling setuid() now has to be audited for any problems
this could cause.

 (b) If we setup something like the proposed /share that administrative
 tools know how to treat it properly.
 
 That is not an especially hard audit of user space and it noticeably
 reduces the complexity of what we must implement.
 
  If I understand correctly, your proposal is to get rid of MNT_USER and
  MNT_ALLOWUSERMNT and allow/deny unprivileged mounts and umounts based
  on a boolean sysctl flag and on a check if the target namespace is the
  initial namespace or not.  
 
 No.  I really do not want to treat the initial namespace in a special way
 from an implementation point of view.

Ah.

 From a distro point of view I don't think we should ever allow a user
 into the initial mount namespace, and that is what we should audit.
 All of the ways a user can login to a machine.
 
 We need to audit the login paths anyway if we are going to do anything
 interesting with the mount namespace.  So I see this as no real
 additional overhead to make things usable.
 
  And maybe add some extra checks which
  prevent ugliness from happening with suid programs.  Is this correct?
 
 Definitely add some extra permission checks to prevent ugliness from
 happening with suid executables.  In the general case the problem
 with suid executables is that they expect parts of the mount namespace
 that a user cannot modify not to be modified.
 
 Ensuring that our permission checks keep that promise for mount
 and umount is going to be a bit challenging, because we need to
 think through a lot of scenarios.  But it is not that hard.
 
  If so, how are we going to make sure this won't break existing
  userspace without doing a full audit of all suid programs in every
  distro that wants this feature?
 
 The permission checks to ensure you can not modify a filesystem with
 mounts in a way that you could not modify it by creating or deleting
 files should be enough.

Umm, well.  What if user mounts a filesystem read-only.  Then wants to
unmount, but hey, it's read-only, no permission to write, no
unmount...

That's just a stupid example, but it shows, that file permissions are
not always useful for determining what you can mount, and especially
what you can unmount.

 That probably means that we are restricted to mounting filesystems
 with the equivalent of uid=$(id -u) gid=$(id -g).  That is if you
 aren't root all files in the filesystem you cause to be mounted
 must be owned by you.  That way you have permission to unmount or
 delete them.

That rules out unprivileged bind mounts.

 At the very least the user who causes a mount to be created should
 be the owner and have complete control over the directory mount point.
 So that they can at least theoretically remove all of the files and
 directories at the mount point (which would be equivalent to
 unmounting it).

Checking the permissions on the mountpoint to allow unmounting is

  - rather inelegant: user can't see those permissions, can only
determine if

Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag

2007-04-19 Thread Miklos Szeredi

   As I said earlier, I see a case where two mounts that are peers of each
   other can become un-identical if we dont propagate the allowusermnt.
   
   As a practical example.
   
   /tmp and /mnt are peers of each other.
   /tmp has its allowusermnt flag set, which has not been propagated
   to /mnt.
   
   now a normal-user mounts an ext2 file system under /tmp at /tmp/1
   
   unfortunately the mount wont appear under /mnt/1 
  
  Argh, that is not true.  That's what I've been trying to explain to
  you all along.
 
 I now realize you did, but I failed to catch it. sorry :-(

It is my fault also.  I will add better description to the patch
headers.

  The propagation will be done _regardless_ of the flag.  The flag is
  only checked for the parent of the _requested_ mount.  If it is
  allowed there, the mount, including any propagations are allowed.  If
  it's denied, then obviously it's denied everywhere.
  
   and in case if you allow the mount to appear under /mnt/1, you will
   break unpriviledge mounts semantics which promises: a normal user will
   not be able to mount at a location that does not allow user-mounts.
  
  No, it does not promise that.  The flag just promises, that the user
  cannot _request_ a mount on the parent mount.
 
 ok. if the ability for a normal user to mount something *indirectly*
 under a mount that has its 'allowusermnt flag' unset, 
 is acceptable under the definition of 'allowusermnt', i guess my only
 choice is to accept it. :-)

It is the only sane way I think.

Maybe the MS_SETFLAGS/MS_CLEARFLAGS mount operation could have an
option for propagating the given flag(s).  Then it's really up to the
sysadmin, to determine in each case what the desired behavior is.

Yes, you are right, that most of the time propagating the
allowusermnt between peers may be the right thing to do.  But for
example propagating it from master to slave may not be a good thing at
all.  So it shouldn't be for the kernel to decide.

Thanks,
Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Devel] Re: [patch 05/10] add permit user mounts in new namespace clone flag

2007-04-19 Thread Miklos Szeredi

 Checking the permissions on the mountpoint to allow unmounting is
 
   - rather inelegant: user can't see those permissions, can only
 determine if umount is allowed by trial and error
 
   - may be a security hole, e.g.:
 
   sysadmin:
 
   mkdir -m 777 /mnt/disk
   mount /dev/hda2 /mnt/disk
 
 Of course the user doesn't have the right to delete the contents of
 the mount, yet the permissions on the mountpoint would imply that s/he
 has permission to umount the disk.

It is becoming increasingly apparent, that mount/umount permission
based on file permissions is inherently broken:

1) there are user-writable files under /proc/$PID/, which definitely
   shouldn't be allowed to be overmounted

2) if user mounts an fs read-only, then wants to create a submount of
   this, it will fail with the current patchset

Solving 2) should be trivial: submounting a mount owned by the user
should be always allowed regardless of the file permissions.

Maybe this could be generaized to say, that a mount can be submounted
by an unprivileged user IFF parent mount is owned by said user.

This would get rid of some of the complications in the current
patchset, namely the functionality of MNT_ALLOWUSERMNT and MNT_USER
flags would be merged, and the permission checking would be removed.

For example on login, the user could get a private namespace set up
some that the home directory is owned by the user, and hence can be
freely submounted:

  clone(CLONE_NEWNS)
  mount --bind /home/$USER /home/$USER
  mount --remount -ouser=$USER /home/$USER

This is of course more limiting than allowing mounts based on file
permissions, but it's also a lot cleaner.

Hmm?

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 09/12] mm: count unstable pages per BDI

2007-04-19 Thread Miklos Szeredi

 Count per BDI unstable pages.
 

I'm wondering, is it really worth having this category separate from
per BDI brity pages?

With the exception of the export to sysfs, always the sum of unstable
+ dirty is used.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 11/12] mm: per device dirty threshold

2007-04-19 Thread Miklos Szeredi

 +static inline unsigned long bdi_stat_delta(void)
 +{
 +#ifdef CONFIG_SMP
 + return NR_CPUS * FBC_BATCH;

Shouln't this be multiplied by the number of counters to sum?  I.e. 3
if dirty and unstable are separate, and 2 if they are not.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 09/12] mm: count unstable pages per BDI

2007-04-19 Thread Miklos Szeredi

 Index: linux-2.6/fs/buffer.c
 ===
 --- linux-2.6.orig/fs/buffer.c2007-04-19 19:59:26.0 +0200
 +++ linux-2.6/fs/buffer.c 2007-04-19 20:35:39.0 +0200
 @@ -733,7 +733,7 @@ int __set_page_dirty_buffers(struct page
   if (page-mapping) {/* Race with truncate? */
   if (mapping_cap_account_dirty(mapping)) {
   __inc_zone_page_state(page, NR_FILE_DIRTY);
 - __inc_bdi_stat(mapping-backing_dev_info, BDI_DIRTY);
 + __inc_bdi_stat(mapping-backing_dev_info, BDI_RECLAIM);

This name suggests it's _under_ reclaim, which is not true.  You might
rather want to call it BDI_RECLAIMABLE, or something similar in
meaning.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 2/8] allow unprivileged umount

2007-04-20 Thread Miklos Szeredi

From: Miklos Szeredi [EMAIL PROTECTED]

The owner doesn't need sysadmin capabilities to call umount().

Similar behavior as umount(8) on mounts having user=UID option in
/etc/mtab.  The difference is that umount also checks /etc/fstab,
presumably to exclude another mount on the same mountpoint.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-20 11:55:05.0 +0200
+++ linux/fs/namespace.c2007-04-20 11:55:06.0 +0200
@@ -659,6 +659,25 @@ static int do_umount(struct vfsmount *mn
 }
 
 /*
+ * umount is permitted for
+ *  - sysadmin
+ *  - mount owner, if not forced umount
+ */
+static bool permit_umount(struct vfsmount *mnt, int flags)
+{
+   if (capable(CAP_SYS_ADMIN))
+   return true;
+
+   if (!(mnt-mnt_flags  MNT_USER))
+   return false;
+
+   if (flags  MNT_FORCE)
+   return false;
+
+   return mnt-mnt_uid == current-uid;
+}
+
+/*
  * Now umount can handle mount points as well as block devices.
  * This is important for filesystems which use unnamed block devices.
  *
@@ -681,7 +700,7 @@ asmlinkage long sys_umount(char __user *
goto dput_and_out;
 
retval = -EPERM;
-   if (!capable(CAP_SYS_ADMIN))
+   if (!permit_umount(nd.mnt, flags))
goto dput_and_out;
 
retval = do_umount(nd.mnt, flags);

--
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 1/8] add user mounts to the kernel

2007-04-20 Thread Miklos Szeredi

From: Miklos Szeredi [EMAIL PROTECTED]

Add ownership information to mounts.

A new mount flag, MS_SETUSER is used to make a mount owned by a user.
If this flag is specified, then the owner will be set to the current
real user id and the mount will be marked with the MNT_USER flag.  On
remount don't preserve previous owner, and treat MS_SETUSER as for a
new mount.  The MS_SETUSER flag is ignored on mount move.

The MNT_USER flag is not copied on any kind of mount cloning:
namespace creation, binding or propagation.  For bind mounts the
cloned mount(s) are set to MNT_USER depending on the MS_SETUSER mount
flag.  In all the other cases MNT_USER is always cleared.

For MNT_USER mounts a user=UID option is added to /proc/PID/mounts.
This is compatible with how mount ownership is stored in /etc/mtab.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-20 11:55:02.0 +0200
+++ linux/fs/namespace.c2007-04-20 11:55:05.0 +0200
@@ -227,6 +227,13 @@ static struct vfsmount *skip_mnt_tree(st
return p;
 }
 
+static void set_mnt_user(struct vfsmount *mnt)
+{
+   BUG_ON(mnt-mnt_flags  MNT_USER);
+   mnt-mnt_uid = current-uid;
+   mnt-mnt_flags |= MNT_USER;
+}
+
 static struct vfsmount *clone_mnt(struct vfsmount *old, struct dentry *root,
int flag)
 {
@@ -241,6 +248,11 @@ static struct vfsmount *clone_mnt(struct
mnt-mnt_mountpoint = mnt-mnt_root;
mnt-mnt_parent = mnt;
 
+   /* don't copy the MNT_USER flag */
+   mnt-mnt_flags = ~MNT_USER;
+   if (flag  CL_SETUSER)
+   set_mnt_user(mnt);
+
if (flag  CL_SLAVE) {
list_add(mnt-mnt_slave, old-mnt_slave_list);
mnt-mnt_master = old;
@@ -403,6 +415,8 @@ static int show_vfsmnt(struct seq_file *
if (mnt-mnt_flags  fs_infop-flag)
seq_puts(m, fs_infop-str);
}
+   if (mnt-mnt_flags  MNT_USER)
+   seq_printf(m, ,user=%i, mnt-mnt_uid);
if (mnt-mnt_sb-s_op-show_options)
err = mnt-mnt_sb-s_op-show_options(m, mnt);
seq_puts(m,  0 0\n);
@@ -920,8 +934,9 @@ static int do_change_type(struct nameida
 /*
  * do loopback mount.
  */
-static int do_loopback(struct nameidata *nd, char *old_name, int recurse)
+static int do_loopback(struct nameidata *nd, char *old_name, int flags)
 {
+   int clone_flags;
struct nameidata old_nd;
struct vfsmount *mnt = NULL;
int err = mount_is_safe(nd);
@@ -941,11 +956,12 @@ static int do_loopback(struct nameidata 
if (!check_mnt(nd-mnt) || !check_mnt(old_nd.mnt))
goto out;
 
+   clone_flags = (flags  MS_SETUSER) ? CL_SETUSER : 0;
err = -ENOMEM;
-   if (recurse)
-   mnt = copy_tree(old_nd.mnt, old_nd.dentry, 0);
+   if (flags  MS_REC)
+   mnt = copy_tree(old_nd.mnt, old_nd.dentry, clone_flags);
else
-   mnt = clone_mnt(old_nd.mnt, old_nd.dentry, 0);
+   mnt = clone_mnt(old_nd.mnt, old_nd.dentry, clone_flags);
 
if (!mnt)
goto out;
@@ -987,8 +1003,11 @@ static int do_remount(struct nameidata *
 
down_write(sb-s_umount);
err = do_remount_sb(sb, flags, data, 0);
-   if (!err)
+   if (!err) {
nd-mnt-mnt_flags = mnt_flags;
+   if (flags  MS_SETUSER)
+   set_mnt_user(nd-mnt);
+   }
up_write(sb-s_umount);
if (!err)
security_sb_post_remount(nd-mnt, flags, data);
@@ -1093,10 +1112,13 @@ static int do_new_mount(struct nameidata
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
 
-   mnt = do_kern_mount(type, flags, name, data);
+   mnt = do_kern_mount(type, flags  ~MS_SETUSER, name, data);
if (IS_ERR(mnt))
return PTR_ERR(mnt);
 
+   if (flags  MS_SETUSER)
+   set_mnt_user(mnt);
+
return do_add_mount(mnt, nd, mnt_flags, NULL);
 }
 
@@ -1127,7 +1149,8 @@ int do_add_mount(struct vfsmount *newmnt
if (S_ISLNK(newmnt-mnt_root-d_inode-i_mode))
goto unlock;
 
-   newmnt-mnt_flags = mnt_flags;
+   /* MNT_USER was set earlier */
+   newmnt-mnt_flags |= mnt_flags;
if ((err = graft_tree(newmnt, nd)))
goto unlock;
 
@@ -1447,7 +1470,7 @@ long do_mount(char *dev_name, char *dir_
retval = do_remount(nd, flags  ~MS_REMOUNT, mnt_flags,
data_page);
else if (flags  MS_BIND)
-   retval = do_loopback(nd, dev_name, flags  MS_REC);
+   retval = do_loopback(nd, dev_name, flags);
else if (flags  (MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE

[patch 4/8] propagate error values from clone_mnt

2007-04-20 Thread Miklos Szeredi

From: Miklos Szeredi [EMAIL PROTECTED]

Allow clone_mnt() to return errors other than ENOMEM.  This will be
used for returning a different error value when the number of user
mounts goes over the limit.

Fix copy_tree() to return EPERM for unbindable mounts.

Don't propagate further from dup_mnt_ns() as that copy_tree() can only
fail with -ENOMEM.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-20 11:55:07.0 +0200
+++ linux/fs/namespace.c2007-04-20 11:55:09.0 +0200
@@ -261,42 +261,42 @@ static struct vfsmount *clone_mnt(struct
 {
struct super_block *sb = old-mnt_sb;
struct vfsmount *mnt = alloc_vfsmnt(old-mnt_devname);
+   if (!mnt)
+   return ERR_PTR(-ENOMEM);
 
-   if (mnt) {
-   mnt-mnt_flags = old-mnt_flags;
-   atomic_inc(sb-s_active);
-   mnt-mnt_sb = sb;
-   mnt-mnt_root = dget(root);
-   mnt-mnt_mountpoint = mnt-mnt_root;
-   mnt-mnt_parent = mnt;
-
-   /* don't copy the MNT_USER flag */
-   mnt-mnt_flags = ~MNT_USER;
-   if (flag  CL_SETUSER)
-   set_mnt_user(mnt);
-
-   if (flag  CL_SLAVE) {
-   list_add(mnt-mnt_slave, old-mnt_slave_list);
-   mnt-mnt_master = old;
-   CLEAR_MNT_SHARED(mnt);
-   } else {
-   if ((flag  CL_PROPAGATION) || IS_MNT_SHARED(old))
-   list_add(mnt-mnt_share, old-mnt_share);
-   if (IS_MNT_SLAVE(old))
-   list_add(mnt-mnt_slave, old-mnt_slave);
-   mnt-mnt_master = old-mnt_master;
-   }
-   if (flag  CL_MAKE_SHARED)
-   set_mnt_shared(mnt);
+   mnt-mnt_flags = old-mnt_flags;
+   atomic_inc(sb-s_active);
+   mnt-mnt_sb = sb;
+   mnt-mnt_root = dget(root);
+   mnt-mnt_mountpoint = mnt-mnt_root;
+   mnt-mnt_parent = mnt;
+
+   /* don't copy the MNT_USER flag */
+   mnt-mnt_flags = ~MNT_USER;
+   if (flag  CL_SETUSER)
+   set_mnt_user(mnt);
 
-   /* stick the duplicate mount on the same expiry list
-* as the original if that was on one */
-   if (flag  CL_EXPIRE) {
-   spin_lock(vfsmount_lock);
-   if (!list_empty(old-mnt_expire))
-   list_add(mnt-mnt_expire, old-mnt_expire);
-   spin_unlock(vfsmount_lock);
-   }
+   if (flag  CL_SLAVE) {
+   list_add(mnt-mnt_slave, old-mnt_slave_list);
+   mnt-mnt_master = old;
+   CLEAR_MNT_SHARED(mnt);
+   } else {
+   if ((flag  CL_PROPAGATION) || IS_MNT_SHARED(old))
+   list_add(mnt-mnt_share, old-mnt_share);
+   if (IS_MNT_SLAVE(old))
+   list_add(mnt-mnt_slave, old-mnt_slave);
+   mnt-mnt_master = old-mnt_master;
+   }
+   if (flag  CL_MAKE_SHARED)
+   set_mnt_shared(mnt);
+
+   /* stick the duplicate mount on the same expiry list
+* as the original if that was on one */
+   if (flag  CL_EXPIRE) {
+   spin_lock(vfsmount_lock);
+   if (!list_empty(old-mnt_expire))
+   list_add(mnt-mnt_expire, old-mnt_expire);
+   spin_unlock(vfsmount_lock);
}
return mnt;
 }
@@ -781,11 +781,11 @@ struct vfsmount *copy_tree(struct vfsmou
struct nameidata nd;
 
if (!(flag  CL_COPY_ALL)  IS_MNT_UNBINDABLE(mnt))
-   return NULL;
+   return ERR_PTR(-EPERM);
 
res = q = clone_mnt(mnt, dentry, flag);
-   if (!q)
-   goto Enomem;
+   if (IS_ERR(q))
+   goto error;
q-mnt_mountpoint = mnt-mnt_mountpoint;
 
p = mnt;
@@ -806,8 +806,8 @@ struct vfsmount *copy_tree(struct vfsmou
nd.mnt = q;
nd.dentry = p-mnt_mountpoint;
q = clone_mnt(p, p-mnt_root, flag);
-   if (!q)
-   goto Enomem;
+   if (IS_ERR(q))
+   goto error;
spin_lock(vfsmount_lock);
list_add_tail(q-mnt_list, res-mnt_list);
attach_mnt(q, nd);
@@ -815,7 +815,7 @@ struct vfsmount *copy_tree(struct vfsmou
}
}
return res;
-Enomem:
+ error:
if (res) {
LIST_HEAD(umount_list);
spin_lock(vfsmount_lock);
@@ -823,7 +823,7 @@ Enomem:
spin_unlock(vfsmount_lock);
release_mounts(umount_list);
}
-   return

[patch 6/8] put declaration of put_filesystem() in fs.h

2007-04-20 Thread Miklos Szeredi

From: Miklos Szeredi [EMAIL PROTECTED]

Declarations go into headers.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/fs/super.c
===
--- linux.orig/fs/super.c   2007-04-20 11:55:02.0 +0200
+++ linux/fs/super.c2007-04-20 11:55:11.0 +0200
@@ -40,10 +40,6 @@
 #include asm/uaccess.h
 
 
-void get_filesystem(struct file_system_type *fs);
-void put_filesystem(struct file_system_type *fs);
-struct file_system_type *get_fs_type(const char *name);
-
 LIST_HEAD(super_blocks);
 DEFINE_SPINLOCK(sb_lock);
 
Index: linux/include/linux/fs.h
===
--- linux.orig/include/linux/fs.h   2007-04-20 11:55:07.0 +0200
+++ linux/include/linux/fs.h2007-04-20 11:55:11.0 +0200
@@ -1918,6 +1918,8 @@ extern int vfs_fstat(unsigned int, struc
 
 extern int vfs_ioctl(struct file *, unsigned int, unsigned int, unsigned long);
 
+extern void get_filesystem(struct file_system_type *fs);
+extern void put_filesystem(struct file_system_type *fs);
 extern struct file_system_type *get_fs_type(const char *name);
 extern struct super_block *get_super(struct block_device *);
 extern struct super_block *user_get_super(dev_t);

--
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 5/8] allow unprivileged bind mounts

2007-04-20 Thread Miklos Szeredi

From: Miklos Szeredi [EMAIL PROTECTED]

Allow bind mounts to unprivileged users if the following conditions
are met:

  - mountpoint is not a symlink or special file
  - parent mount is owned by the user
  - the number of user mounts is below the maximum

Unprivileged mounts imply MS_SETUSER, and will also have the nosuid
and nodev mount flags set.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-20 11:55:09.0 +0200
+++ linux/fs/namespace.c2007-04-20 11:55:10.0 +0200
@@ -237,11 +237,30 @@ static void dec_nr_user_mounts(void)
spin_unlock(vfsmount_lock);
 }
 
-static void set_mnt_user(struct vfsmount *mnt)
+static int reserve_user_mount(void)
+{
+   int err = 0;
+   spin_lock(vfsmount_lock);
+   if (nr_user_mounts = max_user_mounts  !capable(CAP_SYS_ADMIN))
+   err = -EPERM;
+   else
+   nr_user_mounts++;
+   spin_unlock(vfsmount_lock);
+   return err;
+}
+
+static void __set_mnt_user(struct vfsmount *mnt)
 {
BUG_ON(mnt-mnt_flags  MNT_USER);
mnt-mnt_uid = current-uid;
mnt-mnt_flags |= MNT_USER;
+   if (!capable(CAP_SYS_ADMIN))
+   mnt-mnt_flags |= MNT_NOSUID | MNT_NODEV;
+}
+
+static void set_mnt_user(struct vfsmount *mnt)
+{
+   __set_mnt_user(mnt);
spin_lock(vfsmount_lock);
nr_user_mounts++;
spin_unlock(vfsmount_lock);
@@ -260,9 +279,16 @@ static struct vfsmount *clone_mnt(struct
int flag)
 {
struct super_block *sb = old-mnt_sb;
-   struct vfsmount *mnt = alloc_vfsmnt(old-mnt_devname);
+   struct vfsmount *mnt;
+
+   if (flag  CL_SETUSER) {
+   int err = reserve_user_mount();
+   if (err)
+   return ERR_PTR(err);
+   }
+   mnt = alloc_vfsmnt(old-mnt_devname);
if (!mnt)
-   return ERR_PTR(-ENOMEM);
+   goto alloc_failed;
 
mnt-mnt_flags = old-mnt_flags;
atomic_inc(sb-s_active);
@@ -274,7 +300,7 @@ static struct vfsmount *clone_mnt(struct
/* don't copy the MNT_USER flag */
mnt-mnt_flags = ~MNT_USER;
if (flag  CL_SETUSER)
-   set_mnt_user(mnt);
+   __set_mnt_user(mnt);
 
if (flag  CL_SLAVE) {
list_add(mnt-mnt_slave, old-mnt_slave_list);
@@ -299,6 +325,11 @@ static struct vfsmount *clone_mnt(struct
spin_unlock(vfsmount_lock);
}
return mnt;
+
+ alloc_failed:
+   if (flag  CL_SETUSER)
+   dec_nr_user_mounts();
+   return ERR_PTR(-ENOMEM);
 }
 
 static inline void __mntput(struct vfsmount *mnt)
@@ -745,22 +776,29 @@ asmlinkage long sys_oldumount(char __use
 
 #endif
 
-static int mount_is_safe(struct nameidata *nd)
+/*
+ * Conditions for unprivileged mounts are:
+ * - mountpoint is not a symlink or special file
+ * - mountpoint is in a mount owned by the user
+ */
+static bool permit_mount(struct nameidata *nd, int *flags)
 {
+   struct inode *inode = nd-dentry-d_inode;
+
if (capable(CAP_SYS_ADMIN))
-   return 0;
-   return -EPERM;
-#ifdef notyet
-   if (S_ISLNK(nd-dentry-d_inode-i_mode))
-   return -EPERM;
-   if (nd-dentry-d_inode-i_mode  S_ISVTX) {
-   if (current-uid != nd-dentry-d_inode-i_uid)
-   return -EPERM;
-   }
-   if (vfs_permission(nd, MAY_WRITE))
-   return -EPERM;
-   return 0;
-#endif
+   return true;
+
+   if (!S_ISDIR(inode-i_mode)  !S_ISREG(inode-i_mode))
+   return false;
+
+   if (!(nd-mnt-mnt_flags  MNT_USER))
+   return false;
+
+   if (nd-mnt-mnt_uid != current-uid)
+   return false;
+
+   *flags |= MS_SETUSER;
+   return true;
 }
 
 static int lives_below_in_same_fs(struct dentry *d, struct dentry *dentry)
@@ -981,9 +1019,10 @@ static int do_loopback(struct nameidata 
int clone_flags;
struct nameidata old_nd;
struct vfsmount *mnt = NULL;
-   int err = mount_is_safe(nd);
-   if (err)
-   return err;
+   int err;
+
+   if (!permit_mount(nd, flags))
+   return -EPERM;
if (!old_name || !*old_name)
return -EINVAL;
err = path_lookup(old_name, LOOKUP_FOLLOW, old_nd);

--
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 8/8] allow unprivileged fuse mounts

2007-04-20 Thread Miklos Szeredi

From: Miklos Szeredi [EMAIL PROTECTED]

Use FS_SAFE for fuse fs type, but not for fuseblk.

FUSE was designed from the beginning to be safe for unprivileged
users.  This has also been verified in practice over many years.  In
addition unprivileged mounts require the parent mount to be owned by
the user, which is more strict than the current userspace policy.

This will enable future installations to remove the suid-root
fusermount utility.

Don't require the user_id= and group_id= options for unprivileged
mounts, but if they are present, verify them for sanity.

Disallow the allow_other option for unprivileged mounts.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/fs/fuse/inode.c
===
--- linux.orig/fs/fuse/inode.c  2007-04-20 11:55:01.0 +0200
+++ linux/fs/fuse/inode.c   2007-04-20 11:55:14.0 +0200
@@ -311,6 +311,19 @@ static int parse_fuse_opt(char *opt, str
d-max_read = ~0;
d-blksize = 512;
 
+   /*
+* For unprivileged mounts use current uid/gid.  Still allow
+* user_id and group_id options for compatibility, but
+* only if they match these values.
+*/
+   if (!capable(CAP_SYS_ADMIN)) {
+   d-user_id = current-uid;
+   d-user_id_present = 1;
+   d-group_id = current-gid;
+   d-group_id_present = 1;
+
+   }
+
while ((p = strsep(opt, ,)) != NULL) {
int token;
int value;
@@ -339,6 +352,8 @@ static int parse_fuse_opt(char *opt, str
case OPT_USER_ID:
if (match_int(args[0], value))
return 0;
+   if (d-user_id_present  d-user_id != value)
+   return 0;
d-user_id = value;
d-user_id_present = 1;
break;
@@ -346,6 +361,8 @@ static int parse_fuse_opt(char *opt, str
case OPT_GROUP_ID:
if (match_int(args[0], value))
return 0;
+   if (d-group_id_present  d-group_id != value)
+   return 0;
d-group_id = value;
d-group_id_present = 1;
break;
@@ -536,6 +553,10 @@ static int fuse_fill_super(struct super_
if (!parse_fuse_opt((char *) data, d, is_bdev))
return -EINVAL;
 
+   /* This is a privileged option */
+   if ((d.flags  FUSE_ALLOW_OTHER)  !capable(CAP_SYS_ADMIN))
+   return -EPERM;
+
if (is_bdev) {
 #ifdef CONFIG_BLOCK
if (!sb_set_blocksize(sb, d.blksize))
@@ -639,6 +660,7 @@ static struct file_system_type fuse_fs_t
.fs_flags   = FS_HAS_SUBTYPE,
.get_sb = fuse_get_sb,
.kill_sb= kill_anon_super,
+   .fs_flags   = FS_SAFE,
 };
 
 #ifdef CONFIG_BLOCK

--
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 3/8] account user mounts

2007-04-20 Thread Miklos Szeredi

From: Miklos Szeredi [EMAIL PROTECTED]

Add sysctl variables for accounting and limiting the number of user
mounts.

The maximum number of user mounts is set to 1024 by default.  This
won't in itself enable user mounts, setting a mount to be owned by a
user is first needed

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/include/linux/sysctl.h
===
--- linux.orig/include/linux/sysctl.h   2007-04-20 11:55:02.0 +0200
+++ linux/include/linux/sysctl.h2007-04-20 11:55:07.0 +0200
@@ -818,6 +818,8 @@ enum
FS_AIO_NR=18,   /* current system-wide number of aio requests */
FS_AIO_MAX_NR=19,   /* system-wide maximum number of aio requests */
FS_INOTIFY=20,  /* inotify submenu */
+   FS_NR_USER_MOUNTS=21,   /* int:current number of user mounts */
+   FS_MAX_USER_MOUNTS=22,  /* int:maximum number of user mounts */
FS_OCFS2=988,   /* ocfs2 */
 };
 
Index: linux/kernel/sysctl.c
===
--- linux.orig/kernel/sysctl.c  2007-04-20 11:55:02.0 +0200
+++ linux/kernel/sysctl.c   2007-04-20 11:55:07.0 +0200
@@ -1063,6 +1063,22 @@ static ctl_table fs_table[] = {
 #endif 
 #endif
{
+   .ctl_name   = FS_NR_USER_MOUNTS,
+   .procname   = nr_user_mounts,
+   .data   = nr_user_mounts,
+   .maxlen = sizeof(int),
+   .mode   = 0444,
+   .proc_handler   = proc_dointvec,
+   },
+   {
+   .ctl_name   = FS_MAX_USER_MOUNTS,
+   .procname   = max_user_mounts,
+   .data   = max_user_mounts,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+   {
.ctl_name   = KERN_SETUID_DUMPABLE,
.procname   = suid_dumpable,
.data   = suid_dumpable,
Index: linux/Documentation/filesystems/proc.txt
===
--- linux.orig/Documentation/filesystems/proc.txt   2007-04-20 
11:55:02.0 +0200
+++ linux/Documentation/filesystems/proc.txt2007-04-20 11:55:07.0 
+0200
@@ -923,6 +923,15 @@ reaches aio-max-nr then io_setup will fa
 raising aio-max-nr does not result in the pre-allocation or re-sizing
 of any kernel data structures.
 
+nr_user_mounts and max_user_mounts
+--
+
+These represent the number of user mounts and the maximum number of
+user mounts respectively.  User mounts may be created by
+unprivileged users.  User mounts may also be created with sysadmin
+privileges on behalf of a user, in which case nr_user_mounts may
+exceed max_user_mounts.
+
 2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats
 ---
 
Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-20 11:55:06.0 +0200
+++ linux/fs/namespace.c2007-04-20 11:55:07.0 +0200
@@ -39,6 +39,9 @@ static int hash_mask __read_mostly, hash
 static struct kmem_cache *mnt_cache __read_mostly;
 static struct rw_semaphore namespace_sem;
 
+int nr_user_mounts;
+int max_user_mounts = 1024;
+
 /* /sys/fs */
 decl_subsys(fs, NULL, NULL);
 EXPORT_SYMBOL_GPL(fs_subsys);
@@ -227,11 +230,30 @@ static struct vfsmount *skip_mnt_tree(st
return p;
 }
 
+static void dec_nr_user_mounts(void)
+{
+   spin_lock(vfsmount_lock);
+   nr_user_mounts--;
+   spin_unlock(vfsmount_lock);
+}
+
 static void set_mnt_user(struct vfsmount *mnt)
 {
BUG_ON(mnt-mnt_flags  MNT_USER);
mnt-mnt_uid = current-uid;
mnt-mnt_flags |= MNT_USER;
+   spin_lock(vfsmount_lock);
+   nr_user_mounts++;
+   spin_unlock(vfsmount_lock);
+}
+
+static void clear_mnt_user(struct vfsmount *mnt)
+{
+   if (mnt-mnt_flags  MNT_USER) {
+   mnt-mnt_uid = 0;
+   mnt-mnt_flags = ~MNT_USER;
+   dec_nr_user_mounts();
+   }
 }
 
 static struct vfsmount *clone_mnt(struct vfsmount *old, struct dentry *root,
@@ -283,6 +305,7 @@ static inline void __mntput(struct vfsmo
 {
struct super_block *sb = mnt-mnt_sb;
dput(mnt-mnt_root);
+   clear_mnt_user(mnt);
free_vfsmnt(mnt);
deactivate_super(sb);
 }
@@ -1023,6 +1046,7 @@ static int do_remount(struct nameidata *
down_write(sb-s_umount);
err = do_remount_sb(sb, flags, data, 0);
if (!err) {
+   clear_mnt_user(nd-mnt);
nd-mnt-mnt_flags = mnt_flags;
if (flags  MS_SETUSER)
set_mnt_user(nd-mnt);
Index: linux/include/linux/fs.h

[patch 7/8] allow unprivileged mounts

2007-04-20 Thread Miklos Szeredi

From: Miklos Szeredi [EMAIL PROTECTED]

Define a new fs flag FS_SAFE, which denotes, that unprivileged
mounting of this filesystem may not constitute a security problem.

Since most filesystems haven't been designed with unprivileged
mounting in mind, a thorough audit is needed before setting this flag.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-20 11:55:10.0 +0200
+++ linux/fs/namespace.c2007-04-20 11:55:13.0 +0200
@@ -781,13 +781,17 @@ asmlinkage long sys_oldumount(char __use
  * - mountpoint is not a symlink or special file
  * - mountpoint is in a mount owned by the user
  */
-static bool permit_mount(struct nameidata *nd, int *flags)
+static bool permit_mount(struct nameidata *nd, struct file_system_type *type,
+int *flags)
 {
struct inode *inode = nd-dentry-d_inode;
 
if (capable(CAP_SYS_ADMIN))
return true;
 
+   if (type  !(type-fs_flags  FS_SAFE))
+   return false;
+
if (!S_ISDIR(inode-i_mode)  !S_ISREG(inode-i_mode))
return false;
 
@@ -1021,7 +1025,7 @@ static int do_loopback(struct nameidata 
struct vfsmount *mnt = NULL;
int err;
 
-   if (!permit_mount(nd, flags))
+   if (!permit_mount(nd, NULL, flags))
return -EPERM;
if (!old_name || !*old_name)
return -EINVAL;
@@ -1182,26 +1186,46 @@ out:
  * create a new mount for userspace and request it to be added into the
  * namespace's tree
  */
-static int do_new_mount(struct nameidata *nd, char *type, int flags,
+static int do_new_mount(struct nameidata *nd, char *fstype, int flags,
int mnt_flags, char *name, void *data)
 {
+   int err;
struct vfsmount *mnt;
+   struct file_system_type *type;
 
-   if (!type || !memchr(type, 0, PAGE_SIZE))
+   if (!fstype || !memchr(fstype, 0, PAGE_SIZE))
return -EINVAL;
 
-   /* we need capabilities... */
-   if (!capable(CAP_SYS_ADMIN))
-   return -EPERM;
-
-   mnt = do_kern_mount(type, flags  ~MS_SETUSER, name, data);
-   if (IS_ERR(mnt))
+   type = get_fs_type(fstype);
+   if (!type)
+   return -ENODEV;
+
+   err = -EPERM;
+   if (!permit_mount(nd, type, flags))
+   goto out_put_filesystem;
+
+   if (flags  MS_SETUSER) {
+   err = reserve_user_mount();
+   if (err)
+   goto out_put_filesystem;
+   }
+
+   mnt = vfs_kern_mount(type, flags  ~MS_SETUSER, name, data);
+   put_filesystem(type);
+   if (IS_ERR(mnt)) {
+   if (flags  MS_SETUSER)
+   dec_nr_user_mounts();
return PTR_ERR(mnt);
+   }
 
if (flags  MS_SETUSER)
-   set_mnt_user(mnt);
+   __set_mnt_user(mnt);
 
return do_add_mount(mnt, nd, mnt_flags, NULL);
+
+ out_put_filesystem:
+   put_filesystem(type);
+   return err;
 }
 
 /*
@@ -1231,7 +1255,7 @@ int do_add_mount(struct vfsmount *newmnt
if (S_ISLNK(newmnt-mnt_root-d_inode-i_mode))
goto unlock;
 
-   /* MNT_USER was set earlier */
+   /* some flags may have been set earlier */
newmnt-mnt_flags |= mnt_flags;
if ((err = graft_tree(newmnt, nd)))
goto unlock;
Index: linux/include/linux/fs.h
===
--- linux.orig/include/linux/fs.h   2007-04-20 11:55:11.0 +0200
+++ linux/include/linux/fs.h2007-04-20 11:55:13.0 +0200
@@ -96,6 +96,7 @@ extern int dir_notify_enable;
 #define FS_REQUIRES_DEV 1 
 #define FS_BINARY_MOUNTDATA 2
 #define FS_HAS_SUBTYPE 4
+#define FS_SAFE 8  /* Safe to mount by unprivileged users */
 #define FS_REVAL_DOT   16384   /* Check the paths ., .. for staleness */
 #define FS_RENAME_DOES_D_MOVE  32768   /* FS will handle d_move()
 * during rename() internally.

--
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 0/8] mount ownership and unprivileged mount syscall (v4)

2007-04-20 Thread Miklos Szeredi

This patchset has now been bared to the lowest common denominator
that everybody can agree on.  Or at least there weren't any objections
to this proposal.

Andrew, please consider it for -mm.

Thanks,
Miklos


v3 - v4:

 - simplify interface as much as possible, now only a single option
   (user=UID) is used to control everything
 - no longer allow/deny mounting based on file/directory permissions,
   that approach does not always make sense


This patchset adds support for keeping mount ownership information in
the kernel, and allow unprivileged mount(2) and umount(2) in certain
cases.

The mount owner has the following privileges:

  - unmount the owned mount
  - create a submount under the owned mount

The sysadmin can set the owner explicitly on mount and remount.  When
an unprivileged user creates a mount, then the owner is automatically
set to the user.

The following use cases are envisioned:

1) Private namespace, with selected mounts owned by user.
   E.g. /home/$USER is a good candidate for allowing unpriv mounts and
   unmounts within.

2) Private namespace, with all mounts owned by user and having the
   nosuid flag.  User can mount and umount anywhere within the
   namespace, but suid programs will not work.

3) Global namespace, with a designated directory, which is a mount
   owned by the user.  E.g. /mnt/users/$USER is set up so that it is
   bind mounted onto itself, and set to be owned by $USER.  The user
   can add/remove mounts only under this directory.

The following extra security measures are taken for unprivileged
mounts:

 - usermounts are limited by a sysctl tunable
 - force nosuid,nodev mount options on the created mount

--
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [d_path 0/7] Fixes to d_path: Respin

2007-04-20 Thread Miklos Szeredi

  I gave a chroot example that showed that in the current
  implementation, you can get pretty random clashes between mounts; there are
  other cases with lazy unmounts as well.
 
 Irrelevant as well.  If you create chroot problems it's your problem.
 
 The fact is that if you have a normal setup the code works fine.  All
 other situations cannot be handled with the current kernel interface.
 
 This does not give anybody the right to say since the code doesn't
 always work we can break it completely.  That's completely
 unacceptable.

I'm not sure I understand the situation completely.  What exactly is
broken in libc by removing unreachable mounts from /proc/mounts?

Is it the situation when
 - file descriptor is opened
 - process does chroot
 - process does fstatvfs on file descriptor
?

In that case currently fstatvfs() _usually_ gives the correct results,
but can give wrong results if mounts paths accidently clash in
/proc/mounts?

Also isn't it the case, that fstatvfs() or statvfs() performed within
the chroot could also give incorrect result for a _reachable_ mount if
it clashes with an unreachable mount?

If this is the case, I would think that removing the unreachable
mounts from /proc/mounts, would actually be fixing this second case,
which is more likely to be used anyway.

BTW, this patch, or at least a predecessor is in -mm, and it very much
feels the Right Thing(tm).  The /proc/mounts under a chroot
environment actually looks sane, instead of some random crap, that it
was previously.

While we should make every effort to keep the kernel interfaces
stable, this shouldn't prevent us from fixing bugs.  And this one is
clearly a bug, even if not a very serious one.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/8] add user mounts to the kernel

2007-04-21 Thread Miklos Szeredi

  From: Miklos Szeredi [EMAIL PROTECTED]
  
  Add ownership information to mounts.
  
  A new mount flag, MS_SETUSER is used to make a mount owned by a user.
  If this flag is specified, then the owner will be set to the current
  real user id and the mount will be marked with the MNT_USER flag.  On
  remount don't preserve previous owner, and treat MS_SETUSER as for a
  new mount.  The MS_SETUSER flag is ignored on mount move.
 
 So is a modified mount(8) needed?  If so, is there some convenient way
 in which testers can get hold of it?

I haven't yet added this to mount(8).  I'll do that and post patches.

There's an ultra sophisticated mount tester program I use.  It's
slightly user unfriendly...

Miklos


#include stdio.h
#include stdlib.h
#include string.h
#include sys/mount.h

int main(int argc, char *argv[])
{
int res;
int un = (strrchr(argv[0], '/') ?: argv[0] - 1)[1] == 'u';
if ((un  argc != 3) || (!un  argc != 6)) {
fprintf(stderr, usage: %s %s\n, argv[0],
un ? dir flags : dev dir type flags opts);
return 1;
}
if (un)
res = umount2(argv[1], atoi(argv[2]));
else
res = mount(argv[1], argv[2], argv[3], atoi(argv[4]), argv[5]);
if (res == -1) {
perror(argv[0]);
return 1;
}
return 0;
}

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2/8] allow unprivileged umount

2007-04-21 Thread Miklos Szeredi

  +static bool permit_umount(struct vfsmount *mnt, int flags)
  +{
 
  ...
 
  +   return mnt-mnt_uid == current-uid;
  +}
 
 Yes, this seems very wrong.  I'd have thought that comparing user_struct*'s
 would get us a heck of a lot closer to being able to support aliasing of
 UIDs between different namespaces.
 

OK, I'll fix this up.

Actually an earlier version of this patch did use user_struct's but
I'd changed it to uids, because it's simpler.  I didn't think about
this being contrary to the id namespaces thing.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 7/8] allow unprivileged mounts

2007-04-21 Thread Miklos Szeredi

 On Fri, 20 Apr 2007 12:25:39 +0200 Miklos Szeredi [EMAIL PROTECTED] wrote:
 
  Define a new fs flag FS_SAFE, which denotes, that unprivileged
  mounting of this filesystem may not constitute a security problem.
  
  Since most filesystems haven't been designed with unprivileged
  mounting in mind, a thorough audit is needed before setting this flag.
 
 Practically speaking, is there any realistic likelihood that any filesystem
 apart from FUSE will ever use this?

V9FS people did express an interest in this.  Yeah, I should've CC-ed
them, but forgot.  Sorry.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 8/8] allow unprivileged fuse mounts

2007-04-21 Thread Miklos Szeredi

  Use FS_SAFE for fuse fs type, but not for fuseblk.
  
  FUSE was designed from the beginning to be safe for unprivileged
  users.  This has also been verified in practice over many years.
 
 How does FUSE do this?
 
 There are obvious cases like crafting a filesystem which has setuid 
 executables
 or world-writeable device nodes or whatever.  I'm sure there are lots of other
 cases.
 
 Where is FUSE's implementation of all this protection described?

Most of it is in Documentation/filesystems/fuse.txt, some of it is
code comments.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 7/8] allow unprivileged mounts

2007-04-21 Thread Miklos Szeredi

  
   Define a new fs flag FS_SAFE, which denotes, that unprivileged
   mounting of this filesystem may not constitute a security problem.
   
   Since most filesystems haven't been designed with unprivileged
   mounting in mind, a thorough audit is needed before setting this flag.
  
  Practically speaking, is there any realistic likelihood that any filesystem
  apart from FUSE will ever use this?
 
 V9FS people did express an interest in this.  Yeah, I should've CC-ed
 them, but forgot.  Sorry.

And CIFS maybe.  They also have an unprivileged mounting suid hack.
But I'm not very optimistic about CIFS, seeing some of the code,
that's in there.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-21 Thread Miklos Szeredi

 On Fri, 20 Apr 2007 17:52:04 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote:
 
  Scale writeback cache per backing device, proportional to its writeout 
  speed.
  
  By decoupling the BDI dirty thresholds a number of problems we currently 
  have
  will go away, namely:
  
   - mutual interference starvation (for any number of BDIs);
   - deadlocks with stacked BDIs (loop, FUSE and local NFS mounts).
  
  It might be that all dirty pages are for a single BDI while other BDIs are
  idling. By giving each BDI a 'fair' share of the dirty limit, each one can 
  have
  dirty pages outstanding and make progress.
  
  A global threshold also creates a deadlock for stacked BDIs; when A writes 
  to
  B, and A generates enough dirty pages to get throttled, B will never start
  writeback until the dirty pages go away. Again, by giving each BDI its own
  'independent' dirty limit, this problem is avoided.
  
  So the problem is to determine how to distribute the total dirty limit 
  across
  the BDIs fairly and efficiently. A DBI that has a large dirty limit but does
  not have any dirty pages outstanding is a waste.
  
  What is done is to keep a floating proportion between the DBIs based on
  writeback completions. This way faster/more active devices get a larger 
  share
  than slower/idle devices.
 
 This is a pretty major improvement to various nasty corner-cases, if it
 works.
 
 Does it work?  Please describe the testing you did, and the results.
 
 Has this been confirmed to fix Miklos's FUSE and loopback problems?

I haven't yet tested it (will do), but I'm sure it does solve the
deadlock in balance_dirty_pages(), if for no other reason, that when
the queue is idle (no dirty or writeback pages), then it allowes the
caller to dirty some more pages.

The other deadlock, in throttle_vm_writeout() is still to be solved.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-21 Thread Miklos Szeredi

  The other deadlock, in throttle_vm_writeout() is still to be solved.
 
 Let's go back to the original changelog:
 
 Author: marcelo.tosatti marcelo.tosatti
 Date:   Tue Mar 8 17:25:19 2005 +
 
 [PATCH] vm: pageout throttling
 
 With silly pageout testcases it is possible to place huge amounts of 
 memory
 under I/O.  With a large request queue (CFQ uses 8192 requests) it is
 possible to place _all_ memory under I/O at the same time.
 
 This means that all memory is pinned and unreclaimable and the VM gets
 upset and goes oom.
 
 The patch limits the amount of memory which is under pageout writeout to 
 be
 a little more than the amount of memory at which balance_dirty_pages()
 callers will synchronously throttle.
 
 This means that heavy pageout activity can starve heavy writeback activity
 completely, but heavy writeback activity will not cause starvation of
 pageout.  Because we don't want a simple `dd' to be causing excessive
 latencies in page reclaim.
 
 Signed-off-by: Andrew Morton [EMAIL PROTECTED]
 Signed-off-by: Linus Torvalds [EMAIL PROTECTED]
 
 (A good one!  I wrote it ;))
 
 
 I believe that the combination of dirty-page-tracking and its calls to
 balance_dirty_pages() mean that we can now never get more than dirty_ratio
 of memory into the dirty-or-writeback condition.
 
 The vm scanner can convert dirty pages into clean, under-writeback pages,
 but it cannot increase the total of dirty+writeback.

What about swapout?  That can increase the number of writeback pages,
without decreasing the number of dirty pages, no?

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2/8] allow unprivileged umount

2007-04-22 Thread Miklos Szeredi

  On Sat, 21 Apr 2007 10:09:42 +0200 Miklos Szeredi [EMAIL PROTECTED] wrote:
 
+static bool permit_umount(struct vfsmount *mnt, int flags)
+{
   
...
   
+  return mnt-mnt_uid == current-uid;
+}
   
   Yes, this seems very wrong.  I'd have thought that comparing 
   user_struct*'s
   would get us a heck of a lot closer to being able to support aliasing of
   UIDs between different namespaces.
   
  
  OK, I'll fix this up.
  
  Actually an earlier version of this patch did use user_struct's but
  I'd changed it to uids, because it's simpler.
 
  OK..
 
   I didn't think about
  this being contrary to the id namespaces thing.
 
  Well I was madly assuming that when serarate UID namespaces are in use, UID
  42 in container A will have a different user_struct from UID 42 in
  container B.  I'd suggest that we provoke an opinion from Eric  co before
  you do work on this.
 
 That is what I what I have been thinking as well,

Does this mean, that containers will need this?  Or that you don't
know yet?

 storing a user struct on each mount point seems sane, plus it allows
 per user mount rlimits which is major plus.  Especially since we
 seem to be doing accounting only for user mounts a per user rlimit
 seems good.

I'm not against per-user rlimits for mounts, but I'd rather do this
later...

 To get the user we should be user fs_uid as HPA suggested.

fsuid is exclusively used for checking file permissions, which we
don't do here anymore.  So while it could be argued, that mount() _is_
a filesystem operation, it is really a different sort of filesystem
operation than the rest.

OTOH it wouldn't hurt to use fsuid instead of ruid...

 Finally I'm pretty certain the capability we should care about in
 this context is CAP_SETUID.  Instead of CAP_SYS_ADMIN.
 
 If we have CAP_SETUID we can become which ever user owns the mount,
 and the root user in a container needs this, so he can run login
 programs.  So changing the appropriate super user checks from
 CAP_SYS_ADMIN to CAP_SETUID I think is the right thing todo.

That's a flawed logic.  If you want to mount as a specific user, and
you have CAP_SETUID, then just use set*uid() and then mount().

Changing the capability check for mount() would break the userspace
ABI.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/8] add user mounts to the kernel

2007-04-22 Thread Miklos Szeredi

  The MNT_USER flag is not copied on any kind of mount cloning:
  namespace creation, binding or propagation.
 
 I half agree, and as an initial approximation this works.
 Ultimately we should be at the point that for mount propagation
 that we copy the owner of the from the owner of our parent mount
 at the propagation destination.

Yes, that sounds the most sane.

Ram, what do you think?

  +   if (mnt-mnt_flags  MNT_USER)
  +   seq_printf(m, ,user=%i, mnt-mnt_uid);
 How about making the test if (mnt-mnt_user != root_user)

We don't want to treat root_user special.  That's what capabilities
were invented for.

  Index: linux/include/linux/fs.h
  ===
  --- linux.orig/include/linux/fs.h   2007-04-20 11:55:02.0 +0200
  +++ linux/include/linux/fs.h2007-04-20 11:55:05.0 +0200
  @@ -123,6 +123,7 @@ extern int dir_notify_enable;
   #define MS_SLAVE   (119) /* change to slave */
   #define MS_SHARED  (120) /* change to shared */
   #define MS_RELATIME(121) /* Update atime relative to 
  mtime/ctime. */
  +#define MS_SETUSER (122) /* set mnt_uid to current user */
 
 If we unconditionally use the fsuid I think we can get away without
 this flag.

That coudl work if we wouldn't have to worry about breaking the user
interface.  As it is, we cannot be sure, that existing callers of
mount(2) don't have fsuid set to some random value.

   #define MNT_SHRINKABLE 0x100
  +#define MNT_USER   0x200
 
 If we assign a user to all mount points and root gets to own the
 initial set of mounts then we don't need the internal MNT_USER
 flag.

I think we do want to treat owned mounts special, rather than
treating user=0 mounts special.

  +
  +   uid_t mnt_uid;  /* owner of the mount */
 
 Can we please make this a user struct.   That requires a bit of
 reference counting but it has uid namespace benefits as well
 as making it easy to implement per user mount rlimits.

OK, can you ellaborate, what the uid namespace benifits are?

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2/8] allow unprivileged umount

2007-04-22 Thread Miklos Szeredi

   I suspect we can allow MNT_FORCE for non-privileged users 
   as well if we can trust the filesystem.

I don't think so.  MNT_FORCE has side effects on the superblock.  So a
user shouldn't be able to force an unmount on a bind mount s/he did,
but there's no problem with allowing plain/lazy unmounts.

We could possibly allow MNT_FORCE, for FS_SAFE filesystems.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 3/8] account user mounts

2007-04-22 Thread Miklos Szeredi

  From: Miklos Szeredi [EMAIL PROTECTED]
 
  Add sysctl variables for accounting and limiting the number of user
  mounts.
 
  The maximum number of user mounts is set to 1024 by default.  This
  won't in itself enable user mounts, setting a mount to be owned by a
  user is first needed
 
 Since each mount has a user can we just make this a per user rlimit?
 
 If we are going to implement a sysctl at this point I think it should
 be a global limit that doesn't care if who you are.  Even root can
 have recursive mounts that attempt to get out of control.

Recursive bind mounts are done carefully enough, so they don't get out
of control.

Recursive mount propagations can get out of control.  But root can
shoot itself in the foot any number of ways, and it's not for the
kernel to police that.

 Also currently you are not checking the max_users.  It looks like
 you do this in a later patch but still it is a little strange to
 allow user own mounts and have accounting but to not check the
 limit at this state.

Yeah, but at this stage user mounts are not yet allowed, so this is
safe.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 5/8] allow unprivileged bind mounts

2007-04-22 Thread Miklos Szeredi

  From: Miklos Szeredi [EMAIL PROTECTED]
 
  Allow bind mounts to unprivileged users if the following conditions
  are met:
 
- mountpoint is not a symlink or special file
 
 Why?  This sounds like a left over from when we were checking permissions.

Hmm, yes.  Don't know.  Maybe only the symlink check.

Bind mounts of directory over non-directy, and vica versa are already
excluded, even for root.

- parent mount is owned by the user
- the number of user mounts is below the maximum
 
  Unprivileged mounts imply MS_SETUSER, and will also have the nosuid
  and nodev mount flags set.
 
 So in principle I agree, but in detail I disagree.
 
 capable(CAP_SETUID) should be required to leave MNT_NOSUID clear.
 capable(CAP_MKNOD) should be required to leave MNT_NODEV clear.
 
 I.e.  We should not special case this as a user mount but rather
 simply check to see if the user performing the mount has the appropriate
 capabilities to allow the flags.

Sounds sane.  Will fix.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 8/8] allow unprivileged fuse mounts

2007-04-22 Thread Miklos Szeredi

  +   /*
  +* For unprivileged mounts use current uid/gid.  Still allow
  +* user_id and group_id options for compatibility, but
  +* only if they match these values.
  +*/
  +   if (!capable(CAP_SYS_ADMIN)) {
  +   d-user_id = current-uid;
  +   d-user_id_present = 1;
  +   d-group_id = current-gid;
  +   d-group_id_present = 1;
  +
  +   }
 
 CAP_SETUID is the appropriate capability...
 
 This is not a dimension we have not fully explored.
 What is the problem with a user controlled mount having different
 uid and gid values.
 
 Yes they map into different users but how is this a problem.
 The only problem that I can recall is the historic chown problem
 where you could give files to other users and mess up their quotas.
 
 Or is the problem other users writing to this user controlled
 filesystem?

Yes.  Or even just a suid process trying to access the user controlled
filesystem.  See Documentation/filesystems/fuse.txt for the gory
details.

Eric, thanks for the detailed review :)

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2/8] allow unprivileged umount

2007-04-22 Thread Miklos Szeredi

  Does this mean, that containers will need this?  Or that you don't
  know yet?
 
 The uid namespace is something we have to handle carefully and we
 have not decided on the final design.
 
 What is clear is that all permission checks will need to become
 either (uid namspace, uid) tuple comparisons.  Or struct user
 pointer comparisons.  To see if we are talking about the same
 uid.
 
 So the eventual uid namespace combined with the possibility
 for rlimits if we use struct user *.  See to make using a struct
 user a clear win.

OK, if we don't yet know, I'd rather leave this for later.  It will be
trivial to change to user_struct if we want per-user rlimits.

  storing a user struct on each mount point seems sane, plus it allows
  per user mount rlimits which is major plus.  Especially since we
  seem to be doing accounting only for user mounts a per user rlimit
  seems good.
 
  I'm not against per-user rlimits for mounts, but I'd rather do this
  later...
 
 Then let's add a non-discriminate limit.  Instead of a limit that
 applies only to root.

See reply to relevant patch.

  To get the user we should be user fs_uid as HPA suggested.
 
  fsuid is exclusively used for checking file permissions, which we
  don't do here anymore.  So while it could be argued, that mount() _is_
  a filesystem operation, it is really a different sort of filesystem
  operation than the rest.
 
  OTOH it wouldn't hurt to use fsuid instead of ruid...
 
 Yes.  I may be confused but I'm pretty certain we want either
 the fsuid or the euid to be the mount owner.  ruid just looks wrong.
 The fsuid is a special case of the effective uid.  Which is who
 we should perform operations as.  Unless I'm just confused.

Definitely not euid.  Euid is the one which is effective, i.e. it will
basically always be zero for a privileged mount().

Ruid is the one which is returned by getuid().  If a user execs a
suid-root program, then ruid will be the id of the user, while euid
will be zero.


  Finally I'm pretty certain the capability we should care about in
  this context is CAP_SETUID.  Instead of CAP_SYS_ADMIN.
  
  If we have CAP_SETUID we can become which ever user owns the mount,
  and the root user in a container needs this, so he can run login
  programs.  So changing the appropriate super user checks from
  CAP_SYS_ADMIN to CAP_SETUID I think is the right thing todo.
 
  That's a flawed logic.  If you want to mount as a specific user, and
  you have CAP_SETUID, then just use set*uid() and then mount().
 
 Totally agreed for mount.
 
  Changing the capability check for mount() would break the userspace
  ABI.
 
 Sorry I apparently wasn't clear.  CAP_SETUID should be the capability
 check for umount.

The argument applies to umount as well.  For compatibility, we _need_
the CAP_SYS_ADMIN check.  And if program has CAP_SETUID but not
CAP_SYS_ADMIN, it can just set the id to the mount owner before
calling umount.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/8] add user mounts to the kernel

2007-04-22 Thread Miklos Szeredi

   +if (mnt-mnt_flags  MNT_USER)
   +seq_printf(m, ,user=%i, mnt-mnt_uid);
  How about making the test if (mnt-mnt_user != root_user)
 
  We don't want to treat root_user special.  That's what capabilities
  were invented for.
 
 For the print statement?  What ever it is minor.

It is a user interface, not a print statement.  Your suggested change
would be vetoed by any number of people.  

So either we have all mounts having owners, AND have /proc/mounts add
user=0 to all mounts.  While I don't _think_ this would actually
break userspace, it would definitely make people complain.

The other choice is what the current patchset does: is to have
legacy mounts without owners, and new generation mounts with
owners having user=UID in /proc/mounts, regardless of the value of
UID.

 So I want to minimize the changes needed to existing programs.
 Now if all we have to do is specify MS_SETUSER when root a
 user with CAP_SETUID is setting up a mount as a user other
 then himself then I don't much care.  If we have to call MS_SETUSER
 as unprivileged users

You don't.  Unprivileged mounts _imply_ MS_SETUSER.

   +
   +uid_t mnt_uid;  /* owner of the mount */
  
  Can we please make this a user struct.   That requires a bit of
  reference counting but it has uid namespace benefits as well
  as making it easy to implement per user mount rlimits.
 
  OK, can you ellaborate, what the uid namespace benifits are?
 
 In the uid namespace the comparison is simpler as are the propagations
 rules.  Basically if you use a struct user you will never need to
 care about a uid namespace.  If you don't we will have to tear through
 this code another time.

Well, OK.  I'll do the user_struct thing then.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 3/8] account user mounts

2007-04-22 Thread Miklos Szeredi

   From: Miklos Szeredi [EMAIL PROTECTED]
  
   Add sysctl variables for accounting and limiting the number of user
   mounts.
  
   The maximum number of user mounts is set to 1024 by default.  This
   won't in itself enable user mounts, setting a mount to be owned by a
   user is first needed
  
  Since each mount has a user can we just make this a per user rlimit?
  
  If we are going to implement a sysctl at this point I think it should
  be a global limit that doesn't care if who you are.  Even root can
  have recursive mounts that attempt to get out of control.
 
  Recursive bind mounts are done carefully enough, so they don't get out
  of control.
 
  Recursive mount propagations can get out of control.  But root can
  shoot itself in the foot any number of ways, and it's not for the
  kernel to police that.
 
 Yes.  It is.
 
 This is mostly about removing special cases.
 
 We routinely have limits on resources that are global and apply
 to root along with every one else.  Root can change them but
 they still apply to root.  Things like the number of inodes
 in the system or the total number of files.

There's no max_inodes any more.  As for max_files:

get_empty_filp():

/*
 * Privileged users can go above max_files
 */
if (get_nr_files() = files_stat.max_files  !capable(CAP_SYS_ADMIN)) {

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 7/8] allow unprivileged mounts

2007-04-22 Thread Miklos Szeredi

  On Apr 21 2007 10:57, Eric W. Biederman wrote:
 
  tmpfs!
 
 tmpfs is a possible problem because it can consume lots of ram/swap. 
 Which is why it has limits on the amount of space it can consume. 
 
  Users can gobble up all RAM and swap already today. (Unless they are
  confined into an rlimit, which, in most systems, is not the case.)
  And in case /dev/shm exists, they can already fill it without running
  into an rlimit early.
 
 There are systems that care about rlimits and there is strong intersection
 between caring about rlimits and user mounts.  Although I do agree that
 it looks like we have gotten lazy with the default mount options for
 /dev/shm.
 
 Going a little farther any filesystem that is safe to put on a usb
 stick and mount automatically should ultimately be safe for unprivileged
 mounts as well.

Actually, that's not as simple.

For the usb stick or cdrom you need physical access to the machine.
And once you have that you basically have full control over the system
anyway.

But with block filesystems, the user would still need access to the
device (currently kernel doesn't even check this I think).

So it may make sense to mark all block based filesystems safe, and
defer permission checking to user access on the block device.

But the safe flag is still needed for filesystems, which don't have
such an additional access checking, such as network filesystems.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/8] add user mounts to the kernel

2007-04-22 Thread Miklos Szeredi

   +
   +uid_t mnt_uid;  /* owner of the mount */
  
  Can we please make this a user struct.   That requires a bit of
  reference counting but it has uid namespace benefits as well
  as making it easy to implement per user mount rlimits.
 
  OK, can you ellaborate, what the uid namespace benifits are?
 
 In the uid namespace the comparison is simpler as are the propagations
 rules.  Basically if you use a struct user you will never need to
 care about a uid namespace.

I tried to implement it but got stuck on this: fsuid doesn't have a
user_struct in task_struct (yet), so we'd now have to convert
current-fsuid to a user_struct.  This can be done with alloc_uid(),
but this can fail, bringing in extra error handling complexity.

Also we'd have to compare current-fsuid with a user_struct, which we
don't yet know how will actually be done in the future.

So it seems, we still have to care about the uid namespace, at least
if fsuid is preferred to ruid.

Anyway, here's a patch fixing the other things you brought up, and
which I agree with.  Does this look OK?

Thanks,
Miklos


Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-22 17:48:18.0 +0200
+++ linux/fs/namespace.c2007-04-22 18:19:51.0 +0200
@@ -252,10 +252,12 @@ static int reserve_user_mount(void)
 static void __set_mnt_user(struct vfsmount *mnt)
 {
BUG_ON(mnt-mnt_flags  MNT_USER);
-   mnt-mnt_uid = current-uid;
+   mnt-mnt_uid = current-fsuid;
mnt-mnt_flags |= MNT_USER;
-   if (!capable(CAP_SYS_ADMIN))
-   mnt-mnt_flags |= MNT_NOSUID | MNT_NODEV;
+   if (!capable(CAP_SETUID))
+   mnt-mnt_flags |= MNT_NOSUID;
+   if (!capable(CAP_MKNOD))
+   mnt-mnt_flags |= MNT_NODEV;
 }
 
 static void set_mnt_user(struct vfsmount *mnt)
@@ -725,10 +727,10 @@ static bool permit_umount(struct vfsmoun
if (!(mnt-mnt_flags  MNT_USER))
return false;
 
-   if (flags  MNT_FORCE)
+   if ((flags  MNT_FORCE)  !(mnt-mnt_sb-s_type-fs_flags  FS_SAFE))
return false;
 
-   return mnt-mnt_uid == current-uid;
+   return mnt-mnt_uid == current-fsuid;
 }
 
 /*
@@ -792,13 +794,13 @@ static bool permit_mount(struct nameidat
if (type  !(type-fs_flags  FS_SAFE))
return false;
 
-   if (!S_ISDIR(inode-i_mode)  !S_ISREG(inode-i_mode))
+   if (S_ISLNK(inode-i_mode))
return false;
 
if (!(nd-mnt-mnt_flags  MNT_USER))
return false;
 
-   if (nd-mnt-mnt_uid != current-uid)
+   if (nd-mnt-mnt_uid != current-fsuid)
return false;
 
*flags |= MS_SETUSER;
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-23 Thread Miklos Szeredi

The other deadlock, in throttle_vm_writeout() is still to be solved.
   
   Let's go back to the original changelog:
   
   Author: marcelo.tosatti marcelo.tosatti
   Date:   Tue Mar 8 17:25:19 2005 +
   
   [PATCH] vm: pageout throttling
   
   With silly pageout testcases it is possible to place huge amounts of 
   memory
   under I/O.  With a large request queue (CFQ uses 8192 requests) it is
   possible to place _all_ memory under I/O at the same time.
   
   This means that all memory is pinned and unreclaimable and the VM gets
   upset and goes oom.
   
   The patch limits the amount of memory which is under pageout writeout 
   to be
   a little more than the amount of memory at which balance_dirty_pages()
   callers will synchronously throttle.
   
   This means that heavy pageout activity can starve heavy writeback 
   activity
   completely, but heavy writeback activity will not cause starvation of
   pageout.  Because we don't want a simple `dd' to be causing excessive
   latencies in page reclaim.
   
   Signed-off-by: Andrew Morton [EMAIL PROTECTED]
   Signed-off-by: Linus Torvalds [EMAIL PROTECTED]
   
   (A good one!  I wrote it ;))
   
   
   I believe that the combination of dirty-page-tracking and its calls to
   balance_dirty_pages() mean that we can now never get more than dirty_ratio
   of memory into the dirty-or-writeback condition.
   
   The vm scanner can convert dirty pages into clean, under-writeback pages,
   but it cannot increase the total of dirty+writeback.
  
  What about swapout?  That can increase the number of writeback pages,
  without decreasing the number of dirty pages, no?
 
 Could we not solve that by enabling cap_account_writeback on
 swapper_space, and thereby account swap writeback pages. Then the VM
 knows it has outstanding IO and need not panic.

Hmm, I'm not sure that would be right, because then those writeback
pages would be accounted twice: once for swapper_space, and once for
the real device.

So there's a condition, when lots of anonymous pages are turned into
swap-cache writeback pages, and we should somehow throttle this, because

 This means that all memory is pinned and unreclaimable and the VM gets
 upset and goes oom.

although, it's not quite clear in my mind, how the VM gets upset about
this.

One way to throttle just the swapout activity, is to do the per-bdi
accounting on swapper_space, and limit the number of writeback pages
to e.g. the global threshold + 10%, which is basically what
throttle_vm_writeout() currently does, only now it does it
indiscriminately, and not just on swap writeback pages.

Does this make any sense?

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-24 Thread Miklos Szeredi

  This is probably a
   reasonable thing to do but it doesn't feel like the right place.  I
   think get_dirty_limits should return the raw threshold, and
   balance_dirty_pages should do both tests - the bdi-local test and the
   system-wide test.
 
 Ok, that makes sense I guess.

Well, my narrow minded world view says it's not such a good idea,
because it would again introduce the deadlock scenario, we're trying
to avoid.

In a sense allowing a queue to go over the global limit just a little
bit is a good thing.  Actually the very original code does that: if
writeback was started for write_chunk number of pages, then we allow
ratelimit (8) _new_ pages to be dirtied, effectively ignoring the
global limit.

That's why I've been saying, that the current code is so unfair: if
there are lots of dirty pages to be written back to a particular
device, then balance_dirty_pages() allows the dirty producer to make
even more pages dirty, but if there are _no_ dirty pages for a device,
and we are over the limit, then that dirty producer is allowed
absolutely no new dirty pages until the global counts subside.

I'm still not quite sure what purpose the above soft limiting
serves.  It seems to just give advantage to writers, which managed to
accumulate lots of dirty pages, and then can convert that into even
more dirtyings.

Would it make sense to remove this behavior, and ensure that
balance_dirty_pages() doesn't return until the per-queue limits have
been complied with?

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-24 Thread Miklos Szeredi

This is probably a
 reasonable thing to do but it doesn't feel like the right place.  I
 think get_dirty_limits should return the raw threshold, and
 balance_dirty_pages should do both tests - the bdi-local test and the
 system-wide test.
   
   Ok, that makes sense I guess.
  
  Well, my narrow minded world view says it's not such a good idea,
  because it would again introduce the deadlock scenario, we're trying
  to avoid.
 
 I was only referring to the placement of the clipping; and exactly where
 that happens does not affect the deadlock.

OK.

  In a sense allowing a queue to go over the global limit just a little
  bit is a good thing.  Actually the very original code does that: if
  writeback was started for write_chunk number of pages, then we allow
  ratelimit (8) _new_ pages to be dirtied, effectively ignoring the
  global limit.
 
 It might be time to get rid of that rate-limiting.
 balance_dirty_pages()'s fast path is not nearly as heavy as it used to
 be. All these fancy counter systems have removed quite a bit of
 iteration from there.

Hmm.  The rate limiting probably makes lots of sense for
dirty_exceeded==0, when ratelimit can be a nice large value.

For dirty_exceeded==1 it may make sense to disable ratelimiting, OTOH
having a granularity of 8 pages probably doesn't matter, because of
the granularity of the percpu counter is usually larger (except on UP).

  That's why I've been saying, that the current code is so unfair: if
  there are lots of dirty pages to be written back to a particular
  device, then balance_dirty_pages() allows the dirty producer to make
  even more pages dirty, but if there are _no_ dirty pages for a device,
  and we are over the limit, then that dirty producer is allowed
  absolutely no new dirty pages until the global counts subside.
 
 Well, that got fixed on a per device basis with this patch, it is still
 true for multiple tasks writing to the same device.

Yes, this is the part of this patchset I'm personally interested in ;)

  I'm still not quite sure what purpose the above soft limiting
  serves.  It seems to just give advantage to writers, which managed to
  accumulate lots of dirty pages, and then can convert that into even
  more dirtyings.
 
 The queues only limit the actual in-flight writeback pages,
 balance_dirty_pages() considers all pages that might become writeback as
 well as those that are.
 
  Would it make sense to remove this behavior, and ensure that
  balance_dirty_pages() doesn't return until the per-queue limits have
  been complied with?
 
 I don't think that will help, balance_dirty_pages drives the queues.
 That is, it converts pages from mere dirty to writeback.

Yes.  But current logic says, that if you convert write_chunk dirty
to writeback, you are allowed to dirty ratelimit more. 

D: number of dirty pages
W: number of writeback pages
L: global limit
C: write_chunk = ratelimit_pages * 1.5
R: ratelimit

If D+W = L, then R = 8

Let's assume, that D == L and W == 0.  And that all of the dirty pages
belong to a single device.  Also for simplicity, lets assume an
infinite length queue, and a slow device.

Then while converting the dirty pages to writeback, D / C * R new
dirty pages can be created.  So when all existing dirty have been
converted:

  D = L / C * R
  W = L

  D + W = L * (1 + R / C)

So we see, that we're now even more above the limit than before the
conversion.  This means, that we starve writers to other devices,
which don't have as many dirty pages, because until the slow device
doesn't finish these writes they will not get to do anything.

Your patch helps this in that if the other writers have an empty queue
and no dirty, they will be allowed to slowly start writing.  But they
will not gain their full share until the slow dirty-hog goes below the
global limit, which may take some time.

So I think the logical thing to do, is if the dirty-hog is over it's
queue limit, don't let it dirty any more until it's dirty+writeback go
below the limit.  That allowes other devices to more quickly gain
their share of dirty pages.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-24 Thread Miklos Szeredi

 Ahh, now I see; I had totally blocked out these few lines:
 
   pages_written += write_chunk - wbc.nr_to_write;
   if (pages_written = write_chunk)
   break;  /* We've done our duty */
 
 yeah, those look dubious indeed... And reading back Neil's comments, I
 think he agrees.
 
 Shall we just kill those?

I think we should.

Athough I'm a little afraid, that Akpm will tell me again, that I'm a
stupid git, and that those lines are in fact vitally important ;)

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-24 Thread Miklos Szeredi

Ahh, now I see; I had totally blocked out these few lines:

pages_written += write_chunk - wbc.nr_to_write;
if (pages_written = write_chunk)
break;  /* We've done our duty 
*/

yeah, those look dubious indeed... And reading back Neil's comments, I
think he agrees.

Shall we just kill those?
   
   I think we should.
   
   Athough I'm a little afraid, that Akpm will tell me again, that I'm a
   stupid git, and that those lines are in fact vitally important ;)
   
  
  It depends what they're replaced with.
  
  That code is there, iirc, to prevent a process from getting stuck in
  balance_dirty_pages() forever due to the dirtying activity of other
  processes.
  
  hm, we ask the process to write write_chunk pages each go around the loop.
  So if it wrote write-chunk/2 pages on the first pass it might end up writing
  write_chunk*1.5 pages total.  I guess that's rare and doesn't matter much
  if it does happen - the upper bound is write_chunk*2-1, I think.
 
 Right, but I think the problem is that its dirty - writeback, not dirty
 - writeback completed.
 
 Ie. they don't guarantee progress, it could be that the total
 nr_reclaimable + nr_writeback will steadily increase due to this break.
 
 How about ensuring that vm_writeout_total increases least
 2*sync_writeback_pages() during our stay in balance_dirty_pages(). That
 way we have the guarantee that more pages get written out than can be
 dirtied.

No, because that's a global counter, which many writers could be
looking at.

We'd need a per-task writeout counter, but when finishing the write we
don't know anymore which task it was performed for.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-24 Thread Miklos Szeredi

 On Tue, 24 Apr 2007 12:12:18 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote:
 
  On Tue, 2007-04-24 at 03:00 -0700, Andrew Morton wrote:
   On Tue, 24 Apr 2007 11:47:20 +0200 Miklos Szeredi [EMAIL PROTECTED] 
   wrote:
   
 Ahh, now I see; I had totally blocked out these few lines:
 
   pages_written += write_chunk - wbc.nr_to_write;
   if (pages_written = write_chunk)
   break;  /* We've done our duty 
 */
 
 yeah, those look dubious indeed... And reading back Neil's comments, I
 think he agrees.
 
 Shall we just kill those?

I think we should.

Athough I'm a little afraid, that Akpm will tell me again, that I'm a
stupid git, and that those lines are in fact vitally important ;)

   
   It depends what they're replaced with.
   
   That code is there, iirc, to prevent a process from getting stuck in
   balance_dirty_pages() forever due to the dirtying activity of other
   processes.
   
   hm, we ask the process to write write_chunk pages each go around the loop.
   So if it wrote write-chunk/2 pages on the first pass it might end up 
   writing
   write_chunk*1.5 pages total.  I guess that's rare and doesn't matter much
   if it does happen - the upper bound is write_chunk*2-1, I think.
  
  Right, but I think the problem is that its dirty - writeback, not dirty
  - writeback completed.
  
  Ie. they don't guarantee progress, it could be that the total
  nr_reclaimable + nr_writeback will steadily increase due to this break.
 
 Don't think so.  We call balance_dirty_pages() once per ratelimit_pages
 dirtyings and when we get there, we write 1.5*ratelimit_pages pages.

No, we _start_ writeback for 1.5*ratelimit_pages pages, but do not
wait for those writebacks to finish.

So for a slow device and a fast writer, dirty+writeback can indeed
increase beyond the dirty threshold.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-24 Thread Miklos Szeredi

  No, we _start_ writeback for 1.5*ratelimit_pages pages, but do not
  wait for those writebacks to finish.
  
  So for a slow device and a fast writer, dirty+writeback can indeed
  increase beyond the dirty threshold.
  
 
 Nope, try it.
 
 If a process dirties 1000 pages it'll then go into balance_dirty_pages()
 and start writeback against 1,500 pages.  When we hit dirty_ratio that
 process will be required to write back 1,500 pages for each eight pages
 which it dirtied.  We'll quickly reach the stage where there are no longer
 1,500 pages to be written back and the process will block in
 balance_dirty_pages() until the dirty+writeback level subsides.

OK.  I was confused by this:

static long ratelimit_pages = 32;

and didn't realize, that that 32 is totally irrelevant.

So I'm still right, that for N dirty pages, the writer is allowed to
dirty N/1500*8 more dirty pages, but I agree, that this isn't really
an issue.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 0/8] mount ownership and unprivileged mount syscall (v4)

2007-04-25 Thread Miklos Szeredi

  The following extra security measures are taken for unprivileged
  mounts:
  
   - usermounts are limited by a sysctl tunable
   - force nosuid,nodev mount options on the created mount
 
  The original userspace user= solution also implies the noexec
  option by default (you can override the default by exec option).

Unlike nosuid and nodev, I don't think noexec has real security
benefits.

  It means the kernel based solution is not fully compatible ;-(

Oh, I don't think that matters.  For traditional /etc/fstab based user
mounts, mount(8) will have to remain suid-root, the kernel can't
replace the fstab check.

In fact the latest patches don't even support these legacy user
mounts too well: setting the owner of a mount gives not only umount
privilege, but the ability to submount.  This is not necessarily a
good thing for these kinds of user mounts.

We could add a new nosubmount or similar flag, to prevent
submounting, but that again would go against the simplicity of the
current approach, so I'm not sure it's worth it.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] unprivileged mounts update

2007-04-25 Thread Miklos Szeredi

From: Miklos Szeredi [EMAIL PROTECTED]

- refine adding nosuid and nodev flags for unprivileged mounts:
o add nosuid, only if mounter doesn't have CAP_SETUID capability
o add nodev, only if mounter doesn't have CAP_MKNOD capability

- allow unprivileged forced unmount, but only for FS_SAFE filesystems

- allow mounting over special files, but not symlinks

- for mounting and umounting check fsuid instead of ruid

Thanks to everyone for the comments, with special thanks to Serge
Hallyn and Eric Biederman.

For testing the new functionality provided by this patchset a simple
tool similar in syntax to mount(8) is available from:

  http://www.kernel.org/pub/linux/kernel/people/mszeredi/mmount

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-22 17:48:18.0 +0200
+++ linux/fs/namespace.c2007-04-22 18:19:51.0 +0200
@@ -252,10 +252,12 @@ static int reserve_user_mount(void)
 static void __set_mnt_user(struct vfsmount *mnt)
 {
BUG_ON(mnt-mnt_flags  MNT_USER);
-   mnt-mnt_uid = current-uid;
+   mnt-mnt_uid = current-fsuid;
mnt-mnt_flags |= MNT_USER;
-   if (!capable(CAP_SYS_ADMIN))
-   mnt-mnt_flags |= MNT_NOSUID | MNT_NODEV;
+   if (!capable(CAP_SETUID))
+   mnt-mnt_flags |= MNT_NOSUID;
+   if (!capable(CAP_MKNOD))
+   mnt-mnt_flags |= MNT_NODEV;
 }
 
 static void set_mnt_user(struct vfsmount *mnt)
@@ -725,10 +727,10 @@ static bool permit_umount(struct vfsmoun
if (!(mnt-mnt_flags  MNT_USER))
return false;
 
-   if (flags  MNT_FORCE)
+   if ((flags  MNT_FORCE)  !(mnt-mnt_sb-s_type-fs_flags  FS_SAFE))
return false;
 
-   return mnt-mnt_uid == current-uid;
+   return mnt-mnt_uid == current-fsuid;
 }
 
 /*
@@ -792,13 +794,13 @@ static bool permit_mount(struct nameidat
if (type  !(type-fs_flags  FS_SAFE))
return false;
 
-   if (!S_ISDIR(inode-i_mode)  !S_ISREG(inode-i_mode))
+   if (S_ISLNK(inode-i_mode))
return false;
 
if (!(nd-mnt-mnt_flags  MNT_USER))
return false;
 
-   if (nd-mnt-mnt_uid != current-uid)
+   if (nd-mnt-mnt_uid != current-fsuid)
return false;
 
*flags |= MS_SETUSER;
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 3/8] per backing_dev dirty and writeback page accounting

2007-03-12 Thread Miklos Szeredi

  I have no idea how serious the scalability problems with this are.  If
  they are serious, different solutions can probably be found for the
  above, but this is certainly the simplest.
 
 Atomic operations to a single per-backing device from all CPUs at once?
 That's a pretty serious scalability issue and it will cause a major
 performance regression for XFS.

OK.  How about just accounting writeback pages?  That should be much
less of a problem, since normally writeback is started from
pdflush/kupdate in large batches without any concurrency.

Or is it possible to export the state of the device queue to mm?
E.g. could balance_dirty_pages() query the backing dev if there are
any outstanding write requests?

 I'd call this a showstopper right now - maybe you need to look at
 something like the ZVC code that Christoph Lameter wrote, perhaps?

That's rather a heavyweight approach for this I think.

The only info balance_dirty_pages() really needs is whether there are
any dirty+writeback bound for the backing dev or not.

It knows about the diry pages, since it calls writeback_inodes() which
scans the dirty pages for this backing dev looking for ones to write
out.  If after returning from writeback_inodes() wbc-nr_to_write
didn't decrease and wbc-pages_skipped is zero then we know that there
are no more dirty pages for the device.  Or at least there are no
dirty pages which aren't already under writeback.

Thanks,
Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 3/8] per backing_dev dirty and writeback page accounting

2007-03-12 Thread Miklos Szeredi

I'll try to explain the reason for the deadlock first.

 IIUC, your problem is that there's another bdi that holds all the
 dirty pages, and this throttle loop never flushes pages from that
 other bdi and we sleep instead. It seems to me that the fundamental
 problem is that to clean the pages we need to flush both bdi's, not
 just the bdi we are directly dirtying.

This is what happens:

write fault on upper filesystem
  balance_dirty_pages
submit write requests
  loop ...
--- fuse IPC ---
[fuse loopback fs thread 1]
read request
sys_write
  mutex_lock(i_mutex)
  ...
 balance_dirty_pages
submit write requests
loop ... write requests completed ... dirty still over limit ... 
... loop forever

[fuse loopback fs thread 1]
read request
sys_write
  mute_lock(i_mutex) blocks

So the queue for the upper filesystem is full.  The queue for the
lower filesystem is empty.  There are no dirty pages in the lower
filesystem.

So kicking pdflush for the lower filesystem doesn't help, there's
nothing to do.  balance_dirty_pages() for the lower filesystem should
just realize that there's nothing to do and return, and then there
would be progress.

So there's there's really no need to do any accounting, just some
logic to determine that a backing dev is nearly or completely
quiescent.

And getting out of this tight situation doesn't have to be efficient.
This is probably a very rare corner case, that almost never happens in
real life, only with aggressive test tools like bash_shared_mapping.

  OK.  How about just accounting writeback pages?  That should be much
  less of a problem, since normally writeback is started from
  pdflush/kupdate in large batches without any concurrency.
 
 Except when you are throttling you bounce the cacheline around
 each cpu as it triggers foreground writeback.

Yeah, we'd loose a bit of CPU, but not any write performance, since it
is being throttled back anyway.

  Or is it possible to export the state of the device queue to mm?
  E.g. could balance_dirty_pages() query the backing dev if there are
  any outstanding write requests?
 
 Not directly - writeback_in_progress(bdi) is a coarse measure
 indicating pdflush is active on this bdi, which implies outstanding
 write requests).

Hmm, not quite what I need.

   I'd call this a showstopper right now - maybe you need to look at
   something like the ZVC code that Christoph Lameter wrote, perhaps?
  
  That's rather a heavyweight approach for this I think.
 
 But if you want to use per-page accounting, you are going to
 need a per-cpu or per-zone set of counters on each bdi to do
 this without introducing regressions.

Yes, this is an option, but I hope for a simpler solution.

Thanks,
Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 3/8] per backing_dev dirty and writeback page accounting

2007-03-13 Thread Miklos Szeredi

   IIUC, your problem is that there's another bdi that holds all the
   dirty pages, and this throttle loop never flushes pages from that
   other bdi and we sleep instead. It seems to me that the fundamental
   problem is that to clean the pages we need to flush both bdi's, not
   just the bdi we are directly dirtying.
  
  This is what happens:
  
  write fault on upper filesystem
balance_dirty_pages
  submit write requests
loop ...
 
 Isn't this loop transferring the dirty state from the upper
 filesystem to the lower filesystem?

What this loop is doing is putting write requests in the request
queue, and in so doing transforming page state from dirty to
writeback.

 What I don't see here is how the pages on this filesystem are not
 getting cleaned if the lower filesystem is being flushed properly.

Because the lower filesystem writes back one request, but then gets
stuck in balance_dirty_pages before returning.  So the write request
is never completed.

The problem is that balance_dirty_pages is waiting for the condition
that the global number of dirty+writeback pages goes below the
threshold.  But this condition can only be satisfied if
balance_dirty_pages() returns.

 I'm probably missing something big and obvious, but I'm not
 familiar with the exact workings of FUSE so please excuse my
 ignorance
 
  --- fuse IPC ---
  [fuse loopback fs thread 1]
 
 This is the lower filesystem? Or a callback thread for
 doing the write requests to the lower filesystem?

This is the fuse daemon.  It's a normal process that reads requests
from /dev/fuse, serves these requests then writes the reply back onto
/dev/fuse.  It is usually multithreaded, so it can serve many requests
in parallel.

The loopback filesystem serves the requests by issuing the relevant
filesystem syscalls on the underlying fs.

  read request
  sys_write
mutex_lock(i_mutex)
...
   balance_dirty_pages
  submit write requests
  loop ... write requests completed ... dirty still over limit ... 
  ... loop forever
 
 Hmmm - the situation in balance_dirty_pages() after an attempt
 to writeback_inodes(wbc) that has written nothing because there
 is nothing to write would be:
 
   wbc-nr_write == write_chunk 
   wbc-pages_skipped == 0 
   wbc-encountered_congestion == 0 
   !bdi_congested(wbc-bdi)
 
 What happens if you make that an exit condition to the loop?

That's almost right.  The only problem is that even if there's no
congestion, the device queue can be holding a great amount of yet
unwritten pages.  So exiting on this condition would mean, that
dirty+writeback could go way over the threshold.

How much this would be a problem?  I don't know, I guess it depends on
many things: how many queues, how many requests per queue, how many
bytes per request.

 Or alternatively, adding another bit to the wbc structure to
 say there was nothing to do and setting that if we find
 list_empty(sb-s_dirty) when trying to flush dirty inodes.
 
 [ FWIW, this may also solve another problem of fast block devices
 being throttled incorrectly when a slow block dev is consuming
 all the dirty pages... ]

There may be a patch floating around, which I think basically does
this, but only as long as the dirty+writeback are over a soft limit,
but under the hard limit.

When over the the hard limit, balance_dirty_pages still loops until
dirty+writeback go below the threshold.

Thanks,
Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 3/8] per backing_dev dirty and writeback page accounting

2007-03-14 Thread Miklos Szeredi

 Only if the queue depth is not bound. Queue depths are bound and so
 the distance we can go over the threshold is limited.  This is the
 fundamental principle on which the throttling is based.
 
 Hence, if the queue is not full, then we will have either written
 dirty pages to it (i.e wbc-nr_write != write_chunk so we will throttle
 or continue normally if write_chunk was written) or we have no more
 dirty pages left.
 
 Having no dirty pages left on the bdi and it not being congested
 means we effectively have a clean, idle bdi. We should not be trying
 to throttle writeback here - we can't do anything to improve the
 situation by continuing to try to do writeback on this bdi, so we
 may as well give up and let the writer continue. Once we have dirty
 pages on the bdi, we'll get throttled appropriately.

OK, you convinced me.

How about this patch?  I introduced a new wbc counter, that sums the
number of dirty pages encountered, including ones already under
writeback.

Dave, big thanks for your insights.

Miklos

Index: linux/include/linux/writeback.h
===
--- linux.orig/include/linux/writeback.h2007-03-14 22:43:42.0 
+0100
+++ linux/include/linux/writeback.h 2007-03-14 22:58:56.0 +0100
@@ -44,6 +44,7 @@ struct writeback_control {
long nr_to_write;   /* Write this many pages, and decrement
   this for each page written */
long pages_skipped; /* Pages which were not written */
+   long nr_dirty;  /* Number of dirty pages encountered */
 
/*
 * For a_ops-writepages(): is start or end are non-zero then this is
Index: linux/mm/page-writeback.c
===
--- linux.orig/mm/page-writeback.c  2007-03-14 22:41:01.0 +0100
+++ linux/mm/page-writeback.c   2007-03-14 23:00:20.0 +0100
@@ -220,6 +220,17 @@ static void balance_dirty_pages(struct a
pages_written += write_chunk - wbc.nr_to_write;
if (pages_written = write_chunk)
break;  /* We've done our duty */
+
+   /*
+* If just a few dirty pages were encountered, and
+* the queue is not congested, then allow this dirty
+* producer to continue.  This resolves the deadlock
+* that happens when one filesystem writes back data
+* through another.  It should also help when a slow
+* device is completely blocking other writes.
+*/
+   if (wbc.nr_dirty  8  !bdi_write_congested(bdi))
+   break;
}
congestion_wait(WRITE, HZ/10);
}
@@ -612,6 +623,7 @@ retry:
  min(end - index, 
(pgoff_t)PAGEVEC_SIZE-1) + 1))) {
unsigned i;
 
+   wbc-nr_dirty += nr_pages;
scanned = 1;
for (i = 0; i  nr_pages; i++) {
struct page *page = pvec.pages[i];
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] fix quadratic behavior of shrink_dcache_parent()

2007-02-11 Thread Miklos Szeredi

  Unfortunately this patch doesn't completely solve this problem, since
  the system will still be hosed due to all memory being used up by
  dentries.  And I bet the OOM killer won't find the real target (du)
  but will kill anything before that.
 
  So the second part of the problem is to somehow limit the number of
  dentries used.  Not easy...
 
 At least with this patch (if I am reading it correctly), once the offending
 culprit is identified and the programs are killed off, everything will go
 back to normal without a reboot.

Yes, it's basically a normal OOM situation.  What makes it worse
than usual, is that it's _kernel_ memory being used up which is not
attributed to any process.

So from the OOM killer's point of view it looks like none of the
userspace programs are responsible for the OOM, and it will just
select targets randomly.

By the time it has come to the actual culprits (the filesystem daemon
and 'du') it will have killed almost everything useful.

So it's definitely better than without the patch, but still has room
for improvement.

Thanks,
Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH] add filesystem subtype support

2007-02-12 Thread Miklos Szeredi

There's a slight problem with filesystem type representation in fuse
based filesystems.

From the kernel's view, there are just two filesystem types: fuse and
fuseblk.  From the user's view there are lots of different filesystem
types.  The user is not even much concerned if the filesystem is fuse
based or not.  So there's a conflict of interest in how this should be
represented in fstab, mtab and /proc/mounts.

The current scheme is to encode the real filesystem type in the mount
source.  So an sshfs mount looks like this:

  [EMAIL PROTECTED]:/   /mnt/serverfuse   rw,nosuid,nodev,...

This url-ish syntax works OK for sshfs and similar filesystems.
However for block device based filesystems (ntfs-3g, zfs) it doesn't
work, since the kernel expects the mount source to be a real device
name.

A possibly better scheme would be to encode the real type in the type
field as type.subtype.  So fuse mounts would look like this:

  /dev/hda1   /mnt/windows   fuseblk.ntfs-3g   rw,...
  [EMAIL PROTECTED]:/   /mnt/serverfuse.sshfsrw,nosuid,nodev,...

This patch adds the necessary code to the kernel so that this can be
correctly displayed in /proc/mounts.

---
Index: linux/fs/filesystems.c
===
--- linux.orig/fs/filesystems.c 2007-02-12 12:42:55.0 +0100
+++ linux/fs/filesystems.c  2007-02-12 12:43:00.0 +0100
@@ -42,11 +42,12 @@ void put_filesystem(struct file_system_t
module_put(fs-owner);
 }
 
-static struct file_system_type **find_filesystem(const char *name)
+static struct file_system_type **find_filesystem(const char *name, unsigned 
len)
 {
struct file_system_type **p;
for (p=file_systems; *p; p=(*p)-next)
-   if (strcmp((*p)-name,name) == 0)
+   if (strlen((*p)-name) == len 
+   strncmp((*p)-name, name, len) == 0)
break;
return p;
 }
@@ -69,11 +70,12 @@ int register_filesystem(struct file_syst
int res = 0;
struct file_system_type ** p;
 
+   BUG_ON(strchr(fs-name, '.'));
if (fs-next)
return -EBUSY;
INIT_LIST_HEAD(fs-fs_supers);
write_lock(file_systems_lock);
-   p = find_filesystem(fs-name);
+   p = find_filesystem(fs-name, strlen(fs-name));
if (*p)
res = -EBUSY;
else
@@ -216,19 +218,26 @@ int get_filesystem_list(char * buf)
 struct file_system_type *get_fs_type(const char *name)
 {
struct file_system_type *fs;
+   const char *dot = strchr(name, '.');
+   unsigned len = dot ? dot - name : strlen(name);
 
read_lock(file_systems_lock);
-   fs = *(find_filesystem(name));
+   fs = *(find_filesystem(name, len));
if (fs  !try_module_get(fs-owner))
fs = NULL;
read_unlock(file_systems_lock);
-   if (!fs  (request_module(%s, name) == 0)) {
+   if (!fs  (request_module(%.*s, len, name) == 0)) {
read_lock(file_systems_lock);
-   fs = *(find_filesystem(name));
+   fs = *(find_filesystem(name, len));
if (fs  !try_module_get(fs-owner))
fs = NULL;
read_unlock(file_systems_lock);
}
+
+   if (dot  fs  !(fs-fs_flags  FS_HAS_SUBTYPE)) {
+   put_filesystem(fs);
+   fs = NULL;
+   }
return fs;
 }
 
Index: linux/fs/fuse/inode.c
===
--- linux.orig/fs/fuse/inode.c  2007-02-12 12:42:55.0 +0100
+++ linux/fs/fuse/inode.c   2007-02-12 12:43:00.0 +0100
@@ -634,6 +634,7 @@ static int fuse_get_sb(struct file_syste
 static struct file_system_type fuse_fs_type = {
.owner  = THIS_MODULE,
.name   = fuse,
+   .fs_flags   = FS_HAS_SUBTYPE,
.get_sb = fuse_get_sb,
.kill_sb= kill_anon_super,
 };
@@ -650,6 +651,7 @@ static int fuse_get_sb_blk(struct file_s
 static struct file_system_type fuseblk_fs_type = {
.owner  = THIS_MODULE,
.name   = fuseblk,
+   .fs_flags   = FS_HAS_SUBTYPE,
.get_sb = fuse_get_sb_blk,
.kill_sb= kill_block_super,
.fs_flags   = FS_REQUIRES_DEV,
Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-02-12 12:42:55.0 +0100
+++ linux/fs/namespace.c2007-02-12 12:43:00.0 +0100
@@ -378,6 +378,10 @@ static int show_vfsmnt(struct seq_file *
seq_path(m, mnt, mnt-mnt_root,  \t\n\\);
seq_putc(m, ' ');
mangle(m, mnt-mnt_sb-s_type-name);
+   if (mnt-mnt_sb-s_subtype  mnt-mnt_sb-s_subtype[0]) {
+   seq_putc(m, '.');
+   mangle(m, mnt-mnt_sb-s_subtype);
+   }
seq_puts(m, mnt-mnt_sb-s_flags  MS_RDONLY ?  ro :  rw);
for (fs_infop =

Re: [RFC PATCH] add filesystem subtype support

2007-02-12 Thread Miklos Szeredi

  There's a slight problem with filesystem type representation in fuse
  based filesystems.
  
  From the kernel's view, there are just two filesystem types: fuse and
  fuseblk.  From the user's view there are lots of different filesystem
  types.  The user is not even much concerned if the filesystem is fuse
  based or not.  
 
 Yes. Those who are concerned with the fstype and mount like
 
 mount -t fstype device mountpoint
 
 apparently expect mount/fstab line like
 
 device mountpoint fstype ...

The problem with supporting 

  /dev/hda1   /mnt/windows   ntfs-3g   rw,...

type syntax in /proc/mounts is that it would add much more complexity
to the kernel.  The same could be achieved in userspace by teaching
the mount utility about special filesystems like ntfs-3g.

This patch aims for the minium footprint for a feature, that really
doesn't concern the kernel at all.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] add filesystem subtype support

2007-02-12 Thread Miklos Szeredi

 -static struct file_system_type **find_filesystem(const char *name)
 +static struct file_system_type **find_filesystem(const char *name, unsigned 
 len)
  {
  struct file_system_type **p;
  for (p=file_systems; *p; p=(*p)-next)
 -if (strcmp((*p)-name,name) == 0)
 +if (strlen((*p)-name) == len 
 +strncmp((*p)-name, name, len) == 0)
  break;
  return p;
  }
 
 Question btw, why does this function return a struct file_system_type ** at
 all? Would not struct file_system_type * suffice?

It's used in register_filesystem() to get the end of the list pointer.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [uml-devel] UML hang with 100% CPU

2007-02-15 Thread Miklos Szeredi

  Strangely enough after continuing in gdb, UML is back to normal, and I
  can't make it hang any more.  It must be something timing related.
 
 Can you see if the patch below fixes it?

Yay!  Got my nice fast UML back instead of ugly slow QEmu ;)

Seems to work perfectly now.

Thanks,
Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] consolidate generic_writepages and mpage_writepages

2007-02-16 Thread Miklos Szeredi

From: Miklos Szeredi [EMAIL PROTECTED]

Clean up massive code duplication between mpage_writepages() and
generic_writepages().

The new generic function, write_cache_pages() takes a function pointer
argument, which will be called for each page to be written.

Maybe cifs_writepages() too can use this infrastructure, but I'm not
touching that with a ten-foot pole.

The upcoming page writeback support in fuse will also want this.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/fs/mpage.c
===
--- linux.orig/fs/mpage.c   2007-02-16 16:54:41.0 +0100
+++ linux/fs/mpage.c2007-02-16 16:54:42.0 +0100
@@ -458,11 +458,18 @@ EXPORT_SYMBOL(mpage_readpage);
  * written, so it can intelligently allocate a suitably-sized BIO.  For now,
  * just allocate full-size (16-page) BIOs.
  */
-static struct bio *
-__mpage_writepage(struct bio *bio, struct page *page, get_block_t get_block,
-   sector_t *last_block_in_bio, int *ret, struct writeback_control *wbc,
-   writepage_t writepage_fn)
+struct mpage_data {
+   struct bio *bio;
+   sector_t last_block_in_bio;
+   get_block_t *get_block;
+   unsigned use_writepage;
+};
+
+static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
+void *data)
 {
+   struct mpage_data *mpd = data;
+   struct bio *bio = mpd-bio;
struct address_space *mapping = page-mapping;
struct inode *inode = page-mapping-host;
const unsigned blkbits = inode-i_blkbits;
@@ -480,6 +487,7 @@ __mpage_writepage(struct bio *bio, struc
int length;
struct buffer_head map_bh;
loff_t i_size = i_size_read(inode);
+   int ret = 0;
 
if (page_has_buffers(page)) {
struct buffer_head *head = page_buffers(page);
@@ -542,7 +550,7 @@ __mpage_writepage(struct bio *bio, struc
 
map_bh.b_state = 0;
map_bh.b_size = 1  blkbits;
-   if (get_block(inode, block_in_file, map_bh, 1))
+   if (mpd-get_block(inode, block_in_file, map_bh, 1))
goto confused;
if (buffer_new(map_bh))
unmap_underlying_metadata(map_bh.b_bdev,
@@ -591,7 +599,7 @@ page_is_mapped:
/*
 * This page will go to BIO.  Do we need to send this BIO off first?
 */
-   if (bio  *last_block_in_bio != blocks[0] - 1)
+   if (bio  mpd-last_block_in_bio != blocks[0] - 1)
bio = mpage_bio_submit(WRITE, bio);
 
 alloc_new:
@@ -648,7 +656,7 @@ alloc_new:
boundary_block, 1  blkbits);
}
} else {
-   *last_block_in_bio = blocks[blocks_per_page - 1];
+   mpd-last_block_in_bio = blocks[blocks_per_page - 1];
}
goto out;
 
@@ -656,18 +664,19 @@ confused:
if (bio)
bio = mpage_bio_submit(WRITE, bio);
 
-   if (writepage_fn) {
-   *ret = (*writepage_fn)(page, wbc);
+   if (mpd-use_writepage) {
+   ret = mapping-a_ops-writepage(page, wbc);
} else {
-   *ret = -EAGAIN;
+   ret = -EAGAIN;
goto out;
}
/*
 * The caller has a ref on the inode, so *mapping is stable
 */
-   mapping_set_error(mapping, *ret);
+   mapping_set_error(mapping, ret);
 out:
-   return bio;
+   mpd-bio = bio;
+   return ret;
 }
 
 /**
@@ -690,120 +699,27 @@ out:
  * the call was made get new I/O started against them.  If wbc-sync_mode is
  * WB_SYNC_ALL then we were called for data integrity and we must wait for
  * existing IO to complete.
- *
- * If you fix this you should check generic_writepages() also!
  */
 int
 mpage_writepages(struct address_space *mapping,
struct writeback_control *wbc, get_block_t get_block)
 {
-   struct backing_dev_info *bdi = mapping-backing_dev_info;
-   struct bio *bio = NULL;
-   sector_t last_block_in_bio = 0;
-   int ret = 0;
-   int done = 0;
-   int (*writepage)(struct page *page, struct writeback_control *wbc);
-   struct pagevec pvec;
-   int nr_pages;
-   pgoff_t index;
-   pgoff_t end;/* Inclusive */
-   int scanned = 0;
-   int range_whole = 0;
-
-   if (wbc-nonblocking  bdi_write_congested(bdi)) {
-   wbc-encountered_congestion = 1;
-   return 0;
-   }
+   int ret;
 
-   writepage = NULL;
-   if (get_block == NULL)
-   writepage = mapping-a_ops-writepage;
-
-   pagevec_init(pvec, 0);
-   if (wbc-range_cyclic) {
-   index = mapping-writeback_index; /* Start from prev offset */
-   end = -1;
-   } else {
-   index = wbc-range_start  PAGE_CACHE_SHIFT;
-   end = wbc-range_end  PAGE_CACHE_SHIFT;
-   if (wbc

Re: [Fwd: [PATCH] consolidate generic_writepages and mpage_writepages]

2007-02-17 Thread Miklos Szeredi

 Maybe cifs_writepages() too can use this infrastructure, but I'm not
 touching that with a ten-foot pole.
   
 
 The cifs case ought to be one of the simpler ones, pseudo-code is pretty 
 easy, the hard part is all of the stuff unrelated to cifs:
 Ideally if there were generic functions to help out, cifs writepages 
 would look roughly like the following
 
 cifs_writepages(struct address_space *mapping, struct writeback_control 
 *wbc)
 {

 while (no more pages to write) {
 /* find writeable file handle for this inode */
 /* find the biggest set of contiguous pages that total less than 
 wsize */
 if (packet signing is enabled)
 /* write lock pages so they can not be changed under us 
 while we are calculating the checksum */
   
CIFSSMBWrite2(tree_connection, network_file_handle, array of 
 iovecs, number of iovecs);
 
 if(packet signing was enabled)
 /* unlock pages */
 
 if(error) {
  set page errors
  if (mounted hard )
continue; /* retry */
  else /* if no retry possible */
return error to caller;
}
 update bytes written statistics
 update index to point to next set of pages
 }  /* end while loop */
 }

write_cache_pages() now takes care of the 

 while (no more pages to write) {
 }

part.  All you have to do is to make the body of the loop into a
function and pass it a structure with the data you want to preserve
between invocations.

The hard part is untangling the breaks and continues.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

dirty balancing deadlock

2007-02-18 Thread Miklos Szeredi

I was testing the new fuse shared writable mmap support, and finding
that bash-shared-mapping deadlocks (which isn't so strange ;).  What
is more strange is that this is not an OOM situation at all, with
plenty of free and cached pages.

A little more investigation shows that a similar deadlock happens
reliably with bash-shared-mapping on a loopback mount, even if only
half the total memory is used.

The cause is slightly different in the two cases:

  - loopback mount: allocation by the underlying filesystem is stalled
on throttle_vm_writeout()

  - fuse-loop: page dirtying on the underlying filesystem is stalled on
balance_dirty_pages()

In both cases the underlying fs is totally innocent, with no
dirty/writback pages, yet it's waiting for the global dirty+writeback
to go below the threshold, which obviously won't, until the
allocation/dirtying succeeds.

I'm not quite sure what the solution is, and asking for thoughts.

Ideas:

  - per filesystem dirty counters.  If filesystem is clean (or dirty
is below some minimum), then balance_dirty_pages() should no wait
any more

  - throttle_vm_writeout() was meant to throttle swapping, no?  So in
that case there should be a separate swap-writback counter

Thanks,
Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: dirty balancing deadlock

2007-02-18 Thread Miklos Szeredi

  I was testing the new fuse shared writable mmap support, and finding
  that bash-shared-mapping deadlocks (which isn't so strange ;).  What
  is more strange is that this is not an OOM situation at all, with
  plenty of free and cached pages.
  
  A little more investigation shows that a similar deadlock happens
  reliably with bash-shared-mapping on a loopback mount, even if only
  half the total memory is used.
  
  The cause is slightly different in the two cases:
  
- loopback mount: allocation by the underlying filesystem is stalled
  on throttle_vm_writeout()
  
- fuse-loop: page dirtying on the underlying filesystem is stalled on
  balance_dirty_pages()
  
  In both cases the underlying fs is totally innocent, with no
  dirty/writback pages, yet it's waiting for the global dirty+writeback
  to go below the threshold, which obviously won't, until the
  allocation/dirtying succeeds.
  
  I'm not quite sure what the solution is, and asking for thoughts.
 
 But  these things don't just throttle.  They also perform large amounts
 of writeback, which causes the dirty levels to subside.
 
 From your description it appears that this writeback isn't happening, or
 isn't working.  How come?

 - filesystems A and B
 - write to A will end up as write to B
 - dirty pages in A manage to go over dirty_threshold
 - page writeback is started from A
 - this triggers writeback for a couple of pages in B
 - writeback finishes normally, but dirty+writeback pages are still
   over threshold
 - balance_dirty_pages in B gets stuck, nothing ever moves after this

At least this is my theory for what happens.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: dirty balancing deadlock

2007-02-18 Thread Miklos Szeredi

 Andrew Morton wrote:
  On Sun, 18 Feb 2007 19:28:18 +0100 Miklos Szeredi [EMAIL PROTECTED] wrote:
  
  I was testing the new fuse shared writable mmap support, and finding
  that bash-shared-mapping deadlocks (which isn't so strange ;).  What
  is more strange is that this is not an OOM situation at all, with
  plenty of free and cached pages.
 
  A little more investigation shows that a similar deadlock happens
  reliably with bash-shared-mapping on a loopback mount, even if only
  half the total memory is used.
 
  The cause is slightly different in the two cases:
 
- loopback mount: allocation by the underlying filesystem is stalled
  on throttle_vm_writeout()
 
- fuse-loop: page dirtying on the underlying filesystem is stalled on
  balance_dirty_pages()
 
  In both cases the underlying fs is totally innocent, with no
  dirty/writback pages, yet it's waiting for the global dirty+writeback
  to go below the threshold, which obviously won't, until the
  allocation/dirtying succeeds.
 
  I'm not quite sure what the solution is, and asking for thoughts.
  
  But  these things don't just throttle.  They also perform large amounts
  of writeback, which causes the dirty levels to subside.
  
 From your description it appears that this writeback isn't happening, or
  isn't working.  How come?
 
 Is the fuse daemon trying to do writeback to itself, perhaps?
 
 That is, trying to write out data to the FUSE filesystem, for which
 it is also the server.

No.  It's trying to write out data to a different filesystem.

Trying to write out data to itself very obviously deadlocks, but that
doesn't affect anything beside the stupid filesystem itself, and there
are mechanisms for aborting such a situation (forced umount, abort
through fuse-control filesystem).

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: dirty balancing deadlock

2007-02-18 Thread Miklos Szeredi

I was testing the new fuse shared writable mmap support, and finding
that bash-shared-mapping deadlocks (which isn't so strange ;).  What
is more strange is that this is not an OOM situation at all, with
plenty of free and cached pages.

A little more investigation shows that a similar deadlock happens
reliably with bash-shared-mapping on a loopback mount, even if only
half the total memory is used.

The cause is slightly different in the two cases:

  - loopback mount: allocation by the underlying filesystem is stalled
on throttle_vm_writeout()

  - fuse-loop: page dirtying on the underlying filesystem is stalled on
balance_dirty_pages()

In both cases the underlying fs is totally innocent, with no
dirty/writback pages, yet it's waiting for the global dirty+writeback
to go below the threshold, which obviously won't, until the
allocation/dirtying succeeds.

I'm not quite sure what the solution is, and asking for thoughts.
   
   But  these things don't just throttle.  They also perform large 
   amounts
   of writeback, which causes the dirty levels to subside.
   
   From your description it appears that this writeback isn't happening, or
   isn't working.  How come?
  
   - filesystems A and B
   - write to A will end up as write to B
   - dirty pages in A manage to go over dirty_threshold
   - page writeback is started from A
   - this triggers writeback for a couple of pages in B
   - writeback finishes normally, but dirty+writeback pages are still
 over threshold
   - balance_dirty_pages in B gets stuck, nothing ever moves after this
  
  At least this is my theory for what happens.
  
 
 Is B a real filesystem?

Yes.

 If so, writes to B will decrease the dirty memory threshold.

Yes, but not by enough.  Say A dirties a 1100 pages, limit is 1000.
Some pages queued for writeback (doesn't matter how much).  B writes
back 1, 1099 dirty remain in A, zero in B.  balance_dirty_pages() for
B doesn't know that there's nothing more to write back for B, it's
just waiting there for those 1099, which'll never get written.

 The writeout code _should_ just sit there transferring dirtyiness from A to
 B and cleaning pages via B, looping around, alternating between both.
 
 What does sysrq-t say?

This is the fuse daemon thread that got stuck.  There are lots of
others that are stuck on some ext3 mutex as a result of this.

fusexmp_fh_no D 40045401 0   527493   533   495 (NOTLB)
088d55f8 0001  08dcfb14 0805d8cb 08a09b78 088d55f8 08dc8000
   08dc8000 08dcfb3c 0805a38a 08a09680 088d5100 08dcfb2c 08dc8000 08dc8000
   0847c300 088d5100 08a09680 08dcfb94 08182fe6 08a09680 088d5100 08a09680 
Call Trace:
08dcfb00:  [0805d8cb] switch_to_skas+0x3b/0x83
08dcfb18:  [0805a38a] _switch_to+0x49/0x99
08dcfb40:  [08182fe6] schedule+0x246/0x547
08dcfb98:  [08183a03] schedule_timeout+0x4e/0xb6
08dcfbcc:  [08183991] io_schedule_timeout+0x11/0x20
08dcfbd4:  [080a0cf2] congestion_wait+0x72/0x87
08dcfc04:  [0809c693] balance_dirty_pages+0xa8/0x153
08dcfc5c:  [0809c7bf] balance_dirty_pages_ratelimited_nr+0x43/0x45
08dcfc68:  [080992b5] generic_file_buffered_write+0x3e3/0x6f5
08dcfd20:  [0809988e] __generic_file_aio_write_nolock+0x2c7/0x5dd
08dcfda8:  [08099cb6] generic_file_aio_write+0x55/0xc7
08dcfddc:  [080ea1e6] ext3_file_write+0x39/0xaf
08dcfe04:  [080b060b] do_sync_write+0xd8/0x10e
08dcfebc:  [080b06e3] vfs_write+0xa2/0x1cb
08dcfeec:  [080b09b8] sys_pwrite64+0x65/0x69
08dcff10:  [0805dd54] handle_syscall+0x90/0xbc
08dcff64:  [0806d56c] handle_trap+0x27/0x121
08dcff8c:  [0806dc65] userspace+0x1de/0x226
08dcffe4:  [0805da19] fork_handler+0x76/0x88
08dcfffc:  [d4cf0007] 0xd4cf0007

/proc/vmstat:

nr_anon_pages 668
nr_mapped 3168
nr_file_pages 5191
nr_slab_reclaimable 173
nr_slab_unreclaimable 494
nr_page_table_pages 65
nr_dirty 2174
nr_writeback 10
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
pgpgin 10955
pgpgout 421091
pswpin 0
pswpout 0
pgalloc_dma 0
pgalloc_normal 268761
pgfree 269709
pgactivate 128287
pgdeactivate 31253
pgfault 237350
pgmajfault 4340
pgrefill_dma 0
pgrefill_normal 127899
pgsteal_dma 0
pgsteal_normal 46892
pgscan_kswapd_dma 0
pgscan_kswapd_normal 47104
pgscan_direct_dma 0
pgscan_direct_normal 36544
pginodesteal 0
slabs_scanned 2048
kswapd_steal 25083
kswapd_inodesteal 335
pageoutrun 656
allocstall 423
pgrotated 0

Breakpoint 3, balance_dirty_pages (mapping=0xa01feb0)
at mm/page-writeback.c:202
202 dirty_exceeded = 1;
(gdb) p dirty_thresh
$1 = 2113
(gdb)

For completeness' sake, here's the backtrace for the stuck loopback as
well:

loop0 D BFFFE101 0   499  5   50059 (L-TLB)
088cc578 0001  09197c4c 0805d8cb 084fe6f8 088cc578 0919
   0919 09197c74 0805a38a 084fe200 088cc080 09197c64 0919 0919
   086d9c80 088cc080 084fe200 09197ccc 08182ab6 084fe200 088cc080 084fe200 
Call Trace:

Re: dirty balancing deadlock

2007-02-18 Thread Miklos Szeredi

   If so, writes to B will decrease the dirty memory threshold.
  
  Yes, but not by enough.  Say A dirties a 1100 pages, limit is 1000.
  Some pages queued for writeback (doesn't matter how much).  B writes
  back 1, 1099 dirty remain in A, zero in B.  balance_dirty_pages() for
  B doesn't know that there's nothing more to write back for B, it's
  just waiting there for those 1099, which'll never get written.
 
 hm, OK, arguable.  I guess something like this..

Doesn't help the fuse case, but does seem to help the loopback mount
one.

For fuse it's worse with the patch: now the write triggered by the
balance recurses into fuse, with disastrous results, since the fuse
writeback is now blocked on the userspace queue.

fusexmp_fh_no D 40136678 0   505494   506   504 (NOTLB)
08982b78 0001  08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000
   08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 08f98000
   085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 089a7100 
Call Trace:
08f9f9a0:  [0805d8cb] switch_to_skas+0x3b/0x83
08f9f9b8:  [0805a38a] _switch_to+0x49/0x99
08f9f9e0:  [08183006] schedule+0x246/0x547
08f9fa38:  [08103c7e] fuse_get_req_wp+0xe9/0x14a
08f9fa70:  [08103d2e] fuse_writepage+0x4f/0x12c
08f9faac:  [0809ce3f] __writepage+0x1e/0x3d
08f9fac0:  [0809cd39] write_cache_pages+0x222/0x30a
08f9fb44:  [0809ce8d] generic_writepages+0x2f/0x35
08f9fb5c:  [0809ced6] do_writepages+0x43/0x45
08f9fb70:  [080cb8d2] __writeback_single_inode+0xbc/0x173
08f9fbb8:  [080cbb30] sync_sb_inodes+0x1a7/0x260
08f9fbe8:  [080cbc54] writeback_inodes+0x6b/0x81
08f9fc04:  [0809c640] balance_dirty_pages+0x55/0x153
08f9fc5c:  [0809c7bf] balance_dirty_pages_ratelimited_nr+0x43/0x45
08f9fc68:  [080992b5] generic_file_buffered_write+0x3e3/0x6f5
08f9fd20:  [0809988e] __generic_file_aio_write_nolock+0x2c7/0x5dd
08f9fda8:  [08099cb6] generic_file_aio_write+0x55/0xc7
08f9fddc:  [080ea206] ext3_file_write+0x39/0xaf
08f9fe04:  [080b060b] do_sync_write+0xd8/0x10e
08f9febc:  [080b06e3] vfs_write+0xa2/0x1cb
08f9feec:  [080b09b8] sys_pwrite64+0x65/0x69
08f9ff10:  [0805dd54] handle_syscall+0x90/0xbc
08f9ff64:  [0806d56c] handle_trap+0x27/0x121
08f9ff8c:  [0806dc65] userspace+0x1de/0x226
08f9ffe4:  [0805da19] fork_handler+0x76/0x88
08f9fffc:  [] nosmp+0xf7fb7000/0x14


 but where's pdflush?  It should be busily transferring dirtiness from A to
 B.

The transfer of dirtyness from A to B goes through the narrow channel
of i_mutex.  And once that is plugged by the stuck balance_dirty_pages()
nothing else can pass through.

   The writeout code _should_ just sit there transferring dirtyiness from A 
   to
   B and cleaning pages via B, looping around, alternating between both.
   
   What does sysrq-t say?
  
  This is the fuse daemon thread that got stuck.
 
 Where's pdflsuh?

Doing nothing I guess.  The request queue for the fuse filesystem is
full, so writepage with wbc-nonblocking=1 will be skipped.

pdflush   D 40045401 023  52412 (L-TLB)
088d5bf8 0001  08907df8 0805d8cb 088d55f8 088d5bf8 0890
   0890 08907e20 0805a38a 088d5100 088d5700 08907e10 0890 0890
   0847c300 088d5700 088d5100 08907e78 08182fe6 088d5100 088d5700 088d5100 
Call Trace:
08907de4:  [0805d8cb] switch_to_skas+0x3b/0x83
08907dfc:  [0805a38a] _switch_to+0x49/0x99
08907e24:  [08182fe6] schedule+0x246/0x547
08907e7c:  [08183a03] schedule_timeout+0x4e/0xb6
08907eb0:  [08183991] io_schedule_timeout+0x11/0x20
08907eb8:  [080a0cf2] congestion_wait+0x72/0x87
08907ee8:  [0809c860] background_writeout+0x35/0xa4
08907f38:  [0809d41e] __pdflush+0xae/0x152
08907f54:  [0809d4f5] pdflush+0x33/0x39
08907f84:  [0808a03a] kthread+0xa7/0xab
08907fb4:  [0806a0f1] run_kernel_thread+0x41/0x50
08907fe0:  [0805d975] new_thread_handler+0x62/0x8b
08907ffc:  [] nosmp+0xf7fb7000/0x14

pdflush   D 40045401 024  52523 (L-TLB)
081e1458 0001  088ffe00 0805d8cb 088d5bf8 081e1458 088f8000
   088f8000 088ffe28 0805a38a 088d5700 081e0f60 088ffe18 088f8000 088f8000
   0847c300 081e0f60 088d5700 088ffe80 08182fe6 088d5700 081e0f60 088d5700 
Call Trace:
088ffdec:  [0805d8cb] switch_to_skas+0x3b/0x83
088ffe04:  [0805a38a] _switch_to+0x49/0x99
088ffe2c:  [08182fe6] schedule+0x246/0x547
088ffe84:  [08183a03] schedule_timeout+0x4e/0xb6
088ffeb8:  [08183991] io_schedule_timeout+0x11/0x20
088ffec0:  [080a0cf2] congestion_wait+0x72/0x87
088ffef0:  [0809c98c] wb_kupdate+0x93/0xd9
088fff38:  [0809d41e] __pdflush+0xae/0x152
088fff54:  [0809d4f5] pdflush+0x33/0x39
088fff84:  [0808a03a] kthread+0xa7/0xab
088fffb4:  [0806a0f1] run_kernel_thread+0x41/0x50
088fffe0:  [0805d975] new_thread_handler+0x62/0x8b
088c:  [] nosmp+0xf7fb7000/0x14
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please

Re: dirty balancing deadlock

2007-02-18 Thread Miklos Szeredi

 --- a/fs/fs-writeback.c~a
 +++ a/fs/fs-writeback.c
 @@ -356,7 +356,7 @@ int generic_sync_sb_inodes(struct super_
   continue;   /* Skip a congested blockdev */
   }
  
 - if (wbc-bdi  bdi != wbc-bdi) {
 + if (wbc-bdi  bdi != wbc-bdi  bdi_write_congested(bdi)) {
   if (!sb_is_blkdev_sb(sb))
   break;  /* fs has the wrong queue */
   list_move(inode-i_list, sb-s_dirty);

Checking bdi_write_congested(bdi) is not reliable, since the queue can
become congested _after_ the check is done.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: dirty balancing deadlock

2007-02-18 Thread Miklos Szeredi

If so, writes to B will decrease the dirty memory threshold.
   
   Yes, but not by enough.  Say A dirties a 1100 pages, limit is 1000.
   Some pages queued for writeback (doesn't matter how much).  B writes
   back 1, 1099 dirty remain in A, zero in B.  balance_dirty_pages() for
   B doesn't know that there's nothing more to write back for B, it's
   just waiting there for those 1099, which'll never get written.
  
  hm, OK, arguable.  I guess something like this..
 
 Doesn't help the fuse case, but does seem to help the loopback mount
 one.

No sorry, it doesn't even help the loopback deadlock.  It sometimes
takes quite a while to trigger...

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: dirty balancing deadlock

2007-02-18 Thread Miklos Szeredi

 If so, writes to B will decrease the dirty memory threshold.

Yes, but not by enough.  Say A dirties a 1100 pages, limit is 1000.
Some pages queued for writeback (doesn't matter how much).  B writes
back 1, 1099 dirty remain in A, zero in B.  balance_dirty_pages() for
B doesn't know that there's nothing more to write back for B, it's
just waiting there for those 1099, which'll never get written.
   
   hm, OK, arguable.  I guess something like this..
  
  Doesn't help the fuse case, but does seem to help the loopback mount
  one.
  
  For fuse it's worse with the patch: now the write triggered by the
  balance recurses into fuse, with disastrous results, since the fuse
  writeback is now blocked on the userspace queue.
  
  fusexmp_fh_no D 40136678 0   505494   506   504 (NOTLB)
  08982b78 0001  08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000
 08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 
  08f98000
 085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 
  089a7100 Call Trace:
  08f9f9a0:  [0805d8cb] switch_to_skas+0x3b/0x83
  08f9f9b8:  [0805a38a] _switch_to+0x49/0x99
  08f9f9e0:  [08183006] schedule+0x246/0x547
  08f9fa38:  [08103c7e] fuse_get_req_wp+0xe9/0x14a
  08f9fa70:  [08103d2e] fuse_writepage+0x4f/0x12c
 
 In general, writepage is supposed to do work without blocking on
 expensive locks that will get pdflush and dirty reclaim stuck in this
 fashion.  You'll probably have to take the same approach reiserfs does
 in data=journal mode, which is leaving the page dirty if fuse_get_req_wp
 is going to block without making progress.

Pdflush, and dirty reclaim set wbc-nonblocking to true.
balance_dirty_pages and fsync don't.  The problem here is that
Andrew's patch is wrong to let balance_dirty_pages() try to write back
pages from a different queue.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: dirty balancing deadlock

2007-02-18 Thread Miklos Szeredi

   In general, writepage is supposed to do work without blocking on
   expensive locks that will get pdflush and dirty reclaim stuck in this
   fashion.  You'll probably have to take the same approach reiserfs does
   in data=journal mode, which is leaving the page dirty if fuse_get_req_wp
   is going to block without making progress.
  
  Pdflush, and dirty reclaim set wbc-nonblocking to true.
  balance_dirty_pages and fsync don't.  The problem here is that
  Andrew's patch is wrong to let balance_dirty_pages() try to write back
  pages from a different queue.
 
 async or sync, writepage is supposed to either make progress or bail.
 loopback aside, if the fuse call is blocking long term, you're going to
 run into problems.

Hmm, like what?

Thanks,
Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: dirty balancing deadlock

2007-02-19 Thread Miklos Szeredi

How about this?

Solves the FUSE deadlock, but not the throttle_vm_writeout() one.
I'll try to tackle that one as well.

If the per-bdi dirty counter goes below 16, balance_dirty_pages()
returns.

Does the constant need to tunable?  If it's too large, then the global
threshold is more easily exceeded.  If it's too small, then in a tight
situation progress will be slower.

Thanks,
Miklos

Index: linux/mm/page-writeback.c
===
--- linux.orig/mm/page-writeback.c  2007-02-19 17:32:41.0 +0100
+++ linux/mm/page-writeback.c   2007-02-19 18:05:28.0 +0100
@@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a
dirty_thresh)
break;
 
+   /*
+* Acquit this producer if there's little or nothing
+* to write back to this particular queue
+*
+* Without this check a deadlock is possible in the
+* following case:
+*
+* - filesystem A writes data through filesystem B
+* - filesystem A has dirty pages over dirty_thresh
+* - writeback is started, this triggers a write in B
+* - balance_dirty_pages() is called synchronously
+* - the write to B blocks
+* - the writeback completes, but dirty is still over threshold
+* - the blocking write prevents futher writes from happening
+*/
+   if (atomic_long_read(bdi-nr_dirty) +
+   atomic_long_read(bdi-nr_writeback)  16)
+   break;
+
if (!dirty_exceeded)
dirty_exceeded = 1;
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: dirty balancing deadlock

2007-02-19 Thread Miklos Szeredi

 Solves the FUSE deadlock, but not the throttle_vm_writeout() one.
 I'll try to tackle that one as well.
 
 If the per-bdi dirty counter goes below 16, balance_dirty_pages()
 returns.
 
 Does the constant need to tunable?  If it's too large, then the global
 threshold is more easily exceeded.  If it's too small, then in a tight
 situation progress will be slower.

Similar in spirit, this should solve the deadlock on throttle_vm_writeout().
Totally untested.

Does this approach look workable?

Thanks,
Miklos


Index: linux/include/linux/swap.h
===
--- linux.orig/include/linux/swap.h 2007-02-19 23:39:36.0 +0100
+++ linux/include/linux/swap.h  2007-02-20 00:03:38.0 +0100
@@ -277,10 +277,14 @@ static inline void disable_swap_token(vo
put_swap_token(swap_token_mm);
 }
 
+#define nr_swap_writeback \
+   atomic_long_read(swapper_space.backing_dev_info-nr_writeback)
+
 #else /* CONFIG_SWAP */
 
 #define total_swap_pages   0
 #define total_swapcache_pages  0UL
+#define nr_swap_writeback  0UL
 
 #define si_swapinfo(val) \
do { (val)-freeswap = (val)-totalswap = 0; } while (0)
Index: linux/mm/page-writeback.c
===
--- linux.orig/mm/page-writeback.c  2007-02-19 23:43:03.0 +0100
+++ linux/mm/page-writeback.c   2007-02-20 00:03:49.0 +0100
@@ -33,6 +33,7 @@
 #include linux/syscalls.h
 #include linux/buffer_head.h
 #include linux/pagevec.h
+#include linux/swap.h
 
 /*
  * The maximum number of pages to writeout in a single bdflush/kupdate
@@ -332,6 +333,9 @@ void throttle_vm_writeout(void)
 if (global_page_state(NR_UNSTABLE_NFS) +
global_page_state(NR_WRITEBACK) = dirty_thresh)
break;
+
+   if (nr_swap_writeback  16)
+   break;
 congestion_wait(WRITE, HZ/10);
 }
 }
Index: linux/mm/page_io.c
===
--- linux.orig/mm/page_io.c 2007-02-19 23:24:23.0 +0100
+++ linux/mm/page_io.c  2007-02-19 23:42:21.0 +0100
@@ -70,6 +70,7 @@ static int end_swap_bio_write(struct bio
ClearPageReclaim(page);
}
end_page_writeback(page);
+   atomic_long_dec(swapper_space.backing_dev_info-nr_writeback);
bio_put(bio);
return 0;
 }
@@ -121,6 +122,7 @@ int swap_writepage(struct page *page, st
if (wbc-sync_mode == WB_SYNC_ALL)
rw |= (1  BIO_RW_SYNC);
count_vm_event(PSWPOUT);
+   atomic_long_inc(swapper_space.backing_dev_info-nr_writeback);
set_page_writeback(page);
unlock_page(page);
submit_bio(rw, bio);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 1/3] fix illogical behavior in balance_dirty_pages()

2007-03-24 Thread Miklos Szeredi

This is a slightly different take on the fix for the deadlock in fuse
with dirty balancing.  David Chinner convinced me, that per-bdi
counters are too expensive, and that it's not worth trying to account
the number of pages under writeback, as they will be limited by the
queue anyway.


From: Miklos Szeredi [EMAIL PROTECTED]

Current behavior of balance_dirty_pages() is to try to start writeout
into the specified queue for at least write_chunk number of pages.
If write_chunk pages have been submitted, then return.

However if there are less than write_chunk dirty pages for this
queue, then it doesn't return, waiting for the global dirty+writeback
counters to subside, but without doing any actual work.

This is illogical behavior: it allows more dirtyings while there are
dirty pages, but stops further dirtying completely if there are no
more dirty pages.

It also makes a deadlock possible when one filesystem is writing data
through another, and the balance_dirty_pages() for the lower
filesystem is stalling the writeback for the upper filesystem's
data (*).

So the exit condition should instead be:

  submitted at least write_chunk number of pages
OR
submitted ALL the dirty pages destined for this backing dev
  AND
backing dev is not congested

To do this, introduce a new counter in writeback_control, which counts
the number of dirty pages encountered during writeback.  This includes
all dirty pages, even those which are already under writeback but have
been dirtied again, and those which have been skipped due to having
locked buffers.

If this counter is zero after trying to submit some pages for
writeback, and the backing dev is uncongested, then don't wait any
more.  After this, newly dirtied pages can quickly be written back to
this backing dev.

If there are globally no more pages to submit for writeback
(nr_reclaimable == 0), then also don't wait for ever, only while this
backing dev is congested.

(*) For more info on this deadlock, see the following discussions:

  http://lkml.org/lkml/2007/3/1/9
  http://lkml.org/lkml/2007/3/12/16

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/include/linux/writeback.h
===
--- linux.orig/include/linux/writeback.h2007-03-24 22:06:56.0 
+0100
+++ linux/include/linux/writeback.h 2007-03-24 22:29:02.0 +0100
@@ -44,6 +44,7 @@ struct writeback_control {
long nr_to_write;   /* Write this many pages, and decrement
   this for each page written */
long pages_skipped; /* Pages which were not written */
+   long nr_dirty;  /* Number of dirty pages encountered */
 
/*
 * For a_ops-writepages(): is start or end are non-zero then this is
Index: linux/mm/page-writeback.c
===
--- linux.orig/mm/page-writeback.c  2007-03-24 22:06:56.0 +0100
+++ linux/mm/page-writeback.c   2007-03-24 22:29:02.0 +0100
@@ -207,7 +207,15 @@ static void balance_dirty_pages(struct a
 * written to the server's write cache, but has not yet
 * been flushed to permanent storage.
 */
-   if (nr_reclaimable) {
+   if (!nr_reclaimable) {
+   /*
+* If there's nothing more to write back and this queue
+* is uncongested,  then it is possible to quickly
+* write out some more data, so let's not wait
+*/
+   if (!bdi_write_congested(bdi))
+   break;
+   } else {
writeback_inodes(wbc);
get_dirty_limits(background_thresh,
dirty_thresh, mapping);
@@ -220,6 +228,14 @@ static void balance_dirty_pages(struct a
pages_written += write_chunk - wbc.nr_to_write;
if (pages_written = write_chunk)
break;  /* We've done our duty */
+
+   /*
+* If there are no more dirty pages for this backing
+* backing dev, and the queue is not congested, then
+* it is possible to quickly write out some more data
+*/
+   if (!wbc.nr_dirty  !bdi_write_congested(bdi))
+   break;
}
congestion_wait(WRITE, HZ/10);
}
@@ -619,6 +635,7 @@ retry:
  min(end - index, 
(pgoff_t)PAGEVEC_SIZE-1) + 1))) {
unsigned i;
 
+   wbc-nr_dirty += nr_pages;
scanned = 1;
for (i = 0; i  nr_pages; i

[patch 2/3] remove throttle_vm_writeout()

2007-03-24 Thread Miklos Szeredi

From: Miklos Szeredi [EMAIL PROTECTED]

Remove this function.  It's purpose was to limit the global number of
writeback pages from submitted by direct reclaim.  But this is equally
well accomplished by limited queue lengths.  When this function was
added, the device queues had much larger default lengths (8192
requests, now it's 128), causing problems.

When writable shared mapping support is added to fuse, this function
would be able to cause a deadlock if the userspace filesystem needs to
allocate memory while writing back dirty pages.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/include/linux/writeback.h
===
--- linux.orig/include/linux/writeback.h2007-03-24 22:07:00.0 
+0100
+++ linux/include/linux/writeback.h 2007-03-24 22:28:52.0 +0100
@@ -85,7 +85,6 @@ static inline void wait_on_inode(struct 
 int wakeup_pdflush(long nr_pages);
 void laptop_io_completion(void);
 void laptop_sync_completion(void);
-void throttle_vm_writeout(gfp_t gfp_mask);
 
 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
Index: linux/mm/page-writeback.c
===
--- linux.orig/mm/page-writeback.c  2007-03-24 22:07:00.0 +0100
+++ linux/mm/page-writeback.c   2007-03-24 22:28:52.0 +0100
@@ -312,37 +312,6 @@ void balance_dirty_pages_ratelimited_nr(
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
-void throttle_vm_writeout(gfp_t gfp_mask)
-{
-   long background_thresh;
-   long dirty_thresh;
-
-   if ((gfp_mask  (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) {
-   /*
-* The caller might hold locks which can prevent IO completion
-* or progress in the filesystem.  So we cannot just sit here
-* waiting for IO to complete.
-*/
-   congestion_wait(WRITE, HZ/10);
-   return;
-   }
-
-for ( ; ; ) {
-   get_dirty_limits(background_thresh, dirty_thresh, NULL);
-
-/*
- * Boost the allowable dirty threshold a bit for page
- * allocators so they don't get DoS'ed by heavy writers
- */
-dirty_thresh += dirty_thresh / 10;  /* wh... */
-
-if (global_page_state(NR_UNSTABLE_NFS) +
-   global_page_state(NR_WRITEBACK) = dirty_thresh)
-   break;
-congestion_wait(WRITE, HZ/10);
-}
-}
-
 /*
  * writeback at least _min_pages, and keep writing until the amount of dirty
  * memory is less than the background threshold, or until we're all clean.
Index: linux/mm/vmscan.c
===
--- linux.orig/mm/vmscan.c  2007-03-24 22:06:53.0 +0100
+++ linux/mm/vmscan.c   2007-03-24 22:07:03.0 +0100
@@ -952,8 +952,6 @@ static unsigned long shrink_zone(int pri
}
}
 
-   throttle_vm_writeout(sc-gfp_mask);
-
atomic_dec(zone-reclaim_in_progress);
return nr_reclaimed;
 }
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 3/3] balance dirty pages from loop device

2007-03-24 Thread Miklos Szeredi

From: Miklos Szeredi [EMAIL PROTECTED]

The function do_lo_send_aops() should call
balance_dirty_pages_ratelimited() after each page similarly to
generic_file_buffered_write().

Without this, writing the loop device directly (not through a
filesystem) is very slow, and also slows the whole system down,
because nr_dirty is constantly over the limit.

Beware: this patch without the fix to balance_dirty_pages() makes a
loopback mounted filesystem prone to deadlock.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/drivers/block/loop.c
===
--- linux.orig/drivers/block/loop.c 2007-03-24 21:00:40.0 +0100
+++ linux/drivers/block/loop.c  2007-03-24 22:07:06.0 +0100
@@ -275,6 +275,8 @@ static int do_lo_send_aops(struct loop_d
pos += size;
unlock_page(page);
page_cache_release(page);
+   balance_dirty_pages_ratelimited(mapping);
+   cond_resched();
}
ret = 0;
 out:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] add file position info to proc

2007-03-24 Thread Miklos Szeredi

From: Miklos Szeredi [EMAIL PROTECTED]

This patch adds support for finding out the current file position,
open flags and possibly other info in the future.

These new entries are added:

  /proc/PID/fdinfo/FD
  /proc/PID/task/TID/fdinfo/FD

For each fd the information is provided in the following format:

pos:1234
flags:  012

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/fs/proc/base.c
===
--- linux.orig/fs/proc/base.c   2007-03-24 19:00:48.0 +0100
+++ linux/fs/proc/base.c2007-03-24 22:28:14.0 +0100
@@ -1199,7 +1199,10 @@ out:
return ~0U;
 }
 
-static int proc_fd_link(struct inode *inode, struct dentry **dentry, struct 
vfsmount **mnt)
+#define PROC_FDINFO_MAX 64
+
+static int proc_fd_info(struct inode *inode, struct dentry **dentry,
+   struct vfsmount **mnt, char *info)
 {
struct task_struct *task = get_proc_task(inode);
struct files_struct *files = NULL;
@@ -1218,8 +1221,16 @@ static int proc_fd_link(struct inode *in
spin_lock(files-file_lock);
file = fcheck_files(files, fd);
if (file) {
-   *mnt = mntget(file-f_path.mnt);
-   *dentry = dget(file-f_path.dentry);
+   if (mnt)
+   *mnt = mntget(file-f_path.mnt);
+   if (dentry)
+   *dentry = dget(file-f_path.dentry);
+   if (info)
+   snprintf(info, PROC_FDINFO_MAX,
+pos:\t%lli\n
+flags:\t0%o\n,
+(long long) file-f_pos,
+file-f_flags);
spin_unlock(files-file_lock);
put_files_struct(files);
return 0;
@@ -1230,6 +1241,12 @@ static int proc_fd_link(struct inode *in
return -ENOENT;
 }
 
+static int proc_fd_link(struct inode *inode, struct dentry **dentry,
+   struct vfsmount **mnt)
+{
+   return proc_fd_info(inode, dentry, mnt, NULL);
+}
+
 static int tid_fd_revalidate(struct dentry *dentry, struct nameidata *nd)
 {
struct inode *inode = dentry-d_inode;
@@ -1325,7 +1342,9 @@ out_iput:
goto out;
 }
 
-static struct dentry *proc_lookupfd(struct inode * dir, struct dentry * 
dentry, struct nameidata *nd)
+static struct dentry *proc_lookupfd_common(struct inode *dir,
+  struct dentry *dentry,
+  instantiate_t instantiate)
 {
struct task_struct *task = get_proc_task(dir);
unsigned fd = name_to_int(dentry);
@@ -1336,23 +1355,15 @@ static struct dentry *proc_lookupfd(stru
if (fd == ~0U)
goto out;
 
-   result = proc_fd_instantiate(dir, dentry, task, fd);
+   result = instantiate(dir, dentry, task, fd);
 out:
put_task_struct(task);
 out_no_task:
return result;
 }
 
-static int proc_fd_fill_cache(struct file *filp, void *dirent, filldir_t 
filldir,
-   struct task_struct *task, int fd)
-{
-   char name[PROC_NUMBUF];
-   int len = snprintf(name, sizeof(name), %d, fd);
-   return proc_fill_cache(filp, dirent, filldir, name, len,
-   proc_fd_instantiate, task, fd);
-}
-
-static int proc_readfd(struct file * filp, void * dirent, filldir_t filldir)
+static int proc_readfd_common(struct file * filp, void * dirent,
+ filldir_t filldir, instantiate_t instantiate)
 {
struct dentry *dentry = filp-f_path.dentry;
struct inode *inode = dentry-d_inode;
@@ -1388,12 +1399,17 @@ static int proc_readfd(struct file * fil
for (fd = filp-f_pos-2;
 fd  fdt-max_fds;
 fd++, filp-f_pos++) {
+   char name[PROC_NUMBUF];
+   int len;
 
if (!fcheck_files(files, fd))
continue;
rcu_read_unlock();
 
-   if (proc_fd_fill_cache(filp, dirent, filldir, 
p, fd)  0) {
+   len = snprintf(name, sizeof(name), %d, fd);
+   if (proc_fill_cache(filp, dirent, filldir,
+   name, len, instantiate,
+   p, fd)  0) {
rcu_read_lock();
break;
}
@@ -1408,6 +1424,32 @@ out_no_task:
return retval;
 }
 
+static struct dentry *proc_lookupfd(struct inode *dir, struct dentry *dentry

[patch 1/3] split mmap

2007-03-24 Thread Miklos Szeredi

From: Miklos Szeredi [EMAIL PROTECTED]

This is a straightforward split of do_mmap_pgoff() into two functions:

 - do_mmap_pgoff() checks the parameters, and calculates the vma
   flags.  Then it calls

 - mmap_region(), which does the actual mapping

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/mm/mmap.c
===
--- linux.orig/mm/mmap.c2007-03-24 21:00:40.0 +0100
+++ linux/mm/mmap.c 2007-03-24 22:28:52.0 +0100
@@ -893,14 +893,11 @@ unsigned long do_mmap_pgoff(struct file 
unsigned long flags, unsigned long pgoff)
 {
struct mm_struct * mm = current-mm;
-   struct vm_area_struct * vma, * prev;
struct inode *inode;
unsigned int vm_flags;
-   int correct_wcount = 0;
int error;
-   struct rb_node ** rb_link, * rb_parent;
int accountable = 1;
-   unsigned long charged = 0, reqprot = prot;
+   unsigned long reqprot = prot;
 
/*
 * Does the application expect PROT_READ to imply PROT_EXEC?
@@ -1025,7 +1022,25 @@ unsigned long do_mmap_pgoff(struct file 
error = security_file_mmap(file, reqprot, prot, flags);
if (error)
return error;
-   
+
+   return mmap_region(file, addr, len, flags, vm_flags, pgoff,
+  accountable);
+}
+EXPORT_SYMBOL(do_mmap_pgoff);
+
+unsigned long mmap_region(struct file *file, unsigned long addr,
+ unsigned long len, unsigned long flags,
+ unsigned int vm_flags, unsigned long pgoff,
+ int accountable)
+{
+   struct mm_struct *mm = current-mm;
+   struct vm_area_struct *vma, *prev;
+   int correct_wcount = 0;
+   int error;
+   struct rb_node **rb_link, *rb_parent;
+   unsigned long charged = 0;
+   struct inode *inode =  file ? file-f_path.dentry-d_inode : NULL;
+
/* Clear old maps */
error = -ENOMEM;
 munmap_back:
@@ -1174,8 +1189,6 @@ unacct_error:
return error;
 }
 
-EXPORT_SYMBOL(do_mmap_pgoff);
-
 /* Get an address range which is currently unmapped.
  * For shmat() with addr=0.
  *
Index: linux/include/linux/mm.h
===
--- linux.orig/include/linux/mm.h   2007-03-24 21:00:40.0 +0100
+++ linux/include/linux/mm.h2007-03-24 22:28:52.0 +0100
@@ -1035,6 +1035,10 @@ extern unsigned long get_unmapped_area(s
 extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
unsigned long flag, unsigned long pgoff);
+extern unsigned long mmap_region(struct file *file, unsigned long addr,
+   unsigned long len, unsigned long flags,
+   unsigned int vm_flags, unsigned long pgoff,
+   int accountable);
 
 static inline unsigned long do_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 2/3] only allow nonlinear vmas for ram backed filesystems

2007-03-24 Thread Miklos Szeredi

From: Miklos Szeredi [EMAIL PROTECTED]

Dirty page accounting/limiting doesn't work for nonlinear mappings, so
for non-ram backed filesystems emulate with linear mappings.  This
retains ABI compatibility with previous kernels at minimal code cost.

All known users of nonlinear mappings actually use tmpfs, so this
shouldn't have any negative effect.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux-2.6.21-rc4-mm1/mm/fremap.c
===
--- linux-2.6.21-rc4-mm1.orig/mm/fremap.c   2007-03-24 22:30:05.0 
+0100
+++ linux-2.6.21-rc4-mm1/mm/fremap.c2007-03-24 22:37:59.0 +0100
@@ -181,6 +181,24 @@ asmlinkage long sys_remap_file_pages(uns
goto retry;
}
mapping = vma-vm_file-f_mapping;
+   /*
+* page_mkclean doesn't work on nonlinear vmas, so if dirty
+* pages need to be accounted, emulate with linear vmas.
+*/
+   if (mapping_cap_account_dirty(mapping)) {
+   unsigned long addr;
+
+   flags = MAP_NONBLOCK;
+   addr = mmap_region(vma-vm_file, start, size, flags,
+  vma-vm_flags, pgoff, 1);
+   if (IS_ERR_VALUE(addr))
+   err = addr;
+   else {
+   BUG_ON(addr != start);
+   err = 0;
+   }
+   goto out;
+   }
spin_lock(mapping-i_mmap_lock);
flush_dcache_mmap_lock(mapping);
vma-vm_flags |= VM_NONLINEAR;
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 5429 matches

Mail list logo