Re: [REVIEW][PATCH 0/6] Wrapping up the vfs support for unprivileged mounts

2018-05-29 Thread Dongsu Park
Hi,

On Thu, May 24, 2018 at 1:22 AM, Eric W. Biederman
 wrote:
>
> Very slowly the work has been progressing to ensure the vfs has the
> necessary support for mounting filesystems without privilege.
>
> This patchset contains one more core piece of that work, ensuring a few
> more operations that would write back an inode and confuse an exisiting
> filesystem are denied.
>
> The rest of the changes actually enable userns root to do things with
> filesystems that the userns root has mounted.  Most of these have been
> waiting in the wings a long time, held back because I wanted the core
> of the patchset to be solid before I started allowing additional
> behavor.
>
> It is definitely time for these changes so the effect of s_user_ns
> becomes less theoretical.
>
> The change to allow mknod is new, but consistent with everything else
> and harmless as device nodes on filesystems mounted without privilege
> are ignored.
>
> Unless problems show up in the during review I plan to merge these changes.

Thank you for the great work. I have been looking forward to seeing it.
I have just gathered available relevant patches in my branch:

https://github.com/kinvolk/linux/tree/dongsu/fuse-userns-for-4.18

With this branch, I tested sshfs/fuse from non-init user namespace.
It works fine as expected. So you can add:

Tested-by: Dongsu Park 

Thanks!
Dongsu

> These changes are also available at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
> userns-test
>
> Eric W. Biederman (5):
>   vfs: Don't allow changing the link count of an inode with an invalid 
> uid or gid
>   vfs: Allow userns root to call mknod on owned filesystems.
>   fs: Allow superblock owner to replace invalid owners of inodes
>   fs: Allow superblock owner to access do_remount_sb()
>   capabilities: Allow privileged user in s_user_ns to set security.* 
> xattrs
>
> Seth Forshee (1):
>   fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
>
>  fs/attr.c| 36 
>  fs/ioctl.c   |  4 ++--
>  fs/namei.c   | 16 
>  fs/namespace.c   |  4 ++--
>  security/commoncap.c |  8 ++--
>  5 files changed, 50 insertions(+), 18 deletions(-)
>
> Eric
> ___
> Containers mailing list
> contain...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers


Re: [REVIEW][PATCH 0/6] Wrapping up the vfs support for unprivileged mounts

2018-05-29 Thread Dongsu Park
Hi,

On Thu, May 24, 2018 at 1:22 AM, Eric W. Biederman
 wrote:
>
> Very slowly the work has been progressing to ensure the vfs has the
> necessary support for mounting filesystems without privilege.
>
> This patchset contains one more core piece of that work, ensuring a few
> more operations that would write back an inode and confuse an exisiting
> filesystem are denied.
>
> The rest of the changes actually enable userns root to do things with
> filesystems that the userns root has mounted.  Most of these have been
> waiting in the wings a long time, held back because I wanted the core
> of the patchset to be solid before I started allowing additional
> behavor.
>
> It is definitely time for these changes so the effect of s_user_ns
> becomes less theoretical.
>
> The change to allow mknod is new, but consistent with everything else
> and harmless as device nodes on filesystems mounted without privilege
> are ignored.
>
> Unless problems show up in the during review I plan to merge these changes.

Thank you for the great work. I have been looking forward to seeing it.
I have just gathered available relevant patches in my branch:

https://github.com/kinvolk/linux/tree/dongsu/fuse-userns-for-4.18

With this branch, I tested sshfs/fuse from non-init user namespace.
It works fine as expected. So you can add:

Tested-by: Dongsu Park 

Thanks!
Dongsu

> These changes are also available at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
> userns-test
>
> Eric W. Biederman (5):
>   vfs: Don't allow changing the link count of an inode with an invalid 
> uid or gid
>   vfs: Allow userns root to call mknod on owned filesystems.
>   fs: Allow superblock owner to replace invalid owners of inodes
>   fs: Allow superblock owner to access do_remount_sb()
>   capabilities: Allow privileged user in s_user_ns to set security.* 
> xattrs
>
> Seth Forshee (1):
>   fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
>
>  fs/attr.c| 36 
>  fs/ioctl.c   |  4 ++--
>  fs/namei.c   | 16 
>  fs/namespace.c   |  4 ++--
>  security/commoncap.c |  8 ++--
>  5 files changed, 50 insertions(+), 18 deletions(-)
>
> Eric
> ___
> Containers mailing list
> contain...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers


[RFC PATCH v5 1/2] ima: force re-appraisal on filesystems with FS_IMA_NO_CACHE

2018-02-07 Thread Dongsu Park
From: Alban Crequy <al...@kinvolk.io>

This patch forces files to be re-measured, re-appraised and re-audited
on file systems with the feature flag FS_IMA_NO_CACHE. In that way,
cached integrity results won't be used.

Cc: linux-kernel@vger.kernel.org
Cc: linux-integr...@vger.kernel.org
Cc: linux-security-mod...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: Alexander Viro <v...@zeniv.linux.org.uk>
Cc: Miklos Szeredi <mik...@szeredi.hu>
Cc: Mimi Zohar <zo...@linux.vnet.ibm.com>
Cc: Dmitry Kasatkin <dmitry.kasat...@gmail.com>
Cc: James Morris <jmor...@namei.org>
Cc: Christoph Hellwig <h...@infradead.org>
Acked-by: "Serge E. Hallyn" <se...@hallyn.com>
Acked-by: Seth Forshee <seth.fors...@canonical.com>
Tested-by: Dongsu Park <don...@kinvolk.io>
Signed-off-by: Alban Crequy <al...@kinvolk.io>
Signed-off-by: Dongsu Park <don...@kinvolk.io>
---
 include/linux/fs.h|  1 +
 security/integrity/ima/ima_main.c | 15 +--
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 511fbaab..ced841ba 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2075,6 +2075,7 @@ struct file_system_type {
 #define FS_BINARY_MOUNTDATA2
 #define FS_HAS_SUBTYPE 4
 #define FS_USERNS_MOUNT8   /* Can be mounted by userns 
root */
+#define FS_IMA_NO_CACHE16  /* Force IMA to re-measure, 
re-appraise, re-audit files */
 #define FS_RENAME_DOES_D_MOVE  32768   /* FS will handle d_move() during 
rename() internally. */
struct dentry *(*mount) (struct file_system_type *, int,
   const char *, void *);
diff --git a/security/integrity/ima/ima_main.c 
b/security/integrity/ima/ima_main.c
index 6d78cb26..83edbad8 100644
--- a/security/integrity/ima/ima_main.c
+++ b/security/integrity/ima/ima_main.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ima.h"
 
@@ -228,9 +229,19 @@ static int process_measurement(struct file *file, char 
*buf, loff_t size,
 IMA_APPRAISE_SUBMASK | IMA_APPRAISED_SUBMASK |
 IMA_ACTION_FLAGS);
 
-   if (test_and_clear_bit(IMA_CHANGE_XATTR, >atomic_flags))
-   /* reset all flags if ima_inode_setxattr was called */
+   /*
+* Reset the measure, appraise and audit cached flags either if:
+* - ima_inode_setxattr was called, or
+* - based on filesystem feature flag
+* forcing the file to be re-evaluated.
+*/
+   if (test_and_clear_bit(IMA_CHANGE_XATTR, >atomic_flags)) {
iint->flags &= ~IMA_DONE_MASK;
+   } else if (inode->i_sb->s_type->fs_flags & FS_IMA_NO_CACHE) {
+   iint->flags &= ~IMA_DONE_MASK;
+   if (action & IMA_MEASURE)
+   iint->measured_pcrs = 0;
+   }
 
/* Determine if already appraised/measured based on bitmask
 * (IMA_MEASURE, IMA_MEASURED, IMA__APPRAISE, IMA__APPRAISED,
-- 
2.13.6



[RFC PATCH v5 1/2] ima: force re-appraisal on filesystems with FS_IMA_NO_CACHE

2018-02-07 Thread Dongsu Park
From: Alban Crequy 

This patch forces files to be re-measured, re-appraised and re-audited
on file systems with the feature flag FS_IMA_NO_CACHE. In that way,
cached integrity results won't be used.

Cc: linux-kernel@vger.kernel.org
Cc: linux-integr...@vger.kernel.org
Cc: linux-security-mod...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: Alexander Viro 
Cc: Miklos Szeredi 
Cc: Mimi Zohar 
Cc: Dmitry Kasatkin 
Cc: James Morris 
Cc: Christoph Hellwig 
Acked-by: "Serge E. Hallyn" 
Acked-by: Seth Forshee 
Tested-by: Dongsu Park 
Signed-off-by: Alban Crequy 
Signed-off-by: Dongsu Park 
---
 include/linux/fs.h|  1 +
 security/integrity/ima/ima_main.c | 15 +--
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 511fbaab..ced841ba 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2075,6 +2075,7 @@ struct file_system_type {
 #define FS_BINARY_MOUNTDATA2
 #define FS_HAS_SUBTYPE 4
 #define FS_USERNS_MOUNT8   /* Can be mounted by userns 
root */
+#define FS_IMA_NO_CACHE16  /* Force IMA to re-measure, 
re-appraise, re-audit files */
 #define FS_RENAME_DOES_D_MOVE  32768   /* FS will handle d_move() during 
rename() internally. */
struct dentry *(*mount) (struct file_system_type *, int,
   const char *, void *);
diff --git a/security/integrity/ima/ima_main.c 
b/security/integrity/ima/ima_main.c
index 6d78cb26..83edbad8 100644
--- a/security/integrity/ima/ima_main.c
+++ b/security/integrity/ima/ima_main.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ima.h"
 
@@ -228,9 +229,19 @@ static int process_measurement(struct file *file, char 
*buf, loff_t size,
 IMA_APPRAISE_SUBMASK | IMA_APPRAISED_SUBMASK |
 IMA_ACTION_FLAGS);
 
-   if (test_and_clear_bit(IMA_CHANGE_XATTR, >atomic_flags))
-   /* reset all flags if ima_inode_setxattr was called */
+   /*
+* Reset the measure, appraise and audit cached flags either if:
+* - ima_inode_setxattr was called, or
+* - based on filesystem feature flag
+* forcing the file to be re-evaluated.
+*/
+   if (test_and_clear_bit(IMA_CHANGE_XATTR, >atomic_flags)) {
iint->flags &= ~IMA_DONE_MASK;
+   } else if (inode->i_sb->s_type->fs_flags & FS_IMA_NO_CACHE) {
+   iint->flags &= ~IMA_DONE_MASK;
+   if (action & IMA_MEASURE)
+   iint->measured_pcrs = 0;
+   }
 
/* Determine if already appraised/measured based on bitmask
 * (IMA_MEASURE, IMA_MEASURED, IMA__APPRAISE, IMA__APPRAISED,
-- 
2.13.6



[RFC PATCH v5 0/2] ima,fuse: introduce new fs flag FS_IMA_NO_CACHE

2018-02-07 Thread Dongsu Park
This patchset v5 introduces a new fs flag FS_IMA_NO_CACHE and uses it in
FUSE. This forces files to be re-measured, re-appraised and re-audited
on file systems with the feature flag FS_IMA_NO_CACHE. In that way,
cached integrity results won't be used.

There was a previous attempt (unmerged) with a IMA option named "force" and 
using
that option for FUSE filesystems. These patches use a different approach
so that the IMA subsystem does not need to know about FUSE.
- https://www.spinics.net/lists/linux-integrity/msg00948.html
- https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1584131.html

Changes since v1: 
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1587390.html
- include linux-fsdevel mailing list in cc
- mark patch as RFC
- based on next-integrity, without other unmerged FUSE / IMA patches

Changes since v2: 
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1587678.html
- rename flag to FS_IMA_NO_CACHE
- split patch into 2

Changes since v3: 
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1592393.html
- make the code simpler by resetting IMA_DONE_MASK

Changes since v4: 
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1598387.html
- add Acked-by from Miklos
- change ordering of patches as suggested by Miklos
- improve commit messages
- code diff since v4 is empty: only commit messages, ordering were changed

The patchset is also available in our github repo:
  https://github.com/kinvolk/linux/tree/dongsu/fuse-flag-ima-nocache-v5


Alban Crequy (2):
  ima: force re-appraisal on filesystems with FS_IMA_NO_CACHE
  fuse: introduce new fs_type flag FS_IMA_NO_CACHE

 fs/fuse/inode.c   |  2 +-
 include/linux/fs.h|  1 +
 security/integrity/ima/ima_main.c | 15 +--
 3 files changed, 15 insertions(+), 3 deletions(-)

-- 
2.13.6



[RFC PATCH v5 0/2] ima,fuse: introduce new fs flag FS_IMA_NO_CACHE

2018-02-07 Thread Dongsu Park
This patchset v5 introduces a new fs flag FS_IMA_NO_CACHE and uses it in
FUSE. This forces files to be re-measured, re-appraised and re-audited
on file systems with the feature flag FS_IMA_NO_CACHE. In that way,
cached integrity results won't be used.

There was a previous attempt (unmerged) with a IMA option named "force" and 
using
that option for FUSE filesystems. These patches use a different approach
so that the IMA subsystem does not need to know about FUSE.
- https://www.spinics.net/lists/linux-integrity/msg00948.html
- https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1584131.html

Changes since v1: 
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1587390.html
- include linux-fsdevel mailing list in cc
- mark patch as RFC
- based on next-integrity, without other unmerged FUSE / IMA patches

Changes since v2: 
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1587678.html
- rename flag to FS_IMA_NO_CACHE
- split patch into 2

Changes since v3: 
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1592393.html
- make the code simpler by resetting IMA_DONE_MASK

Changes since v4: 
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1598387.html
- add Acked-by from Miklos
- change ordering of patches as suggested by Miklos
- improve commit messages
- code diff since v4 is empty: only commit messages, ordering were changed

The patchset is also available in our github repo:
  https://github.com/kinvolk/linux/tree/dongsu/fuse-flag-ima-nocache-v5


Alban Crequy (2):
  ima: force re-appraisal on filesystems with FS_IMA_NO_CACHE
  fuse: introduce new fs_type flag FS_IMA_NO_CACHE

 fs/fuse/inode.c   |  2 +-
 include/linux/fs.h|  1 +
 security/integrity/ima/ima_main.c | 15 +--
 3 files changed, 15 insertions(+), 3 deletions(-)

-- 
2.13.6



[RFC PATCH v5 2/2] fuse: introduce new fs_type flag FS_IMA_NO_CACHE

2018-02-07 Thread Dongsu Park
From: Alban Crequy <al...@kinvolk.io>

This new fs_type flag FS_IMA_NO_CACHE means files should be re-measured,
re-appraised and re-audited each time. Cached integrity results should
not be used.

It is useful in FUSE because the userspace FUSE process can change the
underlying files at any time without notifying the kernel. FUSE can be
mounted by unprivileged users either today with fusermount installed
with setuid, or soon with the upcoming patches to allow FUSE mounts in
a non-init user namespace. That makes the issue more visible than for
network filesystems where unprivileged users cannot mount.

How to test this:

The test I did was using a patched version of the memfs FUSE driver
[1][2] and two very simple "hello-world" programs [4] (prog1 prints
"hello world: 1" and prog2 prints "hello world: 2").

I copy prog1 and prog2 in the fuse-memfs mount point, execute them and
check the sha1 hash in
"/sys/kernel/security/ima/ascii_runtime_measurements".

My patch on the memfs FUSE driver added a backdoor command to serve
prog1 when the kernel asks for prog2 or vice-versa. In this way, I can
exec prog1 and get it to print "hello world: 2" without ever replacing
the file via the VFS, so the kernel is not aware of the change.

The test was done using the branch "dongsu/fuse-flag-ima-nocache-v5" [3].

Step by step test procedure:

1. Mount the memfs FUSE using [2]:
rm -f  /tmp/memfs-switch* ; memfs -L DEBUG  /mnt/memfs

2. Copy prog1 and prog2 using [4]
cp prog1 /mnt/memfs/prog1
cp prog2 /mnt/memfs/prog2

3. Lookup the files and let the FUSE driver to keep the handles open:
dd if=/mnt/memfs/prog1 bs=1 | (read -n 1 x ; sleep 3600 ) &
dd if=/mnt/memfs/prog2 bs=1 | (read -n 1 x ; sleep 3600 ) &

4. Check the 2 programs work correctly:
$ /mnt/memfs/prog1
hello world: 1
$ /mnt/memfs/prog2
hello world: 2

5. Check the measurements for prog1 and prog2:
$ sudo cat /sys/kernel/security/ima/ascii_runtime_measurements \
| grep /mnt/memfs/prog
10 [...] ima-ng sha1:ac14c9268cd2[...] /mnt/memfs/prog1
10 [...] ima-ng sha1:799cb5d1e06d[...] /mnt/memfs/prog2

6. Use the backdoor command in my patched memfs to redirect file
operations on file handle 3 to file handle 2:
rm -f  /tmp/memfs-switch* ; touch /tmp/memfs-switch-3-2

7. Check how the FUSE driver serves different content for the files:
$ /mnt/memfs/prog1
hello world: 2
$ /mnt/memfs/prog2
hello world: 2

8. Check the measurements:
sudo cat /sys/kernel/security/ima/ascii_runtime_measurements \
| grep /mnt/memfs/prog

Without the patch, there are no new measurements, despite the FUSE
driver having served different executables.

With the patch, I can see additional measurements for prog1 and prog2
with the hashes reversed when the FUSE driver served the alternative
content.

[1] https://github.com/bbengfort/memfs
[2] https://github.com/kinvolk/memfs/commits/alban/switch-files
[3] https://github.com/kinvolk/linux/commits/dongsu/fuse-flag-ima-nocache-v5
[4] https://github.com/kinvolk/fuse-userns-patches/commit/cf1f5750cab0

Cc: linux-kernel@vger.kernel.org
Cc: linux-integr...@vger.kernel.org
Cc: linux-security-mod...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: Alexander Viro <v...@zeniv.linux.org.uk>
Cc: Mimi Zohar <zo...@linux.vnet.ibm.com>
Cc: Dmitry Kasatkin <dmitry.kasat...@gmail.com>
Cc: James Morris <jmor...@namei.org>
Cc: Christoph Hellwig <h...@infradead.org>
Acked-by: Miklos Szeredi <mik...@szeredi.hu>
Acked-by: "Serge E. Hallyn" <se...@hallyn.com>
Acked-by: Seth Forshee <seth.fors...@canonical.com>
Tested-by: Dongsu Park <don...@kinvolk.io>
Signed-off-by: Alban Crequy <al...@kinvolk.io>
---
 fs/fuse/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 624f18bb..0a9e5164 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1205,7 +1205,7 @@ static void fuse_kill_sb_anon(struct super_block *sb)
 static struct file_system_type fuse_fs_type = {
.owner  = THIS_MODULE,
.name   = "fuse",
-   .fs_flags   = FS_HAS_SUBTYPE,
+   .fs_flags   = FS_HAS_SUBTYPE | FS_IMA_NO_CACHE,
.mount  = fuse_mount,
.kill_sb= fuse_kill_sb_anon,
 };
-- 
2.13.6



[RFC PATCH v5 2/2] fuse: introduce new fs_type flag FS_IMA_NO_CACHE

2018-02-07 Thread Dongsu Park
From: Alban Crequy 

This new fs_type flag FS_IMA_NO_CACHE means files should be re-measured,
re-appraised and re-audited each time. Cached integrity results should
not be used.

It is useful in FUSE because the userspace FUSE process can change the
underlying files at any time without notifying the kernel. FUSE can be
mounted by unprivileged users either today with fusermount installed
with setuid, or soon with the upcoming patches to allow FUSE mounts in
a non-init user namespace. That makes the issue more visible than for
network filesystems where unprivileged users cannot mount.

How to test this:

The test I did was using a patched version of the memfs FUSE driver
[1][2] and two very simple "hello-world" programs [4] (prog1 prints
"hello world: 1" and prog2 prints "hello world: 2").

I copy prog1 and prog2 in the fuse-memfs mount point, execute them and
check the sha1 hash in
"/sys/kernel/security/ima/ascii_runtime_measurements".

My patch on the memfs FUSE driver added a backdoor command to serve
prog1 when the kernel asks for prog2 or vice-versa. In this way, I can
exec prog1 and get it to print "hello world: 2" without ever replacing
the file via the VFS, so the kernel is not aware of the change.

The test was done using the branch "dongsu/fuse-flag-ima-nocache-v5" [3].

Step by step test procedure:

1. Mount the memfs FUSE using [2]:
rm -f  /tmp/memfs-switch* ; memfs -L DEBUG  /mnt/memfs

2. Copy prog1 and prog2 using [4]
cp prog1 /mnt/memfs/prog1
cp prog2 /mnt/memfs/prog2

3. Lookup the files and let the FUSE driver to keep the handles open:
dd if=/mnt/memfs/prog1 bs=1 | (read -n 1 x ; sleep 3600 ) &
dd if=/mnt/memfs/prog2 bs=1 | (read -n 1 x ; sleep 3600 ) &

4. Check the 2 programs work correctly:
$ /mnt/memfs/prog1
hello world: 1
$ /mnt/memfs/prog2
hello world: 2

5. Check the measurements for prog1 and prog2:
$ sudo cat /sys/kernel/security/ima/ascii_runtime_measurements \
| grep /mnt/memfs/prog
10 [...] ima-ng sha1:ac14c9268cd2[...] /mnt/memfs/prog1
10 [...] ima-ng sha1:799cb5d1e06d[...] /mnt/memfs/prog2

6. Use the backdoor command in my patched memfs to redirect file
operations on file handle 3 to file handle 2:
rm -f  /tmp/memfs-switch* ; touch /tmp/memfs-switch-3-2

7. Check how the FUSE driver serves different content for the files:
$ /mnt/memfs/prog1
hello world: 2
$ /mnt/memfs/prog2
hello world: 2

8. Check the measurements:
sudo cat /sys/kernel/security/ima/ascii_runtime_measurements \
| grep /mnt/memfs/prog

Without the patch, there are no new measurements, despite the FUSE
driver having served different executables.

With the patch, I can see additional measurements for prog1 and prog2
with the hashes reversed when the FUSE driver served the alternative
content.

[1] https://github.com/bbengfort/memfs
[2] https://github.com/kinvolk/memfs/commits/alban/switch-files
[3] https://github.com/kinvolk/linux/commits/dongsu/fuse-flag-ima-nocache-v5
[4] https://github.com/kinvolk/fuse-userns-patches/commit/cf1f5750cab0

Cc: linux-kernel@vger.kernel.org
Cc: linux-integr...@vger.kernel.org
Cc: linux-security-mod...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: Alexander Viro 
Cc: Mimi Zohar 
Cc: Dmitry Kasatkin 
Cc: James Morris 
Cc: Christoph Hellwig 
Acked-by: Miklos Szeredi 
Acked-by: "Serge E. Hallyn" 
Acked-by: Seth Forshee 
Tested-by: Dongsu Park 
Signed-off-by: Alban Crequy 
---
 fs/fuse/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 624f18bb..0a9e5164 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1205,7 +1205,7 @@ static void fuse_kill_sb_anon(struct super_block *sb)
 static struct file_system_type fuse_fs_type = {
.owner  = THIS_MODULE,
.name   = "fuse",
-   .fs_flags   = FS_HAS_SUBTYPE,
+   .fs_flags   = FS_HAS_SUBTYPE | FS_IMA_NO_CACHE,
.mount  = fuse_mount,
.kill_sb= fuse_kill_sb_anon,
 };
-- 
2.13.6



Re: [RFC PATCH v3 2/2] ima: force re-appraisal on filesystems with FS_IMA_NO_CACHE

2018-01-30 Thread Dongsu Park
Hi,

On Mon, Jan 29, 2018 at 6:40 PM, Dongsu Park <don...@kinvolk.io> wrote:
> On Mon, Jan 29, 2018 at 5:33 PM, Mimi Zohar <zo...@linux.vnet.ibm.com> wrote:
>> On Thu, 2018-01-25 at 06:56 -0500, Mimi Zohar wrote:
...
>> Did you get a chance to make the change and test it?
>
> Alban has been on holidays, so he will be back on Wednesday or so.
> So I'll try to understand what you meant in the last email.
>
> As IMA_DONE_MASK contains all other bitmasks, it's possible to
> optimize the code like this:
>
> if (test_and_clear_bit(IMA_CHANGE_XATTR, >atomic_flags)) {
> iint->flags &= ~IMA_DONE_MASK;
> } else if (inode->i_sb->s_type->fs_flags & FS_IMA_NO_CACHE) {
> iint->flags &= ~IMA_DONE_MASK;
> if (action & IMA_MEASURE)
> iint->measured_pcrs = 0;
> }
>
> Is that what you want to see? Please let me know if it's not.
> Tomorrow I will try to test with a new patch.

Today I created a new patch, and tested it. It worked fine.
So I've just sent a new patchset v4. Please see:
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1598387.html

Thanks,
Dongsu

> Thanks,
> Dongsu
>
>> Mimi
>>


Re: [RFC PATCH v3 2/2] ima: force re-appraisal on filesystems with FS_IMA_NO_CACHE

2018-01-30 Thread Dongsu Park
Hi,

On Mon, Jan 29, 2018 at 6:40 PM, Dongsu Park  wrote:
> On Mon, Jan 29, 2018 at 5:33 PM, Mimi Zohar  wrote:
>> On Thu, 2018-01-25 at 06:56 -0500, Mimi Zohar wrote:
...
>> Did you get a chance to make the change and test it?
>
> Alban has been on holidays, so he will be back on Wednesday or so.
> So I'll try to understand what you meant in the last email.
>
> As IMA_DONE_MASK contains all other bitmasks, it's possible to
> optimize the code like this:
>
> if (test_and_clear_bit(IMA_CHANGE_XATTR, >atomic_flags)) {
> iint->flags &= ~IMA_DONE_MASK;
> } else if (inode->i_sb->s_type->fs_flags & FS_IMA_NO_CACHE) {
> iint->flags &= ~IMA_DONE_MASK;
> if (action & IMA_MEASURE)
> iint->measured_pcrs = 0;
> }
>
> Is that what you want to see? Please let me know if it's not.
> Tomorrow I will try to test with a new patch.

Today I created a new patch, and tested it. It worked fine.
So I've just sent a new patchset v4. Please see:
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1598387.html

Thanks,
Dongsu

> Thanks,
> Dongsu
>
>> Mimi
>>


[RFC PATCH v4 0/2] ima,fuse: introduce new fs flag FS_IMA_NO_CACHE

2018-01-30 Thread Dongsu Park
This patchset v4 introduces a new fs flag FS_IMA_NO_CACHE and uses it in
FUSE. This forces files to be re-measured, re-appraised and re-audited
on file systems with the feature flag FS_IMA_NO_CACHE. In that way,
cached integrity results won't be used.

There was a previous attempt (unmerged) with a IMA option named "force" and 
using
that option for FUSE filesystems. These patches use a different approach
so that the IMA subsystem does not need to know about FUSE.
- https://www.spinics.net/lists/linux-integrity/msg00948.html
- https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1584131.html

Changes since v1: 
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1587390.html
- include linux-fsdevel mailing list in cc
- mark patch as RFC
- based on next-integrity, without other unmerged FUSE / IMA patches

Changes since v2: 
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1587678.html
- rename flag to FS_IMA_NO_CACHE
- split patch into 2

Changes since v3: 
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1592393.html
- make the code simpler by resetting IMA_DONE_MASK

The patchset is also available in our github repo:
  https://github.com/kinvolk/linux/tree/dongsu/fuse-flag-ima-nocache-v4


Alban Crequy (2):
  fuse: introduce new fs_type flag FS_IMA_NO_CACHE
  ima: force re-appraisal on filesystems with FS_IMA_NO_CACHE

 fs/fuse/inode.c   |  2 +-
 include/linux/fs.h|  1 +
 security/integrity/ima/ima_main.c | 15 +--
 3 files changed, 15 insertions(+), 3 deletions(-)

-- 
2.13.6



[RFC PATCH v4 0/2] ima,fuse: introduce new fs flag FS_IMA_NO_CACHE

2018-01-30 Thread Dongsu Park
This patchset v4 introduces a new fs flag FS_IMA_NO_CACHE and uses it in
FUSE. This forces files to be re-measured, re-appraised and re-audited
on file systems with the feature flag FS_IMA_NO_CACHE. In that way,
cached integrity results won't be used.

There was a previous attempt (unmerged) with a IMA option named "force" and 
using
that option for FUSE filesystems. These patches use a different approach
so that the IMA subsystem does not need to know about FUSE.
- https://www.spinics.net/lists/linux-integrity/msg00948.html
- https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1584131.html

Changes since v1: 
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1587390.html
- include linux-fsdevel mailing list in cc
- mark patch as RFC
- based on next-integrity, without other unmerged FUSE / IMA patches

Changes since v2: 
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1587678.html
- rename flag to FS_IMA_NO_CACHE
- split patch into 2

Changes since v3: 
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1592393.html
- make the code simpler by resetting IMA_DONE_MASK

The patchset is also available in our github repo:
  https://github.com/kinvolk/linux/tree/dongsu/fuse-flag-ima-nocache-v4


Alban Crequy (2):
  fuse: introduce new fs_type flag FS_IMA_NO_CACHE
  ima: force re-appraisal on filesystems with FS_IMA_NO_CACHE

 fs/fuse/inode.c   |  2 +-
 include/linux/fs.h|  1 +
 security/integrity/ima/ima_main.c | 15 +--
 3 files changed, 15 insertions(+), 3 deletions(-)

-- 
2.13.6



[RFC PATCH v4 2/2] ima: force re-appraisal on filesystems with FS_IMA_NO_CACHE

2018-01-30 Thread Dongsu Park
From: Alban Crequy <al...@kinvolk.io>

This patch forces files to be re-measured, re-appraised and re-audited
on file systems with the feature flag FS_IMA_NO_CACHE. In that way,
cached integrity results won't be used.

How to test this:

The test I did was using a patched version of the memfs FUSE driver
[1][2] and two very simple "hello-world" programs [4] (prog1 prints
"hello world: 1" and prog2 prints "hello world: 2").

I copy prog1 and prog2 in the fuse-memfs mount point, execute them and
check the sha1 hash in
"/sys/kernel/security/ima/ascii_runtime_measurements".

My patch on the memfs FUSE driver added a backdoor command to serve
prog1 when the kernel asks for prog2 or vice-versa. In this way, I can
exec prog1 and get it to print "hello world: 2" without ever replacing
the file via the VFS, so the kernel is not aware of the change.

The test was done using the branch "dongsu/fuse-flag-ima-nocache-v4" [3].

Step by step test procedure:

1. Mount the memfs FUSE using [2]:
rm -f  /tmp/memfs-switch* ; memfs -L DEBUG  /mnt/memfs

2. Copy prog1 and prog2 using [4]
cp prog1 /mnt/memfs/prog1
cp prog2 /mnt/memfs/prog2

3. Lookup the files and let the FUSE driver to keep the handles open:
dd if=/mnt/memfs/prog1 bs=1 | (read -n 1 x ; sleep 3600 ) &
dd if=/mnt/memfs/prog2 bs=1 | (read -n 1 x ; sleep 3600 ) &

4. Check the 2 programs work correctly:
$ /mnt/memfs/prog1
hello world: 1
$ /mnt/memfs/prog2
hello world: 2

5. Check the measurements for prog1 and prog2:
$ sudo cat /sys/kernel/security/ima/ascii_runtime_measurements \
| grep /mnt/memfs/prog
10 [...] ima-ng sha1:ac14c9268cd2[...] /mnt/memfs/prog1
10 [...] ima-ng sha1:799cb5d1e06d[...] /mnt/memfs/prog2

6. Use the backdoor command in my patched memfs to redirect file
operations on file handle 3 to file handle 2:
rm -f  /tmp/memfs-switch* ; touch /tmp/memfs-switch-3-2

7. Check how the FUSE driver serves different content for the files:
$ /mnt/memfs/prog1
hello world: 2
$ /mnt/memfs/prog2
hello world: 2

8. Check the measurements:
sudo cat /sys/kernel/security/ima/ascii_runtime_measurements \
| grep /mnt/memfs/prog

Without the patch, there are no new measurements, despite the FUSE
driver having served different executables.

With the patch, I can see additional measurements for prog1 and prog2
with the hashes reversed when the FUSE driver served the alternative
content.

[1] https://github.com/bbengfort/memfs
[2] https://github.com/kinvolk/memfs/commits/alban/switch-files
[3] https://github.com/kinvolk/linux/commits/dongsu/fuse-flag-ima-nocache-v4
[4] https://github.com/kinvolk/fuse-userns-patches/commit/cf1f5750cab0

Cc: linux-kernel@vger.kernel.org
Cc: linux-integr...@vger.kernel.org
Cc: linux-security-mod...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: Miklos Szeredi <mik...@szeredi.hu>
Cc: Alexander Viro <v...@zeniv.linux.org.uk>
Cc: Mimi Zohar <zo...@linux.vnet.ibm.com>
Cc: Dmitry Kasatkin <dmitry.kasat...@gmail.com>
Cc: James Morris <jmor...@namei.org>
Cc: Christoph Hellwig <h...@infradead.org>
Acked-by: "Serge E. Hallyn" <se...@hallyn.com>
Acked-by: Seth Forshee <seth.fors...@canonical.com>
Tested-by: Dongsu Park <don...@kinvolk.io>
Signed-off-by: Alban Crequy <al...@kinvolk.io>
[dongsu: optimized code to address review comments by Mimi]
Signed-off-by: Dongsu Park <don...@kinvolk.io>
---
 security/integrity/ima/ima_main.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/security/integrity/ima/ima_main.c 
b/security/integrity/ima/ima_main.c
index 6d78cb26..83edbad8 100644
--- a/security/integrity/ima/ima_main.c
+++ b/security/integrity/ima/ima_main.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ima.h"
 
@@ -228,9 +229,19 @@ static int process_measurement(struct file *file, char 
*buf, loff_t size,
 IMA_APPRAISE_SUBMASK | IMA_APPRAISED_SUBMASK |
 IMA_ACTION_FLAGS);
 
-   if (test_and_clear_bit(IMA_CHANGE_XATTR, >atomic_flags))
-   /* reset all flags if ima_inode_setxattr was called */
+   /*
+* Reset the measure, appraise and audit cached flags either if:
+* - ima_inode_setxattr was called, or
+* - based on filesystem feature flag
+* forcing the file to be re-evaluated.
+*/
+   if (test_and_clear_bit(IMA_CHANGE_XATTR, >atomic_flags)) {
iint->flags &= ~IMA_DONE_MASK;
+   } else if (inode->i_sb->s_type->fs_flags & FS_IMA_NO_CACHE) {
+   iint->flags &= ~IMA_DONE_MASK;
+   if (action & IMA_MEASURE)
+   iint->measured_pcrs = 0;
+   }
 
/* Determine if already appraised/measured based on bitmask
 * (IMA_MEASURE, IMA_MEASURED, IMA__APPRAISE, IMA__APPRAISED,
-- 
2.13.6



[RFC PATCH v4 2/2] ima: force re-appraisal on filesystems with FS_IMA_NO_CACHE

2018-01-30 Thread Dongsu Park
From: Alban Crequy 

This patch forces files to be re-measured, re-appraised and re-audited
on file systems with the feature flag FS_IMA_NO_CACHE. In that way,
cached integrity results won't be used.

How to test this:

The test I did was using a patched version of the memfs FUSE driver
[1][2] and two very simple "hello-world" programs [4] (prog1 prints
"hello world: 1" and prog2 prints "hello world: 2").

I copy prog1 and prog2 in the fuse-memfs mount point, execute them and
check the sha1 hash in
"/sys/kernel/security/ima/ascii_runtime_measurements".

My patch on the memfs FUSE driver added a backdoor command to serve
prog1 when the kernel asks for prog2 or vice-versa. In this way, I can
exec prog1 and get it to print "hello world: 2" without ever replacing
the file via the VFS, so the kernel is not aware of the change.

The test was done using the branch "dongsu/fuse-flag-ima-nocache-v4" [3].

Step by step test procedure:

1. Mount the memfs FUSE using [2]:
rm -f  /tmp/memfs-switch* ; memfs -L DEBUG  /mnt/memfs

2. Copy prog1 and prog2 using [4]
cp prog1 /mnt/memfs/prog1
cp prog2 /mnt/memfs/prog2

3. Lookup the files and let the FUSE driver to keep the handles open:
dd if=/mnt/memfs/prog1 bs=1 | (read -n 1 x ; sleep 3600 ) &
dd if=/mnt/memfs/prog2 bs=1 | (read -n 1 x ; sleep 3600 ) &

4. Check the 2 programs work correctly:
$ /mnt/memfs/prog1
hello world: 1
$ /mnt/memfs/prog2
hello world: 2

5. Check the measurements for prog1 and prog2:
$ sudo cat /sys/kernel/security/ima/ascii_runtime_measurements \
| grep /mnt/memfs/prog
10 [...] ima-ng sha1:ac14c9268cd2[...] /mnt/memfs/prog1
10 [...] ima-ng sha1:799cb5d1e06d[...] /mnt/memfs/prog2

6. Use the backdoor command in my patched memfs to redirect file
operations on file handle 3 to file handle 2:
rm -f  /tmp/memfs-switch* ; touch /tmp/memfs-switch-3-2

7. Check how the FUSE driver serves different content for the files:
$ /mnt/memfs/prog1
hello world: 2
$ /mnt/memfs/prog2
hello world: 2

8. Check the measurements:
sudo cat /sys/kernel/security/ima/ascii_runtime_measurements \
| grep /mnt/memfs/prog

Without the patch, there are no new measurements, despite the FUSE
driver having served different executables.

With the patch, I can see additional measurements for prog1 and prog2
with the hashes reversed when the FUSE driver served the alternative
content.

[1] https://github.com/bbengfort/memfs
[2] https://github.com/kinvolk/memfs/commits/alban/switch-files
[3] https://github.com/kinvolk/linux/commits/dongsu/fuse-flag-ima-nocache-v4
[4] https://github.com/kinvolk/fuse-userns-patches/commit/cf1f5750cab0

Cc: linux-kernel@vger.kernel.org
Cc: linux-integr...@vger.kernel.org
Cc: linux-security-mod...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: Miklos Szeredi 
Cc: Alexander Viro 
Cc: Mimi Zohar 
Cc: Dmitry Kasatkin 
Cc: James Morris 
Cc: Christoph Hellwig 
Acked-by: "Serge E. Hallyn" 
Acked-by: Seth Forshee 
Tested-by: Dongsu Park 
Signed-off-by: Alban Crequy 
[dongsu: optimized code to address review comments by Mimi]
Signed-off-by: Dongsu Park 
---
 security/integrity/ima/ima_main.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/security/integrity/ima/ima_main.c 
b/security/integrity/ima/ima_main.c
index 6d78cb26..83edbad8 100644
--- a/security/integrity/ima/ima_main.c
+++ b/security/integrity/ima/ima_main.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ima.h"
 
@@ -228,9 +229,19 @@ static int process_measurement(struct file *file, char 
*buf, loff_t size,
 IMA_APPRAISE_SUBMASK | IMA_APPRAISED_SUBMASK |
 IMA_ACTION_FLAGS);
 
-   if (test_and_clear_bit(IMA_CHANGE_XATTR, >atomic_flags))
-   /* reset all flags if ima_inode_setxattr was called */
+   /*
+* Reset the measure, appraise and audit cached flags either if:
+* - ima_inode_setxattr was called, or
+* - based on filesystem feature flag
+* forcing the file to be re-evaluated.
+*/
+   if (test_and_clear_bit(IMA_CHANGE_XATTR, >atomic_flags)) {
iint->flags &= ~IMA_DONE_MASK;
+   } else if (inode->i_sb->s_type->fs_flags & FS_IMA_NO_CACHE) {
+   iint->flags &= ~IMA_DONE_MASK;
+   if (action & IMA_MEASURE)
+   iint->measured_pcrs = 0;
+   }
 
/* Determine if already appraised/measured based on bitmask
 * (IMA_MEASURE, IMA_MEASURED, IMA__APPRAISE, IMA__APPRAISED,
-- 
2.13.6



[RFC PATCH v4 1/2] fuse: introduce new fs_type flag FS_IMA_NO_CACHE

2018-01-30 Thread Dongsu Park
From: Alban Crequy <al...@kinvolk.io>

This new fs_type flag FS_IMA_NO_CACHE means files should be re-measured,
re-appraised and re-audited each time. Cached integrity results should
not be used.

It is useful in FUSE because the userspace FUSE process can change the
underlying files at any time without notifying the kernel.

Cc: linux-kernel@vger.kernel.org
Cc: linux-integr...@vger.kernel.org
Cc: linux-security-mod...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: Miklos Szeredi <mik...@szeredi.hu>
Cc: Alexander Viro <v...@zeniv.linux.org.uk>
Cc: Mimi Zohar <zo...@linux.vnet.ibm.com>
Cc: Dmitry Kasatkin <dmitry.kasat...@gmail.com>
Cc: James Morris <jmor...@namei.org>
Cc: Christoph Hellwig <h...@infradead.org>
Acked-by: "Serge E. Hallyn" <se...@hallyn.com>
Acked-by: Seth Forshee <seth.fors...@canonical.com>
Tested-by: Dongsu Park <don...@kinvolk.io>
Signed-off-by: Alban Crequy <al...@kinvolk.io>
---
 fs/fuse/inode.c| 2 +-
 include/linux/fs.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 624f18bb..0a9e5164 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1205,7 +1205,7 @@ static void fuse_kill_sb_anon(struct super_block *sb)
 static struct file_system_type fuse_fs_type = {
.owner  = THIS_MODULE,
.name   = "fuse",
-   .fs_flags   = FS_HAS_SUBTYPE,
+   .fs_flags   = FS_HAS_SUBTYPE | FS_IMA_NO_CACHE,
.mount  = fuse_mount,
.kill_sb= fuse_kill_sb_anon,
 };
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 511fbaab..ced841ba 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2075,6 +2075,7 @@ struct file_system_type {
 #define FS_BINARY_MOUNTDATA2
 #define FS_HAS_SUBTYPE 4
 #define FS_USERNS_MOUNT8   /* Can be mounted by userns 
root */
+#define FS_IMA_NO_CACHE16  /* Force IMA to re-measure, 
re-appraise, re-audit files */
 #define FS_RENAME_DOES_D_MOVE  32768   /* FS will handle d_move() during 
rename() internally. */
struct dentry *(*mount) (struct file_system_type *, int,
   const char *, void *);
-- 
2.13.6



[RFC PATCH v4 1/2] fuse: introduce new fs_type flag FS_IMA_NO_CACHE

2018-01-30 Thread Dongsu Park
From: Alban Crequy 

This new fs_type flag FS_IMA_NO_CACHE means files should be re-measured,
re-appraised and re-audited each time. Cached integrity results should
not be used.

It is useful in FUSE because the userspace FUSE process can change the
underlying files at any time without notifying the kernel.

Cc: linux-kernel@vger.kernel.org
Cc: linux-integr...@vger.kernel.org
Cc: linux-security-mod...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: Miklos Szeredi 
Cc: Alexander Viro 
Cc: Mimi Zohar 
Cc: Dmitry Kasatkin 
Cc: James Morris 
Cc: Christoph Hellwig 
Acked-by: "Serge E. Hallyn" 
Acked-by: Seth Forshee 
Tested-by: Dongsu Park 
Signed-off-by: Alban Crequy 
---
 fs/fuse/inode.c| 2 +-
 include/linux/fs.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 624f18bb..0a9e5164 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1205,7 +1205,7 @@ static void fuse_kill_sb_anon(struct super_block *sb)
 static struct file_system_type fuse_fs_type = {
.owner  = THIS_MODULE,
.name   = "fuse",
-   .fs_flags   = FS_HAS_SUBTYPE,
+   .fs_flags   = FS_HAS_SUBTYPE | FS_IMA_NO_CACHE,
.mount  = fuse_mount,
.kill_sb= fuse_kill_sb_anon,
 };
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 511fbaab..ced841ba 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2075,6 +2075,7 @@ struct file_system_type {
 #define FS_BINARY_MOUNTDATA2
 #define FS_HAS_SUBTYPE 4
 #define FS_USERNS_MOUNT8   /* Can be mounted by userns 
root */
+#define FS_IMA_NO_CACHE16  /* Force IMA to re-measure, 
re-appraise, re-audit files */
 #define FS_RENAME_DOES_D_MOVE  32768   /* FS will handle d_move() during 
rename() internally. */
struct dentry *(*mount) (struct file_system_type *, int,
   const char *, void *);
-- 
2.13.6



Re: [RFC PATCH v3 2/2] ima: force re-appraisal on filesystems with FS_IMA_NO_CACHE

2018-01-29 Thread Dongsu Park
Hi Mimi,

On Mon, Jan 29, 2018 at 5:33 PM, Mimi Zohar  wrote:
> Hi Alban,
>
> On Thu, 2018-01-25 at 06:56 -0500, Mimi Zohar wrote:
>> > > @@ -228,9 +229,28 @@ static int process_measurement(struct file *file, 
>> > > char *buf, loff_t size,
>> > >IMA_APPRAISE_SUBMASK | IMA_APPRAISED_SUBMASK |
>> > >IMA_ACTION_FLAGS);
>> > >
>> > > - if (test_and_clear_bit(IMA_CHANGE_XATTR, >atomic_flags))
>> > > - /* reset all flags if ima_inode_setxattr was called */
>> > > + /*
>> > > +  * Reset the measure, appraise and audit cached flags either if:
>> > > +  * - ima_inode_setxattr was called, or
>> > > +  * - based on filesystem feature flag
>> > > +  * forcing the file to be re-evaluated.
>> > > +  */
>> > > + if (test_and_clear_bit(IMA_CHANGE_XATTR, >atomic_flags)) {
>> > >   iint->flags &= ~IMA_DONE_MASK;
>> > > + } else if (inode->i_sb->s_type->fs_flags & FS_IMA_NO_CACHE) {
>> > > + if (action & IMA_MEASURE) {
>> > > + iint->measured_pcrs = 0;
>> > > + iint->flags &=
>> > > + ~(IMA_COLLECTED | IMA_MEASURE | IMA_MEASURED);
>> > > + }
>> > > + if (action & IMA_APPRAISE)
>> > > + iint->flags &=
>> > > + ~(IMA_COLLECTED | IMA_APPRAISE | IMA_APPRAISED |
>> > > +   IMA_APPRAISE_SUBMASK | IMA_APPRAISED_SUBMASK);
>> > > + if (action & IMA_AUDIT)
>> > > + iint->flags &=
>> > > + ~(IMA_COLLECTED | IMA_AUDIT | IMA_AUDITED);
>> > > + }
>> > >
>>
>> Alban, I don't know what I was thinking, but this can be simplified
>> like for the IMA_CHANGE_XATTR case.  Except in the IMA_CHANGE_XATTR
>> case, "measured_pcrs" was already reset, whereas in this case
>> "measured_pcrs" needs to be reset.
>
> Did you get a chance to make the change and test it?

Alban has been on holidays, so he will be back on Wednesday or so.
So I'll try to understand what you meant in the last email.

As IMA_DONE_MASK contains all other bitmasks, it's possible to
optimize the code like this:

if (test_and_clear_bit(IMA_CHANGE_XATTR, >atomic_flags)) {
iint->flags &= ~IMA_DONE_MASK;
} else if (inode->i_sb->s_type->fs_flags & FS_IMA_NO_CACHE) {
iint->flags &= ~IMA_DONE_MASK;
if (action & IMA_MEASURE)
iint->measured_pcrs = 0;
}

Is that what you want to see? Please let me know if it's not.
Tomorrow I will try to test with a new patch.

Thanks,
Dongsu

> Mimi
>


Re: [RFC PATCH v3 2/2] ima: force re-appraisal on filesystems with FS_IMA_NO_CACHE

2018-01-29 Thread Dongsu Park
Hi Mimi,

On Mon, Jan 29, 2018 at 5:33 PM, Mimi Zohar  wrote:
> Hi Alban,
>
> On Thu, 2018-01-25 at 06:56 -0500, Mimi Zohar wrote:
>> > > @@ -228,9 +229,28 @@ static int process_measurement(struct file *file, 
>> > > char *buf, loff_t size,
>> > >IMA_APPRAISE_SUBMASK | IMA_APPRAISED_SUBMASK |
>> > >IMA_ACTION_FLAGS);
>> > >
>> > > - if (test_and_clear_bit(IMA_CHANGE_XATTR, >atomic_flags))
>> > > - /* reset all flags if ima_inode_setxattr was called */
>> > > + /*
>> > > +  * Reset the measure, appraise and audit cached flags either if:
>> > > +  * - ima_inode_setxattr was called, or
>> > > +  * - based on filesystem feature flag
>> > > +  * forcing the file to be re-evaluated.
>> > > +  */
>> > > + if (test_and_clear_bit(IMA_CHANGE_XATTR, >atomic_flags)) {
>> > >   iint->flags &= ~IMA_DONE_MASK;
>> > > + } else if (inode->i_sb->s_type->fs_flags & FS_IMA_NO_CACHE) {
>> > > + if (action & IMA_MEASURE) {
>> > > + iint->measured_pcrs = 0;
>> > > + iint->flags &=
>> > > + ~(IMA_COLLECTED | IMA_MEASURE | IMA_MEASURED);
>> > > + }
>> > > + if (action & IMA_APPRAISE)
>> > > + iint->flags &=
>> > > + ~(IMA_COLLECTED | IMA_APPRAISE | IMA_APPRAISED |
>> > > +   IMA_APPRAISE_SUBMASK | IMA_APPRAISED_SUBMASK);
>> > > + if (action & IMA_AUDIT)
>> > > + iint->flags &=
>> > > + ~(IMA_COLLECTED | IMA_AUDIT | IMA_AUDITED);
>> > > + }
>> > >
>>
>> Alban, I don't know what I was thinking, but this can be simplified
>> like for the IMA_CHANGE_XATTR case.  Except in the IMA_CHANGE_XATTR
>> case, "measured_pcrs" was already reset, whereas in this case
>> "measured_pcrs" needs to be reset.
>
> Did you get a chance to make the change and test it?

Alban has been on holidays, so he will be back on Wednesday or so.
So I'll try to understand what you meant in the last email.

As IMA_DONE_MASK contains all other bitmasks, it's possible to
optimize the code like this:

if (test_and_clear_bit(IMA_CHANGE_XATTR, >atomic_flags)) {
iint->flags &= ~IMA_DONE_MASK;
} else if (inode->i_sb->s_type->fs_flags & FS_IMA_NO_CACHE) {
iint->flags &= ~IMA_DONE_MASK;
if (action & IMA_MEASURE)
iint->measured_pcrs = 0;
}

Is that what you want to see? Please let me know if it's not.
Tomorrow I will try to test with a new patch.

Thanks,
Dongsu

> Mimi
>


Re: [PATCH 0/2] turn on force option for FUSE in builtin policies

2018-01-16 Thread Dongsu Park
Hi Mimi,

On Tue, Jan 16, 2018 at 12:23 PM, Mimi Zohar <zo...@linux.vnet.ibm.com> wrote:
> On Tue, 2018-01-16 at 12:09 +0100, Dongsu Park wrote:
>> Since yesterday Alban and I have been working on a different approach
>> that does not depend on IMA rules, nor fsmagic. Please see:
>> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1587390.html
>>
>> If that's ok, I'm ready to discard this patchset.
>
> You dropped a number of people involved in this discussion and mailing
> lists.  Please post the proposed patch inline as an RFC, cc'ing the
> same people, those involved in the discussion, and previous mailing
> lists, including LSM, integrity, and fsdevel.

Sorry about that. Starting from the patchset v2, we will add Cc correctly.
Thank you also for the detailed review.

Dongsu

> thanks,
>
> Mimi
>


Re: [PATCH 0/2] turn on force option for FUSE in builtin policies

2018-01-16 Thread Dongsu Park
Hi Mimi,

On Tue, Jan 16, 2018 at 12:23 PM, Mimi Zohar  wrote:
> On Tue, 2018-01-16 at 12:09 +0100, Dongsu Park wrote:
>> Since yesterday Alban and I have been working on a different approach
>> that does not depend on IMA rules, nor fsmagic. Please see:
>> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1587390.html
>>
>> If that's ok, I'm ready to discard this patchset.
>
> You dropped a number of people involved in this discussion and mailing
> lists.  Please post the proposed patch inline as an RFC, cc'ing the
> same people, those involved in the discussion, and previous mailing
> lists, including LSM, integrity, and fsdevel.

Sorry about that. Starting from the patchset v2, we will add Cc correctly.
Thank you also for the detailed review.

Dongsu

> thanks,
>
> Mimi
>


Re: [PATCH 0/2] turn on force option for FUSE in builtin policies

2018-01-16 Thread Dongsu Park
Hi,

On Thu, Jan 11, 2018 at 8:51 PM, Dongsu Park <don...@kinvolk.io> wrote:
> In case of FUSE filesystem, cached integrity results in IMA could be
> reused, when the userspace FUSE process has changed the
> underlying files. To be able to avoid such cases, we need to turn on
> the force option in builtin policies, for actions of measure and
> appraise. Then integrity values become re-measured and re-appraised.
> In that way, cached integrity results won't be used.

Since yesterday Alban and I have been working on a different approach
that does not depend on IMA rules, nor fsmagic. Please see:
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1587390.html

If that's ok, I'm ready to discard this patchset.

Thanks,
Dongsu

> This patchset depends on the patch "ima: define a new policy option
> named force" by Mimi. [1]  For details on testing the force option,
> please refer to the testing report by Alban. [2]
>
> The first patch is for simply moving FUSE_*SUPER_MAGIC macros to
> include/uapi/linux, to be able to use those in other subsystems like
> security/integrity/ima.
>
> The second patch is actually to turn on the force option for FUSE fs
> in IMA.
>
> [1] https://www.spinics.net/lists/linux-integrity/msg00948.html
> [2] https://marc.info/?l=linux-integrity=151559360514676=2
>
>
> Dongsu Park (2):
>   fs/fuse: move SUPER_MAGIC definitions to linux/magic.h
>   ima: turn on force option for FUSE in builtin policies
>
>  fs/fuse/control.c   | 3 +--
>  fs/fuse/inode.c | 3 +--
>  include/uapi/linux/magic.h  | 3 +++
>  security/integrity/ima/ima_policy.c | 2 ++
>  4 files changed, 7 insertions(+), 4 deletions(-)
>
> --
> 2.13.6
>


Re: [PATCH 0/2] turn on force option for FUSE in builtin policies

2018-01-16 Thread Dongsu Park
Hi,

On Thu, Jan 11, 2018 at 8:51 PM, Dongsu Park  wrote:
> In case of FUSE filesystem, cached integrity results in IMA could be
> reused, when the userspace FUSE process has changed the
> underlying files. To be able to avoid such cases, we need to turn on
> the force option in builtin policies, for actions of measure and
> appraise. Then integrity values become re-measured and re-appraised.
> In that way, cached integrity results won't be used.

Since yesterday Alban and I have been working on a different approach
that does not depend on IMA rules, nor fsmagic. Please see:
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1587390.html

If that's ok, I'm ready to discard this patchset.

Thanks,
Dongsu

> This patchset depends on the patch "ima: define a new policy option
> named force" by Mimi. [1]  For details on testing the force option,
> please refer to the testing report by Alban. [2]
>
> The first patch is for simply moving FUSE_*SUPER_MAGIC macros to
> include/uapi/linux, to be able to use those in other subsystems like
> security/integrity/ima.
>
> The second patch is actually to turn on the force option for FUSE fs
> in IMA.
>
> [1] https://www.spinics.net/lists/linux-integrity/msg00948.html
> [2] https://marc.info/?l=linux-integrity=151559360514676=2
>
>
> Dongsu Park (2):
>   fs/fuse: move SUPER_MAGIC definitions to linux/magic.h
>   ima: turn on force option for FUSE in builtin policies
>
>  fs/fuse/control.c   | 3 +--
>  fs/fuse/inode.c | 3 +--
>  include/uapi/linux/magic.h  | 3 +++
>  security/integrity/ima/ima_policy.c | 2 ++
>  4 files changed, 7 insertions(+), 4 deletions(-)
>
> --
> 2.13.6
>


Re: [PATCH 2/2] ima: turn on force option for FUSE in builtin policies

2018-01-16 Thread Dongsu Park
Hi,

On Sun, Jan 14, 2018 at 8:09 PM, kbuild test robot <l...@intel.com> wrote:
> [auto build test ERROR on linus/master]
> [also build test ERROR on v4.15-rc7 next-20180112]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]

As already mentioned in the commit message, this patch depends on
patches that are not yet in the mainline, or not even in next-integrity.
So please make it excluded from kbuild.

Thanks,
Dongsu

> url:
> https://github.com/0day-ci/linux/commits/Dongsu-Park/turn-on-force-option-for-FUSE-in-builtin-policies/20180115-015830
> config: xtensa-allmodconfig (attached as .config)
> compiler: xtensa-linux-gcc (GCC) 7.2.0
> reproduce:
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> make.cross ARCH=xtensa
>
> All errors (new ones prefixed by >>):
>
>>> security/integrity/ima/ima_policy.c:130:74: error: 'IMA_FORCE' undeclared 
>>> here (not in a function); did you mean 'IMA_FUNC'?
>  {.action = MEASURE, .fsmagic = FUSE_SUPER_MAGIC, .flags = IMA_FSMAGIC | 
> IMA_FORCE},
>  
> ^
>  
> IMA_FUNC
>security/integrity/ima/ima_policy.c:158:73: error: invalid operands to 
> binary | (have 'int' and 'struct ima_rule_entry *')
>  {.action = APPRAISE, .fsmagic = FUSE_SUPER_MAGIC, .flags = IMA_FSMAGIC | 
> IMA_FORCE},
> ^
>security/integrity/ima/ima_policy.c:29:21: warning: initialization makes 
> integer from pointer without a cast [-Wint-conversion]
> #define IMA_FSMAGIC 0x0004
> ^
>security/integrity/ima/ima_policy.c:158:61: note: in expansion of macro 
> 'IMA_FSMAGIC'
>  {.action = APPRAISE, .fsmagic = FUSE_SUPER_MAGIC, .flags = IMA_FSMAGIC | 
> IMA_FORCE},
> ^~~
>security/integrity/ima/ima_policy.c:29:21: note: (near initialization for 
> 'default_appraise_rules[14].flags')
> #define IMA_FSMAGIC 0x0004
> ^
>security/integrity/ima/ima_policy.c:158:61: note: in expansion of macro 
> 'IMA_FSMAGIC'
>  {.action = APPRAISE, .fsmagic = FUSE_SUPER_MAGIC, .flags = IMA_FSMAGIC | 
> IMA_FORCE},
> ^~~
>security/integrity/ima/ima_policy.c:29:21: error: initializer element is 
> not constant
> #define IMA_FSMAGIC 0x0004
> ^
>security/integrity/ima/ima_policy.c:158:61: note: in expansion of macro 
> 'IMA_FSMAGIC'
>  {.action = APPRAISE, .fsmagic = FUSE_SUPER_MAGIC, .flags = IMA_FSMAGIC | 
> IMA_FORCE},
> ^~~
>security/integrity/ima/ima_policy.c:29:21: note: (near initialization for 
> 'default_appraise_rules[14].flags')
> #define IMA_FSMAGIC 0x0004
> ^
>security/integrity/ima/ima_policy.c:158:61: note: in expansion of macro 
> 'IMA_FSMAGIC'
>  {.action = APPRAISE, .fsmagic = FUSE_SUPER_MAGIC, .flags = IMA_FSMAGIC | 
> IMA_FORCE},
> ^~~
>
> vim +130 security/integrity/ima/ima_policy.c
>
>115
>116  static struct ima_rule_entry default_measurement_rules[] 
> __ro_after_init = {
>117  {.action = MEASURE, .func = MMAP_CHECK, .mask = MAY_EXEC,
>118   .flags = IMA_FUNC | IMA_MASK},
>119  {.action = MEASURE, .func = BPRM_CHECK, .mask = MAY_EXEC,
>120   .flags = IMA_FUNC | IMA_MASK},
>121  {.action = MEASURE, .func = FILE_CHECK, .mask = MAY_READ,
>122   .uid = GLOBAL_ROOT_UID, .uid_op = _eq,
>123   .flags = IMA_FUNC | IMA_INMASK | IMA_EUID},
>124  {.action = MEASURE, .func = FILE_CHECK, .mask = MAY_READ,
>125   .uid = GLOBAL_ROOT_UID, .uid_op = _eq,
>126   .flags = IMA_FUNC | IMA_INMASK | IMA_UID},
>127  {.action = MEASURE, .func = MODULE_CHECK, .flags = IMA_FUNC},
>128  {.action = MEASURE, .func = FIRMWARE_CHECK, .flags = 
> IMA_FUNC},
>129  {.action = MEASURE, .func = POLICY_CHECK, .flags = IMA_FUNC},
>  > 130  {.action = MEASURE, .fsmagic = FUSE_SUPER_MAGIC, .flags = 
> IMA_FSMAGIC | IMA_FORCE},
>131  };
>132
>
> ---
> 0-DAY kernel test infrastructureOpen Source Technology Center
> https://lists.01.org/pipermail/kbuild-all   Intel Corporation


Re: [PATCH 2/2] ima: turn on force option for FUSE in builtin policies

2018-01-16 Thread Dongsu Park
Hi,

On Sun, Jan 14, 2018 at 8:09 PM, kbuild test robot  wrote:
> [auto build test ERROR on linus/master]
> [also build test ERROR on v4.15-rc7 next-20180112]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]

As already mentioned in the commit message, this patch depends on
patches that are not yet in the mainline, or not even in next-integrity.
So please make it excluded from kbuild.

Thanks,
Dongsu

> url:
> https://github.com/0day-ci/linux/commits/Dongsu-Park/turn-on-force-option-for-FUSE-in-builtin-policies/20180115-015830
> config: xtensa-allmodconfig (attached as .config)
> compiler: xtensa-linux-gcc (GCC) 7.2.0
> reproduce:
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> make.cross ARCH=xtensa
>
> All errors (new ones prefixed by >>):
>
>>> security/integrity/ima/ima_policy.c:130:74: error: 'IMA_FORCE' undeclared 
>>> here (not in a function); did you mean 'IMA_FUNC'?
>  {.action = MEASURE, .fsmagic = FUSE_SUPER_MAGIC, .flags = IMA_FSMAGIC | 
> IMA_FORCE},
>  
> ^
>  
> IMA_FUNC
>security/integrity/ima/ima_policy.c:158:73: error: invalid operands to 
> binary | (have 'int' and 'struct ima_rule_entry *')
>  {.action = APPRAISE, .fsmagic = FUSE_SUPER_MAGIC, .flags = IMA_FSMAGIC | 
> IMA_FORCE},
> ^
>security/integrity/ima/ima_policy.c:29:21: warning: initialization makes 
> integer from pointer without a cast [-Wint-conversion]
> #define IMA_FSMAGIC 0x0004
> ^
>security/integrity/ima/ima_policy.c:158:61: note: in expansion of macro 
> 'IMA_FSMAGIC'
>  {.action = APPRAISE, .fsmagic = FUSE_SUPER_MAGIC, .flags = IMA_FSMAGIC | 
> IMA_FORCE},
> ^~~
>security/integrity/ima/ima_policy.c:29:21: note: (near initialization for 
> 'default_appraise_rules[14].flags')
> #define IMA_FSMAGIC 0x0004
> ^
>security/integrity/ima/ima_policy.c:158:61: note: in expansion of macro 
> 'IMA_FSMAGIC'
>  {.action = APPRAISE, .fsmagic = FUSE_SUPER_MAGIC, .flags = IMA_FSMAGIC | 
> IMA_FORCE},
> ^~~
>security/integrity/ima/ima_policy.c:29:21: error: initializer element is 
> not constant
> #define IMA_FSMAGIC 0x0004
> ^
>security/integrity/ima/ima_policy.c:158:61: note: in expansion of macro 
> 'IMA_FSMAGIC'
>  {.action = APPRAISE, .fsmagic = FUSE_SUPER_MAGIC, .flags = IMA_FSMAGIC | 
> IMA_FORCE},
> ^~~
>security/integrity/ima/ima_policy.c:29:21: note: (near initialization for 
> 'default_appraise_rules[14].flags')
> #define IMA_FSMAGIC 0x0004
> ^
>security/integrity/ima/ima_policy.c:158:61: note: in expansion of macro 
> 'IMA_FSMAGIC'
>  {.action = APPRAISE, .fsmagic = FUSE_SUPER_MAGIC, .flags = IMA_FSMAGIC | 
> IMA_FORCE},
> ^~~
>
> vim +130 security/integrity/ima/ima_policy.c
>
>115
>116  static struct ima_rule_entry default_measurement_rules[] 
> __ro_after_init = {
>117  {.action = MEASURE, .func = MMAP_CHECK, .mask = MAY_EXEC,
>118   .flags = IMA_FUNC | IMA_MASK},
>119  {.action = MEASURE, .func = BPRM_CHECK, .mask = MAY_EXEC,
>120   .flags = IMA_FUNC | IMA_MASK},
>121  {.action = MEASURE, .func = FILE_CHECK, .mask = MAY_READ,
>122   .uid = GLOBAL_ROOT_UID, .uid_op = _eq,
>123   .flags = IMA_FUNC | IMA_INMASK | IMA_EUID},
>124  {.action = MEASURE, .func = FILE_CHECK, .mask = MAY_READ,
>125   .uid = GLOBAL_ROOT_UID, .uid_op = _eq,
>126   .flags = IMA_FUNC | IMA_INMASK | IMA_UID},
>127  {.action = MEASURE, .func = MODULE_CHECK, .flags = IMA_FUNC},
>128  {.action = MEASURE, .func = FIRMWARE_CHECK, .flags = 
> IMA_FUNC},
>129  {.action = MEASURE, .func = POLICY_CHECK, .flags = IMA_FUNC},
>  > 130  {.action = MEASURE, .fsmagic = FUSE_SUPER_MAGIC, .flags = 
> IMA_FSMAGIC | IMA_FORCE},
>131  };
>132
>
> ---
> 0-DAY kernel test infrastructureOpen Source Technology Center
> https://lists.01.org/pipermail/kbuild-all   Intel Corporation


[PATCH 1/2] fs/fuse: move SUPER_MAGIC definitions to linux/magic.h

2018-01-11 Thread Dongsu Park
To be able to use FUSE_*SUPER_MAGIC macros in other subsystems like
security/integrity/ima, we need to move the definitions from fs/fuse
to include/uapi/linux/.

The FUSE_*SUPER_MAGIC macros are made available to userspace in the
same way as other filesystems.

Cc: linux-fsde...@vger.kernel.org
Cc: linux-integr...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Alban Crequy <al...@kinvolk.io>
Cc: Miklos Szeredi <mik...@szeredi.hu>
Cc: Mimi Zohar <zo...@linux.vnet.ibm.com>
Cc: Seth Forshee <seth.fors...@canonical.com>
Signed-off-by: Dongsu Park <don...@kinvolk.io>
---
 fs/fuse/control.c  | 3 +--
 fs/fuse/inode.c| 3 +--
 include/uapi/linux/magic.h | 3 +++
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/fuse/control.c b/fs/fuse/control.c
index b9ea99c5..9015c15c 100644
--- a/fs/fuse/control.c
+++ b/fs/fuse/control.c
@@ -10,8 +10,7 @@
 
 #include 
 #include 
-
-#define FUSE_CTL_SUPER_MAGIC 0x65735543
+#include 
 
 /*
  * This is non-NULL when the single instance of the control filesystem
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 8c98edee..57371b77 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 MODULE_AUTHOR("Miklos Szeredi <mik...@szeredi.hu>");
 MODULE_DESCRIPTION("Filesystem in Userspace");
@@ -49,8 +50,6 @@ MODULE_PARM_DESC(max_user_congthresh,
  "Global limit for the maximum congestion threshold an "
  "unprivileged user can set");
 
-#define FUSE_SUPER_MAGIC 0x65735546
-
 #define FUSE_DEFAULT_BLKSIZE 512
 
 /** Maximum number of outstanding background requests */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 1a6fee97..1534e99c 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -90,4 +90,7 @@
 #define BALLOON_KVM_MAGIC  0x13661366
 #define ZSMALLOC_MAGIC 0x58295829
 
+#define FUSE_CTL_SUPER_MAGIC   0x65735543
+#define FUSE_SUPER_MAGIC   0x65735546
+
 #endif /* __LINUX_MAGIC_H__ */
-- 
2.13.6



[PATCH 1/2] fs/fuse: move SUPER_MAGIC definitions to linux/magic.h

2018-01-11 Thread Dongsu Park
To be able to use FUSE_*SUPER_MAGIC macros in other subsystems like
security/integrity/ima, we need to move the definitions from fs/fuse
to include/uapi/linux/.

The FUSE_*SUPER_MAGIC macros are made available to userspace in the
same way as other filesystems.

Cc: linux-fsde...@vger.kernel.org
Cc: linux-integr...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Alban Crequy 
Cc: Miklos Szeredi 
Cc: Mimi Zohar 
Cc: Seth Forshee 
Signed-off-by: Dongsu Park 
---
 fs/fuse/control.c  | 3 +--
 fs/fuse/inode.c| 3 +--
 include/uapi/linux/magic.h | 3 +++
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/fuse/control.c b/fs/fuse/control.c
index b9ea99c5..9015c15c 100644
--- a/fs/fuse/control.c
+++ b/fs/fuse/control.c
@@ -10,8 +10,7 @@
 
 #include 
 #include 
-
-#define FUSE_CTL_SUPER_MAGIC 0x65735543
+#include 
 
 /*
  * This is non-NULL when the single instance of the control filesystem
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 8c98edee..57371b77 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 MODULE_AUTHOR("Miklos Szeredi ");
 MODULE_DESCRIPTION("Filesystem in Userspace");
@@ -49,8 +50,6 @@ MODULE_PARM_DESC(max_user_congthresh,
  "Global limit for the maximum congestion threshold an "
  "unprivileged user can set");
 
-#define FUSE_SUPER_MAGIC 0x65735546
-
 #define FUSE_DEFAULT_BLKSIZE 512
 
 /** Maximum number of outstanding background requests */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 1a6fee97..1534e99c 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -90,4 +90,7 @@
 #define BALLOON_KVM_MAGIC  0x13661366
 #define ZSMALLOC_MAGIC 0x58295829
 
+#define FUSE_CTL_SUPER_MAGIC   0x65735543
+#define FUSE_SUPER_MAGIC   0x65735546
+
 #endif /* __LINUX_MAGIC_H__ */
-- 
2.13.6



[PATCH 2/2] ima: turn on force option for FUSE in builtin policies

2018-01-11 Thread Dongsu Park
In case of FUSE filesystem, cached integrity results in IMA could be
reused, when the userspace FUSE process has changed the
underlying files. To be able to avoid such cases, we need to turn on
the force option in builtin policies, for actions of measure and
appraise. Then integrity values become re-measured and re-appraised.
In that way, cached integrity results won't be used.

This patch depends on the patch "ima: define a new policy option
named force" by Mimi. [1]

How to test the force option written by Alban:



The test I did was using a patched version of the memfs FUSE driver
[2][3] and two very simple "hello-world" programs [5] (prog1 prints
"hello world: 1" and prog2 prints "hello world: 2").

I copy prog1 and prog2 in the fuse-memfs mount point, execute them and
check the sha1 hash in
"/sys/kernel/security/ima/ascii_runtime_measurements".

My patch on the memfs FUSE driver added a backdoor command to serve
prog1 when the kernel asks for prog2 or vice-versa. In this way, I can
exec prog1 and get it to print "hello world: 2" without ever replacing
the file via the VFS, so the kernel is not aware of the change.

The test was done using the branch "dongsu/fuse-userns-v5-2" [4],
including both this new force option and Sascha's patch ("ima: Use
i_version only when filesystem supports it").


Step by step test procedure:

1. Mount the memfs FUSE using [3]:
rm -f  /tmp/memfs-switch* ; memfs -L DEBUG  /mnt/memfs

2. Copy prog1 and prog2 using [5]
cp prog1 /mnt/memfs/prog1
cp prog2 /mnt/memfs/prog2

3. Lookup the files and let the FUSE driver to keep the handles open:
dd if=/mnt/memfs/prog1 bs=1 | (read -n 1 x ; sleep 3600 ) &
dd if=/mnt/memfs/prog2 bs=1 | (read -n 1 x ; sleep 3600 ) &

4. Check the 2 programs work correctly:
$ /mnt/memfs/prog1
hello world: 1
$ /mnt/memfs/prog2
hello world: 2

5. Check the measurements for prog1 and prog2:
$ sudo cat /sys/kernel/security/ima/ascii_runtime_measurements|grep
/mnt/memfs/prog
10 7ac5aed52061cb09120e977c6d04ee5c7b11c371 ima-ng
sha1:ac14c9268cd2811f7a5adea17b27d84f50e1122c /mnt/memfs/prog1
10 9acc17a9a32aec4a676b8f6558e17a3d6c9a78e6 ima-ng
sha1:799cb5d1e06d5c37ae7a76ba25ecd1bd01476383 /mnt/memfs/prog2

6. Use the backdoor command in my patched memfs to redirect file
operations on file handle 3 to file handle 2:
rm -f  /tmp/memfs-switch* ; touch /tmp/memfs-switch-3-2

7. Check how the FUSE driver serves different content for the files:
$ /mnt/memfs/prog1
hello world: 2
$ /mnt/memfs/prog2
hello world: 2

8. Check the measurements:
sudo cat /sys/kernel/security/ima/ascii_runtime_measurements|grep
/mnt/memfs/prog

Without the patches, on a vanilla kernel, there are no new
measurements, despite the FUSE driver having served different
executables.

However, with the "force" option enabled, I can see additional
measurements for prog1 and prog2 with the hashes reversed when the
FUSE driver served the alternative content.



[1] https://www.spinics.net/lists/linux-integrity/msg00948.html
[2] https://github.com/bbengfort/memfs
[3] https://github.com/kinvolk/memfs/commits/alban/switch-files
[4] https://github.com/kinvolk/linux/commits/dongsu/fuse-userns-v5-2
[5] https://github.com/kinvolk/fuse-userns-patches/commit/cf1f5750cab0

Cc: linux-integr...@vger.kernel.org
Cc: linux-security-mod...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Miklos Szeredi <mik...@szeredi.hu>
Cc: Mimi Zohar <zo...@linux.vnet.ibm.com>
Cc: Seth Forshee <seth.fors...@canonical.com>
Tested-by: Alban Crequy <al...@kinvolk.io>
Signed-off-by: Dongsu Park <don...@kinvolk.io>
---
 security/integrity/ima/ima_policy.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/security/integrity/ima/ima_policy.c 
b/security/integrity/ima/ima_policy.c
index fddef8f8..8de40d85 100644
--- a/security/integrity/ima/ima_policy.c
+++ b/security/integrity/ima/ima_policy.c
@@ -127,6 +127,7 @@ static struct ima_rule_entry default_measurement_rules[] 
__ro_after_init = {
{.action = MEASURE, .func = MODULE_CHECK, .flags = IMA_FUNC},
{.action = MEASURE, .func = FIRMWARE_CHECK, .flags = IMA_FUNC},
{.action = MEASURE, .func = POLICY_CHECK, .flags = IMA_FUNC},
+   {.action = MEASURE, .fsmagic = FUSE_SUPER_MAGIC, .flags = IMA_FSMAGIC | 
IMA_FORCE},
 };
 
 static struct ima_rule_entry default_appraise_rules[] __ro_after_init = {
@@ -154,6 +155,7 @@ static struct ima_rule_entry default_appraise_rules[] 
__ro_after_init = {
{.action = APPRAISE, .fowner = GLOBAL_ROOT_UID, .fowner_op = _eq,
 .flags = IMA_FOWNER | IMA_DIGSIG_REQUIRED},
 #endif
+   {.action = APPRAISE, .fsmagic = FUSE_SUPER_MAGIC, .flags = IMA_FSMAGIC 
| IMA_FORCE},
 };
 
 static struct ima_rule_entry secure_boot_rules[] __ro_after_init = {
-- 
2.13.6



[PATCH 2/2] ima: turn on force option for FUSE in builtin policies

2018-01-11 Thread Dongsu Park
In case of FUSE filesystem, cached integrity results in IMA could be
reused, when the userspace FUSE process has changed the
underlying files. To be able to avoid such cases, we need to turn on
the force option in builtin policies, for actions of measure and
appraise. Then integrity values become re-measured and re-appraised.
In that way, cached integrity results won't be used.

This patch depends on the patch "ima: define a new policy option
named force" by Mimi. [1]

How to test the force option written by Alban:



The test I did was using a patched version of the memfs FUSE driver
[2][3] and two very simple "hello-world" programs [5] (prog1 prints
"hello world: 1" and prog2 prints "hello world: 2").

I copy prog1 and prog2 in the fuse-memfs mount point, execute them and
check the sha1 hash in
"/sys/kernel/security/ima/ascii_runtime_measurements".

My patch on the memfs FUSE driver added a backdoor command to serve
prog1 when the kernel asks for prog2 or vice-versa. In this way, I can
exec prog1 and get it to print "hello world: 2" without ever replacing
the file via the VFS, so the kernel is not aware of the change.

The test was done using the branch "dongsu/fuse-userns-v5-2" [4],
including both this new force option and Sascha's patch ("ima: Use
i_version only when filesystem supports it").


Step by step test procedure:

1. Mount the memfs FUSE using [3]:
rm -f  /tmp/memfs-switch* ; memfs -L DEBUG  /mnt/memfs

2. Copy prog1 and prog2 using [5]
cp prog1 /mnt/memfs/prog1
cp prog2 /mnt/memfs/prog2

3. Lookup the files and let the FUSE driver to keep the handles open:
dd if=/mnt/memfs/prog1 bs=1 | (read -n 1 x ; sleep 3600 ) &
dd if=/mnt/memfs/prog2 bs=1 | (read -n 1 x ; sleep 3600 ) &

4. Check the 2 programs work correctly:
$ /mnt/memfs/prog1
hello world: 1
$ /mnt/memfs/prog2
hello world: 2

5. Check the measurements for prog1 and prog2:
$ sudo cat /sys/kernel/security/ima/ascii_runtime_measurements|grep
/mnt/memfs/prog
10 7ac5aed52061cb09120e977c6d04ee5c7b11c371 ima-ng
sha1:ac14c9268cd2811f7a5adea17b27d84f50e1122c /mnt/memfs/prog1
10 9acc17a9a32aec4a676b8f6558e17a3d6c9a78e6 ima-ng
sha1:799cb5d1e06d5c37ae7a76ba25ecd1bd01476383 /mnt/memfs/prog2

6. Use the backdoor command in my patched memfs to redirect file
operations on file handle 3 to file handle 2:
rm -f  /tmp/memfs-switch* ; touch /tmp/memfs-switch-3-2

7. Check how the FUSE driver serves different content for the files:
$ /mnt/memfs/prog1
hello world: 2
$ /mnt/memfs/prog2
hello world: 2

8. Check the measurements:
sudo cat /sys/kernel/security/ima/ascii_runtime_measurements|grep
/mnt/memfs/prog

Without the patches, on a vanilla kernel, there are no new
measurements, despite the FUSE driver having served different
executables.

However, with the "force" option enabled, I can see additional
measurements for prog1 and prog2 with the hashes reversed when the
FUSE driver served the alternative content.



[1] https://www.spinics.net/lists/linux-integrity/msg00948.html
[2] https://github.com/bbengfort/memfs
[3] https://github.com/kinvolk/memfs/commits/alban/switch-files
[4] https://github.com/kinvolk/linux/commits/dongsu/fuse-userns-v5-2
[5] https://github.com/kinvolk/fuse-userns-patches/commit/cf1f5750cab0

Cc: linux-integr...@vger.kernel.org
Cc: linux-security-mod...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Miklos Szeredi 
Cc: Mimi Zohar 
Cc: Seth Forshee 
Tested-by: Alban Crequy 
Signed-off-by: Dongsu Park 
---
 security/integrity/ima/ima_policy.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/security/integrity/ima/ima_policy.c 
b/security/integrity/ima/ima_policy.c
index fddef8f8..8de40d85 100644
--- a/security/integrity/ima/ima_policy.c
+++ b/security/integrity/ima/ima_policy.c
@@ -127,6 +127,7 @@ static struct ima_rule_entry default_measurement_rules[] 
__ro_after_init = {
{.action = MEASURE, .func = MODULE_CHECK, .flags = IMA_FUNC},
{.action = MEASURE, .func = FIRMWARE_CHECK, .flags = IMA_FUNC},
{.action = MEASURE, .func = POLICY_CHECK, .flags = IMA_FUNC},
+   {.action = MEASURE, .fsmagic = FUSE_SUPER_MAGIC, .flags = IMA_FSMAGIC | 
IMA_FORCE},
 };
 
 static struct ima_rule_entry default_appraise_rules[] __ro_after_init = {
@@ -154,6 +155,7 @@ static struct ima_rule_entry default_appraise_rules[] 
__ro_after_init = {
{.action = APPRAISE, .fowner = GLOBAL_ROOT_UID, .fowner_op = _eq,
 .flags = IMA_FOWNER | IMA_DIGSIG_REQUIRED},
 #endif
+   {.action = APPRAISE, .fsmagic = FUSE_SUPER_MAGIC, .flags = IMA_FSMAGIC 
| IMA_FORCE},
 };
 
 static struct ima_rule_entry secure_boot_rules[] __ro_after_init = {
-- 
2.13.6



[PATCH 0/2] turn on force option for FUSE in builtin policies

2018-01-11 Thread Dongsu Park
In case of FUSE filesystem, cached integrity results in IMA could be
reused, when the userspace FUSE process has changed the
underlying files. To be able to avoid such cases, we need to turn on
the force option in builtin policies, for actions of measure and
appraise. Then integrity values become re-measured and re-appraised.
In that way, cached integrity results won't be used.

This patchset depends on the patch "ima: define a new policy option
named force" by Mimi. [1]  For details on testing the force option,
please refer to the testing report by Alban. [2]

The first patch is for simply moving FUSE_*SUPER_MAGIC macros to
include/uapi/linux, to be able to use those in other subsystems like
security/integrity/ima.

The second patch is actually to turn on the force option for FUSE fs
in IMA.

[1] https://www.spinics.net/lists/linux-integrity/msg00948.html
[2] https://marc.info/?l=linux-integrity=151559360514676=2


Dongsu Park (2):
  fs/fuse: move SUPER_MAGIC definitions to linux/magic.h
  ima: turn on force option for FUSE in builtin policies

 fs/fuse/control.c   | 3 +--
 fs/fuse/inode.c | 3 +--
 include/uapi/linux/magic.h  | 3 +++
 security/integrity/ima/ima_policy.c | 2 ++
 4 files changed, 7 insertions(+), 4 deletions(-)

-- 
2.13.6



[PATCH 0/2] turn on force option for FUSE in builtin policies

2018-01-11 Thread Dongsu Park
In case of FUSE filesystem, cached integrity results in IMA could be
reused, when the userspace FUSE process has changed the
underlying files. To be able to avoid such cases, we need to turn on
the force option in builtin policies, for actions of measure and
appraise. Then integrity values become re-measured and re-appraised.
In that way, cached integrity results won't be used.

This patchset depends on the patch "ima: define a new policy option
named force" by Mimi. [1]  For details on testing the force option,
please refer to the testing report by Alban. [2]

The first patch is for simply moving FUSE_*SUPER_MAGIC macros to
include/uapi/linux, to be able to use those in other subsystems like
security/integrity/ima.

The second patch is actually to turn on the force option for FUSE fs
in IMA.

[1] https://www.spinics.net/lists/linux-integrity/msg00948.html
[2] https://marc.info/?l=linux-integrity=151559360514676=2


Dongsu Park (2):
  fs/fuse: move SUPER_MAGIC definitions to linux/magic.h
  ima: turn on force option for FUSE in builtin policies

 fs/fuse/control.c   | 3 +--
 fs/fuse/inode.c | 3 +--
 include/uapi/linux/magic.h  | 3 +++
 security/integrity/ima/ima_policy.c | 2 ++
 4 files changed, 7 insertions(+), 4 deletions(-)

-- 
2.13.6



Re: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes

2018-01-09 Thread Dongsu Park
Hi,

On Fri, Jan 5, 2018 at 8:24 PM, Luis R. Rodriguez <mcg...@kernel.org> wrote:
> On Fri, Dec 22, 2017 at 03:32:27PM +0100, Dongsu Park wrote:
>> diff --git a/fs/attr.c b/fs/attr.c
>> index 12ffdb6f..bf8e94f3 100644
>> --- a/fs/attr.c
>> +++ b/fs/attr.c
>> @@ -18,6 +18,30 @@
>>  #include 
>>  #include 
>>
>> +static bool chown_ok(const struct inode *inode, kuid_t uid)
>> +{
>> + if (uid_eq(current_fsuid(), inode->i_uid) &&
>> + uid_eq(uid, inode->i_uid))
>> + return true;
>> + if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
>> + return true;
>> + if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
>> + return true;
>> + return false;
>> +}
>> +
>> +static bool chgrp_ok(const struct inode *inode, kgid_t gid)
>> +{
>> + if (uid_eq(current_fsuid(), inode->i_uid) &&
>> + (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
>> + return true;
>> + if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
>> + return true;
>> + if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
>> + return true;
>> + return false;
>> +}
>> +
>>  /**
>>   * setattr_prepare - check if attribute changes to a dentry are allowed
>>   * @dentry:  dentry to check
>> @@ -52,17 +76,11 @@ int setattr_prepare(struct dentry *dentry, struct iattr 
>> *attr)
>>   goto kill_priv;
>>
>>   /* Make sure a caller can chown. */
>> - if ((ia_valid & ATTR_UID) &&
>> - (!uid_eq(current_fsuid(), inode->i_uid) ||
>> -  !uid_eq(attr->ia_uid, inode->i_uid)) &&
>> - !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
>> + if ((ia_valid & ATTR_UID) && !chown_ok(inode, attr->ia_uid))
>>   return -EPERM;
>
> I think this patch would read much better and easier to review if it was
> split up by first adding the helpers, and then extending them afterwards.

I'm fine with splitting it up into multiple patches, if the original author
Eric agrees.

>>   /* Make sure caller can chgrp. */
>> - if ((ia_valid & ATTR_GID) &&
>> - (!uid_eq(current_fsuid(), inode->i_uid) ||
>> - (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, 
>> inode->i_gid))) &&
>> - !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
>> + if ((ia_valid & ATTR_GID) && !chgrp_ok(inode, attr->ia_gid))
>>   return -EPERM;
>>
>>   /* Make sure a caller can chmod. */
>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>> index 31934cb9..9d50ec92 100644
>> --- a/fs/proc/base.c
>> +++ b/fs/proc/base.c
>> @@ -665,10 +665,17 @@ int proc_setattr(struct dentry *dentry, struct iattr 
>> *attr)
>>  {
>>   int error;
>>   struct inode *inode = d_inode(dentry);
>> + struct user_namespace *s_user_ns;
>>
>>   if (attr->ia_valid & ATTR_MODE)
>>   return -EPERM;
>>
>> + /* Don't let anyone mess with weird proc files */
>> + s_user_ns = inode->i_sb->s_user_ns;
>> + if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
>> + !kgid_has_mapping(s_user_ns, inode->i_gid))
>> + return -EPERM;
>> +
>>   error = setattr_prepare(dentry, attr);
>>   if (error)
>>   return error;
>
> Are we sure proc is the only special one? How was it observed first that this 
> was
> require for proc? Has anyone tried fuzzing by trying this op with a slew of 
> other
> filesystems on all files?

>From my limited knowledge about procfs, I suppose that procfs is a little
different from ordinary filesystems. Procfs is not exactly namespaced,
it has many inconsistencies. Some files under /proc should be owned by the
global root, regardless of user namespaces. That's why we need to handle such
special cases for proc. As it has been historically like that since the
beginning, it's hard to change it fundamentally.

However, you have good points. Other than procfs, there could be other
filesystems that have potential issues when relaxing privileges. Question is
how we can be sure that there's no hidden issues. From my understanding,
usually we could run testsuites like LTP
(https://github.com/linux-test-project/ltp.git) to avoid such regressions.
Today I have run LTP tests for fs & containers, with the patchset included.
It seemed to work fine without failures. Obviously it doesn't mean that it's
completely bug-free, when we are talking about unknown issues.
Please let me know if there are other good ways to figure out potential issues.

Thanks,
Dongsu

>   Luis


Re: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes

2018-01-09 Thread Dongsu Park
Hi,

On Fri, Jan 5, 2018 at 8:24 PM, Luis R. Rodriguez  wrote:
> On Fri, Dec 22, 2017 at 03:32:27PM +0100, Dongsu Park wrote:
>> diff --git a/fs/attr.c b/fs/attr.c
>> index 12ffdb6f..bf8e94f3 100644
>> --- a/fs/attr.c
>> +++ b/fs/attr.c
>> @@ -18,6 +18,30 @@
>>  #include 
>>  #include 
>>
>> +static bool chown_ok(const struct inode *inode, kuid_t uid)
>> +{
>> + if (uid_eq(current_fsuid(), inode->i_uid) &&
>> + uid_eq(uid, inode->i_uid))
>> + return true;
>> + if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
>> + return true;
>> + if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
>> + return true;
>> + return false;
>> +}
>> +
>> +static bool chgrp_ok(const struct inode *inode, kgid_t gid)
>> +{
>> + if (uid_eq(current_fsuid(), inode->i_uid) &&
>> + (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
>> + return true;
>> + if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
>> + return true;
>> + if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
>> + return true;
>> + return false;
>> +}
>> +
>>  /**
>>   * setattr_prepare - check if attribute changes to a dentry are allowed
>>   * @dentry:  dentry to check
>> @@ -52,17 +76,11 @@ int setattr_prepare(struct dentry *dentry, struct iattr 
>> *attr)
>>   goto kill_priv;
>>
>>   /* Make sure a caller can chown. */
>> - if ((ia_valid & ATTR_UID) &&
>> - (!uid_eq(current_fsuid(), inode->i_uid) ||
>> -  !uid_eq(attr->ia_uid, inode->i_uid)) &&
>> - !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
>> + if ((ia_valid & ATTR_UID) && !chown_ok(inode, attr->ia_uid))
>>   return -EPERM;
>
> I think this patch would read much better and easier to review if it was
> split up by first adding the helpers, and then extending them afterwards.

I'm fine with splitting it up into multiple patches, if the original author
Eric agrees.

>>   /* Make sure caller can chgrp. */
>> - if ((ia_valid & ATTR_GID) &&
>> - (!uid_eq(current_fsuid(), inode->i_uid) ||
>> - (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, 
>> inode->i_gid))) &&
>> - !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
>> + if ((ia_valid & ATTR_GID) && !chgrp_ok(inode, attr->ia_gid))
>>   return -EPERM;
>>
>>   /* Make sure a caller can chmod. */
>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>> index 31934cb9..9d50ec92 100644
>> --- a/fs/proc/base.c
>> +++ b/fs/proc/base.c
>> @@ -665,10 +665,17 @@ int proc_setattr(struct dentry *dentry, struct iattr 
>> *attr)
>>  {
>>   int error;
>>   struct inode *inode = d_inode(dentry);
>> + struct user_namespace *s_user_ns;
>>
>>   if (attr->ia_valid & ATTR_MODE)
>>   return -EPERM;
>>
>> + /* Don't let anyone mess with weird proc files */
>> + s_user_ns = inode->i_sb->s_user_ns;
>> + if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
>> + !kgid_has_mapping(s_user_ns, inode->i_gid))
>> + return -EPERM;
>> +
>>   error = setattr_prepare(dentry, attr);
>>   if (error)
>>   return error;
>
> Are we sure proc is the only special one? How was it observed first that this 
> was
> require for proc? Has anyone tried fuzzing by trying this op with a slew of 
> other
> filesystems on all files?

>From my limited knowledge about procfs, I suppose that procfs is a little
different from ordinary filesystems. Procfs is not exactly namespaced,
it has many inconsistencies. Some files under /proc should be owned by the
global root, regardless of user namespaces. That's why we need to handle such
special cases for proc. As it has been historically like that since the
beginning, it's hard to change it fundamentally.

However, you have good points. Other than procfs, there could be other
filesystems that have potential issues when relaxing privileges. Question is
how we can be sure that there's no hidden issues. From my understanding,
usually we could run testsuites like LTP
(https://github.com/linux-test-project/ltp.git) to avoid such regressions.
Today I have run LTP tests for fs & containers, with the patchset included.
It seemed to work fine without failures. Obviously it doesn't mean that it's
completely bug-free, when we are talking about unknown issues.
Please let me know if there are other good ways to figure out potential issues.

Thanks,
Dongsu

>   Luis


Re: [PATCH v5 00/11] FUSE mounts from non-init user namespaces

2018-01-09 Thread Dongsu Park
Hi,

On Mon, Dec 25, 2017 at 8:05 AM, Eric W. Biederman
<ebied...@xmission.com> wrote:
> Dongsu Park <don...@kinvolk.io> writes:
>
>> This patchset v5 is based on work by Seth Forshee and Eric Biederman.
>> The latest patchset was v4:
>> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1132206.html
>>
>> At the moment, filesystems backed by physical medium can only be mounted
>> by real root in the initial user namespace. This restriction exists
>> because if it's allowed for root user in non-init user namespaces to
>> mount the filesystem, then it effectively allows the user to control the
>> underlying source of the filesystem. In case of FUSE, the source would
>> mean any underlying device.
>>
>> However, in many use cases such as containers, it's necessary to allow
>> filesystems to be mounted from non-init user namespaces. Goal of this
>> patchset is to allow FUSE filesystems to be mounted from non-init user
>> namespaces. Support for other filesystems like ext4 are not in the
>> scope of this patchset.
>>
>> Let me describe how to test mounting from non-init user namespaces. It's
>> assumed that tests are done via sshfs, a userspace filesystem based on
>> FUSE with ssh as backend. Testing system is Fedora 27.
>
> In general I am for this work, and more bodies and more eyes on it is
> generally better.
>
> I will review this after the New Year, I am out for the holidays right
> now.

Thanks. I'll wait for your review.

Dongsu

> Eric
>
>
>>
>> 
>> $ sudo dnf install -y sshfs
>> $ sudo mkdir -p /mnt/userns
>>
>> ### workaround to get the sshfs permission checks
>> $ sudo chown -R $UID:$UID /etc/ssh/ssh_config.d /usr/share/crypto-policies
>>
>> $ unshare -U -r -m
>> # sshfs root@localhost: /mnt/userns
>>
>> ### You can see sshfs being mounted from a non-init user namespace
>> # mount | grep sshfs
>> root@localhost: on /mnt/userns type fuse.sshfs
>> (rw,nosuid,nodev,relatime,user_id=0,group_id=0)
>>
>> # touch /mnt/userns/test
>> # ls -l /mnt/userns/test
>> -rw-r--r-- 1 root root 0 Dec 11 19:01 /mnt/userns/test
>> 
>>
>> Open another terminal, check the mountpoint from outside the namespace.
>>
>> 
>> $ grep userns /proc/$(pidof sshfs)/mountinfo
>> 131 102 0:35 / /mnt/userns rw,nosuid,nodev,relatime - fuse.sshfs
>> root@localhost: rw,user_id=0,group_id=0
>> 
>>
>> After all tests are done, you can unmount the filesystem
>> inside the namespace.
>>
>> 
>> # fusermount -u /mnt/userns
>> 
>>
>> Changes since v4:
>>  * Remove other parts like ext4 to keep the patchset minimal for FUSE
>>  * Add and change commit messages
>>  * Describe how to test non-init user namespaces
>>
>> TODO:
>>  * Think through potential security implications. There are 2 patches
>>being prepared for security issues. One is "ima: define a new policy
>>option named force" by Mimi Zohar, which adds an option to specify
>>that the results should not be cached:
>>https://marc.info/?l=linux-integrity=151275680115856=2
>>The other one is to basically prevent FUSE results from being cached,
>>which is still in progress.
>>
>>  * Test IMA/LSMs. Details are written in
>>
>> https://github.com/kinvolk/fuse-userns-patches/blob/master/tests/TESTING_INTEGRITY.md
>>
>> Patches 1-2 deal with an additional flag of lookup_bdev() to check for
>> additional inode permission.
>>
>> Patches 3-7 allow the superblock owner to change ownership of inodes, and
>> deal with additional capability checks w.r.t user namespaces.
>>
>> Patches 8-10 allow FUSE filesystems to be mounted outside of the init
>> user namespace.
>>
>> Patch 11 handles a corner case of non-root users in EVM.
>>
>> The patchset is also available in our github repo:
>>   https://github.com/kinvolk/linux/tree/dongsu/fuse-userns-v5-1
>>
>>
>> Eric W. Biederman (1):
>>   fs: Allow superblock owner to change ownership of inodes
>>
>> Seth Forshee (10):
>>   block_dev: Support checking inode permissions in lookup_bdev()
>>   mtd: Check permissions towards mtd block device inode when mounting
>>   fs: Don't remove suid for CAP_FSETID for userns root
>>   fs: Allow superblock owner to access do_remount_sb()
>>   capabilities: Allow privileged user in s_user_ns to set security.*
>> xattrs
>>   fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
>>   fuse:

Re: [PATCH v5 00/11] FUSE mounts from non-init user namespaces

2018-01-09 Thread Dongsu Park
Hi,

On Mon, Dec 25, 2017 at 8:05 AM, Eric W. Biederman
 wrote:
> Dongsu Park  writes:
>
>> This patchset v5 is based on work by Seth Forshee and Eric Biederman.
>> The latest patchset was v4:
>> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1132206.html
>>
>> At the moment, filesystems backed by physical medium can only be mounted
>> by real root in the initial user namespace. This restriction exists
>> because if it's allowed for root user in non-init user namespaces to
>> mount the filesystem, then it effectively allows the user to control the
>> underlying source of the filesystem. In case of FUSE, the source would
>> mean any underlying device.
>>
>> However, in many use cases such as containers, it's necessary to allow
>> filesystems to be mounted from non-init user namespaces. Goal of this
>> patchset is to allow FUSE filesystems to be mounted from non-init user
>> namespaces. Support for other filesystems like ext4 are not in the
>> scope of this patchset.
>>
>> Let me describe how to test mounting from non-init user namespaces. It's
>> assumed that tests are done via sshfs, a userspace filesystem based on
>> FUSE with ssh as backend. Testing system is Fedora 27.
>
> In general I am for this work, and more bodies and more eyes on it is
> generally better.
>
> I will review this after the New Year, I am out for the holidays right
> now.

Thanks. I'll wait for your review.

Dongsu

> Eric
>
>
>>
>> 
>> $ sudo dnf install -y sshfs
>> $ sudo mkdir -p /mnt/userns
>>
>> ### workaround to get the sshfs permission checks
>> $ sudo chown -R $UID:$UID /etc/ssh/ssh_config.d /usr/share/crypto-policies
>>
>> $ unshare -U -r -m
>> # sshfs root@localhost: /mnt/userns
>>
>> ### You can see sshfs being mounted from a non-init user namespace
>> # mount | grep sshfs
>> root@localhost: on /mnt/userns type fuse.sshfs
>> (rw,nosuid,nodev,relatime,user_id=0,group_id=0)
>>
>> # touch /mnt/userns/test
>> # ls -l /mnt/userns/test
>> -rw-r--r-- 1 root root 0 Dec 11 19:01 /mnt/userns/test
>> 
>>
>> Open another terminal, check the mountpoint from outside the namespace.
>>
>> 
>> $ grep userns /proc/$(pidof sshfs)/mountinfo
>> 131 102 0:35 / /mnt/userns rw,nosuid,nodev,relatime - fuse.sshfs
>> root@localhost: rw,user_id=0,group_id=0
>> 
>>
>> After all tests are done, you can unmount the filesystem
>> inside the namespace.
>>
>> 
>> # fusermount -u /mnt/userns
>> 
>>
>> Changes since v4:
>>  * Remove other parts like ext4 to keep the patchset minimal for FUSE
>>  * Add and change commit messages
>>  * Describe how to test non-init user namespaces
>>
>> TODO:
>>  * Think through potential security implications. There are 2 patches
>>being prepared for security issues. One is "ima: define a new policy
>>option named force" by Mimi Zohar, which adds an option to specify
>>that the results should not be cached:
>>https://marc.info/?l=linux-integrity=151275680115856=2
>>The other one is to basically prevent FUSE results from being cached,
>>which is still in progress.
>>
>>  * Test IMA/LSMs. Details are written in
>>
>> https://github.com/kinvolk/fuse-userns-patches/blob/master/tests/TESTING_INTEGRITY.md
>>
>> Patches 1-2 deal with an additional flag of lookup_bdev() to check for
>> additional inode permission.
>>
>> Patches 3-7 allow the superblock owner to change ownership of inodes, and
>> deal with additional capability checks w.r.t user namespaces.
>>
>> Patches 8-10 allow FUSE filesystems to be mounted outside of the init
>> user namespace.
>>
>> Patch 11 handles a corner case of non-root users in EVM.
>>
>> The patchset is also available in our github repo:
>>   https://github.com/kinvolk/linux/tree/dongsu/fuse-userns-v5-1
>>
>>
>> Eric W. Biederman (1):
>>   fs: Allow superblock owner to change ownership of inodes
>>
>> Seth Forshee (10):
>>   block_dev: Support checking inode permissions in lookup_bdev()
>>   mtd: Check permissions towards mtd block device inode when mounting
>>   fs: Don't remove suid for CAP_FSETID for userns root
>>   fs: Allow superblock owner to access do_remount_sb()
>>   capabilities: Allow privileged user in s_user_ns to set security.*
>> xattrs
>>   fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
>>   fuse: Support fuse filesystems outside of init_user_ns

Re: [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root

2017-12-23 Thread Dongsu Park
Hi,

On Sat, Dec 23, 2017 at 4:26 AM, Serge E. Hallyn <se...@hallyn.com> wrote:
> On Fri, Dec 22, 2017 at 03:32:28PM +0100, Dongsu Park wrote:
>> From: Seth Forshee <seth.fors...@canonical.com>
>>
>> Expand the check in should_remove_suid() to keep privileges for
>
> I realize this description came from Seth, but reading it now,
> 'Expand' seems wrong.  Expanding a check brings to my mind making
> it stricter, not looser.  How about 'Relax the check' ?

Makes sense. Will do.

>> CAP_FSETID in s_user_ns rather than init_user_ns.
>>
>> Patch v4 is available: https://patchwork.kernel.org/patch/8944621/
>>
>> --EWB Changed from ns_capable(sb->s_user_ns, ) to capable_wrt_inode_uidgid
>
> Why exactly?
>
> This is wrong, because capable_wrt_inode_uidgid() does a check
> against current_user_ns, not the  inode->i_sb->s_user_ns

Ah. I see.
I suppose it was changed probably for the privileged_wrt_inode_uidgid()
called by capable_wrt_inode_uidgid(). But as you pointed out, that checks
against current_user_ns, which is wrong. I would just create another
wrapper like capable_userns_wrt_inode_uidgid(), which takes an
additional parameter of (struct user_namespace *), to be able to check for
both ns_capable() and privileged_wrt_inode_uidgid().

Thanks,
Dongsu

>> Cc: linux-fsde...@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> Cc: Alexander Viro <v...@zeniv.linux.org.uk>
>> Cc: Serge Hallyn <se...@hallyn.com>
>> Signed-off-by: Seth Forshee <seth.fors...@canonical.com>
>> Signed-off-by: Dongsu Park <don...@kinvolk.io>
>> ---
>>  fs/inode.c | 6 --
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/inode.c b/fs/inode.c
>> index fd401028..6459a437 100644
>> --- a/fs/inode.c
>> +++ b/fs/inode.c
>> @@ -1749,7 +1749,8 @@ EXPORT_SYMBOL(touch_atime);
>>   */
>>  int should_remove_suid(struct dentry *dentry)
>>  {
>> - umode_t mode = d_inode(dentry)->i_mode;
>> + struct inode *inode = d_inode(dentry);
>> + umode_t mode = inode->i_mode;
>>   int kill = 0;
>>
>>   /* suid always must be killed */
>> @@ -1763,7 +1764,8 @@ int should_remove_suid(struct dentry *dentry)
>>   if (unlikely((mode & S_ISGID) && (mode & S_IXGRP)))
>>   kill |= ATTR_KILL_SGID;
>>
>> - if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode)))
>> + if (unlikely(kill && !capable_wrt_inode_uidgid(inode, CAP_FSETID) &&
>> +  S_ISREG(mode)))
>>   return kill;
>>
>>   return 0;
>> --
>> 2.13.6


Re: [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root

2017-12-23 Thread Dongsu Park
Hi,

On Sat, Dec 23, 2017 at 4:26 AM, Serge E. Hallyn  wrote:
> On Fri, Dec 22, 2017 at 03:32:28PM +0100, Dongsu Park wrote:
>> From: Seth Forshee 
>>
>> Expand the check in should_remove_suid() to keep privileges for
>
> I realize this description came from Seth, but reading it now,
> 'Expand' seems wrong.  Expanding a check brings to my mind making
> it stricter, not looser.  How about 'Relax the check' ?

Makes sense. Will do.

>> CAP_FSETID in s_user_ns rather than init_user_ns.
>>
>> Patch v4 is available: https://patchwork.kernel.org/patch/8944621/
>>
>> --EWB Changed from ns_capable(sb->s_user_ns, ) to capable_wrt_inode_uidgid
>
> Why exactly?
>
> This is wrong, because capable_wrt_inode_uidgid() does a check
> against current_user_ns, not the  inode->i_sb->s_user_ns

Ah. I see.
I suppose it was changed probably for the privileged_wrt_inode_uidgid()
called by capable_wrt_inode_uidgid(). But as you pointed out, that checks
against current_user_ns, which is wrong. I would just create another
wrapper like capable_userns_wrt_inode_uidgid(), which takes an
additional parameter of (struct user_namespace *), to be able to check for
both ns_capable() and privileged_wrt_inode_uidgid().

Thanks,
Dongsu

>> Cc: linux-fsde...@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> Cc: Alexander Viro 
>> Cc: Serge Hallyn 
>> Signed-off-by: Seth Forshee 
>> Signed-off-by: Dongsu Park 
>> ---
>>  fs/inode.c | 6 --
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/inode.c b/fs/inode.c
>> index fd401028..6459a437 100644
>> --- a/fs/inode.c
>> +++ b/fs/inode.c
>> @@ -1749,7 +1749,8 @@ EXPORT_SYMBOL(touch_atime);
>>   */
>>  int should_remove_suid(struct dentry *dentry)
>>  {
>> - umode_t mode = d_inode(dentry)->i_mode;
>> + struct inode *inode = d_inode(dentry);
>> + umode_t mode = inode->i_mode;
>>   int kill = 0;
>>
>>   /* suid always must be killed */
>> @@ -1763,7 +1764,8 @@ int should_remove_suid(struct dentry *dentry)
>>   if (unlikely((mode & S_ISGID) && (mode & S_IXGRP)))
>>   kill |= ATTR_KILL_SGID;
>>
>> - if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode)))
>> + if (unlikely(kill && !capable_wrt_inode_uidgid(inode, CAP_FSETID) &&
>> +  S_ISREG(mode)))
>>   return kill;
>>
>>   return 0;
>> --
>> 2.13.6


Re: [PATCH 02/11] mtd: Check permissions towards mtd block device inode when mounting

2017-12-23 Thread Dongsu Park
Hi,

On Fri, Dec 22, 2017 at 10:06 PM, Richard Weinberger
<richard.weinber...@gmail.com> wrote:
> Dongsu,
>
> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <don...@kinvolk.io> wrote:
>> From: Seth Forshee <seth.fors...@canonical.com>
>>
>> Unprivileged users should not be able to mount mtd block devices
>> when they lack sufficient privileges towards the block device
>> inode.  Update mount_mtd() to validate that the user has the
>> required access to the inode at the specified path. The check
>> will be skipped for CAP_SYS_ADMIN, so privileged mounts will
>> continue working as before.
>
> What is the big picture of this?
> Can in future an unprivileged user just mount UBIFS?

I'm not sure I'm aware of all use cases w.r.t mtd & ubifs.
To my understanding, in these days many container runtimes allow
unprivileged users to run containers. (docker, lxc, runc, bubblewrap, etc)
That's why the kernel should deal with additional permission checks
that might have not been necessary in the past.
This MTD patch is one of those special cases.

> Please note that UBIFS sits on top of a character device and not a block 
> device.

Aha, good to know.

Thanks,
Dongsu

> --
> Thanks,
> //richard


Re: [PATCH 02/11] mtd: Check permissions towards mtd block device inode when mounting

2017-12-23 Thread Dongsu Park
Hi,

On Fri, Dec 22, 2017 at 10:06 PM, Richard Weinberger
 wrote:
> Dongsu,
>
> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park  wrote:
>> From: Seth Forshee 
>>
>> Unprivileged users should not be able to mount mtd block devices
>> when they lack sufficient privileges towards the block device
>> inode.  Update mount_mtd() to validate that the user has the
>> required access to the inode at the specified path. The check
>> will be skipped for CAP_SYS_ADMIN, so privileged mounts will
>> continue working as before.
>
> What is the big picture of this?
> Can in future an unprivileged user just mount UBIFS?

I'm not sure I'm aware of all use cases w.r.t mtd & ubifs.
To my understanding, in these days many container runtimes allow
unprivileged users to run containers. (docker, lxc, runc, bubblewrap, etc)
That's why the kernel should deal with additional permission checks
that might have not been necessary in the past.
This MTD patch is one of those special cases.

> Please note that UBIFS sits on top of a character device and not a block 
> device.

Aha, good to know.

Thanks,
Dongsu

> --
> Thanks,
> //richard


Re: [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev()

2017-12-23 Thread Dongsu Park
Hi,

On Fri, Dec 22, 2017 at 7:59 PM, Coly Li <i...@coly.li> wrote:
> On 22/12/2017 10:32 PM, Dongsu Park wrote:
> Hi Dongsu,
>
> Could you please use a macro like NO_PERMISSION_CHECK to replace hard
> coded 0 ? At least for me, I don't need to check what does 0 mean in the
> new lookup_bdev().

I see. I'll do that.

Thanks,
Dongsu

> Thanks.
>
> Coly Li
>
>> ---
>>  drivers/md/bcache/super.c |  2 +-
>>  drivers/md/dm-table.c |  2 +-
>>  drivers/mtd/mtdsuper.c|  2 +-
>>  fs/block_dev.c| 13 ++---
>>  fs/quota/quota.c  |  2 +-
>>  include/linux/fs.h|  2 +-
>>  6 files changed, 15 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>> index b4d28928..acc9d56c 100644
>> --- a/drivers/md/bcache/super.c
>> +++ b/drivers/md/bcache/super.c
>> @@ -1967,7 +1967,7 @@ static ssize_t register_bcache(struct kobject *k, 
>> struct kobj_attribute *attr,
>> sb);
>>   if (IS_ERR(bdev)) {
>>   if (bdev == ERR_PTR(-EBUSY)) {
>> - bdev = lookup_bdev(strim(path));
>> + bdev = lookup_bdev(strim(path), 0);
>>   mutex_lock(_register_lock);
>>   if (!IS_ERR(bdev) && bch_is_open(bdev))
>>   err = "device already registered";
>> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
>> index 88130b5d..bca5eaf4 100644
> [snip]
>
>
> --
> Coly Li


Re: [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev()

2017-12-23 Thread Dongsu Park
Hi,

On Fri, Dec 22, 2017 at 7:59 PM, Coly Li  wrote:
> On 22/12/2017 10:32 PM, Dongsu Park wrote:
> Hi Dongsu,
>
> Could you please use a macro like NO_PERMISSION_CHECK to replace hard
> coded 0 ? At least for me, I don't need to check what does 0 mean in the
> new lookup_bdev().

I see. I'll do that.

Thanks,
Dongsu

> Thanks.
>
> Coly Li
>
>> ---
>>  drivers/md/bcache/super.c |  2 +-
>>  drivers/md/dm-table.c |  2 +-
>>  drivers/mtd/mtdsuper.c|  2 +-
>>  fs/block_dev.c| 13 ++---
>>  fs/quota/quota.c  |  2 +-
>>  include/linux/fs.h|  2 +-
>>  6 files changed, 15 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>> index b4d28928..acc9d56c 100644
>> --- a/drivers/md/bcache/super.c
>> +++ b/drivers/md/bcache/super.c
>> @@ -1967,7 +1967,7 @@ static ssize_t register_bcache(struct kobject *k, 
>> struct kobj_attribute *attr,
>> sb);
>>   if (IS_ERR(bdev)) {
>>   if (bdev == ERR_PTR(-EBUSY)) {
>> - bdev = lookup_bdev(strim(path));
>> + bdev = lookup_bdev(strim(path), 0);
>>   mutex_lock(_register_lock);
>>   if (!IS_ERR(bdev) && bch_is_open(bdev))
>>   err = "device already registered";
>> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
>> index 88130b5d..bca5eaf4 100644
> [snip]
>
>
> --
> Coly Li


[PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev()

2017-12-22 Thread Dongsu Park
From: Seth Forshee <seth.fors...@canonical.com>

When looking up a block device by path no permission check is
done to verify that the user has access to the block device inode
at the specified path. In some cases it may be necessary to
check permissions towards the inode, such as allowing
unprivileged users to mount block devices in user namespaces.

Add an argument to lookup_bdev() to optionally perform this
permission check. A value of 0 skips the permission check and
behaves the same as before. A non-zero value specifies the mask
of access rights required towards the inode at the specified
path. The check is always skipped if the user has CAP_SYS_ADMIN.

All callers of lookup_bdev() currently pass a mask of 0, so this
patch results in no functional change. Subsequent patches will
add permission checks where appropriate.

Patch v4 is available: https://patchwork.kernel.org/patch/8943601/

Cc: dm-de...@redhat.com
Cc: linux-bca...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux-...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: Alexander Viro <v...@zeniv.linux.org.uk>
Cc: Jan Kara <j...@suse.com>
Cc: Serge Hallyn <se...@hallyn.com>
Signed-off-by: Seth Forshee <seth.fors...@canonical.com>
Signed-off-by: Dongsu Park <don...@kinvolk.io>
---
 drivers/md/bcache/super.c |  2 +-
 drivers/md/dm-table.c |  2 +-
 drivers/mtd/mtdsuper.c|  2 +-
 fs/block_dev.c| 13 ++---
 fs/quota/quota.c  |  2 +-
 include/linux/fs.h|  2 +-
 6 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index b4d28928..acc9d56c 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1967,7 +1967,7 @@ static ssize_t register_bcache(struct kobject *k, struct 
kobj_attribute *attr,
  sb);
if (IS_ERR(bdev)) {
if (bdev == ERR_PTR(-EBUSY)) {
-   bdev = lookup_bdev(strim(path));
+   bdev = lookup_bdev(strim(path), 0);
mutex_lock(_register_lock);
if (!IS_ERR(bdev) && bch_is_open(bdev))
err = "device already registered";
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 88130b5d..bca5eaf4 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -410,7 +410,7 @@ dev_t dm_get_dev_t(const char *path)
dev_t dev;
struct block_device *bdev;
 
-   bdev = lookup_bdev(path);
+   bdev = lookup_bdev(path, 0);
if (IS_ERR(bdev))
dev = name_to_dev_t(path);
else {
diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c
index e43fea89..4a4d40c0 100644
--- a/drivers/mtd/mtdsuper.c
+++ b/drivers/mtd/mtdsuper.c
@@ -180,7 +180,7 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, 
int flags,
/* try the old way - the hack where we allowed users to mount
 * /dev/mtdblock$(n) but didn't actually _use_ the blockdev
 */
-   bdev = lookup_bdev(dev_name);
+   bdev = lookup_bdev(dev_name, 0);
if (IS_ERR(bdev)) {
ret = PTR_ERR(bdev);
pr_debug("MTDSB: lookup_bdev() returned %d\n", ret);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 4a181fcb..5ca06095 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1662,7 +1662,7 @@ struct block_device *blkdev_get_by_path(const char *path, 
fmode_t mode,
struct block_device *bdev;
int err;
 
-   bdev = lookup_bdev(path);
+   bdev = lookup_bdev(path, 0);
if (IS_ERR(bdev))
return bdev;
 
@@ -2052,12 +2052,14 @@ EXPORT_SYMBOL(ioctl_by_bdev);
 /**
  * lookup_bdev  - lookup a struct block_device by name
  * @pathname:  special file representing the block device
+ * @mask:  rights to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
  *
  * Get a reference to the blockdevice at @pathname in the current
  * namespace if possible and return it.  Return ERR_PTR(error)
- * otherwise.
+ * otherwise.  If @mask is non-zero, check for access rights to the
+ * inode at @pathname.
  */
-struct block_device *lookup_bdev(const char *pathname)
+struct block_device *lookup_bdev(const char *pathname, int mask)
 {
struct block_device *bdev;
struct inode *inode;
@@ -2072,6 +2074,11 @@ struct block_device *lookup_bdev(const char *pathname)
return ERR_PTR(error);
 
inode = d_backing_inode(path.dentry);
+   if (mask != 0 && !capable(CAP_SYS_ADMIN)) {
+   error = __inode_permission(inode, mask);
+   if (error)
+   goto fail;
+   }
error = -ENOTBLK;
if (!S_ISBLK(inode->i_mode))
goto fail;
diff --git a/fs/quota/quota.c b/fs/quota/quota.c
index 43612e2a..e5d47955 100644
--- a/fs/quota/quota.c
+++ b/fs/quota/quota.c
@@ -807,7 +807,7 @@ sta

[PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev()

2017-12-22 Thread Dongsu Park
From: Seth Forshee 

When looking up a block device by path no permission check is
done to verify that the user has access to the block device inode
at the specified path. In some cases it may be necessary to
check permissions towards the inode, such as allowing
unprivileged users to mount block devices in user namespaces.

Add an argument to lookup_bdev() to optionally perform this
permission check. A value of 0 skips the permission check and
behaves the same as before. A non-zero value specifies the mask
of access rights required towards the inode at the specified
path. The check is always skipped if the user has CAP_SYS_ADMIN.

All callers of lookup_bdev() currently pass a mask of 0, so this
patch results in no functional change. Subsequent patches will
add permission checks where appropriate.

Patch v4 is available: https://patchwork.kernel.org/patch/8943601/

Cc: dm-de...@redhat.com
Cc: linux-bca...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux-...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: Alexander Viro 
Cc: Jan Kara 
Cc: Serge Hallyn 
Signed-off-by: Seth Forshee 
Signed-off-by: Dongsu Park 
---
 drivers/md/bcache/super.c |  2 +-
 drivers/md/dm-table.c |  2 +-
 drivers/mtd/mtdsuper.c|  2 +-
 fs/block_dev.c| 13 ++---
 fs/quota/quota.c  |  2 +-
 include/linux/fs.h|  2 +-
 6 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index b4d28928..acc9d56c 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1967,7 +1967,7 @@ static ssize_t register_bcache(struct kobject *k, struct 
kobj_attribute *attr,
  sb);
if (IS_ERR(bdev)) {
if (bdev == ERR_PTR(-EBUSY)) {
-   bdev = lookup_bdev(strim(path));
+   bdev = lookup_bdev(strim(path), 0);
mutex_lock(_register_lock);
if (!IS_ERR(bdev) && bch_is_open(bdev))
err = "device already registered";
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 88130b5d..bca5eaf4 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -410,7 +410,7 @@ dev_t dm_get_dev_t(const char *path)
dev_t dev;
struct block_device *bdev;
 
-   bdev = lookup_bdev(path);
+   bdev = lookup_bdev(path, 0);
if (IS_ERR(bdev))
dev = name_to_dev_t(path);
else {
diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c
index e43fea89..4a4d40c0 100644
--- a/drivers/mtd/mtdsuper.c
+++ b/drivers/mtd/mtdsuper.c
@@ -180,7 +180,7 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, 
int flags,
/* try the old way - the hack where we allowed users to mount
 * /dev/mtdblock$(n) but didn't actually _use_ the blockdev
 */
-   bdev = lookup_bdev(dev_name);
+   bdev = lookup_bdev(dev_name, 0);
if (IS_ERR(bdev)) {
ret = PTR_ERR(bdev);
pr_debug("MTDSB: lookup_bdev() returned %d\n", ret);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 4a181fcb..5ca06095 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1662,7 +1662,7 @@ struct block_device *blkdev_get_by_path(const char *path, 
fmode_t mode,
struct block_device *bdev;
int err;
 
-   bdev = lookup_bdev(path);
+   bdev = lookup_bdev(path, 0);
if (IS_ERR(bdev))
return bdev;
 
@@ -2052,12 +2052,14 @@ EXPORT_SYMBOL(ioctl_by_bdev);
 /**
  * lookup_bdev  - lookup a struct block_device by name
  * @pathname:  special file representing the block device
+ * @mask:  rights to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
  *
  * Get a reference to the blockdevice at @pathname in the current
  * namespace if possible and return it.  Return ERR_PTR(error)
- * otherwise.
+ * otherwise.  If @mask is non-zero, check for access rights to the
+ * inode at @pathname.
  */
-struct block_device *lookup_bdev(const char *pathname)
+struct block_device *lookup_bdev(const char *pathname, int mask)
 {
struct block_device *bdev;
struct inode *inode;
@@ -2072,6 +2074,11 @@ struct block_device *lookup_bdev(const char *pathname)
return ERR_PTR(error);
 
inode = d_backing_inode(path.dentry);
+   if (mask != 0 && !capable(CAP_SYS_ADMIN)) {
+   error = __inode_permission(inode, mask);
+   if (error)
+   goto fail;
+   }
error = -ENOTBLK;
if (!S_ISBLK(inode->i_mode))
goto fail;
diff --git a/fs/quota/quota.c b/fs/quota/quota.c
index 43612e2a..e5d47955 100644
--- a/fs/quota/quota.c
+++ b/fs/quota/quota.c
@@ -807,7 +807,7 @@ static struct super_block *quotactl_block(const char __user 
*special, int cmd)
 
if (IS_ERR(tmp))
return ERR_CAST(tmp);
-   bdev = lookup_bdev(tmp

[PATCH 02/11] mtd: Check permissions towards mtd block device inode when mounting

2017-12-22 Thread Dongsu Park
From: Seth Forshee <seth.fors...@canonical.com>

Unprivileged users should not be able to mount mtd block devices
when they lack sufficient privileges towards the block device
inode.  Update mount_mtd() to validate that the user has the
required access to the inode at the specified path. The check
will be skipped for CAP_SYS_ADMIN, so privileged mounts will
continue working as before.

Patch v3 is available: https://patchwork.kernel.org/patch/7640011/

Cc: linux-...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Seth Forshee <seth.fors...@canonical.com>
Signed-off-by: Dongsu Park <don...@kinvolk.io>
---
 drivers/mtd/mtdsuper.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c
index 4a4d40c0..3c8734f3 100644
--- a/drivers/mtd/mtdsuper.c
+++ b/drivers/mtd/mtdsuper.c
@@ -129,6 +129,7 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, 
int flags,
 #ifdef CONFIG_BLOCK
struct block_device *bdev;
int ret, major;
+   int perm;
 #endif
int mtdnr;
 
@@ -180,7 +181,10 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, 
int flags,
/* try the old way - the hack where we allowed users to mount
 * /dev/mtdblock$(n) but didn't actually _use_ the blockdev
 */
-   bdev = lookup_bdev(dev_name, 0);
+   perm = MAY_READ;
+   if (!(flags & MS_RDONLY))
+   perm |= MAY_WRITE;
+   bdev = lookup_bdev(dev_name, perm);
if (IS_ERR(bdev)) {
ret = PTR_ERR(bdev);
pr_debug("MTDSB: lookup_bdev() returned %d\n", ret);
-- 
2.13.6



[PATCH 02/11] mtd: Check permissions towards mtd block device inode when mounting

2017-12-22 Thread Dongsu Park
From: Seth Forshee 

Unprivileged users should not be able to mount mtd block devices
when they lack sufficient privileges towards the block device
inode.  Update mount_mtd() to validate that the user has the
required access to the inode at the specified path. The check
will be skipped for CAP_SYS_ADMIN, so privileged mounts will
continue working as before.

Patch v3 is available: https://patchwork.kernel.org/patch/7640011/

Cc: linux-...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Seth Forshee 
Signed-off-by: Dongsu Park 
---
 drivers/mtd/mtdsuper.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c
index 4a4d40c0..3c8734f3 100644
--- a/drivers/mtd/mtdsuper.c
+++ b/drivers/mtd/mtdsuper.c
@@ -129,6 +129,7 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, 
int flags,
 #ifdef CONFIG_BLOCK
struct block_device *bdev;
int ret, major;
+   int perm;
 #endif
int mtdnr;
 
@@ -180,7 +181,10 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, 
int flags,
/* try the old way - the hack where we allowed users to mount
 * /dev/mtdblock$(n) but didn't actually _use_ the blockdev
 */
-   bdev = lookup_bdev(dev_name, 0);
+   perm = MAY_READ;
+   if (!(flags & MS_RDONLY))
+   perm |= MAY_WRITE;
+   bdev = lookup_bdev(dev_name, perm);
if (IS_ERR(bdev)) {
ret = PTR_ERR(bdev);
pr_debug("MTDSB: lookup_bdev() returned %d\n", ret);
-- 
2.13.6



[PATCH 11/11] evm: Don't update hmacs in user ns mounts

2017-12-22 Thread Dongsu Park
From: Seth Forshee <seth.fors...@canonical.com>

The kernel should not calculate new hmacs for mounts done by
non-root users. Update evm_calc_hmac_or_hash() to refuse to
calculate new hmacs for mounts for non-init user namespaces.

Cc: linux-integr...@vger.kernel.org
Cc: linux-security-mod...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: James Morris <james.l.mor...@oracle.com>
Cc: Mimi Zohar <zo...@linux.vnet.ibm.com>
Cc: "Serge E. Hallyn" <se...@hallyn.com>
Signed-off-by: Seth Forshee <seth.fors...@canonical.com>
Signed-off-by: Dongsu Park <don...@kinvolk.io>
---
 security/integrity/evm/evm_crypto.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/security/integrity/evm/evm_crypto.c 
b/security/integrity/evm/evm_crypto.c
index bcd64baf..729f4545 100644
--- a/security/integrity/evm/evm_crypto.c
+++ b/security/integrity/evm/evm_crypto.c
@@ -190,7 +190,8 @@ static int evm_calc_hmac_or_hash(struct dentry *dentry,
int error;
int size;
 
-   if (!(inode->i_opflags & IOP_XATTR))
+   if (!(inode->i_opflags & IOP_XATTR) ||
+   inode->i_sb->s_user_ns != _user_ns)
return -EOPNOTSUPP;
 
desc = init_desc(type);
-- 
2.13.6



[PATCH 11/11] evm: Don't update hmacs in user ns mounts

2017-12-22 Thread Dongsu Park
From: Seth Forshee 

The kernel should not calculate new hmacs for mounts done by
non-root users. Update evm_calc_hmac_or_hash() to refuse to
calculate new hmacs for mounts for non-init user namespaces.

Cc: linux-integr...@vger.kernel.org
Cc: linux-security-mod...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: James Morris 
Cc: Mimi Zohar 
Cc: "Serge E. Hallyn" 
Signed-off-by: Seth Forshee 
Signed-off-by: Dongsu Park 
---
 security/integrity/evm/evm_crypto.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/security/integrity/evm/evm_crypto.c 
b/security/integrity/evm/evm_crypto.c
index bcd64baf..729f4545 100644
--- a/security/integrity/evm/evm_crypto.c
+++ b/security/integrity/evm/evm_crypto.c
@@ -190,7 +190,8 @@ static int evm_calc_hmac_or_hash(struct dentry *dentry,
int error;
int size;
 
-   if (!(inode->i_opflags & IOP_XATTR))
+   if (!(inode->i_opflags & IOP_XATTR) ||
+   inode->i_sb->s_user_ns != _user_ns)
return -EOPNOTSUPP;
 
desc = init_desc(type);
-- 
2.13.6



[PATCH 10/11] fuse: Allow user namespace mounts

2017-12-22 Thread Dongsu Park
From: Seth Forshee <seth.fors...@canonical.com>

To be able to mount fuse from non-init user namespaces, it's necessary
to set FS_USERNS_MOUNT flag to fs_flags.

Patch v4 is available: https://patchwork.kernel.org/patch/8944681/

Cc: linux-fsde...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Miklos Szeredi <mszer...@redhat.com>
Signed-off-by: Seth Forshee <seth.fors...@canonical.com>
[dongsu: add a simple commit messasge]
Signed-off-by: Dongsu Park <don...@kinvolk.io>
---
 fs/fuse/inode.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 7f6b2e55..8c98edee 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1212,7 +1212,7 @@ static void fuse_kill_sb_anon(struct super_block *sb)
 static struct file_system_type fuse_fs_type = {
.owner  = THIS_MODULE,
.name   = "fuse",
-   .fs_flags   = FS_HAS_SUBTYPE,
+   .fs_flags   = FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
.mount  = fuse_mount,
.kill_sb= fuse_kill_sb_anon,
 };
@@ -1244,7 +1244,7 @@ static struct file_system_type fuseblk_fs_type = {
.name   = "fuseblk",
.mount  = fuse_mount_blk,
.kill_sb= fuse_kill_sb_blk,
-   .fs_flags   = FS_REQUIRES_DEV | FS_HAS_SUBTYPE,
+   .fs_flags   = FS_REQUIRES_DEV | FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
 };
 MODULE_ALIAS_FS("fuseblk");
 
-- 
2.13.6



[PATCH 10/11] fuse: Allow user namespace mounts

2017-12-22 Thread Dongsu Park
From: Seth Forshee 

To be able to mount fuse from non-init user namespaces, it's necessary
to set FS_USERNS_MOUNT flag to fs_flags.

Patch v4 is available: https://patchwork.kernel.org/patch/8944681/

Cc: linux-fsde...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Miklos Szeredi 
Signed-off-by: Seth Forshee 
[dongsu: add a simple commit messasge]
Signed-off-by: Dongsu Park 
---
 fs/fuse/inode.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 7f6b2e55..8c98edee 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1212,7 +1212,7 @@ static void fuse_kill_sb_anon(struct super_block *sb)
 static struct file_system_type fuse_fs_type = {
.owner  = THIS_MODULE,
.name   = "fuse",
-   .fs_flags   = FS_HAS_SUBTYPE,
+   .fs_flags   = FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
.mount  = fuse_mount,
.kill_sb= fuse_kill_sb_anon,
 };
@@ -1244,7 +1244,7 @@ static struct file_system_type fuseblk_fs_type = {
.name   = "fuseblk",
.mount  = fuse_mount_blk,
.kill_sb= fuse_kill_sb_blk,
-   .fs_flags   = FS_REQUIRES_DEV | FS_HAS_SUBTYPE,
+   .fs_flags   = FS_REQUIRES_DEV | FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
 };
 MODULE_ALIAS_FS("fuseblk");
 
-- 
2.13.6



[PATCH 09/11] fuse: Restrict allow_other to the superblock's namespace or a descendant

2017-12-22 Thread Dongsu Park
From: Seth Forshee <seth.fors...@canonical.com>

Unprivileged users are normally restricted from mounting with the
allow_other option by system policy, but this could be bypassed
for a mount done with user namespace root permissions. In such
cases allow_other should not allow users outside the userns
to access the mount as doing so would give the unprivileged user
the ability to manipulate processes it would otherwise be unable
to manipulate. Restrict allow_other to apply to users in the same
userns used at mount or a descendant of that namespace. Also
export current_in_userns() for use by fuse when built as a
module.

Patch v4 is available: https://patchwork.kernel.org/patch/8944671/

Cc: linux-fsde...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: "Eric W. Biederman" <ebied...@xmission.com>
Cc: Serge Hallyn <se...@hallyn.com>
Cc: Miklos Szeredi <mszer...@redhat.com>
Signed-off-by: Seth Forshee <seth.fors...@canonical.com>
Signed-off-by: Dongsu Park <don...@kinvolk.io>
---
 fs/fuse/dir.c   | 2 +-
 kernel/user_namespace.c | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index ad1cfac1..d41559a0 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc)
const struct cred *cred;
 
if (fc->allow_other)
-   return 1;
+   return current_in_userns(fc->user_ns);
 
cred = current_cred();
if (uid_eq(cred->euid, fc->user_id) &&
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 246d4d4c..492c255e 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace 
*target_ns)
 {
return in_userns(target_ns, current_user_ns());
 }
+EXPORT_SYMBOL(current_in_userns);
 
 static inline struct user_namespace *to_user_ns(struct ns_common *ns)
 {
-- 
2.13.6



[PATCH 09/11] fuse: Restrict allow_other to the superblock's namespace or a descendant

2017-12-22 Thread Dongsu Park
From: Seth Forshee 

Unprivileged users are normally restricted from mounting with the
allow_other option by system policy, but this could be bypassed
for a mount done with user namespace root permissions. In such
cases allow_other should not allow users outside the userns
to access the mount as doing so would give the unprivileged user
the ability to manipulate processes it would otherwise be unable
to manipulate. Restrict allow_other to apply to users in the same
userns used at mount or a descendant of that namespace. Also
export current_in_userns() for use by fuse when built as a
module.

Patch v4 is available: https://patchwork.kernel.org/patch/8944671/

Cc: linux-fsde...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: "Eric W. Biederman" 
Cc: Serge Hallyn 
Cc: Miklos Szeredi 
Signed-off-by: Seth Forshee 
Signed-off-by: Dongsu Park 
---
 fs/fuse/dir.c   | 2 +-
 kernel/user_namespace.c | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index ad1cfac1..d41559a0 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc)
const struct cred *cred;
 
if (fc->allow_other)
-   return 1;
+   return current_in_userns(fc->user_ns);
 
cred = current_cred();
if (uid_eq(cred->euid, fc->user_id) &&
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 246d4d4c..492c255e 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace 
*target_ns)
 {
return in_userns(target_ns, current_user_ns());
 }
+EXPORT_SYMBOL(current_in_userns);
 
 static inline struct user_namespace *to_user_ns(struct ns_common *ns)
 {
-- 
2.13.6



[PATCH 07/11] fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems

2017-12-22 Thread Dongsu Park
From: Seth Forshee <seth.fors...@canonical.com>

The user in control of a super block should be allowed to freeze
and thaw it. Relax the restrictions on the FIFREEZE and FITHAW
ioctls to require CAP_SYS_ADMIN in s_user_ns.

Cc: linux-fsde...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Alexander Viro <v...@zeniv.linux.org.uk>
Signed-off-by: Seth Forshee <seth.fors...@canonical.com>
Signed-off-by: Dongsu Park <don...@kinvolk.io>
---
 fs/ioctl.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ioctl.c b/fs/ioctl.c
index 5ace7efb..8c628a8d 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -549,7 +549,7 @@ static int ioctl_fsfreeze(struct file *filp)
 {
struct super_block *sb = file_inode(filp)->i_sb;
 
-   if (!capable(CAP_SYS_ADMIN))
+   if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
return -EPERM;
 
/* If filesystem doesn't support freeze feature, return. */
@@ -566,7 +566,7 @@ static int ioctl_fsthaw(struct file *filp)
 {
struct super_block *sb = file_inode(filp)->i_sb;
 
-   if (!capable(CAP_SYS_ADMIN))
+   if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
return -EPERM;
 
/* Thaw */
-- 
2.13.6



[PATCH 07/11] fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems

2017-12-22 Thread Dongsu Park
From: Seth Forshee 

The user in control of a super block should be allowed to freeze
and thaw it. Relax the restrictions on the FIFREEZE and FITHAW
ioctls to require CAP_SYS_ADMIN in s_user_ns.

Cc: linux-fsde...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Alexander Viro 
Signed-off-by: Seth Forshee 
Signed-off-by: Dongsu Park 
---
 fs/ioctl.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ioctl.c b/fs/ioctl.c
index 5ace7efb..8c628a8d 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -549,7 +549,7 @@ static int ioctl_fsfreeze(struct file *filp)
 {
struct super_block *sb = file_inode(filp)->i_sb;
 
-   if (!capable(CAP_SYS_ADMIN))
+   if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
return -EPERM;
 
/* If filesystem doesn't support freeze feature, return. */
@@ -566,7 +566,7 @@ static int ioctl_fsthaw(struct file *filp)
 {
struct super_block *sb = file_inode(filp)->i_sb;
 
-   if (!capable(CAP_SYS_ADMIN))
+   if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
return -EPERM;
 
/* Thaw */
-- 
2.13.6



[PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns

2017-12-22 Thread Dongsu Park
From: Seth Forshee <seth.fors...@canonical.com>

In order to support mounts from namespaces other than
init_user_ns, fuse must translate uids and gids to/from the
userns of the process servicing requests on /dev/fuse. This
patch does that, with a couple of restrictions on the namespace:

 - The userns for the fuse connection is fixed to the namespace
   from which /dev/fuse is opened.

 - The namespace must be the same as s_user_ns.

These restrictions simplify the implementation by avoiding the
need to pass around userns references and by allowing fuse to
rely on the checks in inode_change_ok for ownership changes.
Either restriction could be relaxed in the future if needed.

For cuse the namespace used for the connection is also simply
current_user_ns() at the time /dev/cuse is opened.

Patch v4 is available: https://patchwork.kernel.org/patch/8944661/

Cc: linux-fsde...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Miklos Szeredi <mszer...@redhat.com>
Signed-off-by: Seth Forshee <seth.fors...@canonical.com>
Signed-off-by: Dongsu Park <don...@kinvolk.io>
---
 fs/fuse/cuse.c   |  3 ++-
 fs/fuse/dev.c| 11 ---
 fs/fuse/dir.c| 14 +++---
 fs/fuse/fuse_i.h |  6 +-
 fs/fuse/inode.c  | 31 +++
 5 files changed, 41 insertions(+), 24 deletions(-)

diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index e9e97803..b1b83259 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -48,6 +48,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "fuse_i.h"
 
@@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct 
file *file)
if (!cc)
return -ENOMEM;
 
-   fuse_conn_init(>fc);
+   fuse_conn_init(>fc, current_user_ns());
 
fud = fuse_dev_alloc(>fc);
if (!fud) {
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 17f0d05b..0f780e16 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
 
 static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
 {
-   req->in.h.uid = from_kuid_munged(_user_ns, current_fsuid());
-   req->in.h.gid = from_kgid_munged(_user_ns, current_fsgid());
+   req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
+   req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
 }
 
@@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn 
*fc, unsigned npages,
__set_bit(FR_WAITING, >flags);
if (for_background)
__set_bit(FR_BACKGROUND, >flags);
+   if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) {
+   fuse_put_request(fc, req);
+   return ERR_PTR(-EOVERFLOW);
+   }
 
return req;
 
@@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, 
struct file *file,
in = >in;
reqsize = in->h.len;
 
-   if (task_active_pid_ns(current) != fc->pid_ns) {
+   if (task_active_pid_ns(current) != fc->pid_ns ||
+   current_user_ns() != fc->user_ns) {
rcu_read_lock();
in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
rcu_read_unlock();
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 24967382..ad1cfac1 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct 
fuse_attr *attr,
stat->ino = attr->ino;
stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 0);
stat->nlink = attr->nlink;
-   stat->uid = make_kuid(_user_ns, attr->uid);
-   stat->gid = make_kgid(_user_ns, attr->gid);
+   stat->uid = make_kuid(fc->user_ns, attr->uid);
+   stat->gid = make_kgid(fc->user_ns, attr->gid);
stat->rdev = inode->i_rdev;
stat->atime.tv_sec = attr->atime;
stat->atime.tv_nsec = attr->atimensec;
@@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool 
trust_local_mtime)
return true;
 }
 
-static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
-  bool trust_local_cmtime)
+static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
+  struct fuse_setattr_in *arg, bool trust_local_cmtime)
 {
unsigned ivalid = iattr->ia_valid;
 
if (ivalid & ATTR_MODE)
arg->valid |= FATTR_MODE,   arg->mode = iattr->ia_mode;
if (ivalid & ATTR_UID)
-   arg->valid |= FATTR_UID,arg->uid = from_kuid(_user_ns, 
iattr->ia_uid);
+   arg->valid |= FATTR_UID,arg->uid = from_kuid(fc->user_ns, 
iattr->ia_uid);
if (ivalid & ATTR_GID)
-

[PATCH 06/11] capabilities: Allow privileged user in s_user_ns to set security.* xattrs

2017-12-22 Thread Dongsu Park
From: Seth Forshee <seth.fors...@canonical.com>

A privileged user in s_user_ns will generally have the ability to
manipulate the backing store and insert security.* xattrs into
the filesystem directly. Therefore the kernel must be prepared to
handle these xattrs from unprivileged mounts, and it makes little
sense for commoncap to prevent writing these xattrs to the
filesystem. The capability and LSM code have already been updated
to appropriately handle xattrs from unprivileged mounts, so it
is safe to loosen this restriction on setting xattrs.

The exception to this logic is that writing xattrs to a mounted
filesystem may also cause the LSM inode_post_setxattr or
inode_setsecurity callbacks to be invoked. SELinux will deny the
xattr update by virtue of applying mountpoint labeling to
unprivileged userns mounts, and Smack will deny the writes for
any user without global CAP_MAC_ADMIN, so loosening the
capability check in commoncap is safe in this respect as well.

Patch v4 is available: https://patchwork.kernel.org/patch/8944641/

Cc: linux-security-mod...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: James Morris <james.l.mor...@oracle.com>
Cc: Serge Hallyn <se...@hallyn.com>
Signed-off-by: Seth Forshee <seth.fors...@canonical.com>
Signed-off-by: Dongsu Park <don...@kinvolk.io>
---
 security/commoncap.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/security/commoncap.c b/security/commoncap.c
index 4f8e0934..dd0afef9 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -920,6 +920,8 @@ int cap_bprm_set_creds(struct linux_binprm *bprm)
 int cap_inode_setxattr(struct dentry *dentry, const char *name,
   const void *value, size_t size, int flags)
 {
+   struct user_namespace *user_ns = dentry->d_sb->s_user_ns;
+
/* Ignore non-security xattrs */
if (strncmp(name, XATTR_SECURITY_PREFIX,
sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
@@ -932,7 +934,7 @@ int cap_inode_setxattr(struct dentry *dentry, const char 
*name,
if (strcmp(name, XATTR_NAME_CAPS) == 0)
return 0;
 
-   if (!capable(CAP_SYS_ADMIN))
+   if (!ns_capable(user_ns, CAP_SYS_ADMIN))
return -EPERM;
return 0;
 }
@@ -950,6 +952,8 @@ int cap_inode_setxattr(struct dentry *dentry, const char 
*name,
  */
 int cap_inode_removexattr(struct dentry *dentry, const char *name)
 {
+   struct user_namespace *user_ns = dentry->d_sb->s_user_ns;
+
/* Ignore non-security xattrs */
if (strncmp(name, XATTR_SECURITY_PREFIX,
sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
@@ -965,7 +969,7 @@ int cap_inode_removexattr(struct dentry *dentry, const char 
*name)
return 0;
}
 
-   if (!capable(CAP_SYS_ADMIN))
+   if (!ns_capable(user_ns, CAP_SYS_ADMIN))
return -EPERM;
return 0;
 }
-- 
2.13.6



[PATCH 06/11] capabilities: Allow privileged user in s_user_ns to set security.* xattrs

2017-12-22 Thread Dongsu Park
From: Seth Forshee 

A privileged user in s_user_ns will generally have the ability to
manipulate the backing store and insert security.* xattrs into
the filesystem directly. Therefore the kernel must be prepared to
handle these xattrs from unprivileged mounts, and it makes little
sense for commoncap to prevent writing these xattrs to the
filesystem. The capability and LSM code have already been updated
to appropriately handle xattrs from unprivileged mounts, so it
is safe to loosen this restriction on setting xattrs.

The exception to this logic is that writing xattrs to a mounted
filesystem may also cause the LSM inode_post_setxattr or
inode_setsecurity callbacks to be invoked. SELinux will deny the
xattr update by virtue of applying mountpoint labeling to
unprivileged userns mounts, and Smack will deny the writes for
any user without global CAP_MAC_ADMIN, so loosening the
capability check in commoncap is safe in this respect as well.

Patch v4 is available: https://patchwork.kernel.org/patch/8944641/

Cc: linux-security-mod...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: James Morris 
Cc: Serge Hallyn 
Signed-off-by: Seth Forshee 
Signed-off-by: Dongsu Park 
---
 security/commoncap.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/security/commoncap.c b/security/commoncap.c
index 4f8e0934..dd0afef9 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -920,6 +920,8 @@ int cap_bprm_set_creds(struct linux_binprm *bprm)
 int cap_inode_setxattr(struct dentry *dentry, const char *name,
   const void *value, size_t size, int flags)
 {
+   struct user_namespace *user_ns = dentry->d_sb->s_user_ns;
+
/* Ignore non-security xattrs */
if (strncmp(name, XATTR_SECURITY_PREFIX,
sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
@@ -932,7 +934,7 @@ int cap_inode_setxattr(struct dentry *dentry, const char 
*name,
if (strcmp(name, XATTR_NAME_CAPS) == 0)
return 0;
 
-   if (!capable(CAP_SYS_ADMIN))
+   if (!ns_capable(user_ns, CAP_SYS_ADMIN))
return -EPERM;
return 0;
 }
@@ -950,6 +952,8 @@ int cap_inode_setxattr(struct dentry *dentry, const char 
*name,
  */
 int cap_inode_removexattr(struct dentry *dentry, const char *name)
 {
+   struct user_namespace *user_ns = dentry->d_sb->s_user_ns;
+
/* Ignore non-security xattrs */
if (strncmp(name, XATTR_SECURITY_PREFIX,
sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
@@ -965,7 +969,7 @@ int cap_inode_removexattr(struct dentry *dentry, const char 
*name)
return 0;
}
 
-   if (!capable(CAP_SYS_ADMIN))
+   if (!ns_capable(user_ns, CAP_SYS_ADMIN))
return -EPERM;
return 0;
 }
-- 
2.13.6



[PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns

2017-12-22 Thread Dongsu Park
From: Seth Forshee 

In order to support mounts from namespaces other than
init_user_ns, fuse must translate uids and gids to/from the
userns of the process servicing requests on /dev/fuse. This
patch does that, with a couple of restrictions on the namespace:

 - The userns for the fuse connection is fixed to the namespace
   from which /dev/fuse is opened.

 - The namespace must be the same as s_user_ns.

These restrictions simplify the implementation by avoiding the
need to pass around userns references and by allowing fuse to
rely on the checks in inode_change_ok for ownership changes.
Either restriction could be relaxed in the future if needed.

For cuse the namespace used for the connection is also simply
current_user_ns() at the time /dev/cuse is opened.

Patch v4 is available: https://patchwork.kernel.org/patch/8944661/

Cc: linux-fsde...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Miklos Szeredi 
Signed-off-by: Seth Forshee 
Signed-off-by: Dongsu Park 
---
 fs/fuse/cuse.c   |  3 ++-
 fs/fuse/dev.c| 11 ---
 fs/fuse/dir.c| 14 +++---
 fs/fuse/fuse_i.h |  6 +-
 fs/fuse/inode.c  | 31 +++
 5 files changed, 41 insertions(+), 24 deletions(-)

diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index e9e97803..b1b83259 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -48,6 +48,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "fuse_i.h"
 
@@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct 
file *file)
if (!cc)
return -ENOMEM;
 
-   fuse_conn_init(>fc);
+   fuse_conn_init(>fc, current_user_ns());
 
fud = fuse_dev_alloc(>fc);
if (!fud) {
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 17f0d05b..0f780e16 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
 
 static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
 {
-   req->in.h.uid = from_kuid_munged(_user_ns, current_fsuid());
-   req->in.h.gid = from_kgid_munged(_user_ns, current_fsgid());
+   req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
+   req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
 }
 
@@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn 
*fc, unsigned npages,
__set_bit(FR_WAITING, >flags);
if (for_background)
__set_bit(FR_BACKGROUND, >flags);
+   if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) {
+   fuse_put_request(fc, req);
+   return ERR_PTR(-EOVERFLOW);
+   }
 
return req;
 
@@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, 
struct file *file,
in = >in;
reqsize = in->h.len;
 
-   if (task_active_pid_ns(current) != fc->pid_ns) {
+   if (task_active_pid_ns(current) != fc->pid_ns ||
+   current_user_ns() != fc->user_ns) {
rcu_read_lock();
in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
rcu_read_unlock();
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 24967382..ad1cfac1 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct 
fuse_attr *attr,
stat->ino = attr->ino;
stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 0);
stat->nlink = attr->nlink;
-   stat->uid = make_kuid(_user_ns, attr->uid);
-   stat->gid = make_kgid(_user_ns, attr->gid);
+   stat->uid = make_kuid(fc->user_ns, attr->uid);
+   stat->gid = make_kgid(fc->user_ns, attr->gid);
stat->rdev = inode->i_rdev;
stat->atime.tv_sec = attr->atime;
stat->atime.tv_nsec = attr->atimensec;
@@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool 
trust_local_mtime)
return true;
 }
 
-static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
-  bool trust_local_cmtime)
+static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
+  struct fuse_setattr_in *arg, bool trust_local_cmtime)
 {
unsigned ivalid = iattr->ia_valid;
 
if (ivalid & ATTR_MODE)
arg->valid |= FATTR_MODE,   arg->mode = iattr->ia_mode;
if (ivalid & ATTR_UID)
-   arg->valid |= FATTR_UID,arg->uid = from_kuid(_user_ns, 
iattr->ia_uid);
+   arg->valid |= FATTR_UID,arg->uid = from_kuid(fc->user_ns, 
iattr->ia_uid);
if (ivalid & ATTR_GID)
-   arg->valid |= FATTR_GID,arg->gid = from_kgid(_user_ns, 
iattr->ia_gid);
+   arg->valid |= F

[PATCH 05/11] fs: Allow superblock owner to access do_remount_sb()

2017-12-22 Thread Dongsu Park
From: Seth Forshee <seth.fors...@canonical.com>

Superblock level remounts are currently restricted to global
CAP_SYS_ADMIN, as is the path for changing the root mount to
read only on umount. Loosen both of these permission checks to
also allow CAP_SYS_ADMIN in any namespace which is privileged
towards the userns which originally mounted the filesystem.

Patch v4 is available: https://patchwork.kernel.org/patch/8944631/

Cc: linux-fsde...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Alexander Viro <v...@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebied...@xmission.com>
Cc: Serge Hallyn <se...@hallyn.com>
Signed-off-by: Seth Forshee <seth.fors...@canonical.com>
Signed-off-by: Dongsu Park <don...@kinvolk.io>
---
 fs/namespace.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index e158ec6b..830040d7 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1589,7 +1589,7 @@ static int do_umount(struct mount *mnt, int flags)
 * Special case for "unmounting" root ...
 * we just try to remount it readonly.
 */
-   if (!capable(CAP_SYS_ADMIN))
+   if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
return -EPERM;
down_write(>s_umount);
if (!sb_rdonly(sb))
@@ -2327,7 +2327,7 @@ static int do_remount(struct path *path, int ms_flags, 
int sb_flags,
down_write(>s_umount);
if (ms_flags & MS_BIND)
err = change_mount_flags(path->mnt, ms_flags);
-   else if (!capable(CAP_SYS_ADMIN))
+   else if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
err = -EPERM;
else
err = do_remount_sb(sb, sb_flags, data, 0);
-- 
2.13.6



[PATCH 03/11] fs: Allow superblock owner to change ownership of inodes

2017-12-22 Thread Dongsu Park
From: Eric W. Biederman <ebied...@xmission.com>

Allow users with CAP_SYS_CHOWN over the superblock of a filesystem to
chown files.  Ordinarily the capable_wrt_inode_uidgid check is
sufficient to allow access to files but when the underlying filesystem
has uids or gids that don't map to the current user namespace it is
not enough, so the chown permission checks need to be extended to
allow this case.

Calling chown on filesystem nodes whose uid or gid don't map is
necessary if those nodes are going to be modified as writing back
inodes which contain uids or gids that don't map is likely to cause
filesystem corruption of the uid or gid fields.

Once chown has been called the existing capable_wrt_inode_uidgid
checks are sufficient, to allow the owner of a superblock to do anything
the global root user can do with an appropriate set of capabilities.

For the proc filesystem this relaxation of permissions is not safe, as
some files are owned by users (particularly GLOBAL_ROOT_UID) outside
of the control of the mounter of the proc and that would be unsafe to
grant chown access to.  So update setattr on proc to disallow changing
files whose uids or gids are outside of proc's s_user_ns.

The original version of this patch was written by: Seth Forshee.  I
have rewritten and rethought this patch enough so it's really not the
same thing (certainly it needs a different description), but he
deserves credit for getting out there and getting the conversation
started, and finding the potential gotcha's and putting up with my
semi-paranoid feedback.

Patch v4 is available: https://patchwork.kernel.org/patch/8944611/

Cc: linux-fsde...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Alexander Viro <v...@zeniv.linux.org.uk>
Cc: "Luis R. Rodriguez" <mcg...@kernel.org>
Cc: Kees Cook <keesc...@chromium.org>
Inspired-by: Seth Forshee <seth.fors...@canonical.com>
Signed-off-by: Eric W. Biederman <ebied...@xmission.com>
[saf: Resolve conflicts caused by s/inode_change_ok/setattr_prepare/]
Signed-off-by: Dongsu Park <don...@kinvolk.io>
---
 fs/attr.c | 34 ++
 fs/proc/base.c|  7 +++
 fs/proc/generic.c |  7 +++
 fs/proc/proc_sysctl.c |  7 +++
 4 files changed, 47 insertions(+), 8 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index 12ffdb6f..bf8e94f3 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -18,6 +18,30 @@
 #include 
 #include 
 
+static bool chown_ok(const struct inode *inode, kuid_t uid)
+{
+   if (uid_eq(current_fsuid(), inode->i_uid) &&
+   uid_eq(uid, inode->i_uid))
+   return true;
+   if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+   return true;
+   if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
+   return true;
+   return false;
+}
+
+static bool chgrp_ok(const struct inode *inode, kgid_t gid)
+{
+   if (uid_eq(current_fsuid(), inode->i_uid) &&
+   (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
+   return true;
+   if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+   return true;
+   if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
+   return true;
+   return false;
+}
+
 /**
  * setattr_prepare - check if attribute changes to a dentry are allowed
  * @dentry:dentry to check
@@ -52,17 +76,11 @@ int setattr_prepare(struct dentry *dentry, struct iattr 
*attr)
goto kill_priv;
 
/* Make sure a caller can chown. */
-   if ((ia_valid & ATTR_UID) &&
-   (!uid_eq(current_fsuid(), inode->i_uid) ||
-!uid_eq(attr->ia_uid, inode->i_uid)) &&
-   !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+   if ((ia_valid & ATTR_UID) && !chown_ok(inode, attr->ia_uid))
return -EPERM;
 
/* Make sure caller can chgrp. */
-   if ((ia_valid & ATTR_GID) &&
-   (!uid_eq(current_fsuid(), inode->i_uid) ||
-   (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, inode->i_gid))) 
&&
-   !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+   if ((ia_valid & ATTR_GID) && !chgrp_ok(inode, attr->ia_gid))
return -EPERM;
 
/* Make sure a caller can chmod. */
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 31934cb9..9d50ec92 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -665,10 +665,17 @@ int proc_setattr(struct dentry *dentry, struct iattr 
*attr)
 {
int error;
struct inode *inode = d_inode(dentry);
+   struct user_namespace *s_user_ns;
 
if (attr->ia_valid & ATTR_MODE)
return -EPERM;
 
+   /* Don't let anyone mess with weird proc files */
+   s_user_ns = inode->i_sb->s_user_ns;
+   if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
+   !kgid_has_mapping(s_us

[PATCH 05/11] fs: Allow superblock owner to access do_remount_sb()

2017-12-22 Thread Dongsu Park
From: Seth Forshee 

Superblock level remounts are currently restricted to global
CAP_SYS_ADMIN, as is the path for changing the root mount to
read only on umount. Loosen both of these permission checks to
also allow CAP_SYS_ADMIN in any namespace which is privileged
towards the userns which originally mounted the filesystem.

Patch v4 is available: https://patchwork.kernel.org/patch/8944631/

Cc: linux-fsde...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Alexander Viro 
Cc: "Eric W. Biederman" 
Cc: Serge Hallyn 
Signed-off-by: Seth Forshee 
Signed-off-by: Dongsu Park 
---
 fs/namespace.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index e158ec6b..830040d7 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1589,7 +1589,7 @@ static int do_umount(struct mount *mnt, int flags)
 * Special case for "unmounting" root ...
 * we just try to remount it readonly.
 */
-   if (!capable(CAP_SYS_ADMIN))
+   if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
return -EPERM;
down_write(>s_umount);
if (!sb_rdonly(sb))
@@ -2327,7 +2327,7 @@ static int do_remount(struct path *path, int ms_flags, 
int sb_flags,
down_write(>s_umount);
if (ms_flags & MS_BIND)
err = change_mount_flags(path->mnt, ms_flags);
-   else if (!capable(CAP_SYS_ADMIN))
+   else if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
err = -EPERM;
else
err = do_remount_sb(sb, sb_flags, data, 0);
-- 
2.13.6



[PATCH 03/11] fs: Allow superblock owner to change ownership of inodes

2017-12-22 Thread Dongsu Park
From: Eric W. Biederman 

Allow users with CAP_SYS_CHOWN over the superblock of a filesystem to
chown files.  Ordinarily the capable_wrt_inode_uidgid check is
sufficient to allow access to files but when the underlying filesystem
has uids or gids that don't map to the current user namespace it is
not enough, so the chown permission checks need to be extended to
allow this case.

Calling chown on filesystem nodes whose uid or gid don't map is
necessary if those nodes are going to be modified as writing back
inodes which contain uids or gids that don't map is likely to cause
filesystem corruption of the uid or gid fields.

Once chown has been called the existing capable_wrt_inode_uidgid
checks are sufficient, to allow the owner of a superblock to do anything
the global root user can do with an appropriate set of capabilities.

For the proc filesystem this relaxation of permissions is not safe, as
some files are owned by users (particularly GLOBAL_ROOT_UID) outside
of the control of the mounter of the proc and that would be unsafe to
grant chown access to.  So update setattr on proc to disallow changing
files whose uids or gids are outside of proc's s_user_ns.

The original version of this patch was written by: Seth Forshee.  I
have rewritten and rethought this patch enough so it's really not the
same thing (certainly it needs a different description), but he
deserves credit for getting out there and getting the conversation
started, and finding the potential gotcha's and putting up with my
semi-paranoid feedback.

Patch v4 is available: https://patchwork.kernel.org/patch/8944611/

Cc: linux-fsde...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Alexander Viro 
Cc: "Luis R. Rodriguez" 
Cc: Kees Cook 
Inspired-by: Seth Forshee 
Signed-off-by: Eric W. Biederman 
[saf: Resolve conflicts caused by s/inode_change_ok/setattr_prepare/]
Signed-off-by: Dongsu Park 
---
 fs/attr.c | 34 ++
 fs/proc/base.c|  7 +++
 fs/proc/generic.c |  7 +++
 fs/proc/proc_sysctl.c |  7 +++
 4 files changed, 47 insertions(+), 8 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index 12ffdb6f..bf8e94f3 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -18,6 +18,30 @@
 #include 
 #include 
 
+static bool chown_ok(const struct inode *inode, kuid_t uid)
+{
+   if (uid_eq(current_fsuid(), inode->i_uid) &&
+   uid_eq(uid, inode->i_uid))
+   return true;
+   if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+   return true;
+   if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
+   return true;
+   return false;
+}
+
+static bool chgrp_ok(const struct inode *inode, kgid_t gid)
+{
+   if (uid_eq(current_fsuid(), inode->i_uid) &&
+   (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
+   return true;
+   if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+   return true;
+   if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
+   return true;
+   return false;
+}
+
 /**
  * setattr_prepare - check if attribute changes to a dentry are allowed
  * @dentry:dentry to check
@@ -52,17 +76,11 @@ int setattr_prepare(struct dentry *dentry, struct iattr 
*attr)
goto kill_priv;
 
/* Make sure a caller can chown. */
-   if ((ia_valid & ATTR_UID) &&
-   (!uid_eq(current_fsuid(), inode->i_uid) ||
-!uid_eq(attr->ia_uid, inode->i_uid)) &&
-   !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+   if ((ia_valid & ATTR_UID) && !chown_ok(inode, attr->ia_uid))
return -EPERM;
 
/* Make sure caller can chgrp. */
-   if ((ia_valid & ATTR_GID) &&
-   (!uid_eq(current_fsuid(), inode->i_uid) ||
-   (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, inode->i_gid))) 
&&
-   !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+   if ((ia_valid & ATTR_GID) && !chgrp_ok(inode, attr->ia_gid))
return -EPERM;
 
/* Make sure a caller can chmod. */
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 31934cb9..9d50ec92 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -665,10 +665,17 @@ int proc_setattr(struct dentry *dentry, struct iattr 
*attr)
 {
int error;
struct inode *inode = d_inode(dentry);
+   struct user_namespace *s_user_ns;
 
if (attr->ia_valid & ATTR_MODE)
return -EPERM;
 
+   /* Don't let anyone mess with weird proc files */
+   s_user_ns = inode->i_sb->s_user_ns;
+   if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
+   !kgid_has_mapping(s_user_ns, inode->i_gid))
+   return -EPERM;
+
error = setattr_prepare(dentry, attr);
if (error)
return error;
diff --git a/fs/proc/generic.c b/fs/proc/gen

[PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root

2017-12-22 Thread Dongsu Park
From: Seth Forshee <seth.fors...@canonical.com>

Expand the check in should_remove_suid() to keep privileges for
CAP_FSETID in s_user_ns rather than init_user_ns.

Patch v4 is available: https://patchwork.kernel.org/patch/8944621/

--EWB Changed from ns_capable(sb->s_user_ns, ) to capable_wrt_inode_uidgid

Cc: linux-fsde...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Alexander Viro <v...@zeniv.linux.org.uk>
Cc: Serge Hallyn <se...@hallyn.com>
Signed-off-by: Seth Forshee <seth.fors...@canonical.com>
Signed-off-by: Dongsu Park <don...@kinvolk.io>
---
 fs/inode.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index fd401028..6459a437 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1749,7 +1749,8 @@ EXPORT_SYMBOL(touch_atime);
  */
 int should_remove_suid(struct dentry *dentry)
 {
-   umode_t mode = d_inode(dentry)->i_mode;
+   struct inode *inode = d_inode(dentry);
+   umode_t mode = inode->i_mode;
int kill = 0;
 
/* suid always must be killed */
@@ -1763,7 +1764,8 @@ int should_remove_suid(struct dentry *dentry)
if (unlikely((mode & S_ISGID) && (mode & S_IXGRP)))
kill |= ATTR_KILL_SGID;
 
-   if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode)))
+   if (unlikely(kill && !capable_wrt_inode_uidgid(inode, CAP_FSETID) &&
+S_ISREG(mode)))
return kill;
 
return 0;
-- 
2.13.6



[PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root

2017-12-22 Thread Dongsu Park
From: Seth Forshee 

Expand the check in should_remove_suid() to keep privileges for
CAP_FSETID in s_user_ns rather than init_user_ns.

Patch v4 is available: https://patchwork.kernel.org/patch/8944621/

--EWB Changed from ns_capable(sb->s_user_ns, ) to capable_wrt_inode_uidgid

Cc: linux-fsde...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Alexander Viro 
Cc: Serge Hallyn 
Signed-off-by: Seth Forshee 
Signed-off-by: Dongsu Park 
---
 fs/inode.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index fd401028..6459a437 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1749,7 +1749,8 @@ EXPORT_SYMBOL(touch_atime);
  */
 int should_remove_suid(struct dentry *dentry)
 {
-   umode_t mode = d_inode(dentry)->i_mode;
+   struct inode *inode = d_inode(dentry);
+   umode_t mode = inode->i_mode;
int kill = 0;
 
/* suid always must be killed */
@@ -1763,7 +1764,8 @@ int should_remove_suid(struct dentry *dentry)
if (unlikely((mode & S_ISGID) && (mode & S_IXGRP)))
kill |= ATTR_KILL_SGID;
 
-   if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode)))
+   if (unlikely(kill && !capable_wrt_inode_uidgid(inode, CAP_FSETID) &&
+S_ISREG(mode)))
return kill;
 
return 0;
-- 
2.13.6



[PATCH v5 00/11] FUSE mounts from non-init user namespaces

2017-12-22 Thread Dongsu Park
This patchset v5 is based on work by Seth Forshee and Eric Biederman.
The latest patchset was v4:
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1132206.html

At the moment, filesystems backed by physical medium can only be mounted
by real root in the initial user namespace. This restriction exists
because if it's allowed for root user in non-init user namespaces to
mount the filesystem, then it effectively allows the user to control the
underlying source of the filesystem. In case of FUSE, the source would
mean any underlying device.

However, in many use cases such as containers, it's necessary to allow
filesystems to be mounted from non-init user namespaces. Goal of this
patchset is to allow FUSE filesystems to be mounted from non-init user
namespaces. Support for other filesystems like ext4 are not in the
scope of this patchset.

Let me describe how to test mounting from non-init user namespaces. It's
assumed that tests are done via sshfs, a userspace filesystem based on
FUSE with ssh as backend. Testing system is Fedora 27.


$ sudo dnf install -y sshfs
$ sudo mkdir -p /mnt/userns

### workaround to get the sshfs permission checks
$ sudo chown -R $UID:$UID /etc/ssh/ssh_config.d /usr/share/crypto-policies

$ unshare -U -r -m
# sshfs root@localhost: /mnt/userns

### You can see sshfs being mounted from a non-init user namespace
# mount | grep sshfs
root@localhost: on /mnt/userns type fuse.sshfs
(rw,nosuid,nodev,relatime,user_id=0,group_id=0)

# touch /mnt/userns/test
# ls -l /mnt/userns/test
-rw-r--r-- 1 root root 0 Dec 11 19:01 /mnt/userns/test


Open another terminal, check the mountpoint from outside the namespace.


$ grep userns /proc/$(pidof sshfs)/mountinfo
131 102 0:35 / /mnt/userns rw,nosuid,nodev,relatime - fuse.sshfs
root@localhost: rw,user_id=0,group_id=0


After all tests are done, you can unmount the filesystem
inside the namespace.


# fusermount -u /mnt/userns


Changes since v4:
 * Remove other parts like ext4 to keep the patchset minimal for FUSE
 * Add and change commit messages
 * Describe how to test non-init user namespaces

TODO:
 * Think through potential security implications. There are 2 patches
   being prepared for security issues. One is "ima: define a new policy
   option named force" by Mimi Zohar, which adds an option to specify
   that the results should not be cached:
   https://marc.info/?l=linux-integrity=151275680115856=2
   The other one is to basically prevent FUSE results from being cached,
   which is still in progress.

 * Test IMA/LSMs. Details are written in
   
https://github.com/kinvolk/fuse-userns-patches/blob/master/tests/TESTING_INTEGRITY.md

Patches 1-2 deal with an additional flag of lookup_bdev() to check for
additional inode permission.

Patches 3-7 allow the superblock owner to change ownership of inodes, and
deal with additional capability checks w.r.t user namespaces.

Patches 8-10 allow FUSE filesystems to be mounted outside of the init
user namespace.

Patch 11 handles a corner case of non-root users in EVM.

The patchset is also available in our github repo:
  https://github.com/kinvolk/linux/tree/dongsu/fuse-userns-v5-1


Eric W. Biederman (1):
  fs: Allow superblock owner to change ownership of inodes

Seth Forshee (10):
  block_dev: Support checking inode permissions in lookup_bdev()
  mtd: Check permissions towards mtd block device inode when mounting
  fs: Don't remove suid for CAP_FSETID for userns root
  fs: Allow superblock owner to access do_remount_sb()
  capabilities: Allow privileged user in s_user_ns to set security.*
xattrs
  fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
  fuse: Support fuse filesystems outside of init_user_ns
  fuse: Restrict allow_other to the superblock's namespace or a
descendant
  fuse: Allow user namespace mounts
  evm: Don't update hmacs in user ns mounts

 drivers/md/bcache/super.c   |  2 +-
 drivers/md/dm-table.c   |  2 +-
 drivers/mtd/mtdsuper.c  |  6 +-
 fs/attr.c   | 34 ++
 fs/block_dev.c  | 13 ++---
 fs/fuse/cuse.c  |  3 ++-
 fs/fuse/dev.c   | 11 ---
 fs/fuse/dir.c   | 16 
 fs/fuse/fuse_i.h|  6 +-
 fs/fuse/inode.c | 35 +--
 fs/inode.c  |  6 --
 fs/ioctl.c  |  4 ++--
 fs/namespace.c  |  4 ++--
 fs/proc/base.c  |  7 +++
 fs/proc/generic.c   |  7 +++
 fs/proc/proc_sysctl.c   |  7 +++
 fs/quota/quota.c|  2 +-
 include/linux/fs.h  |  2 +-
 kernel/user_namespace.c |  1 +
 security/commoncap.c|  8 ++--
 security/integrity/evm/evm_crypto.c |  3 ++-
 21 files changed, 127 

[PATCH v5 00/11] FUSE mounts from non-init user namespaces

2017-12-22 Thread Dongsu Park
This patchset v5 is based on work by Seth Forshee and Eric Biederman.
The latest patchset was v4:
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1132206.html

At the moment, filesystems backed by physical medium can only be mounted
by real root in the initial user namespace. This restriction exists
because if it's allowed for root user in non-init user namespaces to
mount the filesystem, then it effectively allows the user to control the
underlying source of the filesystem. In case of FUSE, the source would
mean any underlying device.

However, in many use cases such as containers, it's necessary to allow
filesystems to be mounted from non-init user namespaces. Goal of this
patchset is to allow FUSE filesystems to be mounted from non-init user
namespaces. Support for other filesystems like ext4 are not in the
scope of this patchset.

Let me describe how to test mounting from non-init user namespaces. It's
assumed that tests are done via sshfs, a userspace filesystem based on
FUSE with ssh as backend. Testing system is Fedora 27.


$ sudo dnf install -y sshfs
$ sudo mkdir -p /mnt/userns

### workaround to get the sshfs permission checks
$ sudo chown -R $UID:$UID /etc/ssh/ssh_config.d /usr/share/crypto-policies

$ unshare -U -r -m
# sshfs root@localhost: /mnt/userns

### You can see sshfs being mounted from a non-init user namespace
# mount | grep sshfs
root@localhost: on /mnt/userns type fuse.sshfs
(rw,nosuid,nodev,relatime,user_id=0,group_id=0)

# touch /mnt/userns/test
# ls -l /mnt/userns/test
-rw-r--r-- 1 root root 0 Dec 11 19:01 /mnt/userns/test


Open another terminal, check the mountpoint from outside the namespace.


$ grep userns /proc/$(pidof sshfs)/mountinfo
131 102 0:35 / /mnt/userns rw,nosuid,nodev,relatime - fuse.sshfs
root@localhost: rw,user_id=0,group_id=0


After all tests are done, you can unmount the filesystem
inside the namespace.


# fusermount -u /mnt/userns


Changes since v4:
 * Remove other parts like ext4 to keep the patchset minimal for FUSE
 * Add and change commit messages
 * Describe how to test non-init user namespaces

TODO:
 * Think through potential security implications. There are 2 patches
   being prepared for security issues. One is "ima: define a new policy
   option named force" by Mimi Zohar, which adds an option to specify
   that the results should not be cached:
   https://marc.info/?l=linux-integrity=151275680115856=2
   The other one is to basically prevent FUSE results from being cached,
   which is still in progress.

 * Test IMA/LSMs. Details are written in
   
https://github.com/kinvolk/fuse-userns-patches/blob/master/tests/TESTING_INTEGRITY.md

Patches 1-2 deal with an additional flag of lookup_bdev() to check for
additional inode permission.

Patches 3-7 allow the superblock owner to change ownership of inodes, and
deal with additional capability checks w.r.t user namespaces.

Patches 8-10 allow FUSE filesystems to be mounted outside of the init
user namespace.

Patch 11 handles a corner case of non-root users in EVM.

The patchset is also available in our github repo:
  https://github.com/kinvolk/linux/tree/dongsu/fuse-userns-v5-1


Eric W. Biederman (1):
  fs: Allow superblock owner to change ownership of inodes

Seth Forshee (10):
  block_dev: Support checking inode permissions in lookup_bdev()
  mtd: Check permissions towards mtd block device inode when mounting
  fs: Don't remove suid for CAP_FSETID for userns root
  fs: Allow superblock owner to access do_remount_sb()
  capabilities: Allow privileged user in s_user_ns to set security.*
xattrs
  fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
  fuse: Support fuse filesystems outside of init_user_ns
  fuse: Restrict allow_other to the superblock's namespace or a
descendant
  fuse: Allow user namespace mounts
  evm: Don't update hmacs in user ns mounts

 drivers/md/bcache/super.c   |  2 +-
 drivers/md/dm-table.c   |  2 +-
 drivers/mtd/mtdsuper.c  |  6 +-
 fs/attr.c   | 34 ++
 fs/block_dev.c  | 13 ++---
 fs/fuse/cuse.c  |  3 ++-
 fs/fuse/dev.c   | 11 ---
 fs/fuse/dir.c   | 16 
 fs/fuse/fuse_i.h|  6 +-
 fs/fuse/inode.c | 35 +--
 fs/inode.c  |  6 --
 fs/ioctl.c  |  4 ++--
 fs/namespace.c  |  4 ++--
 fs/proc/base.c  |  7 +++
 fs/proc/generic.c   |  7 +++
 fs/proc/proc_sysctl.c   |  7 +++
 fs/quota/quota.c|  2 +-
 include/linux/fs.h  |  2 +-
 kernel/user_namespace.c |  1 +
 security/commoncap.c|  8 ++--
 security/integrity/evm/evm_crypto.c |  3 ++-
 21 files changed, 127 

Re: [PATCH v2] devpts: allow mounting with uid/gid of uint32_t

2015-08-29 Thread Dongsu Park
On 28.08.2015 15:33, Peter Hurley wrote:
> On 08/18/2015 11:18 AM, Dongsu Park wrote:
> > ---
> >  fs/devpts/inode.c | 20 
> >  1 file changed, 16 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
> > index c35ffdc12bba..49272fae40a7 100644
> > --- a/fs/devpts/inode.c
> > +++ b/fs/devpts/inode.c
> > @@ -188,23 +188,35 @@ static int parse_mount_options(char *data, int op, 
> > struct pts_mount_opts *opts)
> > token = match_token(p, tokens, args);
> > switch (token) {
> > case Opt_uid:
> > -   if (match_int([0], ))
> 
> match_int() => make_kuid/kgid is a widespread pattern in filesystems
> for handling uid/gid mount parameters.
> 
> How about adding a for-purpose string-to-uid/gid function, rather than
> open-coding?

Yeah, that sounds like a good idea.
Do you mean probably something like this? (on top of -mm tree)

Thanks,
Dongsu

----

>From ccfa5db398ba5ac31c5e0128e88abca1f6d1e6f5 Mon Sep 17 00:00:00 2001
Message-Id: 

From: Dongsu Park 
Date: Sat, 29 Aug 2015 12:35:01 +0200
Subject: [PATCH v3] devpts: allow mounting with uid/gid of uint32_t

To allow devpts to be mounted with options of uid/gid of uint32_t,
we need to make use of general parsing API instead of match_int().
So introduce kstrto{uid,gid}(), wrappers around kstrtouint() as well
as make_k{uid,gid}() calls. And then make devpts parse options only
using kstrto{uid,gid}(). Doing that, mounting devpts with uid or
gid > (2^31 - 1) will work as expected, e.g.:

 # mount -t devpts devpts /tmp/devptsdir -o \
   newinstance,ptmxmode=0666,mode=620,uid=3598450688,gid=3598450693

It was originally by reported on github issue tracker of systemd:
https://github.com/systemd/systemd/issues/956


from v2:
 * rebase on top of -mm tree
 * split common parts for parsing uid/gid into kstrto{uid,gid}()
 * fix minor format.
 * continue to use kstrtouint() suggested by Alexey Dobriyan.

from v1: fix patch format correctly

Cc: Alexey Dobriyan 
Reported-by: Alban Crequy 
Suggested-by: Peter Hurley 
Signed-off-by: Dongsu Park 
---
 fs/devpts/inode.c | 19 +--
 include/linux/parse-integer.h |  4 
 lib/kstrtox.c | 40 
 3 files changed, 53 insertions(+), 10 deletions(-)

diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
index c35ffdc12bba..fbbd71005dcb 100644
--- a/fs/devpts/inode.c
+++ b/fs/devpts/inode.c
@@ -181,6 +181,7 @@ static int parse_mount_options(char *data, int op, struct 
pts_mount_opts *opts)
substring_t args[MAX_OPT_ARGS];
int token;
int option;
+   int rc;
 
if (!*p)
continue;
@@ -188,20 +189,18 @@ static int parse_mount_options(char *data, int op, struct 
pts_mount_opts *opts)
token = match_token(p, tokens, args);
switch (token) {
case Opt_uid:
-   if (match_int([0], ))
-   return -EINVAL;
-   uid = make_kuid(current_user_ns(), option);
-   if (!uid_valid(uid))
-   return -EINVAL;
+   rc = kstrtouid(args[0].from, );
+   if (rc)
+   return rc;
+
opts->uid = uid;
opts->setuid = 1;
break;
case Opt_gid:
-   if (match_int([0], ))
-   return -EINVAL;
-   gid = make_kgid(current_user_ns(), option);
-   if (!gid_valid(gid))
-   return -EINVAL;
+   rc = kstrtogid(args[0].from, );
+   if (rc)
+   return rc;
+
opts->gid = gid;
opts->setgid = 1;
break;
diff --git a/include/linux/parse-integer.h b/include/linux/parse-integer.h
index ba620cdf3df6..2cdc4f418e00 100644
--- a/include/linux/parse-integer.h
+++ b/include/linux/parse-integer.h
@@ -2,6 +2,7 @@
 #define _PARSE_INTEGER_H
 #include 
 #include 
+#include 
 
 /*
  * int parse_integer(const char *s, unsigned int base, T *val);
@@ -155,6 +156,9 @@ static inline int __must_check kstrtos8(const char *s, 
unsigned int base, s8 *re
return parse_integer(s, base | PARSE_INTEGER_NEWLINE, res);
 }
 
+int __must_check kstrtouid(const char *uidstr, kuid_t *kuid);
+int __must_check kstrtogid(const char *gidstr, kgid_t *kgid);
+
 int __must_check kstrtoull_from_user(const char __user *s, size_t count, 
unsigned int base, unsigned long long *res);
 int __must_check kstrtoll_from_user(const char __user *s, size_t count, 
unsigned in

Re: [PATCH v2] devpts: allow mounting with uid/gid of uint32_t

2015-08-29 Thread Dongsu Park
On 28.08.2015 15:33, Peter Hurley wrote:
 On 08/18/2015 11:18 AM, Dongsu Park wrote:
  ---
   fs/devpts/inode.c | 20 
   1 file changed, 16 insertions(+), 4 deletions(-)
  
  diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
  index c35ffdc12bba..49272fae40a7 100644
  --- a/fs/devpts/inode.c
  +++ b/fs/devpts/inode.c
  @@ -188,23 +188,35 @@ static int parse_mount_options(char *data, int op, 
  struct pts_mount_opts *opts)
  token = match_token(p, tokens, args);
  switch (token) {
  case Opt_uid:
  -   if (match_int(args[0], option))
 
 match_int() = make_kuid/kgid is a widespread pattern in filesystems
 for handling uid/gid mount parameters.
 
 How about adding a for-purpose string-to-uid/gid function, rather than
 open-coding?

Yeah, that sounds like a good idea.
Do you mean probably something like this? (on top of -mm tree)

Thanks,
Dongsu



From ccfa5db398ba5ac31c5e0128e88abca1f6d1e6f5 Mon Sep 17 00:00:00 2001
Message-Id: 
ccfa5db398ba5ac31c5e0128e88abca1f6d1e6f5.1440844226.git.dp...@posteo.net
From: Dongsu Park dp...@posteo.net
Date: Sat, 29 Aug 2015 12:35:01 +0200
Subject: [PATCH v3] devpts: allow mounting with uid/gid of uint32_t

To allow devpts to be mounted with options of uid/gid of uint32_t,
we need to make use of general parsing API instead of match_int().
So introduce kstrto{uid,gid}(), wrappers around kstrtouint() as well
as make_k{uid,gid}() calls. And then make devpts parse options only
using kstrto{uid,gid}(). Doing that, mounting devpts with uid or
gid  (2^31 - 1) will work as expected, e.g.:

 # mount -t devpts devpts /tmp/devptsdir -o \
   newinstance,ptmxmode=0666,mode=620,uid=3598450688,gid=3598450693

It was originally by reported on github issue tracker of systemd:
https://github.com/systemd/systemd/issues/956


from v2:
 * rebase on top of -mm tree
 * split common parts for parsing uid/gid into kstrto{uid,gid}()
 * fix minor format.
 * continue to use kstrtouint() suggested by Alexey Dobriyan.

from v1: fix patch format correctly

Cc: Alexey Dobriyan adobri...@gmail.com
Reported-by: Alban Crequy al...@endocode.com
Suggested-by: Peter Hurley pe...@hurleysoftware.com
Signed-off-by: Dongsu Park dp...@posteo.net
---
 fs/devpts/inode.c | 19 +--
 include/linux/parse-integer.h |  4 
 lib/kstrtox.c | 40 
 3 files changed, 53 insertions(+), 10 deletions(-)

diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
index c35ffdc12bba..fbbd71005dcb 100644
--- a/fs/devpts/inode.c
+++ b/fs/devpts/inode.c
@@ -181,6 +181,7 @@ static int parse_mount_options(char *data, int op, struct 
pts_mount_opts *opts)
substring_t args[MAX_OPT_ARGS];
int token;
int option;
+   int rc;
 
if (!*p)
continue;
@@ -188,20 +189,18 @@ static int parse_mount_options(char *data, int op, struct 
pts_mount_opts *opts)
token = match_token(p, tokens, args);
switch (token) {
case Opt_uid:
-   if (match_int(args[0], option))
-   return -EINVAL;
-   uid = make_kuid(current_user_ns(), option);
-   if (!uid_valid(uid))
-   return -EINVAL;
+   rc = kstrtouid(args[0].from, uid);
+   if (rc)
+   return rc;
+
opts-uid = uid;
opts-setuid = 1;
break;
case Opt_gid:
-   if (match_int(args[0], option))
-   return -EINVAL;
-   gid = make_kgid(current_user_ns(), option);
-   if (!gid_valid(gid))
-   return -EINVAL;
+   rc = kstrtogid(args[0].from, gid);
+   if (rc)
+   return rc;
+
opts-gid = gid;
opts-setgid = 1;
break;
diff --git a/include/linux/parse-integer.h b/include/linux/parse-integer.h
index ba620cdf3df6..2cdc4f418e00 100644
--- a/include/linux/parse-integer.h
+++ b/include/linux/parse-integer.h
@@ -2,6 +2,7 @@
 #define _PARSE_INTEGER_H
 #include linux/compiler.h
 #include linux/types.h
+#include linux/uidgid.h
 
 /*
  * int parse_integer(const char *s, unsigned int base, T *val);
@@ -155,6 +156,9 @@ static inline int __must_check kstrtos8(const char *s, 
unsigned int base, s8 *re
return parse_integer(s, base | PARSE_INTEGER_NEWLINE, res);
 }
 
+int __must_check kstrtouid(const char *uidstr, kuid_t *kuid);
+int __must_check kstrtogid(const char *gidstr, kgid_t *kgid);
+
 int __must_check kstrtoull_from_user(const char __user *s, size_t count, 
unsigned int base, unsigned long long *res);
 int

Re: [PATCH v2] devpts: allow mounting with uid/gid of uint32_t

2015-08-19 Thread Dongsu Park
Hi,

thanks for the review.

On 18.08.2015 16:44, Andrew Morton wrote:
> On Tue, 18 Aug 2015 17:18:19 +0200 Dongsu Park  wrote:
> 
> > To allow devpts to be mounted with options of uid/gid of uint32_t,
> > use kstrtouint() instead of match_int(). Doing that, mounting devpts
> > with uid or gid > (2^31 - 1) will work as expected, e.g.:
> > 
> >  # mount -t devpts devpts /tmp/devptsdir -o \
> >newinstance,ptmxmode=0666,mode=620,uid=3598450688,gid=3598450693
> > 
> > It was originally by reported on systemd github issues:
> > https://github.com/systemd/systemd/issues/956
> > 
> > --- a/fs/devpts/inode.c
> > +++ b/fs/devpts/inode.c
> > @@ -188,23 +188,35 @@ static int parse_mount_options(char *data, int op, 
> > struct pts_mount_opts *opts)
> > token = match_token(p, tokens, args);
> > switch (token) {
> > case Opt_uid:
> > -   if (match_int([0], ))
> > +   {
> 
> It might be neater to lay this out as
> 
>   case Opt_uid: {

I'll do it.

> > +   char *uidstr = args[0].from;
> > +   uid_t uidval;
> > +   int rc = kstrtouint(uidstr, 0, );
> 
> This assumes that the architecture/config uses a uint for uid_t.  We
> have no business assuming this - it's an opaque type for a reason.  It
> would be safer to do
> 
>   unsigned long uidl;
> 
>   rc = kstrtoul(uidstr, 0, );
>   uidval = uidl;

That's a good point. I'll do it.

> > +   if (rc)
> > return -EINVAL;
> 
> I don't get it.  From my reading, kstrtouint->parse_integer() returns
> "number of characters parsed or -E".  So this code won't work.  But
> presumably it *does* work, so why?

It's probably because kstrtouint() returns just 0 on success.
That's what functions in the call chain of kstrtouint() -> kstrtoull() ->
_kstrtoull() -> _parse_integer() are actually doing.
_parse_integer() actually returns rv, i.e. number of characters parsed.
But after that, if there's no error, _kstrtoull() simply returns 0.

> Also, we should probably return `rc' here if it's negative, to
> propagate the error which kstrtouint() detected.  That's a minor
> non-back-compatible change but it shouldn't matter.

Okay, I also think that we should return rc. I'll do it.

> otoh, kstrtouint() likes to return -ERANGE when things go wrong. 
> ERANGE means "Math result not representable", which is a nonsenscal
> error code in this context.  Sigh, why do people keep doing this.

Hmm, good to know.

Thanks,
Dongsu

> > -   uid = make_kuid(current_user_ns(), option);
> > +   uid = make_kuid(current_user_ns(), uidval);
> > if (!uid_valid(uid))
> > return -EINVAL;
> > opts->uid = uid;
> > opts->setuid = 1;
> > break;
> >
> > ...
> >
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] devpts: allow mounting with uid/gid of uint32_t

2015-08-19 Thread Dongsu Park
Hi,

thanks for the review.

On 18.08.2015 16:44, Andrew Morton wrote:
 On Tue, 18 Aug 2015 17:18:19 +0200 Dongsu Park dp...@posteo.net wrote:
 
  To allow devpts to be mounted with options of uid/gid of uint32_t,
  use kstrtouint() instead of match_int(). Doing that, mounting devpts
  with uid or gid  (2^31 - 1) will work as expected, e.g.:
  
   # mount -t devpts devpts /tmp/devptsdir -o \
 newinstance,ptmxmode=0666,mode=620,uid=3598450688,gid=3598450693
  
  It was originally by reported on systemd github issues:
  https://github.com/systemd/systemd/issues/956
  
  --- a/fs/devpts/inode.c
  +++ b/fs/devpts/inode.c
  @@ -188,23 +188,35 @@ static int parse_mount_options(char *data, int op, 
  struct pts_mount_opts *opts)
  token = match_token(p, tokens, args);
  switch (token) {
  case Opt_uid:
  -   if (match_int(args[0], option))
  +   {
 
 It might be neater to lay this out as
 
   case Opt_uid: {

I'll do it.

  +   char *uidstr = args[0].from;
  +   uid_t uidval;
  +   int rc = kstrtouint(uidstr, 0, uidval);
 
 This assumes that the architecture/config uses a uint for uid_t.  We
 have no business assuming this - it's an opaque type for a reason.  It
 would be safer to do
 
   unsigned long uidl;
 
   rc = kstrtoul(uidstr, 0, uidl);
   uidval = uidl;

That's a good point. I'll do it.

  +   if (rc)
  return -EINVAL;
 
 I don't get it.  From my reading, kstrtouint-parse_integer() returns
 number of characters parsed or -E.  So this code won't work.  But
 presumably it *does* work, so why?

It's probably because kstrtouint() returns just 0 on success.
That's what functions in the call chain of kstrtouint() - kstrtoull() -
_kstrtoull() - _parse_integer() are actually doing.
_parse_integer() actually returns rv, i.e. number of characters parsed.
But after that, if there's no error, _kstrtoull() simply returns 0.

 Also, we should probably return `rc' here if it's negative, to
 propagate the error which kstrtouint() detected.  That's a minor
 non-back-compatible change but it shouldn't matter.

Okay, I also think that we should return rc. I'll do it.

 otoh, kstrtouint() likes to return -ERANGE when things go wrong. 
 ERANGE means Math result not representable, which is a nonsenscal
 error code in this context.  Sigh, why do people keep doing this.

Hmm, good to know.

Thanks,
Dongsu

  -   uid = make_kuid(current_user_ns(), option);
  +   uid = make_kuid(current_user_ns(), uidval);
  if (!uid_valid(uid))
  return -EINVAL;
  opts-uid = uid;
  opts-setuid = 1;
  break;
 
  ...
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] devpts: allow mounting with uid/gid of uint32_t

2015-08-18 Thread Dongsu Park
To allow devpts to be mounted with options of uid/gid of uint32_t,
use kstrtouint() instead of match_int(). Doing that, mounting devpts
with uid or gid > (2^31 - 1) will work as expected, e.g.:

 # mount -t devpts devpts /tmp/devptsdir -o \
   newinstance,ptmxmode=0666,mode=620,uid=3598450688,gid=3598450693

It was originally by reported on systemd github issues:
https://github.com/systemd/systemd/issues/956

from v1: fix patch format correctly

Reported-by: Alban Crequy 
Signed-off-by: Dongsu Park 
---
 fs/devpts/inode.c | 20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
index c35ffdc12bba..49272fae40a7 100644
--- a/fs/devpts/inode.c
+++ b/fs/devpts/inode.c
@@ -188,23 +188,35 @@ static int parse_mount_options(char *data, int op, struct 
pts_mount_opts *opts)
token = match_token(p, tokens, args);
switch (token) {
case Opt_uid:
-   if (match_int([0], ))
+   {
+   char *uidstr = args[0].from;
+   uid_t uidval;
+   int rc = kstrtouint(uidstr, 0, );
+
+   if (rc)
return -EINVAL;
-   uid = make_kuid(current_user_ns(), option);
+   uid = make_kuid(current_user_ns(), uidval);
if (!uid_valid(uid))
return -EINVAL;
opts->uid = uid;
opts->setuid = 1;
break;
+   }
case Opt_gid:
-   if (match_int([0], ))
+   {
+   char *gidstr = args[0].from;
+   gid_t gidval;
+   int rc = kstrtouint(gidstr, 0, );
+
+   if (rc)
return -EINVAL;
-   gid = make_kgid(current_user_ns(), option);
+   gid = make_kgid(current_user_ns(), gidval);
if (!gid_valid(gid))
return -EINVAL;
opts->gid = gid;
opts->setgid = 1;
break;
+   }
case Opt_mode:
if (match_octal([0], ))
return -EINVAL;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] devpts: allow mounting with uid/gid of uint32_t

2015-08-18 Thread Dongsu Park
To allow devpts to be mounted with options of uid/gid of uint32_t,
use kstrtouint() instead of match_int(). Doing that, mounting devpts
with uid or gid > (2^31 - 1) will work as expected, e.g.:

 # mount -t devpts devpts /tmp/devptsdir -o \
   newinstance,ptmxmode=0666,mode=620,uid=3598450688,gid=3598450693

It was originally by reported on systemd github issues:
https://github.com/systemd/systemd/issues/956

Reported-by: Alban Crequy 
Signed-off-by: Dongsu Park 
---
 fs/devpts/inode.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
index c35ffdc12bba..83c3e7368f38 100644
--- a/fs/devpts/inode.c
+++ b/fs/devpts/inode.c
@@ -188,23 +188,33 @@ static int parse_mount_options(char *data, int op, struct 
pts_mount_opts *opts)
token = match_token(p, tokens, args);
switch (token) {
case Opt_uid:
-   if (match_int([0], ))
+   {
+   char *uidstr = args[0].from;
+   uid_t uidval;
+   int rc = kstrtouint(uidstr, 0, );
+   if (rc)
return -EINVAL;
-   uid = make_kuid(current_user_ns(), option);
+   uid = make_kuid(current_user_ns(), uidval);
if (!uid_valid(uid))
return -EINVAL;
opts->uid = uid;
opts->setuid = 1;
break;
+}
case Opt_gid:
-   if (match_int([0], ))
+{
+   char *gidstr = args[0].from;
+   gid_t gidval;
+   int rc = kstrtouint(gidstr, 0, );
+   if (rc)
return -EINVAL;
-   gid = make_kgid(current_user_ns(), option);
+   gid = make_kgid(current_user_ns(), gidval);
if (!gid_valid(gid))
return -EINVAL;
opts->gid = gid;
opts->setgid = 1;
break;
+}
case Opt_mode:
if (match_octal([0], ))
return -EINVAL;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] devpts: allow mounting with uid/gid of uint32_t

2015-08-18 Thread Dongsu Park
To allow devpts to be mounted with options of uid/gid of uint32_t,
use kstrtouint() instead of match_int(). Doing that, mounting devpts
with uid or gid  (2^31 - 1) will work as expected, e.g.:

 # mount -t devpts devpts /tmp/devptsdir -o \
   newinstance,ptmxmode=0666,mode=620,uid=3598450688,gid=3598450693

It was originally by reported on systemd github issues:
https://github.com/systemd/systemd/issues/956

Reported-by: Alban Crequy al...@endocode.com
Signed-off-by: Dongsu Park dp...@posteo.net
---
 fs/devpts/inode.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
index c35ffdc12bba..83c3e7368f38 100644
--- a/fs/devpts/inode.c
+++ b/fs/devpts/inode.c
@@ -188,23 +188,33 @@ static int parse_mount_options(char *data, int op, struct 
pts_mount_opts *opts)
token = match_token(p, tokens, args);
switch (token) {
case Opt_uid:
-   if (match_int(args[0], option))
+   {
+   char *uidstr = args[0].from;
+   uid_t uidval;
+   int rc = kstrtouint(uidstr, 0, uidval);
+   if (rc)
return -EINVAL;
-   uid = make_kuid(current_user_ns(), option);
+   uid = make_kuid(current_user_ns(), uidval);
if (!uid_valid(uid))
return -EINVAL;
opts-uid = uid;
opts-setuid = 1;
break;
+}
case Opt_gid:
-   if (match_int(args[0], option))
+{
+   char *gidstr = args[0].from;
+   gid_t gidval;
+   int rc = kstrtouint(gidstr, 0, gidval);
+   if (rc)
return -EINVAL;
-   gid = make_kgid(current_user_ns(), option);
+   gid = make_kgid(current_user_ns(), gidval);
if (!gid_valid(gid))
return -EINVAL;
opts-gid = gid;
opts-setgid = 1;
break;
+}
case Opt_mode:
if (match_octal(args[0], option))
return -EINVAL;
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] devpts: allow mounting with uid/gid of uint32_t

2015-08-18 Thread Dongsu Park
To allow devpts to be mounted with options of uid/gid of uint32_t,
use kstrtouint() instead of match_int(). Doing that, mounting devpts
with uid or gid  (2^31 - 1) will work as expected, e.g.:

 # mount -t devpts devpts /tmp/devptsdir -o \
   newinstance,ptmxmode=0666,mode=620,uid=3598450688,gid=3598450693

It was originally by reported on systemd github issues:
https://github.com/systemd/systemd/issues/956

from v1: fix patch format correctly

Reported-by: Alban Crequy al...@endocode.com
Signed-off-by: Dongsu Park dp...@posteo.net
---
 fs/devpts/inode.c | 20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
index c35ffdc12bba..49272fae40a7 100644
--- a/fs/devpts/inode.c
+++ b/fs/devpts/inode.c
@@ -188,23 +188,35 @@ static int parse_mount_options(char *data, int op, struct 
pts_mount_opts *opts)
token = match_token(p, tokens, args);
switch (token) {
case Opt_uid:
-   if (match_int(args[0], option))
+   {
+   char *uidstr = args[0].from;
+   uid_t uidval;
+   int rc = kstrtouint(uidstr, 0, uidval);
+
+   if (rc)
return -EINVAL;
-   uid = make_kuid(current_user_ns(), option);
+   uid = make_kuid(current_user_ns(), uidval);
if (!uid_valid(uid))
return -EINVAL;
opts-uid = uid;
opts-setuid = 1;
break;
+   }
case Opt_gid:
-   if (match_int(args[0], option))
+   {
+   char *gidstr = args[0].from;
+   gid_t gidval;
+   int rc = kstrtouint(gidstr, 0, gidval);
+
+   if (rc)
return -EINVAL;
-   gid = make_kgid(current_user_ns(), option);
+   gid = make_kgid(current_user_ns(), gidval);
if (!gid_valid(gid))
return -EINVAL;
opts-gid = gid;
opts-setgid = 1;
break;
+   }
case Opt_mode:
if (match_octal(args[0], option))
return -EINVAL;
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: panic with CPU hotplug + blk-mq + scsi-mq

2015-04-20 Thread Dongsu Park
On 21.04.2015 00:48, Ming Lei wrote:
> Thanks for providing that.
> The trick is just in CPU number and virito-scsi hw queue number,
> and that is why I asked that, :-)
> Now the problem is quite clear, before CPU1 online, suppose
> CPU3 is mapped hw queue 6, and CPU 3 will map to hw queue 5
> after CPU1 is offline, unfortunately current code can't allocate
> tags for hw queue 5 even it becomes mapped.
> The following updated patch(include original patch 2) will fix
> the problem, and patch 1 is required too.
> So the following patch should fix your hotplug issue.

Yes, it works indeed. Thanks a lot! :-)
You can add:

Tested-by: Dongsu Park 

As the original patch didn't apply, I had to change some nitpicks though.
(see below)

Cheers,
Dongsu



>From 8c0edcbbdfbab67dc8ae2fd46cca6a86e0cadcba Mon Sep 17 00:00:00 2001
From: Ming Lei 
Date: Sun, 19 Apr 2015 23:32:46 +0800
Subject: [PATCH v1 2/2] blk-mq: fix CPU hotplug handling

Firstly the hctx->tags have to be set as NULL if it is to be disabled
no matter if set->tags[i] is NULL or not in blk_mq_map_swqueue() because
shared tags can be freed already from another request queue.

The same situation has to be considered in blk_mq_hctx_cpu_online() too.

Finally one unmapped hw queue can be remapped after CPU topo is changed,
we need to allocate tags for the hw queue in blk_mq_map_swqueue() too.
Then tags allocation for hw queue can be removed in hctx cpu online
notifier, and it is reasonable to do that after remapping is done.

Cc: 
Reported-by: Dongsu Park 
Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 34 +-
 1 file changed, 13 insertions(+), 21 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 078840ce8670..df4b9597e477 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1573,22 +1573,6 @@ static int blk_mq_hctx_cpu_offline(struct blk_mq_hw_ctx 
*hctx, int cpu)
return NOTIFY_OK;
 }
 
-static int blk_mq_hctx_cpu_online(struct blk_mq_hw_ctx *hctx, int cpu)
-{
-   struct request_queue *q = hctx->queue;
-   struct blk_mq_tag_set *set = q->tag_set;
-
-   if (set->tags[hctx->queue_num])
-   return NOTIFY_OK;
-
-   set->tags[hctx->queue_num] = blk_mq_init_rq_map(set, hctx->queue_num);
-   if (!set->tags[hctx->queue_num])
-   return NOTIFY_STOP;
-
-   hctx->tags = set->tags[hctx->queue_num];
-   return NOTIFY_OK;
-}
-
 static int blk_mq_hctx_notify(void *data, unsigned long action,
  unsigned int cpu)
 {
@@ -1596,8 +1580,11 @@ static int blk_mq_hctx_notify(void *data, unsigned long 
action,
 
if (action == CPU_DEAD || action == CPU_DEAD_FROZEN)
return blk_mq_hctx_cpu_offline(hctx, cpu);
-   else if (action == CPU_ONLINE || action == CPU_ONLINE_FROZEN)
-   return blk_mq_hctx_cpu_online(hctx, cpu);
+
+   /*
+* In case of CPU online, tags will be reallocated
+* after new mapping is done in blk_mq_map_swqueue().
+*/
 
return NOTIFY_OK;
 }
@@ -1779,6 +1766,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
unsigned int i;
struct blk_mq_hw_ctx *hctx;
struct blk_mq_ctx *ctx;
+   struct blk_mq_tag_set *set = q->tag_set;
 
queue_for_each_hw_ctx(q, hctx, i) {
cpumask_clear(hctx->cpumask);
@@ -1805,16 +1793,20 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 * disable it and free the request entries.
 */
if (!hctx->nr_ctx) {
-   struct blk_mq_tag_set *set = q->tag_set;
-
if (set->tags[i]) {
blk_mq_free_rq_map(set, set->tags[i], i);
set->tags[i] = NULL;
-   hctx->tags = NULL;
}
+   hctx->tags = NULL;
continue;
}
 
+   /* unmapped hw queue can be remapped after CPU topo changed */
+   if (!set->tags[i])
+   set->tags[i] = blk_mq_init_rq_map(set, hctx->queue_num);
+   hctx->tags = set->tags[i];
+   WARN_ON(!hctx->tags);
+
/*
 * Initialize batch roundrobin counts
 */
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: panic with CPU hotplug + blk-mq + scsi-mq

2015-04-20 Thread Dongsu Park
On 20.04.2015 21:12, Ming Lei wrote:
> On Mon, Apr 20, 2015 at 4:07 PM, Dongsu Park
>  wrote:
> > Hi Ming,
> >
> > On 18.04.2015 00:23, Ming Lei wrote:
> >> > Does anyone have an idea?
> >>
> >> As far as I can see, at least two problems exist:
> >> - race between timeout and CPU hotplug
> >> - in case of shared tags, during CPU online handling, about setting
> >> and checking hctx->tags
> >>
> >> So could you please test the attached two patches to see if they fix your 
> >> issue?
> >> I run them in my VM, and looks opps does disappear.
> >
> > Thanks for the patches.
> > But it still panics also with your patches, both v1 and v2.
> > I tested it multiple times, and hit the bug every time.
> 
> Could you share us what the exact test you are running?
> Such as, CPU numbers, virtio-scsi hw queue number, and
> multi-lun or not, and your workload if it is specific.

It would be probably helpful to just share my Qemu command line:

/usr/bin/qemu-system-x86_64 -M pc -cpu host -enable-kvm -m 2048 \
 -smp 4,cores=1,maxcpus=4,threads=1 \
 -object memory-backend-ram,size=1024M,id=ram-node0 \
 -numa node,nodeid=0,cpus=0-1,memdev=ram-node0 \
 -object memory-backend-ram,size=1024M,id=ram-node1 \
 -numa node,nodeid=1,cpus=2-3,memdev=ram-node1 \
 -serial stdio -name vm-0fa2eb90-51f3-4b65-aa72-97cea3ead7bf \
 -uuid 0fa2eb90-51f3-4b65-aa72-97cea3ead7bf \
 -monitor telnet:0.0.0.0:9400,server,nowait \
 -rtc base=utc -boot menu=off,order=c -L /usr/share/qemu \
 -device virtio-scsi-pci,id=scsi0,num_queues=8,bus=pci.0,addr=0x7 \
 -drive 
file=./mydebian2.qcow2,if=none,id=drive-virtio-disk0,aio=native,cache=writeback 
\
 -device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x9,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
 \
 -drive file=./tfile00.img,if=none,id=drive-scsi0-0-0-0,aio=native \
 -device 
scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0
 \
 -drive file=./tfile01.img,if=none,id=drive-scsi0-0-0-1,aio=native \
 -device 
scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=1,drive=drive-scsi0-0-0-1,id=scsi0-0-0-1
 \
 -k en-us -vga cirrus -netdev user,id=vnet0,net=192.168.122.0/24 \
 -net nic,vlan=0,model=virtio,macaddr=52:54:00:5b:d7:00 \
 -net tap,vlan=0,ifname=dntap0,vhost=on,script=no,downscript=no \
 -vnc 0.0.0.0:1 -virtfs local,path=/Dev,mount_tag=homedev,security_model=none

(where each of tfile0[01].img is 16-GiB image)

And there's nothing special about workload. Inside the guest, I go to
a 9pfs-mounted directory, where kernel source is available.
When I just do 'make install', then the guest immediately crashes.
That's the simplest way to make it crash.

Dongsu

> I can not reproduce it in my VM.
> One interesting point is that the oops always happened
> on CPU3 in your tests, looks like the mapping is broken
> for CPU3's ctx in case of CPU 1 offline?
> 
> > Cheers,
> > Dongsu
> >
> >  [beginning of call traces] 
> > [   22.942214] smpboot: CPU 1 is now offline
> > [   30.686284] random: nonblocking pool is initialized
> > [   39.857305] fuse init (API version 7.23)
> > [   40.563853] BUG: unable to handle kernel NULL pointer dereference at 
> > 0018
> > [   40.564005] IP: [] __bt_get.isra.5+0x7d/0x1e0
> > [   40.564005] PGD 7a363067 PUD 7cadc067 PMD 0
> > [   40.564005] Oops:  [#1] SMP
> > [   40.564005] Modules linked in: fuse cpufreq_stats binfmt_misc 9p fscache 
> > dm_round_robin dm_multipath loop r
> > tc_cmos 9pnet_virtio 9pnet serio_raw acpi_cpufreq i2c_piix4 virtio_net
> > [   40.564005] CPU: 3 PID: 6349 Comm: grub-mount Not tainted 4.0.0+ #320
> > [   40.564005] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> > 1.7.5-20140709_153950- 04/01/2014
> > [   40.564005] task: 880079011560 ti: 88007a1c8000 task.ti: 
> > 88007a1c8000
> > [   40.564005] RIP: 0010:[]  [] 
> > __bt_get.isra.5+0x7d/0x1e0
> > [   40.564005] RSP: 0018:88007a1cb838  EFLAGS: 00010246
> > [   40.564005] RAX: 0075 RBX: 88007913c400 RCX: 
> > 0078
> > [   40.564005] RDX: 88007fddbb80 RSI: 0010 RDI: 
> > 88007913c400
> > [   40.564005] RBP: 88007a1cb888 R08: 88007fddbb80 R09: 
> > 0001
> > [   40.564005] R10:  R11: 0001 R12: 
> > 0010
> > [   40.564005] R13: 0010 R14: 88007a1cb988 R15: 
> > 88007fddbb80
> > [   40.564005] FS:  2b7c8b6807c0() GS:88007fc0() 
> > knlGS:
> > [   40.564005] CS:  0010 DS:  ES:  CR0: 80050033
> > [   40.564005] CR2: 0018 CR3: 

Re: panic with CPU hotplug + blk-mq + scsi-mq

2015-04-20 Thread Dongsu Park
q_flags+0x8e/0x100
> > [   47.816324]  [] scsi_test_unit_ready+0x83/0x130
> > [   47.816324]  [] sd_check_events+0x14e/0x1b0
> > [   47.816324]  [] disk_check_events+0x51/0x170
> > [   47.816324]  [] disk_events_workfn+0x1c/0x20
> > [   47.816324]  [] process_one_work+0x1e8/0x800
> > [   47.816324]  [] ? process_one_work+0x15d/0x800
> > [   47.816324]  [] ? worker_thread+0xda/0x470
> > [   47.816324]  [] worker_thread+0x53/0x470
> > [   47.816324]  [] ? process_one_work+0x800/0x800
> > [   47.816324]  [] ? process_one_work+0x800/0x800
> > [   47.816324]  [] kthread+0xf2/0x110
> > [   47.816324]  [] ? trace_hardirqs_on+0xd/0x10
> > [   47.816324]  [] ? kthread_create_on_node+0x230/0x230
> > [   47.816324]  [] ret_from_fork+0x58/0x90
> > [   47.816324]  [] ? kthread_create_on_node+0x230/0x230
> > [   47.816324] Code: 00 48 89 e5 5d 48 8b 40 88 48 c1 e8 02 83 e0 01 c3 66 
> > 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 87 20 04 00 00 55 48 89 e5 
> > <48> 8b 40 98 5d c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
> > [   47.816324] RIP  [] kthread_data+0x10/0x20
> > [   47.816324]  RSP 
> > [   47.816324] CR2: ff98
> > [   47.816324] ---[ end trace 9a650b674f0fae76 ]---
> > [   47.816324] Fixing recursive fault but reboot is needed!
> >  [end of call traces] 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/

> From 9aed1bd79531d91513cd16ed90872e4349425acc Mon Sep 17 00:00:00 2001
> From: Ming Lei 
> Date: Fri, 17 Apr 2015 23:50:48 -0400
> Subject: [PATCH 1/2] block: blk-mq: fix race between timeout and CPU hotplug
> 
> Firstly during CPU hotplug, even queue is freezed, timeout
> handler still may come and access hctx->tags, which may cause
> use after free, so this patch deactivates timeout handler
> inside CPU hotplug notifier.
> 
> Secondly, tags can be shared by more than one queues, so we
> have to check if the hctx has been disabled, otherwise
> still use-after-free on tags can be triggered.
> 
> Cc: 
> Reported-by: Dongsu Park 
> Signed-off-by: Ming Lei 
> ---
>  block/blk-mq.c | 13 ++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 67f01a0..58a3b4c 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -677,8 +677,11 @@ static void blk_mq_rq_timer(unsigned long priv)
>   data.next = blk_rq_timeout(round_jiffies_up(data.next));
>   mod_timer(>timeout, data.next);
>   } else {
> - queue_for_each_hw_ctx(q, hctx, i)
> - blk_mq_tag_idle(hctx);
> + queue_for_each_hw_ctx(q, hctx, i) {
> + /* the hctx may be disabled, so we have to check here */
> + if (hctx->tags)
> + blk_mq_tag_idle(hctx);
> + }
>   }
>  }
>  
> @@ -2085,9 +2088,13 @@ static int blk_mq_queue_reinit_notify(struct 
> notifier_block *nb,
>*/
>   list_for_each_entry(q, _q_list, all_q_node)
>   blk_mq_freeze_queue_start(q);
> - list_for_each_entry(q, _q_list, all_q_node)
> + list_for_each_entry(q, _q_list, all_q_node) {
>   blk_mq_freeze_queue_wait(q);
>  
> + /* deactivate timeout handler */
> + del_timer_sync(>timeout);
> + }
> +
>   list_for_each_entry(q, _q_list, all_q_node)
>   blk_mq_queue_reinit(q);
>  
> -- 
> 1.9.1
> 

> From 8b70c8612543859173230fbd16a63bacf84ba23a Mon Sep 17 00:00:00 2001
> From: Ming Lei 
> Date: Sat, 18 Apr 2015 00:01:31 -0400
> Subject: [PATCH 2/2] blk-mq: fix CPU hotplug handling
> 
> Firstly the hctx->tags have to be set as NULL if it is to be disabled
> no matter if set->tags[i] is NULL or not in blk_mq_map_swqueue() because
> shared tags can be freed already from another request_queue.
> 
> The same situation has to be considered in blk_mq_hctx_cpu_online()
> too.
> 
> Cc: 
> Reported-by: Dongsu Park 
> Signed-off-by: Ming Lei 
> ---
>  block/blk-mq.c | 17 +++--
>  1 file changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 58a3b4c..612d5c6 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1580,15 +1580,20 @@ static int blk_mq_hctx_cpu_online(struct 
> blk_mq_hw_ctx *hctx, int cpu)
>  {
>   struct request_queue *q = hctx->queue;
>   s

Re: panic with CPU hotplug + blk-mq + scsi-mq

2015-04-20 Thread Dongsu Park
] wq_worker_sleeping+0x15/0xa0
  [   47.816324]  [816ff757] __schedule+0xa77/0x1080
  [   47.816324]  [8107cfc6] ? do_exit+0x756/0xbf0
  [   47.816324]  [8107cffa] ? do_exit+0x78a/0xbf0
  [   47.816324]  [816ffd97] schedule+0x37/0x90
  [   47.816324]  [8107d0d6] do_exit+0x866/0xbf0
  [   47.816324]  [810ec14e] ? kmsg_dump+0xfe/0x200
  [   47.816324]  [810068ad] oops_end+0x8d/0xd0
  [   47.816324]  [81047849] no_context+0x119/0x370
  [   47.816324]  [810ce795] ? cpuacct_charge+0x5/0x1c0
  [   47.816324]  [810b4a25] ? sched_clock_local+0x25/0x90
  [   47.816324]  [81047b25] __bad_area_nosemaphore+0x85/0x210
  [   47.816324]  [81047cc3] bad_area_nosemaphore+0x13/0x20
  [   47.816324]  [81047fb6] __do_page_fault+0xb6/0x490
  [   47.816324]  [8104839c] do_page_fault+0xc/0x10
  [   47.816324]  [817080c2] page_fault+0x22/0x30
  [   47.816324]  [8140b31d] ? __bt_get.isra.5+0x7d/0x1e0
  [   47.816324]  [8140b4e5] bt_get+0x65/0x1e0
  [   47.816324]  [810c9b40] ? wait_woken+0xa0/0xa0
  [   47.816324]  [8140ba07] blk_mq_get_tag+0xa7/0xd0
  [   47.816324]  [8140630b] __blk_mq_alloc_request+0x1b/0x200
  [   47.816324]  [81407f91] blk_mq_alloc_request+0xa1/0x250
  [   47.816324]  [813fc74c] blk_get_request+0x2c/0xf0
  [   47.816324]  [810a6acd] ? __might_sleep+0x4d/0x90
  [   47.816324]  [815747dd] scsi_execute+0x3d/0x1f0
  [   47.816324]  [815763be] scsi_execute_req_flags+0x8e/0x100
  [   47.816324]  [81576a43] scsi_test_unit_ready+0x83/0x130
  [   47.816324]  [8158672e] sd_check_events+0x14e/0x1b0
  [   47.816324]  [8140e731] disk_check_events+0x51/0x170
  [   47.816324]  [8140e86c] disk_events_workfn+0x1c/0x20
  [   47.816324]  [81099128] process_one_work+0x1e8/0x800
  [   47.816324]  [8109909d] ? process_one_work+0x15d/0x800
  [   47.816324]  [8109981a] ? worker_thread+0xda/0x470
  [   47.816324]  [81099793] worker_thread+0x53/0x470
  [   47.816324]  [81099740] ? process_one_work+0x800/0x800
  [   47.816324]  [81099740] ? process_one_work+0x800/0x800
  [   47.816324]  [8109f652] kthread+0xf2/0x110
  [   47.816324]  [810d3d4d] ? trace_hardirqs_on+0xd/0x10
  [   47.816324]  [8109f560] ? kthread_create_on_node+0x230/0x230
  [   47.816324]  [81706308] ret_from_fork+0x58/0x90
  [   47.816324]  [8109f560] ? kthread_create_on_node+0x230/0x230
  [   47.816324] Code: 00 48 89 e5 5d 48 8b 40 88 48 c1 e8 02 83 e0 01 c3 66 
  2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 87 20 04 00 00 55 48 89 e5 
  48 8b 40 98 5d c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
  [   47.816324] RIP  [810a00d0] kthread_data+0x10/0x20
  [   47.816324]  RSP 88007906f5e8
  [   47.816324] CR2: ff98
  [   47.816324] ---[ end trace 9a650b674f0fae76 ]---
  [   47.816324] Fixing recursive fault but reboot is needed!
   [end of call traces] 
  --
  To unsubscribe from this list: send the line unsubscribe linux-kernel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  Please read the FAQ at  http://www.tux.org/lkml/

 From 9aed1bd79531d91513cd16ed90872e4349425acc Mon Sep 17 00:00:00 2001
 From: Ming Lei ming@canonical.com
 Date: Fri, 17 Apr 2015 23:50:48 -0400
 Subject: [PATCH 1/2] block: blk-mq: fix race between timeout and CPU hotplug
 
 Firstly during CPU hotplug, even queue is freezed, timeout
 handler still may come and access hctx-tags, which may cause
 use after free, so this patch deactivates timeout handler
 inside CPU hotplug notifier.
 
 Secondly, tags can be shared by more than one queues, so we
 have to check if the hctx has been disabled, otherwise
 still use-after-free on tags can be triggered.
 
 Cc: sta...@vger.kernel.org
 Reported-by: Dongsu Park dongsu.p...@profitbricks.com
 Signed-off-by: Ming Lei ming@canonical.com
 ---
  block/blk-mq.c | 13 ++---
  1 file changed, 10 insertions(+), 3 deletions(-)
 
 diff --git a/block/blk-mq.c b/block/blk-mq.c
 index 67f01a0..58a3b4c 100644
 --- a/block/blk-mq.c
 +++ b/block/blk-mq.c
 @@ -677,8 +677,11 @@ static void blk_mq_rq_timer(unsigned long priv)
   data.next = blk_rq_timeout(round_jiffies_up(data.next));
   mod_timer(q-timeout, data.next);
   } else {
 - queue_for_each_hw_ctx(q, hctx, i)
 - blk_mq_tag_idle(hctx);
 + queue_for_each_hw_ctx(q, hctx, i) {
 + /* the hctx may be disabled, so we have to check here */
 + if (hctx-tags)
 + blk_mq_tag_idle(hctx);
 + }
   }
  }
  
 @@ -2085,9 +2088,13 @@ static int blk_mq_queue_reinit_notify(struct 
 notifier_block *nb,
*/
   list_for_each_entry(q, all_q_list

Re: panic with CPU hotplug + blk-mq + scsi-mq

2015-04-20 Thread Dongsu Park
On 20.04.2015 21:12, Ming Lei wrote:
 On Mon, Apr 20, 2015 at 4:07 PM, Dongsu Park
 dongsu.p...@profitbricks.com wrote:
  Hi Ming,
 
  On 18.04.2015 00:23, Ming Lei wrote:
   Does anyone have an idea?
 
  As far as I can see, at least two problems exist:
  - race between timeout and CPU hotplug
  - in case of shared tags, during CPU online handling, about setting
  and checking hctx-tags
 
  So could you please test the attached two patches to see if they fix your 
  issue?
  I run them in my VM, and looks opps does disappear.
 
  Thanks for the patches.
  But it still panics also with your patches, both v1 and v2.
  I tested it multiple times, and hit the bug every time.
 
 Could you share us what the exact test you are running?
 Such as, CPU numbers, virtio-scsi hw queue number, and
 multi-lun or not, and your workload if it is specific.

It would be probably helpful to just share my Qemu command line:

/usr/bin/qemu-system-x86_64 -M pc -cpu host -enable-kvm -m 2048 \
 -smp 4,cores=1,maxcpus=4,threads=1 \
 -object memory-backend-ram,size=1024M,id=ram-node0 \
 -numa node,nodeid=0,cpus=0-1,memdev=ram-node0 \
 -object memory-backend-ram,size=1024M,id=ram-node1 \
 -numa node,nodeid=1,cpus=2-3,memdev=ram-node1 \
 -serial stdio -name vm-0fa2eb90-51f3-4b65-aa72-97cea3ead7bf \
 -uuid 0fa2eb90-51f3-4b65-aa72-97cea3ead7bf \
 -monitor telnet:0.0.0.0:9400,server,nowait \
 -rtc base=utc -boot menu=off,order=c -L /usr/share/qemu \
 -device virtio-scsi-pci,id=scsi0,num_queues=8,bus=pci.0,addr=0x7 \
 -drive 
file=./mydebian2.qcow2,if=none,id=drive-virtio-disk0,aio=native,cache=writeback 
\
 -device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x9,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
 \
 -drive file=./tfile00.img,if=none,id=drive-scsi0-0-0-0,aio=native \
 -device 
scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0
 \
 -drive file=./tfile01.img,if=none,id=drive-scsi0-0-0-1,aio=native \
 -device 
scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=1,drive=drive-scsi0-0-0-1,id=scsi0-0-0-1
 \
 -k en-us -vga cirrus -netdev user,id=vnet0,net=192.168.122.0/24 \
 -net nic,vlan=0,model=virtio,macaddr=52:54:00:5b:d7:00 \
 -net tap,vlan=0,ifname=dntap0,vhost=on,script=no,downscript=no \
 -vnc 0.0.0.0:1 -virtfs local,path=/Dev,mount_tag=homedev,security_model=none

(where each of tfile0[01].img is 16-GiB image)

And there's nothing special about workload. Inside the guest, I go to
a 9pfs-mounted directory, where kernel source is available.
When I just do 'make install', then the guest immediately crashes.
That's the simplest way to make it crash.

Dongsu

 I can not reproduce it in my VM.
 One interesting point is that the oops always happened
 on CPU3 in your tests, looks like the mapping is broken
 for CPU3's ctx in case of CPU 1 offline?
 
  Cheers,
  Dongsu
 
   [beginning of call traces] 
  [   22.942214] smpboot: CPU 1 is now offline
  [   30.686284] random: nonblocking pool is initialized
  [   39.857305] fuse init (API version 7.23)
  [   40.563853] BUG: unable to handle kernel NULL pointer dereference at 
  0018
  [   40.564005] IP: [813b905d] __bt_get.isra.5+0x7d/0x1e0
  [   40.564005] PGD 7a363067 PUD 7cadc067 PMD 0
  [   40.564005] Oops:  [#1] SMP
  [   40.564005] Modules linked in: fuse cpufreq_stats binfmt_misc 9p fscache 
  dm_round_robin dm_multipath loop r
  tc_cmos 9pnet_virtio 9pnet serio_raw acpi_cpufreq i2c_piix4 virtio_net
  [   40.564005] CPU: 3 PID: 6349 Comm: grub-mount Not tainted 4.0.0+ #320
  [   40.564005] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
  1.7.5-20140709_153950- 04/01/2014
  [   40.564005] task: 880079011560 ti: 88007a1c8000 task.ti: 
  88007a1c8000
  [   40.564005] RIP: 0010:[813b905d]  [813b905d] 
  __bt_get.isra.5+0x7d/0x1e0
  [   40.564005] RSP: 0018:88007a1cb838  EFLAGS: 00010246
  [   40.564005] RAX: 0075 RBX: 88007913c400 RCX: 
  0078
  [   40.564005] RDX: 88007fddbb80 RSI: 0010 RDI: 
  88007913c400
  [   40.564005] RBP: 88007a1cb888 R08: 88007fddbb80 R09: 
  0001
  [   40.564005] R10:  R11: 0001 R12: 
  0010
  [   40.564005] R13: 0010 R14: 88007a1cb988 R15: 
  88007fddbb80
  [   40.564005] FS:  2b7c8b6807c0() GS:88007fc0() 
  knlGS:
  [   40.564005] CS:  0010 DS:  ES:  CR0: 80050033
  [   40.564005] CR2: 0018 CR3: 79b0b000 CR4: 
  001407e0
  [   40.564005] Stack:
  [   40.564005]  88007a1cb918 88007fdd58c0 0078 
  813b5d28
  [   40.564005]  88007a1cb878 88007913c400 0010 
  0010
  [   40.564005]  88007a1cb988 88007fddbb80 88007a1cb908 
  813b9225
  [   40.564005] Call Trace:
  [   40.564005]  [813b5d28] ? blk_mq_queue_enter+0x98/0x2b0
  [   40.564005]  [813b9225] bt_get

Re: panic with CPU hotplug + blk-mq + scsi-mq

2015-04-20 Thread Dongsu Park
On 21.04.2015 00:48, Ming Lei wrote:
 Thanks for providing that.
 The trick is just in CPU number and virito-scsi hw queue number,
 and that is why I asked that, :-)
 Now the problem is quite clear, before CPU1 online, suppose
 CPU3 is mapped hw queue 6, and CPU 3 will map to hw queue 5
 after CPU1 is offline, unfortunately current code can't allocate
 tags for hw queue 5 even it becomes mapped.
 The following updated patch(include original patch 2) will fix
 the problem, and patch 1 is required too.
 So the following patch should fix your hotplug issue.

Yes, it works indeed. Thanks a lot! :-)
You can add:

Tested-by: Dongsu Park dongsu.p...@profitbricks.com

As the original patch didn't apply, I had to change some nitpicks though.
(see below)

Cheers,
Dongsu



From 8c0edcbbdfbab67dc8ae2fd46cca6a86e0cadcba Mon Sep 17 00:00:00 2001
From: Ming Lei ming@canonical.com
Date: Sun, 19 Apr 2015 23:32:46 +0800
Subject: [PATCH v1 2/2] blk-mq: fix CPU hotplug handling

Firstly the hctx-tags have to be set as NULL if it is to be disabled
no matter if set-tags[i] is NULL or not in blk_mq_map_swqueue() because
shared tags can be freed already from another request queue.

The same situation has to be considered in blk_mq_hctx_cpu_online() too.

Finally one unmapped hw queue can be remapped after CPU topo is changed,
we need to allocate tags for the hw queue in blk_mq_map_swqueue() too.
Then tags allocation for hw queue can be removed in hctx cpu online
notifier, and it is reasonable to do that after remapping is done.

Cc: sta...@vger.kernel.org
Reported-by: Dongsu Park dongsu.p...@profitbricks.com
Signed-off-by: Ming Lei ming@canonical.com
---
 block/blk-mq.c | 34 +-
 1 file changed, 13 insertions(+), 21 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 078840ce8670..df4b9597e477 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1573,22 +1573,6 @@ static int blk_mq_hctx_cpu_offline(struct blk_mq_hw_ctx 
*hctx, int cpu)
return NOTIFY_OK;
 }
 
-static int blk_mq_hctx_cpu_online(struct blk_mq_hw_ctx *hctx, int cpu)
-{
-   struct request_queue *q = hctx-queue;
-   struct blk_mq_tag_set *set = q-tag_set;
-
-   if (set-tags[hctx-queue_num])
-   return NOTIFY_OK;
-
-   set-tags[hctx-queue_num] = blk_mq_init_rq_map(set, hctx-queue_num);
-   if (!set-tags[hctx-queue_num])
-   return NOTIFY_STOP;
-
-   hctx-tags = set-tags[hctx-queue_num];
-   return NOTIFY_OK;
-}
-
 static int blk_mq_hctx_notify(void *data, unsigned long action,
  unsigned int cpu)
 {
@@ -1596,8 +1580,11 @@ static int blk_mq_hctx_notify(void *data, unsigned long 
action,
 
if (action == CPU_DEAD || action == CPU_DEAD_FROZEN)
return blk_mq_hctx_cpu_offline(hctx, cpu);
-   else if (action == CPU_ONLINE || action == CPU_ONLINE_FROZEN)
-   return blk_mq_hctx_cpu_online(hctx, cpu);
+
+   /*
+* In case of CPU online, tags will be reallocated
+* after new mapping is done in blk_mq_map_swqueue().
+*/
 
return NOTIFY_OK;
 }
@@ -1779,6 +1766,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
unsigned int i;
struct blk_mq_hw_ctx *hctx;
struct blk_mq_ctx *ctx;
+   struct blk_mq_tag_set *set = q-tag_set;
 
queue_for_each_hw_ctx(q, hctx, i) {
cpumask_clear(hctx-cpumask);
@@ -1805,16 +1793,20 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 * disable it and free the request entries.
 */
if (!hctx-nr_ctx) {
-   struct blk_mq_tag_set *set = q-tag_set;
-
if (set-tags[i]) {
blk_mq_free_rq_map(set, set-tags[i], i);
set-tags[i] = NULL;
-   hctx-tags = NULL;
}
+   hctx-tags = NULL;
continue;
}
 
+   /* unmapped hw queue can be remapped after CPU topo changed */
+   if (!set-tags[i])
+   set-tags[i] = blk_mq_init_rq_map(set, hctx-queue_num);
+   hctx-tags = set-tags[i];
+   WARN_ON(!hctx-tags);
+
/*
 * Initialize batch roundrobin counts
 */
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


panic with CPU hotplug + blk-mq + scsi-mq

2015-04-17 Thread Dongsu Park
Hi,

there's a critical bug regarding CPU hotplug, blk-mq, and scsi-mq.
Every time when a CPU is offlined, some arbitrary range of kernel memory
seems to get corrupted. Then after a while, kernel panics at random places
when block IOs are issued. (for example, see the call traces below)

This bug can be easily reproducible with a Qemu VM running with virtio-scsi,
when its guest kernel is 3.19-rc1 or higher, and when scsi-mq is loaded
with blk-mq enabled. And yes, 4.0 release is still affected, as well as
Jens' for-4.1/core. How to reproduce:

 # echo 0 > /sys/devices/system/cpu/cpu1/online
 (and issue some block IOs, that's it.)

Bisecting between 3.18 and 3.19-rc1, it looks like this bug had been hidden
until commit ccbedf117f01 ("virtio_scsi: support multi hw queue of blk-mq"),
which started to allow virtio-scsi to map virtqueues to hardware queues of
blk-mq. Reverting that commit makes the bug go away. However, I suppose
reverting it could not be a correct solution.

More precisely, every time a CPU hotplug event gets triggered,
a call graph is like the following:

  blk_mq_queue_reinit_notify()
  -> blk_mq_queue_reinit()
   -> blk_mq_map_swqueue()
-> blk_mq_free_rq_map()
 -> scsi_exit_request()

>From that point, as soon as any address in the request gets modified, an
arbitrary range of memory gets corrupted. My first guess was that probably
the exit routine could try to deallocate tags->rqs[] where invalid
addresses are stored. But actually it looks like it's not the case,
and cmd->sense_buffer looks also valid.
It's not obvious to me, exactly what could go wrong.

Does anyone have an idea?

Regards,
Dongsu

 [beginning of call traces] 
[   47.274292] BUG: unable to handle kernel NULL pointer dereference at 
0018
[   47.275013] IP: [] __bt_get.isra.5+0x7d/0x1e0
[   47.275013] PGD 79c55067 PUD 7ba17067 PMD 0 
[   47.275013] Oops:  [#1] SMP 
[   47.275013] Modules linked in: fuse cpufreq_stats binfmt_misc 9p fscache 
dm_round_robin loop dm_multipath 9pnet_virtio rtc_cmos 9pnet acpi_cpufreq 
serio_raw i2c_piix4 virtio_net
[   47.275013] CPU: 3 PID: 6232 Comm: blkid Not tainted 4.0.0 #303
[   47.275013] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.7.5-20140709_153950- 04/01/2014
[   47.275013] task: 88003dfbc020 ti: 880079bac000 task.ti: 
880079bac000
[   47.275013] RIP: 0010:[]  [] 
__bt_get.isra.5+0x7d/0x1e0
[   47.275013] RSP: 0018:880079baf898  EFLAGS: 00010246
[   47.275013] RAX: 003c RBX: 880079198400 RCX: 0078
[   47.275013] RDX: 88007fddbb80 RSI: 0010 RDI: 880079198400
[   47.275013] RBP: 880079baf8e8 R08: 88007fddbb80 R09: 
[   47.275013] R10: 0001 R11: 0001 R12: 0010
[   47.275013] R13: 0010 R14: 880079baf9e8 R15: 88007fddbb80
[   47.275013] FS:  2b270c049800() GS:88007fc0() 
knlGS:
[   47.275013] CS:  0010 DS:  ES:  CR0: 80050033
[   47.275013] CR2: 0018 CR3: 7ca8d000 CR4: 001407e0
[   47.275013] Stack:
[   47.275013]  880079baf978 88007fdd58c0 0078 
814071ff
[   47.275013]  880079baf8d8 880079198400 0010 
0010
[   47.275013]  880079baf9e8 88007fddbb80 880079baf968 
8140b4e5
[   47.275013] Call Trace:
[   47.275013]  [] ? blk_mq_queue_enter+0x9f/0x2d0
[   47.275013]  [] bt_get+0x65/0x1e0
[   47.275013]  [] ? blk_mq_queue_enter+0x9f/0x2d0
[   47.275013]  [] ? wait_woken+0xa0/0xa0
[   47.275013]  [] blk_mq_get_tag+0xa7/0xd0
[   47.275013]  [] __blk_mq_alloc_request+0x1b/0x200
[   47.275013]  [] blk_mq_map_request+0xd6/0x4e0
[   47.275013]  [] blk_mq_make_request+0x6e/0x2d0
[   47.275013]  [] ? generic_make_request_checks+0x674/0x6a0
[   47.275013]  [] ? bio_add_page+0x5e/0x70
[   47.275013]  [] generic_make_request+0xc0/0x110
[   47.275013]  [] submit_bio+0x68/0x150
[   47.275013]  [] ? lru_cache_add+0x1c/0x50
[   47.275013]  [] mpage_bio_submit+0x2a/0x40
[   47.275013]  [] mpage_readpages+0x10c/0x130
[   47.275013]  [] ? I_BDEV+0x10/0x10
[   47.275013]  [] ? I_BDEV+0x10/0x10
[   47.275013]  [] ? __page_cache_alloc+0x137/0x160
[   47.275013]  [] blkdev_readpages+0x1d/0x20
[   47.275013]  [] __do_page_cache_readahead+0x29f/0x320
[   47.275013]  [] ? __do_page_cache_readahead+0x165/0x320
[   47.275013]  [] force_page_cache_readahead+0x34/0x60
[   47.275013]  [] page_cache_sync_readahead+0x46/0x50
[   47.275013]  [] generic_file_read_iter+0x52c/0x640
[   47.275013]  [] blkdev_read_iter+0x37/0x40
[   47.275013]  [] new_sync_read+0x7e/0xb0
[   47.275013]  [] __vfs_read+0x18/0x50
[   47.275013]  [] vfs_read+0x8d/0x150
[   47.275013]  [] SyS_read+0x49/0xb0
[   47.275013]  [] system_call_fastpath+0x12/0x17
[   47.275013] Code: 97 18 03 00 00 bf 04 00 00 00 41 f7 f1 83 f8 04 0f 43 f8 
b8 ff ff ff ff 44 39 d7 0f 86 c1 00 00 00 41 8b 00 48 89 4d c0 49 89 f5 

panic with CPU hotplug + blk-mq + scsi-mq

2015-04-17 Thread Dongsu Park
Hi,

there's a critical bug regarding CPU hotplug, blk-mq, and scsi-mq.
Every time when a CPU is offlined, some arbitrary range of kernel memory
seems to get corrupted. Then after a while, kernel panics at random places
when block IOs are issued. (for example, see the call traces below)

This bug can be easily reproducible with a Qemu VM running with virtio-scsi,
when its guest kernel is 3.19-rc1 or higher, and when scsi-mq is loaded
with blk-mq enabled. And yes, 4.0 release is still affected, as well as
Jens' for-4.1/core. How to reproduce:

 # echo 0  /sys/devices/system/cpu/cpu1/online
 (and issue some block IOs, that's it.)

Bisecting between 3.18 and 3.19-rc1, it looks like this bug had been hidden
until commit ccbedf117f01 (virtio_scsi: support multi hw queue of blk-mq),
which started to allow virtio-scsi to map virtqueues to hardware queues of
blk-mq. Reverting that commit makes the bug go away. However, I suppose
reverting it could not be a correct solution.

More precisely, every time a CPU hotplug event gets triggered,
a call graph is like the following:

  blk_mq_queue_reinit_notify()
  - blk_mq_queue_reinit()
   - blk_mq_map_swqueue()
- blk_mq_free_rq_map()
 - scsi_exit_request()

From that point, as soon as any address in the request gets modified, an
arbitrary range of memory gets corrupted. My first guess was that probably
the exit routine could try to deallocate tags-rqs[] where invalid
addresses are stored. But actually it looks like it's not the case,
and cmd-sense_buffer looks also valid.
It's not obvious to me, exactly what could go wrong.

Does anyone have an idea?

Regards,
Dongsu

 [beginning of call traces] 
[   47.274292] BUG: unable to handle kernel NULL pointer dereference at 
0018
[   47.275013] IP: [8140b31d] __bt_get.isra.5+0x7d/0x1e0
[   47.275013] PGD 79c55067 PUD 7ba17067 PMD 0 
[   47.275013] Oops:  [#1] SMP 
[   47.275013] Modules linked in: fuse cpufreq_stats binfmt_misc 9p fscache 
dm_round_robin loop dm_multipath 9pnet_virtio rtc_cmos 9pnet acpi_cpufreq 
serio_raw i2c_piix4 virtio_net
[   47.275013] CPU: 3 PID: 6232 Comm: blkid Not tainted 4.0.0 #303
[   47.275013] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.7.5-20140709_153950- 04/01/2014
[   47.275013] task: 88003dfbc020 ti: 880079bac000 task.ti: 
880079bac000
[   47.275013] RIP: 0010:[8140b31d]  [8140b31d] 
__bt_get.isra.5+0x7d/0x1e0
[   47.275013] RSP: 0018:880079baf898  EFLAGS: 00010246
[   47.275013] RAX: 003c RBX: 880079198400 RCX: 0078
[   47.275013] RDX: 88007fddbb80 RSI: 0010 RDI: 880079198400
[   47.275013] RBP: 880079baf8e8 R08: 88007fddbb80 R09: 
[   47.275013] R10: 0001 R11: 0001 R12: 0010
[   47.275013] R13: 0010 R14: 880079baf9e8 R15: 88007fddbb80
[   47.275013] FS:  2b270c049800() GS:88007fc0() 
knlGS:
[   47.275013] CS:  0010 DS:  ES:  CR0: 80050033
[   47.275013] CR2: 0018 CR3: 7ca8d000 CR4: 001407e0
[   47.275013] Stack:
[   47.275013]  880079baf978 88007fdd58c0 0078 
814071ff
[   47.275013]  880079baf8d8 880079198400 0010 
0010
[   47.275013]  880079baf9e8 88007fddbb80 880079baf968 
8140b4e5
[   47.275013] Call Trace:
[   47.275013]  [814071ff] ? blk_mq_queue_enter+0x9f/0x2d0
[   47.275013]  [8140b4e5] bt_get+0x65/0x1e0
[   47.275013]  [814071ff] ? blk_mq_queue_enter+0x9f/0x2d0
[   47.275013]  [810c9b40] ? wait_woken+0xa0/0xa0
[   47.275013]  [8140ba07] blk_mq_get_tag+0xa7/0xd0
[   47.275013]  [8140630b] __blk_mq_alloc_request+0x1b/0x200
[   47.275013]  [81408736] blk_mq_map_request+0xd6/0x4e0
[   47.275013]  [8140a53e] blk_mq_make_request+0x6e/0x2d0
[   47.275013]  [813fb844] ? generic_make_request_checks+0x674/0x6a0
[   47.275013]  [813f23ae] ? bio_add_page+0x5e/0x70
[   47.275013]  [813fb930] generic_make_request+0xc0/0x110
[   47.275013]  [813fb9e8] submit_bio+0x68/0x150
[   47.275013]  [811b0c6c] ? lru_cache_add+0x1c/0x50
[   47.275013]  [8125972a] mpage_bio_submit+0x2a/0x40
[   47.275013]  [8125a81c] mpage_readpages+0x10c/0x130
[   47.275013]  [81254040] ? I_BDEV+0x10/0x10
[   47.275013]  [81254040] ? I_BDEV+0x10/0x10
[   47.275013]  [8119e417] ? __page_cache_alloc+0x137/0x160
[   47.275013]  [8125486d] blkdev_readpages+0x1d/0x20
[   47.275013]  [811ae43f] __do_page_cache_readahead+0x29f/0x320
[   47.275013]  [811ae305] ? __do_page_cache_readahead+0x165/0x320
[   47.275013]  [811aea14] force_page_cache_readahead+0x34/0x60
[   47.275013]  [811aea86] page_cache_sync_readahead+0x46/0x50
[   47.275013]  [811a094c] 

Re: [PATCH] dm: fix multipath regression due to initializing wrong request

2015-02-10 Thread Dongsu Park
On 09.02.2015 10:47, Jens Axboe wrote:
> On 02/09/2015 10:35 AM, Mike Snitzer wrote:
> >On Mon, Feb 09 2015 at 12:13P -0500,
> >Mike Snitzer  wrote:
> >
> >Jens and I discussed this further and given that linux-block breaks
> >dm-multipath it is best to fix linux-block and let Linus resolve the
> >merge when I send him the linux-dm pull.
> >
> >Here is the patch to fix the regression:
> 
> Added, thanks. I don't think this is worth rebasing for, so just added to
> the top of for-3.20/core (since that's where the buggy commit was added).

Thanks a lot. Now the branch for-3.20/core works without hitting the BUG.

Dongsu

> -- 
> Jens Axboe
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] dm: fix multipath regression due to initializing wrong request

2015-02-10 Thread Dongsu Park
On 09.02.2015 10:47, Jens Axboe wrote:
 On 02/09/2015 10:35 AM, Mike Snitzer wrote:
 On Mon, Feb 09 2015 at 12:13P -0500,
 Mike Snitzer snit...@redhat.com wrote:
 
 Jens and I discussed this further and given that linux-block breaks
 dm-multipath it is best to fix linux-block and let Linus resolve the
 merge when I send him the linux-dm pull.
 
 Here is the patch to fix the regression:
 
 Added, thanks. I don't think this is worth rebasing for, so just added to
 the top of for-3.20/core (since that's where the buggy commit was added).

Thanks a lot. Now the branch for-3.20/core works without hitting the BUG.

Dongsu

 -- 
 Jens Axboe
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


blk-mq crash with dm-multipath in for-3.20/core

2015-02-09 Thread Dongsu Park
Hi Jens,

during testing with the linux-block for-3.20/core branch, I hit a BUG
like below. It's reproducible by running xfstests/xfs/279.

Bisecting showed that the first bad commit is 6d6285c45f5a ("block:
require blk_rq_prep_clone() be given an initialized clone request").
With reverting this commit, the crash disappears.
The linux-dm's branch dm-for-3.20 works fine without crash too.

As pointed out already by Keith Busch in a thread, [1] that commit should
not be there in the first place. Commit 102e38b1030e ("dm: split
request structure out from dm_rq_target_io structure") from linux-dm tree
[2] is going to move the blk_rq_init() call again to __clone_rq().

So that commit 6d6285c45f5a should be either reverted, or moved to
linux-dm tree, doesn't it?

Cheers,
Dongsu

[1] https://www.redhat.com/archives/dm-devel/2015-January/msg00171.html
[2] 
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-for-3.20=102e38b1030e883efc022dfdc7b7e7a3de70d1c5

[ cut here ]
kernel BUG at block/blk-core.c:2333!
RIP: 0010: [] blk_dequeue_request+0x78/0x90
Call Trace:
 [] blk_start_request+0x16/0x70
 [] dm_start_request+0x1a/0x50
 [] dm_request_fn+0x2b6/0x3e0
 [] __blk_run_queue+0x37/0x50
 [] queue_unplugged+0x5d/0x230
 [] blk_flush_plug_list+0x1ac/0x230
 [] blk_finish_plug+0x18/0x60
 [] __do_page_cache_readahead+0x2b1/0x320
 [] ? __do_page_cache_readahead+0x165/0x320
 [] ondemand_readahead+0xe2/0x480
 [] ? pagecache_get_page+0x2f/0x200
 [] page_cache_sync_readahead+0x31/0x50
 [] generic_file_read_iter+0x51c/0x630
 [] ? might_fault+0x5e/0xc0
 [] blkdev_read_iter+0x37/0x40
 [] new_sync_read+0x7e/0xb0
 [] __vfs_read+0x18/0x50
 [] vfs_read+0x8d/0x150
 [] SyS_read+0x49/0xb0
 [] system_call_fastpath+0x12/0x17
RIP  [] blk_dequeue_request+0x78/0x90
 RSP 
---[ end trace dcfc3d438518b1aa ]---

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: cleanup and refactor BLOCK_PC mapping helpers V2

2015-02-09 Thread Dongsu Park
On 05.02.2015 09:28, Jens Axboe wrote:
> On 02/02/2015 06:19 AM, Christoph Hellwig wrote:
> >Jens, do these patches look fine to you?  Any chance to get them into
> >the tree for the 3.20 merge window?
> 
> Yes, I think they look fine. I'll throw them into the testing mix and merge
> them for 3.20.

Thanks a lot, and many thanks also to Christoph.

Dongsu

> -- 
> Jens Axboe
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: cleanup and refactor BLOCK_PC mapping helpers V2

2015-02-09 Thread Dongsu Park
On 05.02.2015 09:28, Jens Axboe wrote:
 On 02/02/2015 06:19 AM, Christoph Hellwig wrote:
 Jens, do these patches look fine to you?  Any chance to get them into
 the tree for the 3.20 merge window?
 
 Yes, I think they look fine. I'll throw them into the testing mix and merge
 them for 3.20.

Thanks a lot, and many thanks also to Christoph.

Dongsu

 -- 
 Jens Axboe
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


blk-mq crash with dm-multipath in for-3.20/core

2015-02-09 Thread Dongsu Park
Hi Jens,

during testing with the linux-block for-3.20/core branch, I hit a BUG
like below. It's reproducible by running xfstests/xfs/279.

Bisecting showed that the first bad commit is 6d6285c45f5a (block:
require blk_rq_prep_clone() be given an initialized clone request).
With reverting this commit, the crash disappears.
The linux-dm's branch dm-for-3.20 works fine without crash too.

As pointed out already by Keith Busch in a thread, [1] that commit should
not be there in the first place. Commit 102e38b1030e (dm: split
request structure out from dm_rq_target_io structure) from linux-dm tree
[2] is going to move the blk_rq_init() call again to __clone_rq().

So that commit 6d6285c45f5a should be either reverted, or moved to
linux-dm tree, doesn't it?

Cheers,
Dongsu

[1] https://www.redhat.com/archives/dm-devel/2015-January/msg00171.html
[2] 
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-for-3.20id=102e38b1030e883efc022dfdc7b7e7a3de70d1c5

[ cut here ]
kernel BUG at block/blk-core.c:2333!
RIP: 0010: [814c6858] blk_dequeue_request+0x78/0x90
Call Trace:
 [814c6886] blk_start_request+0x16/0x70
 [8169f9fa] dm_start_request+0x1a/0x50
 [8169fce6] dm_request_fn+0x2b6/0x3e0
 [814c0087] __blk_run_queue+0x37/0x50
 [814c31ed] queue_unplugged+0x5d/0x230
 [814c710c] blk_flush_plug_list+0x1ac/0x230
 [814c7708] blk_finish_plug+0x18/0x60
 [811baea1] __do_page_cache_readahead+0x2b1/0x320
 [811bad55] ? __do_page_cache_readahead+0x165/0x320
 [811baff2] ondemand_readahead+0xe2/0x480
 [811ac3ff] ? pagecache_get_page+0x2f/0x200
 [811bb4c1] page_cache_sync_readahead+0x31/0x50
 [811ad5bc] generic_file_read_iter+0x51c/0x630
 [811dd00e] ? might_fault+0x5e/0xc0
 [81261e37] blkdev_read_iter+0x37/0x40
 [8121fa4e] new_sync_read+0x7e/0xb0
 [81220ce8] __vfs_read+0x18/0x50
 [81220dad] vfs_read+0x8d/0x150
 [81220eb9] SyS_read+0x49/0xb0
 [817dce52] system_call_fastpath+0x12/0x17
RIP  [814c6858] blk_dequeue_request+0x78/0x90
 RSP 88006e1eba68
---[ end trace dcfc3d438518b1aa ]---

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/7] block: rewrite __bio_copy_iov()

2015-01-16 Thread Dongsu Park
Hi Christoph,

On 16.01.2015 03:31, Christoph Hellwig wrote:
> On Thu, Jan 15, 2015 at 10:18:17AM -0800, Christoph Hellwig wrote:
> > This breaks booting a simple KVM VM for me:
> Seems like the issue actually is in the patch before this one, but
> only shows up with this one applied.
> The root cause is that we only copy the iov_iter, but not the
> actual iovecs into the bio_map_data.
> I have a fixed series, which I'll send out together with various
> related cleanups ASAP.

Thanks for testing it and finding out the root cause.
Strange, I haven't never seen the bug.
Maybe I'd have to test it also with virtio-scsi, which I don't do usually.

Dongsu
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/7] block: rewrite __bio_copy_iov()

2015-01-16 Thread Dongsu Park
Hi Christoph,

On 16.01.2015 03:31, Christoph Hellwig wrote:
 On Thu, Jan 15, 2015 at 10:18:17AM -0800, Christoph Hellwig wrote:
  This breaks booting a simple KVM VM for me:
 Seems like the issue actually is in the patch before this one, but
 only shows up with this one applied.
 The root cause is that we only copy the iov_iter, but not the
 actual iovecs into the bio_map_data.
 I have a fixed series, which I'll send out together with various
 related cleanups ASAP.

Thanks for testing it and finding out the root cause.
Strange, I haven't never seen the bug.
Maybe I'd have to test it also with virtio-scsi, which I don't do usually.

Dongsu
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 3/9] block: allow __blk_queue_bounce() to handle bios larger than BIO_MAX_PAGES

2015-01-12 Thread Dongsu Park
From: Kent Overstreet 

Allow __blk_queue_bounce() to handle bios with more than BIO_MAX_PAGES
segments. Doing that, it becomes possible to simplify the block layer
in the kernel.

The issue is that any code that clones the bio and must clone the biovec
(i.e. it can't use bio_clone_fast()) won't be able to allocate a bio with
more than BIO_MAX_PAGES - bio_alloc_bioset() always fails in that case.

Fortunately, it's easy to make __blk_queue_bounce() just process part of
the bio if necessary, using bi_remaining to count the splits and punting
the rest back to generic_make_request().

Cc: Christoph Hellwig 
Cc: Jens Axboe 
Signed-off-by: Kent Overstreet 
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park 
---
 block/bounce.c | 60 ++
 1 file changed, 52 insertions(+), 8 deletions(-)

diff --git a/block/bounce.c b/block/bounce.c
index ab21ba2..689ea89 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -196,6 +196,43 @@ static int must_snapshot_stable_pages(struct request_queue 
*q, struct bio *bio)
 }
 #endif /* CONFIG_NEED_BOUNCE_POOL */
 
+static struct bio *bio_clone_segments(struct bio *bio_src, gfp_t gfp_mask,
+ struct bio_set *bs, unsigned nsegs)
+{
+   struct bvec_iter iter;
+   struct bio_vec bv;
+   struct bio *bio;
+
+   bio = bio_alloc_bioset(gfp_mask, nsegs, bs);
+   if (!bio)
+   return NULL;
+
+   bio->bi_bdev= bio_src->bi_bdev;
+   bio->bi_rw  = bio_src->bi_rw;
+   bio->bi_iter.bi_sector  = bio_src->bi_iter.bi_sector;
+
+   bio_for_each_segment(bv, bio_src, iter) {
+   bio->bi_io_vec[bio->bi_vcnt++] = bv;
+   bio->bi_iter.bi_size += bv.bv_len;
+   if (!--nsegs)
+   break;
+   }
+
+   if (bio_integrity(bio_src)) {
+   int ret;
+
+   ret = bio_integrity_clone(bio, bio_src, gfp_mask);
+   if (ret < 0) {
+   bio_put(bio);
+   return NULL;
+   }
+   }
+
+   bio_src->bi_iter = iter;
+
+   return bio;
+}
+
 static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
   mempool_t *pool, int force)
 {
@@ -203,17 +240,24 @@ static void __blk_queue_bounce(struct request_queue *q, 
struct bio **bio_orig,
int rw = bio_data_dir(*bio_orig);
struct bio_vec *to, from;
struct bvec_iter iter;
-   unsigned i;
+   int i, nsegs = 0, bounce = force;
 
-   if (force)
-   goto bounce;
-   bio_for_each_segment(from, *bio_orig, iter)
+   bio_for_each_segment(from, *bio_orig, iter) {
+   nsegs++;
if (page_to_pfn(from.bv_page) > queue_bounce_pfn(q))
-   goto bounce;
+   bounce = 1;
+   }
+
+   if (!bounce)
+   return;
 
-   return;
-bounce:
-   bio = bio_clone_bioset(*bio_orig, GFP_NOIO, fs_bio_set);
+   bio = bio_clone_segments(*bio_orig, GFP_NOIO, fs_bio_set,
+min(nsegs, BIO_MAX_PAGES));
+
+   if ((*bio_orig)->bi_iter.bi_size) {
+   atomic_inc(&(*bio_orig)->bi_remaining);
+   generic_make_request(*bio_orig);
+   }
 
bio_for_each_segment_all(to, bio, i) {
struct page *page = to->bv_page;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 4/9] bcache: clean up hacks around bio_split_pool

2015-01-12 Thread Dongsu Park
From: Kent Overstreet 

There has been workarounds only in bcache, for splitting pool as well
as submitting bios. Since generic_make_request() is able to handle
arbitrarily sized bios, it's now possible to delete those hacks.

Cc: linux-bca...@vger.kernel.org
Signed-off-by: Kent Overstreet 
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park 
---
 drivers/md/bcache/bcache.h|  18 
 drivers/md/bcache/io.c| 100 +-
 drivers/md/bcache/journal.c   |   4 +-
 drivers/md/bcache/request.c   |  16 +++
 drivers/md/bcache/super.c |  32 +-
 drivers/md/bcache/util.h  |   5 ++-
 drivers/md/bcache/writeback.c |   4 +-
 7 files changed, 18 insertions(+), 161 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 04f7bc2..6b420a5 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -243,19 +243,6 @@ struct keybuf {
DECLARE_ARRAY_ALLOCATOR(struct keybuf_key, freelist, KEYBUF_NR);
 };
 
-struct bio_split_pool {
-   struct bio_set  *bio_split;
-   mempool_t   *bio_split_hook;
-};
-
-struct bio_split_hook {
-   struct closure  cl;
-   struct bio_split_pool   *p;
-   struct bio  *bio;
-   bio_end_io_t*bi_end_io;
-   void*bi_private;
-};
-
 struct bcache_device {
struct closure  cl;
 
@@ -288,8 +275,6 @@ struct bcache_device {
int (*cache_miss)(struct btree *, struct search *,
  struct bio *, unsigned);
int (*ioctl) (struct bcache_device *, fmode_t, unsigned, unsigned long);
-
-   struct bio_split_pool   bio_split_hook;
 };
 
 struct io {
@@ -454,8 +439,6 @@ struct cache {
atomic_long_t   meta_sectors_written;
atomic_long_t   btree_sectors_written;
atomic_long_t   sectors_written;
-
-   struct bio_split_pool   bio_split_hook;
 };
 
 struct gc_stat {
@@ -873,7 +856,6 @@ void bch_bbio_endio(struct cache_set *, struct bio *, int, 
const char *);
 void bch_bbio_free(struct bio *, struct cache_set *);
 struct bio *bch_bbio_alloc(struct cache_set *);
 
-void bch_generic_make_request(struct bio *, struct bio_split_pool *);
 void __bch_submit_bbio(struct bio *, struct cache_set *);
 void bch_submit_bbio(struct bio *, struct cache_set *, struct bkey *, 
unsigned);
 
diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
index fa028fa..86a0bb8 100644
--- a/drivers/md/bcache/io.c
+++ b/drivers/md/bcache/io.c
@@ -11,104 +11,6 @@
 
 #include 
 
-static unsigned bch_bio_max_sectors(struct bio *bio)
-{
-   struct request_queue *q = bdev_get_queue(bio->bi_bdev);
-   struct bio_vec bv;
-   struct bvec_iter iter;
-   unsigned ret = 0, seg = 0;
-
-   if (bio->bi_rw & REQ_DISCARD)
-   return min(bio_sectors(bio), q->limits.max_discard_sectors);
-
-   bio_for_each_segment(bv, bio, iter) {
-   struct bvec_merge_data bvm = {
-   .bi_bdev= bio->bi_bdev,
-   .bi_sector  = bio->bi_iter.bi_sector,
-   .bi_size= ret << 9,
-   .bi_rw  = bio->bi_rw,
-   };
-
-   if (seg == min_t(unsigned, BIO_MAX_PAGES,
-queue_max_segments(q)))
-   break;
-
-   if (q->merge_bvec_fn &&
-   q->merge_bvec_fn(q, , ) < (int) bv.bv_len)
-   break;
-
-   seg++;
-   ret += bv.bv_len >> 9;
-   }
-
-   ret = min(ret, queue_max_sectors(q));
-
-   WARN_ON(!ret);
-   ret = max_t(int, ret, bio_iovec(bio).bv_len >> 9);
-
-   return ret;
-}
-
-static void bch_bio_submit_split_done(struct closure *cl)
-{
-   struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);
-
-   s->bio->bi_end_io = s->bi_end_io;
-   s->bio->bi_private = s->bi_private;
-   bio_endio_nodec(s->bio, 0);
-
-   closure_debug_destroy(>cl);
-   mempool_free(s, s->p->bio_split_hook);
-}
-
-static void bch_bio_submit_split_endio(struct bio *bio, int error)
-{
-   struct closure *cl = bio->bi_private;
-   struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);
-
-   if (error)
-   clear_bit(BIO_UPTODATE, >bio->bi_flags);
-
-   bio_put(bio);
-   closure_put(cl);
-}
-
-void bch_generic_make_request(struct bio *bio, struct bio_split_pool *p)
-{
-   struct bio_split_hook *s;
-   struct bio *n;
-
-   if (!bio_has_data(bio) && !(bio->bi_rw & REQ_DISCARD))
-   goto submit;
-
-   if (bio_sectors(bio) <= bch_bio_max_sectors(bio))
-   goto submit;
-
-   s = mempool_alloc(p->bio_split_hook

[PATCH v2 5/9] btrfs: remove bio splitting and merge_bvec_fn() calls

2015-01-12 Thread Dongsu Park
From: Kent Overstreet 

Btrfs has been doing bio splitting from btrfs_map_bio(), by checking
device limits as well as calling ->merge_bvec_fn() etc. That is not
necessary any more, because generic_make_request() is now able to
handle arbitrarily sized bios. So clean up unnecessary code paths.

Cc: Chris Mason 
Cc: Josef Bacik 
Cc: linux-bt...@vger.kernel.org
Signed-off-by: Kent Overstreet 
Signed-off-by: Chris Mason 
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park 
---
 fs/btrfs/volumes.c | 73 --
 1 file changed, 73 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 50c5a87..c627bf8 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5691,34 +5691,6 @@ static noinline void btrfs_schedule_bio(struct 
btrfs_root *root,
 >work);
 }
 
-static int bio_size_ok(struct block_device *bdev, struct bio *bio,
-  sector_t sector)
-{
-   struct bio_vec *prev;
-   struct request_queue *q = bdev_get_queue(bdev);
-   unsigned int max_sectors = queue_max_sectors(q);
-   struct bvec_merge_data bvm = {
-   .bi_bdev = bdev,
-   .bi_sector = sector,
-   .bi_rw = bio->bi_rw,
-   };
-
-   if (WARN_ON(bio->bi_vcnt == 0))
-   return 1;
-
-   prev = >bi_io_vec[bio->bi_vcnt - 1];
-   if (bio_sectors(bio) > max_sectors)
-   return 0;
-
-   if (!q->merge_bvec_fn)
-   return 1;
-
-   bvm.bi_size = bio->bi_iter.bi_size - prev->bv_len;
-   if (q->merge_bvec_fn(q, , prev) < prev->bv_len)
-   return 0;
-   return 1;
-}
-
 static void submit_stripe_bio(struct btrfs_root *root, struct btrfs_bio *bbio,
  struct bio *bio, u64 physical, int dev_nr,
  int rw, int async)
@@ -5752,38 +5724,6 @@ static void submit_stripe_bio(struct btrfs_root *root, 
struct btrfs_bio *bbio,
btrfsic_submit_bio(rw, bio);
 }
 
-static int breakup_stripe_bio(struct btrfs_root *root, struct btrfs_bio *bbio,
- struct bio *first_bio, struct btrfs_device *dev,
- int dev_nr, int rw, int async)
-{
-   struct bio_vec *bvec = first_bio->bi_io_vec;
-   struct bio *bio;
-   int nr_vecs = bio_get_nr_vecs(dev->bdev);
-   u64 physical = bbio->stripes[dev_nr].physical;
-
-again:
-   bio = btrfs_bio_alloc(dev->bdev, physical >> 9, nr_vecs, GFP_NOFS);
-   if (!bio)
-   return -ENOMEM;
-
-   while (bvec <= (first_bio->bi_io_vec + first_bio->bi_vcnt - 1)) {
-   if (bio_add_page(bio, bvec->bv_page, bvec->bv_len,
-bvec->bv_offset) < bvec->bv_len) {
-   u64 len = bio->bi_iter.bi_size;
-
-   atomic_inc(>stripes_pending);
-   submit_stripe_bio(root, bbio, bio, physical, dev_nr,
- rw, async);
-   physical += len;
-   goto again;
-   }
-   bvec++;
-   }
-
-   submit_stripe_bio(root, bbio, bio, physical, dev_nr, rw, async);
-   return 0;
-}
-
 static void bbio_error(struct btrfs_bio *bbio, struct bio *bio, u64 logical)
 {
atomic_inc(>error);
@@ -5862,19 +5802,6 @@ int btrfs_map_bio(struct btrfs_root *root, int rw, 
struct bio *bio,
continue;
}
 
-   /*
-* Check and see if we're ok with this bio based on it's size
-* and offset with the given device.
-*/
-   if (!bio_size_ok(dev->bdev, first_bio,
-bbio->stripes[dev_nr].physical >> 9)) {
-   ret = breakup_stripe_bio(root, bbio, first_bio, dev,
-dev_nr, rw, async_submit);
-   BUG_ON(ret);
-   dev_nr++;
-   continue;
-   }
-
if (dev_nr < total_devs - 1) {
bio = btrfs_bio_clone(first_bio, GFP_NOFS);
BUG_ON(!bio); /* -ENOMEM */
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 6/9] md/raid5: get rid of bio_fits_rdev()

2015-01-12 Thread Dongsu Park
From: Kent Overstreet 

Remove bio_fits_rdev() completely, because ->merge_bvec_fn() has now
gone. There's no point in calling bio_fits_rdev() only for ensuring
aligned read from rdev.

Cc: Neil Brown 
Cc: linux-r...@vger.kernel.org
Signed-off-by: Kent Overstreet 
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park 
---
 drivers/md/raid5.c | 23 +--
 1 file changed, 1 insertion(+), 22 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index c1b0d52..40e464c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4218,25 +4218,6 @@ static void raid5_align_endio(struct bio *bi, int error)
add_bio_to_retry(raid_bi, conf);
 }
 
-static int bio_fits_rdev(struct bio *bi)
-{
-   struct request_queue *q = bdev_get_queue(bi->bi_bdev);
-
-   if (bio_sectors(bi) > queue_max_sectors(q))
-   return 0;
-   blk_recount_segments(q, bi);
-   if (bi->bi_phys_segments > queue_max_segments(q))
-   return 0;
-
-   if (q->merge_bvec_fn)
-   /* it's too hard to apply the merge_bvec_fn at this stage,
-* just just give up
-*/
-   return 0;
-
-   return 1;
-}
-
 static int chunk_aligned_read(struct mddev *mddev, struct bio * raid_bio)
 {
struct r5conf *conf = mddev->private;
@@ -4290,11 +4271,9 @@ static int chunk_aligned_read(struct mddev *mddev, 
struct bio * raid_bio)
align_bi->bi_bdev =  rdev->bdev;
__clear_bit(BIO_SEG_VALID, _bi->bi_flags);
 
-   if (!bio_fits_rdev(align_bi) ||
-   is_badblock(rdev, align_bi->bi_iter.bi_sector,
+   if (is_badblock(rdev, align_bi->bi_iter.bi_sector,
bio_sectors(align_bi),
_bad, _sectors)) {
-   /* too big in some way, or has a known bad block */
bio_put(align_bi);
rdev_dec_pending(rdev, mddev);
return 0;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 2/9] block: simplify bio_add_page()

2015-01-12 Thread Dongsu Park
From: Kent Overstreet 

Since generic_make_request() can now handle arbitrary size bios, all we
have to do is make sure the bvec array doesn't overflow.
__bio_add_page() doesn't need to call ->merge_bvec_fn(), where
we can get rid of unnecessary code paths.

Note that removing call to ->merge_bvec_fn() is fine for
bio_add_pc_page(), as SCSI devices usually don't even need that.
Few exceptional cases like pscsi or osd are not affected either.

Cc: Christoph Hellwig 
Cc: Jens Axboe 
Cc: Ming Lin 
Signed-off-by: Kent Overstreet 
[dpark: rebase and resolve merge conflicts, change a couple of comments,
 make bio_add_page() warn once upon a cloned bio.]
Signed-off-by: Dongsu Park 
---
 block/bio.c | 135 +---
 1 file changed, 55 insertions(+), 80 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 7ff846d..136b78b 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -700,9 +700,23 @@ int bio_get_nr_vecs(struct block_device *bdev)
 }
 EXPORT_SYMBOL(bio_get_nr_vecs);
 
-static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
- *page, unsigned int len, unsigned int offset,
- unsigned int max_sectors)
+/**
+ * bio_add_pc_page -   attempt to add page to bio
+ * @q: the target queue
+ * @bio: destination bio
+ * @page: page to add
+ * @len: vec entry length
+ * @offset: vec entry offset
+ *
+ * Attempt to add a page to the bio_vec maplist. This can fail for a
+ * number of reasons, such as the bio being full or target block device
+ * limitations. The target block device must allow bio's up to PAGE_SIZE,
+ * so it is always possible to add a single page to an empty bio.
+ *
+ * This should only be used by REQ_PC bios.
+ */
+int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page
+   *page, unsigned int len, unsigned int offset)
 {
int retried_segments = 0;
struct bio_vec *bvec;
@@ -713,7 +727,7 @@ static int __bio_add_page(struct request_queue *q, struct 
bio *bio, struct page
if (unlikely(bio_flagged(bio, BIO_CLONED)))
return 0;
 
-   if (((bio->bi_iter.bi_size + len) >> 9) > max_sectors)
+   if (((bio->bi_iter.bi_size + len) >> 9) > queue_max_hw_sectors(q))
return 0;
 
/*
@@ -726,28 +740,7 @@ static int __bio_add_page(struct request_queue *q, struct 
bio *bio, struct page
 
if (page == prev->bv_page &&
offset == prev->bv_offset + prev->bv_len) {
-   unsigned int prev_bv_len = prev->bv_len;
prev->bv_len += len;
-
-   if (q->merge_bvec_fn) {
-   struct bvec_merge_data bvm = {
-   /* prev_bvec is already charged in
-  bi_size, discharge it in order to
-  simulate merging updated prev_bvec
-  as new bvec. */
-   .bi_bdev = bio->bi_bdev,
-   .bi_sector = bio->bi_iter.bi_sector,
-   .bi_size = bio->bi_iter.bi_size -
-   prev_bv_len,
-   .bi_rw = bio->bi_rw,
-   };
-
-   if (q->merge_bvec_fn(q, , prev) < 
prev->bv_len) {
-   prev->bv_len -= len;
-   return 0;
-   }
-   }
-
bio->bi_iter.bi_size += len;
goto done;
}
@@ -790,27 +783,6 @@ static int __bio_add_page(struct request_queue *q, struct 
bio *bio, struct page
blk_recount_segments(q, bio);
}
 
-   /*
-* if queue has other restrictions (eg varying max sector size
-* depending on offset), it can specify a merge_bvec_fn in the
-* queue to get further control
-*/
-   if (q->merge_bvec_fn) {
-   struct bvec_merge_data bvm = {
-   .bi_bdev = bio->bi_bdev,
-   .bi_sector = bio->bi_iter.bi_sector,
-   .bi_size = bio->bi_iter.bi_size - len,
-   .bi_rw = bio->bi_rw,
-   };
-
-   /*
-* merge_bvec_fn() returns number of bytes it can accept
-* at this offset
-*/
-   if (q->merge_bvec_fn(q, , bvec) < bvec->bv_len)
-   goto failed;
-   }
-
/* If we may be able to merge these biovecs, force a recount */
if (bio->bi_vcnt &g

[PATCH v2 8/9] fs: use helper bio_add_page() instead of open coding on bi_io_vec

2015-01-12 Thread Dongsu Park
From: Kent Overstreet 

Call pre-defined helper bio_add_page() instead of open coding for
iterating through bi_io_vec[]. Doing that, it's possible to make some
parts in filesystems and mm/page_io.c simpler than before.

Acked-by: Dave Kleikamp 
Cc: Christoph Hellwig 
Cc: Al Viro 
Cc: linux-fsde...@vger.kernel.org
Signed-off-by: Kent Overstreet 
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park 
---
 fs/buffer.c |  7 ++-
 fs/jfs/jfs_logmgr.c | 14 --
 mm/page_io.c|  8 +++-
 3 files changed, 9 insertions(+), 20 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index dbe5699..78e63e3 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3022,12 +3022,9 @@ int _submit_bh(int rw, struct buffer_head *bh, unsigned 
long bio_flags)
 
bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
bio->bi_bdev = bh->b_bdev;
-   bio->bi_io_vec[0].bv_page = bh->b_page;
-   bio->bi_io_vec[0].bv_len = bh->b_size;
-   bio->bi_io_vec[0].bv_offset = bh_offset(bh);
 
-   bio->bi_vcnt = 1;
-   bio->bi_iter.bi_size = bh->b_size;
+   bio_add_page(bio, bh->b_page, bh->b_size, bh_offset(bh));
+   BUG_ON(bio->bi_iter.bi_size != bh->b_size);
 
bio->bi_end_io = end_bio_bh_io_sync;
bio->bi_private = bh;
diff --git a/fs/jfs/jfs_logmgr.c b/fs/jfs/jfs_logmgr.c
index bc462dc..46fae06 100644
--- a/fs/jfs/jfs_logmgr.c
+++ b/fs/jfs/jfs_logmgr.c
@@ -1999,12 +1999,9 @@ static int lbmRead(struct jfs_log * log, int pn, struct 
lbuf ** bpp)
 
bio->bi_iter.bi_sector = bp->l_blkno << (log->l2bsize - 9);
bio->bi_bdev = log->bdev;
-   bio->bi_io_vec[0].bv_page = bp->l_page;
-   bio->bi_io_vec[0].bv_len = LOGPSIZE;
-   bio->bi_io_vec[0].bv_offset = bp->l_offset;
 
-   bio->bi_vcnt = 1;
-   bio->bi_iter.bi_size = LOGPSIZE;
+   bio_add_page(bio, bp->l_page, LOGPSIZE, bp->l_offset);
+   BUG_ON(bio->bi_iter.bi_size != LOGPSIZE);
 
bio->bi_end_io = lbmIODone;
bio->bi_private = bp;
@@ -2145,12 +2142,9 @@ static void lbmStartIO(struct lbuf * bp)
bio = bio_alloc(GFP_NOFS, 1);
bio->bi_iter.bi_sector = bp->l_blkno << (log->l2bsize - 9);
bio->bi_bdev = log->bdev;
-   bio->bi_io_vec[0].bv_page = bp->l_page;
-   bio->bi_io_vec[0].bv_len = LOGPSIZE;
-   bio->bi_io_vec[0].bv_offset = bp->l_offset;
 
-   bio->bi_vcnt = 1;
-   bio->bi_iter.bi_size = LOGPSIZE;
+   bio_add_page(bio, bp->l_page, LOGPSIZE, bp->l_offset);
+   BUG_ON(bio->bi_iter.bi_size != LOGPSIZE);
 
bio->bi_end_io = lbmIODone;
bio->bi_private = bp;
diff --git a/mm/page_io.c b/mm/page_io.c
index 955db8b..8c878c7 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -33,12 +33,10 @@ static struct bio *get_swap_bio(gfp_t gfp_flags,
if (bio) {
bio->bi_iter.bi_sector = map_swap_page(page, >bi_bdev);
bio->bi_iter.bi_sector <<= PAGE_SHIFT - 9;
-   bio->bi_io_vec[0].bv_page = page;
-   bio->bi_io_vec[0].bv_len = PAGE_SIZE;
-   bio->bi_io_vec[0].bv_offset = 0;
-   bio->bi_vcnt = 1;
-   bio->bi_iter.bi_size = PAGE_SIZE;
bio->bi_end_io = end_io;
+
+   bio_add_page(bio, page, PAGE_SIZE, 0);
+   BUG_ON(bio->bi_iter.bi_size != PAGE_SIZE);
}
return bio;
 }
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 7/9] block: kill merge_bvec_fn() completely

2015-01-12 Thread Dongsu Park
From: Kent Overstreet 

As generic_make_request() is now able to handle arbitrarily sized bios,
it's no longer necessary for each individual block driver to define its
own ->merge_bvec_fn() callback. Remove every invocation completely.

Cc: Jens Axboe 
Cc: Lars Ellenberg 
Cc: drbd-u...@lists.linbit.com
Cc: Jiri Kosina 
Cc: Yehuda Sadeh 
Cc: Sage Weil 
Cc: Alex Elder 
Cc: ceph-de...@vger.kernel.org
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: dm-de...@redhat.com
Cc: Neil Brown 
Cc: linux-r...@vger.kernel.org
Cc: Christoph Hellwig 
Cc: "Martin K. Petersen" 
Signed-off-by: Kent Overstreet 
[dpark: also remove ->merge_bvec_fn() in dm-thin as well as
 dm-era-target, and resolve merge conflicts]
Signed-off-by: Dongsu Park 
---
 block/blk-merge.c  |  17 +-
 block/blk-settings.c   |  22 
 drivers/block/drbd/drbd_int.h  |   1 -
 drivers/block/drbd/drbd_main.c |   1 -
 drivers/block/drbd/drbd_req.c  |  35 
 drivers/block/pktcdvd.c|  21 ---
 drivers/block/rbd.c|  47 
 drivers/md/dm-cache-target.c   |  21 ---
 drivers/md/dm-crypt.c  |  16 --
 drivers/md/dm-era-target.c |  15 -
 drivers/md/dm-flakey.c |  16 --
 drivers/md/dm-linear.c |  16 --
 drivers/md/dm-snap.c   |  15 -
 drivers/md/dm-stripe.c |  21 ---
 drivers/md/dm-table.c  |   8 ---
 drivers/md/dm-thin.c   |  31 ---
 drivers/md/dm-verity.c |  16 --
 drivers/md/dm.c| 120 +---
 drivers/md/dm.h|   2 -
 drivers/md/linear.c|  46 
 drivers/md/md.c|   2 -
 drivers/md/md.h|   8 ---
 drivers/md/multipath.c |  21 ---
 drivers/md/raid0.c |  57 ---
 drivers/md/raid0.h |   2 -
 drivers/md/raid1.c |  59 +---
 drivers/md/raid10.c| 122 +
 drivers/md/raid5.c |  28 --
 include/linux/blkdev.h |  10 
 include/linux/device-mapper.h  |   4 --
 30 files changed, 9 insertions(+), 791 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 3bc2068..8cd7a83 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -69,24 +69,13 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
struct bio *split;
struct bio_vec bv = { 0 }, bvprv = { 0 };
struct bvec_iter iter;
-   unsigned seg_size = 0, nsegs = 0;
+   unsigned seg_size = 0, nsegs = 0, sectors = 0;
int prev = 0;
 
-   struct bvec_merge_data bvm = {
-   .bi_bdev= bio->bi_bdev,
-   .bi_sector  = bio->bi_iter.bi_sector,
-   .bi_size= 0,
-   .bi_rw  = bio->bi_rw,
-   };
-
bio_for_each_segment(bv, bio, iter) {
-   if (q->merge_bvec_fn &&
-   q->merge_bvec_fn(q, , ) < (int) bv.bv_len)
-   goto split;
-
-   bvm.bi_size += bv.bv_len;
+   sectors += bv.bv_len >> 9;
 
-   if (bvm.bi_size >> 9 > queue_max_sectors(q))
+   if (sectors > queue_max_sectors(q))
goto split;
 
if (prev && blk_queue_cluster(q)) {
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 6ed2cbe..463a10a 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -53,28 +53,6 @@ void blk_queue_unprep_rq(struct request_queue *q, 
unprep_rq_fn *ufn)
 }
 EXPORT_SYMBOL(blk_queue_unprep_rq);
 
-/**
- * blk_queue_merge_bvec - set a merge_bvec function for queue
- * @q: queue
- * @mbfn:  merge_bvec_fn
- *
- * Usually queues have static limitations on the max sectors or segments that
- * we can put in a request. Stacking drivers may have some settings that
- * are dynamic, and thus we have to query the queue whether it is ok to
- * add a new bio_vec to a bio at a given offset or not. If the block device
- * has such limitations, it needs to register a merge_bvec_fn to control
- * the size of bio's sent to it. Note that a block device *must* allow a
- * single page to be added to an empty bio. The block device driver may want
- * to use the bio_split() function to deal with these bio's. By default
- * no merge_bvec_fn is defined for a queue, and only the fixed limits are
- * honored.
- */
-void blk_queue_merge_bvec(struct request_queue *q, merge_bvec_fn *mbfn)
-{
-   q->merge_bvec_fn = mbfn;
-}
-EXPORT_SYMBOL(blk_queue_merge_bvec);
-
 void blk_queue_softirq_done(struct request_queue *q, softirq_done_fn *fn)
 {
q->softirq_done_fn = fn;
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index b905e98..63ce2b0 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drb

[PATCH v2 9/9] Documentation: update notes in biovecs about arbitrarily sized bios

2015-01-12 Thread Dongsu Park
Update block/biovecs.txt so that it includes a note on what kind of
effects arbitrarily sized bios would bring to the block layer.
Also fix a trivial typo, bio_iter_iovec.

Cc: Christoph Hellwig 
Cc: Kent Overstreet 
Cc: Jonathan Corbet 
Cc: linux-...@vger.kernel.org
Signed-off-by: Dongsu Park 
---
 Documentation/block/biovecs.txt | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/Documentation/block/biovecs.txt b/Documentation/block/biovecs.txt
index 74a32ad..2568958 100644
--- a/Documentation/block/biovecs.txt
+++ b/Documentation/block/biovecs.txt
@@ -24,7 +24,7 @@ particular, presenting the illusion of partially completed 
biovecs so that
 normal code doesn't have to deal with bi_bvec_done.
 
  * Driver code should no longer refer to biovecs directly; we now have
-   bio_iovec() and bio_iovec_iter() macros that return literal struct biovecs,
+   bio_iovec() and bio_iter_iovec() macros that return literal struct biovecs,
constructed from the raw biovecs but taking into account bi_bvec_done and
bi_size.
 
@@ -109,3 +109,11 @@ Other implications:
over all the biovecs in the new bio - which is silly as it's not needed.
 
So, don't use bi_vcnt anymore.
+
+ * The current interface allows the block layer to split bios as needed, so we
+   could eliminate a lot of complexity particularly in stacked drivers. Code
+   that creates bios can then create whatever size bios are convenient, and
+   more importantly stacked drivers don't have to deal with both their own bio
+   size limitations and the limitations of the underlying devices. Thus
+   there's no need to define ->merge_bvec_fn() callbacks for individual block
+   drivers.
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 2/7] block: rewrite __bio_copy_iov()

2015-01-12 Thread Dongsu Park
Rewrite __bio_copy_iov() so that it can call either _read() or _write()
variant, which is determined by direction to_iov, given as either READ
or WRITE. Moreover, make __bio_copy_iov() take its parameter iov_iter
by value, to avoid awkward situations like ref-/dereferencing pointer
and value repeatedly.

This commit should contain only literal replacements, without
functional changes.

Suggested-by: Christoph Hellwig 
Cc: Kent Overstreet 
Cc: Jens Axboe 
Cc: Al Viro 
Signed-off-by: Dongsu Park 
---
 block/bio.c | 113 
 1 file changed, 75 insertions(+), 38 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 8267676..7b1aed3 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1046,46 +1046,84 @@ static struct bio_map_data *bio_alloc_map_data(unsigned 
int iov_count,
   sizeof(struct sg_iovec) * iov_count, gfp_mask);
 }
 
-static int __bio_copy_iov(struct bio *bio, const struct iov_iter *iter,
- int to_user, int from_user, int do_free_page)
+/**
+ * __bio_copy_iov_read - copy all pages from iov_iter to bio
+ * @bio: The  bio which describes the I/O as destination
+ * @iter: iov_iter as source
+ *
+ * Copy all pages from iov_iter to bio.
+ * Returns 0 on success, or error on failure.
+ */
+static int __bio_copy_iov_read(struct bio *bio, struct iov_iter iter)
 {
-   int ret = 0, i;
+   int i;
struct bio_vec *bvec;
-   struct iov_iter iov_iter = *iter;
 
bio_for_each_segment_all(bvec, bio, i) {
-   char *bv_addr = page_address(bvec->bv_page);
-   unsigned int bv_len = bvec->bv_len;
-
-   while (bv_len && iov_iter.count) {
-   struct iovec iov = iov_iter_iovec(_iter);
-   unsigned int bytes = min_t(unsigned int, bv_len,
-  iov.iov_len);
-
-   if (!ret) {
-   if (to_user)
-   ret = copy_to_user(iov.iov_base,
-  bv_addr, bytes);
-
-   if (from_user)
-   ret = copy_from_user(bv_addr,
-iov.iov_base,
-bytes);
-
-   if (ret)
-   ret = -EFAULT;
-   }
+   ssize_t ret;
 
-   bv_len -= bytes;
-   bv_addr += bytes;
-   iov_iter_advance(_iter, bytes);
-   }
+   ret = copy_page_from_iter(bvec->bv_page,
+ bvec->bv_offset,
+ bvec->bv_len,
+ );
 
-   if (do_free_page)
-   __free_page(bvec->bv_page);
+   if (!iov_iter_count())
+   break;
+
+   if (ret < bvec->bv_len)
+   return -EFAULT;
}
 
-   return ret;
+   return 0;
+}
+
+/**
+ * __bio_copy_iov_write - copy all pages from bio to iov_iter
+ * @bio: The  bio which describes the I/O as source
+ * @iter: iov_iter as destination
+ *
+ * Copy all pages from bio to iov_iter.
+ * Returns 0 on success, or error on failure.
+ */
+static int __bio_copy_iov_write(struct bio *bio, struct iov_iter iter)
+{
+   int i;
+   struct bio_vec *bvec;
+
+   bio_for_each_segment_all(bvec, bio, i) {
+   ssize_t ret;
+
+   ret = copy_page_to_iter(bvec->bv_page,
+   bvec->bv_offset,
+   bvec->bv_len,
+   );
+
+   if (!iov_iter_count())
+   break;
+
+   if (ret < bvec->bv_len)
+   return -EFAULT;
+   }
+
+   return 0;
+}
+
+/**
+ * __bio_copy_iov - copy all pages between bio and iov_iter
+ * @bio: The  bio which describes the I/O
+ * @iter: iov_iter either as source or destination
+ * @to_iov: whether to %READ (0) or %WRITE (1)
+ *
+ * Simple wrapper around __bio_copy_iov_{write,read}().
+ * Returns 0 on success, or the error returned as-is on failure.
+ */
+static inline int __bio_copy_iov(struct bio *bio, struct iov_iter iter,
+int to_iov)
+{
+   if (to_iov == WRITE)
+   return __bio_copy_iov_write(bio, iter);
+   else
+   return __bio_copy_iov_read(bio, iter);
 }
 
 /**
@@ -1106,11 +1144,10 @@ int bio_uncopy_user(struct bio *bio)
 * if we're in a workqueue, the request is orphaned, so
 * don't copy into a random user address space, just free.

[RFC PATCH v2 0/9] simplify block layer based on immutable biovecs

2015-01-12 Thread Dongsu Park
This is the second attempt of simplifying block layer based on immutable
biovecs. Immutable biovecs, implemented by Kent Overstreet, have been
available in mainline since v3.14. Its original goal was actually making
generic_make_request() accept arbitrarily sized bios, and pushing the
splitting down to the drivers or wherever it's required. See also
discussions in the past, [1] [2] [3] [4].
  
This will bring not only performance improvements, but also a great amount
of reduction in code complexity all over the block layer. Performance gain
is possible due to the fact that bio_add_page() does not have to check
unnecesary conditions such as queue limits or if biovecs are mergeable.
Those will be delegated to the driver level. Kent already said that he
actually benchmarked the impact of this with fio on a micron p320h, which
showed definitely a positive impact.
  
Moreover, this patchset also allows a lot of code to be deleted, mainly
because of removal of merge_bvec_fn() callbacks. We have been aware that
it has been always a delicate issue for stacking block drivers (e.g. md
and bcache) to handle merging bio consistently. This simplication will
help every individual block driver avoid having such an issue.
  
- Patch 01/09 allows generic_make_request handle arbitrarily sized bios,
  by making make_request functions call blk_queue_split().
- Patch 02/09 simplifies __bio_add_page() to avoid calling ->merge_bvec_fn().
- Patch 03/09 allows queue_bounce to handle bios with > BIO_MAX_PAGES
- Patch 04/09 gets rid of workarounds in bcache.
- Patch 05/09 removes unnecessary biovec merging parts in btrfs
- Patch 06/09 removes unnecessary biovec merging parts in MD-RAID5.
- Patch 07/09 removes ->merge_bvec_fn() completely, which affects a lot of
  block drivers, such as Ceph RBD, DRBD, device mapper, MD, etc.
- Patch 08/09 makes filesystems use helper bio_add_page().
- Patch 09/09 updates document about biovecs.

Patches are against 3.19-rc4. These are also available in my git repo at:

  https://github.com/dongsupark/linux.git block-generic-req

It's recommended to apply this patchset on top of its preparation patchset
i.e. "preparation for block layer simplification". [5]
This patchset is in turn also a prerequisite of other consecutive patchsets,
e.g. multipage biovecs, rewriting plugging, or rewriting direct-IO, which
is yet to-be-done. This patchset should not bring any regression to
end-users. I already tested it with xfstests multiple times. On the other
hand, the multipage biovecs part is currently in heavy development, with
help of Kent and Ming Lin. Those experimental patches are also available
on other branches on my git tree. Once they are done, I'm also going to
post them to get reviews.

Comments are welcome.
Dongsu

Changes in v2:
- split up preparation patches into a separate series
- remove a patch "block: simplify issueing discard, write_same, zeroout",
  as suggested by Christoph Hellwig.
- move a patch "btrfs: make use of immutable biovecs" to the upcoming series.
- minor change in ps3vram suggested by Geoff Levand
- make bio_add_page() warn once on a cloned bio.
- add more comments and commit messages to patch 02 "block: simplify
  bio_add_page()"

[1] https://lkml.org/lkml/2014/11/23/263
[2] https://lkml.org/lkml/2013/11/25/732
[3] https://lkml.org/lkml/2014/2/26/618
[4] https://lkml.org/lkml/2014/12/22/128
[5] https://lkml.org/lkml/2015/1/12/255

Dongsu Park (1):
  Documentation: update notes in biovecs about arbitrarily sized bios

Kent Overstreet (8):
  block: make generic_make_request handle arbitrarily sized bios
  block: simplify bio_add_page()
  block: allow __blk_queue_bounce() to handle bios larger than
BIO_MAX_PAGES
  bcache: clean up hacks around bio_split_pool
  btrfs: remove bio splitting and merge_bvec_fn() calls
  md/raid5: get rid of bio_fits_rdev()
  block: kill merge_bvec_fn() completely
  fs: use helper bio_add_page() instead of open coding on bi_io_vec

 Documentation/block/biovecs.txt |  10 +-
 block/bio.c | 135 +++
 block/blk-core.c|  19 ++--
 block/blk-merge.c   | 140 ++--
 block/blk-mq.c  |   2 +
 block/blk-settings.c|  22 -
 block/bounce.c  |  60 ++--
 drivers/block/drbd/drbd_int.h   |   1 -
 drivers/block/drbd/drbd_main.c  |   1 -
 drivers/block/drbd/drbd_req.c   |  37 +---
 drivers/block/pktcdvd.c |  27 +-
 drivers/block/ps3vram.c |   2 +
 drivers/block/rbd.c |  47 --
 drivers/block/rsxx/dev.c|   2 +
 drivers/block/umem.c|   2 +
 drivers/block/zram/zram_drv.c   |   2 +
 drivers/md/bcache/bcache.h   

[PATCH v2 1/9] block: make generic_make_request handle arbitrarily sized bios

2015-01-12 Thread Dongsu Park
From: Kent Overstreet 

The way the block layer is currently written, it goes to great lengths
to avoid having to split bios; upper layer code (such as bio_add_page())
checks what the underlying device can handle and tries to always create
bios that don't need to be split.

But this approach becomes unwieldy and eventually breaks down with
stacked devices and devices with dynamic limits, and it adds a lot of
complexity. If the block layer could split bios as needed, we could
eliminate a lot of complexity elsewhere - particularly in stacked
drivers. Code that creates bios can then create whatever size bios are
convenient, and more importantly stacked drivers don't have to deal with
both their own bio size limitations and the limitations of the
(potentially multiple) devices underneath them.  In the future this will
let us delete merge_bvec_fn and a bunch of other code.

We do this by adding calls to blk_queue_split() to the various
make_request functions that need it - a few can already handle arbitrary
size bios. Note that we add the call _after_ any call to
blk_queue_bounce(); this means that blk_queue_split() and
blk_recalc_rq_segments() don't need to be concerned with bouncing
affecting segment merging.

Some make_request_fn() callbacks were simple enough to audit and verify
they don't need blk_queue_split() calls. The skipped ones are:

 * nfhd_make_request (arch/m68k/emu/nfblock.c)
 * axon_ram_make_request (arch/powerpc/sysdev/axonram.c)
 * simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c)
 * brd_make_request (ramdisk - drivers/block/brd.c)
 * mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c)
 * loop_make_request
 * null_queue_bio
 * bcache's make_request fns

Some others are almost certainly safe to remove now, but will be left
for future patches.

Cc: Ming Lin 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Al Viro 
Cc: Ming Lei 
Cc: Neil Brown 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: dm-de...@redhat.com
Cc: Lars Ellenberg 
Cc: drbd-u...@lists.linbit.com
Cc: Jiri Kosina 
Cc: Geoff Levand 
Cc: Jim Paris 
Cc: Joshua Morris 
Cc: Philip Kelleher 
Cc: Minchan Kim 
Cc: Nitin Gupta 
Cc: Oleg Drokin 
Cc: Andreas Dilger 
Signed-off-by: Kent Overstreet 
[dpark: skip more mq-based drivers, resolve merge conflicts, etc.]
Signed-off-by: Dongsu Park 
---
 block/blk-core.c|  19 ++--
 block/blk-merge.c   | 151 ++--
 block/blk-mq.c  |   2 +
 drivers/block/drbd/drbd_req.c   |   2 +
 drivers/block/pktcdvd.c |   6 +-
 drivers/block/ps3vram.c |   2 +
 drivers/block/rsxx/dev.c|   2 +
 drivers/block/umem.c|   2 +
 drivers/block/zram/zram_drv.c   |   2 +
 drivers/md/dm.c |   2 +
 drivers/md/md.c |   2 +
 drivers/s390/block/dcssblk.c|   2 +
 drivers/s390/block/xpram.c  |   2 +
 drivers/staging/lustre/lustre/llite/lloop.c |   2 +
 include/linux/blkdev.h  |   3 +
 15 files changed, 179 insertions(+), 22 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 30f6153..e86ad75 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -585,6 +585,10 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
if (q->id < 0)
goto fail_q;
 
+   q->bio_split = bioset_create(4, 0);
+   if (!q->bio_split)
+   goto fail_id;
+
q->backing_dev_info.ra_pages =
(VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
q->backing_dev_info.state = 0;
@@ -594,7 +598,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
 
err = bdi_init(>backing_dev_info);
if (err)
-   goto fail_id;
+   goto fail_split;
 
setup_timer(>backing_dev_info.laptop_mode_wb_timer,
laptop_mode_timer_fn, (unsigned long) q);
@@ -636,6 +640,8 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
 
 fail_bdi:
bdi_destroy(>backing_dev_info);
+fail_split:
+   bioset_free(q->bio_split);
 fail_id:
ida_simple_remove(_queue_ida, q->id);
 fail_q:
@@ -1552,6 +1558,8 @@ void blk_queue_bio(struct request_queue *q, struct bio 
*bio)
struct request *req;
unsigned int request_count = 0;
 
+   blk_queue_split(q, , q->bio_split);
+
/*
 * low level driver can indicate that it wants pages above a
 * certain limit bounced to low memory (ie for highmem, or even
@@ -1775,15 +1783,6 @@ generic_make_request_checks(struct bio *bio)
goto end_io;
}
 
-   if (likely(bio_is_rw(bio) &&
-  nr_sectors > queue_max_hw_sectors(q))) {
-   printk(KERN_ERR "bio too big device %s (%u > 

[PATCH v2 4/7] block: refactor bio_get_user_pages() from __bio_map_user_iov()

2015-01-12 Thread Dongsu Park
From: Kent Overstreet 

Split up a part of the code that was in __bio_map_user_iov() into
a new function bio_get_user_pages(). This helper is going to be used
by future block layer rewriting, especially from direct-IO part.

Note that this relies on the recent change to make
generic_make_request() take arbitrarily sized bios - we're not using
bio_add_page() here.

Cc: Christoph Hellwig 
Cc: Jens Axboe 
Signed-off-by: Kent Overstreet 
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park 
---
 block/bio.c | 130 +++-
 include/linux/bio.h |   2 +
 2 files changed, 70 insertions(+), 62 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 9ad76ed..7ff846d 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1302,19 +1302,79 @@ struct bio *bio_copy_user(struct request_queue *q, 
struct rq_map_data *map_data,
 }
 EXPORT_SYMBOL(bio_copy_user);
 
+/**
+ * bio_get_user_pages - pin user pages and add them to a biovec
+ * @bio: bio to add pages to
+ * @uaddr: start of user address
+ * @len: length in bytes
+ * @write_to_vm: bool indicating writing to pages or not
+ *
+ * Pins pages for up to @len bytes and appends them to @bio's bvec array. May
+ * pin only part of the requested pages - @bio need not have room for all the
+ * pages and can already have had pages added to it.
+ *
+ * Returns the number of bytes from @len added to @bio.
+ */
+ssize_t bio_get_user_pages(struct bio *bio, struct iov_iter *i, int 
write_to_vm)
+{
+   while (bio->bi_vcnt < bio->bi_max_vecs && iov_iter_count(i)) {
+   struct iovec iov = iov_iter_iovec(i);
+   int ret;
+   unsigned nr_pages, bytes;
+   unsigned offset = offset_in_page(iov.iov_base);
+   struct bio_vec *bv;
+   struct page **pages;
+
+   nr_pages = min_t(size_t,
+DIV_ROUND_UP(iov.iov_len + offset, PAGE_SIZE),
+bio->bi_max_vecs - bio->bi_vcnt);
+
+   bv = >bi_io_vec[bio->bi_vcnt];
+   pages = (void *) bv;
+
+   ret = get_user_pages_fast((unsigned long) iov.iov_base,
+ nr_pages, write_to_vm, pages);
+   if (ret < 0) {
+   if (bio->bi_vcnt)
+   return 0;
+
+   return ret;
+   }
+
+   bio->bi_vcnt += ret;
+   bytes = ret * PAGE_SIZE - offset;
+
+   while (ret--) {
+   bv[ret].bv_page = pages[ret];
+   bv[ret].bv_len = PAGE_SIZE;
+   bv[ret].bv_offset = 0;
+   }
+
+   bv[0].bv_offset += offset;
+   bv[0].bv_len -= offset;
+
+   if (bytes > iov.iov_len) {
+   bio->bi_io_vec[bio->bi_vcnt - 1].bv_len -=
+   bytes - iov.iov_len;
+   bytes = iov.iov_len;
+   }
+
+   bio->bi_iter.bi_size += bytes;
+   iov_iter_advance(i, bytes);
+   }
+
+   return 0;
+}
+EXPORT_SYMBOL(bio_get_user_pages);
+
 static struct bio *__bio_map_user_iov(struct request_queue *q,
  struct block_device *bdev,
  const struct iov_iter *iter,
  int write_to_vm, gfp_t gfp_mask)
 {
-   int j;
+   ssize_t ret;
int nr_pages = 0;
-   struct page **pages;
struct bio *bio;
-   int cur_page = 0;
-   int ret, offset;
-   struct iov_iter i;
-   struct iovec iov;
 
nr_pages = iov_count_pages(iter, queue_dma_alignment(q));
if (nr_pages < 0)
@@ -1327,57 +1387,10 @@ static struct bio *__bio_map_user_iov(struct 
request_queue *q,
if (!bio)
return ERR_PTR(-ENOMEM);
 
-   ret = -ENOMEM;
-   pages = kcalloc(nr_pages, sizeof(struct page *), gfp_mask);
-   if (!pages)
+   ret = bio_get_user_pages(bio, (struct iov_iter *)iter, write_to_vm);
+   if (ret < 0)
goto out;
 
-   iov_for_each(iov, i, *iter) {
-   unsigned long uaddr = (unsigned long) iov.iov_base;
-   unsigned long len = iov.iov_len;
-   unsigned long end = (uaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
-   unsigned long start = uaddr >> PAGE_SHIFT;
-   const int local_nr_pages = end - start;
-   const int page_limit = cur_page + local_nr_pages;
-
-   ret = get_user_pages_fast(uaddr, local_nr_pages,
-   write_to_vm, [cur_page]);
-   if (ret < local_nr_pages) {
-   ret = -EFAULT;
-   goto out_unmap;
-   }
-
-   offset = uaddr & ~PAGE_MASK;
-  

  1   2   3   >