Re: [REVIEW][PATCH 0/6] Wrapping up the vfs support for unprivileged mounts
Hi, On Thu, May 24, 2018 at 1:22 AM, Eric W. Biederman wrote: > > Very slowly the work has been progressing to ensure the vfs has the > necessary support for mounting filesystems without privilege. > > This patchset contains one more core piece of that work, ensuring a few > more operations that would write back an inode and confuse an exisiting > filesystem are denied. > > The rest of the changes actually enable userns root to do things with > filesystems that the userns root has mounted. Most of these have been > waiting in the wings a long time, held back because I wanted the core > of the patchset to be solid before I started allowing additional > behavor. > > It is definitely time for these changes so the effect of s_user_ns > becomes less theoretical. > > The change to allow mknod is new, but consistent with everything else > and harmless as device nodes on filesystems mounted without privilege > are ignored. > > Unless problems show up in the during review I plan to merge these changes. Thank you for the great work. I have been looking forward to seeing it. I have just gathered available relevant patches in my branch: https://github.com/kinvolk/linux/tree/dongsu/fuse-userns-for-4.18 With this branch, I tested sshfs/fuse from non-init user namespace. It works fine as expected. So you can add: Tested-by: Dongsu Park Thanks! Dongsu > These changes are also available at: > git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git > userns-test > > Eric W. Biederman (5): > vfs: Don't allow changing the link count of an inode with an invalid > uid or gid > vfs: Allow userns root to call mknod on owned filesystems. > fs: Allow superblock owner to replace invalid owners of inodes > fs: Allow superblock owner to access do_remount_sb() > capabilities: Allow privileged user in s_user_ns to set security.* > xattrs > > Seth Forshee (1): > fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems > > fs/attr.c| 36 > fs/ioctl.c | 4 ++-- > fs/namei.c | 16 > fs/namespace.c | 4 ++-- > security/commoncap.c | 8 ++-- > 5 files changed, 50 insertions(+), 18 deletions(-) > > Eric > ___ > Containers mailing list > contain...@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/containers
Re: [REVIEW][PATCH 0/6] Wrapping up the vfs support for unprivileged mounts
Hi, On Thu, May 24, 2018 at 1:22 AM, Eric W. Biederman wrote: > > Very slowly the work has been progressing to ensure the vfs has the > necessary support for mounting filesystems without privilege. > > This patchset contains one more core piece of that work, ensuring a few > more operations that would write back an inode and confuse an exisiting > filesystem are denied. > > The rest of the changes actually enable userns root to do things with > filesystems that the userns root has mounted. Most of these have been > waiting in the wings a long time, held back because I wanted the core > of the patchset to be solid before I started allowing additional > behavor. > > It is definitely time for these changes so the effect of s_user_ns > becomes less theoretical. > > The change to allow mknod is new, but consistent with everything else > and harmless as device nodes on filesystems mounted without privilege > are ignored. > > Unless problems show up in the during review I plan to merge these changes. Thank you for the great work. I have been looking forward to seeing it. I have just gathered available relevant patches in my branch: https://github.com/kinvolk/linux/tree/dongsu/fuse-userns-for-4.18 With this branch, I tested sshfs/fuse from non-init user namespace. It works fine as expected. So you can add: Tested-by: Dongsu Park Thanks! Dongsu > These changes are also available at: > git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git > userns-test > > Eric W. Biederman (5): > vfs: Don't allow changing the link count of an inode with an invalid > uid or gid > vfs: Allow userns root to call mknod on owned filesystems. > fs: Allow superblock owner to replace invalid owners of inodes > fs: Allow superblock owner to access do_remount_sb() > capabilities: Allow privileged user in s_user_ns to set security.* > xattrs > > Seth Forshee (1): > fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems > > fs/attr.c| 36 > fs/ioctl.c | 4 ++-- > fs/namei.c | 16 > fs/namespace.c | 4 ++-- > security/commoncap.c | 8 ++-- > 5 files changed, 50 insertions(+), 18 deletions(-) > > Eric > ___ > Containers mailing list > contain...@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/containers
[RFC PATCH v5 1/2] ima: force re-appraisal on filesystems with FS_IMA_NO_CACHE
From: Alban Crequy <al...@kinvolk.io> This patch forces files to be re-measured, re-appraised and re-audited on file systems with the feature flag FS_IMA_NO_CACHE. In that way, cached integrity results won't be used. Cc: linux-kernel@vger.kernel.org Cc: linux-integr...@vger.kernel.org Cc: linux-security-mod...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: Alexander Viro <v...@zeniv.linux.org.uk> Cc: Miklos Szeredi <mik...@szeredi.hu> Cc: Mimi Zohar <zo...@linux.vnet.ibm.com> Cc: Dmitry Kasatkin <dmitry.kasat...@gmail.com> Cc: James Morris <jmor...@namei.org> Cc: Christoph Hellwig <h...@infradead.org> Acked-by: "Serge E. Hallyn" <se...@hallyn.com> Acked-by: Seth Forshee <seth.fors...@canonical.com> Tested-by: Dongsu Park <don...@kinvolk.io> Signed-off-by: Alban Crequy <al...@kinvolk.io> Signed-off-by: Dongsu Park <don...@kinvolk.io> --- include/linux/fs.h| 1 + security/integrity/ima/ima_main.c | 15 +-- 2 files changed, 14 insertions(+), 2 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index 511fbaab..ced841ba 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2075,6 +2075,7 @@ struct file_system_type { #define FS_BINARY_MOUNTDATA2 #define FS_HAS_SUBTYPE 4 #define FS_USERNS_MOUNT8 /* Can be mounted by userns root */ +#define FS_IMA_NO_CACHE16 /* Force IMA to re-measure, re-appraise, re-audit files */ #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */ struct dentry *(*mount) (struct file_system_type *, int, const char *, void *); diff --git a/security/integrity/ima/ima_main.c b/security/integrity/ima/ima_main.c index 6d78cb26..83edbad8 100644 --- a/security/integrity/ima/ima_main.c +++ b/security/integrity/ima/ima_main.c @@ -24,6 +24,7 @@ #include #include #include +#include #include "ima.h" @@ -228,9 +229,19 @@ static int process_measurement(struct file *file, char *buf, loff_t size, IMA_APPRAISE_SUBMASK | IMA_APPRAISED_SUBMASK | IMA_ACTION_FLAGS); - if (test_and_clear_bit(IMA_CHANGE_XATTR, >atomic_flags)) - /* reset all flags if ima_inode_setxattr was called */ + /* +* Reset the measure, appraise and audit cached flags either if: +* - ima_inode_setxattr was called, or +* - based on filesystem feature flag +* forcing the file to be re-evaluated. +*/ + if (test_and_clear_bit(IMA_CHANGE_XATTR, >atomic_flags)) { iint->flags &= ~IMA_DONE_MASK; + } else if (inode->i_sb->s_type->fs_flags & FS_IMA_NO_CACHE) { + iint->flags &= ~IMA_DONE_MASK; + if (action & IMA_MEASURE) + iint->measured_pcrs = 0; + } /* Determine if already appraised/measured based on bitmask * (IMA_MEASURE, IMA_MEASURED, IMA__APPRAISE, IMA__APPRAISED, -- 2.13.6
[RFC PATCH v5 1/2] ima: force re-appraisal on filesystems with FS_IMA_NO_CACHE
From: Alban Crequy This patch forces files to be re-measured, re-appraised and re-audited on file systems with the feature flag FS_IMA_NO_CACHE. In that way, cached integrity results won't be used. Cc: linux-kernel@vger.kernel.org Cc: linux-integr...@vger.kernel.org Cc: linux-security-mod...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: Alexander Viro Cc: Miklos Szeredi Cc: Mimi Zohar Cc: Dmitry Kasatkin Cc: James Morris Cc: Christoph Hellwig Acked-by: "Serge E. Hallyn" Acked-by: Seth Forshee Tested-by: Dongsu Park Signed-off-by: Alban Crequy Signed-off-by: Dongsu Park --- include/linux/fs.h| 1 + security/integrity/ima/ima_main.c | 15 +-- 2 files changed, 14 insertions(+), 2 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index 511fbaab..ced841ba 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2075,6 +2075,7 @@ struct file_system_type { #define FS_BINARY_MOUNTDATA2 #define FS_HAS_SUBTYPE 4 #define FS_USERNS_MOUNT8 /* Can be mounted by userns root */ +#define FS_IMA_NO_CACHE16 /* Force IMA to re-measure, re-appraise, re-audit files */ #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */ struct dentry *(*mount) (struct file_system_type *, int, const char *, void *); diff --git a/security/integrity/ima/ima_main.c b/security/integrity/ima/ima_main.c index 6d78cb26..83edbad8 100644 --- a/security/integrity/ima/ima_main.c +++ b/security/integrity/ima/ima_main.c @@ -24,6 +24,7 @@ #include #include #include +#include #include "ima.h" @@ -228,9 +229,19 @@ static int process_measurement(struct file *file, char *buf, loff_t size, IMA_APPRAISE_SUBMASK | IMA_APPRAISED_SUBMASK | IMA_ACTION_FLAGS); - if (test_and_clear_bit(IMA_CHANGE_XATTR, >atomic_flags)) - /* reset all flags if ima_inode_setxattr was called */ + /* +* Reset the measure, appraise and audit cached flags either if: +* - ima_inode_setxattr was called, or +* - based on filesystem feature flag +* forcing the file to be re-evaluated. +*/ + if (test_and_clear_bit(IMA_CHANGE_XATTR, >atomic_flags)) { iint->flags &= ~IMA_DONE_MASK; + } else if (inode->i_sb->s_type->fs_flags & FS_IMA_NO_CACHE) { + iint->flags &= ~IMA_DONE_MASK; + if (action & IMA_MEASURE) + iint->measured_pcrs = 0; + } /* Determine if already appraised/measured based on bitmask * (IMA_MEASURE, IMA_MEASURED, IMA__APPRAISE, IMA__APPRAISED, -- 2.13.6
[RFC PATCH v5 0/2] ima,fuse: introduce new fs flag FS_IMA_NO_CACHE
This patchset v5 introduces a new fs flag FS_IMA_NO_CACHE and uses it in FUSE. This forces files to be re-measured, re-appraised and re-audited on file systems with the feature flag FS_IMA_NO_CACHE. In that way, cached integrity results won't be used. There was a previous attempt (unmerged) with a IMA option named "force" and using that option for FUSE filesystems. These patches use a different approach so that the IMA subsystem does not need to know about FUSE. - https://www.spinics.net/lists/linux-integrity/msg00948.html - https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1584131.html Changes since v1: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1587390.html - include linux-fsdevel mailing list in cc - mark patch as RFC - based on next-integrity, without other unmerged FUSE / IMA patches Changes since v2: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1587678.html - rename flag to FS_IMA_NO_CACHE - split patch into 2 Changes since v3: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1592393.html - make the code simpler by resetting IMA_DONE_MASK Changes since v4: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1598387.html - add Acked-by from Miklos - change ordering of patches as suggested by Miklos - improve commit messages - code diff since v4 is empty: only commit messages, ordering were changed The patchset is also available in our github repo: https://github.com/kinvolk/linux/tree/dongsu/fuse-flag-ima-nocache-v5 Alban Crequy (2): ima: force re-appraisal on filesystems with FS_IMA_NO_CACHE fuse: introduce new fs_type flag FS_IMA_NO_CACHE fs/fuse/inode.c | 2 +- include/linux/fs.h| 1 + security/integrity/ima/ima_main.c | 15 +-- 3 files changed, 15 insertions(+), 3 deletions(-) -- 2.13.6
[RFC PATCH v5 0/2] ima,fuse: introduce new fs flag FS_IMA_NO_CACHE
This patchset v5 introduces a new fs flag FS_IMA_NO_CACHE and uses it in FUSE. This forces files to be re-measured, re-appraised and re-audited on file systems with the feature flag FS_IMA_NO_CACHE. In that way, cached integrity results won't be used. There was a previous attempt (unmerged) with a IMA option named "force" and using that option for FUSE filesystems. These patches use a different approach so that the IMA subsystem does not need to know about FUSE. - https://www.spinics.net/lists/linux-integrity/msg00948.html - https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1584131.html Changes since v1: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1587390.html - include linux-fsdevel mailing list in cc - mark patch as RFC - based on next-integrity, without other unmerged FUSE / IMA patches Changes since v2: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1587678.html - rename flag to FS_IMA_NO_CACHE - split patch into 2 Changes since v3: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1592393.html - make the code simpler by resetting IMA_DONE_MASK Changes since v4: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1598387.html - add Acked-by from Miklos - change ordering of patches as suggested by Miklos - improve commit messages - code diff since v4 is empty: only commit messages, ordering were changed The patchset is also available in our github repo: https://github.com/kinvolk/linux/tree/dongsu/fuse-flag-ima-nocache-v5 Alban Crequy (2): ima: force re-appraisal on filesystems with FS_IMA_NO_CACHE fuse: introduce new fs_type flag FS_IMA_NO_CACHE fs/fuse/inode.c | 2 +- include/linux/fs.h| 1 + security/integrity/ima/ima_main.c | 15 +-- 3 files changed, 15 insertions(+), 3 deletions(-) -- 2.13.6
[RFC PATCH v5 2/2] fuse: introduce new fs_type flag FS_IMA_NO_CACHE
From: Alban Crequy <al...@kinvolk.io> This new fs_type flag FS_IMA_NO_CACHE means files should be re-measured, re-appraised and re-audited each time. Cached integrity results should not be used. It is useful in FUSE because the userspace FUSE process can change the underlying files at any time without notifying the kernel. FUSE can be mounted by unprivileged users either today with fusermount installed with setuid, or soon with the upcoming patches to allow FUSE mounts in a non-init user namespace. That makes the issue more visible than for network filesystems where unprivileged users cannot mount. How to test this: The test I did was using a patched version of the memfs FUSE driver [1][2] and two very simple "hello-world" programs [4] (prog1 prints "hello world: 1" and prog2 prints "hello world: 2"). I copy prog1 and prog2 in the fuse-memfs mount point, execute them and check the sha1 hash in "/sys/kernel/security/ima/ascii_runtime_measurements". My patch on the memfs FUSE driver added a backdoor command to serve prog1 when the kernel asks for prog2 or vice-versa. In this way, I can exec prog1 and get it to print "hello world: 2" without ever replacing the file via the VFS, so the kernel is not aware of the change. The test was done using the branch "dongsu/fuse-flag-ima-nocache-v5" [3]. Step by step test procedure: 1. Mount the memfs FUSE using [2]: rm -f /tmp/memfs-switch* ; memfs -L DEBUG /mnt/memfs 2. Copy prog1 and prog2 using [4] cp prog1 /mnt/memfs/prog1 cp prog2 /mnt/memfs/prog2 3. Lookup the files and let the FUSE driver to keep the handles open: dd if=/mnt/memfs/prog1 bs=1 | (read -n 1 x ; sleep 3600 ) & dd if=/mnt/memfs/prog2 bs=1 | (read -n 1 x ; sleep 3600 ) & 4. Check the 2 programs work correctly: $ /mnt/memfs/prog1 hello world: 1 $ /mnt/memfs/prog2 hello world: 2 5. Check the measurements for prog1 and prog2: $ sudo cat /sys/kernel/security/ima/ascii_runtime_measurements \ | grep /mnt/memfs/prog 10 [...] ima-ng sha1:ac14c9268cd2[...] /mnt/memfs/prog1 10 [...] ima-ng sha1:799cb5d1e06d[...] /mnt/memfs/prog2 6. Use the backdoor command in my patched memfs to redirect file operations on file handle 3 to file handle 2: rm -f /tmp/memfs-switch* ; touch /tmp/memfs-switch-3-2 7. Check how the FUSE driver serves different content for the files: $ /mnt/memfs/prog1 hello world: 2 $ /mnt/memfs/prog2 hello world: 2 8. Check the measurements: sudo cat /sys/kernel/security/ima/ascii_runtime_measurements \ | grep /mnt/memfs/prog Without the patch, there are no new measurements, despite the FUSE driver having served different executables. With the patch, I can see additional measurements for prog1 and prog2 with the hashes reversed when the FUSE driver served the alternative content. [1] https://github.com/bbengfort/memfs [2] https://github.com/kinvolk/memfs/commits/alban/switch-files [3] https://github.com/kinvolk/linux/commits/dongsu/fuse-flag-ima-nocache-v5 [4] https://github.com/kinvolk/fuse-userns-patches/commit/cf1f5750cab0 Cc: linux-kernel@vger.kernel.org Cc: linux-integr...@vger.kernel.org Cc: linux-security-mod...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: Alexander Viro <v...@zeniv.linux.org.uk> Cc: Mimi Zohar <zo...@linux.vnet.ibm.com> Cc: Dmitry Kasatkin <dmitry.kasat...@gmail.com> Cc: James Morris <jmor...@namei.org> Cc: Christoph Hellwig <h...@infradead.org> Acked-by: Miklos Szeredi <mik...@szeredi.hu> Acked-by: "Serge E. Hallyn" <se...@hallyn.com> Acked-by: Seth Forshee <seth.fors...@canonical.com> Tested-by: Dongsu Park <don...@kinvolk.io> Signed-off-by: Alban Crequy <al...@kinvolk.io> --- fs/fuse/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 624f18bb..0a9e5164 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -1205,7 +1205,7 @@ static void fuse_kill_sb_anon(struct super_block *sb) static struct file_system_type fuse_fs_type = { .owner = THIS_MODULE, .name = "fuse", - .fs_flags = FS_HAS_SUBTYPE, + .fs_flags = FS_HAS_SUBTYPE | FS_IMA_NO_CACHE, .mount = fuse_mount, .kill_sb= fuse_kill_sb_anon, }; -- 2.13.6
[RFC PATCH v5 2/2] fuse: introduce new fs_type flag FS_IMA_NO_CACHE
From: Alban Crequy This new fs_type flag FS_IMA_NO_CACHE means files should be re-measured, re-appraised and re-audited each time. Cached integrity results should not be used. It is useful in FUSE because the userspace FUSE process can change the underlying files at any time without notifying the kernel. FUSE can be mounted by unprivileged users either today with fusermount installed with setuid, or soon with the upcoming patches to allow FUSE mounts in a non-init user namespace. That makes the issue more visible than for network filesystems where unprivileged users cannot mount. How to test this: The test I did was using a patched version of the memfs FUSE driver [1][2] and two very simple "hello-world" programs [4] (prog1 prints "hello world: 1" and prog2 prints "hello world: 2"). I copy prog1 and prog2 in the fuse-memfs mount point, execute them and check the sha1 hash in "/sys/kernel/security/ima/ascii_runtime_measurements". My patch on the memfs FUSE driver added a backdoor command to serve prog1 when the kernel asks for prog2 or vice-versa. In this way, I can exec prog1 and get it to print "hello world: 2" without ever replacing the file via the VFS, so the kernel is not aware of the change. The test was done using the branch "dongsu/fuse-flag-ima-nocache-v5" [3]. Step by step test procedure: 1. Mount the memfs FUSE using [2]: rm -f /tmp/memfs-switch* ; memfs -L DEBUG /mnt/memfs 2. Copy prog1 and prog2 using [4] cp prog1 /mnt/memfs/prog1 cp prog2 /mnt/memfs/prog2 3. Lookup the files and let the FUSE driver to keep the handles open: dd if=/mnt/memfs/prog1 bs=1 | (read -n 1 x ; sleep 3600 ) & dd if=/mnt/memfs/prog2 bs=1 | (read -n 1 x ; sleep 3600 ) & 4. Check the 2 programs work correctly: $ /mnt/memfs/prog1 hello world: 1 $ /mnt/memfs/prog2 hello world: 2 5. Check the measurements for prog1 and prog2: $ sudo cat /sys/kernel/security/ima/ascii_runtime_measurements \ | grep /mnt/memfs/prog 10 [...] ima-ng sha1:ac14c9268cd2[...] /mnt/memfs/prog1 10 [...] ima-ng sha1:799cb5d1e06d[...] /mnt/memfs/prog2 6. Use the backdoor command in my patched memfs to redirect file operations on file handle 3 to file handle 2: rm -f /tmp/memfs-switch* ; touch /tmp/memfs-switch-3-2 7. Check how the FUSE driver serves different content for the files: $ /mnt/memfs/prog1 hello world: 2 $ /mnt/memfs/prog2 hello world: 2 8. Check the measurements: sudo cat /sys/kernel/security/ima/ascii_runtime_measurements \ | grep /mnt/memfs/prog Without the patch, there are no new measurements, despite the FUSE driver having served different executables. With the patch, I can see additional measurements for prog1 and prog2 with the hashes reversed when the FUSE driver served the alternative content. [1] https://github.com/bbengfort/memfs [2] https://github.com/kinvolk/memfs/commits/alban/switch-files [3] https://github.com/kinvolk/linux/commits/dongsu/fuse-flag-ima-nocache-v5 [4] https://github.com/kinvolk/fuse-userns-patches/commit/cf1f5750cab0 Cc: linux-kernel@vger.kernel.org Cc: linux-integr...@vger.kernel.org Cc: linux-security-mod...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: Alexander Viro Cc: Mimi Zohar Cc: Dmitry Kasatkin Cc: James Morris Cc: Christoph Hellwig Acked-by: Miklos Szeredi Acked-by: "Serge E. Hallyn" Acked-by: Seth Forshee Tested-by: Dongsu Park Signed-off-by: Alban Crequy --- fs/fuse/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 624f18bb..0a9e5164 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -1205,7 +1205,7 @@ static void fuse_kill_sb_anon(struct super_block *sb) static struct file_system_type fuse_fs_type = { .owner = THIS_MODULE, .name = "fuse", - .fs_flags = FS_HAS_SUBTYPE, + .fs_flags = FS_HAS_SUBTYPE | FS_IMA_NO_CACHE, .mount = fuse_mount, .kill_sb= fuse_kill_sb_anon, }; -- 2.13.6
Re: [RFC PATCH v3 2/2] ima: force re-appraisal on filesystems with FS_IMA_NO_CACHE
Hi, On Mon, Jan 29, 2018 at 6:40 PM, Dongsu Park <don...@kinvolk.io> wrote: > On Mon, Jan 29, 2018 at 5:33 PM, Mimi Zohar <zo...@linux.vnet.ibm.com> wrote: >> On Thu, 2018-01-25 at 06:56 -0500, Mimi Zohar wrote: ... >> Did you get a chance to make the change and test it? > > Alban has been on holidays, so he will be back on Wednesday or so. > So I'll try to understand what you meant in the last email. > > As IMA_DONE_MASK contains all other bitmasks, it's possible to > optimize the code like this: > > if (test_and_clear_bit(IMA_CHANGE_XATTR, >atomic_flags)) { > iint->flags &= ~IMA_DONE_MASK; > } else if (inode->i_sb->s_type->fs_flags & FS_IMA_NO_CACHE) { > iint->flags &= ~IMA_DONE_MASK; > if (action & IMA_MEASURE) > iint->measured_pcrs = 0; > } > > Is that what you want to see? Please let me know if it's not. > Tomorrow I will try to test with a new patch. Today I created a new patch, and tested it. It worked fine. So I've just sent a new patchset v4. Please see: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1598387.html Thanks, Dongsu > Thanks, > Dongsu > >> Mimi >>
Re: [RFC PATCH v3 2/2] ima: force re-appraisal on filesystems with FS_IMA_NO_CACHE
Hi, On Mon, Jan 29, 2018 at 6:40 PM, Dongsu Park wrote: > On Mon, Jan 29, 2018 at 5:33 PM, Mimi Zohar wrote: >> On Thu, 2018-01-25 at 06:56 -0500, Mimi Zohar wrote: ... >> Did you get a chance to make the change and test it? > > Alban has been on holidays, so he will be back on Wednesday or so. > So I'll try to understand what you meant in the last email. > > As IMA_DONE_MASK contains all other bitmasks, it's possible to > optimize the code like this: > > if (test_and_clear_bit(IMA_CHANGE_XATTR, >atomic_flags)) { > iint->flags &= ~IMA_DONE_MASK; > } else if (inode->i_sb->s_type->fs_flags & FS_IMA_NO_CACHE) { > iint->flags &= ~IMA_DONE_MASK; > if (action & IMA_MEASURE) > iint->measured_pcrs = 0; > } > > Is that what you want to see? Please let me know if it's not. > Tomorrow I will try to test with a new patch. Today I created a new patch, and tested it. It worked fine. So I've just sent a new patchset v4. Please see: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1598387.html Thanks, Dongsu > Thanks, > Dongsu > >> Mimi >>
[RFC PATCH v4 0/2] ima,fuse: introduce new fs flag FS_IMA_NO_CACHE
This patchset v4 introduces a new fs flag FS_IMA_NO_CACHE and uses it in FUSE. This forces files to be re-measured, re-appraised and re-audited on file systems with the feature flag FS_IMA_NO_CACHE. In that way, cached integrity results won't be used. There was a previous attempt (unmerged) with a IMA option named "force" and using that option for FUSE filesystems. These patches use a different approach so that the IMA subsystem does not need to know about FUSE. - https://www.spinics.net/lists/linux-integrity/msg00948.html - https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1584131.html Changes since v1: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1587390.html - include linux-fsdevel mailing list in cc - mark patch as RFC - based on next-integrity, without other unmerged FUSE / IMA patches Changes since v2: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1587678.html - rename flag to FS_IMA_NO_CACHE - split patch into 2 Changes since v3: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1592393.html - make the code simpler by resetting IMA_DONE_MASK The patchset is also available in our github repo: https://github.com/kinvolk/linux/tree/dongsu/fuse-flag-ima-nocache-v4 Alban Crequy (2): fuse: introduce new fs_type flag FS_IMA_NO_CACHE ima: force re-appraisal on filesystems with FS_IMA_NO_CACHE fs/fuse/inode.c | 2 +- include/linux/fs.h| 1 + security/integrity/ima/ima_main.c | 15 +-- 3 files changed, 15 insertions(+), 3 deletions(-) -- 2.13.6
[RFC PATCH v4 0/2] ima,fuse: introduce new fs flag FS_IMA_NO_CACHE
This patchset v4 introduces a new fs flag FS_IMA_NO_CACHE and uses it in FUSE. This forces files to be re-measured, re-appraised and re-audited on file systems with the feature flag FS_IMA_NO_CACHE. In that way, cached integrity results won't be used. There was a previous attempt (unmerged) with a IMA option named "force" and using that option for FUSE filesystems. These patches use a different approach so that the IMA subsystem does not need to know about FUSE. - https://www.spinics.net/lists/linux-integrity/msg00948.html - https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1584131.html Changes since v1: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1587390.html - include linux-fsdevel mailing list in cc - mark patch as RFC - based on next-integrity, without other unmerged FUSE / IMA patches Changes since v2: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1587678.html - rename flag to FS_IMA_NO_CACHE - split patch into 2 Changes since v3: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1592393.html - make the code simpler by resetting IMA_DONE_MASK The patchset is also available in our github repo: https://github.com/kinvolk/linux/tree/dongsu/fuse-flag-ima-nocache-v4 Alban Crequy (2): fuse: introduce new fs_type flag FS_IMA_NO_CACHE ima: force re-appraisal on filesystems with FS_IMA_NO_CACHE fs/fuse/inode.c | 2 +- include/linux/fs.h| 1 + security/integrity/ima/ima_main.c | 15 +-- 3 files changed, 15 insertions(+), 3 deletions(-) -- 2.13.6
[RFC PATCH v4 2/2] ima: force re-appraisal on filesystems with FS_IMA_NO_CACHE
From: Alban Crequy <al...@kinvolk.io> This patch forces files to be re-measured, re-appraised and re-audited on file systems with the feature flag FS_IMA_NO_CACHE. In that way, cached integrity results won't be used. How to test this: The test I did was using a patched version of the memfs FUSE driver [1][2] and two very simple "hello-world" programs [4] (prog1 prints "hello world: 1" and prog2 prints "hello world: 2"). I copy prog1 and prog2 in the fuse-memfs mount point, execute them and check the sha1 hash in "/sys/kernel/security/ima/ascii_runtime_measurements". My patch on the memfs FUSE driver added a backdoor command to serve prog1 when the kernel asks for prog2 or vice-versa. In this way, I can exec prog1 and get it to print "hello world: 2" without ever replacing the file via the VFS, so the kernel is not aware of the change. The test was done using the branch "dongsu/fuse-flag-ima-nocache-v4" [3]. Step by step test procedure: 1. Mount the memfs FUSE using [2]: rm -f /tmp/memfs-switch* ; memfs -L DEBUG /mnt/memfs 2. Copy prog1 and prog2 using [4] cp prog1 /mnt/memfs/prog1 cp prog2 /mnt/memfs/prog2 3. Lookup the files and let the FUSE driver to keep the handles open: dd if=/mnt/memfs/prog1 bs=1 | (read -n 1 x ; sleep 3600 ) & dd if=/mnt/memfs/prog2 bs=1 | (read -n 1 x ; sleep 3600 ) & 4. Check the 2 programs work correctly: $ /mnt/memfs/prog1 hello world: 1 $ /mnt/memfs/prog2 hello world: 2 5. Check the measurements for prog1 and prog2: $ sudo cat /sys/kernel/security/ima/ascii_runtime_measurements \ | grep /mnt/memfs/prog 10 [...] ima-ng sha1:ac14c9268cd2[...] /mnt/memfs/prog1 10 [...] ima-ng sha1:799cb5d1e06d[...] /mnt/memfs/prog2 6. Use the backdoor command in my patched memfs to redirect file operations on file handle 3 to file handle 2: rm -f /tmp/memfs-switch* ; touch /tmp/memfs-switch-3-2 7. Check how the FUSE driver serves different content for the files: $ /mnt/memfs/prog1 hello world: 2 $ /mnt/memfs/prog2 hello world: 2 8. Check the measurements: sudo cat /sys/kernel/security/ima/ascii_runtime_measurements \ | grep /mnt/memfs/prog Without the patch, there are no new measurements, despite the FUSE driver having served different executables. With the patch, I can see additional measurements for prog1 and prog2 with the hashes reversed when the FUSE driver served the alternative content. [1] https://github.com/bbengfort/memfs [2] https://github.com/kinvolk/memfs/commits/alban/switch-files [3] https://github.com/kinvolk/linux/commits/dongsu/fuse-flag-ima-nocache-v4 [4] https://github.com/kinvolk/fuse-userns-patches/commit/cf1f5750cab0 Cc: linux-kernel@vger.kernel.org Cc: linux-integr...@vger.kernel.org Cc: linux-security-mod...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: Miklos Szeredi <mik...@szeredi.hu> Cc: Alexander Viro <v...@zeniv.linux.org.uk> Cc: Mimi Zohar <zo...@linux.vnet.ibm.com> Cc: Dmitry Kasatkin <dmitry.kasat...@gmail.com> Cc: James Morris <jmor...@namei.org> Cc: Christoph Hellwig <h...@infradead.org> Acked-by: "Serge E. Hallyn" <se...@hallyn.com> Acked-by: Seth Forshee <seth.fors...@canonical.com> Tested-by: Dongsu Park <don...@kinvolk.io> Signed-off-by: Alban Crequy <al...@kinvolk.io> [dongsu: optimized code to address review comments by Mimi] Signed-off-by: Dongsu Park <don...@kinvolk.io> --- security/integrity/ima/ima_main.c | 15 +-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/security/integrity/ima/ima_main.c b/security/integrity/ima/ima_main.c index 6d78cb26..83edbad8 100644 --- a/security/integrity/ima/ima_main.c +++ b/security/integrity/ima/ima_main.c @@ -24,6 +24,7 @@ #include #include #include +#include #include "ima.h" @@ -228,9 +229,19 @@ static int process_measurement(struct file *file, char *buf, loff_t size, IMA_APPRAISE_SUBMASK | IMA_APPRAISED_SUBMASK | IMA_ACTION_FLAGS); - if (test_and_clear_bit(IMA_CHANGE_XATTR, >atomic_flags)) - /* reset all flags if ima_inode_setxattr was called */ + /* +* Reset the measure, appraise and audit cached flags either if: +* - ima_inode_setxattr was called, or +* - based on filesystem feature flag +* forcing the file to be re-evaluated. +*/ + if (test_and_clear_bit(IMA_CHANGE_XATTR, >atomic_flags)) { iint->flags &= ~IMA_DONE_MASK; + } else if (inode->i_sb->s_type->fs_flags & FS_IMA_NO_CACHE) { + iint->flags &= ~IMA_DONE_MASK; + if (action & IMA_MEASURE) + iint->measured_pcrs = 0; + } /* Determine if already appraised/measured based on bitmask * (IMA_MEASURE, IMA_MEASURED, IMA__APPRAISE, IMA__APPRAISED, -- 2.13.6
[RFC PATCH v4 2/2] ima: force re-appraisal on filesystems with FS_IMA_NO_CACHE
From: Alban Crequy This patch forces files to be re-measured, re-appraised and re-audited on file systems with the feature flag FS_IMA_NO_CACHE. In that way, cached integrity results won't be used. How to test this: The test I did was using a patched version of the memfs FUSE driver [1][2] and two very simple "hello-world" programs [4] (prog1 prints "hello world: 1" and prog2 prints "hello world: 2"). I copy prog1 and prog2 in the fuse-memfs mount point, execute them and check the sha1 hash in "/sys/kernel/security/ima/ascii_runtime_measurements". My patch on the memfs FUSE driver added a backdoor command to serve prog1 when the kernel asks for prog2 or vice-versa. In this way, I can exec prog1 and get it to print "hello world: 2" without ever replacing the file via the VFS, so the kernel is not aware of the change. The test was done using the branch "dongsu/fuse-flag-ima-nocache-v4" [3]. Step by step test procedure: 1. Mount the memfs FUSE using [2]: rm -f /tmp/memfs-switch* ; memfs -L DEBUG /mnt/memfs 2. Copy prog1 and prog2 using [4] cp prog1 /mnt/memfs/prog1 cp prog2 /mnt/memfs/prog2 3. Lookup the files and let the FUSE driver to keep the handles open: dd if=/mnt/memfs/prog1 bs=1 | (read -n 1 x ; sleep 3600 ) & dd if=/mnt/memfs/prog2 bs=1 | (read -n 1 x ; sleep 3600 ) & 4. Check the 2 programs work correctly: $ /mnt/memfs/prog1 hello world: 1 $ /mnt/memfs/prog2 hello world: 2 5. Check the measurements for prog1 and prog2: $ sudo cat /sys/kernel/security/ima/ascii_runtime_measurements \ | grep /mnt/memfs/prog 10 [...] ima-ng sha1:ac14c9268cd2[...] /mnt/memfs/prog1 10 [...] ima-ng sha1:799cb5d1e06d[...] /mnt/memfs/prog2 6. Use the backdoor command in my patched memfs to redirect file operations on file handle 3 to file handle 2: rm -f /tmp/memfs-switch* ; touch /tmp/memfs-switch-3-2 7. Check how the FUSE driver serves different content for the files: $ /mnt/memfs/prog1 hello world: 2 $ /mnt/memfs/prog2 hello world: 2 8. Check the measurements: sudo cat /sys/kernel/security/ima/ascii_runtime_measurements \ | grep /mnt/memfs/prog Without the patch, there are no new measurements, despite the FUSE driver having served different executables. With the patch, I can see additional measurements for prog1 and prog2 with the hashes reversed when the FUSE driver served the alternative content. [1] https://github.com/bbengfort/memfs [2] https://github.com/kinvolk/memfs/commits/alban/switch-files [3] https://github.com/kinvolk/linux/commits/dongsu/fuse-flag-ima-nocache-v4 [4] https://github.com/kinvolk/fuse-userns-patches/commit/cf1f5750cab0 Cc: linux-kernel@vger.kernel.org Cc: linux-integr...@vger.kernel.org Cc: linux-security-mod...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: Miklos Szeredi Cc: Alexander Viro Cc: Mimi Zohar Cc: Dmitry Kasatkin Cc: James Morris Cc: Christoph Hellwig Acked-by: "Serge E. Hallyn" Acked-by: Seth Forshee Tested-by: Dongsu Park Signed-off-by: Alban Crequy [dongsu: optimized code to address review comments by Mimi] Signed-off-by: Dongsu Park --- security/integrity/ima/ima_main.c | 15 +-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/security/integrity/ima/ima_main.c b/security/integrity/ima/ima_main.c index 6d78cb26..83edbad8 100644 --- a/security/integrity/ima/ima_main.c +++ b/security/integrity/ima/ima_main.c @@ -24,6 +24,7 @@ #include #include #include +#include #include "ima.h" @@ -228,9 +229,19 @@ static int process_measurement(struct file *file, char *buf, loff_t size, IMA_APPRAISE_SUBMASK | IMA_APPRAISED_SUBMASK | IMA_ACTION_FLAGS); - if (test_and_clear_bit(IMA_CHANGE_XATTR, >atomic_flags)) - /* reset all flags if ima_inode_setxattr was called */ + /* +* Reset the measure, appraise and audit cached flags either if: +* - ima_inode_setxattr was called, or +* - based on filesystem feature flag +* forcing the file to be re-evaluated. +*/ + if (test_and_clear_bit(IMA_CHANGE_XATTR, >atomic_flags)) { iint->flags &= ~IMA_DONE_MASK; + } else if (inode->i_sb->s_type->fs_flags & FS_IMA_NO_CACHE) { + iint->flags &= ~IMA_DONE_MASK; + if (action & IMA_MEASURE) + iint->measured_pcrs = 0; + } /* Determine if already appraised/measured based on bitmask * (IMA_MEASURE, IMA_MEASURED, IMA__APPRAISE, IMA__APPRAISED, -- 2.13.6
[RFC PATCH v4 1/2] fuse: introduce new fs_type flag FS_IMA_NO_CACHE
From: Alban Crequy <al...@kinvolk.io> This new fs_type flag FS_IMA_NO_CACHE means files should be re-measured, re-appraised and re-audited each time. Cached integrity results should not be used. It is useful in FUSE because the userspace FUSE process can change the underlying files at any time without notifying the kernel. Cc: linux-kernel@vger.kernel.org Cc: linux-integr...@vger.kernel.org Cc: linux-security-mod...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: Miklos Szeredi <mik...@szeredi.hu> Cc: Alexander Viro <v...@zeniv.linux.org.uk> Cc: Mimi Zohar <zo...@linux.vnet.ibm.com> Cc: Dmitry Kasatkin <dmitry.kasat...@gmail.com> Cc: James Morris <jmor...@namei.org> Cc: Christoph Hellwig <h...@infradead.org> Acked-by: "Serge E. Hallyn" <se...@hallyn.com> Acked-by: Seth Forshee <seth.fors...@canonical.com> Tested-by: Dongsu Park <don...@kinvolk.io> Signed-off-by: Alban Crequy <al...@kinvolk.io> --- fs/fuse/inode.c| 2 +- include/linux/fs.h | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 624f18bb..0a9e5164 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -1205,7 +1205,7 @@ static void fuse_kill_sb_anon(struct super_block *sb) static struct file_system_type fuse_fs_type = { .owner = THIS_MODULE, .name = "fuse", - .fs_flags = FS_HAS_SUBTYPE, + .fs_flags = FS_HAS_SUBTYPE | FS_IMA_NO_CACHE, .mount = fuse_mount, .kill_sb= fuse_kill_sb_anon, }; diff --git a/include/linux/fs.h b/include/linux/fs.h index 511fbaab..ced841ba 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2075,6 +2075,7 @@ struct file_system_type { #define FS_BINARY_MOUNTDATA2 #define FS_HAS_SUBTYPE 4 #define FS_USERNS_MOUNT8 /* Can be mounted by userns root */ +#define FS_IMA_NO_CACHE16 /* Force IMA to re-measure, re-appraise, re-audit files */ #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */ struct dentry *(*mount) (struct file_system_type *, int, const char *, void *); -- 2.13.6
[RFC PATCH v4 1/2] fuse: introduce new fs_type flag FS_IMA_NO_CACHE
From: Alban Crequy This new fs_type flag FS_IMA_NO_CACHE means files should be re-measured, re-appraised and re-audited each time. Cached integrity results should not be used. It is useful in FUSE because the userspace FUSE process can change the underlying files at any time without notifying the kernel. Cc: linux-kernel@vger.kernel.org Cc: linux-integr...@vger.kernel.org Cc: linux-security-mod...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: Miklos Szeredi Cc: Alexander Viro Cc: Mimi Zohar Cc: Dmitry Kasatkin Cc: James Morris Cc: Christoph Hellwig Acked-by: "Serge E. Hallyn" Acked-by: Seth Forshee Tested-by: Dongsu Park Signed-off-by: Alban Crequy --- fs/fuse/inode.c| 2 +- include/linux/fs.h | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 624f18bb..0a9e5164 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -1205,7 +1205,7 @@ static void fuse_kill_sb_anon(struct super_block *sb) static struct file_system_type fuse_fs_type = { .owner = THIS_MODULE, .name = "fuse", - .fs_flags = FS_HAS_SUBTYPE, + .fs_flags = FS_HAS_SUBTYPE | FS_IMA_NO_CACHE, .mount = fuse_mount, .kill_sb= fuse_kill_sb_anon, }; diff --git a/include/linux/fs.h b/include/linux/fs.h index 511fbaab..ced841ba 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2075,6 +2075,7 @@ struct file_system_type { #define FS_BINARY_MOUNTDATA2 #define FS_HAS_SUBTYPE 4 #define FS_USERNS_MOUNT8 /* Can be mounted by userns root */ +#define FS_IMA_NO_CACHE16 /* Force IMA to re-measure, re-appraise, re-audit files */ #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */ struct dentry *(*mount) (struct file_system_type *, int, const char *, void *); -- 2.13.6
Re: [RFC PATCH v3 2/2] ima: force re-appraisal on filesystems with FS_IMA_NO_CACHE
Hi Mimi, On Mon, Jan 29, 2018 at 5:33 PM, Mimi Zoharwrote: > Hi Alban, > > On Thu, 2018-01-25 at 06:56 -0500, Mimi Zohar wrote: >> > > @@ -228,9 +229,28 @@ static int process_measurement(struct file *file, >> > > char *buf, loff_t size, >> > >IMA_APPRAISE_SUBMASK | IMA_APPRAISED_SUBMASK | >> > >IMA_ACTION_FLAGS); >> > > >> > > - if (test_and_clear_bit(IMA_CHANGE_XATTR, >atomic_flags)) >> > > - /* reset all flags if ima_inode_setxattr was called */ >> > > + /* >> > > + * Reset the measure, appraise and audit cached flags either if: >> > > + * - ima_inode_setxattr was called, or >> > > + * - based on filesystem feature flag >> > > + * forcing the file to be re-evaluated. >> > > + */ >> > > + if (test_and_clear_bit(IMA_CHANGE_XATTR, >atomic_flags)) { >> > > iint->flags &= ~IMA_DONE_MASK; >> > > + } else if (inode->i_sb->s_type->fs_flags & FS_IMA_NO_CACHE) { >> > > + if (action & IMA_MEASURE) { >> > > + iint->measured_pcrs = 0; >> > > + iint->flags &= >> > > + ~(IMA_COLLECTED | IMA_MEASURE | IMA_MEASURED); >> > > + } >> > > + if (action & IMA_APPRAISE) >> > > + iint->flags &= >> > > + ~(IMA_COLLECTED | IMA_APPRAISE | IMA_APPRAISED | >> > > + IMA_APPRAISE_SUBMASK | IMA_APPRAISED_SUBMASK); >> > > + if (action & IMA_AUDIT) >> > > + iint->flags &= >> > > + ~(IMA_COLLECTED | IMA_AUDIT | IMA_AUDITED); >> > > + } >> > > >> >> Alban, I don't know what I was thinking, but this can be simplified >> like for the IMA_CHANGE_XATTR case. Except in the IMA_CHANGE_XATTR >> case, "measured_pcrs" was already reset, whereas in this case >> "measured_pcrs" needs to be reset. > > Did you get a chance to make the change and test it? Alban has been on holidays, so he will be back on Wednesday or so. So I'll try to understand what you meant in the last email. As IMA_DONE_MASK contains all other bitmasks, it's possible to optimize the code like this: if (test_and_clear_bit(IMA_CHANGE_XATTR, >atomic_flags)) { iint->flags &= ~IMA_DONE_MASK; } else if (inode->i_sb->s_type->fs_flags & FS_IMA_NO_CACHE) { iint->flags &= ~IMA_DONE_MASK; if (action & IMA_MEASURE) iint->measured_pcrs = 0; } Is that what you want to see? Please let me know if it's not. Tomorrow I will try to test with a new patch. Thanks, Dongsu > Mimi >
Re: [RFC PATCH v3 2/2] ima: force re-appraisal on filesystems with FS_IMA_NO_CACHE
Hi Mimi, On Mon, Jan 29, 2018 at 5:33 PM, Mimi Zohar wrote: > Hi Alban, > > On Thu, 2018-01-25 at 06:56 -0500, Mimi Zohar wrote: >> > > @@ -228,9 +229,28 @@ static int process_measurement(struct file *file, >> > > char *buf, loff_t size, >> > >IMA_APPRAISE_SUBMASK | IMA_APPRAISED_SUBMASK | >> > >IMA_ACTION_FLAGS); >> > > >> > > - if (test_and_clear_bit(IMA_CHANGE_XATTR, >atomic_flags)) >> > > - /* reset all flags if ima_inode_setxattr was called */ >> > > + /* >> > > + * Reset the measure, appraise and audit cached flags either if: >> > > + * - ima_inode_setxattr was called, or >> > > + * - based on filesystem feature flag >> > > + * forcing the file to be re-evaluated. >> > > + */ >> > > + if (test_and_clear_bit(IMA_CHANGE_XATTR, >atomic_flags)) { >> > > iint->flags &= ~IMA_DONE_MASK; >> > > + } else if (inode->i_sb->s_type->fs_flags & FS_IMA_NO_CACHE) { >> > > + if (action & IMA_MEASURE) { >> > > + iint->measured_pcrs = 0; >> > > + iint->flags &= >> > > + ~(IMA_COLLECTED | IMA_MEASURE | IMA_MEASURED); >> > > + } >> > > + if (action & IMA_APPRAISE) >> > > + iint->flags &= >> > > + ~(IMA_COLLECTED | IMA_APPRAISE | IMA_APPRAISED | >> > > + IMA_APPRAISE_SUBMASK | IMA_APPRAISED_SUBMASK); >> > > + if (action & IMA_AUDIT) >> > > + iint->flags &= >> > > + ~(IMA_COLLECTED | IMA_AUDIT | IMA_AUDITED); >> > > + } >> > > >> >> Alban, I don't know what I was thinking, but this can be simplified >> like for the IMA_CHANGE_XATTR case. Except in the IMA_CHANGE_XATTR >> case, "measured_pcrs" was already reset, whereas in this case >> "measured_pcrs" needs to be reset. > > Did you get a chance to make the change and test it? Alban has been on holidays, so he will be back on Wednesday or so. So I'll try to understand what you meant in the last email. As IMA_DONE_MASK contains all other bitmasks, it's possible to optimize the code like this: if (test_and_clear_bit(IMA_CHANGE_XATTR, >atomic_flags)) { iint->flags &= ~IMA_DONE_MASK; } else if (inode->i_sb->s_type->fs_flags & FS_IMA_NO_CACHE) { iint->flags &= ~IMA_DONE_MASK; if (action & IMA_MEASURE) iint->measured_pcrs = 0; } Is that what you want to see? Please let me know if it's not. Tomorrow I will try to test with a new patch. Thanks, Dongsu > Mimi >
Re: [PATCH 0/2] turn on force option for FUSE in builtin policies
Hi Mimi, On Tue, Jan 16, 2018 at 12:23 PM, Mimi Zohar <zo...@linux.vnet.ibm.com> wrote: > On Tue, 2018-01-16 at 12:09 +0100, Dongsu Park wrote: >> Since yesterday Alban and I have been working on a different approach >> that does not depend on IMA rules, nor fsmagic. Please see: >> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1587390.html >> >> If that's ok, I'm ready to discard this patchset. > > You dropped a number of people involved in this discussion and mailing > lists. Please post the proposed patch inline as an RFC, cc'ing the > same people, those involved in the discussion, and previous mailing > lists, including LSM, integrity, and fsdevel. Sorry about that. Starting from the patchset v2, we will add Cc correctly. Thank you also for the detailed review. Dongsu > thanks, > > Mimi >
Re: [PATCH 0/2] turn on force option for FUSE in builtin policies
Hi Mimi, On Tue, Jan 16, 2018 at 12:23 PM, Mimi Zohar wrote: > On Tue, 2018-01-16 at 12:09 +0100, Dongsu Park wrote: >> Since yesterday Alban and I have been working on a different approach >> that does not depend on IMA rules, nor fsmagic. Please see: >> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1587390.html >> >> If that's ok, I'm ready to discard this patchset. > > You dropped a number of people involved in this discussion and mailing > lists. Please post the proposed patch inline as an RFC, cc'ing the > same people, those involved in the discussion, and previous mailing > lists, including LSM, integrity, and fsdevel. Sorry about that. Starting from the patchset v2, we will add Cc correctly. Thank you also for the detailed review. Dongsu > thanks, > > Mimi >
Re: [PATCH 0/2] turn on force option for FUSE in builtin policies
Hi, On Thu, Jan 11, 2018 at 8:51 PM, Dongsu Park <don...@kinvolk.io> wrote: > In case of FUSE filesystem, cached integrity results in IMA could be > reused, when the userspace FUSE process has changed the > underlying files. To be able to avoid such cases, we need to turn on > the force option in builtin policies, for actions of measure and > appraise. Then integrity values become re-measured and re-appraised. > In that way, cached integrity results won't be used. Since yesterday Alban and I have been working on a different approach that does not depend on IMA rules, nor fsmagic. Please see: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1587390.html If that's ok, I'm ready to discard this patchset. Thanks, Dongsu > This patchset depends on the patch "ima: define a new policy option > named force" by Mimi. [1] For details on testing the force option, > please refer to the testing report by Alban. [2] > > The first patch is for simply moving FUSE_*SUPER_MAGIC macros to > include/uapi/linux, to be able to use those in other subsystems like > security/integrity/ima. > > The second patch is actually to turn on the force option for FUSE fs > in IMA. > > [1] https://www.spinics.net/lists/linux-integrity/msg00948.html > [2] https://marc.info/?l=linux-integrity=151559360514676=2 > > > Dongsu Park (2): > fs/fuse: move SUPER_MAGIC definitions to linux/magic.h > ima: turn on force option for FUSE in builtin policies > > fs/fuse/control.c | 3 +-- > fs/fuse/inode.c | 3 +-- > include/uapi/linux/magic.h | 3 +++ > security/integrity/ima/ima_policy.c | 2 ++ > 4 files changed, 7 insertions(+), 4 deletions(-) > > -- > 2.13.6 >
Re: [PATCH 0/2] turn on force option for FUSE in builtin policies
Hi, On Thu, Jan 11, 2018 at 8:51 PM, Dongsu Park wrote: > In case of FUSE filesystem, cached integrity results in IMA could be > reused, when the userspace FUSE process has changed the > underlying files. To be able to avoid such cases, we need to turn on > the force option in builtin policies, for actions of measure and > appraise. Then integrity values become re-measured and re-appraised. > In that way, cached integrity results won't be used. Since yesterday Alban and I have been working on a different approach that does not depend on IMA rules, nor fsmagic. Please see: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1587390.html If that's ok, I'm ready to discard this patchset. Thanks, Dongsu > This patchset depends on the patch "ima: define a new policy option > named force" by Mimi. [1] For details on testing the force option, > please refer to the testing report by Alban. [2] > > The first patch is for simply moving FUSE_*SUPER_MAGIC macros to > include/uapi/linux, to be able to use those in other subsystems like > security/integrity/ima. > > The second patch is actually to turn on the force option for FUSE fs > in IMA. > > [1] https://www.spinics.net/lists/linux-integrity/msg00948.html > [2] https://marc.info/?l=linux-integrity=151559360514676=2 > > > Dongsu Park (2): > fs/fuse: move SUPER_MAGIC definitions to linux/magic.h > ima: turn on force option for FUSE in builtin policies > > fs/fuse/control.c | 3 +-- > fs/fuse/inode.c | 3 +-- > include/uapi/linux/magic.h | 3 +++ > security/integrity/ima/ima_policy.c | 2 ++ > 4 files changed, 7 insertions(+), 4 deletions(-) > > -- > 2.13.6 >
Re: [PATCH 2/2] ima: turn on force option for FUSE in builtin policies
Hi, On Sun, Jan 14, 2018 at 8:09 PM, kbuild test robot <l...@intel.com> wrote: > [auto build test ERROR on linus/master] > [also build test ERROR on v4.15-rc7 next-20180112] > [if your patch is applied to the wrong git tree, please drop us a note to > help improve the system] As already mentioned in the commit message, this patch depends on patches that are not yet in the mainline, or not even in next-integrity. So please make it excluded from kbuild. Thanks, Dongsu > url: > https://github.com/0day-ci/linux/commits/Dongsu-Park/turn-on-force-option-for-FUSE-in-builtin-policies/20180115-015830 > config: xtensa-allmodconfig (attached as .config) > compiler: xtensa-linux-gcc (GCC) 7.2.0 > reproduce: > wget > https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O > ~/bin/make.cross > chmod +x ~/bin/make.cross > # save the attached .config to linux build tree > make.cross ARCH=xtensa > > All errors (new ones prefixed by >>): > >>> security/integrity/ima/ima_policy.c:130:74: error: 'IMA_FORCE' undeclared >>> here (not in a function); did you mean 'IMA_FUNC'? > {.action = MEASURE, .fsmagic = FUSE_SUPER_MAGIC, .flags = IMA_FSMAGIC | > IMA_FORCE}, > > ^ > > IMA_FUNC >security/integrity/ima/ima_policy.c:158:73: error: invalid operands to > binary | (have 'int' and 'struct ima_rule_entry *') > {.action = APPRAISE, .fsmagic = FUSE_SUPER_MAGIC, .flags = IMA_FSMAGIC | > IMA_FORCE}, > ^ >security/integrity/ima/ima_policy.c:29:21: warning: initialization makes > integer from pointer without a cast [-Wint-conversion] > #define IMA_FSMAGIC 0x0004 > ^ >security/integrity/ima/ima_policy.c:158:61: note: in expansion of macro > 'IMA_FSMAGIC' > {.action = APPRAISE, .fsmagic = FUSE_SUPER_MAGIC, .flags = IMA_FSMAGIC | > IMA_FORCE}, > ^~~ >security/integrity/ima/ima_policy.c:29:21: note: (near initialization for > 'default_appraise_rules[14].flags') > #define IMA_FSMAGIC 0x0004 > ^ >security/integrity/ima/ima_policy.c:158:61: note: in expansion of macro > 'IMA_FSMAGIC' > {.action = APPRAISE, .fsmagic = FUSE_SUPER_MAGIC, .flags = IMA_FSMAGIC | > IMA_FORCE}, > ^~~ >security/integrity/ima/ima_policy.c:29:21: error: initializer element is > not constant > #define IMA_FSMAGIC 0x0004 > ^ >security/integrity/ima/ima_policy.c:158:61: note: in expansion of macro > 'IMA_FSMAGIC' > {.action = APPRAISE, .fsmagic = FUSE_SUPER_MAGIC, .flags = IMA_FSMAGIC | > IMA_FORCE}, > ^~~ >security/integrity/ima/ima_policy.c:29:21: note: (near initialization for > 'default_appraise_rules[14].flags') > #define IMA_FSMAGIC 0x0004 > ^ >security/integrity/ima/ima_policy.c:158:61: note: in expansion of macro > 'IMA_FSMAGIC' > {.action = APPRAISE, .fsmagic = FUSE_SUPER_MAGIC, .flags = IMA_FSMAGIC | > IMA_FORCE}, > ^~~ > > vim +130 security/integrity/ima/ima_policy.c > >115 >116 static struct ima_rule_entry default_measurement_rules[] > __ro_after_init = { >117 {.action = MEASURE, .func = MMAP_CHECK, .mask = MAY_EXEC, >118 .flags = IMA_FUNC | IMA_MASK}, >119 {.action = MEASURE, .func = BPRM_CHECK, .mask = MAY_EXEC, >120 .flags = IMA_FUNC | IMA_MASK}, >121 {.action = MEASURE, .func = FILE_CHECK, .mask = MAY_READ, >122 .uid = GLOBAL_ROOT_UID, .uid_op = _eq, >123 .flags = IMA_FUNC | IMA_INMASK | IMA_EUID}, >124 {.action = MEASURE, .func = FILE_CHECK, .mask = MAY_READ, >125 .uid = GLOBAL_ROOT_UID, .uid_op = _eq, >126 .flags = IMA_FUNC | IMA_INMASK | IMA_UID}, >127 {.action = MEASURE, .func = MODULE_CHECK, .flags = IMA_FUNC}, >128 {.action = MEASURE, .func = FIRMWARE_CHECK, .flags = > IMA_FUNC}, >129 {.action = MEASURE, .func = POLICY_CHECK, .flags = IMA_FUNC}, > > 130 {.action = MEASURE, .fsmagic = FUSE_SUPER_MAGIC, .flags = > IMA_FSMAGIC | IMA_FORCE}, >131 }; >132 > > --- > 0-DAY kernel test infrastructureOpen Source Technology Center > https://lists.01.org/pipermail/kbuild-all Intel Corporation
Re: [PATCH 2/2] ima: turn on force option for FUSE in builtin policies
Hi, On Sun, Jan 14, 2018 at 8:09 PM, kbuild test robot wrote: > [auto build test ERROR on linus/master] > [also build test ERROR on v4.15-rc7 next-20180112] > [if your patch is applied to the wrong git tree, please drop us a note to > help improve the system] As already mentioned in the commit message, this patch depends on patches that are not yet in the mainline, or not even in next-integrity. So please make it excluded from kbuild. Thanks, Dongsu > url: > https://github.com/0day-ci/linux/commits/Dongsu-Park/turn-on-force-option-for-FUSE-in-builtin-policies/20180115-015830 > config: xtensa-allmodconfig (attached as .config) > compiler: xtensa-linux-gcc (GCC) 7.2.0 > reproduce: > wget > https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O > ~/bin/make.cross > chmod +x ~/bin/make.cross > # save the attached .config to linux build tree > make.cross ARCH=xtensa > > All errors (new ones prefixed by >>): > >>> security/integrity/ima/ima_policy.c:130:74: error: 'IMA_FORCE' undeclared >>> here (not in a function); did you mean 'IMA_FUNC'? > {.action = MEASURE, .fsmagic = FUSE_SUPER_MAGIC, .flags = IMA_FSMAGIC | > IMA_FORCE}, > > ^ > > IMA_FUNC >security/integrity/ima/ima_policy.c:158:73: error: invalid operands to > binary | (have 'int' and 'struct ima_rule_entry *') > {.action = APPRAISE, .fsmagic = FUSE_SUPER_MAGIC, .flags = IMA_FSMAGIC | > IMA_FORCE}, > ^ >security/integrity/ima/ima_policy.c:29:21: warning: initialization makes > integer from pointer without a cast [-Wint-conversion] > #define IMA_FSMAGIC 0x0004 > ^ >security/integrity/ima/ima_policy.c:158:61: note: in expansion of macro > 'IMA_FSMAGIC' > {.action = APPRAISE, .fsmagic = FUSE_SUPER_MAGIC, .flags = IMA_FSMAGIC | > IMA_FORCE}, > ^~~ >security/integrity/ima/ima_policy.c:29:21: note: (near initialization for > 'default_appraise_rules[14].flags') > #define IMA_FSMAGIC 0x0004 > ^ >security/integrity/ima/ima_policy.c:158:61: note: in expansion of macro > 'IMA_FSMAGIC' > {.action = APPRAISE, .fsmagic = FUSE_SUPER_MAGIC, .flags = IMA_FSMAGIC | > IMA_FORCE}, > ^~~ >security/integrity/ima/ima_policy.c:29:21: error: initializer element is > not constant > #define IMA_FSMAGIC 0x0004 > ^ >security/integrity/ima/ima_policy.c:158:61: note: in expansion of macro > 'IMA_FSMAGIC' > {.action = APPRAISE, .fsmagic = FUSE_SUPER_MAGIC, .flags = IMA_FSMAGIC | > IMA_FORCE}, > ^~~ >security/integrity/ima/ima_policy.c:29:21: note: (near initialization for > 'default_appraise_rules[14].flags') > #define IMA_FSMAGIC 0x0004 > ^ >security/integrity/ima/ima_policy.c:158:61: note: in expansion of macro > 'IMA_FSMAGIC' > {.action = APPRAISE, .fsmagic = FUSE_SUPER_MAGIC, .flags = IMA_FSMAGIC | > IMA_FORCE}, > ^~~ > > vim +130 security/integrity/ima/ima_policy.c > >115 >116 static struct ima_rule_entry default_measurement_rules[] > __ro_after_init = { >117 {.action = MEASURE, .func = MMAP_CHECK, .mask = MAY_EXEC, >118 .flags = IMA_FUNC | IMA_MASK}, >119 {.action = MEASURE, .func = BPRM_CHECK, .mask = MAY_EXEC, >120 .flags = IMA_FUNC | IMA_MASK}, >121 {.action = MEASURE, .func = FILE_CHECK, .mask = MAY_READ, >122 .uid = GLOBAL_ROOT_UID, .uid_op = _eq, >123 .flags = IMA_FUNC | IMA_INMASK | IMA_EUID}, >124 {.action = MEASURE, .func = FILE_CHECK, .mask = MAY_READ, >125 .uid = GLOBAL_ROOT_UID, .uid_op = _eq, >126 .flags = IMA_FUNC | IMA_INMASK | IMA_UID}, >127 {.action = MEASURE, .func = MODULE_CHECK, .flags = IMA_FUNC}, >128 {.action = MEASURE, .func = FIRMWARE_CHECK, .flags = > IMA_FUNC}, >129 {.action = MEASURE, .func = POLICY_CHECK, .flags = IMA_FUNC}, > > 130 {.action = MEASURE, .fsmagic = FUSE_SUPER_MAGIC, .flags = > IMA_FSMAGIC | IMA_FORCE}, >131 }; >132 > > --- > 0-DAY kernel test infrastructureOpen Source Technology Center > https://lists.01.org/pipermail/kbuild-all Intel Corporation
[PATCH 1/2] fs/fuse: move SUPER_MAGIC definitions to linux/magic.h
To be able to use FUSE_*SUPER_MAGIC macros in other subsystems like security/integrity/ima, we need to move the definitions from fs/fuse to include/uapi/linux/. The FUSE_*SUPER_MAGIC macros are made available to userspace in the same way as other filesystems. Cc: linux-fsde...@vger.kernel.org Cc: linux-integr...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Alban Crequy <al...@kinvolk.io> Cc: Miklos Szeredi <mik...@szeredi.hu> Cc: Mimi Zohar <zo...@linux.vnet.ibm.com> Cc: Seth Forshee <seth.fors...@canonical.com> Signed-off-by: Dongsu Park <don...@kinvolk.io> --- fs/fuse/control.c | 3 +-- fs/fuse/inode.c| 3 +-- include/uapi/linux/magic.h | 3 +++ 3 files changed, 5 insertions(+), 4 deletions(-) diff --git a/fs/fuse/control.c b/fs/fuse/control.c index b9ea99c5..9015c15c 100644 --- a/fs/fuse/control.c +++ b/fs/fuse/control.c @@ -10,8 +10,7 @@ #include #include - -#define FUSE_CTL_SUPER_MAGIC 0x65735543 +#include /* * This is non-NULL when the single instance of the control filesystem diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 8c98edee..57371b77 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -22,6 +22,7 @@ #include #include #include +#include MODULE_AUTHOR("Miklos Szeredi <mik...@szeredi.hu>"); MODULE_DESCRIPTION("Filesystem in Userspace"); @@ -49,8 +50,6 @@ MODULE_PARM_DESC(max_user_congthresh, "Global limit for the maximum congestion threshold an " "unprivileged user can set"); -#define FUSE_SUPER_MAGIC 0x65735546 - #define FUSE_DEFAULT_BLKSIZE 512 /** Maximum number of outstanding background requests */ diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h index 1a6fee97..1534e99c 100644 --- a/include/uapi/linux/magic.h +++ b/include/uapi/linux/magic.h @@ -90,4 +90,7 @@ #define BALLOON_KVM_MAGIC 0x13661366 #define ZSMALLOC_MAGIC 0x58295829 +#define FUSE_CTL_SUPER_MAGIC 0x65735543 +#define FUSE_SUPER_MAGIC 0x65735546 + #endif /* __LINUX_MAGIC_H__ */ -- 2.13.6
[PATCH 1/2] fs/fuse: move SUPER_MAGIC definitions to linux/magic.h
To be able to use FUSE_*SUPER_MAGIC macros in other subsystems like security/integrity/ima, we need to move the definitions from fs/fuse to include/uapi/linux/. The FUSE_*SUPER_MAGIC macros are made available to userspace in the same way as other filesystems. Cc: linux-fsde...@vger.kernel.org Cc: linux-integr...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Alban Crequy Cc: Miklos Szeredi Cc: Mimi Zohar Cc: Seth Forshee Signed-off-by: Dongsu Park --- fs/fuse/control.c | 3 +-- fs/fuse/inode.c| 3 +-- include/uapi/linux/magic.h | 3 +++ 3 files changed, 5 insertions(+), 4 deletions(-) diff --git a/fs/fuse/control.c b/fs/fuse/control.c index b9ea99c5..9015c15c 100644 --- a/fs/fuse/control.c +++ b/fs/fuse/control.c @@ -10,8 +10,7 @@ #include #include - -#define FUSE_CTL_SUPER_MAGIC 0x65735543 +#include /* * This is non-NULL when the single instance of the control filesystem diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 8c98edee..57371b77 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -22,6 +22,7 @@ #include #include #include +#include MODULE_AUTHOR("Miklos Szeredi "); MODULE_DESCRIPTION("Filesystem in Userspace"); @@ -49,8 +50,6 @@ MODULE_PARM_DESC(max_user_congthresh, "Global limit for the maximum congestion threshold an " "unprivileged user can set"); -#define FUSE_SUPER_MAGIC 0x65735546 - #define FUSE_DEFAULT_BLKSIZE 512 /** Maximum number of outstanding background requests */ diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h index 1a6fee97..1534e99c 100644 --- a/include/uapi/linux/magic.h +++ b/include/uapi/linux/magic.h @@ -90,4 +90,7 @@ #define BALLOON_KVM_MAGIC 0x13661366 #define ZSMALLOC_MAGIC 0x58295829 +#define FUSE_CTL_SUPER_MAGIC 0x65735543 +#define FUSE_SUPER_MAGIC 0x65735546 + #endif /* __LINUX_MAGIC_H__ */ -- 2.13.6
[PATCH 2/2] ima: turn on force option for FUSE in builtin policies
In case of FUSE filesystem, cached integrity results in IMA could be reused, when the userspace FUSE process has changed the underlying files. To be able to avoid such cases, we need to turn on the force option in builtin policies, for actions of measure and appraise. Then integrity values become re-measured and re-appraised. In that way, cached integrity results won't be used. This patch depends on the patch "ima: define a new policy option named force" by Mimi. [1] How to test the force option written by Alban: The test I did was using a patched version of the memfs FUSE driver [2][3] and two very simple "hello-world" programs [5] (prog1 prints "hello world: 1" and prog2 prints "hello world: 2"). I copy prog1 and prog2 in the fuse-memfs mount point, execute them and check the sha1 hash in "/sys/kernel/security/ima/ascii_runtime_measurements". My patch on the memfs FUSE driver added a backdoor command to serve prog1 when the kernel asks for prog2 or vice-versa. In this way, I can exec prog1 and get it to print "hello world: 2" without ever replacing the file via the VFS, so the kernel is not aware of the change. The test was done using the branch "dongsu/fuse-userns-v5-2" [4], including both this new force option and Sascha's patch ("ima: Use i_version only when filesystem supports it"). Step by step test procedure: 1. Mount the memfs FUSE using [3]: rm -f /tmp/memfs-switch* ; memfs -L DEBUG /mnt/memfs 2. Copy prog1 and prog2 using [5] cp prog1 /mnt/memfs/prog1 cp prog2 /mnt/memfs/prog2 3. Lookup the files and let the FUSE driver to keep the handles open: dd if=/mnt/memfs/prog1 bs=1 | (read -n 1 x ; sleep 3600 ) & dd if=/mnt/memfs/prog2 bs=1 | (read -n 1 x ; sleep 3600 ) & 4. Check the 2 programs work correctly: $ /mnt/memfs/prog1 hello world: 1 $ /mnt/memfs/prog2 hello world: 2 5. Check the measurements for prog1 and prog2: $ sudo cat /sys/kernel/security/ima/ascii_runtime_measurements|grep /mnt/memfs/prog 10 7ac5aed52061cb09120e977c6d04ee5c7b11c371 ima-ng sha1:ac14c9268cd2811f7a5adea17b27d84f50e1122c /mnt/memfs/prog1 10 9acc17a9a32aec4a676b8f6558e17a3d6c9a78e6 ima-ng sha1:799cb5d1e06d5c37ae7a76ba25ecd1bd01476383 /mnt/memfs/prog2 6. Use the backdoor command in my patched memfs to redirect file operations on file handle 3 to file handle 2: rm -f /tmp/memfs-switch* ; touch /tmp/memfs-switch-3-2 7. Check how the FUSE driver serves different content for the files: $ /mnt/memfs/prog1 hello world: 2 $ /mnt/memfs/prog2 hello world: 2 8. Check the measurements: sudo cat /sys/kernel/security/ima/ascii_runtime_measurements|grep /mnt/memfs/prog Without the patches, on a vanilla kernel, there are no new measurements, despite the FUSE driver having served different executables. However, with the "force" option enabled, I can see additional measurements for prog1 and prog2 with the hashes reversed when the FUSE driver served the alternative content. [1] https://www.spinics.net/lists/linux-integrity/msg00948.html [2] https://github.com/bbengfort/memfs [3] https://github.com/kinvolk/memfs/commits/alban/switch-files [4] https://github.com/kinvolk/linux/commits/dongsu/fuse-userns-v5-2 [5] https://github.com/kinvolk/fuse-userns-patches/commit/cf1f5750cab0 Cc: linux-integr...@vger.kernel.org Cc: linux-security-mod...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Miklos Szeredi <mik...@szeredi.hu> Cc: Mimi Zohar <zo...@linux.vnet.ibm.com> Cc: Seth Forshee <seth.fors...@canonical.com> Tested-by: Alban Crequy <al...@kinvolk.io> Signed-off-by: Dongsu Park <don...@kinvolk.io> --- security/integrity/ima/ima_policy.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/security/integrity/ima/ima_policy.c b/security/integrity/ima/ima_policy.c index fddef8f8..8de40d85 100644 --- a/security/integrity/ima/ima_policy.c +++ b/security/integrity/ima/ima_policy.c @@ -127,6 +127,7 @@ static struct ima_rule_entry default_measurement_rules[] __ro_after_init = { {.action = MEASURE, .func = MODULE_CHECK, .flags = IMA_FUNC}, {.action = MEASURE, .func = FIRMWARE_CHECK, .flags = IMA_FUNC}, {.action = MEASURE, .func = POLICY_CHECK, .flags = IMA_FUNC}, + {.action = MEASURE, .fsmagic = FUSE_SUPER_MAGIC, .flags = IMA_FSMAGIC | IMA_FORCE}, }; static struct ima_rule_entry default_appraise_rules[] __ro_after_init = { @@ -154,6 +155,7 @@ static struct ima_rule_entry default_appraise_rules[] __ro_after_init = { {.action = APPRAISE, .fowner = GLOBAL_ROOT_UID, .fowner_op = _eq, .flags = IMA_FOWNER | IMA_DIGSIG_REQUIRED}, #endif + {.action = APPRAISE, .fsmagic = FUSE_SUPER_MAGIC, .flags = IMA_FSMAGIC | IMA_FORCE}, }; static struct ima_rule_entry secure_boot_rules[] __ro_after_init = { -- 2.13.6
[PATCH 2/2] ima: turn on force option for FUSE in builtin policies
In case of FUSE filesystem, cached integrity results in IMA could be reused, when the userspace FUSE process has changed the underlying files. To be able to avoid such cases, we need to turn on the force option in builtin policies, for actions of measure and appraise. Then integrity values become re-measured and re-appraised. In that way, cached integrity results won't be used. This patch depends on the patch "ima: define a new policy option named force" by Mimi. [1] How to test the force option written by Alban: The test I did was using a patched version of the memfs FUSE driver [2][3] and two very simple "hello-world" programs [5] (prog1 prints "hello world: 1" and prog2 prints "hello world: 2"). I copy prog1 and prog2 in the fuse-memfs mount point, execute them and check the sha1 hash in "/sys/kernel/security/ima/ascii_runtime_measurements". My patch on the memfs FUSE driver added a backdoor command to serve prog1 when the kernel asks for prog2 or vice-versa. In this way, I can exec prog1 and get it to print "hello world: 2" without ever replacing the file via the VFS, so the kernel is not aware of the change. The test was done using the branch "dongsu/fuse-userns-v5-2" [4], including both this new force option and Sascha's patch ("ima: Use i_version only when filesystem supports it"). Step by step test procedure: 1. Mount the memfs FUSE using [3]: rm -f /tmp/memfs-switch* ; memfs -L DEBUG /mnt/memfs 2. Copy prog1 and prog2 using [5] cp prog1 /mnt/memfs/prog1 cp prog2 /mnt/memfs/prog2 3. Lookup the files and let the FUSE driver to keep the handles open: dd if=/mnt/memfs/prog1 bs=1 | (read -n 1 x ; sleep 3600 ) & dd if=/mnt/memfs/prog2 bs=1 | (read -n 1 x ; sleep 3600 ) & 4. Check the 2 programs work correctly: $ /mnt/memfs/prog1 hello world: 1 $ /mnt/memfs/prog2 hello world: 2 5. Check the measurements for prog1 and prog2: $ sudo cat /sys/kernel/security/ima/ascii_runtime_measurements|grep /mnt/memfs/prog 10 7ac5aed52061cb09120e977c6d04ee5c7b11c371 ima-ng sha1:ac14c9268cd2811f7a5adea17b27d84f50e1122c /mnt/memfs/prog1 10 9acc17a9a32aec4a676b8f6558e17a3d6c9a78e6 ima-ng sha1:799cb5d1e06d5c37ae7a76ba25ecd1bd01476383 /mnt/memfs/prog2 6. Use the backdoor command in my patched memfs to redirect file operations on file handle 3 to file handle 2: rm -f /tmp/memfs-switch* ; touch /tmp/memfs-switch-3-2 7. Check how the FUSE driver serves different content for the files: $ /mnt/memfs/prog1 hello world: 2 $ /mnt/memfs/prog2 hello world: 2 8. Check the measurements: sudo cat /sys/kernel/security/ima/ascii_runtime_measurements|grep /mnt/memfs/prog Without the patches, on a vanilla kernel, there are no new measurements, despite the FUSE driver having served different executables. However, with the "force" option enabled, I can see additional measurements for prog1 and prog2 with the hashes reversed when the FUSE driver served the alternative content. [1] https://www.spinics.net/lists/linux-integrity/msg00948.html [2] https://github.com/bbengfort/memfs [3] https://github.com/kinvolk/memfs/commits/alban/switch-files [4] https://github.com/kinvolk/linux/commits/dongsu/fuse-userns-v5-2 [5] https://github.com/kinvolk/fuse-userns-patches/commit/cf1f5750cab0 Cc: linux-integr...@vger.kernel.org Cc: linux-security-mod...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Miklos Szeredi Cc: Mimi Zohar Cc: Seth Forshee Tested-by: Alban Crequy Signed-off-by: Dongsu Park --- security/integrity/ima/ima_policy.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/security/integrity/ima/ima_policy.c b/security/integrity/ima/ima_policy.c index fddef8f8..8de40d85 100644 --- a/security/integrity/ima/ima_policy.c +++ b/security/integrity/ima/ima_policy.c @@ -127,6 +127,7 @@ static struct ima_rule_entry default_measurement_rules[] __ro_after_init = { {.action = MEASURE, .func = MODULE_CHECK, .flags = IMA_FUNC}, {.action = MEASURE, .func = FIRMWARE_CHECK, .flags = IMA_FUNC}, {.action = MEASURE, .func = POLICY_CHECK, .flags = IMA_FUNC}, + {.action = MEASURE, .fsmagic = FUSE_SUPER_MAGIC, .flags = IMA_FSMAGIC | IMA_FORCE}, }; static struct ima_rule_entry default_appraise_rules[] __ro_after_init = { @@ -154,6 +155,7 @@ static struct ima_rule_entry default_appraise_rules[] __ro_after_init = { {.action = APPRAISE, .fowner = GLOBAL_ROOT_UID, .fowner_op = _eq, .flags = IMA_FOWNER | IMA_DIGSIG_REQUIRED}, #endif + {.action = APPRAISE, .fsmagic = FUSE_SUPER_MAGIC, .flags = IMA_FSMAGIC | IMA_FORCE}, }; static struct ima_rule_entry secure_boot_rules[] __ro_after_init = { -- 2.13.6
[PATCH 0/2] turn on force option for FUSE in builtin policies
In case of FUSE filesystem, cached integrity results in IMA could be reused, when the userspace FUSE process has changed the underlying files. To be able to avoid such cases, we need to turn on the force option in builtin policies, for actions of measure and appraise. Then integrity values become re-measured and re-appraised. In that way, cached integrity results won't be used. This patchset depends on the patch "ima: define a new policy option named force" by Mimi. [1] For details on testing the force option, please refer to the testing report by Alban. [2] The first patch is for simply moving FUSE_*SUPER_MAGIC macros to include/uapi/linux, to be able to use those in other subsystems like security/integrity/ima. The second patch is actually to turn on the force option for FUSE fs in IMA. [1] https://www.spinics.net/lists/linux-integrity/msg00948.html [2] https://marc.info/?l=linux-integrity=151559360514676=2 Dongsu Park (2): fs/fuse: move SUPER_MAGIC definitions to linux/magic.h ima: turn on force option for FUSE in builtin policies fs/fuse/control.c | 3 +-- fs/fuse/inode.c | 3 +-- include/uapi/linux/magic.h | 3 +++ security/integrity/ima/ima_policy.c | 2 ++ 4 files changed, 7 insertions(+), 4 deletions(-) -- 2.13.6
[PATCH 0/2] turn on force option for FUSE in builtin policies
In case of FUSE filesystem, cached integrity results in IMA could be reused, when the userspace FUSE process has changed the underlying files. To be able to avoid such cases, we need to turn on the force option in builtin policies, for actions of measure and appraise. Then integrity values become re-measured and re-appraised. In that way, cached integrity results won't be used. This patchset depends on the patch "ima: define a new policy option named force" by Mimi. [1] For details on testing the force option, please refer to the testing report by Alban. [2] The first patch is for simply moving FUSE_*SUPER_MAGIC macros to include/uapi/linux, to be able to use those in other subsystems like security/integrity/ima. The second patch is actually to turn on the force option for FUSE fs in IMA. [1] https://www.spinics.net/lists/linux-integrity/msg00948.html [2] https://marc.info/?l=linux-integrity=151559360514676=2 Dongsu Park (2): fs/fuse: move SUPER_MAGIC definitions to linux/magic.h ima: turn on force option for FUSE in builtin policies fs/fuse/control.c | 3 +-- fs/fuse/inode.c | 3 +-- include/uapi/linux/magic.h | 3 +++ security/integrity/ima/ima_policy.c | 2 ++ 4 files changed, 7 insertions(+), 4 deletions(-) -- 2.13.6
Re: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes
Hi, On Fri, Jan 5, 2018 at 8:24 PM, Luis R. Rodriguez <mcg...@kernel.org> wrote: > On Fri, Dec 22, 2017 at 03:32:27PM +0100, Dongsu Park wrote: >> diff --git a/fs/attr.c b/fs/attr.c >> index 12ffdb6f..bf8e94f3 100644 >> --- a/fs/attr.c >> +++ b/fs/attr.c >> @@ -18,6 +18,30 @@ >> #include >> #include >> >> +static bool chown_ok(const struct inode *inode, kuid_t uid) >> +{ >> + if (uid_eq(current_fsuid(), inode->i_uid) && >> + uid_eq(uid, inode->i_uid)) >> + return true; >> + if (capable_wrt_inode_uidgid(inode, CAP_CHOWN)) >> + return true; >> + if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN)) >> + return true; >> + return false; >> +} >> + >> +static bool chgrp_ok(const struct inode *inode, kgid_t gid) >> +{ >> + if (uid_eq(current_fsuid(), inode->i_uid) && >> + (in_group_p(gid) || gid_eq(gid, inode->i_gid))) >> + return true; >> + if (capable_wrt_inode_uidgid(inode, CAP_CHOWN)) >> + return true; >> + if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN)) >> + return true; >> + return false; >> +} >> + >> /** >> * setattr_prepare - check if attribute changes to a dentry are allowed >> * @dentry: dentry to check >> @@ -52,17 +76,11 @@ int setattr_prepare(struct dentry *dentry, struct iattr >> *attr) >> goto kill_priv; >> >> /* Make sure a caller can chown. */ >> - if ((ia_valid & ATTR_UID) && >> - (!uid_eq(current_fsuid(), inode->i_uid) || >> - !uid_eq(attr->ia_uid, inode->i_uid)) && >> - !capable_wrt_inode_uidgid(inode, CAP_CHOWN)) >> + if ((ia_valid & ATTR_UID) && !chown_ok(inode, attr->ia_uid)) >> return -EPERM; > > I think this patch would read much better and easier to review if it was > split up by first adding the helpers, and then extending them afterwards. I'm fine with splitting it up into multiple patches, if the original author Eric agrees. >> /* Make sure caller can chgrp. */ >> - if ((ia_valid & ATTR_GID) && >> - (!uid_eq(current_fsuid(), inode->i_uid) || >> - (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, >> inode->i_gid))) && >> - !capable_wrt_inode_uidgid(inode, CAP_CHOWN)) >> + if ((ia_valid & ATTR_GID) && !chgrp_ok(inode, attr->ia_gid)) >> return -EPERM; >> >> /* Make sure a caller can chmod. */ >> diff --git a/fs/proc/base.c b/fs/proc/base.c >> index 31934cb9..9d50ec92 100644 >> --- a/fs/proc/base.c >> +++ b/fs/proc/base.c >> @@ -665,10 +665,17 @@ int proc_setattr(struct dentry *dentry, struct iattr >> *attr) >> { >> int error; >> struct inode *inode = d_inode(dentry); >> + struct user_namespace *s_user_ns; >> >> if (attr->ia_valid & ATTR_MODE) >> return -EPERM; >> >> + /* Don't let anyone mess with weird proc files */ >> + s_user_ns = inode->i_sb->s_user_ns; >> + if (!kuid_has_mapping(s_user_ns, inode->i_uid) || >> + !kgid_has_mapping(s_user_ns, inode->i_gid)) >> + return -EPERM; >> + >> error = setattr_prepare(dentry, attr); >> if (error) >> return error; > > Are we sure proc is the only special one? How was it observed first that this > was > require for proc? Has anyone tried fuzzing by trying this op with a slew of > other > filesystems on all files? >From my limited knowledge about procfs, I suppose that procfs is a little different from ordinary filesystems. Procfs is not exactly namespaced, it has many inconsistencies. Some files under /proc should be owned by the global root, regardless of user namespaces. That's why we need to handle such special cases for proc. As it has been historically like that since the beginning, it's hard to change it fundamentally. However, you have good points. Other than procfs, there could be other filesystems that have potential issues when relaxing privileges. Question is how we can be sure that there's no hidden issues. From my understanding, usually we could run testsuites like LTP (https://github.com/linux-test-project/ltp.git) to avoid such regressions. Today I have run LTP tests for fs & containers, with the patchset included. It seemed to work fine without failures. Obviously it doesn't mean that it's completely bug-free, when we are talking about unknown issues. Please let me know if there are other good ways to figure out potential issues. Thanks, Dongsu > Luis
Re: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes
Hi, On Fri, Jan 5, 2018 at 8:24 PM, Luis R. Rodriguez wrote: > On Fri, Dec 22, 2017 at 03:32:27PM +0100, Dongsu Park wrote: >> diff --git a/fs/attr.c b/fs/attr.c >> index 12ffdb6f..bf8e94f3 100644 >> --- a/fs/attr.c >> +++ b/fs/attr.c >> @@ -18,6 +18,30 @@ >> #include >> #include >> >> +static bool chown_ok(const struct inode *inode, kuid_t uid) >> +{ >> + if (uid_eq(current_fsuid(), inode->i_uid) && >> + uid_eq(uid, inode->i_uid)) >> + return true; >> + if (capable_wrt_inode_uidgid(inode, CAP_CHOWN)) >> + return true; >> + if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN)) >> + return true; >> + return false; >> +} >> + >> +static bool chgrp_ok(const struct inode *inode, kgid_t gid) >> +{ >> + if (uid_eq(current_fsuid(), inode->i_uid) && >> + (in_group_p(gid) || gid_eq(gid, inode->i_gid))) >> + return true; >> + if (capable_wrt_inode_uidgid(inode, CAP_CHOWN)) >> + return true; >> + if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN)) >> + return true; >> + return false; >> +} >> + >> /** >> * setattr_prepare - check if attribute changes to a dentry are allowed >> * @dentry: dentry to check >> @@ -52,17 +76,11 @@ int setattr_prepare(struct dentry *dentry, struct iattr >> *attr) >> goto kill_priv; >> >> /* Make sure a caller can chown. */ >> - if ((ia_valid & ATTR_UID) && >> - (!uid_eq(current_fsuid(), inode->i_uid) || >> - !uid_eq(attr->ia_uid, inode->i_uid)) && >> - !capable_wrt_inode_uidgid(inode, CAP_CHOWN)) >> + if ((ia_valid & ATTR_UID) && !chown_ok(inode, attr->ia_uid)) >> return -EPERM; > > I think this patch would read much better and easier to review if it was > split up by first adding the helpers, and then extending them afterwards. I'm fine with splitting it up into multiple patches, if the original author Eric agrees. >> /* Make sure caller can chgrp. */ >> - if ((ia_valid & ATTR_GID) && >> - (!uid_eq(current_fsuid(), inode->i_uid) || >> - (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, >> inode->i_gid))) && >> - !capable_wrt_inode_uidgid(inode, CAP_CHOWN)) >> + if ((ia_valid & ATTR_GID) && !chgrp_ok(inode, attr->ia_gid)) >> return -EPERM; >> >> /* Make sure a caller can chmod. */ >> diff --git a/fs/proc/base.c b/fs/proc/base.c >> index 31934cb9..9d50ec92 100644 >> --- a/fs/proc/base.c >> +++ b/fs/proc/base.c >> @@ -665,10 +665,17 @@ int proc_setattr(struct dentry *dentry, struct iattr >> *attr) >> { >> int error; >> struct inode *inode = d_inode(dentry); >> + struct user_namespace *s_user_ns; >> >> if (attr->ia_valid & ATTR_MODE) >> return -EPERM; >> >> + /* Don't let anyone mess with weird proc files */ >> + s_user_ns = inode->i_sb->s_user_ns; >> + if (!kuid_has_mapping(s_user_ns, inode->i_uid) || >> + !kgid_has_mapping(s_user_ns, inode->i_gid)) >> + return -EPERM; >> + >> error = setattr_prepare(dentry, attr); >> if (error) >> return error; > > Are we sure proc is the only special one? How was it observed first that this > was > require for proc? Has anyone tried fuzzing by trying this op with a slew of > other > filesystems on all files? >From my limited knowledge about procfs, I suppose that procfs is a little different from ordinary filesystems. Procfs is not exactly namespaced, it has many inconsistencies. Some files under /proc should be owned by the global root, regardless of user namespaces. That's why we need to handle such special cases for proc. As it has been historically like that since the beginning, it's hard to change it fundamentally. However, you have good points. Other than procfs, there could be other filesystems that have potential issues when relaxing privileges. Question is how we can be sure that there's no hidden issues. From my understanding, usually we could run testsuites like LTP (https://github.com/linux-test-project/ltp.git) to avoid such regressions. Today I have run LTP tests for fs & containers, with the patchset included. It seemed to work fine without failures. Obviously it doesn't mean that it's completely bug-free, when we are talking about unknown issues. Please let me know if there are other good ways to figure out potential issues. Thanks, Dongsu > Luis
Re: [PATCH v5 00/11] FUSE mounts from non-init user namespaces
Hi, On Mon, Dec 25, 2017 at 8:05 AM, Eric W. Biederman <ebied...@xmission.com> wrote: > Dongsu Park <don...@kinvolk.io> writes: > >> This patchset v5 is based on work by Seth Forshee and Eric Biederman. >> The latest patchset was v4: >> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1132206.html >> >> At the moment, filesystems backed by physical medium can only be mounted >> by real root in the initial user namespace. This restriction exists >> because if it's allowed for root user in non-init user namespaces to >> mount the filesystem, then it effectively allows the user to control the >> underlying source of the filesystem. In case of FUSE, the source would >> mean any underlying device. >> >> However, in many use cases such as containers, it's necessary to allow >> filesystems to be mounted from non-init user namespaces. Goal of this >> patchset is to allow FUSE filesystems to be mounted from non-init user >> namespaces. Support for other filesystems like ext4 are not in the >> scope of this patchset. >> >> Let me describe how to test mounting from non-init user namespaces. It's >> assumed that tests are done via sshfs, a userspace filesystem based on >> FUSE with ssh as backend. Testing system is Fedora 27. > > In general I am for this work, and more bodies and more eyes on it is > generally better. > > I will review this after the New Year, I am out for the holidays right > now. Thanks. I'll wait for your review. Dongsu > Eric > > >> >> >> $ sudo dnf install -y sshfs >> $ sudo mkdir -p /mnt/userns >> >> ### workaround to get the sshfs permission checks >> $ sudo chown -R $UID:$UID /etc/ssh/ssh_config.d /usr/share/crypto-policies >> >> $ unshare -U -r -m >> # sshfs root@localhost: /mnt/userns >> >> ### You can see sshfs being mounted from a non-init user namespace >> # mount | grep sshfs >> root@localhost: on /mnt/userns type fuse.sshfs >> (rw,nosuid,nodev,relatime,user_id=0,group_id=0) >> >> # touch /mnt/userns/test >> # ls -l /mnt/userns/test >> -rw-r--r-- 1 root root 0 Dec 11 19:01 /mnt/userns/test >> >> >> Open another terminal, check the mountpoint from outside the namespace. >> >> >> $ grep userns /proc/$(pidof sshfs)/mountinfo >> 131 102 0:35 / /mnt/userns rw,nosuid,nodev,relatime - fuse.sshfs >> root@localhost: rw,user_id=0,group_id=0 >> >> >> After all tests are done, you can unmount the filesystem >> inside the namespace. >> >> >> # fusermount -u /mnt/userns >> >> >> Changes since v4: >> * Remove other parts like ext4 to keep the patchset minimal for FUSE >> * Add and change commit messages >> * Describe how to test non-init user namespaces >> >> TODO: >> * Think through potential security implications. There are 2 patches >>being prepared for security issues. One is "ima: define a new policy >>option named force" by Mimi Zohar, which adds an option to specify >>that the results should not be cached: >>https://marc.info/?l=linux-integrity=151275680115856=2 >>The other one is to basically prevent FUSE results from being cached, >>which is still in progress. >> >> * Test IMA/LSMs. Details are written in >> >> https://github.com/kinvolk/fuse-userns-patches/blob/master/tests/TESTING_INTEGRITY.md >> >> Patches 1-2 deal with an additional flag of lookup_bdev() to check for >> additional inode permission. >> >> Patches 3-7 allow the superblock owner to change ownership of inodes, and >> deal with additional capability checks w.r.t user namespaces. >> >> Patches 8-10 allow FUSE filesystems to be mounted outside of the init >> user namespace. >> >> Patch 11 handles a corner case of non-root users in EVM. >> >> The patchset is also available in our github repo: >> https://github.com/kinvolk/linux/tree/dongsu/fuse-userns-v5-1 >> >> >> Eric W. Biederman (1): >> fs: Allow superblock owner to change ownership of inodes >> >> Seth Forshee (10): >> block_dev: Support checking inode permissions in lookup_bdev() >> mtd: Check permissions towards mtd block device inode when mounting >> fs: Don't remove suid for CAP_FSETID for userns root >> fs: Allow superblock owner to access do_remount_sb() >> capabilities: Allow privileged user in s_user_ns to set security.* >> xattrs >> fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems >> fuse:
Re: [PATCH v5 00/11] FUSE mounts from non-init user namespaces
Hi, On Mon, Dec 25, 2017 at 8:05 AM, Eric W. Biederman wrote: > Dongsu Park writes: > >> This patchset v5 is based on work by Seth Forshee and Eric Biederman. >> The latest patchset was v4: >> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1132206.html >> >> At the moment, filesystems backed by physical medium can only be mounted >> by real root in the initial user namespace. This restriction exists >> because if it's allowed for root user in non-init user namespaces to >> mount the filesystem, then it effectively allows the user to control the >> underlying source of the filesystem. In case of FUSE, the source would >> mean any underlying device. >> >> However, in many use cases such as containers, it's necessary to allow >> filesystems to be mounted from non-init user namespaces. Goal of this >> patchset is to allow FUSE filesystems to be mounted from non-init user >> namespaces. Support for other filesystems like ext4 are not in the >> scope of this patchset. >> >> Let me describe how to test mounting from non-init user namespaces. It's >> assumed that tests are done via sshfs, a userspace filesystem based on >> FUSE with ssh as backend. Testing system is Fedora 27. > > In general I am for this work, and more bodies and more eyes on it is > generally better. > > I will review this after the New Year, I am out for the holidays right > now. Thanks. I'll wait for your review. Dongsu > Eric > > >> >> >> $ sudo dnf install -y sshfs >> $ sudo mkdir -p /mnt/userns >> >> ### workaround to get the sshfs permission checks >> $ sudo chown -R $UID:$UID /etc/ssh/ssh_config.d /usr/share/crypto-policies >> >> $ unshare -U -r -m >> # sshfs root@localhost: /mnt/userns >> >> ### You can see sshfs being mounted from a non-init user namespace >> # mount | grep sshfs >> root@localhost: on /mnt/userns type fuse.sshfs >> (rw,nosuid,nodev,relatime,user_id=0,group_id=0) >> >> # touch /mnt/userns/test >> # ls -l /mnt/userns/test >> -rw-r--r-- 1 root root 0 Dec 11 19:01 /mnt/userns/test >> >> >> Open another terminal, check the mountpoint from outside the namespace. >> >> >> $ grep userns /proc/$(pidof sshfs)/mountinfo >> 131 102 0:35 / /mnt/userns rw,nosuid,nodev,relatime - fuse.sshfs >> root@localhost: rw,user_id=0,group_id=0 >> >> >> After all tests are done, you can unmount the filesystem >> inside the namespace. >> >> >> # fusermount -u /mnt/userns >> >> >> Changes since v4: >> * Remove other parts like ext4 to keep the patchset minimal for FUSE >> * Add and change commit messages >> * Describe how to test non-init user namespaces >> >> TODO: >> * Think through potential security implications. There are 2 patches >>being prepared for security issues. One is "ima: define a new policy >>option named force" by Mimi Zohar, which adds an option to specify >>that the results should not be cached: >>https://marc.info/?l=linux-integrity=151275680115856=2 >>The other one is to basically prevent FUSE results from being cached, >>which is still in progress. >> >> * Test IMA/LSMs. Details are written in >> >> https://github.com/kinvolk/fuse-userns-patches/blob/master/tests/TESTING_INTEGRITY.md >> >> Patches 1-2 deal with an additional flag of lookup_bdev() to check for >> additional inode permission. >> >> Patches 3-7 allow the superblock owner to change ownership of inodes, and >> deal with additional capability checks w.r.t user namespaces. >> >> Patches 8-10 allow FUSE filesystems to be mounted outside of the init >> user namespace. >> >> Patch 11 handles a corner case of non-root users in EVM. >> >> The patchset is also available in our github repo: >> https://github.com/kinvolk/linux/tree/dongsu/fuse-userns-v5-1 >> >> >> Eric W. Biederman (1): >> fs: Allow superblock owner to change ownership of inodes >> >> Seth Forshee (10): >> block_dev: Support checking inode permissions in lookup_bdev() >> mtd: Check permissions towards mtd block device inode when mounting >> fs: Don't remove suid for CAP_FSETID for userns root >> fs: Allow superblock owner to access do_remount_sb() >> capabilities: Allow privileged user in s_user_ns to set security.* >> xattrs >> fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems >> fuse: Support fuse filesystems outside of init_user_ns
Re: [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root
Hi, On Sat, Dec 23, 2017 at 4:26 AM, Serge E. Hallyn <se...@hallyn.com> wrote: > On Fri, Dec 22, 2017 at 03:32:28PM +0100, Dongsu Park wrote: >> From: Seth Forshee <seth.fors...@canonical.com> >> >> Expand the check in should_remove_suid() to keep privileges for > > I realize this description came from Seth, but reading it now, > 'Expand' seems wrong. Expanding a check brings to my mind making > it stricter, not looser. How about 'Relax the check' ? Makes sense. Will do. >> CAP_FSETID in s_user_ns rather than init_user_ns. >> >> Patch v4 is available: https://patchwork.kernel.org/patch/8944621/ >> >> --EWB Changed from ns_capable(sb->s_user_ns, ) to capable_wrt_inode_uidgid > > Why exactly? > > This is wrong, because capable_wrt_inode_uidgid() does a check > against current_user_ns, not the inode->i_sb->s_user_ns Ah. I see. I suppose it was changed probably for the privileged_wrt_inode_uidgid() called by capable_wrt_inode_uidgid(). But as you pointed out, that checks against current_user_ns, which is wrong. I would just create another wrapper like capable_userns_wrt_inode_uidgid(), which takes an additional parameter of (struct user_namespace *), to be able to check for both ns_capable() and privileged_wrt_inode_uidgid(). Thanks, Dongsu >> Cc: linux-fsde...@vger.kernel.org >> Cc: linux-kernel@vger.kernel.org >> Cc: Alexander Viro <v...@zeniv.linux.org.uk> >> Cc: Serge Hallyn <se...@hallyn.com> >> Signed-off-by: Seth Forshee <seth.fors...@canonical.com> >> Signed-off-by: Dongsu Park <don...@kinvolk.io> >> --- >> fs/inode.c | 6 -- >> 1 file changed, 4 insertions(+), 2 deletions(-) >> >> diff --git a/fs/inode.c b/fs/inode.c >> index fd401028..6459a437 100644 >> --- a/fs/inode.c >> +++ b/fs/inode.c >> @@ -1749,7 +1749,8 @@ EXPORT_SYMBOL(touch_atime); >> */ >> int should_remove_suid(struct dentry *dentry) >> { >> - umode_t mode = d_inode(dentry)->i_mode; >> + struct inode *inode = d_inode(dentry); >> + umode_t mode = inode->i_mode; >> int kill = 0; >> >> /* suid always must be killed */ >> @@ -1763,7 +1764,8 @@ int should_remove_suid(struct dentry *dentry) >> if (unlikely((mode & S_ISGID) && (mode & S_IXGRP))) >> kill |= ATTR_KILL_SGID; >> >> - if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode))) >> + if (unlikely(kill && !capable_wrt_inode_uidgid(inode, CAP_FSETID) && >> + S_ISREG(mode))) >> return kill; >> >> return 0; >> -- >> 2.13.6
Re: [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root
Hi, On Sat, Dec 23, 2017 at 4:26 AM, Serge E. Hallyn wrote: > On Fri, Dec 22, 2017 at 03:32:28PM +0100, Dongsu Park wrote: >> From: Seth Forshee >> >> Expand the check in should_remove_suid() to keep privileges for > > I realize this description came from Seth, but reading it now, > 'Expand' seems wrong. Expanding a check brings to my mind making > it stricter, not looser. How about 'Relax the check' ? Makes sense. Will do. >> CAP_FSETID in s_user_ns rather than init_user_ns. >> >> Patch v4 is available: https://patchwork.kernel.org/patch/8944621/ >> >> --EWB Changed from ns_capable(sb->s_user_ns, ) to capable_wrt_inode_uidgid > > Why exactly? > > This is wrong, because capable_wrt_inode_uidgid() does a check > against current_user_ns, not the inode->i_sb->s_user_ns Ah. I see. I suppose it was changed probably for the privileged_wrt_inode_uidgid() called by capable_wrt_inode_uidgid(). But as you pointed out, that checks against current_user_ns, which is wrong. I would just create another wrapper like capable_userns_wrt_inode_uidgid(), which takes an additional parameter of (struct user_namespace *), to be able to check for both ns_capable() and privileged_wrt_inode_uidgid(). Thanks, Dongsu >> Cc: linux-fsde...@vger.kernel.org >> Cc: linux-kernel@vger.kernel.org >> Cc: Alexander Viro >> Cc: Serge Hallyn >> Signed-off-by: Seth Forshee >> Signed-off-by: Dongsu Park >> --- >> fs/inode.c | 6 -- >> 1 file changed, 4 insertions(+), 2 deletions(-) >> >> diff --git a/fs/inode.c b/fs/inode.c >> index fd401028..6459a437 100644 >> --- a/fs/inode.c >> +++ b/fs/inode.c >> @@ -1749,7 +1749,8 @@ EXPORT_SYMBOL(touch_atime); >> */ >> int should_remove_suid(struct dentry *dentry) >> { >> - umode_t mode = d_inode(dentry)->i_mode; >> + struct inode *inode = d_inode(dentry); >> + umode_t mode = inode->i_mode; >> int kill = 0; >> >> /* suid always must be killed */ >> @@ -1763,7 +1764,8 @@ int should_remove_suid(struct dentry *dentry) >> if (unlikely((mode & S_ISGID) && (mode & S_IXGRP))) >> kill |= ATTR_KILL_SGID; >> >> - if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode))) >> + if (unlikely(kill && !capable_wrt_inode_uidgid(inode, CAP_FSETID) && >> + S_ISREG(mode))) >> return kill; >> >> return 0; >> -- >> 2.13.6
Re: [PATCH 02/11] mtd: Check permissions towards mtd block device inode when mounting
Hi, On Fri, Dec 22, 2017 at 10:06 PM, Richard Weinberger <richard.weinber...@gmail.com> wrote: > Dongsu, > > On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <don...@kinvolk.io> wrote: >> From: Seth Forshee <seth.fors...@canonical.com> >> >> Unprivileged users should not be able to mount mtd block devices >> when they lack sufficient privileges towards the block device >> inode. Update mount_mtd() to validate that the user has the >> required access to the inode at the specified path. The check >> will be skipped for CAP_SYS_ADMIN, so privileged mounts will >> continue working as before. > > What is the big picture of this? > Can in future an unprivileged user just mount UBIFS? I'm not sure I'm aware of all use cases w.r.t mtd & ubifs. To my understanding, in these days many container runtimes allow unprivileged users to run containers. (docker, lxc, runc, bubblewrap, etc) That's why the kernel should deal with additional permission checks that might have not been necessary in the past. This MTD patch is one of those special cases. > Please note that UBIFS sits on top of a character device and not a block > device. Aha, good to know. Thanks, Dongsu > -- > Thanks, > //richard
Re: [PATCH 02/11] mtd: Check permissions towards mtd block device inode when mounting
Hi, On Fri, Dec 22, 2017 at 10:06 PM, Richard Weinberger wrote: > Dongsu, > > On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park wrote: >> From: Seth Forshee >> >> Unprivileged users should not be able to mount mtd block devices >> when they lack sufficient privileges towards the block device >> inode. Update mount_mtd() to validate that the user has the >> required access to the inode at the specified path. The check >> will be skipped for CAP_SYS_ADMIN, so privileged mounts will >> continue working as before. > > What is the big picture of this? > Can in future an unprivileged user just mount UBIFS? I'm not sure I'm aware of all use cases w.r.t mtd & ubifs. To my understanding, in these days many container runtimes allow unprivileged users to run containers. (docker, lxc, runc, bubblewrap, etc) That's why the kernel should deal with additional permission checks that might have not been necessary in the past. This MTD patch is one of those special cases. > Please note that UBIFS sits on top of a character device and not a block > device. Aha, good to know. Thanks, Dongsu > -- > Thanks, > //richard
Re: [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev()
Hi, On Fri, Dec 22, 2017 at 7:59 PM, Coly Li <i...@coly.li> wrote: > On 22/12/2017 10:32 PM, Dongsu Park wrote: > Hi Dongsu, > > Could you please use a macro like NO_PERMISSION_CHECK to replace hard > coded 0 ? At least for me, I don't need to check what does 0 mean in the > new lookup_bdev(). I see. I'll do that. Thanks, Dongsu > Thanks. > > Coly Li > >> --- >> drivers/md/bcache/super.c | 2 +- >> drivers/md/dm-table.c | 2 +- >> drivers/mtd/mtdsuper.c| 2 +- >> fs/block_dev.c| 13 ++--- >> fs/quota/quota.c | 2 +- >> include/linux/fs.h| 2 +- >> 6 files changed, 15 insertions(+), 8 deletions(-) >> >> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c >> index b4d28928..acc9d56c 100644 >> --- a/drivers/md/bcache/super.c >> +++ b/drivers/md/bcache/super.c >> @@ -1967,7 +1967,7 @@ static ssize_t register_bcache(struct kobject *k, >> struct kobj_attribute *attr, >> sb); >> if (IS_ERR(bdev)) { >> if (bdev == ERR_PTR(-EBUSY)) { >> - bdev = lookup_bdev(strim(path)); >> + bdev = lookup_bdev(strim(path), 0); >> mutex_lock(_register_lock); >> if (!IS_ERR(bdev) && bch_is_open(bdev)) >> err = "device already registered"; >> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c >> index 88130b5d..bca5eaf4 100644 > [snip] > > > -- > Coly Li
Re: [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev()
Hi, On Fri, Dec 22, 2017 at 7:59 PM, Coly Li wrote: > On 22/12/2017 10:32 PM, Dongsu Park wrote: > Hi Dongsu, > > Could you please use a macro like NO_PERMISSION_CHECK to replace hard > coded 0 ? At least for me, I don't need to check what does 0 mean in the > new lookup_bdev(). I see. I'll do that. Thanks, Dongsu > Thanks. > > Coly Li > >> --- >> drivers/md/bcache/super.c | 2 +- >> drivers/md/dm-table.c | 2 +- >> drivers/mtd/mtdsuper.c| 2 +- >> fs/block_dev.c| 13 ++--- >> fs/quota/quota.c | 2 +- >> include/linux/fs.h| 2 +- >> 6 files changed, 15 insertions(+), 8 deletions(-) >> >> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c >> index b4d28928..acc9d56c 100644 >> --- a/drivers/md/bcache/super.c >> +++ b/drivers/md/bcache/super.c >> @@ -1967,7 +1967,7 @@ static ssize_t register_bcache(struct kobject *k, >> struct kobj_attribute *attr, >> sb); >> if (IS_ERR(bdev)) { >> if (bdev == ERR_PTR(-EBUSY)) { >> - bdev = lookup_bdev(strim(path)); >> + bdev = lookup_bdev(strim(path), 0); >> mutex_lock(_register_lock); >> if (!IS_ERR(bdev) && bch_is_open(bdev)) >> err = "device already registered"; >> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c >> index 88130b5d..bca5eaf4 100644 > [snip] > > > -- > Coly Li
[PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev()
From: Seth Forshee <seth.fors...@canonical.com> When looking up a block device by path no permission check is done to verify that the user has access to the block device inode at the specified path. In some cases it may be necessary to check permissions towards the inode, such as allowing unprivileged users to mount block devices in user namespaces. Add an argument to lookup_bdev() to optionally perform this permission check. A value of 0 skips the permission check and behaves the same as before. A non-zero value specifies the mask of access rights required towards the inode at the specified path. The check is always skipped if the user has CAP_SYS_ADMIN. All callers of lookup_bdev() currently pass a mask of 0, so this patch results in no functional change. Subsequent patches will add permission checks where appropriate. Patch v4 is available: https://patchwork.kernel.org/patch/8943601/ Cc: dm-de...@redhat.com Cc: linux-bca...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux-...@lists.infradead.org Cc: linux-kernel@vger.kernel.org Cc: Alexander Viro <v...@zeniv.linux.org.uk> Cc: Jan Kara <j...@suse.com> Cc: Serge Hallyn <se...@hallyn.com> Signed-off-by: Seth Forshee <seth.fors...@canonical.com> Signed-off-by: Dongsu Park <don...@kinvolk.io> --- drivers/md/bcache/super.c | 2 +- drivers/md/dm-table.c | 2 +- drivers/mtd/mtdsuper.c| 2 +- fs/block_dev.c| 13 ++--- fs/quota/quota.c | 2 +- include/linux/fs.h| 2 +- 6 files changed, 15 insertions(+), 8 deletions(-) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index b4d28928..acc9d56c 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1967,7 +1967,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr, sb); if (IS_ERR(bdev)) { if (bdev == ERR_PTR(-EBUSY)) { - bdev = lookup_bdev(strim(path)); + bdev = lookup_bdev(strim(path), 0); mutex_lock(_register_lock); if (!IS_ERR(bdev) && bch_is_open(bdev)) err = "device already registered"; diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index 88130b5d..bca5eaf4 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c @@ -410,7 +410,7 @@ dev_t dm_get_dev_t(const char *path) dev_t dev; struct block_device *bdev; - bdev = lookup_bdev(path); + bdev = lookup_bdev(path, 0); if (IS_ERR(bdev)) dev = name_to_dev_t(path); else { diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c index e43fea89..4a4d40c0 100644 --- a/drivers/mtd/mtdsuper.c +++ b/drivers/mtd/mtdsuper.c @@ -180,7 +180,7 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags, /* try the old way - the hack where we allowed users to mount * /dev/mtdblock$(n) but didn't actually _use_ the blockdev */ - bdev = lookup_bdev(dev_name); + bdev = lookup_bdev(dev_name, 0); if (IS_ERR(bdev)) { ret = PTR_ERR(bdev); pr_debug("MTDSB: lookup_bdev() returned %d\n", ret); diff --git a/fs/block_dev.c b/fs/block_dev.c index 4a181fcb..5ca06095 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1662,7 +1662,7 @@ struct block_device *blkdev_get_by_path(const char *path, fmode_t mode, struct block_device *bdev; int err; - bdev = lookup_bdev(path); + bdev = lookup_bdev(path, 0); if (IS_ERR(bdev)) return bdev; @@ -2052,12 +2052,14 @@ EXPORT_SYMBOL(ioctl_by_bdev); /** * lookup_bdev - lookup a struct block_device by name * @pathname: special file representing the block device + * @mask: rights to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC) * * Get a reference to the blockdevice at @pathname in the current * namespace if possible and return it. Return ERR_PTR(error) - * otherwise. + * otherwise. If @mask is non-zero, check for access rights to the + * inode at @pathname. */ -struct block_device *lookup_bdev(const char *pathname) +struct block_device *lookup_bdev(const char *pathname, int mask) { struct block_device *bdev; struct inode *inode; @@ -2072,6 +2074,11 @@ struct block_device *lookup_bdev(const char *pathname) return ERR_PTR(error); inode = d_backing_inode(path.dentry); + if (mask != 0 && !capable(CAP_SYS_ADMIN)) { + error = __inode_permission(inode, mask); + if (error) + goto fail; + } error = -ENOTBLK; if (!S_ISBLK(inode->i_mode)) goto fail; diff --git a/fs/quota/quota.c b/fs/quota/quota.c index 43612e2a..e5d47955 100644 --- a/fs/quota/quota.c +++ b/fs/quota/quota.c @@ -807,7 +807,7 @@ sta
[PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev()
From: Seth Forshee When looking up a block device by path no permission check is done to verify that the user has access to the block device inode at the specified path. In some cases it may be necessary to check permissions towards the inode, such as allowing unprivileged users to mount block devices in user namespaces. Add an argument to lookup_bdev() to optionally perform this permission check. A value of 0 skips the permission check and behaves the same as before. A non-zero value specifies the mask of access rights required towards the inode at the specified path. The check is always skipped if the user has CAP_SYS_ADMIN. All callers of lookup_bdev() currently pass a mask of 0, so this patch results in no functional change. Subsequent patches will add permission checks where appropriate. Patch v4 is available: https://patchwork.kernel.org/patch/8943601/ Cc: dm-de...@redhat.com Cc: linux-bca...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux-...@lists.infradead.org Cc: linux-kernel@vger.kernel.org Cc: Alexander Viro Cc: Jan Kara Cc: Serge Hallyn Signed-off-by: Seth Forshee Signed-off-by: Dongsu Park --- drivers/md/bcache/super.c | 2 +- drivers/md/dm-table.c | 2 +- drivers/mtd/mtdsuper.c| 2 +- fs/block_dev.c| 13 ++--- fs/quota/quota.c | 2 +- include/linux/fs.h| 2 +- 6 files changed, 15 insertions(+), 8 deletions(-) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index b4d28928..acc9d56c 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1967,7 +1967,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr, sb); if (IS_ERR(bdev)) { if (bdev == ERR_PTR(-EBUSY)) { - bdev = lookup_bdev(strim(path)); + bdev = lookup_bdev(strim(path), 0); mutex_lock(_register_lock); if (!IS_ERR(bdev) && bch_is_open(bdev)) err = "device already registered"; diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index 88130b5d..bca5eaf4 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c @@ -410,7 +410,7 @@ dev_t dm_get_dev_t(const char *path) dev_t dev; struct block_device *bdev; - bdev = lookup_bdev(path); + bdev = lookup_bdev(path, 0); if (IS_ERR(bdev)) dev = name_to_dev_t(path); else { diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c index e43fea89..4a4d40c0 100644 --- a/drivers/mtd/mtdsuper.c +++ b/drivers/mtd/mtdsuper.c @@ -180,7 +180,7 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags, /* try the old way - the hack where we allowed users to mount * /dev/mtdblock$(n) but didn't actually _use_ the blockdev */ - bdev = lookup_bdev(dev_name); + bdev = lookup_bdev(dev_name, 0); if (IS_ERR(bdev)) { ret = PTR_ERR(bdev); pr_debug("MTDSB: lookup_bdev() returned %d\n", ret); diff --git a/fs/block_dev.c b/fs/block_dev.c index 4a181fcb..5ca06095 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1662,7 +1662,7 @@ struct block_device *blkdev_get_by_path(const char *path, fmode_t mode, struct block_device *bdev; int err; - bdev = lookup_bdev(path); + bdev = lookup_bdev(path, 0); if (IS_ERR(bdev)) return bdev; @@ -2052,12 +2052,14 @@ EXPORT_SYMBOL(ioctl_by_bdev); /** * lookup_bdev - lookup a struct block_device by name * @pathname: special file representing the block device + * @mask: rights to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC) * * Get a reference to the blockdevice at @pathname in the current * namespace if possible and return it. Return ERR_PTR(error) - * otherwise. + * otherwise. If @mask is non-zero, check for access rights to the + * inode at @pathname. */ -struct block_device *lookup_bdev(const char *pathname) +struct block_device *lookup_bdev(const char *pathname, int mask) { struct block_device *bdev; struct inode *inode; @@ -2072,6 +2074,11 @@ struct block_device *lookup_bdev(const char *pathname) return ERR_PTR(error); inode = d_backing_inode(path.dentry); + if (mask != 0 && !capable(CAP_SYS_ADMIN)) { + error = __inode_permission(inode, mask); + if (error) + goto fail; + } error = -ENOTBLK; if (!S_ISBLK(inode->i_mode)) goto fail; diff --git a/fs/quota/quota.c b/fs/quota/quota.c index 43612e2a..e5d47955 100644 --- a/fs/quota/quota.c +++ b/fs/quota/quota.c @@ -807,7 +807,7 @@ static struct super_block *quotactl_block(const char __user *special, int cmd) if (IS_ERR(tmp)) return ERR_CAST(tmp); - bdev = lookup_bdev(tmp
[PATCH 02/11] mtd: Check permissions towards mtd block device inode when mounting
From: Seth Forshee <seth.fors...@canonical.com> Unprivileged users should not be able to mount mtd block devices when they lack sufficient privileges towards the block device inode. Update mount_mtd() to validate that the user has the required access to the inode at the specified path. The check will be skipped for CAP_SYS_ADMIN, so privileged mounts will continue working as before. Patch v3 is available: https://patchwork.kernel.org/patch/7640011/ Cc: linux-...@lists.infradead.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Seth Forshee <seth.fors...@canonical.com> Signed-off-by: Dongsu Park <don...@kinvolk.io> --- drivers/mtd/mtdsuper.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c index 4a4d40c0..3c8734f3 100644 --- a/drivers/mtd/mtdsuper.c +++ b/drivers/mtd/mtdsuper.c @@ -129,6 +129,7 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags, #ifdef CONFIG_BLOCK struct block_device *bdev; int ret, major; + int perm; #endif int mtdnr; @@ -180,7 +181,10 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags, /* try the old way - the hack where we allowed users to mount * /dev/mtdblock$(n) but didn't actually _use_ the blockdev */ - bdev = lookup_bdev(dev_name, 0); + perm = MAY_READ; + if (!(flags & MS_RDONLY)) + perm |= MAY_WRITE; + bdev = lookup_bdev(dev_name, perm); if (IS_ERR(bdev)) { ret = PTR_ERR(bdev); pr_debug("MTDSB: lookup_bdev() returned %d\n", ret); -- 2.13.6
[PATCH 02/11] mtd: Check permissions towards mtd block device inode when mounting
From: Seth Forshee Unprivileged users should not be able to mount mtd block devices when they lack sufficient privileges towards the block device inode. Update mount_mtd() to validate that the user has the required access to the inode at the specified path. The check will be skipped for CAP_SYS_ADMIN, so privileged mounts will continue working as before. Patch v3 is available: https://patchwork.kernel.org/patch/7640011/ Cc: linux-...@lists.infradead.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Seth Forshee Signed-off-by: Dongsu Park --- drivers/mtd/mtdsuper.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c index 4a4d40c0..3c8734f3 100644 --- a/drivers/mtd/mtdsuper.c +++ b/drivers/mtd/mtdsuper.c @@ -129,6 +129,7 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags, #ifdef CONFIG_BLOCK struct block_device *bdev; int ret, major; + int perm; #endif int mtdnr; @@ -180,7 +181,10 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags, /* try the old way - the hack where we allowed users to mount * /dev/mtdblock$(n) but didn't actually _use_ the blockdev */ - bdev = lookup_bdev(dev_name, 0); + perm = MAY_READ; + if (!(flags & MS_RDONLY)) + perm |= MAY_WRITE; + bdev = lookup_bdev(dev_name, perm); if (IS_ERR(bdev)) { ret = PTR_ERR(bdev); pr_debug("MTDSB: lookup_bdev() returned %d\n", ret); -- 2.13.6
[PATCH 11/11] evm: Don't update hmacs in user ns mounts
From: Seth Forshee <seth.fors...@canonical.com> The kernel should not calculate new hmacs for mounts done by non-root users. Update evm_calc_hmac_or_hash() to refuse to calculate new hmacs for mounts for non-init user namespaces. Cc: linux-integr...@vger.kernel.org Cc: linux-security-mod...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: James Morris <james.l.mor...@oracle.com> Cc: Mimi Zohar <zo...@linux.vnet.ibm.com> Cc: "Serge E. Hallyn" <se...@hallyn.com> Signed-off-by: Seth Forshee <seth.fors...@canonical.com> Signed-off-by: Dongsu Park <don...@kinvolk.io> --- security/integrity/evm/evm_crypto.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/security/integrity/evm/evm_crypto.c b/security/integrity/evm/evm_crypto.c index bcd64baf..729f4545 100644 --- a/security/integrity/evm/evm_crypto.c +++ b/security/integrity/evm/evm_crypto.c @@ -190,7 +190,8 @@ static int evm_calc_hmac_or_hash(struct dentry *dentry, int error; int size; - if (!(inode->i_opflags & IOP_XATTR)) + if (!(inode->i_opflags & IOP_XATTR) || + inode->i_sb->s_user_ns != _user_ns) return -EOPNOTSUPP; desc = init_desc(type); -- 2.13.6
[PATCH 11/11] evm: Don't update hmacs in user ns mounts
From: Seth Forshee The kernel should not calculate new hmacs for mounts done by non-root users. Update evm_calc_hmac_or_hash() to refuse to calculate new hmacs for mounts for non-init user namespaces. Cc: linux-integr...@vger.kernel.org Cc: linux-security-mod...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: James Morris Cc: Mimi Zohar Cc: "Serge E. Hallyn" Signed-off-by: Seth Forshee Signed-off-by: Dongsu Park --- security/integrity/evm/evm_crypto.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/security/integrity/evm/evm_crypto.c b/security/integrity/evm/evm_crypto.c index bcd64baf..729f4545 100644 --- a/security/integrity/evm/evm_crypto.c +++ b/security/integrity/evm/evm_crypto.c @@ -190,7 +190,8 @@ static int evm_calc_hmac_or_hash(struct dentry *dentry, int error; int size; - if (!(inode->i_opflags & IOP_XATTR)) + if (!(inode->i_opflags & IOP_XATTR) || + inode->i_sb->s_user_ns != _user_ns) return -EOPNOTSUPP; desc = init_desc(type); -- 2.13.6
[PATCH 10/11] fuse: Allow user namespace mounts
From: Seth Forshee <seth.fors...@canonical.com> To be able to mount fuse from non-init user namespaces, it's necessary to set FS_USERNS_MOUNT flag to fs_flags. Patch v4 is available: https://patchwork.kernel.org/patch/8944681/ Cc: linux-fsde...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Miklos Szeredi <mszer...@redhat.com> Signed-off-by: Seth Forshee <seth.fors...@canonical.com> [dongsu: add a simple commit messasge] Signed-off-by: Dongsu Park <don...@kinvolk.io> --- fs/fuse/inode.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 7f6b2e55..8c98edee 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -1212,7 +1212,7 @@ static void fuse_kill_sb_anon(struct super_block *sb) static struct file_system_type fuse_fs_type = { .owner = THIS_MODULE, .name = "fuse", - .fs_flags = FS_HAS_SUBTYPE, + .fs_flags = FS_HAS_SUBTYPE | FS_USERNS_MOUNT, .mount = fuse_mount, .kill_sb= fuse_kill_sb_anon, }; @@ -1244,7 +1244,7 @@ static struct file_system_type fuseblk_fs_type = { .name = "fuseblk", .mount = fuse_mount_blk, .kill_sb= fuse_kill_sb_blk, - .fs_flags = FS_REQUIRES_DEV | FS_HAS_SUBTYPE, + .fs_flags = FS_REQUIRES_DEV | FS_HAS_SUBTYPE | FS_USERNS_MOUNT, }; MODULE_ALIAS_FS("fuseblk"); -- 2.13.6
[PATCH 10/11] fuse: Allow user namespace mounts
From: Seth Forshee To be able to mount fuse from non-init user namespaces, it's necessary to set FS_USERNS_MOUNT flag to fs_flags. Patch v4 is available: https://patchwork.kernel.org/patch/8944681/ Cc: linux-fsde...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Miklos Szeredi Signed-off-by: Seth Forshee [dongsu: add a simple commit messasge] Signed-off-by: Dongsu Park --- fs/fuse/inode.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 7f6b2e55..8c98edee 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -1212,7 +1212,7 @@ static void fuse_kill_sb_anon(struct super_block *sb) static struct file_system_type fuse_fs_type = { .owner = THIS_MODULE, .name = "fuse", - .fs_flags = FS_HAS_SUBTYPE, + .fs_flags = FS_HAS_SUBTYPE | FS_USERNS_MOUNT, .mount = fuse_mount, .kill_sb= fuse_kill_sb_anon, }; @@ -1244,7 +1244,7 @@ static struct file_system_type fuseblk_fs_type = { .name = "fuseblk", .mount = fuse_mount_blk, .kill_sb= fuse_kill_sb_blk, - .fs_flags = FS_REQUIRES_DEV | FS_HAS_SUBTYPE, + .fs_flags = FS_REQUIRES_DEV | FS_HAS_SUBTYPE | FS_USERNS_MOUNT, }; MODULE_ALIAS_FS("fuseblk"); -- 2.13.6
[PATCH 09/11] fuse: Restrict allow_other to the superblock's namespace or a descendant
From: Seth Forshee <seth.fors...@canonical.com> Unprivileged users are normally restricted from mounting with the allow_other option by system policy, but this could be bypassed for a mount done with user namespace root permissions. In such cases allow_other should not allow users outside the userns to access the mount as doing so would give the unprivileged user the ability to manipulate processes it would otherwise be unable to manipulate. Restrict allow_other to apply to users in the same userns used at mount or a descendant of that namespace. Also export current_in_userns() for use by fuse when built as a module. Patch v4 is available: https://patchwork.kernel.org/patch/8944671/ Cc: linux-fsde...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: "Eric W. Biederman" <ebied...@xmission.com> Cc: Serge Hallyn <se...@hallyn.com> Cc: Miklos Szeredi <mszer...@redhat.com> Signed-off-by: Seth Forshee <seth.fors...@canonical.com> Signed-off-by: Dongsu Park <don...@kinvolk.io> --- fs/fuse/dir.c | 2 +- kernel/user_namespace.c | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index ad1cfac1..d41559a0 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc) const struct cred *cred; if (fc->allow_other) - return 1; + return current_in_userns(fc->user_ns); cred = current_cred(); if (uid_eq(cred->euid, fc->user_id) && diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 246d4d4c..492c255e 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace *target_ns) { return in_userns(target_ns, current_user_ns()); } +EXPORT_SYMBOL(current_in_userns); static inline struct user_namespace *to_user_ns(struct ns_common *ns) { -- 2.13.6
[PATCH 09/11] fuse: Restrict allow_other to the superblock's namespace or a descendant
From: Seth Forshee Unprivileged users are normally restricted from mounting with the allow_other option by system policy, but this could be bypassed for a mount done with user namespace root permissions. In such cases allow_other should not allow users outside the userns to access the mount as doing so would give the unprivileged user the ability to manipulate processes it would otherwise be unable to manipulate. Restrict allow_other to apply to users in the same userns used at mount or a descendant of that namespace. Also export current_in_userns() for use by fuse when built as a module. Patch v4 is available: https://patchwork.kernel.org/patch/8944671/ Cc: linux-fsde...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: "Eric W. Biederman" Cc: Serge Hallyn Cc: Miklos Szeredi Signed-off-by: Seth Forshee Signed-off-by: Dongsu Park --- fs/fuse/dir.c | 2 +- kernel/user_namespace.c | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index ad1cfac1..d41559a0 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc) const struct cred *cred; if (fc->allow_other) - return 1; + return current_in_userns(fc->user_ns); cred = current_cred(); if (uid_eq(cred->euid, fc->user_id) && diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 246d4d4c..492c255e 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace *target_ns) { return in_userns(target_ns, current_user_ns()); } +EXPORT_SYMBOL(current_in_userns); static inline struct user_namespace *to_user_ns(struct ns_common *ns) { -- 2.13.6
[PATCH 07/11] fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
From: Seth Forshee <seth.fors...@canonical.com> The user in control of a super block should be allowed to freeze and thaw it. Relax the restrictions on the FIFREEZE and FITHAW ioctls to require CAP_SYS_ADMIN in s_user_ns. Cc: linux-fsde...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Alexander Viro <v...@zeniv.linux.org.uk> Signed-off-by: Seth Forshee <seth.fors...@canonical.com> Signed-off-by: Dongsu Park <don...@kinvolk.io> --- fs/ioctl.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/ioctl.c b/fs/ioctl.c index 5ace7efb..8c628a8d 100644 --- a/fs/ioctl.c +++ b/fs/ioctl.c @@ -549,7 +549,7 @@ static int ioctl_fsfreeze(struct file *filp) { struct super_block *sb = file_inode(filp)->i_sb; - if (!capable(CAP_SYS_ADMIN)) + if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN)) return -EPERM; /* If filesystem doesn't support freeze feature, return. */ @@ -566,7 +566,7 @@ static int ioctl_fsthaw(struct file *filp) { struct super_block *sb = file_inode(filp)->i_sb; - if (!capable(CAP_SYS_ADMIN)) + if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN)) return -EPERM; /* Thaw */ -- 2.13.6
[PATCH 07/11] fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
From: Seth Forshee The user in control of a super block should be allowed to freeze and thaw it. Relax the restrictions on the FIFREEZE and FITHAW ioctls to require CAP_SYS_ADMIN in s_user_ns. Cc: linux-fsde...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Alexander Viro Signed-off-by: Seth Forshee Signed-off-by: Dongsu Park --- fs/ioctl.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/ioctl.c b/fs/ioctl.c index 5ace7efb..8c628a8d 100644 --- a/fs/ioctl.c +++ b/fs/ioctl.c @@ -549,7 +549,7 @@ static int ioctl_fsfreeze(struct file *filp) { struct super_block *sb = file_inode(filp)->i_sb; - if (!capable(CAP_SYS_ADMIN)) + if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN)) return -EPERM; /* If filesystem doesn't support freeze feature, return. */ @@ -566,7 +566,7 @@ static int ioctl_fsthaw(struct file *filp) { struct super_block *sb = file_inode(filp)->i_sb; - if (!capable(CAP_SYS_ADMIN)) + if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN)) return -EPERM; /* Thaw */ -- 2.13.6
[PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
From: Seth Forshee <seth.fors...@canonical.com> In order to support mounts from namespaces other than init_user_ns, fuse must translate uids and gids to/from the userns of the process servicing requests on /dev/fuse. This patch does that, with a couple of restrictions on the namespace: - The userns for the fuse connection is fixed to the namespace from which /dev/fuse is opened. - The namespace must be the same as s_user_ns. These restrictions simplify the implementation by avoiding the need to pass around userns references and by allowing fuse to rely on the checks in inode_change_ok for ownership changes. Either restriction could be relaxed in the future if needed. For cuse the namespace used for the connection is also simply current_user_ns() at the time /dev/cuse is opened. Patch v4 is available: https://patchwork.kernel.org/patch/8944661/ Cc: linux-fsde...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Miklos Szeredi <mszer...@redhat.com> Signed-off-by: Seth Forshee <seth.fors...@canonical.com> Signed-off-by: Dongsu Park <don...@kinvolk.io> --- fs/fuse/cuse.c | 3 ++- fs/fuse/dev.c| 11 --- fs/fuse/dir.c| 14 +++--- fs/fuse/fuse_i.h | 6 +- fs/fuse/inode.c | 31 +++ 5 files changed, 41 insertions(+), 24 deletions(-) diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c index e9e97803..b1b83259 100644 --- a/fs/fuse/cuse.c +++ b/fs/fuse/cuse.c @@ -48,6 +48,7 @@ #include #include #include +#include #include "fuse_i.h" @@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file) if (!cc) return -ENOMEM; - fuse_conn_init(>fc); + fuse_conn_init(>fc, current_user_ns()); fud = fuse_dev_alloc(>fc); if (!fud) { diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 17f0d05b..0f780e16 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req) static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req) { - req->in.h.uid = from_kuid_munged(_user_ns, current_fsuid()); - req->in.h.gid = from_kgid_munged(_user_ns, current_fsgid()); + req->in.h.uid = from_kuid(fc->user_ns, current_fsuid()); + req->in.h.gid = from_kgid(fc->user_ns, current_fsgid()); req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns); } @@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages, __set_bit(FR_WAITING, >flags); if (for_background) __set_bit(FR_BACKGROUND, >flags); + if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) { + fuse_put_request(fc, req); + return ERR_PTR(-EOVERFLOW); + } return req; @@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file, in = >in; reqsize = in->h.len; - if (task_active_pid_ns(current) != fc->pid_ns) { + if (task_active_pid_ns(current) != fc->pid_ns || + current_user_ns() != fc->user_ns) { rcu_read_lock(); in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns)); rcu_read_unlock(); diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 24967382..ad1cfac1 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr, stat->ino = attr->ino; stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 0); stat->nlink = attr->nlink; - stat->uid = make_kuid(_user_ns, attr->uid); - stat->gid = make_kgid(_user_ns, attr->gid); + stat->uid = make_kuid(fc->user_ns, attr->uid); + stat->gid = make_kgid(fc->user_ns, attr->gid); stat->rdev = inode->i_rdev; stat->atime.tv_sec = attr->atime; stat->atime.tv_nsec = attr->atimensec; @@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime) return true; } -static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg, - bool trust_local_cmtime) +static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr, + struct fuse_setattr_in *arg, bool trust_local_cmtime) { unsigned ivalid = iattr->ia_valid; if (ivalid & ATTR_MODE) arg->valid |= FATTR_MODE, arg->mode = iattr->ia_mode; if (ivalid & ATTR_UID) - arg->valid |= FATTR_UID,arg->uid = from_kuid(_user_ns, iattr->ia_uid); + arg->valid |= FATTR_UID,arg->uid = from_kuid(fc->user_ns, iattr->ia_uid); if (ivalid & ATTR_GID) -
[PATCH 06/11] capabilities: Allow privileged user in s_user_ns to set security.* xattrs
From: Seth Forshee <seth.fors...@canonical.com> A privileged user in s_user_ns will generally have the ability to manipulate the backing store and insert security.* xattrs into the filesystem directly. Therefore the kernel must be prepared to handle these xattrs from unprivileged mounts, and it makes little sense for commoncap to prevent writing these xattrs to the filesystem. The capability and LSM code have already been updated to appropriately handle xattrs from unprivileged mounts, so it is safe to loosen this restriction on setting xattrs. The exception to this logic is that writing xattrs to a mounted filesystem may also cause the LSM inode_post_setxattr or inode_setsecurity callbacks to be invoked. SELinux will deny the xattr update by virtue of applying mountpoint labeling to unprivileged userns mounts, and Smack will deny the writes for any user without global CAP_MAC_ADMIN, so loosening the capability check in commoncap is safe in this respect as well. Patch v4 is available: https://patchwork.kernel.org/patch/8944641/ Cc: linux-security-mod...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: James Morris <james.l.mor...@oracle.com> Cc: Serge Hallyn <se...@hallyn.com> Signed-off-by: Seth Forshee <seth.fors...@canonical.com> Signed-off-by: Dongsu Park <don...@kinvolk.io> --- security/commoncap.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/security/commoncap.c b/security/commoncap.c index 4f8e0934..dd0afef9 100644 --- a/security/commoncap.c +++ b/security/commoncap.c @@ -920,6 +920,8 @@ int cap_bprm_set_creds(struct linux_binprm *bprm) int cap_inode_setxattr(struct dentry *dentry, const char *name, const void *value, size_t size, int flags) { + struct user_namespace *user_ns = dentry->d_sb->s_user_ns; + /* Ignore non-security xattrs */ if (strncmp(name, XATTR_SECURITY_PREFIX, sizeof(XATTR_SECURITY_PREFIX) - 1) != 0) @@ -932,7 +934,7 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name, if (strcmp(name, XATTR_NAME_CAPS) == 0) return 0; - if (!capable(CAP_SYS_ADMIN)) + if (!ns_capable(user_ns, CAP_SYS_ADMIN)) return -EPERM; return 0; } @@ -950,6 +952,8 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name, */ int cap_inode_removexattr(struct dentry *dentry, const char *name) { + struct user_namespace *user_ns = dentry->d_sb->s_user_ns; + /* Ignore non-security xattrs */ if (strncmp(name, XATTR_SECURITY_PREFIX, sizeof(XATTR_SECURITY_PREFIX) - 1) != 0) @@ -965,7 +969,7 @@ int cap_inode_removexattr(struct dentry *dentry, const char *name) return 0; } - if (!capable(CAP_SYS_ADMIN)) + if (!ns_capable(user_ns, CAP_SYS_ADMIN)) return -EPERM; return 0; } -- 2.13.6
[PATCH 06/11] capabilities: Allow privileged user in s_user_ns to set security.* xattrs
From: Seth Forshee A privileged user in s_user_ns will generally have the ability to manipulate the backing store and insert security.* xattrs into the filesystem directly. Therefore the kernel must be prepared to handle these xattrs from unprivileged mounts, and it makes little sense for commoncap to prevent writing these xattrs to the filesystem. The capability and LSM code have already been updated to appropriately handle xattrs from unprivileged mounts, so it is safe to loosen this restriction on setting xattrs. The exception to this logic is that writing xattrs to a mounted filesystem may also cause the LSM inode_post_setxattr or inode_setsecurity callbacks to be invoked. SELinux will deny the xattr update by virtue of applying mountpoint labeling to unprivileged userns mounts, and Smack will deny the writes for any user without global CAP_MAC_ADMIN, so loosening the capability check in commoncap is safe in this respect as well. Patch v4 is available: https://patchwork.kernel.org/patch/8944641/ Cc: linux-security-mod...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: James Morris Cc: Serge Hallyn Signed-off-by: Seth Forshee Signed-off-by: Dongsu Park --- security/commoncap.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/security/commoncap.c b/security/commoncap.c index 4f8e0934..dd0afef9 100644 --- a/security/commoncap.c +++ b/security/commoncap.c @@ -920,6 +920,8 @@ int cap_bprm_set_creds(struct linux_binprm *bprm) int cap_inode_setxattr(struct dentry *dentry, const char *name, const void *value, size_t size, int flags) { + struct user_namespace *user_ns = dentry->d_sb->s_user_ns; + /* Ignore non-security xattrs */ if (strncmp(name, XATTR_SECURITY_PREFIX, sizeof(XATTR_SECURITY_PREFIX) - 1) != 0) @@ -932,7 +934,7 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name, if (strcmp(name, XATTR_NAME_CAPS) == 0) return 0; - if (!capable(CAP_SYS_ADMIN)) + if (!ns_capable(user_ns, CAP_SYS_ADMIN)) return -EPERM; return 0; } @@ -950,6 +952,8 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name, */ int cap_inode_removexattr(struct dentry *dentry, const char *name) { + struct user_namespace *user_ns = dentry->d_sb->s_user_ns; + /* Ignore non-security xattrs */ if (strncmp(name, XATTR_SECURITY_PREFIX, sizeof(XATTR_SECURITY_PREFIX) - 1) != 0) @@ -965,7 +969,7 @@ int cap_inode_removexattr(struct dentry *dentry, const char *name) return 0; } - if (!capable(CAP_SYS_ADMIN)) + if (!ns_capable(user_ns, CAP_SYS_ADMIN)) return -EPERM; return 0; } -- 2.13.6
[PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
From: Seth Forshee In order to support mounts from namespaces other than init_user_ns, fuse must translate uids and gids to/from the userns of the process servicing requests on /dev/fuse. This patch does that, with a couple of restrictions on the namespace: - The userns for the fuse connection is fixed to the namespace from which /dev/fuse is opened. - The namespace must be the same as s_user_ns. These restrictions simplify the implementation by avoiding the need to pass around userns references and by allowing fuse to rely on the checks in inode_change_ok for ownership changes. Either restriction could be relaxed in the future if needed. For cuse the namespace used for the connection is also simply current_user_ns() at the time /dev/cuse is opened. Patch v4 is available: https://patchwork.kernel.org/patch/8944661/ Cc: linux-fsde...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Miklos Szeredi Signed-off-by: Seth Forshee Signed-off-by: Dongsu Park --- fs/fuse/cuse.c | 3 ++- fs/fuse/dev.c| 11 --- fs/fuse/dir.c| 14 +++--- fs/fuse/fuse_i.h | 6 +- fs/fuse/inode.c | 31 +++ 5 files changed, 41 insertions(+), 24 deletions(-) diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c index e9e97803..b1b83259 100644 --- a/fs/fuse/cuse.c +++ b/fs/fuse/cuse.c @@ -48,6 +48,7 @@ #include #include #include +#include #include "fuse_i.h" @@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file) if (!cc) return -ENOMEM; - fuse_conn_init(>fc); + fuse_conn_init(>fc, current_user_ns()); fud = fuse_dev_alloc(>fc); if (!fud) { diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 17f0d05b..0f780e16 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req) static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req) { - req->in.h.uid = from_kuid_munged(_user_ns, current_fsuid()); - req->in.h.gid = from_kgid_munged(_user_ns, current_fsgid()); + req->in.h.uid = from_kuid(fc->user_ns, current_fsuid()); + req->in.h.gid = from_kgid(fc->user_ns, current_fsgid()); req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns); } @@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages, __set_bit(FR_WAITING, >flags); if (for_background) __set_bit(FR_BACKGROUND, >flags); + if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) { + fuse_put_request(fc, req); + return ERR_PTR(-EOVERFLOW); + } return req; @@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file, in = >in; reqsize = in->h.len; - if (task_active_pid_ns(current) != fc->pid_ns) { + if (task_active_pid_ns(current) != fc->pid_ns || + current_user_ns() != fc->user_ns) { rcu_read_lock(); in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns)); rcu_read_unlock(); diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 24967382..ad1cfac1 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr, stat->ino = attr->ino; stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 0); stat->nlink = attr->nlink; - stat->uid = make_kuid(_user_ns, attr->uid); - stat->gid = make_kgid(_user_ns, attr->gid); + stat->uid = make_kuid(fc->user_ns, attr->uid); + stat->gid = make_kgid(fc->user_ns, attr->gid); stat->rdev = inode->i_rdev; stat->atime.tv_sec = attr->atime; stat->atime.tv_nsec = attr->atimensec; @@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime) return true; } -static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg, - bool trust_local_cmtime) +static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr, + struct fuse_setattr_in *arg, bool trust_local_cmtime) { unsigned ivalid = iattr->ia_valid; if (ivalid & ATTR_MODE) arg->valid |= FATTR_MODE, arg->mode = iattr->ia_mode; if (ivalid & ATTR_UID) - arg->valid |= FATTR_UID,arg->uid = from_kuid(_user_ns, iattr->ia_uid); + arg->valid |= FATTR_UID,arg->uid = from_kuid(fc->user_ns, iattr->ia_uid); if (ivalid & ATTR_GID) - arg->valid |= FATTR_GID,arg->gid = from_kgid(_user_ns, iattr->ia_gid); + arg->valid |= F
[PATCH 05/11] fs: Allow superblock owner to access do_remount_sb()
From: Seth Forshee <seth.fors...@canonical.com> Superblock level remounts are currently restricted to global CAP_SYS_ADMIN, as is the path for changing the root mount to read only on umount. Loosen both of these permission checks to also allow CAP_SYS_ADMIN in any namespace which is privileged towards the userns which originally mounted the filesystem. Patch v4 is available: https://patchwork.kernel.org/patch/8944631/ Cc: linux-fsde...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Alexander Viro <v...@zeniv.linux.org.uk> Cc: "Eric W. Biederman" <ebied...@xmission.com> Cc: Serge Hallyn <se...@hallyn.com> Signed-off-by: Seth Forshee <seth.fors...@canonical.com> Signed-off-by: Dongsu Park <don...@kinvolk.io> --- fs/namespace.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index e158ec6b..830040d7 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1589,7 +1589,7 @@ static int do_umount(struct mount *mnt, int flags) * Special case for "unmounting" root ... * we just try to remount it readonly. */ - if (!capable(CAP_SYS_ADMIN)) + if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN)) return -EPERM; down_write(>s_umount); if (!sb_rdonly(sb)) @@ -2327,7 +2327,7 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags, down_write(>s_umount); if (ms_flags & MS_BIND) err = change_mount_flags(path->mnt, ms_flags); - else if (!capable(CAP_SYS_ADMIN)) + else if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN)) err = -EPERM; else err = do_remount_sb(sb, sb_flags, data, 0); -- 2.13.6
[PATCH 03/11] fs: Allow superblock owner to change ownership of inodes
From: Eric W. Biederman <ebied...@xmission.com> Allow users with CAP_SYS_CHOWN over the superblock of a filesystem to chown files. Ordinarily the capable_wrt_inode_uidgid check is sufficient to allow access to files but when the underlying filesystem has uids or gids that don't map to the current user namespace it is not enough, so the chown permission checks need to be extended to allow this case. Calling chown on filesystem nodes whose uid or gid don't map is necessary if those nodes are going to be modified as writing back inodes which contain uids or gids that don't map is likely to cause filesystem corruption of the uid or gid fields. Once chown has been called the existing capable_wrt_inode_uidgid checks are sufficient, to allow the owner of a superblock to do anything the global root user can do with an appropriate set of capabilities. For the proc filesystem this relaxation of permissions is not safe, as some files are owned by users (particularly GLOBAL_ROOT_UID) outside of the control of the mounter of the proc and that would be unsafe to grant chown access to. So update setattr on proc to disallow changing files whose uids or gids are outside of proc's s_user_ns. The original version of this patch was written by: Seth Forshee. I have rewritten and rethought this patch enough so it's really not the same thing (certainly it needs a different description), but he deserves credit for getting out there and getting the conversation started, and finding the potential gotcha's and putting up with my semi-paranoid feedback. Patch v4 is available: https://patchwork.kernel.org/patch/8944611/ Cc: linux-fsde...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Alexander Viro <v...@zeniv.linux.org.uk> Cc: "Luis R. Rodriguez" <mcg...@kernel.org> Cc: Kees Cook <keesc...@chromium.org> Inspired-by: Seth Forshee <seth.fors...@canonical.com> Signed-off-by: Eric W. Biederman <ebied...@xmission.com> [saf: Resolve conflicts caused by s/inode_change_ok/setattr_prepare/] Signed-off-by: Dongsu Park <don...@kinvolk.io> --- fs/attr.c | 34 ++ fs/proc/base.c| 7 +++ fs/proc/generic.c | 7 +++ fs/proc/proc_sysctl.c | 7 +++ 4 files changed, 47 insertions(+), 8 deletions(-) diff --git a/fs/attr.c b/fs/attr.c index 12ffdb6f..bf8e94f3 100644 --- a/fs/attr.c +++ b/fs/attr.c @@ -18,6 +18,30 @@ #include #include +static bool chown_ok(const struct inode *inode, kuid_t uid) +{ + if (uid_eq(current_fsuid(), inode->i_uid) && + uid_eq(uid, inode->i_uid)) + return true; + if (capable_wrt_inode_uidgid(inode, CAP_CHOWN)) + return true; + if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN)) + return true; + return false; +} + +static bool chgrp_ok(const struct inode *inode, kgid_t gid) +{ + if (uid_eq(current_fsuid(), inode->i_uid) && + (in_group_p(gid) || gid_eq(gid, inode->i_gid))) + return true; + if (capable_wrt_inode_uidgid(inode, CAP_CHOWN)) + return true; + if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN)) + return true; + return false; +} + /** * setattr_prepare - check if attribute changes to a dentry are allowed * @dentry:dentry to check @@ -52,17 +76,11 @@ int setattr_prepare(struct dentry *dentry, struct iattr *attr) goto kill_priv; /* Make sure a caller can chown. */ - if ((ia_valid & ATTR_UID) && - (!uid_eq(current_fsuid(), inode->i_uid) || -!uid_eq(attr->ia_uid, inode->i_uid)) && - !capable_wrt_inode_uidgid(inode, CAP_CHOWN)) + if ((ia_valid & ATTR_UID) && !chown_ok(inode, attr->ia_uid)) return -EPERM; /* Make sure caller can chgrp. */ - if ((ia_valid & ATTR_GID) && - (!uid_eq(current_fsuid(), inode->i_uid) || - (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, inode->i_gid))) && - !capable_wrt_inode_uidgid(inode, CAP_CHOWN)) + if ((ia_valid & ATTR_GID) && !chgrp_ok(inode, attr->ia_gid)) return -EPERM; /* Make sure a caller can chmod. */ diff --git a/fs/proc/base.c b/fs/proc/base.c index 31934cb9..9d50ec92 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -665,10 +665,17 @@ int proc_setattr(struct dentry *dentry, struct iattr *attr) { int error; struct inode *inode = d_inode(dentry); + struct user_namespace *s_user_ns; if (attr->ia_valid & ATTR_MODE) return -EPERM; + /* Don't let anyone mess with weird proc files */ + s_user_ns = inode->i_sb->s_user_ns; + if (!kuid_has_mapping(s_user_ns, inode->i_uid) || + !kgid_has_mapping(s_us
[PATCH 05/11] fs: Allow superblock owner to access do_remount_sb()
From: Seth Forshee Superblock level remounts are currently restricted to global CAP_SYS_ADMIN, as is the path for changing the root mount to read only on umount. Loosen both of these permission checks to also allow CAP_SYS_ADMIN in any namespace which is privileged towards the userns which originally mounted the filesystem. Patch v4 is available: https://patchwork.kernel.org/patch/8944631/ Cc: linux-fsde...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Alexander Viro Cc: "Eric W. Biederman" Cc: Serge Hallyn Signed-off-by: Seth Forshee Signed-off-by: Dongsu Park --- fs/namespace.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index e158ec6b..830040d7 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1589,7 +1589,7 @@ static int do_umount(struct mount *mnt, int flags) * Special case for "unmounting" root ... * we just try to remount it readonly. */ - if (!capable(CAP_SYS_ADMIN)) + if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN)) return -EPERM; down_write(>s_umount); if (!sb_rdonly(sb)) @@ -2327,7 +2327,7 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags, down_write(>s_umount); if (ms_flags & MS_BIND) err = change_mount_flags(path->mnt, ms_flags); - else if (!capable(CAP_SYS_ADMIN)) + else if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN)) err = -EPERM; else err = do_remount_sb(sb, sb_flags, data, 0); -- 2.13.6
[PATCH 03/11] fs: Allow superblock owner to change ownership of inodes
From: Eric W. Biederman Allow users with CAP_SYS_CHOWN over the superblock of a filesystem to chown files. Ordinarily the capable_wrt_inode_uidgid check is sufficient to allow access to files but when the underlying filesystem has uids or gids that don't map to the current user namespace it is not enough, so the chown permission checks need to be extended to allow this case. Calling chown on filesystem nodes whose uid or gid don't map is necessary if those nodes are going to be modified as writing back inodes which contain uids or gids that don't map is likely to cause filesystem corruption of the uid or gid fields. Once chown has been called the existing capable_wrt_inode_uidgid checks are sufficient, to allow the owner of a superblock to do anything the global root user can do with an appropriate set of capabilities. For the proc filesystem this relaxation of permissions is not safe, as some files are owned by users (particularly GLOBAL_ROOT_UID) outside of the control of the mounter of the proc and that would be unsafe to grant chown access to. So update setattr on proc to disallow changing files whose uids or gids are outside of proc's s_user_ns. The original version of this patch was written by: Seth Forshee. I have rewritten and rethought this patch enough so it's really not the same thing (certainly it needs a different description), but he deserves credit for getting out there and getting the conversation started, and finding the potential gotcha's and putting up with my semi-paranoid feedback. Patch v4 is available: https://patchwork.kernel.org/patch/8944611/ Cc: linux-fsde...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Alexander Viro Cc: "Luis R. Rodriguez" Cc: Kees Cook Inspired-by: Seth Forshee Signed-off-by: Eric W. Biederman [saf: Resolve conflicts caused by s/inode_change_ok/setattr_prepare/] Signed-off-by: Dongsu Park --- fs/attr.c | 34 ++ fs/proc/base.c| 7 +++ fs/proc/generic.c | 7 +++ fs/proc/proc_sysctl.c | 7 +++ 4 files changed, 47 insertions(+), 8 deletions(-) diff --git a/fs/attr.c b/fs/attr.c index 12ffdb6f..bf8e94f3 100644 --- a/fs/attr.c +++ b/fs/attr.c @@ -18,6 +18,30 @@ #include #include +static bool chown_ok(const struct inode *inode, kuid_t uid) +{ + if (uid_eq(current_fsuid(), inode->i_uid) && + uid_eq(uid, inode->i_uid)) + return true; + if (capable_wrt_inode_uidgid(inode, CAP_CHOWN)) + return true; + if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN)) + return true; + return false; +} + +static bool chgrp_ok(const struct inode *inode, kgid_t gid) +{ + if (uid_eq(current_fsuid(), inode->i_uid) && + (in_group_p(gid) || gid_eq(gid, inode->i_gid))) + return true; + if (capable_wrt_inode_uidgid(inode, CAP_CHOWN)) + return true; + if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN)) + return true; + return false; +} + /** * setattr_prepare - check if attribute changes to a dentry are allowed * @dentry:dentry to check @@ -52,17 +76,11 @@ int setattr_prepare(struct dentry *dentry, struct iattr *attr) goto kill_priv; /* Make sure a caller can chown. */ - if ((ia_valid & ATTR_UID) && - (!uid_eq(current_fsuid(), inode->i_uid) || -!uid_eq(attr->ia_uid, inode->i_uid)) && - !capable_wrt_inode_uidgid(inode, CAP_CHOWN)) + if ((ia_valid & ATTR_UID) && !chown_ok(inode, attr->ia_uid)) return -EPERM; /* Make sure caller can chgrp. */ - if ((ia_valid & ATTR_GID) && - (!uid_eq(current_fsuid(), inode->i_uid) || - (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, inode->i_gid))) && - !capable_wrt_inode_uidgid(inode, CAP_CHOWN)) + if ((ia_valid & ATTR_GID) && !chgrp_ok(inode, attr->ia_gid)) return -EPERM; /* Make sure a caller can chmod. */ diff --git a/fs/proc/base.c b/fs/proc/base.c index 31934cb9..9d50ec92 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -665,10 +665,17 @@ int proc_setattr(struct dentry *dentry, struct iattr *attr) { int error; struct inode *inode = d_inode(dentry); + struct user_namespace *s_user_ns; if (attr->ia_valid & ATTR_MODE) return -EPERM; + /* Don't let anyone mess with weird proc files */ + s_user_ns = inode->i_sb->s_user_ns; + if (!kuid_has_mapping(s_user_ns, inode->i_uid) || + !kgid_has_mapping(s_user_ns, inode->i_gid)) + return -EPERM; + error = setattr_prepare(dentry, attr); if (error) return error; diff --git a/fs/proc/generic.c b/fs/proc/gen
[PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root
From: Seth Forshee <seth.fors...@canonical.com> Expand the check in should_remove_suid() to keep privileges for CAP_FSETID in s_user_ns rather than init_user_ns. Patch v4 is available: https://patchwork.kernel.org/patch/8944621/ --EWB Changed from ns_capable(sb->s_user_ns, ) to capable_wrt_inode_uidgid Cc: linux-fsde...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Alexander Viro <v...@zeniv.linux.org.uk> Cc: Serge Hallyn <se...@hallyn.com> Signed-off-by: Seth Forshee <seth.fors...@canonical.com> Signed-off-by: Dongsu Park <don...@kinvolk.io> --- fs/inode.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index fd401028..6459a437 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1749,7 +1749,8 @@ EXPORT_SYMBOL(touch_atime); */ int should_remove_suid(struct dentry *dentry) { - umode_t mode = d_inode(dentry)->i_mode; + struct inode *inode = d_inode(dentry); + umode_t mode = inode->i_mode; int kill = 0; /* suid always must be killed */ @@ -1763,7 +1764,8 @@ int should_remove_suid(struct dentry *dentry) if (unlikely((mode & S_ISGID) && (mode & S_IXGRP))) kill |= ATTR_KILL_SGID; - if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode))) + if (unlikely(kill && !capable_wrt_inode_uidgid(inode, CAP_FSETID) && +S_ISREG(mode))) return kill; return 0; -- 2.13.6
[PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root
From: Seth Forshee Expand the check in should_remove_suid() to keep privileges for CAP_FSETID in s_user_ns rather than init_user_ns. Patch v4 is available: https://patchwork.kernel.org/patch/8944621/ --EWB Changed from ns_capable(sb->s_user_ns, ) to capable_wrt_inode_uidgid Cc: linux-fsde...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Alexander Viro Cc: Serge Hallyn Signed-off-by: Seth Forshee Signed-off-by: Dongsu Park --- fs/inode.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index fd401028..6459a437 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1749,7 +1749,8 @@ EXPORT_SYMBOL(touch_atime); */ int should_remove_suid(struct dentry *dentry) { - umode_t mode = d_inode(dentry)->i_mode; + struct inode *inode = d_inode(dentry); + umode_t mode = inode->i_mode; int kill = 0; /* suid always must be killed */ @@ -1763,7 +1764,8 @@ int should_remove_suid(struct dentry *dentry) if (unlikely((mode & S_ISGID) && (mode & S_IXGRP))) kill |= ATTR_KILL_SGID; - if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode))) + if (unlikely(kill && !capable_wrt_inode_uidgid(inode, CAP_FSETID) && +S_ISREG(mode))) return kill; return 0; -- 2.13.6
[PATCH v5 00/11] FUSE mounts from non-init user namespaces
This patchset v5 is based on work by Seth Forshee and Eric Biederman. The latest patchset was v4: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1132206.html At the moment, filesystems backed by physical medium can only be mounted by real root in the initial user namespace. This restriction exists because if it's allowed for root user in non-init user namespaces to mount the filesystem, then it effectively allows the user to control the underlying source of the filesystem. In case of FUSE, the source would mean any underlying device. However, in many use cases such as containers, it's necessary to allow filesystems to be mounted from non-init user namespaces. Goal of this patchset is to allow FUSE filesystems to be mounted from non-init user namespaces. Support for other filesystems like ext4 are not in the scope of this patchset. Let me describe how to test mounting from non-init user namespaces. It's assumed that tests are done via sshfs, a userspace filesystem based on FUSE with ssh as backend. Testing system is Fedora 27. $ sudo dnf install -y sshfs $ sudo mkdir -p /mnt/userns ### workaround to get the sshfs permission checks $ sudo chown -R $UID:$UID /etc/ssh/ssh_config.d /usr/share/crypto-policies $ unshare -U -r -m # sshfs root@localhost: /mnt/userns ### You can see sshfs being mounted from a non-init user namespace # mount | grep sshfs root@localhost: on /mnt/userns type fuse.sshfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0) # touch /mnt/userns/test # ls -l /mnt/userns/test -rw-r--r-- 1 root root 0 Dec 11 19:01 /mnt/userns/test Open another terminal, check the mountpoint from outside the namespace. $ grep userns /proc/$(pidof sshfs)/mountinfo 131 102 0:35 / /mnt/userns rw,nosuid,nodev,relatime - fuse.sshfs root@localhost: rw,user_id=0,group_id=0 After all tests are done, you can unmount the filesystem inside the namespace. # fusermount -u /mnt/userns Changes since v4: * Remove other parts like ext4 to keep the patchset minimal for FUSE * Add and change commit messages * Describe how to test non-init user namespaces TODO: * Think through potential security implications. There are 2 patches being prepared for security issues. One is "ima: define a new policy option named force" by Mimi Zohar, which adds an option to specify that the results should not be cached: https://marc.info/?l=linux-integrity=151275680115856=2 The other one is to basically prevent FUSE results from being cached, which is still in progress. * Test IMA/LSMs. Details are written in https://github.com/kinvolk/fuse-userns-patches/blob/master/tests/TESTING_INTEGRITY.md Patches 1-2 deal with an additional flag of lookup_bdev() to check for additional inode permission. Patches 3-7 allow the superblock owner to change ownership of inodes, and deal with additional capability checks w.r.t user namespaces. Patches 8-10 allow FUSE filesystems to be mounted outside of the init user namespace. Patch 11 handles a corner case of non-root users in EVM. The patchset is also available in our github repo: https://github.com/kinvolk/linux/tree/dongsu/fuse-userns-v5-1 Eric W. Biederman (1): fs: Allow superblock owner to change ownership of inodes Seth Forshee (10): block_dev: Support checking inode permissions in lookup_bdev() mtd: Check permissions towards mtd block device inode when mounting fs: Don't remove suid for CAP_FSETID for userns root fs: Allow superblock owner to access do_remount_sb() capabilities: Allow privileged user in s_user_ns to set security.* xattrs fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems fuse: Support fuse filesystems outside of init_user_ns fuse: Restrict allow_other to the superblock's namespace or a descendant fuse: Allow user namespace mounts evm: Don't update hmacs in user ns mounts drivers/md/bcache/super.c | 2 +- drivers/md/dm-table.c | 2 +- drivers/mtd/mtdsuper.c | 6 +- fs/attr.c | 34 ++ fs/block_dev.c | 13 ++--- fs/fuse/cuse.c | 3 ++- fs/fuse/dev.c | 11 --- fs/fuse/dir.c | 16 fs/fuse/fuse_i.h| 6 +- fs/fuse/inode.c | 35 +-- fs/inode.c | 6 -- fs/ioctl.c | 4 ++-- fs/namespace.c | 4 ++-- fs/proc/base.c | 7 +++ fs/proc/generic.c | 7 +++ fs/proc/proc_sysctl.c | 7 +++ fs/quota/quota.c| 2 +- include/linux/fs.h | 2 +- kernel/user_namespace.c | 1 + security/commoncap.c| 8 ++-- security/integrity/evm/evm_crypto.c | 3 ++- 21 files changed, 127
[PATCH v5 00/11] FUSE mounts from non-init user namespaces
This patchset v5 is based on work by Seth Forshee and Eric Biederman. The latest patchset was v4: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1132206.html At the moment, filesystems backed by physical medium can only be mounted by real root in the initial user namespace. This restriction exists because if it's allowed for root user in non-init user namespaces to mount the filesystem, then it effectively allows the user to control the underlying source of the filesystem. In case of FUSE, the source would mean any underlying device. However, in many use cases such as containers, it's necessary to allow filesystems to be mounted from non-init user namespaces. Goal of this patchset is to allow FUSE filesystems to be mounted from non-init user namespaces. Support for other filesystems like ext4 are not in the scope of this patchset. Let me describe how to test mounting from non-init user namespaces. It's assumed that tests are done via sshfs, a userspace filesystem based on FUSE with ssh as backend. Testing system is Fedora 27. $ sudo dnf install -y sshfs $ sudo mkdir -p /mnt/userns ### workaround to get the sshfs permission checks $ sudo chown -R $UID:$UID /etc/ssh/ssh_config.d /usr/share/crypto-policies $ unshare -U -r -m # sshfs root@localhost: /mnt/userns ### You can see sshfs being mounted from a non-init user namespace # mount | grep sshfs root@localhost: on /mnt/userns type fuse.sshfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0) # touch /mnt/userns/test # ls -l /mnt/userns/test -rw-r--r-- 1 root root 0 Dec 11 19:01 /mnt/userns/test Open another terminal, check the mountpoint from outside the namespace. $ grep userns /proc/$(pidof sshfs)/mountinfo 131 102 0:35 / /mnt/userns rw,nosuid,nodev,relatime - fuse.sshfs root@localhost: rw,user_id=0,group_id=0 After all tests are done, you can unmount the filesystem inside the namespace. # fusermount -u /mnt/userns Changes since v4: * Remove other parts like ext4 to keep the patchset minimal for FUSE * Add and change commit messages * Describe how to test non-init user namespaces TODO: * Think through potential security implications. There are 2 patches being prepared for security issues. One is "ima: define a new policy option named force" by Mimi Zohar, which adds an option to specify that the results should not be cached: https://marc.info/?l=linux-integrity=151275680115856=2 The other one is to basically prevent FUSE results from being cached, which is still in progress. * Test IMA/LSMs. Details are written in https://github.com/kinvolk/fuse-userns-patches/blob/master/tests/TESTING_INTEGRITY.md Patches 1-2 deal with an additional flag of lookup_bdev() to check for additional inode permission. Patches 3-7 allow the superblock owner to change ownership of inodes, and deal with additional capability checks w.r.t user namespaces. Patches 8-10 allow FUSE filesystems to be mounted outside of the init user namespace. Patch 11 handles a corner case of non-root users in EVM. The patchset is also available in our github repo: https://github.com/kinvolk/linux/tree/dongsu/fuse-userns-v5-1 Eric W. Biederman (1): fs: Allow superblock owner to change ownership of inodes Seth Forshee (10): block_dev: Support checking inode permissions in lookup_bdev() mtd: Check permissions towards mtd block device inode when mounting fs: Don't remove suid for CAP_FSETID for userns root fs: Allow superblock owner to access do_remount_sb() capabilities: Allow privileged user in s_user_ns to set security.* xattrs fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems fuse: Support fuse filesystems outside of init_user_ns fuse: Restrict allow_other to the superblock's namespace or a descendant fuse: Allow user namespace mounts evm: Don't update hmacs in user ns mounts drivers/md/bcache/super.c | 2 +- drivers/md/dm-table.c | 2 +- drivers/mtd/mtdsuper.c | 6 +- fs/attr.c | 34 ++ fs/block_dev.c | 13 ++--- fs/fuse/cuse.c | 3 ++- fs/fuse/dev.c | 11 --- fs/fuse/dir.c | 16 fs/fuse/fuse_i.h| 6 +- fs/fuse/inode.c | 35 +-- fs/inode.c | 6 -- fs/ioctl.c | 4 ++-- fs/namespace.c | 4 ++-- fs/proc/base.c | 7 +++ fs/proc/generic.c | 7 +++ fs/proc/proc_sysctl.c | 7 +++ fs/quota/quota.c| 2 +- include/linux/fs.h | 2 +- kernel/user_namespace.c | 1 + security/commoncap.c| 8 ++-- security/integrity/evm/evm_crypto.c | 3 ++- 21 files changed, 127
Re: [PATCH v2] devpts: allow mounting with uid/gid of uint32_t
On 28.08.2015 15:33, Peter Hurley wrote: > On 08/18/2015 11:18 AM, Dongsu Park wrote: > > --- > > fs/devpts/inode.c | 20 > > 1 file changed, 16 insertions(+), 4 deletions(-) > > > > diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c > > index c35ffdc12bba..49272fae40a7 100644 > > --- a/fs/devpts/inode.c > > +++ b/fs/devpts/inode.c > > @@ -188,23 +188,35 @@ static int parse_mount_options(char *data, int op, > > struct pts_mount_opts *opts) > > token = match_token(p, tokens, args); > > switch (token) { > > case Opt_uid: > > - if (match_int([0], )) > > match_int() => make_kuid/kgid is a widespread pattern in filesystems > for handling uid/gid mount parameters. > > How about adding a for-purpose string-to-uid/gid function, rather than > open-coding? Yeah, that sounds like a good idea. Do you mean probably something like this? (on top of -mm tree) Thanks, Dongsu ---- >From ccfa5db398ba5ac31c5e0128e88abca1f6d1e6f5 Mon Sep 17 00:00:00 2001 Message-Id: From: Dongsu Park Date: Sat, 29 Aug 2015 12:35:01 +0200 Subject: [PATCH v3] devpts: allow mounting with uid/gid of uint32_t To allow devpts to be mounted with options of uid/gid of uint32_t, we need to make use of general parsing API instead of match_int(). So introduce kstrto{uid,gid}(), wrappers around kstrtouint() as well as make_k{uid,gid}() calls. And then make devpts parse options only using kstrto{uid,gid}(). Doing that, mounting devpts with uid or gid > (2^31 - 1) will work as expected, e.g.: # mount -t devpts devpts /tmp/devptsdir -o \ newinstance,ptmxmode=0666,mode=620,uid=3598450688,gid=3598450693 It was originally by reported on github issue tracker of systemd: https://github.com/systemd/systemd/issues/956 from v2: * rebase on top of -mm tree * split common parts for parsing uid/gid into kstrto{uid,gid}() * fix minor format. * continue to use kstrtouint() suggested by Alexey Dobriyan. from v1: fix patch format correctly Cc: Alexey Dobriyan Reported-by: Alban Crequy Suggested-by: Peter Hurley Signed-off-by: Dongsu Park --- fs/devpts/inode.c | 19 +-- include/linux/parse-integer.h | 4 lib/kstrtox.c | 40 3 files changed, 53 insertions(+), 10 deletions(-) diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c index c35ffdc12bba..fbbd71005dcb 100644 --- a/fs/devpts/inode.c +++ b/fs/devpts/inode.c @@ -181,6 +181,7 @@ static int parse_mount_options(char *data, int op, struct pts_mount_opts *opts) substring_t args[MAX_OPT_ARGS]; int token; int option; + int rc; if (!*p) continue; @@ -188,20 +189,18 @@ static int parse_mount_options(char *data, int op, struct pts_mount_opts *opts) token = match_token(p, tokens, args); switch (token) { case Opt_uid: - if (match_int([0], )) - return -EINVAL; - uid = make_kuid(current_user_ns(), option); - if (!uid_valid(uid)) - return -EINVAL; + rc = kstrtouid(args[0].from, ); + if (rc) + return rc; + opts->uid = uid; opts->setuid = 1; break; case Opt_gid: - if (match_int([0], )) - return -EINVAL; - gid = make_kgid(current_user_ns(), option); - if (!gid_valid(gid)) - return -EINVAL; + rc = kstrtogid(args[0].from, ); + if (rc) + return rc; + opts->gid = gid; opts->setgid = 1; break; diff --git a/include/linux/parse-integer.h b/include/linux/parse-integer.h index ba620cdf3df6..2cdc4f418e00 100644 --- a/include/linux/parse-integer.h +++ b/include/linux/parse-integer.h @@ -2,6 +2,7 @@ #define _PARSE_INTEGER_H #include #include +#include /* * int parse_integer(const char *s, unsigned int base, T *val); @@ -155,6 +156,9 @@ static inline int __must_check kstrtos8(const char *s, unsigned int base, s8 *re return parse_integer(s, base | PARSE_INTEGER_NEWLINE, res); } +int __must_check kstrtouid(const char *uidstr, kuid_t *kuid); +int __must_check kstrtogid(const char *gidstr, kgid_t *kgid); + int __must_check kstrtoull_from_user(const char __user *s, size_t count, unsigned int base, unsigned long long *res); int __must_check kstrtoll_from_user(const char __user *s, size_t count, unsigned in
Re: [PATCH v2] devpts: allow mounting with uid/gid of uint32_t
On 28.08.2015 15:33, Peter Hurley wrote: On 08/18/2015 11:18 AM, Dongsu Park wrote: --- fs/devpts/inode.c | 20 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c index c35ffdc12bba..49272fae40a7 100644 --- a/fs/devpts/inode.c +++ b/fs/devpts/inode.c @@ -188,23 +188,35 @@ static int parse_mount_options(char *data, int op, struct pts_mount_opts *opts) token = match_token(p, tokens, args); switch (token) { case Opt_uid: - if (match_int(args[0], option)) match_int() = make_kuid/kgid is a widespread pattern in filesystems for handling uid/gid mount parameters. How about adding a for-purpose string-to-uid/gid function, rather than open-coding? Yeah, that sounds like a good idea. Do you mean probably something like this? (on top of -mm tree) Thanks, Dongsu From ccfa5db398ba5ac31c5e0128e88abca1f6d1e6f5 Mon Sep 17 00:00:00 2001 Message-Id: ccfa5db398ba5ac31c5e0128e88abca1f6d1e6f5.1440844226.git.dp...@posteo.net From: Dongsu Park dp...@posteo.net Date: Sat, 29 Aug 2015 12:35:01 +0200 Subject: [PATCH v3] devpts: allow mounting with uid/gid of uint32_t To allow devpts to be mounted with options of uid/gid of uint32_t, we need to make use of general parsing API instead of match_int(). So introduce kstrto{uid,gid}(), wrappers around kstrtouint() as well as make_k{uid,gid}() calls. And then make devpts parse options only using kstrto{uid,gid}(). Doing that, mounting devpts with uid or gid (2^31 - 1) will work as expected, e.g.: # mount -t devpts devpts /tmp/devptsdir -o \ newinstance,ptmxmode=0666,mode=620,uid=3598450688,gid=3598450693 It was originally by reported on github issue tracker of systemd: https://github.com/systemd/systemd/issues/956 from v2: * rebase on top of -mm tree * split common parts for parsing uid/gid into kstrto{uid,gid}() * fix minor format. * continue to use kstrtouint() suggested by Alexey Dobriyan. from v1: fix patch format correctly Cc: Alexey Dobriyan adobri...@gmail.com Reported-by: Alban Crequy al...@endocode.com Suggested-by: Peter Hurley pe...@hurleysoftware.com Signed-off-by: Dongsu Park dp...@posteo.net --- fs/devpts/inode.c | 19 +-- include/linux/parse-integer.h | 4 lib/kstrtox.c | 40 3 files changed, 53 insertions(+), 10 deletions(-) diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c index c35ffdc12bba..fbbd71005dcb 100644 --- a/fs/devpts/inode.c +++ b/fs/devpts/inode.c @@ -181,6 +181,7 @@ static int parse_mount_options(char *data, int op, struct pts_mount_opts *opts) substring_t args[MAX_OPT_ARGS]; int token; int option; + int rc; if (!*p) continue; @@ -188,20 +189,18 @@ static int parse_mount_options(char *data, int op, struct pts_mount_opts *opts) token = match_token(p, tokens, args); switch (token) { case Opt_uid: - if (match_int(args[0], option)) - return -EINVAL; - uid = make_kuid(current_user_ns(), option); - if (!uid_valid(uid)) - return -EINVAL; + rc = kstrtouid(args[0].from, uid); + if (rc) + return rc; + opts-uid = uid; opts-setuid = 1; break; case Opt_gid: - if (match_int(args[0], option)) - return -EINVAL; - gid = make_kgid(current_user_ns(), option); - if (!gid_valid(gid)) - return -EINVAL; + rc = kstrtogid(args[0].from, gid); + if (rc) + return rc; + opts-gid = gid; opts-setgid = 1; break; diff --git a/include/linux/parse-integer.h b/include/linux/parse-integer.h index ba620cdf3df6..2cdc4f418e00 100644 --- a/include/linux/parse-integer.h +++ b/include/linux/parse-integer.h @@ -2,6 +2,7 @@ #define _PARSE_INTEGER_H #include linux/compiler.h #include linux/types.h +#include linux/uidgid.h /* * int parse_integer(const char *s, unsigned int base, T *val); @@ -155,6 +156,9 @@ static inline int __must_check kstrtos8(const char *s, unsigned int base, s8 *re return parse_integer(s, base | PARSE_INTEGER_NEWLINE, res); } +int __must_check kstrtouid(const char *uidstr, kuid_t *kuid); +int __must_check kstrtogid(const char *gidstr, kgid_t *kgid); + int __must_check kstrtoull_from_user(const char __user *s, size_t count, unsigned int base, unsigned long long *res); int
Re: [PATCH v2] devpts: allow mounting with uid/gid of uint32_t
Hi, thanks for the review. On 18.08.2015 16:44, Andrew Morton wrote: > On Tue, 18 Aug 2015 17:18:19 +0200 Dongsu Park wrote: > > > To allow devpts to be mounted with options of uid/gid of uint32_t, > > use kstrtouint() instead of match_int(). Doing that, mounting devpts > > with uid or gid > (2^31 - 1) will work as expected, e.g.: > > > > # mount -t devpts devpts /tmp/devptsdir -o \ > >newinstance,ptmxmode=0666,mode=620,uid=3598450688,gid=3598450693 > > > > It was originally by reported on systemd github issues: > > https://github.com/systemd/systemd/issues/956 > > > > --- a/fs/devpts/inode.c > > +++ b/fs/devpts/inode.c > > @@ -188,23 +188,35 @@ static int parse_mount_options(char *data, int op, > > struct pts_mount_opts *opts) > > token = match_token(p, tokens, args); > > switch (token) { > > case Opt_uid: > > - if (match_int([0], )) > > + { > > It might be neater to lay this out as > > case Opt_uid: { I'll do it. > > + char *uidstr = args[0].from; > > + uid_t uidval; > > + int rc = kstrtouint(uidstr, 0, ); > > This assumes that the architecture/config uses a uint for uid_t. We > have no business assuming this - it's an opaque type for a reason. It > would be safer to do > > unsigned long uidl; > > rc = kstrtoul(uidstr, 0, ); > uidval = uidl; That's a good point. I'll do it. > > + if (rc) > > return -EINVAL; > > I don't get it. From my reading, kstrtouint->parse_integer() returns > "number of characters parsed or -E". So this code won't work. But > presumably it *does* work, so why? It's probably because kstrtouint() returns just 0 on success. That's what functions in the call chain of kstrtouint() -> kstrtoull() -> _kstrtoull() -> _parse_integer() are actually doing. _parse_integer() actually returns rv, i.e. number of characters parsed. But after that, if there's no error, _kstrtoull() simply returns 0. > Also, we should probably return `rc' here if it's negative, to > propagate the error which kstrtouint() detected. That's a minor > non-back-compatible change but it shouldn't matter. Okay, I also think that we should return rc. I'll do it. > otoh, kstrtouint() likes to return -ERANGE when things go wrong. > ERANGE means "Math result not representable", which is a nonsenscal > error code in this context. Sigh, why do people keep doing this. Hmm, good to know. Thanks, Dongsu > > - uid = make_kuid(current_user_ns(), option); > > + uid = make_kuid(current_user_ns(), uidval); > > if (!uid_valid(uid)) > > return -EINVAL; > > opts->uid = uid; > > opts->setuid = 1; > > break; > > > > ... > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] devpts: allow mounting with uid/gid of uint32_t
Hi, thanks for the review. On 18.08.2015 16:44, Andrew Morton wrote: On Tue, 18 Aug 2015 17:18:19 +0200 Dongsu Park dp...@posteo.net wrote: To allow devpts to be mounted with options of uid/gid of uint32_t, use kstrtouint() instead of match_int(). Doing that, mounting devpts with uid or gid (2^31 - 1) will work as expected, e.g.: # mount -t devpts devpts /tmp/devptsdir -o \ newinstance,ptmxmode=0666,mode=620,uid=3598450688,gid=3598450693 It was originally by reported on systemd github issues: https://github.com/systemd/systemd/issues/956 --- a/fs/devpts/inode.c +++ b/fs/devpts/inode.c @@ -188,23 +188,35 @@ static int parse_mount_options(char *data, int op, struct pts_mount_opts *opts) token = match_token(p, tokens, args); switch (token) { case Opt_uid: - if (match_int(args[0], option)) + { It might be neater to lay this out as case Opt_uid: { I'll do it. + char *uidstr = args[0].from; + uid_t uidval; + int rc = kstrtouint(uidstr, 0, uidval); This assumes that the architecture/config uses a uint for uid_t. We have no business assuming this - it's an opaque type for a reason. It would be safer to do unsigned long uidl; rc = kstrtoul(uidstr, 0, uidl); uidval = uidl; That's a good point. I'll do it. + if (rc) return -EINVAL; I don't get it. From my reading, kstrtouint-parse_integer() returns number of characters parsed or -E. So this code won't work. But presumably it *does* work, so why? It's probably because kstrtouint() returns just 0 on success. That's what functions in the call chain of kstrtouint() - kstrtoull() - _kstrtoull() - _parse_integer() are actually doing. _parse_integer() actually returns rv, i.e. number of characters parsed. But after that, if there's no error, _kstrtoull() simply returns 0. Also, we should probably return `rc' here if it's negative, to propagate the error which kstrtouint() detected. That's a minor non-back-compatible change but it shouldn't matter. Okay, I also think that we should return rc. I'll do it. otoh, kstrtouint() likes to return -ERANGE when things go wrong. ERANGE means Math result not representable, which is a nonsenscal error code in this context. Sigh, why do people keep doing this. Hmm, good to know. Thanks, Dongsu - uid = make_kuid(current_user_ns(), option); + uid = make_kuid(current_user_ns(), uidval); if (!uid_valid(uid)) return -EINVAL; opts-uid = uid; opts-setuid = 1; break; ... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] devpts: allow mounting with uid/gid of uint32_t
To allow devpts to be mounted with options of uid/gid of uint32_t, use kstrtouint() instead of match_int(). Doing that, mounting devpts with uid or gid > (2^31 - 1) will work as expected, e.g.: # mount -t devpts devpts /tmp/devptsdir -o \ newinstance,ptmxmode=0666,mode=620,uid=3598450688,gid=3598450693 It was originally by reported on systemd github issues: https://github.com/systemd/systemd/issues/956 from v1: fix patch format correctly Reported-by: Alban Crequy Signed-off-by: Dongsu Park --- fs/devpts/inode.c | 20 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c index c35ffdc12bba..49272fae40a7 100644 --- a/fs/devpts/inode.c +++ b/fs/devpts/inode.c @@ -188,23 +188,35 @@ static int parse_mount_options(char *data, int op, struct pts_mount_opts *opts) token = match_token(p, tokens, args); switch (token) { case Opt_uid: - if (match_int([0], )) + { + char *uidstr = args[0].from; + uid_t uidval; + int rc = kstrtouint(uidstr, 0, ); + + if (rc) return -EINVAL; - uid = make_kuid(current_user_ns(), option); + uid = make_kuid(current_user_ns(), uidval); if (!uid_valid(uid)) return -EINVAL; opts->uid = uid; opts->setuid = 1; break; + } case Opt_gid: - if (match_int([0], )) + { + char *gidstr = args[0].from; + gid_t gidval; + int rc = kstrtouint(gidstr, 0, ); + + if (rc) return -EINVAL; - gid = make_kgid(current_user_ns(), option); + gid = make_kgid(current_user_ns(), gidval); if (!gid_valid(gid)) return -EINVAL; opts->gid = gid; opts->setgid = 1; break; + } case Opt_mode: if (match_octal([0], )) return -EINVAL; -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] devpts: allow mounting with uid/gid of uint32_t
To allow devpts to be mounted with options of uid/gid of uint32_t, use kstrtouint() instead of match_int(). Doing that, mounting devpts with uid or gid > (2^31 - 1) will work as expected, e.g.: # mount -t devpts devpts /tmp/devptsdir -o \ newinstance,ptmxmode=0666,mode=620,uid=3598450688,gid=3598450693 It was originally by reported on systemd github issues: https://github.com/systemd/systemd/issues/956 Reported-by: Alban Crequy Signed-off-by: Dongsu Park --- fs/devpts/inode.c | 18 ++ 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c index c35ffdc12bba..83c3e7368f38 100644 --- a/fs/devpts/inode.c +++ b/fs/devpts/inode.c @@ -188,23 +188,33 @@ static int parse_mount_options(char *data, int op, struct pts_mount_opts *opts) token = match_token(p, tokens, args); switch (token) { case Opt_uid: - if (match_int([0], )) + { + char *uidstr = args[0].from; + uid_t uidval; + int rc = kstrtouint(uidstr, 0, ); + if (rc) return -EINVAL; - uid = make_kuid(current_user_ns(), option); + uid = make_kuid(current_user_ns(), uidval); if (!uid_valid(uid)) return -EINVAL; opts->uid = uid; opts->setuid = 1; break; +} case Opt_gid: - if (match_int([0], )) +{ + char *gidstr = args[0].from; + gid_t gidval; + int rc = kstrtouint(gidstr, 0, ); + if (rc) return -EINVAL; - gid = make_kgid(current_user_ns(), option); + gid = make_kgid(current_user_ns(), gidval); if (!gid_valid(gid)) return -EINVAL; opts->gid = gid; opts->setgid = 1; break; +} case Opt_mode: if (match_octal([0], )) return -EINVAL; -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] devpts: allow mounting with uid/gid of uint32_t
To allow devpts to be mounted with options of uid/gid of uint32_t, use kstrtouint() instead of match_int(). Doing that, mounting devpts with uid or gid (2^31 - 1) will work as expected, e.g.: # mount -t devpts devpts /tmp/devptsdir -o \ newinstance,ptmxmode=0666,mode=620,uid=3598450688,gid=3598450693 It was originally by reported on systemd github issues: https://github.com/systemd/systemd/issues/956 Reported-by: Alban Crequy al...@endocode.com Signed-off-by: Dongsu Park dp...@posteo.net --- fs/devpts/inode.c | 18 ++ 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c index c35ffdc12bba..83c3e7368f38 100644 --- a/fs/devpts/inode.c +++ b/fs/devpts/inode.c @@ -188,23 +188,33 @@ static int parse_mount_options(char *data, int op, struct pts_mount_opts *opts) token = match_token(p, tokens, args); switch (token) { case Opt_uid: - if (match_int(args[0], option)) + { + char *uidstr = args[0].from; + uid_t uidval; + int rc = kstrtouint(uidstr, 0, uidval); + if (rc) return -EINVAL; - uid = make_kuid(current_user_ns(), option); + uid = make_kuid(current_user_ns(), uidval); if (!uid_valid(uid)) return -EINVAL; opts-uid = uid; opts-setuid = 1; break; +} case Opt_gid: - if (match_int(args[0], option)) +{ + char *gidstr = args[0].from; + gid_t gidval; + int rc = kstrtouint(gidstr, 0, gidval); + if (rc) return -EINVAL; - gid = make_kgid(current_user_ns(), option); + gid = make_kgid(current_user_ns(), gidval); if (!gid_valid(gid)) return -EINVAL; opts-gid = gid; opts-setgid = 1; break; +} case Opt_mode: if (match_octal(args[0], option)) return -EINVAL; -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] devpts: allow mounting with uid/gid of uint32_t
To allow devpts to be mounted with options of uid/gid of uint32_t, use kstrtouint() instead of match_int(). Doing that, mounting devpts with uid or gid (2^31 - 1) will work as expected, e.g.: # mount -t devpts devpts /tmp/devptsdir -o \ newinstance,ptmxmode=0666,mode=620,uid=3598450688,gid=3598450693 It was originally by reported on systemd github issues: https://github.com/systemd/systemd/issues/956 from v1: fix patch format correctly Reported-by: Alban Crequy al...@endocode.com Signed-off-by: Dongsu Park dp...@posteo.net --- fs/devpts/inode.c | 20 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c index c35ffdc12bba..49272fae40a7 100644 --- a/fs/devpts/inode.c +++ b/fs/devpts/inode.c @@ -188,23 +188,35 @@ static int parse_mount_options(char *data, int op, struct pts_mount_opts *opts) token = match_token(p, tokens, args); switch (token) { case Opt_uid: - if (match_int(args[0], option)) + { + char *uidstr = args[0].from; + uid_t uidval; + int rc = kstrtouint(uidstr, 0, uidval); + + if (rc) return -EINVAL; - uid = make_kuid(current_user_ns(), option); + uid = make_kuid(current_user_ns(), uidval); if (!uid_valid(uid)) return -EINVAL; opts-uid = uid; opts-setuid = 1; break; + } case Opt_gid: - if (match_int(args[0], option)) + { + char *gidstr = args[0].from; + gid_t gidval; + int rc = kstrtouint(gidstr, 0, gidval); + + if (rc) return -EINVAL; - gid = make_kgid(current_user_ns(), option); + gid = make_kgid(current_user_ns(), gidval); if (!gid_valid(gid)) return -EINVAL; opts-gid = gid; opts-setgid = 1; break; + } case Opt_mode: if (match_octal(args[0], option)) return -EINVAL; -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: panic with CPU hotplug + blk-mq + scsi-mq
On 21.04.2015 00:48, Ming Lei wrote: > Thanks for providing that. > The trick is just in CPU number and virito-scsi hw queue number, > and that is why I asked that, :-) > Now the problem is quite clear, before CPU1 online, suppose > CPU3 is mapped hw queue 6, and CPU 3 will map to hw queue 5 > after CPU1 is offline, unfortunately current code can't allocate > tags for hw queue 5 even it becomes mapped. > The following updated patch(include original patch 2) will fix > the problem, and patch 1 is required too. > So the following patch should fix your hotplug issue. Yes, it works indeed. Thanks a lot! :-) You can add: Tested-by: Dongsu Park As the original patch didn't apply, I had to change some nitpicks though. (see below) Cheers, Dongsu >From 8c0edcbbdfbab67dc8ae2fd46cca6a86e0cadcba Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Sun, 19 Apr 2015 23:32:46 +0800 Subject: [PATCH v1 2/2] blk-mq: fix CPU hotplug handling Firstly the hctx->tags have to be set as NULL if it is to be disabled no matter if set->tags[i] is NULL or not in blk_mq_map_swqueue() because shared tags can be freed already from another request queue. The same situation has to be considered in blk_mq_hctx_cpu_online() too. Finally one unmapped hw queue can be remapped after CPU topo is changed, we need to allocate tags for the hw queue in blk_mq_map_swqueue() too. Then tags allocation for hw queue can be removed in hctx cpu online notifier, and it is reasonable to do that after remapping is done. Cc: Reported-by: Dongsu Park Signed-off-by: Ming Lei --- block/blk-mq.c | 34 +- 1 file changed, 13 insertions(+), 21 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 078840ce8670..df4b9597e477 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1573,22 +1573,6 @@ static int blk_mq_hctx_cpu_offline(struct blk_mq_hw_ctx *hctx, int cpu) return NOTIFY_OK; } -static int blk_mq_hctx_cpu_online(struct blk_mq_hw_ctx *hctx, int cpu) -{ - struct request_queue *q = hctx->queue; - struct blk_mq_tag_set *set = q->tag_set; - - if (set->tags[hctx->queue_num]) - return NOTIFY_OK; - - set->tags[hctx->queue_num] = blk_mq_init_rq_map(set, hctx->queue_num); - if (!set->tags[hctx->queue_num]) - return NOTIFY_STOP; - - hctx->tags = set->tags[hctx->queue_num]; - return NOTIFY_OK; -} - static int blk_mq_hctx_notify(void *data, unsigned long action, unsigned int cpu) { @@ -1596,8 +1580,11 @@ static int blk_mq_hctx_notify(void *data, unsigned long action, if (action == CPU_DEAD || action == CPU_DEAD_FROZEN) return blk_mq_hctx_cpu_offline(hctx, cpu); - else if (action == CPU_ONLINE || action == CPU_ONLINE_FROZEN) - return blk_mq_hctx_cpu_online(hctx, cpu); + + /* +* In case of CPU online, tags will be reallocated +* after new mapping is done in blk_mq_map_swqueue(). +*/ return NOTIFY_OK; } @@ -1779,6 +1766,7 @@ static void blk_mq_map_swqueue(struct request_queue *q) unsigned int i; struct blk_mq_hw_ctx *hctx; struct blk_mq_ctx *ctx; + struct blk_mq_tag_set *set = q->tag_set; queue_for_each_hw_ctx(q, hctx, i) { cpumask_clear(hctx->cpumask); @@ -1805,16 +1793,20 @@ static void blk_mq_map_swqueue(struct request_queue *q) * disable it and free the request entries. */ if (!hctx->nr_ctx) { - struct blk_mq_tag_set *set = q->tag_set; - if (set->tags[i]) { blk_mq_free_rq_map(set, set->tags[i], i); set->tags[i] = NULL; - hctx->tags = NULL; } + hctx->tags = NULL; continue; } + /* unmapped hw queue can be remapped after CPU topo changed */ + if (!set->tags[i]) + set->tags[i] = blk_mq_init_rq_map(set, hctx->queue_num); + hctx->tags = set->tags[i]; + WARN_ON(!hctx->tags); + /* * Initialize batch roundrobin counts */ -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: panic with CPU hotplug + blk-mq + scsi-mq
On 20.04.2015 21:12, Ming Lei wrote: > On Mon, Apr 20, 2015 at 4:07 PM, Dongsu Park > wrote: > > Hi Ming, > > > > On 18.04.2015 00:23, Ming Lei wrote: > >> > Does anyone have an idea? > >> > >> As far as I can see, at least two problems exist: > >> - race between timeout and CPU hotplug > >> - in case of shared tags, during CPU online handling, about setting > >> and checking hctx->tags > >> > >> So could you please test the attached two patches to see if they fix your > >> issue? > >> I run them in my VM, and looks opps does disappear. > > > > Thanks for the patches. > > But it still panics also with your patches, both v1 and v2. > > I tested it multiple times, and hit the bug every time. > > Could you share us what the exact test you are running? > Such as, CPU numbers, virtio-scsi hw queue number, and > multi-lun or not, and your workload if it is specific. It would be probably helpful to just share my Qemu command line: /usr/bin/qemu-system-x86_64 -M pc -cpu host -enable-kvm -m 2048 \ -smp 4,cores=1,maxcpus=4,threads=1 \ -object memory-backend-ram,size=1024M,id=ram-node0 \ -numa node,nodeid=0,cpus=0-1,memdev=ram-node0 \ -object memory-backend-ram,size=1024M,id=ram-node1 \ -numa node,nodeid=1,cpus=2-3,memdev=ram-node1 \ -serial stdio -name vm-0fa2eb90-51f3-4b65-aa72-97cea3ead7bf \ -uuid 0fa2eb90-51f3-4b65-aa72-97cea3ead7bf \ -monitor telnet:0.0.0.0:9400,server,nowait \ -rtc base=utc -boot menu=off,order=c -L /usr/share/qemu \ -device virtio-scsi-pci,id=scsi0,num_queues=8,bus=pci.0,addr=0x7 \ -drive file=./mydebian2.qcow2,if=none,id=drive-virtio-disk0,aio=native,cache=writeback \ -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x9,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \ -drive file=./tfile00.img,if=none,id=drive-scsi0-0-0-0,aio=native \ -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0 \ -drive file=./tfile01.img,if=none,id=drive-scsi0-0-0-1,aio=native \ -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=1,drive=drive-scsi0-0-0-1,id=scsi0-0-0-1 \ -k en-us -vga cirrus -netdev user,id=vnet0,net=192.168.122.0/24 \ -net nic,vlan=0,model=virtio,macaddr=52:54:00:5b:d7:00 \ -net tap,vlan=0,ifname=dntap0,vhost=on,script=no,downscript=no \ -vnc 0.0.0.0:1 -virtfs local,path=/Dev,mount_tag=homedev,security_model=none (where each of tfile0[01].img is 16-GiB image) And there's nothing special about workload. Inside the guest, I go to a 9pfs-mounted directory, where kernel source is available. When I just do 'make install', then the guest immediately crashes. That's the simplest way to make it crash. Dongsu > I can not reproduce it in my VM. > One interesting point is that the oops always happened > on CPU3 in your tests, looks like the mapping is broken > for CPU3's ctx in case of CPU 1 offline? > > > Cheers, > > Dongsu > > > > [beginning of call traces] > > [ 22.942214] smpboot: CPU 1 is now offline > > [ 30.686284] random: nonblocking pool is initialized > > [ 39.857305] fuse init (API version 7.23) > > [ 40.563853] BUG: unable to handle kernel NULL pointer dereference at > > 0018 > > [ 40.564005] IP: [] __bt_get.isra.5+0x7d/0x1e0 > > [ 40.564005] PGD 7a363067 PUD 7cadc067 PMD 0 > > [ 40.564005] Oops: [#1] SMP > > [ 40.564005] Modules linked in: fuse cpufreq_stats binfmt_misc 9p fscache > > dm_round_robin dm_multipath loop r > > tc_cmos 9pnet_virtio 9pnet serio_raw acpi_cpufreq i2c_piix4 virtio_net > > [ 40.564005] CPU: 3 PID: 6349 Comm: grub-mount Not tainted 4.0.0+ #320 > > [ 40.564005] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > > 1.7.5-20140709_153950- 04/01/2014 > > [ 40.564005] task: 880079011560 ti: 88007a1c8000 task.ti: > > 88007a1c8000 > > [ 40.564005] RIP: 0010:[] [] > > __bt_get.isra.5+0x7d/0x1e0 > > [ 40.564005] RSP: 0018:88007a1cb838 EFLAGS: 00010246 > > [ 40.564005] RAX: 0075 RBX: 88007913c400 RCX: > > 0078 > > [ 40.564005] RDX: 88007fddbb80 RSI: 0010 RDI: > > 88007913c400 > > [ 40.564005] RBP: 88007a1cb888 R08: 88007fddbb80 R09: > > 0001 > > [ 40.564005] R10: R11: 0001 R12: > > 0010 > > [ 40.564005] R13: 0010 R14: 88007a1cb988 R15: > > 88007fddbb80 > > [ 40.564005] FS: 2b7c8b6807c0() GS:88007fc0() > > knlGS: > > [ 40.564005] CS: 0010 DS: ES: CR0: 80050033 > > [ 40.564005] CR2: 0018 CR3:
Re: panic with CPU hotplug + blk-mq + scsi-mq
q_flags+0x8e/0x100 > > [ 47.816324] [] scsi_test_unit_ready+0x83/0x130 > > [ 47.816324] [] sd_check_events+0x14e/0x1b0 > > [ 47.816324] [] disk_check_events+0x51/0x170 > > [ 47.816324] [] disk_events_workfn+0x1c/0x20 > > [ 47.816324] [] process_one_work+0x1e8/0x800 > > [ 47.816324] [] ? process_one_work+0x15d/0x800 > > [ 47.816324] [] ? worker_thread+0xda/0x470 > > [ 47.816324] [] worker_thread+0x53/0x470 > > [ 47.816324] [] ? process_one_work+0x800/0x800 > > [ 47.816324] [] ? process_one_work+0x800/0x800 > > [ 47.816324] [] kthread+0xf2/0x110 > > [ 47.816324] [] ? trace_hardirqs_on+0xd/0x10 > > [ 47.816324] [] ? kthread_create_on_node+0x230/0x230 > > [ 47.816324] [] ret_from_fork+0x58/0x90 > > [ 47.816324] [] ? kthread_create_on_node+0x230/0x230 > > [ 47.816324] Code: 00 48 89 e5 5d 48 8b 40 88 48 c1 e8 02 83 e0 01 c3 66 > > 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 87 20 04 00 00 55 48 89 e5 > > <48> 8b 40 98 5d c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 > > [ 47.816324] RIP [] kthread_data+0x10/0x20 > > [ 47.816324] RSP > > [ 47.816324] CR2: ff98 > > [ 47.816324] ---[ end trace 9a650b674f0fae76 ]--- > > [ 47.816324] Fixing recursive fault but reboot is needed! > > [end of call traces] > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > From 9aed1bd79531d91513cd16ed90872e4349425acc Mon Sep 17 00:00:00 2001 > From: Ming Lei > Date: Fri, 17 Apr 2015 23:50:48 -0400 > Subject: [PATCH 1/2] block: blk-mq: fix race between timeout and CPU hotplug > > Firstly during CPU hotplug, even queue is freezed, timeout > handler still may come and access hctx->tags, which may cause > use after free, so this patch deactivates timeout handler > inside CPU hotplug notifier. > > Secondly, tags can be shared by more than one queues, so we > have to check if the hctx has been disabled, otherwise > still use-after-free on tags can be triggered. > > Cc: > Reported-by: Dongsu Park > Signed-off-by: Ming Lei > --- > block/blk-mq.c | 13 ++--- > 1 file changed, 10 insertions(+), 3 deletions(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 67f01a0..58a3b4c 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -677,8 +677,11 @@ static void blk_mq_rq_timer(unsigned long priv) > data.next = blk_rq_timeout(round_jiffies_up(data.next)); > mod_timer(>timeout, data.next); > } else { > - queue_for_each_hw_ctx(q, hctx, i) > - blk_mq_tag_idle(hctx); > + queue_for_each_hw_ctx(q, hctx, i) { > + /* the hctx may be disabled, so we have to check here */ > + if (hctx->tags) > + blk_mq_tag_idle(hctx); > + } > } > } > > @@ -2085,9 +2088,13 @@ static int blk_mq_queue_reinit_notify(struct > notifier_block *nb, >*/ > list_for_each_entry(q, _q_list, all_q_node) > blk_mq_freeze_queue_start(q); > - list_for_each_entry(q, _q_list, all_q_node) > + list_for_each_entry(q, _q_list, all_q_node) { > blk_mq_freeze_queue_wait(q); > > + /* deactivate timeout handler */ > + del_timer_sync(>timeout); > + } > + > list_for_each_entry(q, _q_list, all_q_node) > blk_mq_queue_reinit(q); > > -- > 1.9.1 > > From 8b70c8612543859173230fbd16a63bacf84ba23a Mon Sep 17 00:00:00 2001 > From: Ming Lei > Date: Sat, 18 Apr 2015 00:01:31 -0400 > Subject: [PATCH 2/2] blk-mq: fix CPU hotplug handling > > Firstly the hctx->tags have to be set as NULL if it is to be disabled > no matter if set->tags[i] is NULL or not in blk_mq_map_swqueue() because > shared tags can be freed already from another request_queue. > > The same situation has to be considered in blk_mq_hctx_cpu_online() > too. > > Cc: > Reported-by: Dongsu Park > Signed-off-by: Ming Lei > --- > block/blk-mq.c | 17 +++-- > 1 file changed, 11 insertions(+), 6 deletions(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 58a3b4c..612d5c6 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1580,15 +1580,20 @@ static int blk_mq_hctx_cpu_online(struct > blk_mq_hw_ctx *hctx, int cpu) > { > struct request_queue *q = hctx->queue; > s
Re: panic with CPU hotplug + blk-mq + scsi-mq
] wq_worker_sleeping+0x15/0xa0 [ 47.816324] [816ff757] __schedule+0xa77/0x1080 [ 47.816324] [8107cfc6] ? do_exit+0x756/0xbf0 [ 47.816324] [8107cffa] ? do_exit+0x78a/0xbf0 [ 47.816324] [816ffd97] schedule+0x37/0x90 [ 47.816324] [8107d0d6] do_exit+0x866/0xbf0 [ 47.816324] [810ec14e] ? kmsg_dump+0xfe/0x200 [ 47.816324] [810068ad] oops_end+0x8d/0xd0 [ 47.816324] [81047849] no_context+0x119/0x370 [ 47.816324] [810ce795] ? cpuacct_charge+0x5/0x1c0 [ 47.816324] [810b4a25] ? sched_clock_local+0x25/0x90 [ 47.816324] [81047b25] __bad_area_nosemaphore+0x85/0x210 [ 47.816324] [81047cc3] bad_area_nosemaphore+0x13/0x20 [ 47.816324] [81047fb6] __do_page_fault+0xb6/0x490 [ 47.816324] [8104839c] do_page_fault+0xc/0x10 [ 47.816324] [817080c2] page_fault+0x22/0x30 [ 47.816324] [8140b31d] ? __bt_get.isra.5+0x7d/0x1e0 [ 47.816324] [8140b4e5] bt_get+0x65/0x1e0 [ 47.816324] [810c9b40] ? wait_woken+0xa0/0xa0 [ 47.816324] [8140ba07] blk_mq_get_tag+0xa7/0xd0 [ 47.816324] [8140630b] __blk_mq_alloc_request+0x1b/0x200 [ 47.816324] [81407f91] blk_mq_alloc_request+0xa1/0x250 [ 47.816324] [813fc74c] blk_get_request+0x2c/0xf0 [ 47.816324] [810a6acd] ? __might_sleep+0x4d/0x90 [ 47.816324] [815747dd] scsi_execute+0x3d/0x1f0 [ 47.816324] [815763be] scsi_execute_req_flags+0x8e/0x100 [ 47.816324] [81576a43] scsi_test_unit_ready+0x83/0x130 [ 47.816324] [8158672e] sd_check_events+0x14e/0x1b0 [ 47.816324] [8140e731] disk_check_events+0x51/0x170 [ 47.816324] [8140e86c] disk_events_workfn+0x1c/0x20 [ 47.816324] [81099128] process_one_work+0x1e8/0x800 [ 47.816324] [8109909d] ? process_one_work+0x15d/0x800 [ 47.816324] [8109981a] ? worker_thread+0xda/0x470 [ 47.816324] [81099793] worker_thread+0x53/0x470 [ 47.816324] [81099740] ? process_one_work+0x800/0x800 [ 47.816324] [81099740] ? process_one_work+0x800/0x800 [ 47.816324] [8109f652] kthread+0xf2/0x110 [ 47.816324] [810d3d4d] ? trace_hardirqs_on+0xd/0x10 [ 47.816324] [8109f560] ? kthread_create_on_node+0x230/0x230 [ 47.816324] [81706308] ret_from_fork+0x58/0x90 [ 47.816324] [8109f560] ? kthread_create_on_node+0x230/0x230 [ 47.816324] Code: 00 48 89 e5 5d 48 8b 40 88 48 c1 e8 02 83 e0 01 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 87 20 04 00 00 55 48 89 e5 48 8b 40 98 5d c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 [ 47.816324] RIP [810a00d0] kthread_data+0x10/0x20 [ 47.816324] RSP 88007906f5e8 [ 47.816324] CR2: ff98 [ 47.816324] ---[ end trace 9a650b674f0fae76 ]--- [ 47.816324] Fixing recursive fault but reboot is needed! [end of call traces] -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ From 9aed1bd79531d91513cd16ed90872e4349425acc Mon Sep 17 00:00:00 2001 From: Ming Lei ming@canonical.com Date: Fri, 17 Apr 2015 23:50:48 -0400 Subject: [PATCH 1/2] block: blk-mq: fix race between timeout and CPU hotplug Firstly during CPU hotplug, even queue is freezed, timeout handler still may come and access hctx-tags, which may cause use after free, so this patch deactivates timeout handler inside CPU hotplug notifier. Secondly, tags can be shared by more than one queues, so we have to check if the hctx has been disabled, otherwise still use-after-free on tags can be triggered. Cc: sta...@vger.kernel.org Reported-by: Dongsu Park dongsu.p...@profitbricks.com Signed-off-by: Ming Lei ming@canonical.com --- block/blk-mq.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 67f01a0..58a3b4c 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -677,8 +677,11 @@ static void blk_mq_rq_timer(unsigned long priv) data.next = blk_rq_timeout(round_jiffies_up(data.next)); mod_timer(q-timeout, data.next); } else { - queue_for_each_hw_ctx(q, hctx, i) - blk_mq_tag_idle(hctx); + queue_for_each_hw_ctx(q, hctx, i) { + /* the hctx may be disabled, so we have to check here */ + if (hctx-tags) + blk_mq_tag_idle(hctx); + } } } @@ -2085,9 +2088,13 @@ static int blk_mq_queue_reinit_notify(struct notifier_block *nb, */ list_for_each_entry(q, all_q_list
Re: panic with CPU hotplug + blk-mq + scsi-mq
On 20.04.2015 21:12, Ming Lei wrote: On Mon, Apr 20, 2015 at 4:07 PM, Dongsu Park dongsu.p...@profitbricks.com wrote: Hi Ming, On 18.04.2015 00:23, Ming Lei wrote: Does anyone have an idea? As far as I can see, at least two problems exist: - race between timeout and CPU hotplug - in case of shared tags, during CPU online handling, about setting and checking hctx-tags So could you please test the attached two patches to see if they fix your issue? I run them in my VM, and looks opps does disappear. Thanks for the patches. But it still panics also with your patches, both v1 and v2. I tested it multiple times, and hit the bug every time. Could you share us what the exact test you are running? Such as, CPU numbers, virtio-scsi hw queue number, and multi-lun or not, and your workload if it is specific. It would be probably helpful to just share my Qemu command line: /usr/bin/qemu-system-x86_64 -M pc -cpu host -enable-kvm -m 2048 \ -smp 4,cores=1,maxcpus=4,threads=1 \ -object memory-backend-ram,size=1024M,id=ram-node0 \ -numa node,nodeid=0,cpus=0-1,memdev=ram-node0 \ -object memory-backend-ram,size=1024M,id=ram-node1 \ -numa node,nodeid=1,cpus=2-3,memdev=ram-node1 \ -serial stdio -name vm-0fa2eb90-51f3-4b65-aa72-97cea3ead7bf \ -uuid 0fa2eb90-51f3-4b65-aa72-97cea3ead7bf \ -monitor telnet:0.0.0.0:9400,server,nowait \ -rtc base=utc -boot menu=off,order=c -L /usr/share/qemu \ -device virtio-scsi-pci,id=scsi0,num_queues=8,bus=pci.0,addr=0x7 \ -drive file=./mydebian2.qcow2,if=none,id=drive-virtio-disk0,aio=native,cache=writeback \ -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x9,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \ -drive file=./tfile00.img,if=none,id=drive-scsi0-0-0-0,aio=native \ -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0 \ -drive file=./tfile01.img,if=none,id=drive-scsi0-0-0-1,aio=native \ -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=1,drive=drive-scsi0-0-0-1,id=scsi0-0-0-1 \ -k en-us -vga cirrus -netdev user,id=vnet0,net=192.168.122.0/24 \ -net nic,vlan=0,model=virtio,macaddr=52:54:00:5b:d7:00 \ -net tap,vlan=0,ifname=dntap0,vhost=on,script=no,downscript=no \ -vnc 0.0.0.0:1 -virtfs local,path=/Dev,mount_tag=homedev,security_model=none (where each of tfile0[01].img is 16-GiB image) And there's nothing special about workload. Inside the guest, I go to a 9pfs-mounted directory, where kernel source is available. When I just do 'make install', then the guest immediately crashes. That's the simplest way to make it crash. Dongsu I can not reproduce it in my VM. One interesting point is that the oops always happened on CPU3 in your tests, looks like the mapping is broken for CPU3's ctx in case of CPU 1 offline? Cheers, Dongsu [beginning of call traces] [ 22.942214] smpboot: CPU 1 is now offline [ 30.686284] random: nonblocking pool is initialized [ 39.857305] fuse init (API version 7.23) [ 40.563853] BUG: unable to handle kernel NULL pointer dereference at 0018 [ 40.564005] IP: [813b905d] __bt_get.isra.5+0x7d/0x1e0 [ 40.564005] PGD 7a363067 PUD 7cadc067 PMD 0 [ 40.564005] Oops: [#1] SMP [ 40.564005] Modules linked in: fuse cpufreq_stats binfmt_misc 9p fscache dm_round_robin dm_multipath loop r tc_cmos 9pnet_virtio 9pnet serio_raw acpi_cpufreq i2c_piix4 virtio_net [ 40.564005] CPU: 3 PID: 6349 Comm: grub-mount Not tainted 4.0.0+ #320 [ 40.564005] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153950- 04/01/2014 [ 40.564005] task: 880079011560 ti: 88007a1c8000 task.ti: 88007a1c8000 [ 40.564005] RIP: 0010:[813b905d] [813b905d] __bt_get.isra.5+0x7d/0x1e0 [ 40.564005] RSP: 0018:88007a1cb838 EFLAGS: 00010246 [ 40.564005] RAX: 0075 RBX: 88007913c400 RCX: 0078 [ 40.564005] RDX: 88007fddbb80 RSI: 0010 RDI: 88007913c400 [ 40.564005] RBP: 88007a1cb888 R08: 88007fddbb80 R09: 0001 [ 40.564005] R10: R11: 0001 R12: 0010 [ 40.564005] R13: 0010 R14: 88007a1cb988 R15: 88007fddbb80 [ 40.564005] FS: 2b7c8b6807c0() GS:88007fc0() knlGS: [ 40.564005] CS: 0010 DS: ES: CR0: 80050033 [ 40.564005] CR2: 0018 CR3: 79b0b000 CR4: 001407e0 [ 40.564005] Stack: [ 40.564005] 88007a1cb918 88007fdd58c0 0078 813b5d28 [ 40.564005] 88007a1cb878 88007913c400 0010 0010 [ 40.564005] 88007a1cb988 88007fddbb80 88007a1cb908 813b9225 [ 40.564005] Call Trace: [ 40.564005] [813b5d28] ? blk_mq_queue_enter+0x98/0x2b0 [ 40.564005] [813b9225] bt_get
Re: panic with CPU hotplug + blk-mq + scsi-mq
On 21.04.2015 00:48, Ming Lei wrote: Thanks for providing that. The trick is just in CPU number and virito-scsi hw queue number, and that is why I asked that, :-) Now the problem is quite clear, before CPU1 online, suppose CPU3 is mapped hw queue 6, and CPU 3 will map to hw queue 5 after CPU1 is offline, unfortunately current code can't allocate tags for hw queue 5 even it becomes mapped. The following updated patch(include original patch 2) will fix the problem, and patch 1 is required too. So the following patch should fix your hotplug issue. Yes, it works indeed. Thanks a lot! :-) You can add: Tested-by: Dongsu Park dongsu.p...@profitbricks.com As the original patch didn't apply, I had to change some nitpicks though. (see below) Cheers, Dongsu From 8c0edcbbdfbab67dc8ae2fd46cca6a86e0cadcba Mon Sep 17 00:00:00 2001 From: Ming Lei ming@canonical.com Date: Sun, 19 Apr 2015 23:32:46 +0800 Subject: [PATCH v1 2/2] blk-mq: fix CPU hotplug handling Firstly the hctx-tags have to be set as NULL if it is to be disabled no matter if set-tags[i] is NULL or not in blk_mq_map_swqueue() because shared tags can be freed already from another request queue. The same situation has to be considered in blk_mq_hctx_cpu_online() too. Finally one unmapped hw queue can be remapped after CPU topo is changed, we need to allocate tags for the hw queue in blk_mq_map_swqueue() too. Then tags allocation for hw queue can be removed in hctx cpu online notifier, and it is reasonable to do that after remapping is done. Cc: sta...@vger.kernel.org Reported-by: Dongsu Park dongsu.p...@profitbricks.com Signed-off-by: Ming Lei ming@canonical.com --- block/blk-mq.c | 34 +- 1 file changed, 13 insertions(+), 21 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 078840ce8670..df4b9597e477 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1573,22 +1573,6 @@ static int blk_mq_hctx_cpu_offline(struct blk_mq_hw_ctx *hctx, int cpu) return NOTIFY_OK; } -static int blk_mq_hctx_cpu_online(struct blk_mq_hw_ctx *hctx, int cpu) -{ - struct request_queue *q = hctx-queue; - struct blk_mq_tag_set *set = q-tag_set; - - if (set-tags[hctx-queue_num]) - return NOTIFY_OK; - - set-tags[hctx-queue_num] = blk_mq_init_rq_map(set, hctx-queue_num); - if (!set-tags[hctx-queue_num]) - return NOTIFY_STOP; - - hctx-tags = set-tags[hctx-queue_num]; - return NOTIFY_OK; -} - static int blk_mq_hctx_notify(void *data, unsigned long action, unsigned int cpu) { @@ -1596,8 +1580,11 @@ static int blk_mq_hctx_notify(void *data, unsigned long action, if (action == CPU_DEAD || action == CPU_DEAD_FROZEN) return blk_mq_hctx_cpu_offline(hctx, cpu); - else if (action == CPU_ONLINE || action == CPU_ONLINE_FROZEN) - return blk_mq_hctx_cpu_online(hctx, cpu); + + /* +* In case of CPU online, tags will be reallocated +* after new mapping is done in blk_mq_map_swqueue(). +*/ return NOTIFY_OK; } @@ -1779,6 +1766,7 @@ static void blk_mq_map_swqueue(struct request_queue *q) unsigned int i; struct blk_mq_hw_ctx *hctx; struct blk_mq_ctx *ctx; + struct blk_mq_tag_set *set = q-tag_set; queue_for_each_hw_ctx(q, hctx, i) { cpumask_clear(hctx-cpumask); @@ -1805,16 +1793,20 @@ static void blk_mq_map_swqueue(struct request_queue *q) * disable it and free the request entries. */ if (!hctx-nr_ctx) { - struct blk_mq_tag_set *set = q-tag_set; - if (set-tags[i]) { blk_mq_free_rq_map(set, set-tags[i], i); set-tags[i] = NULL; - hctx-tags = NULL; } + hctx-tags = NULL; continue; } + /* unmapped hw queue can be remapped after CPU topo changed */ + if (!set-tags[i]) + set-tags[i] = blk_mq_init_rq_map(set, hctx-queue_num); + hctx-tags = set-tags[i]; + WARN_ON(!hctx-tags); + /* * Initialize batch roundrobin counts */ -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
panic with CPU hotplug + blk-mq + scsi-mq
Hi, there's a critical bug regarding CPU hotplug, blk-mq, and scsi-mq. Every time when a CPU is offlined, some arbitrary range of kernel memory seems to get corrupted. Then after a while, kernel panics at random places when block IOs are issued. (for example, see the call traces below) This bug can be easily reproducible with a Qemu VM running with virtio-scsi, when its guest kernel is 3.19-rc1 or higher, and when scsi-mq is loaded with blk-mq enabled. And yes, 4.0 release is still affected, as well as Jens' for-4.1/core. How to reproduce: # echo 0 > /sys/devices/system/cpu/cpu1/online (and issue some block IOs, that's it.) Bisecting between 3.18 and 3.19-rc1, it looks like this bug had been hidden until commit ccbedf117f01 ("virtio_scsi: support multi hw queue of blk-mq"), which started to allow virtio-scsi to map virtqueues to hardware queues of blk-mq. Reverting that commit makes the bug go away. However, I suppose reverting it could not be a correct solution. More precisely, every time a CPU hotplug event gets triggered, a call graph is like the following: blk_mq_queue_reinit_notify() -> blk_mq_queue_reinit() -> blk_mq_map_swqueue() -> blk_mq_free_rq_map() -> scsi_exit_request() >From that point, as soon as any address in the request gets modified, an arbitrary range of memory gets corrupted. My first guess was that probably the exit routine could try to deallocate tags->rqs[] where invalid addresses are stored. But actually it looks like it's not the case, and cmd->sense_buffer looks also valid. It's not obvious to me, exactly what could go wrong. Does anyone have an idea? Regards, Dongsu [beginning of call traces] [ 47.274292] BUG: unable to handle kernel NULL pointer dereference at 0018 [ 47.275013] IP: [] __bt_get.isra.5+0x7d/0x1e0 [ 47.275013] PGD 79c55067 PUD 7ba17067 PMD 0 [ 47.275013] Oops: [#1] SMP [ 47.275013] Modules linked in: fuse cpufreq_stats binfmt_misc 9p fscache dm_round_robin loop dm_multipath 9pnet_virtio rtc_cmos 9pnet acpi_cpufreq serio_raw i2c_piix4 virtio_net [ 47.275013] CPU: 3 PID: 6232 Comm: blkid Not tainted 4.0.0 #303 [ 47.275013] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153950- 04/01/2014 [ 47.275013] task: 88003dfbc020 ti: 880079bac000 task.ti: 880079bac000 [ 47.275013] RIP: 0010:[] [] __bt_get.isra.5+0x7d/0x1e0 [ 47.275013] RSP: 0018:880079baf898 EFLAGS: 00010246 [ 47.275013] RAX: 003c RBX: 880079198400 RCX: 0078 [ 47.275013] RDX: 88007fddbb80 RSI: 0010 RDI: 880079198400 [ 47.275013] RBP: 880079baf8e8 R08: 88007fddbb80 R09: [ 47.275013] R10: 0001 R11: 0001 R12: 0010 [ 47.275013] R13: 0010 R14: 880079baf9e8 R15: 88007fddbb80 [ 47.275013] FS: 2b270c049800() GS:88007fc0() knlGS: [ 47.275013] CS: 0010 DS: ES: CR0: 80050033 [ 47.275013] CR2: 0018 CR3: 7ca8d000 CR4: 001407e0 [ 47.275013] Stack: [ 47.275013] 880079baf978 88007fdd58c0 0078 814071ff [ 47.275013] 880079baf8d8 880079198400 0010 0010 [ 47.275013] 880079baf9e8 88007fddbb80 880079baf968 8140b4e5 [ 47.275013] Call Trace: [ 47.275013] [] ? blk_mq_queue_enter+0x9f/0x2d0 [ 47.275013] [] bt_get+0x65/0x1e0 [ 47.275013] [] ? blk_mq_queue_enter+0x9f/0x2d0 [ 47.275013] [] ? wait_woken+0xa0/0xa0 [ 47.275013] [] blk_mq_get_tag+0xa7/0xd0 [ 47.275013] [] __blk_mq_alloc_request+0x1b/0x200 [ 47.275013] [] blk_mq_map_request+0xd6/0x4e0 [ 47.275013] [] blk_mq_make_request+0x6e/0x2d0 [ 47.275013] [] ? generic_make_request_checks+0x674/0x6a0 [ 47.275013] [] ? bio_add_page+0x5e/0x70 [ 47.275013] [] generic_make_request+0xc0/0x110 [ 47.275013] [] submit_bio+0x68/0x150 [ 47.275013] [] ? lru_cache_add+0x1c/0x50 [ 47.275013] [] mpage_bio_submit+0x2a/0x40 [ 47.275013] [] mpage_readpages+0x10c/0x130 [ 47.275013] [] ? I_BDEV+0x10/0x10 [ 47.275013] [] ? I_BDEV+0x10/0x10 [ 47.275013] [] ? __page_cache_alloc+0x137/0x160 [ 47.275013] [] blkdev_readpages+0x1d/0x20 [ 47.275013] [] __do_page_cache_readahead+0x29f/0x320 [ 47.275013] [] ? __do_page_cache_readahead+0x165/0x320 [ 47.275013] [] force_page_cache_readahead+0x34/0x60 [ 47.275013] [] page_cache_sync_readahead+0x46/0x50 [ 47.275013] [] generic_file_read_iter+0x52c/0x640 [ 47.275013] [] blkdev_read_iter+0x37/0x40 [ 47.275013] [] new_sync_read+0x7e/0xb0 [ 47.275013] [] __vfs_read+0x18/0x50 [ 47.275013] [] vfs_read+0x8d/0x150 [ 47.275013] [] SyS_read+0x49/0xb0 [ 47.275013] [] system_call_fastpath+0x12/0x17 [ 47.275013] Code: 97 18 03 00 00 bf 04 00 00 00 41 f7 f1 83 f8 04 0f 43 f8 b8 ff ff ff ff 44 39 d7 0f 86 c1 00 00 00 41 8b 00 48 89 4d c0 49 89 f5
panic with CPU hotplug + blk-mq + scsi-mq
Hi, there's a critical bug regarding CPU hotplug, blk-mq, and scsi-mq. Every time when a CPU is offlined, some arbitrary range of kernel memory seems to get corrupted. Then after a while, kernel panics at random places when block IOs are issued. (for example, see the call traces below) This bug can be easily reproducible with a Qemu VM running with virtio-scsi, when its guest kernel is 3.19-rc1 or higher, and when scsi-mq is loaded with blk-mq enabled. And yes, 4.0 release is still affected, as well as Jens' for-4.1/core. How to reproduce: # echo 0 /sys/devices/system/cpu/cpu1/online (and issue some block IOs, that's it.) Bisecting between 3.18 and 3.19-rc1, it looks like this bug had been hidden until commit ccbedf117f01 (virtio_scsi: support multi hw queue of blk-mq), which started to allow virtio-scsi to map virtqueues to hardware queues of blk-mq. Reverting that commit makes the bug go away. However, I suppose reverting it could not be a correct solution. More precisely, every time a CPU hotplug event gets triggered, a call graph is like the following: blk_mq_queue_reinit_notify() - blk_mq_queue_reinit() - blk_mq_map_swqueue() - blk_mq_free_rq_map() - scsi_exit_request() From that point, as soon as any address in the request gets modified, an arbitrary range of memory gets corrupted. My first guess was that probably the exit routine could try to deallocate tags-rqs[] where invalid addresses are stored. But actually it looks like it's not the case, and cmd-sense_buffer looks also valid. It's not obvious to me, exactly what could go wrong. Does anyone have an idea? Regards, Dongsu [beginning of call traces] [ 47.274292] BUG: unable to handle kernel NULL pointer dereference at 0018 [ 47.275013] IP: [8140b31d] __bt_get.isra.5+0x7d/0x1e0 [ 47.275013] PGD 79c55067 PUD 7ba17067 PMD 0 [ 47.275013] Oops: [#1] SMP [ 47.275013] Modules linked in: fuse cpufreq_stats binfmt_misc 9p fscache dm_round_robin loop dm_multipath 9pnet_virtio rtc_cmos 9pnet acpi_cpufreq serio_raw i2c_piix4 virtio_net [ 47.275013] CPU: 3 PID: 6232 Comm: blkid Not tainted 4.0.0 #303 [ 47.275013] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153950- 04/01/2014 [ 47.275013] task: 88003dfbc020 ti: 880079bac000 task.ti: 880079bac000 [ 47.275013] RIP: 0010:[8140b31d] [8140b31d] __bt_get.isra.5+0x7d/0x1e0 [ 47.275013] RSP: 0018:880079baf898 EFLAGS: 00010246 [ 47.275013] RAX: 003c RBX: 880079198400 RCX: 0078 [ 47.275013] RDX: 88007fddbb80 RSI: 0010 RDI: 880079198400 [ 47.275013] RBP: 880079baf8e8 R08: 88007fddbb80 R09: [ 47.275013] R10: 0001 R11: 0001 R12: 0010 [ 47.275013] R13: 0010 R14: 880079baf9e8 R15: 88007fddbb80 [ 47.275013] FS: 2b270c049800() GS:88007fc0() knlGS: [ 47.275013] CS: 0010 DS: ES: CR0: 80050033 [ 47.275013] CR2: 0018 CR3: 7ca8d000 CR4: 001407e0 [ 47.275013] Stack: [ 47.275013] 880079baf978 88007fdd58c0 0078 814071ff [ 47.275013] 880079baf8d8 880079198400 0010 0010 [ 47.275013] 880079baf9e8 88007fddbb80 880079baf968 8140b4e5 [ 47.275013] Call Trace: [ 47.275013] [814071ff] ? blk_mq_queue_enter+0x9f/0x2d0 [ 47.275013] [8140b4e5] bt_get+0x65/0x1e0 [ 47.275013] [814071ff] ? blk_mq_queue_enter+0x9f/0x2d0 [ 47.275013] [810c9b40] ? wait_woken+0xa0/0xa0 [ 47.275013] [8140ba07] blk_mq_get_tag+0xa7/0xd0 [ 47.275013] [8140630b] __blk_mq_alloc_request+0x1b/0x200 [ 47.275013] [81408736] blk_mq_map_request+0xd6/0x4e0 [ 47.275013] [8140a53e] blk_mq_make_request+0x6e/0x2d0 [ 47.275013] [813fb844] ? generic_make_request_checks+0x674/0x6a0 [ 47.275013] [813f23ae] ? bio_add_page+0x5e/0x70 [ 47.275013] [813fb930] generic_make_request+0xc0/0x110 [ 47.275013] [813fb9e8] submit_bio+0x68/0x150 [ 47.275013] [811b0c6c] ? lru_cache_add+0x1c/0x50 [ 47.275013] [8125972a] mpage_bio_submit+0x2a/0x40 [ 47.275013] [8125a81c] mpage_readpages+0x10c/0x130 [ 47.275013] [81254040] ? I_BDEV+0x10/0x10 [ 47.275013] [81254040] ? I_BDEV+0x10/0x10 [ 47.275013] [8119e417] ? __page_cache_alloc+0x137/0x160 [ 47.275013] [8125486d] blkdev_readpages+0x1d/0x20 [ 47.275013] [811ae43f] __do_page_cache_readahead+0x29f/0x320 [ 47.275013] [811ae305] ? __do_page_cache_readahead+0x165/0x320 [ 47.275013] [811aea14] force_page_cache_readahead+0x34/0x60 [ 47.275013] [811aea86] page_cache_sync_readahead+0x46/0x50 [ 47.275013] [811a094c]
Re: [PATCH] dm: fix multipath regression due to initializing wrong request
On 09.02.2015 10:47, Jens Axboe wrote: > On 02/09/2015 10:35 AM, Mike Snitzer wrote: > >On Mon, Feb 09 2015 at 12:13P -0500, > >Mike Snitzer wrote: > > > >Jens and I discussed this further and given that linux-block breaks > >dm-multipath it is best to fix linux-block and let Linus resolve the > >merge when I send him the linux-dm pull. > > > >Here is the patch to fix the regression: > > Added, thanks. I don't think this is worth rebasing for, so just added to > the top of for-3.20/core (since that's where the buggy commit was added). Thanks a lot. Now the branch for-3.20/core works without hitting the BUG. Dongsu > -- > Jens Axboe > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] dm: fix multipath regression due to initializing wrong request
On 09.02.2015 10:47, Jens Axboe wrote: On 02/09/2015 10:35 AM, Mike Snitzer wrote: On Mon, Feb 09 2015 at 12:13P -0500, Mike Snitzer snit...@redhat.com wrote: Jens and I discussed this further and given that linux-block breaks dm-multipath it is best to fix linux-block and let Linus resolve the merge when I send him the linux-dm pull. Here is the patch to fix the regression: Added, thanks. I don't think this is worth rebasing for, so just added to the top of for-3.20/core (since that's where the buggy commit was added). Thanks a lot. Now the branch for-3.20/core works without hitting the BUG. Dongsu -- Jens Axboe -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
blk-mq crash with dm-multipath in for-3.20/core
Hi Jens, during testing with the linux-block for-3.20/core branch, I hit a BUG like below. It's reproducible by running xfstests/xfs/279. Bisecting showed that the first bad commit is 6d6285c45f5a ("block: require blk_rq_prep_clone() be given an initialized clone request"). With reverting this commit, the crash disappears. The linux-dm's branch dm-for-3.20 works fine without crash too. As pointed out already by Keith Busch in a thread, [1] that commit should not be there in the first place. Commit 102e38b1030e ("dm: split request structure out from dm_rq_target_io structure") from linux-dm tree [2] is going to move the blk_rq_init() call again to __clone_rq(). So that commit 6d6285c45f5a should be either reverted, or moved to linux-dm tree, doesn't it? Cheers, Dongsu [1] https://www.redhat.com/archives/dm-devel/2015-January/msg00171.html [2] https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-for-3.20=102e38b1030e883efc022dfdc7b7e7a3de70d1c5 [ cut here ] kernel BUG at block/blk-core.c:2333! RIP: 0010: [] blk_dequeue_request+0x78/0x90 Call Trace: [] blk_start_request+0x16/0x70 [] dm_start_request+0x1a/0x50 [] dm_request_fn+0x2b6/0x3e0 [] __blk_run_queue+0x37/0x50 [] queue_unplugged+0x5d/0x230 [] blk_flush_plug_list+0x1ac/0x230 [] blk_finish_plug+0x18/0x60 [] __do_page_cache_readahead+0x2b1/0x320 [] ? __do_page_cache_readahead+0x165/0x320 [] ondemand_readahead+0xe2/0x480 [] ? pagecache_get_page+0x2f/0x200 [] page_cache_sync_readahead+0x31/0x50 [] generic_file_read_iter+0x51c/0x630 [] ? might_fault+0x5e/0xc0 [] blkdev_read_iter+0x37/0x40 [] new_sync_read+0x7e/0xb0 [] __vfs_read+0x18/0x50 [] vfs_read+0x8d/0x150 [] SyS_read+0x49/0xb0 [] system_call_fastpath+0x12/0x17 RIP [] blk_dequeue_request+0x78/0x90 RSP ---[ end trace dcfc3d438518b1aa ]--- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cleanup and refactor BLOCK_PC mapping helpers V2
On 05.02.2015 09:28, Jens Axboe wrote: > On 02/02/2015 06:19 AM, Christoph Hellwig wrote: > >Jens, do these patches look fine to you? Any chance to get them into > >the tree for the 3.20 merge window? > > Yes, I think they look fine. I'll throw them into the testing mix and merge > them for 3.20. Thanks a lot, and many thanks also to Christoph. Dongsu > -- > Jens Axboe > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cleanup and refactor BLOCK_PC mapping helpers V2
On 05.02.2015 09:28, Jens Axboe wrote: On 02/02/2015 06:19 AM, Christoph Hellwig wrote: Jens, do these patches look fine to you? Any chance to get them into the tree for the 3.20 merge window? Yes, I think they look fine. I'll throw them into the testing mix and merge them for 3.20. Thanks a lot, and many thanks also to Christoph. Dongsu -- Jens Axboe -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
blk-mq crash with dm-multipath in for-3.20/core
Hi Jens, during testing with the linux-block for-3.20/core branch, I hit a BUG like below. It's reproducible by running xfstests/xfs/279. Bisecting showed that the first bad commit is 6d6285c45f5a (block: require blk_rq_prep_clone() be given an initialized clone request). With reverting this commit, the crash disappears. The linux-dm's branch dm-for-3.20 works fine without crash too. As pointed out already by Keith Busch in a thread, [1] that commit should not be there in the first place. Commit 102e38b1030e (dm: split request structure out from dm_rq_target_io structure) from linux-dm tree [2] is going to move the blk_rq_init() call again to __clone_rq(). So that commit 6d6285c45f5a should be either reverted, or moved to linux-dm tree, doesn't it? Cheers, Dongsu [1] https://www.redhat.com/archives/dm-devel/2015-January/msg00171.html [2] https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-for-3.20id=102e38b1030e883efc022dfdc7b7e7a3de70d1c5 [ cut here ] kernel BUG at block/blk-core.c:2333! RIP: 0010: [814c6858] blk_dequeue_request+0x78/0x90 Call Trace: [814c6886] blk_start_request+0x16/0x70 [8169f9fa] dm_start_request+0x1a/0x50 [8169fce6] dm_request_fn+0x2b6/0x3e0 [814c0087] __blk_run_queue+0x37/0x50 [814c31ed] queue_unplugged+0x5d/0x230 [814c710c] blk_flush_plug_list+0x1ac/0x230 [814c7708] blk_finish_plug+0x18/0x60 [811baea1] __do_page_cache_readahead+0x2b1/0x320 [811bad55] ? __do_page_cache_readahead+0x165/0x320 [811baff2] ondemand_readahead+0xe2/0x480 [811ac3ff] ? pagecache_get_page+0x2f/0x200 [811bb4c1] page_cache_sync_readahead+0x31/0x50 [811ad5bc] generic_file_read_iter+0x51c/0x630 [811dd00e] ? might_fault+0x5e/0xc0 [81261e37] blkdev_read_iter+0x37/0x40 [8121fa4e] new_sync_read+0x7e/0xb0 [81220ce8] __vfs_read+0x18/0x50 [81220dad] vfs_read+0x8d/0x150 [81220eb9] SyS_read+0x49/0xb0 [817dce52] system_call_fastpath+0x12/0x17 RIP [814c6858] blk_dequeue_request+0x78/0x90 RSP 88006e1eba68 ---[ end trace dcfc3d438518b1aa ]--- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/7] block: rewrite __bio_copy_iov()
Hi Christoph, On 16.01.2015 03:31, Christoph Hellwig wrote: > On Thu, Jan 15, 2015 at 10:18:17AM -0800, Christoph Hellwig wrote: > > This breaks booting a simple KVM VM for me: > Seems like the issue actually is in the patch before this one, but > only shows up with this one applied. > The root cause is that we only copy the iov_iter, but not the > actual iovecs into the bio_map_data. > I have a fixed series, which I'll send out together with various > related cleanups ASAP. Thanks for testing it and finding out the root cause. Strange, I haven't never seen the bug. Maybe I'd have to test it also with virtio-scsi, which I don't do usually. Dongsu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/7] block: rewrite __bio_copy_iov()
Hi Christoph, On 16.01.2015 03:31, Christoph Hellwig wrote: On Thu, Jan 15, 2015 at 10:18:17AM -0800, Christoph Hellwig wrote: This breaks booting a simple KVM VM for me: Seems like the issue actually is in the patch before this one, but only shows up with this one applied. The root cause is that we only copy the iov_iter, but not the actual iovecs into the bio_map_data. I have a fixed series, which I'll send out together with various related cleanups ASAP. Thanks for testing it and finding out the root cause. Strange, I haven't never seen the bug. Maybe I'd have to test it also with virtio-scsi, which I don't do usually. Dongsu -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 3/9] block: allow __blk_queue_bounce() to handle bios larger than BIO_MAX_PAGES
From: Kent Overstreet Allow __blk_queue_bounce() to handle bios with more than BIO_MAX_PAGES segments. Doing that, it becomes possible to simplify the block layer in the kernel. The issue is that any code that clones the bio and must clone the biovec (i.e. it can't use bio_clone_fast()) won't be able to allocate a bio with more than BIO_MAX_PAGES - bio_alloc_bioset() always fails in that case. Fortunately, it's easy to make __blk_queue_bounce() just process part of the bio if necessary, using bi_remaining to count the splits and punting the rest back to generic_make_request(). Cc: Christoph Hellwig Cc: Jens Axboe Signed-off-by: Kent Overstreet [dpark: add more description in commit message] Signed-off-by: Dongsu Park --- block/bounce.c | 60 ++ 1 file changed, 52 insertions(+), 8 deletions(-) diff --git a/block/bounce.c b/block/bounce.c index ab21ba2..689ea89 100644 --- a/block/bounce.c +++ b/block/bounce.c @@ -196,6 +196,43 @@ static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio) } #endif /* CONFIG_NEED_BOUNCE_POOL */ +static struct bio *bio_clone_segments(struct bio *bio_src, gfp_t gfp_mask, + struct bio_set *bs, unsigned nsegs) +{ + struct bvec_iter iter; + struct bio_vec bv; + struct bio *bio; + + bio = bio_alloc_bioset(gfp_mask, nsegs, bs); + if (!bio) + return NULL; + + bio->bi_bdev= bio_src->bi_bdev; + bio->bi_rw = bio_src->bi_rw; + bio->bi_iter.bi_sector = bio_src->bi_iter.bi_sector; + + bio_for_each_segment(bv, bio_src, iter) { + bio->bi_io_vec[bio->bi_vcnt++] = bv; + bio->bi_iter.bi_size += bv.bv_len; + if (!--nsegs) + break; + } + + if (bio_integrity(bio_src)) { + int ret; + + ret = bio_integrity_clone(bio, bio_src, gfp_mask); + if (ret < 0) { + bio_put(bio); + return NULL; + } + } + + bio_src->bi_iter = iter; + + return bio; +} + static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig, mempool_t *pool, int force) { @@ -203,17 +240,24 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig, int rw = bio_data_dir(*bio_orig); struct bio_vec *to, from; struct bvec_iter iter; - unsigned i; + int i, nsegs = 0, bounce = force; - if (force) - goto bounce; - bio_for_each_segment(from, *bio_orig, iter) + bio_for_each_segment(from, *bio_orig, iter) { + nsegs++; if (page_to_pfn(from.bv_page) > queue_bounce_pfn(q)) - goto bounce; + bounce = 1; + } + + if (!bounce) + return; - return; -bounce: - bio = bio_clone_bioset(*bio_orig, GFP_NOIO, fs_bio_set); + bio = bio_clone_segments(*bio_orig, GFP_NOIO, fs_bio_set, +min(nsegs, BIO_MAX_PAGES)); + + if ((*bio_orig)->bi_iter.bi_size) { + atomic_inc(&(*bio_orig)->bi_remaining); + generic_make_request(*bio_orig); + } bio_for_each_segment_all(to, bio, i) { struct page *page = to->bv_page; -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 4/9] bcache: clean up hacks around bio_split_pool
From: Kent Overstreet There has been workarounds only in bcache, for splitting pool as well as submitting bios. Since generic_make_request() is able to handle arbitrarily sized bios, it's now possible to delete those hacks. Cc: linux-bca...@vger.kernel.org Signed-off-by: Kent Overstreet [dpark: add more description in commit message] Signed-off-by: Dongsu Park --- drivers/md/bcache/bcache.h| 18 drivers/md/bcache/io.c| 100 +- drivers/md/bcache/journal.c | 4 +- drivers/md/bcache/request.c | 16 +++ drivers/md/bcache/super.c | 32 +- drivers/md/bcache/util.h | 5 ++- drivers/md/bcache/writeback.c | 4 +- 7 files changed, 18 insertions(+), 161 deletions(-) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 04f7bc2..6b420a5 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -243,19 +243,6 @@ struct keybuf { DECLARE_ARRAY_ALLOCATOR(struct keybuf_key, freelist, KEYBUF_NR); }; -struct bio_split_pool { - struct bio_set *bio_split; - mempool_t *bio_split_hook; -}; - -struct bio_split_hook { - struct closure cl; - struct bio_split_pool *p; - struct bio *bio; - bio_end_io_t*bi_end_io; - void*bi_private; -}; - struct bcache_device { struct closure cl; @@ -288,8 +275,6 @@ struct bcache_device { int (*cache_miss)(struct btree *, struct search *, struct bio *, unsigned); int (*ioctl) (struct bcache_device *, fmode_t, unsigned, unsigned long); - - struct bio_split_pool bio_split_hook; }; struct io { @@ -454,8 +439,6 @@ struct cache { atomic_long_t meta_sectors_written; atomic_long_t btree_sectors_written; atomic_long_t sectors_written; - - struct bio_split_pool bio_split_hook; }; struct gc_stat { @@ -873,7 +856,6 @@ void bch_bbio_endio(struct cache_set *, struct bio *, int, const char *); void bch_bbio_free(struct bio *, struct cache_set *); struct bio *bch_bbio_alloc(struct cache_set *); -void bch_generic_make_request(struct bio *, struct bio_split_pool *); void __bch_submit_bbio(struct bio *, struct cache_set *); void bch_submit_bbio(struct bio *, struct cache_set *, struct bkey *, unsigned); diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c index fa028fa..86a0bb8 100644 --- a/drivers/md/bcache/io.c +++ b/drivers/md/bcache/io.c @@ -11,104 +11,6 @@ #include -static unsigned bch_bio_max_sectors(struct bio *bio) -{ - struct request_queue *q = bdev_get_queue(bio->bi_bdev); - struct bio_vec bv; - struct bvec_iter iter; - unsigned ret = 0, seg = 0; - - if (bio->bi_rw & REQ_DISCARD) - return min(bio_sectors(bio), q->limits.max_discard_sectors); - - bio_for_each_segment(bv, bio, iter) { - struct bvec_merge_data bvm = { - .bi_bdev= bio->bi_bdev, - .bi_sector = bio->bi_iter.bi_sector, - .bi_size= ret << 9, - .bi_rw = bio->bi_rw, - }; - - if (seg == min_t(unsigned, BIO_MAX_PAGES, -queue_max_segments(q))) - break; - - if (q->merge_bvec_fn && - q->merge_bvec_fn(q, , ) < (int) bv.bv_len) - break; - - seg++; - ret += bv.bv_len >> 9; - } - - ret = min(ret, queue_max_sectors(q)); - - WARN_ON(!ret); - ret = max_t(int, ret, bio_iovec(bio).bv_len >> 9); - - return ret; -} - -static void bch_bio_submit_split_done(struct closure *cl) -{ - struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl); - - s->bio->bi_end_io = s->bi_end_io; - s->bio->bi_private = s->bi_private; - bio_endio_nodec(s->bio, 0); - - closure_debug_destroy(>cl); - mempool_free(s, s->p->bio_split_hook); -} - -static void bch_bio_submit_split_endio(struct bio *bio, int error) -{ - struct closure *cl = bio->bi_private; - struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl); - - if (error) - clear_bit(BIO_UPTODATE, >bio->bi_flags); - - bio_put(bio); - closure_put(cl); -} - -void bch_generic_make_request(struct bio *bio, struct bio_split_pool *p) -{ - struct bio_split_hook *s; - struct bio *n; - - if (!bio_has_data(bio) && !(bio->bi_rw & REQ_DISCARD)) - goto submit; - - if (bio_sectors(bio) <= bch_bio_max_sectors(bio)) - goto submit; - - s = mempool_alloc(p->bio_split_hook
[PATCH v2 5/9] btrfs: remove bio splitting and merge_bvec_fn() calls
From: Kent Overstreet Btrfs has been doing bio splitting from btrfs_map_bio(), by checking device limits as well as calling ->merge_bvec_fn() etc. That is not necessary any more, because generic_make_request() is now able to handle arbitrarily sized bios. So clean up unnecessary code paths. Cc: Chris Mason Cc: Josef Bacik Cc: linux-bt...@vger.kernel.org Signed-off-by: Kent Overstreet Signed-off-by: Chris Mason [dpark: add more description in commit message] Signed-off-by: Dongsu Park --- fs/btrfs/volumes.c | 73 -- 1 file changed, 73 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 50c5a87..c627bf8 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5691,34 +5691,6 @@ static noinline void btrfs_schedule_bio(struct btrfs_root *root, >work); } -static int bio_size_ok(struct block_device *bdev, struct bio *bio, - sector_t sector) -{ - struct bio_vec *prev; - struct request_queue *q = bdev_get_queue(bdev); - unsigned int max_sectors = queue_max_sectors(q); - struct bvec_merge_data bvm = { - .bi_bdev = bdev, - .bi_sector = sector, - .bi_rw = bio->bi_rw, - }; - - if (WARN_ON(bio->bi_vcnt == 0)) - return 1; - - prev = >bi_io_vec[bio->bi_vcnt - 1]; - if (bio_sectors(bio) > max_sectors) - return 0; - - if (!q->merge_bvec_fn) - return 1; - - bvm.bi_size = bio->bi_iter.bi_size - prev->bv_len; - if (q->merge_bvec_fn(q, , prev) < prev->bv_len) - return 0; - return 1; -} - static void submit_stripe_bio(struct btrfs_root *root, struct btrfs_bio *bbio, struct bio *bio, u64 physical, int dev_nr, int rw, int async) @@ -5752,38 +5724,6 @@ static void submit_stripe_bio(struct btrfs_root *root, struct btrfs_bio *bbio, btrfsic_submit_bio(rw, bio); } -static int breakup_stripe_bio(struct btrfs_root *root, struct btrfs_bio *bbio, - struct bio *first_bio, struct btrfs_device *dev, - int dev_nr, int rw, int async) -{ - struct bio_vec *bvec = first_bio->bi_io_vec; - struct bio *bio; - int nr_vecs = bio_get_nr_vecs(dev->bdev); - u64 physical = bbio->stripes[dev_nr].physical; - -again: - bio = btrfs_bio_alloc(dev->bdev, physical >> 9, nr_vecs, GFP_NOFS); - if (!bio) - return -ENOMEM; - - while (bvec <= (first_bio->bi_io_vec + first_bio->bi_vcnt - 1)) { - if (bio_add_page(bio, bvec->bv_page, bvec->bv_len, -bvec->bv_offset) < bvec->bv_len) { - u64 len = bio->bi_iter.bi_size; - - atomic_inc(>stripes_pending); - submit_stripe_bio(root, bbio, bio, physical, dev_nr, - rw, async); - physical += len; - goto again; - } - bvec++; - } - - submit_stripe_bio(root, bbio, bio, physical, dev_nr, rw, async); - return 0; -} - static void bbio_error(struct btrfs_bio *bbio, struct bio *bio, u64 logical) { atomic_inc(>error); @@ -5862,19 +5802,6 @@ int btrfs_map_bio(struct btrfs_root *root, int rw, struct bio *bio, continue; } - /* -* Check and see if we're ok with this bio based on it's size -* and offset with the given device. -*/ - if (!bio_size_ok(dev->bdev, first_bio, -bbio->stripes[dev_nr].physical >> 9)) { - ret = breakup_stripe_bio(root, bbio, first_bio, dev, -dev_nr, rw, async_submit); - BUG_ON(ret); - dev_nr++; - continue; - } - if (dev_nr < total_devs - 1) { bio = btrfs_bio_clone(first_bio, GFP_NOFS); BUG_ON(!bio); /* -ENOMEM */ -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 6/9] md/raid5: get rid of bio_fits_rdev()
From: Kent Overstreet Remove bio_fits_rdev() completely, because ->merge_bvec_fn() has now gone. There's no point in calling bio_fits_rdev() only for ensuring aligned read from rdev. Cc: Neil Brown Cc: linux-r...@vger.kernel.org Signed-off-by: Kent Overstreet [dpark: add more description in commit message] Signed-off-by: Dongsu Park --- drivers/md/raid5.c | 23 +-- 1 file changed, 1 insertion(+), 22 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index c1b0d52..40e464c 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -4218,25 +4218,6 @@ static void raid5_align_endio(struct bio *bi, int error) add_bio_to_retry(raid_bi, conf); } -static int bio_fits_rdev(struct bio *bi) -{ - struct request_queue *q = bdev_get_queue(bi->bi_bdev); - - if (bio_sectors(bi) > queue_max_sectors(q)) - return 0; - blk_recount_segments(q, bi); - if (bi->bi_phys_segments > queue_max_segments(q)) - return 0; - - if (q->merge_bvec_fn) - /* it's too hard to apply the merge_bvec_fn at this stage, -* just just give up -*/ - return 0; - - return 1; -} - static int chunk_aligned_read(struct mddev *mddev, struct bio * raid_bio) { struct r5conf *conf = mddev->private; @@ -4290,11 +4271,9 @@ static int chunk_aligned_read(struct mddev *mddev, struct bio * raid_bio) align_bi->bi_bdev = rdev->bdev; __clear_bit(BIO_SEG_VALID, _bi->bi_flags); - if (!bio_fits_rdev(align_bi) || - is_badblock(rdev, align_bi->bi_iter.bi_sector, + if (is_badblock(rdev, align_bi->bi_iter.bi_sector, bio_sectors(align_bi), _bad, _sectors)) { - /* too big in some way, or has a known bad block */ bio_put(align_bi); rdev_dec_pending(rdev, mddev); return 0; -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 2/9] block: simplify bio_add_page()
From: Kent Overstreet Since generic_make_request() can now handle arbitrary size bios, all we have to do is make sure the bvec array doesn't overflow. __bio_add_page() doesn't need to call ->merge_bvec_fn(), where we can get rid of unnecessary code paths. Note that removing call to ->merge_bvec_fn() is fine for bio_add_pc_page(), as SCSI devices usually don't even need that. Few exceptional cases like pscsi or osd are not affected either. Cc: Christoph Hellwig Cc: Jens Axboe Cc: Ming Lin Signed-off-by: Kent Overstreet [dpark: rebase and resolve merge conflicts, change a couple of comments, make bio_add_page() warn once upon a cloned bio.] Signed-off-by: Dongsu Park --- block/bio.c | 135 +--- 1 file changed, 55 insertions(+), 80 deletions(-) diff --git a/block/bio.c b/block/bio.c index 7ff846d..136b78b 100644 --- a/block/bio.c +++ b/block/bio.c @@ -700,9 +700,23 @@ int bio_get_nr_vecs(struct block_device *bdev) } EXPORT_SYMBOL(bio_get_nr_vecs); -static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page - *page, unsigned int len, unsigned int offset, - unsigned int max_sectors) +/** + * bio_add_pc_page - attempt to add page to bio + * @q: the target queue + * @bio: destination bio + * @page: page to add + * @len: vec entry length + * @offset: vec entry offset + * + * Attempt to add a page to the bio_vec maplist. This can fail for a + * number of reasons, such as the bio being full or target block device + * limitations. The target block device must allow bio's up to PAGE_SIZE, + * so it is always possible to add a single page to an empty bio. + * + * This should only be used by REQ_PC bios. + */ +int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page + *page, unsigned int len, unsigned int offset) { int retried_segments = 0; struct bio_vec *bvec; @@ -713,7 +727,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page if (unlikely(bio_flagged(bio, BIO_CLONED))) return 0; - if (((bio->bi_iter.bi_size + len) >> 9) > max_sectors) + if (((bio->bi_iter.bi_size + len) >> 9) > queue_max_hw_sectors(q)) return 0; /* @@ -726,28 +740,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page if (page == prev->bv_page && offset == prev->bv_offset + prev->bv_len) { - unsigned int prev_bv_len = prev->bv_len; prev->bv_len += len; - - if (q->merge_bvec_fn) { - struct bvec_merge_data bvm = { - /* prev_bvec is already charged in - bi_size, discharge it in order to - simulate merging updated prev_bvec - as new bvec. */ - .bi_bdev = bio->bi_bdev, - .bi_sector = bio->bi_iter.bi_sector, - .bi_size = bio->bi_iter.bi_size - - prev_bv_len, - .bi_rw = bio->bi_rw, - }; - - if (q->merge_bvec_fn(q, , prev) < prev->bv_len) { - prev->bv_len -= len; - return 0; - } - } - bio->bi_iter.bi_size += len; goto done; } @@ -790,27 +783,6 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page blk_recount_segments(q, bio); } - /* -* if queue has other restrictions (eg varying max sector size -* depending on offset), it can specify a merge_bvec_fn in the -* queue to get further control -*/ - if (q->merge_bvec_fn) { - struct bvec_merge_data bvm = { - .bi_bdev = bio->bi_bdev, - .bi_sector = bio->bi_iter.bi_sector, - .bi_size = bio->bi_iter.bi_size - len, - .bi_rw = bio->bi_rw, - }; - - /* -* merge_bvec_fn() returns number of bytes it can accept -* at this offset -*/ - if (q->merge_bvec_fn(q, , bvec) < bvec->bv_len) - goto failed; - } - /* If we may be able to merge these biovecs, force a recount */ if (bio->bi_vcnt &g
[PATCH v2 8/9] fs: use helper bio_add_page() instead of open coding on bi_io_vec
From: Kent Overstreet Call pre-defined helper bio_add_page() instead of open coding for iterating through bi_io_vec[]. Doing that, it's possible to make some parts in filesystems and mm/page_io.c simpler than before. Acked-by: Dave Kleikamp Cc: Christoph Hellwig Cc: Al Viro Cc: linux-fsde...@vger.kernel.org Signed-off-by: Kent Overstreet [dpark: add more description in commit message] Signed-off-by: Dongsu Park --- fs/buffer.c | 7 ++- fs/jfs/jfs_logmgr.c | 14 -- mm/page_io.c| 8 +++- 3 files changed, 9 insertions(+), 20 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index dbe5699..78e63e3 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -3022,12 +3022,9 @@ int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags) bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9); bio->bi_bdev = bh->b_bdev; - bio->bi_io_vec[0].bv_page = bh->b_page; - bio->bi_io_vec[0].bv_len = bh->b_size; - bio->bi_io_vec[0].bv_offset = bh_offset(bh); - bio->bi_vcnt = 1; - bio->bi_iter.bi_size = bh->b_size; + bio_add_page(bio, bh->b_page, bh->b_size, bh_offset(bh)); + BUG_ON(bio->bi_iter.bi_size != bh->b_size); bio->bi_end_io = end_bio_bh_io_sync; bio->bi_private = bh; diff --git a/fs/jfs/jfs_logmgr.c b/fs/jfs/jfs_logmgr.c index bc462dc..46fae06 100644 --- a/fs/jfs/jfs_logmgr.c +++ b/fs/jfs/jfs_logmgr.c @@ -1999,12 +1999,9 @@ static int lbmRead(struct jfs_log * log, int pn, struct lbuf ** bpp) bio->bi_iter.bi_sector = bp->l_blkno << (log->l2bsize - 9); bio->bi_bdev = log->bdev; - bio->bi_io_vec[0].bv_page = bp->l_page; - bio->bi_io_vec[0].bv_len = LOGPSIZE; - bio->bi_io_vec[0].bv_offset = bp->l_offset; - bio->bi_vcnt = 1; - bio->bi_iter.bi_size = LOGPSIZE; + bio_add_page(bio, bp->l_page, LOGPSIZE, bp->l_offset); + BUG_ON(bio->bi_iter.bi_size != LOGPSIZE); bio->bi_end_io = lbmIODone; bio->bi_private = bp; @@ -2145,12 +2142,9 @@ static void lbmStartIO(struct lbuf * bp) bio = bio_alloc(GFP_NOFS, 1); bio->bi_iter.bi_sector = bp->l_blkno << (log->l2bsize - 9); bio->bi_bdev = log->bdev; - bio->bi_io_vec[0].bv_page = bp->l_page; - bio->bi_io_vec[0].bv_len = LOGPSIZE; - bio->bi_io_vec[0].bv_offset = bp->l_offset; - bio->bi_vcnt = 1; - bio->bi_iter.bi_size = LOGPSIZE; + bio_add_page(bio, bp->l_page, LOGPSIZE, bp->l_offset); + BUG_ON(bio->bi_iter.bi_size != LOGPSIZE); bio->bi_end_io = lbmIODone; bio->bi_private = bp; diff --git a/mm/page_io.c b/mm/page_io.c index 955db8b..8c878c7 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -33,12 +33,10 @@ static struct bio *get_swap_bio(gfp_t gfp_flags, if (bio) { bio->bi_iter.bi_sector = map_swap_page(page, >bi_bdev); bio->bi_iter.bi_sector <<= PAGE_SHIFT - 9; - bio->bi_io_vec[0].bv_page = page; - bio->bi_io_vec[0].bv_len = PAGE_SIZE; - bio->bi_io_vec[0].bv_offset = 0; - bio->bi_vcnt = 1; - bio->bi_iter.bi_size = PAGE_SIZE; bio->bi_end_io = end_io; + + bio_add_page(bio, page, PAGE_SIZE, 0); + BUG_ON(bio->bi_iter.bi_size != PAGE_SIZE); } return bio; } -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 7/9] block: kill merge_bvec_fn() completely
From: Kent Overstreet As generic_make_request() is now able to handle arbitrarily sized bios, it's no longer necessary for each individual block driver to define its own ->merge_bvec_fn() callback. Remove every invocation completely. Cc: Jens Axboe Cc: Lars Ellenberg Cc: drbd-u...@lists.linbit.com Cc: Jiri Kosina Cc: Yehuda Sadeh Cc: Sage Weil Cc: Alex Elder Cc: ceph-de...@vger.kernel.org Cc: Alasdair Kergon Cc: Mike Snitzer Cc: dm-de...@redhat.com Cc: Neil Brown Cc: linux-r...@vger.kernel.org Cc: Christoph Hellwig Cc: "Martin K. Petersen" Signed-off-by: Kent Overstreet [dpark: also remove ->merge_bvec_fn() in dm-thin as well as dm-era-target, and resolve merge conflicts] Signed-off-by: Dongsu Park --- block/blk-merge.c | 17 +- block/blk-settings.c | 22 drivers/block/drbd/drbd_int.h | 1 - drivers/block/drbd/drbd_main.c | 1 - drivers/block/drbd/drbd_req.c | 35 drivers/block/pktcdvd.c| 21 --- drivers/block/rbd.c| 47 drivers/md/dm-cache-target.c | 21 --- drivers/md/dm-crypt.c | 16 -- drivers/md/dm-era-target.c | 15 - drivers/md/dm-flakey.c | 16 -- drivers/md/dm-linear.c | 16 -- drivers/md/dm-snap.c | 15 - drivers/md/dm-stripe.c | 21 --- drivers/md/dm-table.c | 8 --- drivers/md/dm-thin.c | 31 --- drivers/md/dm-verity.c | 16 -- drivers/md/dm.c| 120 +--- drivers/md/dm.h| 2 - drivers/md/linear.c| 46 drivers/md/md.c| 2 - drivers/md/md.h| 8 --- drivers/md/multipath.c | 21 --- drivers/md/raid0.c | 57 --- drivers/md/raid0.h | 2 - drivers/md/raid1.c | 59 +--- drivers/md/raid10.c| 122 + drivers/md/raid5.c | 28 -- include/linux/blkdev.h | 10 include/linux/device-mapper.h | 4 -- 30 files changed, 9 insertions(+), 791 deletions(-) diff --git a/block/blk-merge.c b/block/blk-merge.c index 3bc2068..8cd7a83 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -69,24 +69,13 @@ static struct bio *blk_bio_segment_split(struct request_queue *q, struct bio *split; struct bio_vec bv = { 0 }, bvprv = { 0 }; struct bvec_iter iter; - unsigned seg_size = 0, nsegs = 0; + unsigned seg_size = 0, nsegs = 0, sectors = 0; int prev = 0; - struct bvec_merge_data bvm = { - .bi_bdev= bio->bi_bdev, - .bi_sector = bio->bi_iter.bi_sector, - .bi_size= 0, - .bi_rw = bio->bi_rw, - }; - bio_for_each_segment(bv, bio, iter) { - if (q->merge_bvec_fn && - q->merge_bvec_fn(q, , ) < (int) bv.bv_len) - goto split; - - bvm.bi_size += bv.bv_len; + sectors += bv.bv_len >> 9; - if (bvm.bi_size >> 9 > queue_max_sectors(q)) + if (sectors > queue_max_sectors(q)) goto split; if (prev && blk_queue_cluster(q)) { diff --git a/block/blk-settings.c b/block/blk-settings.c index 6ed2cbe..463a10a 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -53,28 +53,6 @@ void blk_queue_unprep_rq(struct request_queue *q, unprep_rq_fn *ufn) } EXPORT_SYMBOL(blk_queue_unprep_rq); -/** - * blk_queue_merge_bvec - set a merge_bvec function for queue - * @q: queue - * @mbfn: merge_bvec_fn - * - * Usually queues have static limitations on the max sectors or segments that - * we can put in a request. Stacking drivers may have some settings that - * are dynamic, and thus we have to query the queue whether it is ok to - * add a new bio_vec to a bio at a given offset or not. If the block device - * has such limitations, it needs to register a merge_bvec_fn to control - * the size of bio's sent to it. Note that a block device *must* allow a - * single page to be added to an empty bio. The block device driver may want - * to use the bio_split() function to deal with these bio's. By default - * no merge_bvec_fn is defined for a queue, and only the fixed limits are - * honored. - */ -void blk_queue_merge_bvec(struct request_queue *q, merge_bvec_fn *mbfn) -{ - q->merge_bvec_fn = mbfn; -} -EXPORT_SYMBOL(blk_queue_merge_bvec); - void blk_queue_softirq_done(struct request_queue *q, softirq_done_fn *fn) { q->softirq_done_fn = fn; diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h index b905e98..63ce2b0 100644 --- a/drivers/block/drbd/drbd_int.h +++ b/drivers/block/drbd/drb
[PATCH v2 9/9] Documentation: update notes in biovecs about arbitrarily sized bios
Update block/biovecs.txt so that it includes a note on what kind of effects arbitrarily sized bios would bring to the block layer. Also fix a trivial typo, bio_iter_iovec. Cc: Christoph Hellwig Cc: Kent Overstreet Cc: Jonathan Corbet Cc: linux-...@vger.kernel.org Signed-off-by: Dongsu Park --- Documentation/block/biovecs.txt | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/Documentation/block/biovecs.txt b/Documentation/block/biovecs.txt index 74a32ad..2568958 100644 --- a/Documentation/block/biovecs.txt +++ b/Documentation/block/biovecs.txt @@ -24,7 +24,7 @@ particular, presenting the illusion of partially completed biovecs so that normal code doesn't have to deal with bi_bvec_done. * Driver code should no longer refer to biovecs directly; we now have - bio_iovec() and bio_iovec_iter() macros that return literal struct biovecs, + bio_iovec() and bio_iter_iovec() macros that return literal struct biovecs, constructed from the raw biovecs but taking into account bi_bvec_done and bi_size. @@ -109,3 +109,11 @@ Other implications: over all the biovecs in the new bio - which is silly as it's not needed. So, don't use bi_vcnt anymore. + + * The current interface allows the block layer to split bios as needed, so we + could eliminate a lot of complexity particularly in stacked drivers. Code + that creates bios can then create whatever size bios are convenient, and + more importantly stacked drivers don't have to deal with both their own bio + size limitations and the limitations of the underlying devices. Thus + there's no need to define ->merge_bvec_fn() callbacks for individual block + drivers. -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 2/7] block: rewrite __bio_copy_iov()
Rewrite __bio_copy_iov() so that it can call either _read() or _write() variant, which is determined by direction to_iov, given as either READ or WRITE. Moreover, make __bio_copy_iov() take its parameter iov_iter by value, to avoid awkward situations like ref-/dereferencing pointer and value repeatedly. This commit should contain only literal replacements, without functional changes. Suggested-by: Christoph Hellwig Cc: Kent Overstreet Cc: Jens Axboe Cc: Al Viro Signed-off-by: Dongsu Park --- block/bio.c | 113 1 file changed, 75 insertions(+), 38 deletions(-) diff --git a/block/bio.c b/block/bio.c index 8267676..7b1aed3 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1046,46 +1046,84 @@ static struct bio_map_data *bio_alloc_map_data(unsigned int iov_count, sizeof(struct sg_iovec) * iov_count, gfp_mask); } -static int __bio_copy_iov(struct bio *bio, const struct iov_iter *iter, - int to_user, int from_user, int do_free_page) +/** + * __bio_copy_iov_read - copy all pages from iov_iter to bio + * @bio: The bio which describes the I/O as destination + * @iter: iov_iter as source + * + * Copy all pages from iov_iter to bio. + * Returns 0 on success, or error on failure. + */ +static int __bio_copy_iov_read(struct bio *bio, struct iov_iter iter) { - int ret = 0, i; + int i; struct bio_vec *bvec; - struct iov_iter iov_iter = *iter; bio_for_each_segment_all(bvec, bio, i) { - char *bv_addr = page_address(bvec->bv_page); - unsigned int bv_len = bvec->bv_len; - - while (bv_len && iov_iter.count) { - struct iovec iov = iov_iter_iovec(_iter); - unsigned int bytes = min_t(unsigned int, bv_len, - iov.iov_len); - - if (!ret) { - if (to_user) - ret = copy_to_user(iov.iov_base, - bv_addr, bytes); - - if (from_user) - ret = copy_from_user(bv_addr, -iov.iov_base, -bytes); - - if (ret) - ret = -EFAULT; - } + ssize_t ret; - bv_len -= bytes; - bv_addr += bytes; - iov_iter_advance(_iter, bytes); - } + ret = copy_page_from_iter(bvec->bv_page, + bvec->bv_offset, + bvec->bv_len, + ); - if (do_free_page) - __free_page(bvec->bv_page); + if (!iov_iter_count()) + break; + + if (ret < bvec->bv_len) + return -EFAULT; } - return ret; + return 0; +} + +/** + * __bio_copy_iov_write - copy all pages from bio to iov_iter + * @bio: The bio which describes the I/O as source + * @iter: iov_iter as destination + * + * Copy all pages from bio to iov_iter. + * Returns 0 on success, or error on failure. + */ +static int __bio_copy_iov_write(struct bio *bio, struct iov_iter iter) +{ + int i; + struct bio_vec *bvec; + + bio_for_each_segment_all(bvec, bio, i) { + ssize_t ret; + + ret = copy_page_to_iter(bvec->bv_page, + bvec->bv_offset, + bvec->bv_len, + ); + + if (!iov_iter_count()) + break; + + if (ret < bvec->bv_len) + return -EFAULT; + } + + return 0; +} + +/** + * __bio_copy_iov - copy all pages between bio and iov_iter + * @bio: The bio which describes the I/O + * @iter: iov_iter either as source or destination + * @to_iov: whether to %READ (0) or %WRITE (1) + * + * Simple wrapper around __bio_copy_iov_{write,read}(). + * Returns 0 on success, or the error returned as-is on failure. + */ +static inline int __bio_copy_iov(struct bio *bio, struct iov_iter iter, +int to_iov) +{ + if (to_iov == WRITE) + return __bio_copy_iov_write(bio, iter); + else + return __bio_copy_iov_read(bio, iter); } /** @@ -1106,11 +1144,10 @@ int bio_uncopy_user(struct bio *bio) * if we're in a workqueue, the request is orphaned, so * don't copy into a random user address space, just free.
[RFC PATCH v2 0/9] simplify block layer based on immutable biovecs
This is the second attempt of simplifying block layer based on immutable biovecs. Immutable biovecs, implemented by Kent Overstreet, have been available in mainline since v3.14. Its original goal was actually making generic_make_request() accept arbitrarily sized bios, and pushing the splitting down to the drivers or wherever it's required. See also discussions in the past, [1] [2] [3] [4]. This will bring not only performance improvements, but also a great amount of reduction in code complexity all over the block layer. Performance gain is possible due to the fact that bio_add_page() does not have to check unnecesary conditions such as queue limits or if biovecs are mergeable. Those will be delegated to the driver level. Kent already said that he actually benchmarked the impact of this with fio on a micron p320h, which showed definitely a positive impact. Moreover, this patchset also allows a lot of code to be deleted, mainly because of removal of merge_bvec_fn() callbacks. We have been aware that it has been always a delicate issue for stacking block drivers (e.g. md and bcache) to handle merging bio consistently. This simplication will help every individual block driver avoid having such an issue. - Patch 01/09 allows generic_make_request handle arbitrarily sized bios, by making make_request functions call blk_queue_split(). - Patch 02/09 simplifies __bio_add_page() to avoid calling ->merge_bvec_fn(). - Patch 03/09 allows queue_bounce to handle bios with > BIO_MAX_PAGES - Patch 04/09 gets rid of workarounds in bcache. - Patch 05/09 removes unnecessary biovec merging parts in btrfs - Patch 06/09 removes unnecessary biovec merging parts in MD-RAID5. - Patch 07/09 removes ->merge_bvec_fn() completely, which affects a lot of block drivers, such as Ceph RBD, DRBD, device mapper, MD, etc. - Patch 08/09 makes filesystems use helper bio_add_page(). - Patch 09/09 updates document about biovecs. Patches are against 3.19-rc4. These are also available in my git repo at: https://github.com/dongsupark/linux.git block-generic-req It's recommended to apply this patchset on top of its preparation patchset i.e. "preparation for block layer simplification". [5] This patchset is in turn also a prerequisite of other consecutive patchsets, e.g. multipage biovecs, rewriting plugging, or rewriting direct-IO, which is yet to-be-done. This patchset should not bring any regression to end-users. I already tested it with xfstests multiple times. On the other hand, the multipage biovecs part is currently in heavy development, with help of Kent and Ming Lin. Those experimental patches are also available on other branches on my git tree. Once they are done, I'm also going to post them to get reviews. Comments are welcome. Dongsu Changes in v2: - split up preparation patches into a separate series - remove a patch "block: simplify issueing discard, write_same, zeroout", as suggested by Christoph Hellwig. - move a patch "btrfs: make use of immutable biovecs" to the upcoming series. - minor change in ps3vram suggested by Geoff Levand - make bio_add_page() warn once on a cloned bio. - add more comments and commit messages to patch 02 "block: simplify bio_add_page()" [1] https://lkml.org/lkml/2014/11/23/263 [2] https://lkml.org/lkml/2013/11/25/732 [3] https://lkml.org/lkml/2014/2/26/618 [4] https://lkml.org/lkml/2014/12/22/128 [5] https://lkml.org/lkml/2015/1/12/255 Dongsu Park (1): Documentation: update notes in biovecs about arbitrarily sized bios Kent Overstreet (8): block: make generic_make_request handle arbitrarily sized bios block: simplify bio_add_page() block: allow __blk_queue_bounce() to handle bios larger than BIO_MAX_PAGES bcache: clean up hacks around bio_split_pool btrfs: remove bio splitting and merge_bvec_fn() calls md/raid5: get rid of bio_fits_rdev() block: kill merge_bvec_fn() completely fs: use helper bio_add_page() instead of open coding on bi_io_vec Documentation/block/biovecs.txt | 10 +- block/bio.c | 135 +++ block/blk-core.c| 19 ++-- block/blk-merge.c | 140 ++-- block/blk-mq.c | 2 + block/blk-settings.c| 22 - block/bounce.c | 60 ++-- drivers/block/drbd/drbd_int.h | 1 - drivers/block/drbd/drbd_main.c | 1 - drivers/block/drbd/drbd_req.c | 37 +--- drivers/block/pktcdvd.c | 27 +- drivers/block/ps3vram.c | 2 + drivers/block/rbd.c | 47 -- drivers/block/rsxx/dev.c| 2 + drivers/block/umem.c| 2 + drivers/block/zram/zram_drv.c | 2 + drivers/md/bcache/bcache.h
[PATCH v2 1/9] block: make generic_make_request handle arbitrarily sized bios
From: Kent Overstreet The way the block layer is currently written, it goes to great lengths to avoid having to split bios; upper layer code (such as bio_add_page()) checks what the underlying device can handle and tries to always create bios that don't need to be split. But this approach becomes unwieldy and eventually breaks down with stacked devices and devices with dynamic limits, and it adds a lot of complexity. If the block layer could split bios as needed, we could eliminate a lot of complexity elsewhere - particularly in stacked drivers. Code that creates bios can then create whatever size bios are convenient, and more importantly stacked drivers don't have to deal with both their own bio size limitations and the limitations of the (potentially multiple) devices underneath them. In the future this will let us delete merge_bvec_fn and a bunch of other code. We do this by adding calls to blk_queue_split() to the various make_request functions that need it - a few can already handle arbitrary size bios. Note that we add the call _after_ any call to blk_queue_bounce(); this means that blk_queue_split() and blk_recalc_rq_segments() don't need to be concerned with bouncing affecting segment merging. Some make_request_fn() callbacks were simple enough to audit and verify they don't need blk_queue_split() calls. The skipped ones are: * nfhd_make_request (arch/m68k/emu/nfblock.c) * axon_ram_make_request (arch/powerpc/sysdev/axonram.c) * simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c) * brd_make_request (ramdisk - drivers/block/brd.c) * mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c) * loop_make_request * null_queue_bio * bcache's make_request fns Some others are almost certainly safe to remove now, but will be left for future patches. Cc: Ming Lin Cc: Jens Axboe Cc: Christoph Hellwig Cc: Al Viro Cc: Ming Lei Cc: Neil Brown Cc: Alasdair Kergon Cc: Mike Snitzer Cc: dm-de...@redhat.com Cc: Lars Ellenberg Cc: drbd-u...@lists.linbit.com Cc: Jiri Kosina Cc: Geoff Levand Cc: Jim Paris Cc: Joshua Morris Cc: Philip Kelleher Cc: Minchan Kim Cc: Nitin Gupta Cc: Oleg Drokin Cc: Andreas Dilger Signed-off-by: Kent Overstreet [dpark: skip more mq-based drivers, resolve merge conflicts, etc.] Signed-off-by: Dongsu Park --- block/blk-core.c| 19 ++-- block/blk-merge.c | 151 ++-- block/blk-mq.c | 2 + drivers/block/drbd/drbd_req.c | 2 + drivers/block/pktcdvd.c | 6 +- drivers/block/ps3vram.c | 2 + drivers/block/rsxx/dev.c| 2 + drivers/block/umem.c| 2 + drivers/block/zram/zram_drv.c | 2 + drivers/md/dm.c | 2 + drivers/md/md.c | 2 + drivers/s390/block/dcssblk.c| 2 + drivers/s390/block/xpram.c | 2 + drivers/staging/lustre/lustre/llite/lloop.c | 2 + include/linux/blkdev.h | 3 + 15 files changed, 179 insertions(+), 22 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 30f6153..e86ad75 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -585,6 +585,10 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) if (q->id < 0) goto fail_q; + q->bio_split = bioset_create(4, 0); + if (!q->bio_split) + goto fail_id; + q->backing_dev_info.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE; q->backing_dev_info.state = 0; @@ -594,7 +598,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) err = bdi_init(>backing_dev_info); if (err) - goto fail_id; + goto fail_split; setup_timer(>backing_dev_info.laptop_mode_wb_timer, laptop_mode_timer_fn, (unsigned long) q); @@ -636,6 +640,8 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) fail_bdi: bdi_destroy(>backing_dev_info); +fail_split: + bioset_free(q->bio_split); fail_id: ida_simple_remove(_queue_ida, q->id); fail_q: @@ -1552,6 +1558,8 @@ void blk_queue_bio(struct request_queue *q, struct bio *bio) struct request *req; unsigned int request_count = 0; + blk_queue_split(q, , q->bio_split); + /* * low level driver can indicate that it wants pages above a * certain limit bounced to low memory (ie for highmem, or even @@ -1775,15 +1783,6 @@ generic_make_request_checks(struct bio *bio) goto end_io; } - if (likely(bio_is_rw(bio) && - nr_sectors > queue_max_hw_sectors(q))) { - printk(KERN_ERR "bio too big device %s (%u >
[PATCH v2 4/7] block: refactor bio_get_user_pages() from __bio_map_user_iov()
From: Kent Overstreet Split up a part of the code that was in __bio_map_user_iov() into a new function bio_get_user_pages(). This helper is going to be used by future block layer rewriting, especially from direct-IO part. Note that this relies on the recent change to make generic_make_request() take arbitrarily sized bios - we're not using bio_add_page() here. Cc: Christoph Hellwig Cc: Jens Axboe Signed-off-by: Kent Overstreet [dpark: add more description in commit message] Signed-off-by: Dongsu Park --- block/bio.c | 130 +++- include/linux/bio.h | 2 + 2 files changed, 70 insertions(+), 62 deletions(-) diff --git a/block/bio.c b/block/bio.c index 9ad76ed..7ff846d 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1302,19 +1302,79 @@ struct bio *bio_copy_user(struct request_queue *q, struct rq_map_data *map_data, } EXPORT_SYMBOL(bio_copy_user); +/** + * bio_get_user_pages - pin user pages and add them to a biovec + * @bio: bio to add pages to + * @uaddr: start of user address + * @len: length in bytes + * @write_to_vm: bool indicating writing to pages or not + * + * Pins pages for up to @len bytes and appends them to @bio's bvec array. May + * pin only part of the requested pages - @bio need not have room for all the + * pages and can already have had pages added to it. + * + * Returns the number of bytes from @len added to @bio. + */ +ssize_t bio_get_user_pages(struct bio *bio, struct iov_iter *i, int write_to_vm) +{ + while (bio->bi_vcnt < bio->bi_max_vecs && iov_iter_count(i)) { + struct iovec iov = iov_iter_iovec(i); + int ret; + unsigned nr_pages, bytes; + unsigned offset = offset_in_page(iov.iov_base); + struct bio_vec *bv; + struct page **pages; + + nr_pages = min_t(size_t, +DIV_ROUND_UP(iov.iov_len + offset, PAGE_SIZE), +bio->bi_max_vecs - bio->bi_vcnt); + + bv = >bi_io_vec[bio->bi_vcnt]; + pages = (void *) bv; + + ret = get_user_pages_fast((unsigned long) iov.iov_base, + nr_pages, write_to_vm, pages); + if (ret < 0) { + if (bio->bi_vcnt) + return 0; + + return ret; + } + + bio->bi_vcnt += ret; + bytes = ret * PAGE_SIZE - offset; + + while (ret--) { + bv[ret].bv_page = pages[ret]; + bv[ret].bv_len = PAGE_SIZE; + bv[ret].bv_offset = 0; + } + + bv[0].bv_offset += offset; + bv[0].bv_len -= offset; + + if (bytes > iov.iov_len) { + bio->bi_io_vec[bio->bi_vcnt - 1].bv_len -= + bytes - iov.iov_len; + bytes = iov.iov_len; + } + + bio->bi_iter.bi_size += bytes; + iov_iter_advance(i, bytes); + } + + return 0; +} +EXPORT_SYMBOL(bio_get_user_pages); + static struct bio *__bio_map_user_iov(struct request_queue *q, struct block_device *bdev, const struct iov_iter *iter, int write_to_vm, gfp_t gfp_mask) { - int j; + ssize_t ret; int nr_pages = 0; - struct page **pages; struct bio *bio; - int cur_page = 0; - int ret, offset; - struct iov_iter i; - struct iovec iov; nr_pages = iov_count_pages(iter, queue_dma_alignment(q)); if (nr_pages < 0) @@ -1327,57 +1387,10 @@ static struct bio *__bio_map_user_iov(struct request_queue *q, if (!bio) return ERR_PTR(-ENOMEM); - ret = -ENOMEM; - pages = kcalloc(nr_pages, sizeof(struct page *), gfp_mask); - if (!pages) + ret = bio_get_user_pages(bio, (struct iov_iter *)iter, write_to_vm); + if (ret < 0) goto out; - iov_for_each(iov, i, *iter) { - unsigned long uaddr = (unsigned long) iov.iov_base; - unsigned long len = iov.iov_len; - unsigned long end = (uaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT; - unsigned long start = uaddr >> PAGE_SHIFT; - const int local_nr_pages = end - start; - const int page_limit = cur_page + local_nr_pages; - - ret = get_user_pages_fast(uaddr, local_nr_pages, - write_to_vm, [cur_page]); - if (ret < local_nr_pages) { - ret = -EFAULT; - goto out_unmap; - } - - offset = uaddr & ~PAGE_MASK; -