Re: strace of io_uring events?
On Tue, Jul 21, 2020, at 11:58 AM, Stefano Garzarella wrote: > my use case concerns virtualization. The idea, that I described in the > proposal of io-uring restrictions [1], is to share io_uring CQ and SQ queues > with a guest VM for block operations. Virtualization being a strong security barrier is in eternal conflict with maximizing performance. All of these "let's add a special guest/host channel" are high risk areas. And this effort in particular - is it *really* worth it to expose a brand new, fast moving Linux kernel interface (that probably hasn't been fuzzed as much as it needs to be) to virtual machines? People who want maximum performance at the cost of a bit of security already have the choice to use Linux containers, where they can use io_uring natively.
Re: kernel BUG at mm/hugetlb.c:LINE!
On Tue, May 12, 2020, at 11:04 AM, Miklos Szeredi wrote: > > However, in this syzbot test case the 'file' is in an overlayfs filesystem > > created as follows: > > > > mkdir("./file0", 000) = 0 > > mount(NULL, "./file0", "hugetlbfs", MS_MANDLOCK|MS_POSIXACL, NULL) = 0 > > chdir("./file0")= 0 > > mkdir("./file1", 000) = 0 > > mkdir("./bus", 000) = 0 > > mkdir("./file0", 000) = 0 > > mount("\177ELF\2\1\1", "./bus", "overlay", 0, > > "lowerdir=./bus,workdir=./file1,u"...) = 0 Is there any actual valid use case for mounting an overlayfs on top of hugetlbfs? I can't think of one. Why isn't the response to this to instead only allow mounting overlayfs on top of basically a set of whitelisted filesystems?
Re: aio poll, io_pgetevents and a new in-kernel poll API V3
On Thu, Jan 18, 2018, at 11:44 AM, Jeff Moyer wrote: > Jeff Moyerwrites: > > > FYI, this kernel has issues. It will boot up, but I don't have > > networking, and even rebooting doesn't succeed. I'm looking into it. > > A bisect lands on: eventfd: switch to ->poll_mask. That's not super > helpful, though. I did run the ltp eventfd2 tests, and they all pass. FWIW: https://git.gnome.org/browse/glib/commit/?id=3904c8761a60dbadbdfaf98fe23ff19cbdcc4a9a Since that a lot of userspace (including NetworkManager) uses eventfd. I haven't tried this patchset myself but I'd look at what the GLib mainloop is doing.
Re: aio poll, io_pgetevents and a new in-kernel poll API V3
On Thu, Jan 18, 2018, at 11:44 AM, Jeff Moyer wrote: > Jeff Moyer writes: > > > FYI, this kernel has issues. It will boot up, but I don't have > > networking, and even rebooting doesn't succeed. I'm looking into it. > > A bisect lands on: eventfd: switch to ->poll_mask. That's not super > helpful, though. I did run the ltp eventfd2 tests, and they all pass. FWIW: https://git.gnome.org/browse/glib/commit/?id=3904c8761a60dbadbdfaf98fe23ff19cbdcc4a9a Since that a lot of userspace (including NetworkManager) uses eventfd. I haven't tried this patchset myself but I'd look at what the GLib mainloop is doing.
Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall)
On Wed, Nov 1, 2017, at 03:02 PM, Shawn Landden wrote: > > This solves the fact that epoll_pwait() already is a 6 argument (maximum > allowed) syscall. But what if the process has multiple epoll() instances in > multiple threads? Well, that's a subset of the general question of - what is the interaction of this system call and threading? It looks like you've prototyped this out in userspace with systemd, but from a quick glance at the current git, systemd's threading is limited doing sync()/fsync() and gethostbyname() async. But languages with a GC tend to at least use a background thread for that, and of course lots of modern userspace makes heavy use of multithreading (or variants like goroutines). A common pattern though is to have a "main thread" that acts as a control point and runs the mainloop (particularly for anything with a GUI). That's going to be the thing calling prctl(SET_IDLE) - but I think its idle state should implicitly affect the whole process, since for a lot of apps those other threads are going to just be "background". It'd probably then be an error to use prctl(SET_IDLE) in more than one thread ever? (Although that might break in golang due to the way goroutines can be migrated across threads) That'd probably be a good "generality test" - what would it take to have this system call be used for a simple golang webserver app that's e.g. socket activated by systemd, or a Kubernetes service? Or another really interesting case would be qemu; make it easy to flag VMs as always having this state (most of my testing VMs are like this; it's OK if they get destroyed, I just reinitialize them from the gold state). Going back to threading - a tricky thing we should handle in general is when userspace libraries create threads that are unknown to the app; the "async gethostbyname()" is a good example. To be conservative we'd likely need to "fail non-idle", but figure out some way tell the kernel for e.g. GC threads that they're still idle.
Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall)
On Wed, Nov 1, 2017, at 03:02 PM, Shawn Landden wrote: > > This solves the fact that epoll_pwait() already is a 6 argument (maximum > allowed) syscall. But what if the process has multiple epoll() instances in > multiple threads? Well, that's a subset of the general question of - what is the interaction of this system call and threading? It looks like you've prototyped this out in userspace with systemd, but from a quick glance at the current git, systemd's threading is limited doing sync()/fsync() and gethostbyname() async. But languages with a GC tend to at least use a background thread for that, and of course lots of modern userspace makes heavy use of multithreading (or variants like goroutines). A common pattern though is to have a "main thread" that acts as a control point and runs the mainloop (particularly for anything with a GUI). That's going to be the thing calling prctl(SET_IDLE) - but I think its idle state should implicitly affect the whole process, since for a lot of apps those other threads are going to just be "background". It'd probably then be an error to use prctl(SET_IDLE) in more than one thread ever? (Although that might break in golang due to the way goroutines can be migrated across threads) That'd probably be a good "generality test" - what would it take to have this system call be used for a simple golang webserver app that's e.g. socket activated by systemd, or a Kubernetes service? Or another really interesting case would be qemu; make it easy to flag VMs as always having this state (most of my testing VMs are like this; it's OK if they get destroyed, I just reinitialize them from the gold state). Going back to threading - a tricky thing we should handle in general is when userspace libraries create threads that are unknown to the app; the "async gethostbyname()" is a good example. To be conservative we'd likely need to "fail non-idle", but figure out some way tell the kernel for e.g. GC threads that they're still idle.
Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall)
On Wed, Nov 1, 2017, at 11:16 AM, Colin Walters wrote: > > as the maintainer of glib2 which is used by a *lot* of things; I'm not (I meant to say "a" maintainer) Also, while I'm not an expert in Android, I think the "what to kill" logic there lives in userspace, right? So it feels like we should expose this state in e.g. /proc and allow userspace daemons (e.g. systemd, kubelet) to perform idle collection too, even if the system isn't actually low on resources from the kernel's perspective. And doing that requires some sort of kill(pid, SIGKILL_IF_IDLE) or so?
Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall)
On Wed, Nov 1, 2017, at 11:16 AM, Colin Walters wrote: > > as the maintainer of glib2 which is used by a *lot* of things; I'm not (I meant to say "a" maintainer) Also, while I'm not an expert in Android, I think the "what to kill" logic there lives in userspace, right? So it feels like we should expose this state in e.g. /proc and allow userspace daemons (e.g. systemd, kubelet) to perform idle collection too, even if the system isn't actually low on resources from the kernel's perspective. And doing that requires some sort of kill(pid, SIGKILL_IF_IDLE) or so?
Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall)
On Wed, Nov 1, 2017, at 01:32 AM, Shawn Landden wrote: > It is common for services to be stateless around their main event loop. > If a process passes the EPOLL_KILLME flag to epoll_wait5() then it > signals to the kernel that epoll_wait5() may not complete, and the kernel > may send SIGKILL if resources get tight. > I've thought about something like this in the past too and would love to see it land. Bigger picture, this also comes up in (server) container environments, see e.g.: https://docs.openshift.com/container-platform/3.3/admin_guide/idling_applications.html There's going to be a long slog getting apps to actually make use of this, but I suspect if it gets wrapped up nicely in some "framework" libraries for C/C++, and be bound in the language ecosystems like golang we could see a fair amount of adoption on the order of a year or two. However, while I understand why it feels natural to tie this to epoll, as the maintainer of glib2 which is used by a *lot* of things; I'm not sure we're going to port to epoll anytime soon. Why not just make this a prctl()? It's not like it's really any less racy to do: prctl(PR_SET_IDLE) epoll() and this also allows: prctl(PR_SET_IDLE) poll() And as this is most often just going to be an optional hint it's easier to e.g. just ignore EINVAL from the prctl().
Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall)
On Wed, Nov 1, 2017, at 01:32 AM, Shawn Landden wrote: > It is common for services to be stateless around their main event loop. > If a process passes the EPOLL_KILLME flag to epoll_wait5() then it > signals to the kernel that epoll_wait5() may not complete, and the kernel > may send SIGKILL if resources get tight. > I've thought about something like this in the past too and would love to see it land. Bigger picture, this also comes up in (server) container environments, see e.g.: https://docs.openshift.com/container-platform/3.3/admin_guide/idling_applications.html There's going to be a long slog getting apps to actually make use of this, but I suspect if it gets wrapped up nicely in some "framework" libraries for C/C++, and be bound in the language ecosystems like golang we could see a fair amount of adoption on the order of a year or two. However, while I understand why it feels natural to tie this to epoll, as the maintainer of glib2 which is used by a *lot* of things; I'm not sure we're going to port to epoll anytime soon. Why not just make this a prctl()? It's not like it's really any less racy to do: prctl(PR_SET_IDLE) epoll() and this also allows: prctl(PR_SET_IDLE) poll() And as this is most often just going to be an optional hint it's easier to e.g. just ignore EINVAL from the prctl().
Re: [PATCH 1/3] autofs - fix AT_NO_AUTOMOUNT not being honored
On Tue, Aug 8, 2017, at 12:26 AM, Ian Kent wrote: > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -3022,8 +3022,7 @@ static inline int vfs_lstat(const char __user *name, > struct kstat *stat) > static inline int vfs_fstatat(int dfd, const char __user *filename, > struct kstat *stat, int flags) > { > - return vfs_statx(dfd, filename, flags | AT_NO_AUTOMOUNT, > - stat, STATX_BASIC_STATS); > + return vfs_statx(dfd, filename, flags, stat, STATX_BASIC_STATS); > } > static inline int vfs_fstat(int fd, struct kstat *stat) > { This is reverting the fstatat() prat of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=deccf497d804a4c5fca2dbfad2f104675a6f9102 Which itself seems weird to me - it looks like we were unconditionally forcing on AT_NO_AUTOMOUNT regardless of what userspace passed? So perhaps a Fixes: deccf497d804a4c5fca2dbfad2f104675a6f9102 is appropriate here? I understand that for stat()/lstat() we didn't expose the option to userspace, so the behavior was...ah, there's this note in man-pages (man-pages-4.09-3.fc26.noarch): > On Linux, lstat() will generally not trigger automounter action, whereas > stat() will (but see fstatat(2)). I have no idea of the history here, but maybe it makes sense to drop the AT_NO_AUTOMOUNT from the vfs_stat() too?
Re: [PATCH 1/3] autofs - fix AT_NO_AUTOMOUNT not being honored
On Tue, Aug 8, 2017, at 12:26 AM, Ian Kent wrote: > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -3022,8 +3022,7 @@ static inline int vfs_lstat(const char __user *name, > struct kstat *stat) > static inline int vfs_fstatat(int dfd, const char __user *filename, > struct kstat *stat, int flags) > { > - return vfs_statx(dfd, filename, flags | AT_NO_AUTOMOUNT, > - stat, STATX_BASIC_STATS); > + return vfs_statx(dfd, filename, flags, stat, STATX_BASIC_STATS); > } > static inline int vfs_fstat(int fd, struct kstat *stat) > { This is reverting the fstatat() prat of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=deccf497d804a4c5fca2dbfad2f104675a6f9102 Which itself seems weird to me - it looks like we were unconditionally forcing on AT_NO_AUTOMOUNT regardless of what userspace passed? So perhaps a Fixes: deccf497d804a4c5fca2dbfad2f104675a6f9102 is appropriate here? I understand that for stat()/lstat() we didn't expose the option to userspace, so the behavior was...ah, there's this note in man-pages (man-pages-4.09-3.fc26.noarch): > On Linux, lstat() will generally not trigger automounter action, whereas > stat() will (but see fstatat(2)). I have no idea of the history here, but maybe it makes sense to drop the AT_NO_AUTOMOUNT from the vfs_stat() too?
Re: [PATCH 1/3] fs, xfs: introduce S_IOMAP_IMMUTABLE
On Mon, Jul 31, 2017, at 02:23 PM, Darrick J. Wong wrote: > I don't think F_SEAL_{SHRINK,GROW} prevents reflinking or CoW of file data, > which are two things that cannot happen under S_IOMAP_IMMUTABLE that > aren't size changes. From the implementation it looks like shrink and > grow are only supposed to disallow changes to i_size, not i_blocks (or > the file block map). True. > Then again, I suppose F_SEAL_* only work on shmem, so maybe it simply > isn't defined for any other filesystem...? e.g. it doesn't prohibit > reflink, but the only fs implementing seals doesn't support reflink. > > > > Seals cannot be removed, which is too strict for the S_IOMAP_IMMUTABLE > user cases being presented. To be clear, the set of use cases is swap files and DAX, right? Or is there anything else? I can't imagine why anyone would want to turn a swap file back into a regular file. I haven't fully followed DAX, but I'd take your word for it if people want to be able to remove the flag after. Anyways, I think your broader point is right; the use cases are different enough that it doesn't make sense to try to add S_CONTENT_IMMUTABLE (or however one decides to call it) at the same time.
Re: [PATCH 1/3] fs, xfs: introduce S_IOMAP_IMMUTABLE
On Mon, Jul 31, 2017, at 02:23 PM, Darrick J. Wong wrote: > I don't think F_SEAL_{SHRINK,GROW} prevents reflinking or CoW of file data, > which are two things that cannot happen under S_IOMAP_IMMUTABLE that > aren't size changes. From the implementation it looks like shrink and > grow are only supposed to disallow changes to i_size, not i_blocks (or > the file block map). True. > Then again, I suppose F_SEAL_* only work on shmem, so maybe it simply > isn't defined for any other filesystem...? e.g. it doesn't prohibit > reflink, but the only fs implementing seals doesn't support reflink. > > > > Seals cannot be removed, which is too strict for the S_IOMAP_IMMUTABLE > user cases being presented. To be clear, the set of use cases is swap files and DAX, right? Or is there anything else? I can't imagine why anyone would want to turn a swap file back into a regular file. I haven't fully followed DAX, but I'd take your word for it if people want to be able to remove the flag after. Anyways, I think your broader point is right; the use cases are different enough that it doesn't make sense to try to add S_CONTENT_IMMUTABLE (or however one decides to call it) at the same time.
Re: [PATCH 1/3] fs, xfs: introduce S_IOMAP_IMMUTABLE
On Mon, Jul 31, 2017, at 12:32 PM, Colin Walters wrote: > On Mon, Jul 31, 2017, at 12:29 PM, Dan Williams wrote: > > > > How is S_CONTENTS_IMMUTABLE different than S_IMMUTABLE? > > We still want the ability to make hardlinks. Also of course, symmetrically, to unlink. If we used S_IMMUTABLE for /etc/sudoers, it'd still be racy since one would have to transiently remove the flag in order to replace it with a new version. Related to this topic is the fact that S_IMMUTABLE is itself mutable; I think once S_IMMUTABLE_CONTENTS is set, it would not be able to made mutable again. Also I just remembered that since then memfd_create() and more notably fcntl(F_ADD_SEALS) landed - in fact it already has flags for what we want here AFAICS. Your S_IOMAP_IMMUTABLE is fcntl(F_ADD_SEALS, F_SEAL_SHRINK | F_SEAL_GROW) and mine just adds in F_SEAL_WRITE. I think there was some discussion of the seals for persistent files when memfd_create() landed, but I can't find it offhand.
Re: [PATCH 1/3] fs, xfs: introduce S_IOMAP_IMMUTABLE
On Mon, Jul 31, 2017, at 12:32 PM, Colin Walters wrote: > On Mon, Jul 31, 2017, at 12:29 PM, Dan Williams wrote: > > > > How is S_CONTENTS_IMMUTABLE different than S_IMMUTABLE? > > We still want the ability to make hardlinks. Also of course, symmetrically, to unlink. If we used S_IMMUTABLE for /etc/sudoers, it'd still be racy since one would have to transiently remove the flag in order to replace it with a new version. Related to this topic is the fact that S_IMMUTABLE is itself mutable; I think once S_IMMUTABLE_CONTENTS is set, it would not be able to made mutable again. Also I just remembered that since then memfd_create() and more notably fcntl(F_ADD_SEALS) landed - in fact it already has flags for what we want here AFAICS. Your S_IOMAP_IMMUTABLE is fcntl(F_ADD_SEALS, F_SEAL_SHRINK | F_SEAL_GROW) and mine just adds in F_SEAL_WRITE. I think there was some discussion of the seals for persistent files when memfd_create() landed, but I can't find it offhand.
Re: [PATCH 1/3] fs, xfs: introduce S_IOMAP_IMMUTABLE
On Mon, Jul 31, 2017, at 12:29 PM, Dan Williams wrote: > > How is S_CONTENTS_IMMUTABLE different than S_IMMUTABLE? We still want the ability to make hardlinks.
Re: [PATCH 1/3] fs, xfs: introduce S_IOMAP_IMMUTABLE
On Mon, Jul 31, 2017, at 12:29 PM, Dan Williams wrote: > > How is S_CONTENTS_IMMUTABLE different than S_IMMUTABLE? We still want the ability to make hardlinks.
Re: [PATCH 1/3] fs, xfs: introduce S_IOMAP_IMMUTABLE
On Sat, Jul 29, 2017, at 03:43 PM, Dan Williams wrote: > An inode with this flag set indicates that the file's block map cannot > be changed, no size change, deletion, hole-punch, range collapse, or > reflink. > > The implementation of toggling the flag and sealing the state of the > extent map is saved for a later patch. The functionality provided by > S_IOMAP_IMMUTABLE, once toggle support is added, will be a superset of > that provided by S_SWAPFILE, and it is targeted to replace it. > > For now, only xfs and the core vfs are updated to consider the new flag. Quite a while ago I started a request for O_OBJECT: http://www.spinics.net/lists/linux-fsdevel/msg75085.html A few months ago I was thinking about that more and realized it'd likely be more palatable to land as an inode flag, like you're doing here. Now, S_IOMAP_IMMUTABLE would be quite close to the semantics I want for ostree, except we also want to disallow changes to the inode uid, gid or mode. (Extended attributes are a whole other story; but I'd like to at least disallow changes to the security. namespace). The goal here is mostly about resilience to *accidental* changes; think an admin doing `cp /path/to/binary /usr/bin/bash` which does open(O_TRUNC), which would hence corrupt all hardlinks. S_IOMAP_IMMUTABLE would give a lot of great protection against those types of accidental changes - most of them are either going to be open(O_TRUNC) or O_APPEND. Since you're touching various write paths here, perhaps we can also add S_CONTENTS_IMMUTABLE or something at the same time? If this lands as is - I'm quite likely to change ostree to use it; any objections to that? As mentioned in the thread, there are several other cases of "content immutable" files in userspace, such as QEMU "qcow2", git objects. And really the most classic example is /etc/sudoers and the need for a special "visudo" program to really ensure that editors don't do in-place overwrites. But it'd be great if we can use this push to also land "content immutabilty" or however we decide to call it.
Re: [PATCH 1/3] fs, xfs: introduce S_IOMAP_IMMUTABLE
On Sat, Jul 29, 2017, at 03:43 PM, Dan Williams wrote: > An inode with this flag set indicates that the file's block map cannot > be changed, no size change, deletion, hole-punch, range collapse, or > reflink. > > The implementation of toggling the flag and sealing the state of the > extent map is saved for a later patch. The functionality provided by > S_IOMAP_IMMUTABLE, once toggle support is added, will be a superset of > that provided by S_SWAPFILE, and it is targeted to replace it. > > For now, only xfs and the core vfs are updated to consider the new flag. Quite a while ago I started a request for O_OBJECT: http://www.spinics.net/lists/linux-fsdevel/msg75085.html A few months ago I was thinking about that more and realized it'd likely be more palatable to land as an inode flag, like you're doing here. Now, S_IOMAP_IMMUTABLE would be quite close to the semantics I want for ostree, except we also want to disallow changes to the inode uid, gid or mode. (Extended attributes are a whole other story; but I'd like to at least disallow changes to the security. namespace). The goal here is mostly about resilience to *accidental* changes; think an admin doing `cp /path/to/binary /usr/bin/bash` which does open(O_TRUNC), which would hence corrupt all hardlinks. S_IOMAP_IMMUTABLE would give a lot of great protection against those types of accidental changes - most of them are either going to be open(O_TRUNC) or O_APPEND. Since you're touching various write paths here, perhaps we can also add S_CONTENTS_IMMUTABLE or something at the same time? If this lands as is - I'm quite likely to change ostree to use it; any objections to that? As mentioned in the thread, there are several other cases of "content immutable" files in userspace, such as QEMU "qcow2", git objects. And really the most classic example is /etc/sudoers and the need for a special "visudo" program to really ensure that editors don't do in-place overwrites. But it'd be great if we can use this push to also land "content immutabilty" or however we decide to call it.
Re: [RFC PATCH] KEYS: Allow a live daemon in a namespace to service request_key upcalls
On Wed, May 31, 2017, at 10:47 AM, David Howells wrote: > > So if the mount-in-container is mostly init containers, and for init > > containers you have *all* namespaces usually, does it make > > sense to have the capability to match namespaces for key services? > > Yes. > > If I don't provide it, someone will complain that I haven't provided it. I don't think it's as clear cut as that. I'd agree that where possible, it makes sense to be general since it's hard to envision every use case, but in this particualr case there are risks around security and clear maintenance issues. Providing a feature without *known* users in a security-sensitive context seems to me to be something to think twice about. > > Something that worries me a lot here is for e.g. Docker today, the default > > is uid 0 but not CAP_SYS_ADMIN. We don't want a container that I run > > with --host=net to be able to read the "host" keyrings, even if it shares > > the host network namespace. > > This is a separate issue. Kind of - what I'm getting at is that today, changing any of the kernel upcalls is a fully privileged operation. Most container userspace tools mount /proc read-only to ensure that even uid 0 containers can't change it by default. (Wait, is /sbin/request-key really hardcoded in the kernel? I guess that's good from the perspective of not having containers be able to change it like they could /proc/sys/kernel/core_pattern if it was writable, but it seems surprising) Anyways, if we now expose a method to configure this, we have to look at the increase in attack surface. In practice again, most container implementations I'm aware of use seccomp today to simply block off access to the keyring. As I mentioned Docker does it, so does flatpak: https://github.com/flatpak/flatpak/blob/ea7077fcd431fb98fe85cd815cbd2ec13df58d09/common/flatpak-run.c#L4007 and Chrome: https://cs.chromium.org/chromium/src/sandbox/linux/seccomp-bpf-helpers/syscall_sets.cc?q=keyctl=C=791 But I'm a bit uncertain about *relying* on the seccomp filtering. Particularly because we do want the "init container" approach to work and be able to use the kernel keyring, but also not be able to affect other containers or the host. > Keys may be relevant in different namespaces, which makes namespacing them > hard to achieve. For instance, dns-, idmapper- and rxrpc-type keys should > probably be differentiated by network namespace. > > However, it might be worth creating a keyrings namespace. Hm, yes - I think I was conflating CLONE_NEWUSER with uid 0's keyring but that's a distinct thing from the kernel keyrings. > > Basically my instinct here is to be conservative and have KEYCTL_SERVICE_ADD > > require CAP_SYS_ADMIN and only affect the userns keyring. > > "Affect" in what sense? Operate on at all - view/read/write/search etc. At a high level I'm glad you're looking at the "service fd" model instead of upcalls - I do think it'll get us to a better place. The main thing I'm getting at first though is making sure we're not introducing new security issues, and that the new proposed API makes sense for some of the important userspace use cases.
Re: [RFC PATCH] KEYS: Allow a live daemon in a namespace to service request_key upcalls
On Wed, May 31, 2017, at 10:47 AM, David Howells wrote: > > So if the mount-in-container is mostly init containers, and for init > > containers you have *all* namespaces usually, does it make > > sense to have the capability to match namespaces for key services? > > Yes. > > If I don't provide it, someone will complain that I haven't provided it. I don't think it's as clear cut as that. I'd agree that where possible, it makes sense to be general since it's hard to envision every use case, but in this particualr case there are risks around security and clear maintenance issues. Providing a feature without *known* users in a security-sensitive context seems to me to be something to think twice about. > > Something that worries me a lot here is for e.g. Docker today, the default > > is uid 0 but not CAP_SYS_ADMIN. We don't want a container that I run > > with --host=net to be able to read the "host" keyrings, even if it shares > > the host network namespace. > > This is a separate issue. Kind of - what I'm getting at is that today, changing any of the kernel upcalls is a fully privileged operation. Most container userspace tools mount /proc read-only to ensure that even uid 0 containers can't change it by default. (Wait, is /sbin/request-key really hardcoded in the kernel? I guess that's good from the perspective of not having containers be able to change it like they could /proc/sys/kernel/core_pattern if it was writable, but it seems surprising) Anyways, if we now expose a method to configure this, we have to look at the increase in attack surface. In practice again, most container implementations I'm aware of use seccomp today to simply block off access to the keyring. As I mentioned Docker does it, so does flatpak: https://github.com/flatpak/flatpak/blob/ea7077fcd431fb98fe85cd815cbd2ec13df58d09/common/flatpak-run.c#L4007 and Chrome: https://cs.chromium.org/chromium/src/sandbox/linux/seccomp-bpf-helpers/syscall_sets.cc?q=keyctl=C=791 But I'm a bit uncertain about *relying* on the seccomp filtering. Particularly because we do want the "init container" approach to work and be able to use the kernel keyring, but also not be able to affect other containers or the host. > Keys may be relevant in different namespaces, which makes namespacing them > hard to achieve. For instance, dns-, idmapper- and rxrpc-type keys should > probably be differentiated by network namespace. > > However, it might be worth creating a keyrings namespace. Hm, yes - I think I was conflating CLONE_NEWUSER with uid 0's keyring but that's a distinct thing from the kernel keyrings. > > Basically my instinct here is to be conservative and have KEYCTL_SERVICE_ADD > > require CAP_SYS_ADMIN and only affect the userns keyring. > > "Affect" in what sense? Operate on at all - view/read/write/search etc. At a high level I'm glad you're looking at the "service fd" model instead of upcalls - I do think it'll get us to a better place. The main thing I'm getting at first though is making sure we're not introducing new security issues, and that the new proposed API makes sense for some of the important userspace use cases.
Re: [RFC PATCH] KEYS: Allow a live daemon in a namespace to service request_key upcalls
On Tue, May 30, 2017, at 12:08 PM, David Howells wrote: > > KEY_SERVICE_NS_UTS > KEY_SERVICE_NS_IPC > KEY_SERVICE_NS_MNT > KEY_SERVICE_NS_PID > KEY_SERVICE_NS_NET > KEY_SERVICE_NS_CGROUP Any reasons not to reuse the CLONE_ flags? > which select those namespaces of the caller that must match for the daemon > to be given the request. So, for example, a daemon that wanted to service > DNS requests from the kernel would do something like: > > keyctl(KEYCTL_SERVICE_ADD, fd, "dns_resolver", KEY_SERVICE_NS_NET); At least to me, it's not clear what the use cases really are. Do we expect people to e.g. set up NFS mounts that require keys/DNS inside "a container" (and if in a container, with what namespaces?) My offhand feeling is that most of the mounting-in-container case is mostly "init container" where you migrate a VM -> container and run sshd etc. in the container. There's a whole thread on extending what filesystems can be mounted with a userns CAP_SYS_ADMIN but personally I doubt that's going to really happen given how unrealistic it is for the kernel to parse potentially hostile filesystems. So if the mount-in-container is mostly init containers, and for init containers you have *all* namespaces usually, does it make sense to have the capability to match namespaces for key services? Something that worries me a lot here is for e.g. Docker today, the default is uid 0 but not CAP_SYS_ADMIN. We don't want a container that I run with --host=net to be able to read the "host" keyrings, even if it shares the host network namespace. Today for Docker the default seccomp policy prevents access to keyctl() at all because it's only with user namespaces that the kerying is namespaced. Basically my instinct here is to be conservative and have KEYCTL_SERVICE_ADD require CAP_SYS_ADMIN and only affect the userns keyring.
Re: [RFC PATCH] KEYS: Allow a live daemon in a namespace to service request_key upcalls
On Tue, May 30, 2017, at 12:08 PM, David Howells wrote: > > KEY_SERVICE_NS_UTS > KEY_SERVICE_NS_IPC > KEY_SERVICE_NS_MNT > KEY_SERVICE_NS_PID > KEY_SERVICE_NS_NET > KEY_SERVICE_NS_CGROUP Any reasons not to reuse the CLONE_ flags? > which select those namespaces of the caller that must match for the daemon > to be given the request. So, for example, a daemon that wanted to service > DNS requests from the kernel would do something like: > > keyctl(KEYCTL_SERVICE_ADD, fd, "dns_resolver", KEY_SERVICE_NS_NET); At least to me, it's not clear what the use cases really are. Do we expect people to e.g. set up NFS mounts that require keys/DNS inside "a container" (and if in a container, with what namespaces?) My offhand feeling is that most of the mounting-in-container case is mostly "init container" where you migrate a VM -> container and run sshd etc. in the container. There's a whole thread on extending what filesystems can be mounted with a userns CAP_SYS_ADMIN but personally I doubt that's going to really happen given how unrealistic it is for the kernel to parse potentially hostile filesystems. So if the mount-in-container is mostly init containers, and for init containers you have *all* namespaces usually, does it make sense to have the capability to match namespaces for key services? Something that worries me a lot here is for e.g. Docker today, the default is uid 0 but not CAP_SYS_ADMIN. We don't want a container that I run with --host=net to be able to read the "host" keyrings, even if it shares the host network namespace. Today for Docker the default seccomp policy prevents access to keyctl() at all because it's only with user namespaces that the kerying is namespaced. Basically my instinct here is to be conservative and have KEYCTL_SERVICE_ADD require CAP_SYS_ADMIN and only affect the userns keyring.
Re: [RFC][PATCH 0/9] Make containers kernel objects
On Tue, May 23, 2017, at 11:31 AM, Jeff Layton wrote: > > nfsdcltrack was originally nfsdcld, a long running daemon that used > rpc_pipefs to talk to the kernel. That meant that you had to make sure > it gets enabled by systemd (or sysvinit, etc). If it dies, then you also > want to ensure that it gets restarted lest the kernel server hang, > etc... > > It was pretty universally hated, as it was just one more daemon that you > needed to run to work a proper nfs server. So, I was encouraged to > switch it to a call_usermodehelper upcall and since then it has just > worked, as long as the binary is installed. Note that with the "read()/write() fd" model you don't need a whole process just to do that...the functionality could be rolled into systemd or equivalent easily enough. > "You're doing it wrong. You just need to run all of these services as > long-running daemons." Also, I imagine we could figure out a clean model to do *activation* from kernel -> userspace too. systemd's socket activation model where pid 1 activates units on demand is quite nice and obviates the need to configure things on in advance.
Re: [RFC][PATCH 0/9] Make containers kernel objects
On Tue, May 23, 2017, at 11:31 AM, Jeff Layton wrote: > > nfsdcltrack was originally nfsdcld, a long running daemon that used > rpc_pipefs to talk to the kernel. That meant that you had to make sure > it gets enabled by systemd (or sysvinit, etc). If it dies, then you also > want to ensure that it gets restarted lest the kernel server hang, > etc... > > It was pretty universally hated, as it was just one more daemon that you > needed to run to work a proper nfs server. So, I was encouraged to > switch it to a call_usermodehelper upcall and since then it has just > worked, as long as the binary is installed. Note that with the "read()/write() fd" model you don't need a whole process just to do that...the functionality could be rolled into systemd or equivalent easily enough. > "You're doing it wrong. You just need to run all of these services as > long-running daemons." Also, I imagine we could figure out a clean model to do *activation* from kernel -> userspace too. systemd's socket activation model where pid 1 activates units on demand is quite nice and obviates the need to configure things on in advance.
Re: [RFC][PATCH 0/9] Make containers kernel objects
On Tue, May 23, 2017, at 10:30 AM, Djalal Harouni wrote: > > Maybe it depends on the cases, a general approach can be too difficult > to handle especially from the security point. Maybe it is better to > identify what operations need what context, and a userspace > service/proxy can act using kthreadd with the right context... at > least the shift to this model has been done for years now in the > mobile industry. Why not drop the upcall model in favor of having userspace monitor events via a (more efficient) protocol and react to them on its own? It's just generally more flexible and avoids all of those issues like replicating the seccomp configuration, etc. Something like inotify/signalfd could be a precedent around having a read()/poll()able fd. /proc/keys-requests ? Then if you create a new user namespace, and open /proc/keys-requests, the kernel will always write to that instead of calling /sbin/request-key.
Re: [RFC][PATCH 0/9] Make containers kernel objects
On Tue, May 23, 2017, at 10:30 AM, Djalal Harouni wrote: > > Maybe it depends on the cases, a general approach can be too difficult > to handle especially from the security point. Maybe it is better to > identify what operations need what context, and a userspace > service/proxy can act using kthreadd with the right context... at > least the shift to this model has been done for years now in the > mobile industry. Why not drop the upcall model in favor of having userspace monitor events via a (more efficient) protocol and react to them on its own? It's just generally more flexible and avoids all of those issues like replicating the seccomp configuration, etc. Something like inotify/signalfd could be a precedent around having a read()/poll()able fd. /proc/keys-requests ? Then if you create a new user namespace, and open /proc/keys-requests, the kernel will always write to that instead of calling /sbin/request-key.
Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
On Wed, May 10, 2017, at 12:18 AM, Ian Kent wrote: > The fstatat(2) and statx() calls can pass the flag AT_NO_AUTOMOUNT > which is meant to clear the LOOKUP_AUTOMOUNT flag and prevent triggering > of an automount by the call. But this flag is unconditionally cleared > for all stat family system calls except statx(). > Here's the GLib patch to make use of this: https://bugzilla.gnome.org/show_bug.cgi?id=782554
Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
On Wed, May 10, 2017, at 12:18 AM, Ian Kent wrote: > The fstatat(2) and statx() calls can pass the flag AT_NO_AUTOMOUNT > which is meant to clear the LOOKUP_AUTOMOUNT flag and prevent triggering > of an automount by the call. But this flag is unconditionally cleared > for all stat family system calls except statx(). > Here's the GLib patch to make use of this: https://bugzilla.gnome.org/show_bug.cgi?id=782554
Re: [PATCH 1/2] vfs: implement fchmodat2() syscall
On Tue, Apr 11, 2017, at 02:07 PM, Eric Blake wrote: > > A good idea on the surface. But reading the man page of openat(), the > section on O_PATH says: >The file > itself is not opened, and other file operations (e.g., > read(2), > write(2), fchmod(2), fchown(2), fgetxattr(2), mmap(2)) > fail with > the error EBADF. Right, though more topically I'd have expected fchmodat() (not fchmod()) to take AT_EMPTY_PATH, just like fstatat() does. But it doesn't appear to be supported...oh, even at the syscall level, interesting. Ah, I see, glibc does: int fchmodat (int fd, const char *file, mode_t mode, int flag) { if (flag & ~AT_SYMLINK_NOFOLLOW) return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL); ... } And indeed the syscall doesn't have flags, bringing us back to the start here. Sorry, that seems obvious in retrospect, but I was "working forwards" from the O_PATH userspace API mindset.
Re: [PATCH 1/2] vfs: implement fchmodat2() syscall
On Tue, Apr 11, 2017, at 02:07 PM, Eric Blake wrote: > > A good idea on the surface. But reading the man page of openat(), the > section on O_PATH says: >The file > itself is not opened, and other file operations (e.g., > read(2), > write(2), fchmod(2), fchown(2), fgetxattr(2), mmap(2)) > fail with > the error EBADF. Right, though more topically I'd have expected fchmodat() (not fchmod()) to take AT_EMPTY_PATH, just like fstatat() does. But it doesn't appear to be supported...oh, even at the syscall level, interesting. Ah, I see, glibc does: int fchmodat (int fd, const char *file, mode_t mode, int flag) { if (flag & ~AT_SYMLINK_NOFOLLOW) return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL); ... } And indeed the syscall doesn't have flags, bringing us back to the start here. Sorry, that seems obvious in retrospect, but I was "working forwards" from the O_PATH userspace API mindset.
Re: [PATCH 1/2] vfs: implement fchmodat2() syscall
On Tue, Feb 28, 2017, at 02:23 PM, Eric Blake wrote: > Might also be worth mentioning that this patch is required in order to > solve CVE-2016-9602, per discussion at > https://lists.gnu.org/archive/html/qemu-devel/2017-02/msg06089.html I only briefly looked at this, but can't `open(..., O_PATH)` be used to solve this today?
Re: [PATCH 1/2] vfs: implement fchmodat2() syscall
On Tue, Feb 28, 2017, at 02:23 PM, Eric Blake wrote: > Might also be worth mentioning that this patch is required in order to > solve CVE-2016-9602, per discussion at > https://lists.gnu.org/archive/html/qemu-devel/2017-02/msg06089.html I only briefly looked at this, but can't `open(..., O_PATH)` be used to solve this today?
Re: [RFC] [PATCH] Add a "nolinks" mount option.
On Wed, Oct 19, 2016, at 07:28 AM, Mattias Nissler wrote: > > Note that O_NOFOLLOW only affects the final path component. If there's > a symlink in any of the parent directories, that'll still be traversed > even with O_NOFOLLOW. This situation is less risky as an attacker will > have to deal with the restriction of a fixed filename in the last > component, but might still be exploitable. Yeah, I meant that you'd walk the path string in userspace one by one. That said the "fstat at the end and check device" seems a lot better, or perhaps the mount namespaces could help. Also, don't forget about `setfsuid()`. > The difficulty lies in applying these measures of precaution > system-wide. This affects most init scripts and daemons, and > everything else that keeps state on the writable file system. One thing to note is that at least in the freedesktop.org/GNOME etc. side of things, we basically never have privileged processes accessing user home directories anymore. A good example is that GDM used to read ~username/.config/face.png or something like that to show the user's picture on the login screen, and that was subject to many of the same risks. But we've basically across the board migrated to a model where the unprivileged user session talks to privileged daemons via a DBus (or other) API. In this case, the picture data is stored in accountsservice. NetworkManager is another big example of this, where e.g. WiFi credentials can be per user, and the session passes them to the privileged daemon over DBus, rather than having the privileged process try to parse config files in the user's homedir. It's a lot easier to secure.
Re: [RFC] [PATCH] Add a "nolinks" mount option.
On Wed, Oct 19, 2016, at 07:28 AM, Mattias Nissler wrote: > > Note that O_NOFOLLOW only affects the final path component. If there's > a symlink in any of the parent directories, that'll still be traversed > even with O_NOFOLLOW. This situation is less risky as an attacker will > have to deal with the restriction of a fixed filename in the last > component, but might still be exploitable. Yeah, I meant that you'd walk the path string in userspace one by one. That said the "fstat at the end and check device" seems a lot better, or perhaps the mount namespaces could help. Also, don't forget about `setfsuid()`. > The difficulty lies in applying these measures of precaution > system-wide. This affects most init scripts and daemons, and > everything else that keeps state on the writable file system. One thing to note is that at least in the freedesktop.org/GNOME etc. side of things, we basically never have privileged processes accessing user home directories anymore. A good example is that GDM used to read ~username/.config/face.png or something like that to show the user's picture on the login screen, and that was subject to many of the same risks. But we've basically across the board migrated to a model where the unprivileged user session talks to privileged daemons via a DBus (or other) API. In this case, the picture data is stored in accountsservice. NetworkManager is another big example of this, where e.g. WiFi credentials can be per user, and the session passes them to the privileged daemon over DBus, rather than having the privileged process try to parse config files in the user's homedir. It's a lot easier to secure.
Re: [RFC] [PATCH] Add a "nolinks" mount option.
On Mon, Oct 17, 2016, at 09:02 AM, Mattias Nissler wrote: > OK, no more feedback thus far. Is there generally any interest in a > mount option to avoid path name aliasing resulting in target file > confusion? Perhaps a version that only disables symlinks instead of > also hard-disabling files hard-linked to multiple locations (those are > much lower risk for the situation I care about)? So the situation here is a (privileged) process that is trying to read/write to a filesystem tree writable by other processes that are in a separate security domain? That's a classic situation that requires extreme care, and I am doubtful that symlinks are the only issue you're facing. For example, if this process is also *parsing* any data there, there's another whole source of risk. I suspect for you it wouldn't be too hard to have a "follow untrusted path" helper function, it's possible to implement in userspace safely with O_NOFOLLOW etc. Regardless too, it sounds like what you want more is a "same filesystem" traversal (stat and compare devices). Or does it even need to handle full traversal? Would it have mitigated the security issue to fstat() any files you opened and verified they were from the writable partition you expected?
Re: [RFC] [PATCH] Add a "nolinks" mount option.
On Mon, Oct 17, 2016, at 09:02 AM, Mattias Nissler wrote: > OK, no more feedback thus far. Is there generally any interest in a > mount option to avoid path name aliasing resulting in target file > confusion? Perhaps a version that only disables symlinks instead of > also hard-disabling files hard-linked to multiple locations (those are > much lower risk for the situation I care about)? So the situation here is a (privileged) process that is trying to read/write to a filesystem tree writable by other processes that are in a separate security domain? That's a classic situation that requires extreme care, and I am doubtful that symlinks are the only issue you're facing. For example, if this process is also *parsing* any data there, there's another whole source of risk. I suspect for you it wouldn't be too hard to have a "follow untrusted path" helper function, it's possible to implement in userspace safely with O_NOFOLLOW etc. Regardless too, it sounds like what you want more is a "same filesystem" traversal (stat and compare devices). Or does it even need to handle full traversal? Would it have mitigated the security issue to fstat() any files you opened and verified they were from the writable partition you expected?
Re: [PATCH v2 00/10] userns: sysctl limits for namespaces
On Thu, Jul 21, 2016, at 12:39 PM, Eric W. Biederman wrote: > > This patchset addresses two use cases: > - Implement a sane upper bound on the number of namespaces. > - Provide a way for sandboxes to limit the attack surface from > namespaces. Perhaps this is obvious, but since you didn't quite explicitly state it; do you see this as obsoleting the existing downstream patches mentioned in: https://lwn.net/Articles/673597/ It seems conceptually similar to Kees' original approach, right? The high level makes sense to me...most interesting is per-userns sysctls. I'll note most current container managers mount /proc/sys read-only, and Docker specifically drops CAP_SYS_RESOURCE by default, so they'd likely need to learn how to undo that if one wanted to support recursive container usage. We'd probably need to evaluate the safety of having /proc/sys writable generally. (Also it's rather common to filter out CLONE_NEWUSER via seccomp, but that's easy to undo) But that's the flip side - if we're aiming primarily for an upstreamable way to *limit* namespace usage, it seems sane to me.
Re: [PATCH v2 00/10] userns: sysctl limits for namespaces
On Thu, Jul 21, 2016, at 12:39 PM, Eric W. Biederman wrote: > > This patchset addresses two use cases: > - Implement a sane upper bound on the number of namespaces. > - Provide a way for sandboxes to limit the attack surface from > namespaces. Perhaps this is obvious, but since you didn't quite explicitly state it; do you see this as obsoleting the existing downstream patches mentioned in: https://lwn.net/Articles/673597/ It seems conceptually similar to Kees' original approach, right? The high level makes sense to me...most interesting is per-userns sysctls. I'll note most current container managers mount /proc/sys read-only, and Docker specifically drops CAP_SYS_RESOURCE by default, so they'd likely need to learn how to undo that if one wanted to support recursive container usage. We'd probably need to evaluate the safety of having /proc/sys writable generally. (Also it's rather common to filter out CLONE_NEWUSER via seccomp, but that's easy to undo) But that's the flip side - if we're aiming primarily for an upstreamable way to *limit* namespace usage, it seems sane to me.
Re: [PATCH v2 0/2] vfs: Define new syscall getumask.
On Wed, Apr 13, 2016, at 08:57 AM, Richard W.M. Jones wrote: > It's not possible to read the process umask without also modifying it, > which is what umask(2) does. A library cannot read umask safely, > especially if the main program might be multithreaded. I assume you just want to do this from a shared library so you can determine whether or not you need to call fchown() after making files and the like? If that's the case it'd be good to note it in the commit message. BTW...it might be a good idea to add a flags argument: https://lwn.net/Articles/585415/ Did you consider calling this `umask2`, having the initial version only support retrieving it via a UMASK_GET flag, and lay the groundwork to support setting a threadsafe umask with a UMASK_SET_THREAD flag?
Re: [PATCH v2 0/2] vfs: Define new syscall getumask.
On Wed, Apr 13, 2016, at 08:57 AM, Richard W.M. Jones wrote: > It's not possible to read the process umask without also modifying it, > which is what umask(2) does. A library cannot read umask safely, > especially if the main program might be multithreaded. I assume you just want to do this from a shared library so you can determine whether or not you need to call fchown() after making files and the like? If that's the case it'd be good to note it in the commit message. BTW...it might be a good idea to add a flags argument: https://lwn.net/Articles/585415/ Did you consider calling this `umask2`, having the initial version only support retrieving it via a UMASK_GET flag, and lay the groundwork to support setting a threadsafe umask with a UMASK_SET_THREAD flag?
Re: Thoughts on tightening up user namespace creation
On Wed, Mar 9, 2016, at 01:14 PM, Kees Cook wrote: > On Mon, Mar 7, 2016 at 9:15 PM, Andy Lutomirskiwrote: > > Hi all- > > > > There are several users and distros that are nervous about user > > namespaces from an attack surface point of view. > > > > - RHEL and Arch have userns disabled. > > > > - Ubuntu requires CAP_SYS_ADMIN > > > > - Kees periodically proposes to upstream some sysctl to control > > userns creation. > > And here's another ring0 escalation flaw, made available to > unprivileged users because of userns: > > https://code.google.com/p/google-security-research/issues/detail?id=758 Looks like Andy won't have to eat his hat ;) > The change in attack surface is _substantial_. We must have a way to > globally disable userns. No one would object if it was enabled but only accessible to CAP_SYS_ADMIN though, right? This could be useful for writing setuid binaries that expose some of the features, but e.g. not CAP_NET_ADMIN. Andy's suggestion of having this be a per-namespace setting makes sense to me. Currently some container tools that do use userns are by default denying it to be recursive (Sandstorm.io and Docker 1.10 at least) by using a seccomp filter on clone(). If we had this setting that filter wouldn't be necessary, and would solve the issue that seccomp filters aren't robust against the kernel adding new API, e.g. a new CLONE_NEWUSER_NONEWPRIVS which might enable chroot() but not CAP_NET_ADMIN.
Re: Thoughts on tightening up user namespace creation
On Wed, Mar 9, 2016, at 01:14 PM, Kees Cook wrote: > On Mon, Mar 7, 2016 at 9:15 PM, Andy Lutomirski wrote: > > Hi all- > > > > There are several users and distros that are nervous about user > > namespaces from an attack surface point of view. > > > > - RHEL and Arch have userns disabled. > > > > - Ubuntu requires CAP_SYS_ADMIN > > > > - Kees periodically proposes to upstream some sysctl to control > > userns creation. > > And here's another ring0 escalation flaw, made available to > unprivileged users because of userns: > > https://code.google.com/p/google-security-research/issues/detail?id=758 Looks like Andy won't have to eat his hat ;) > The change in attack surface is _substantial_. We must have a way to > globally disable userns. No one would object if it was enabled but only accessible to CAP_SYS_ADMIN though, right? This could be useful for writing setuid binaries that expose some of the features, but e.g. not CAP_NET_ADMIN. Andy's suggestion of having this be a per-namespace setting makes sense to me. Currently some container tools that do use userns are by default denying it to be recursive (Sandstorm.io and Docker 1.10 at least) by using a seccomp filter on clone(). If we had this setting that filter wouldn't be necessary, and would solve the issue that seccomp filters aren't robust against the kernel adding new API, e.g. a new CLONE_NEWUSER_NONEWPRIVS which might enable chroot() but not CAP_NET_ADMIN.
Re: [PATCH v3 0/7] User namespace mount updates
On Thu, Nov 19, 2015, at 02:53 AM, Richard Weinberger wrote: > Erm, I don't want this in the kernel. That's why I've proposed the lklfuse > approach. I already said this before but just to repeat, since I'm confused: How would "lklfuse" be different from http://libguestfs.org/ which we at Red Hat (and a number of other organizations) use quite widely now for build systems, debugging etc. In the end it's just running the kernel in KVM with a custom protocol, with support for non-filesystem things like "install a bootloader", and it already supports FUSE. I'm pretty firmly with Al here - the attack surface increase here is too great, and we'd likely turn this off if it even did make it into the kernel. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 0/7] User namespace mount updates
On Thu, Nov 19, 2015, at 02:53 AM, Richard Weinberger wrote: > Erm, I don't want this in the kernel. That's why I've proposed the lklfuse > approach. I already said this before but just to repeat, since I'm confused: How would "lklfuse" be different from http://libguestfs.org/ which we at Red Hat (and a number of other organizations) use quite widely now for build systems, debugging etc. In the end it's just running the kernel in KVM with a custom protocol, with support for non-filesystem things like "install a bootloader", and it already supports FUSE. I'm pretty firmly with Al here - the attack surface increase here is too great, and we'd likely turn this off if it even did make it into the kernel. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/7] Initial support for user namespace owned mounts
It's worth noting here that I think a lot of the use cases for unprivileged mounts are testing/development type things, and these are pretty well covered by: http://libguestfs.org/ Basically it just runs the host kernel in a VM, and the userspace is a minimal agent that you can talk to over virtio. You can use the API, or `guestmount` exposes it via FUSE. It doesn't magically make the kernel filesystems robust against untrusted input, but in the case of compromise, it's an "unprivileged" VM. I've used it for several projects and been quite happy. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/7] Initial support for user namespace owned mounts
It's worth noting here that I think a lot of the use cases for unprivileged mounts are testing/development type things, and these are pretty well covered by: http://libguestfs.org/ Basically it just runs the host kernel in a VM, and the userspace is a minimal agent that you can talk to over virtio. You can use the API, or `guestmount` exposes it via FUSE. It doesn't magically make the kernel filesystems robust against untrusted input, but in the case of compromise, it's an unprivileged VM. I've used it for several projects and been quite happy. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/7] Initial support for user namespace owned mounts
On Thu, Jul 16, 2015, at 12:47 AM, Eric W. Biederman wrote: > With that said desktop environments have for a long time been > automatically mounting whichever filesystem you place in your computer, > so in practice what this is really about is trying to align the kernel > with how people use filesystems. There is a large attack surface difference between mounting a device that someone physically plugged into the computer (and note typically it's required that the active console be unlocked as well[1]) versus allowing any "unprivileged" process at any time to do it. Many server setups use "unprivileged" uids that otherwise wouldn't be able to exploit bugs in filesystem code. [1] https://bugzilla.gnome.org/show_bug.cgi?id=653520 "AutomountManager also keeps track of the current session availability (using the ConsoleKit and gnome-screensaver DBus interfaces) and inhibits mounting if the current session is locked, or another session is in use instead." -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/7] Initial support for user namespace owned mounts
On Thu, Jul 16, 2015, at 12:47 AM, Eric W. Biederman wrote: With that said desktop environments have for a long time been automatically mounting whichever filesystem you place in your computer, so in practice what this is really about is trying to align the kernel with how people use filesystems. There is a large attack surface difference between mounting a device that someone physically plugged into the computer (and note typically it's required that the active console be unlocked as well[1]) versus allowing any unprivileged process at any time to do it. Many server setups use unprivileged uids that otherwise wouldn't be able to exploit bugs in filesystem code. [1] https://bugzilla.gnome.org/show_bug.cgi?id=653520 AutomountManager also keeps track of the current session availability (using the ConsoleKit and gnome-screensaver DBus interfaces) and inhibits mounting if the current session is locked, or another session is in use instead. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] overlayfs: Warn on copy up if a process has a R/O fd open to the lower file
On Thu, May 28, 2015, at 02:46 PM, David Howells wrote: > Print a warning when overlayfs copies up a file if the process that triggered > the copy up has a R/O fd open to the lower file being copied up. Why not an option to make it an error? If one's goal is to use overlayfs without apps potentially corrupting data, better to fail fast. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] overlayfs: Warn on copy up if a process has a R/O fd open to the lower file
On Thu, May 28, 2015, at 02:46 PM, David Howells wrote: Print a warning when overlayfs copies up a file if the process that triggered the copy up has a R/O fd open to the lower file being copied up. Why not an option to make it an error? If one's goal is to use overlayfs without apps potentially corrupting data, better to fail fast. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/6] File Sealing & memfd_create()
On Thu, Apr 10, 2014 at 3:15 PM, Andy Lutomirski wrote: COW links can do this already, I think. Of course, you'll have to use a filesystem that supports them. COW is nice if the filesystem supports them, but my userspace code needs to be filesystem agnostic. Because of that, the design for userspace simply doesn't allow arbitrary writes. Instead, I have to painfully audit every rpm %post/dpkg postinst type script to ensure they break hardlinks, and furthermore only allow executing scripts that are known to do so. But I think even in a btrfs world it'd still be useful to mark files as content-immutable. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/6] File Sealing & memfd_create()
On Thu, Mar 20, 2014 at 11:32 AM, ty...@mit.edu wrote: Looking at your patches, and what files you are modifying, you are enforcing this in the low-level file system. I would love for this to be implemented in the filesystem level as well. Something like the ext4 immutable bit, but with the ability to still make hardlinks would be *very* useful for OSTree. And anyone else that uses hardlinks as a data source. The vserver people do something similiar: http://linux-vserver.org/util-vserver:Vhashify At the moment I have a read-only bind mount over /usr, but what I really want is to make the individual objects in the object store in /ostree/repo/objects be immutable, so even if a user or app navigates out to /sysroot they still can't mutate them (or the link targets in the visible /usr). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/6] File Sealing memfd_create()
On Thu, Mar 20, 2014 at 11:32 AM, ty...@mit.edu wrote: Looking at your patches, and what files you are modifying, you are enforcing this in the low-level file system. I would love for this to be implemented in the filesystem level as well. Something like the ext4 immutable bit, but with the ability to still make hardlinks would be *very* useful for OSTree. And anyone else that uses hardlinks as a data source. The vserver people do something similiar: http://linux-vserver.org/util-vserver:Vhashify At the moment I have a read-only bind mount over /usr, but what I really want is to make the individual objects in the object store in /ostree/repo/objects be immutable, so even if a user or app navigates out to /sysroot they still can't mutate them (or the link targets in the visible /usr). -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/6] File Sealing memfd_create()
On Thu, Apr 10, 2014 at 3:15 PM, Andy Lutomirski l...@amacapital.net wrote: COW links can do this already, I think. Of course, you'll have to use a filesystem that supports them. COW is nice if the filesystem supports them, but my userspace code needs to be filesystem agnostic. Because of that, the design for userspace simply doesn't allow arbitrary writes. Instead, I have to painfully audit every rpm %post/dpkg postinst type script to ensure they break hardlinks, and furthermore only allow executing scripts that are known to do so. But I think even in a btrfs world it'd still be useful to mark files as content-immutable. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Need for llistxattrat()/lgetxattrat()/lsetxattrat()
Hi, So at the moment the kernel has pretty comprehensive support for the *at() variants of file API calls, and at one point I was rewriting my app to be a modern Unix citizen and use the *at variants, but I ran into one admittedly obscure corner case, which is the lack of API calls to manipulate the extended attributes of symbolic links relative to a dirfd. Correctly handling xattrs (in particular, SELinux labels) is a key part of my application, so for now I'll just live with pathnames and their drawbacks, but: Is this intentional or an oversight? If the latter, should I try a patch? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Need for llistxattrat()/lgetxattrat()/lsetxattrat()
Hi, So at the moment the kernel has pretty comprehensive support for the *at() variants of file API calls, and at one point I was rewriting my app to be a modern Unix citizen and use the *at variants, but I ran into one admittedly obscure corner case, which is the lack of API calls to manipulate the extended attributes of symbolic links relative to a dirfd. Correctly handling xattrs (in particular, SELinux labels) is a key part of my application, so for now I'll just live with pathnames and their drawbacks, but: Is this intentional or an oversight? If the latter, should I try a patch? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] teach argv_split() to ignore the spaces surrounded by \e
On Mon, 2013-05-13 at 16:35 +0200, Oleg Nesterov wrote: > Yes, we can change format_corename() to construct "argv" by hand, and > this was my iniital plan. But perhaps it would be better to not uglify > this code even more? Sure this \e is less code, but it seems pretty ugly to me. Maybe a way to keep fs/coredump.c sane would be always constructing an argv, and then in the !ispipe case just join them into one string. Though I'm still inclined to change systemd to read /proc/pid/cmdline like abrt does; that way it works on current kernels too. For what it's worth I noticed this problem with dconf, which uses g_thread_new ("dconf worker", ...), and g_thread_new uses prctl (PR_SET_NAME). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] teach argv_split() to ignore the spaces surrounded by \e
On Mon, 2013-05-13 at 16:35 +0200, Oleg Nesterov wrote: Yes, we can change format_corename() to construct argv by hand, and this was my iniital plan. But perhaps it would be better to not uglify this code even more? Sure this \e is less code, but it seems pretty ugly to me. Maybe a way to keep fs/coredump.c sane would be always constructing an argv, and then in the !ispipe case just join them into one string. Though I'm still inclined to change systemd to read /proc/pid/cmdline like abrt does; that way it works on current kernels too. For what it's worth I noticed this problem with dconf, which uses g_thread_new (dconf worker, ...), and g_thread_new uses prctl (PR_SET_NAME). -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [systemd-devel] [PATCH 2/2] coredump: Handle programs with spaces in COMM
On Fri, 2013-05-03 at 17:08 +0200, Lennart Poettering wrote: > It sounds really wrong to first merge this into one string and then > split it up again. It sounds much more sensible to instead just pass the > string array around all the time. What's the reason to make this one > string first? I'm wondering if there are compatibility concerns; abrt wouldn't care from what I can tell if we just changed the kernel. systemd-coredump is just plain broken right now. I'll look for the source to the Ubuntu one... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [systemd-devel] [PATCH 2/2] coredump: Handle programs with spaces in COMM
On Fri, 2013-05-03 at 17:08 +0200, Lennart Poettering wrote: It sounds really wrong to first merge this into one string and then split it up again. It sounds much more sensible to instead just pass the string array around all the time. What's the reason to make this one string first? I'm wondering if there are compatibility concerns; abrt wouldn't care from what I can tell if we just changed the kernel. systemd-coredump is just plain broken right now. I'll look for the source to the Ubuntu one... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
linux-user-chroot 2012.2
Hi, This is the release of linux-user-chroot 2012.2. The major change now is that it makes use of Andy's new PR_SET_NO_NEW_PRIVS. This doesn't close any security hole I'm aware of - our previous use of the MS_NOSUID bind mount over / should work - but, belt and suspenders as they say. The code: http://git.gnome.org/browse/linux-user-chroot/commit/?id=515c714471d0b5923f6633ef44a2270b23656ee9 As for how linux-user-chroot and PR_SET_NO_NEW_PRIVS relate, see this thread: http://thread.gmane.org/gmane.linux.kernel.lsm/15339 Summary --- This tool allows regular (non-root) users to call chroot(2), create Linux bind mounts, and use some Linux container features. It's primarily intended for use by build systems. Project information --- There's no web page yet; send patches to Colin Walters -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
linux-user-chroot 2012.2
Hi, This is the release of linux-user-chroot 2012.2. The major change now is that it makes use of Andy's new PR_SET_NO_NEW_PRIVS. This doesn't close any security hole I'm aware of - our previous use of the MS_NOSUID bind mount over / should work - but, belt and suspenders as they say. The code: http://git.gnome.org/browse/linux-user-chroot/commit/?id=515c714471d0b5923f6633ef44a2270b23656ee9 As for how linux-user-chroot and PR_SET_NO_NEW_PRIVS relate, see this thread: http://thread.gmane.org/gmane.linux.kernel.lsm/15339 Summary --- This tool allows regular (non-root) users to call chroot(2), create Linux bind mounts, and use some Linux container features. It's primarily intended for use by build systems. Project information --- There's no web page yet; send patches to Colin Walters walt...@verbum.org -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/