Re: strace of io_uring events?

2020-07-23 Thread Colin Walters
On Tue, Jul 21, 2020, at 11:58 AM, Stefano Garzarella wrote:

> my use case concerns virtualization. The idea, that I described in the
> proposal of io-uring restrictions [1], is to share io_uring CQ and SQ queues
> with a guest VM for block operations.

Virtualization being a strong security barrier is in eternal conflict with 
maximizing performance.
All of these "let's add a special guest/host channel" are high risk areas.

And this effort in particular - is it *really* worth it to expose a brand new, 
fast moving Linux kernel interface (that probably hasn't been fuzzed as much as 
it needs to be) to virtual machines?

People who want maximum performance at the cost of a bit of security already 
have the choice to use Linux containers, where they can use io_uring natively.



Re: kernel BUG at mm/hugetlb.c:LINE!

2020-05-18 Thread Colin Walters



On Tue, May 12, 2020, at 11:04 AM, Miklos Szeredi wrote:

> > However, in this syzbot test case the 'file' is in an overlayfs filesystem
> > created as follows:
> >
> > mkdir("./file0", 000)   = 0
> > mount(NULL, "./file0", "hugetlbfs", MS_MANDLOCK|MS_POSIXACL, NULL) = 0
> > chdir("./file0")= 0
> > mkdir("./file1", 000)   = 0
> > mkdir("./bus", 000) = 0
> > mkdir("./file0", 000)   = 0
> > mount("\177ELF\2\1\1", "./bus", "overlay", 0, 
> > "lowerdir=./bus,workdir=./file1,u"...) = 0

Is there any actual valid use case for mounting an overlayfs on top of 
hugetlbfs?  I can't think of one.  Why isn't the response to this to instead 
only allow mounting overlayfs on top of basically a set of whitelisted 
filesystems?



Re: aio poll, io_pgetevents and a new in-kernel poll API V3

2018-01-18 Thread Colin Walters


On Thu, Jan 18, 2018, at 11:44 AM, Jeff Moyer wrote:
> Jeff Moyer  writes:
> 
> > FYI, this kernel has issues.  It will boot up, but I don't have
> > networking, and even rebooting doesn't succeed.  I'm looking into it.
> 
> A bisect lands on: eventfd: switch to ->poll_mask.  That's not super
> helpful, though.  I did run the ltp eventfd2 tests, and they all pass.

FWIW: 
https://git.gnome.org/browse/glib/commit/?id=3904c8761a60dbadbdfaf98fe23ff19cbdcc4a9a

Since that a lot of userspace (including NetworkManager) uses eventfd. I haven't
tried this patchset myself but I'd look at what the GLib mainloop is doing.


Re: aio poll, io_pgetevents and a new in-kernel poll API V3

2018-01-18 Thread Colin Walters


On Thu, Jan 18, 2018, at 11:44 AM, Jeff Moyer wrote:
> Jeff Moyer  writes:
> 
> > FYI, this kernel has issues.  It will boot up, but I don't have
> > networking, and even rebooting doesn't succeed.  I'm looking into it.
> 
> A bisect lands on: eventfd: switch to ->poll_mask.  That's not super
> helpful, though.  I did run the ltp eventfd2 tests, and they all pass.

FWIW: 
https://git.gnome.org/browse/glib/commit/?id=3904c8761a60dbadbdfaf98fe23ff19cbdcc4a9a

Since that a lot of userspace (including NetworkManager) uses eventfd. I haven't
tried this patchset myself but I'd look at what the GLib mainloop is doing.


Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall)

2017-11-01 Thread Colin Walters
On Wed, Nov 1, 2017, at 03:02 PM, Shawn Landden wrote:
> 
> This solves the fact that epoll_pwait() already is a 6 argument (maximum 
> allowed) syscall. But what if the process has multiple epoll() instances in 
> multiple threads? 

Well, that's a subset of the general question of - what is the interaction
of this system call and threading?  It looks like you've prototyped this
out in userspace with systemd, but from a quick glance at the current git,
systemd's threading is limited doing sync()/fsync() and gethostbyname() async.

But languages with a GC tend to at least use a background thread for that,
and of course lots of modern userspace makes heavy use of multithreading
(or variants like goroutines).

A common pattern though is to have a "main thread" that acts as a control
point and runs the mainloop (particularly for anything with a GUI).   That's
going to be the thing calling prctl(SET_IDLE) - but I think its idle state 
should implicitly
affect the whole process, since for a lot of apps those other threads are going 
to
just be "background".

It'd probably then be an error to use prctl(SET_IDLE) in more than one thread
ever?  (Although that might break in golang due to the way goroutines can
be migrated across threads)

That'd probably be a good "generality test" - what would it take to have
this system call be used for a simple golang webserver app that's e.g.
socket activated by systemd, or a Kubernetes service?  Or another
really interesting case would be qemu; make it easy to flag VMs as always
having this state (most of my testing VMs are like this; it's OK if they get
destroyed, I just reinitialize them from the gold state).

Going back to threading - a tricky thing we should handle in general
is when userspace libraries create threads that are unknown to the app;
the "async gethostbyname()" is a good example.  To be conservative we'd
likely need to "fail non-idle", but figure out some way tell the kernel
for e.g. GC threads that they're still idle.


Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall)

2017-11-01 Thread Colin Walters
On Wed, Nov 1, 2017, at 03:02 PM, Shawn Landden wrote:
> 
> This solves the fact that epoll_pwait() already is a 6 argument (maximum 
> allowed) syscall. But what if the process has multiple epoll() instances in 
> multiple threads? 

Well, that's a subset of the general question of - what is the interaction
of this system call and threading?  It looks like you've prototyped this
out in userspace with systemd, but from a quick glance at the current git,
systemd's threading is limited doing sync()/fsync() and gethostbyname() async.

But languages with a GC tend to at least use a background thread for that,
and of course lots of modern userspace makes heavy use of multithreading
(or variants like goroutines).

A common pattern though is to have a "main thread" that acts as a control
point and runs the mainloop (particularly for anything with a GUI).   That's
going to be the thing calling prctl(SET_IDLE) - but I think its idle state 
should implicitly
affect the whole process, since for a lot of apps those other threads are going 
to
just be "background".

It'd probably then be an error to use prctl(SET_IDLE) in more than one thread
ever?  (Although that might break in golang due to the way goroutines can
be migrated across threads)

That'd probably be a good "generality test" - what would it take to have
this system call be used for a simple golang webserver app that's e.g.
socket activated by systemd, or a Kubernetes service?  Or another
really interesting case would be qemu; make it easy to flag VMs as always
having this state (most of my testing VMs are like this; it's OK if they get
destroyed, I just reinitialize them from the gold state).

Going back to threading - a tricky thing we should handle in general
is when userspace libraries create threads that are unknown to the app;
the "async gethostbyname()" is a good example.  To be conservative we'd
likely need to "fail non-idle", but figure out some way tell the kernel
for e.g. GC threads that they're still idle.


Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall)

2017-11-01 Thread Colin Walters


On Wed, Nov 1, 2017, at 11:16 AM, Colin Walters wrote:
>
> as the maintainer of glib2 which is used by a *lot* of things; I'm not

(I meant to say "a" maintainer)

Also, while I'm not an expert in Android, I think the "what to kill" logic
there lives in userspace, right?   So it feels like we should expose this
state in e.g. /proc and allow userspace daemons (e.g. systemd, kubelet) to 
perform
idle collection too, even if the system isn't actually low on resources
from the kernel's perspective.

And doing that requires some sort of kill(pid, SIGKILL_IF_IDLE) or so?


Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall)

2017-11-01 Thread Colin Walters


On Wed, Nov 1, 2017, at 11:16 AM, Colin Walters wrote:
>
> as the maintainer of glib2 which is used by a *lot* of things; I'm not

(I meant to say "a" maintainer)

Also, while I'm not an expert in Android, I think the "what to kill" logic
there lives in userspace, right?   So it feels like we should expose this
state in e.g. /proc and allow userspace daemons (e.g. systemd, kubelet) to 
perform
idle collection too, even if the system isn't actually low on resources
from the kernel's perspective.

And doing that requires some sort of kill(pid, SIGKILL_IF_IDLE) or so?


Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall)

2017-11-01 Thread Colin Walters


On Wed, Nov 1, 2017, at 01:32 AM, Shawn Landden wrote:
> It is common for services to be stateless around their main event loop.
> If a process passes the EPOLL_KILLME flag to epoll_wait5() then it
> signals to the kernel that epoll_wait5() may not complete, and the kernel
> may send SIGKILL if resources get tight.
> 

I've thought about something like this in the past too and would love
to see it land.  Bigger picture, this also comes up in (server) container
environments, see e.g.:

https://docs.openshift.com/container-platform/3.3/admin_guide/idling_applications.html

There's going to be a long slog getting apps to actually make use
of this, but I suspect if it gets wrapped up nicely in some "framework"
libraries for C/C++, and be bound in the language ecosystems like golang
we could see a fair amount of adoption on the order of a year or two.

However, while I understand why it feels natural to tie this to epoll,
as the maintainer of glib2 which is used by a *lot* of things; I'm not
sure we're going to port to epoll anytime soon.

Why not just make this a prctl()?  It's not like it's really any less racy to 
do:

prctl(PR_SET_IDLE)
epoll()

and this also allows:

prctl(PR_SET_IDLE)
poll()

And as this is most often just going to be an optional hint it's easier to e.g. 
just ignore EINVAL
from the prctl().



Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall)

2017-11-01 Thread Colin Walters


On Wed, Nov 1, 2017, at 01:32 AM, Shawn Landden wrote:
> It is common for services to be stateless around their main event loop.
> If a process passes the EPOLL_KILLME flag to epoll_wait5() then it
> signals to the kernel that epoll_wait5() may not complete, and the kernel
> may send SIGKILL if resources get tight.
> 

I've thought about something like this in the past too and would love
to see it land.  Bigger picture, this also comes up in (server) container
environments, see e.g.:

https://docs.openshift.com/container-platform/3.3/admin_guide/idling_applications.html

There's going to be a long slog getting apps to actually make use
of this, but I suspect if it gets wrapped up nicely in some "framework"
libraries for C/C++, and be bound in the language ecosystems like golang
we could see a fair amount of adoption on the order of a year or two.

However, while I understand why it feels natural to tie this to epoll,
as the maintainer of glib2 which is used by a *lot* of things; I'm not
sure we're going to port to epoll anytime soon.

Why not just make this a prctl()?  It's not like it's really any less racy to 
do:

prctl(PR_SET_IDLE)
epoll()

and this also allows:

prctl(PR_SET_IDLE)
poll()

And as this is most often just going to be an optional hint it's easier to e.g. 
just ignore EINVAL
from the prctl().



Re: [PATCH 1/3] autofs - fix AT_NO_AUTOMOUNT not being honored

2017-08-08 Thread Colin Walters
On Tue, Aug 8, 2017, at 12:26 AM, Ian Kent wrote:

> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3022,8 +3022,7 @@ static inline int vfs_lstat(const char __user *name, 
> struct kstat *stat)
>  static inline int vfs_fstatat(int dfd, const char __user *filename,
> struct kstat *stat, int flags)
>  {
> - return vfs_statx(dfd, filename, flags | AT_NO_AUTOMOUNT,
> -  stat, STATX_BASIC_STATS);
> + return vfs_statx(dfd, filename, flags, stat, STATX_BASIC_STATS);
>  }
>  static inline int vfs_fstat(int fd, struct kstat *stat)
>  {

This is reverting the fstatat() prat of
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=deccf497d804a4c5fca2dbfad2f104675a6f9102
Which itself seems weird to me - it looks like we were unconditionally
forcing on AT_NO_AUTOMOUNT regardless of what userspace passed?
So perhaps a
Fixes: deccf497d804a4c5fca2dbfad2f104675a6f9102
is appropriate here?

I understand that for stat()/lstat() we didn't expose the option to userspace,
so the behavior was...ah, there's this note in man-pages 
(man-pages-4.09-3.fc26.noarch):

> On Linux, lstat() will generally not trigger automounter action, whereas 
> stat() will (but see fstatat(2)).

I have no idea of the history here, but maybe it makes sense to drop
the AT_NO_AUTOMOUNT from the vfs_stat() too?


Re: [PATCH 1/3] autofs - fix AT_NO_AUTOMOUNT not being honored

2017-08-08 Thread Colin Walters
On Tue, Aug 8, 2017, at 12:26 AM, Ian Kent wrote:

> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3022,8 +3022,7 @@ static inline int vfs_lstat(const char __user *name, 
> struct kstat *stat)
>  static inline int vfs_fstatat(int dfd, const char __user *filename,
> struct kstat *stat, int flags)
>  {
> - return vfs_statx(dfd, filename, flags | AT_NO_AUTOMOUNT,
> -  stat, STATX_BASIC_STATS);
> + return vfs_statx(dfd, filename, flags, stat, STATX_BASIC_STATS);
>  }
>  static inline int vfs_fstat(int fd, struct kstat *stat)
>  {

This is reverting the fstatat() prat of
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=deccf497d804a4c5fca2dbfad2f104675a6f9102
Which itself seems weird to me - it looks like we were unconditionally
forcing on AT_NO_AUTOMOUNT regardless of what userspace passed?
So perhaps a
Fixes: deccf497d804a4c5fca2dbfad2f104675a6f9102
is appropriate here?

I understand that for stat()/lstat() we didn't expose the option to userspace,
so the behavior was...ah, there's this note in man-pages 
(man-pages-4.09-3.fc26.noarch):

> On Linux, lstat() will generally not trigger automounter action, whereas 
> stat() will (but see fstatat(2)).

I have no idea of the history here, but maybe it makes sense to drop
the AT_NO_AUTOMOUNT from the vfs_stat() too?


Re: [PATCH 1/3] fs, xfs: introduce S_IOMAP_IMMUTABLE

2017-07-31 Thread Colin Walters
On Mon, Jul 31, 2017, at 02:23 PM, Darrick J. Wong wrote:

> I don't think F_SEAL_{SHRINK,GROW} prevents reflinking or CoW of file data,
> which are two things that cannot happen under S_IOMAP_IMMUTABLE that
> aren't size changes.  From the implementation it looks like shrink and
> grow are only supposed to disallow changes to i_size, not i_blocks (or
> the file block map).

True. 

> Then again, I suppose F_SEAL_* only work on shmem, so maybe it simply
> isn't defined for any other filesystem...?  e.g. it doesn't prohibit
> reflink, but the only fs implementing seals doesn't support reflink.
> 
> 
> 
> Seals cannot be removed, which is too strict for the S_IOMAP_IMMUTABLE
> user cases being presented.

To be clear, the set of use cases is swap files and DAX, right?  Or is there 
anything else?
I can't imagine why anyone would want to turn a swap file back into a regular 
file.
I haven't fully followed DAX, but I'd take your word for it if people want to
be able to remove the flag after.

Anyways, I think your broader point is right; the use cases are different enough
that it doesn't make sense to try to add S_CONTENT_IMMUTABLE (or however 
one decides to call it) at the same time.


Re: [PATCH 1/3] fs, xfs: introduce S_IOMAP_IMMUTABLE

2017-07-31 Thread Colin Walters
On Mon, Jul 31, 2017, at 02:23 PM, Darrick J. Wong wrote:

> I don't think F_SEAL_{SHRINK,GROW} prevents reflinking or CoW of file data,
> which are two things that cannot happen under S_IOMAP_IMMUTABLE that
> aren't size changes.  From the implementation it looks like shrink and
> grow are only supposed to disallow changes to i_size, not i_blocks (or
> the file block map).

True. 

> Then again, I suppose F_SEAL_* only work on shmem, so maybe it simply
> isn't defined for any other filesystem...?  e.g. it doesn't prohibit
> reflink, but the only fs implementing seals doesn't support reflink.
> 
> 
> 
> Seals cannot be removed, which is too strict for the S_IOMAP_IMMUTABLE
> user cases being presented.

To be clear, the set of use cases is swap files and DAX, right?  Or is there 
anything else?
I can't imagine why anyone would want to turn a swap file back into a regular 
file.
I haven't fully followed DAX, but I'd take your word for it if people want to
be able to remove the flag after.

Anyways, I think your broader point is right; the use cases are different enough
that it doesn't make sense to try to add S_CONTENT_IMMUTABLE (or however 
one decides to call it) at the same time.


Re: [PATCH 1/3] fs, xfs: introduce S_IOMAP_IMMUTABLE

2017-07-31 Thread Colin Walters


On Mon, Jul 31, 2017, at 12:32 PM, Colin Walters wrote:
> On Mon, Jul 31, 2017, at 12:29 PM, Dan Williams wrote:
> > 
> > How is S_CONTENTS_IMMUTABLE different than S_IMMUTABLE?
> 
> We still want the ability to make hardlinks.

Also of course, symmetrically, to unlink.   If we used S_IMMUTABLE for 
/etc/sudoers,
it'd still be racy since one would have to transiently remove the flag in order
to replace it with a new version.

Related to this topic is the fact that S_IMMUTABLE is itself mutable; I
think once S_IMMUTABLE_CONTENTS is set, it would not be able to made
mutable again.  

Also I just remembered that since then memfd_create() and more notably
fcntl(F_ADD_SEALS) landed - in fact it already has flags for what we want
here AFAICS.  Your S_IOMAP_IMMUTABLE is fcntl(F_ADD_SEALS, F_SEAL_SHRINK | 
F_SEAL_GROW)
and mine just adds in F_SEAL_WRITE.  I think there was some discussion
of the seals for persistent files when memfd_create() landed, but I can't
find it offhand.


Re: [PATCH 1/3] fs, xfs: introduce S_IOMAP_IMMUTABLE

2017-07-31 Thread Colin Walters


On Mon, Jul 31, 2017, at 12:32 PM, Colin Walters wrote:
> On Mon, Jul 31, 2017, at 12:29 PM, Dan Williams wrote:
> > 
> > How is S_CONTENTS_IMMUTABLE different than S_IMMUTABLE?
> 
> We still want the ability to make hardlinks.

Also of course, symmetrically, to unlink.   If we used S_IMMUTABLE for 
/etc/sudoers,
it'd still be racy since one would have to transiently remove the flag in order
to replace it with a new version.

Related to this topic is the fact that S_IMMUTABLE is itself mutable; I
think once S_IMMUTABLE_CONTENTS is set, it would not be able to made
mutable again.  

Also I just remembered that since then memfd_create() and more notably
fcntl(F_ADD_SEALS) landed - in fact it already has flags for what we want
here AFAICS.  Your S_IOMAP_IMMUTABLE is fcntl(F_ADD_SEALS, F_SEAL_SHRINK | 
F_SEAL_GROW)
and mine just adds in F_SEAL_WRITE.  I think there was some discussion
of the seals for persistent files when memfd_create() landed, but I can't
find it offhand.


Re: [PATCH 1/3] fs, xfs: introduce S_IOMAP_IMMUTABLE

2017-07-31 Thread Colin Walters
On Mon, Jul 31, 2017, at 12:29 PM, Dan Williams wrote:
> 
> How is S_CONTENTS_IMMUTABLE different than S_IMMUTABLE?

We still want the ability to make hardlinks.


Re: [PATCH 1/3] fs, xfs: introduce S_IOMAP_IMMUTABLE

2017-07-31 Thread Colin Walters
On Mon, Jul 31, 2017, at 12:29 PM, Dan Williams wrote:
> 
> How is S_CONTENTS_IMMUTABLE different than S_IMMUTABLE?

We still want the ability to make hardlinks.


Re: [PATCH 1/3] fs, xfs: introduce S_IOMAP_IMMUTABLE

2017-07-31 Thread Colin Walters
On Sat, Jul 29, 2017, at 03:43 PM, Dan Williams wrote:
> An inode with this flag set indicates that the file's block map cannot
> be changed, no size change, deletion, hole-punch, range collapse, or
> reflink.
> 
> The implementation of toggling the flag and sealing the state of the
> extent map is saved for a later patch. The functionality provided by
> S_IOMAP_IMMUTABLE, once toggle support is added, will be a superset of
> that provided by S_SWAPFILE, and it is targeted to replace it.
> 
> For now, only xfs and the core vfs are updated to consider the new flag.

Quite a while ago I started a request for O_OBJECT:
http://www.spinics.net/lists/linux-fsdevel/msg75085.html
A few months ago I was thinking about that more and realized
it'd likely be more palatable to land as an inode flag, like
you're doing here.

Now, S_IOMAP_IMMUTABLE would be quite close to the semantics
I want for ostree, except we also want to disallow
changes to the inode uid, gid or mode.  (Extended attributes are 
a whole other story; but I'd like to at least disallow changes to the
security. namespace).

The goal here is mostly about resilience to *accidental* changes;
think an admin doing `cp /path/to/binary /usr/bin/bash` which
does open(O_TRUNC), which would hence corrupt all hardlinks.

S_IOMAP_IMMUTABLE would give a lot of great protection against
those types of accidental changes - most of them are either going
to be open(O_TRUNC) or O_APPEND.   Since you're touching various
write paths here, perhaps we can also add
S_CONTENTS_IMMUTABLE or something at the same time?

If this lands as is - I'm quite likely to change ostree to use it;
any objections to that?  As mentioned in the thread, there are several
other cases of "content immutable" files in userspace, such as
QEMU "qcow2", git objects.  And really the most classic example is
/etc/sudoers and the need for a special "visudo" program to really
ensure that editors don't do in-place overwrites.

But it'd be great if we can use this push to also land "content immutabilty"
or however we decide to call it.



Re: [PATCH 1/3] fs, xfs: introduce S_IOMAP_IMMUTABLE

2017-07-31 Thread Colin Walters
On Sat, Jul 29, 2017, at 03:43 PM, Dan Williams wrote:
> An inode with this flag set indicates that the file's block map cannot
> be changed, no size change, deletion, hole-punch, range collapse, or
> reflink.
> 
> The implementation of toggling the flag and sealing the state of the
> extent map is saved for a later patch. The functionality provided by
> S_IOMAP_IMMUTABLE, once toggle support is added, will be a superset of
> that provided by S_SWAPFILE, and it is targeted to replace it.
> 
> For now, only xfs and the core vfs are updated to consider the new flag.

Quite a while ago I started a request for O_OBJECT:
http://www.spinics.net/lists/linux-fsdevel/msg75085.html
A few months ago I was thinking about that more and realized
it'd likely be more palatable to land as an inode flag, like
you're doing here.

Now, S_IOMAP_IMMUTABLE would be quite close to the semantics
I want for ostree, except we also want to disallow
changes to the inode uid, gid or mode.  (Extended attributes are 
a whole other story; but I'd like to at least disallow changes to the
security. namespace).

The goal here is mostly about resilience to *accidental* changes;
think an admin doing `cp /path/to/binary /usr/bin/bash` which
does open(O_TRUNC), which would hence corrupt all hardlinks.

S_IOMAP_IMMUTABLE would give a lot of great protection against
those types of accidental changes - most of them are either going
to be open(O_TRUNC) or O_APPEND.   Since you're touching various
write paths here, perhaps we can also add
S_CONTENTS_IMMUTABLE or something at the same time?

If this lands as is - I'm quite likely to change ostree to use it;
any objections to that?  As mentioned in the thread, there are several
other cases of "content immutable" files in userspace, such as
QEMU "qcow2", git objects.  And really the most classic example is
/etc/sudoers and the need for a special "visudo" program to really
ensure that editors don't do in-place overwrites.

But it'd be great if we can use this push to also land "content immutabilty"
or however we decide to call it.



Re: [RFC PATCH] KEYS: Allow a live daemon in a namespace to service request_key upcalls

2017-05-31 Thread Colin Walters
On Wed, May 31, 2017, at 10:47 AM, David Howells wrote:


> > So if the mount-in-container is mostly init containers, and for init
> > containers you have *all* namespaces usually, does it make
> > sense to have the capability to match namespaces for key services?
> 
> Yes.
> 
> If I don't provide it, someone will complain that I haven't provided it.

I don't think it's as clear cut as that.  I'd agree that where possible, it 
makes
sense to be general since it's hard to envision every use case, but in this 
particualr
case there are risks around security and clear maintenance issues.  Providing
a feature without *known* users in a security-sensitive context seems to
me to be something to think twice about.

> > Something that worries me a lot here is for e.g. Docker today, the default
> > is uid 0 but not CAP_SYS_ADMIN.  We don't want a container that I run
> > with --host=net to be able to read the "host" keyrings, even if it shares
> > the host network namespace.
> 
> This is a separate issue.

Kind of - what I'm getting at is that today, changing any of the kernel
upcalls is a fully privileged operation.  Most container userspace tools
mount /proc read-only to ensure that even uid 0 containers can't change it
by default.

(Wait, is /sbin/request-key really hardcoded in the kernel?  I guess that's
 good from the perspective of not having containers be able to change it 
 like they could /proc/sys/kernel/core_pattern if it was writable, but
 it seems surprising)

Anyways, if we now expose a method to configure this, we have to look
at the increase in attack surface.

In practice again, most container implementations I'm aware of use
seccomp today to simply block off access to the keyring.   As I mentioned
Docker does it, so does flatpak:
https://github.com/flatpak/flatpak/blob/ea7077fcd431fb98fe85cd815cbd2ec13df58d09/common/flatpak-run.c#L4007
and Chrome:
https://cs.chromium.org/chromium/src/sandbox/linux/seccomp-bpf-helpers/syscall_sets.cc?q=keyctl=C=791

But I'm a bit uncertain about *relying* on the seccomp filtering.  Particularly
because we do want the "init container" approach to work and be able
to use the kernel keyring, but also not be able to affect other containers
or the host.

> Keys may be relevant in different namespaces, which makes namespacing them
> hard to achieve.  For instance, dns-, idmapper- and rxrpc-type keys should
> probably be differentiated by network namespace.
> 
> However, it might be worth creating a keyrings namespace.

Hm, yes - I think I was conflating CLONE_NEWUSER with uid 0's keyring
but that's a distinct thing from the kernel keyrings.

> > Basically my instinct here is to be conservative and have KEYCTL_SERVICE_ADD
> > require CAP_SYS_ADMIN and only affect the userns keyring.
> 
> "Affect" in what sense?

Operate on at all - view/read/write/search etc.

At a high level I'm glad you're looking at the "service fd" model instead of
upcalls - I do think it'll get us to a better place.  The main thing I'm getting
at first though is making sure we're not introducing new security issues, and 
that the
new proposed API makes sense for some of the important userspace use cases.


Re: [RFC PATCH] KEYS: Allow a live daemon in a namespace to service request_key upcalls

2017-05-31 Thread Colin Walters
On Wed, May 31, 2017, at 10:47 AM, David Howells wrote:


> > So if the mount-in-container is mostly init containers, and for init
> > containers you have *all* namespaces usually, does it make
> > sense to have the capability to match namespaces for key services?
> 
> Yes.
> 
> If I don't provide it, someone will complain that I haven't provided it.

I don't think it's as clear cut as that.  I'd agree that where possible, it 
makes
sense to be general since it's hard to envision every use case, but in this 
particualr
case there are risks around security and clear maintenance issues.  Providing
a feature without *known* users in a security-sensitive context seems to
me to be something to think twice about.

> > Something that worries me a lot here is for e.g. Docker today, the default
> > is uid 0 but not CAP_SYS_ADMIN.  We don't want a container that I run
> > with --host=net to be able to read the "host" keyrings, even if it shares
> > the host network namespace.
> 
> This is a separate issue.

Kind of - what I'm getting at is that today, changing any of the kernel
upcalls is a fully privileged operation.  Most container userspace tools
mount /proc read-only to ensure that even uid 0 containers can't change it
by default.

(Wait, is /sbin/request-key really hardcoded in the kernel?  I guess that's
 good from the perspective of not having containers be able to change it 
 like they could /proc/sys/kernel/core_pattern if it was writable, but
 it seems surprising)

Anyways, if we now expose a method to configure this, we have to look
at the increase in attack surface.

In practice again, most container implementations I'm aware of use
seccomp today to simply block off access to the keyring.   As I mentioned
Docker does it, so does flatpak:
https://github.com/flatpak/flatpak/blob/ea7077fcd431fb98fe85cd815cbd2ec13df58d09/common/flatpak-run.c#L4007
and Chrome:
https://cs.chromium.org/chromium/src/sandbox/linux/seccomp-bpf-helpers/syscall_sets.cc?q=keyctl=C=791

But I'm a bit uncertain about *relying* on the seccomp filtering.  Particularly
because we do want the "init container" approach to work and be able
to use the kernel keyring, but also not be able to affect other containers
or the host.

> Keys may be relevant in different namespaces, which makes namespacing them
> hard to achieve.  For instance, dns-, idmapper- and rxrpc-type keys should
> probably be differentiated by network namespace.
> 
> However, it might be worth creating a keyrings namespace.

Hm, yes - I think I was conflating CLONE_NEWUSER with uid 0's keyring
but that's a distinct thing from the kernel keyrings.

> > Basically my instinct here is to be conservative and have KEYCTL_SERVICE_ADD
> > require CAP_SYS_ADMIN and only affect the userns keyring.
> 
> "Affect" in what sense?

Operate on at all - view/read/write/search etc.

At a high level I'm glad you're looking at the "service fd" model instead of
upcalls - I do think it'll get us to a better place.  The main thing I'm getting
at first though is making sure we're not introducing new security issues, and 
that the
new proposed API makes sense for some of the important userspace use cases.


Re: [RFC PATCH] KEYS: Allow a live daemon in a namespace to service request_key upcalls

2017-05-31 Thread Colin Walters
On Tue, May 30, 2017, at 12:08 PM, David Howells wrote:
>
>   KEY_SERVICE_NS_UTS
>   KEY_SERVICE_NS_IPC
>   KEY_SERVICE_NS_MNT
>   KEY_SERVICE_NS_PID
>   KEY_SERVICE_NS_NET
>   KEY_SERVICE_NS_CGROUP

Any reasons not to reuse the CLONE_ flags?

> which select those namespaces of the caller that must match for the daemon
> to be given the request.  So, for example, a daemon that wanted to service
> DNS requests from the kernel would do something like:
> 
>   keyctl(KEYCTL_SERVICE_ADD, fd, "dns_resolver", KEY_SERVICE_NS_NET);

At least to me, it's not clear what the use cases really are.  Do we expect
people to e.g. set up NFS mounts that require keys/DNS inside "a container"
(and if in a container, with what namespaces?)

My offhand feeling is that most of the mounting-in-container case is mostly
"init container" where you migrate a VM -> container and run sshd etc.
in the container.  

There's a whole thread on extending what filesystems can be mounted with
a userns CAP_SYS_ADMIN but personally I doubt that's going to really happen
given how unrealistic it is for the kernel to parse potentially hostile 
filesystems.

So if the mount-in-container is mostly init containers, and for init
containers you have *all* namespaces usually, does it make
sense to have the capability to match namespaces for key services?

Something that worries me a lot here is for e.g. Docker today, the default
is uid 0 but not CAP_SYS_ADMIN.  We don't want a container that I run
with --host=net to be able to read the "host" keyrings, even if it shares
the host network namespace.

Today for Docker the default seccomp policy prevents access to keyctl()
at all because it's only with user namespaces that the kerying is namespaced.

Basically my instinct here is to be conservative and have KEYCTL_SERVICE_ADD
require CAP_SYS_ADMIN and only affect the userns keyring.


Re: [RFC PATCH] KEYS: Allow a live daemon in a namespace to service request_key upcalls

2017-05-31 Thread Colin Walters
On Tue, May 30, 2017, at 12:08 PM, David Howells wrote:
>
>   KEY_SERVICE_NS_UTS
>   KEY_SERVICE_NS_IPC
>   KEY_SERVICE_NS_MNT
>   KEY_SERVICE_NS_PID
>   KEY_SERVICE_NS_NET
>   KEY_SERVICE_NS_CGROUP

Any reasons not to reuse the CLONE_ flags?

> which select those namespaces of the caller that must match for the daemon
> to be given the request.  So, for example, a daemon that wanted to service
> DNS requests from the kernel would do something like:
> 
>   keyctl(KEYCTL_SERVICE_ADD, fd, "dns_resolver", KEY_SERVICE_NS_NET);

At least to me, it's not clear what the use cases really are.  Do we expect
people to e.g. set up NFS mounts that require keys/DNS inside "a container"
(and if in a container, with what namespaces?)

My offhand feeling is that most of the mounting-in-container case is mostly
"init container" where you migrate a VM -> container and run sshd etc.
in the container.  

There's a whole thread on extending what filesystems can be mounted with
a userns CAP_SYS_ADMIN but personally I doubt that's going to really happen
given how unrealistic it is for the kernel to parse potentially hostile 
filesystems.

So if the mount-in-container is mostly init containers, and for init
containers you have *all* namespaces usually, does it make
sense to have the capability to match namespaces for key services?

Something that worries me a lot here is for e.g. Docker today, the default
is uid 0 but not CAP_SYS_ADMIN.  We don't want a container that I run
with --host=net to be able to read the "host" keyrings, even if it shares
the host network namespace.

Today for Docker the default seccomp policy prevents access to keyctl()
at all because it's only with user namespaces that the kerying is namespaced.

Basically my instinct here is to be conservative and have KEYCTL_SERVICE_ADD
require CAP_SYS_ADMIN and only affect the userns keyring.


Re: [RFC][PATCH 0/9] Make containers kernel objects

2017-05-23 Thread Colin Walters
On Tue, May 23, 2017, at 11:31 AM, Jeff Layton wrote:
> 
> nfsdcltrack was originally nfsdcld, a long running daemon that used
> rpc_pipefs to talk to the kernel. That meant that you had to make sure
> it gets enabled by systemd (or sysvinit, etc). If it dies, then you also
> want to ensure that it gets restarted lest the kernel server hang,
> etc...
> 
> It was pretty universally hated, as it was just one more daemon that you
> needed to run to work a proper nfs server. So, I was encouraged to
> switch it to a call_usermodehelper upcall and since then it has just
> worked, as long as the binary is installed.

Note that with the "read()/write() fd" model you don't need
a whole process just to do that...the functionality could be rolled into systemd
or equivalent easily enough.

> "You're doing it wrong. You just need to run all of these services as
> long-running daemons."

Also, I imagine we could figure out a clean model to do *activation*
from kernel -> userspace too.  systemd's socket activation model
where pid 1 activates units on demand is quite nice and obviates
the need to configure things on in advance.


Re: [RFC][PATCH 0/9] Make containers kernel objects

2017-05-23 Thread Colin Walters
On Tue, May 23, 2017, at 11:31 AM, Jeff Layton wrote:
> 
> nfsdcltrack was originally nfsdcld, a long running daemon that used
> rpc_pipefs to talk to the kernel. That meant that you had to make sure
> it gets enabled by systemd (or sysvinit, etc). If it dies, then you also
> want to ensure that it gets restarted lest the kernel server hang,
> etc...
> 
> It was pretty universally hated, as it was just one more daemon that you
> needed to run to work a proper nfs server. So, I was encouraged to
> switch it to a call_usermodehelper upcall and since then it has just
> worked, as long as the binary is installed.

Note that with the "read()/write() fd" model you don't need
a whole process just to do that...the functionality could be rolled into systemd
or equivalent easily enough.

> "You're doing it wrong. You just need to run all of these services as
> long-running daemons."

Also, I imagine we could figure out a clean model to do *activation*
from kernel -> userspace too.  systemd's socket activation model
where pid 1 activates units on demand is quite nice and obviates
the need to configure things on in advance.


Re: [RFC][PATCH 0/9] Make containers kernel objects

2017-05-23 Thread Colin Walters
On Tue, May 23, 2017, at 10:30 AM, Djalal Harouni wrote:
>
> Maybe it depends on the cases, a general approach can be too difficult
> to handle especially from the security point. Maybe it is better to
> identify what operations need what context, and a userspace
> service/proxy can act using kthreadd with the right context... at
> least the shift to this model has been done for years now in the
> mobile industry.

Why not drop the upcall model in favor of having userspace
monitor events via a (more efficient) protocol and react to them on its own?
It's just generally more flexible and avoids all of those issues like
replicating the seccomp configuration, etc.

Something like inotify/signalfd could be a precedent around having a 
read()/poll()able
fd.  /proc/keys-requests ?

Then if you create a new user namespace, and open /proc/keys-requests, the
kernel will always write to that instead of calling /sbin/request-key.


Re: [RFC][PATCH 0/9] Make containers kernel objects

2017-05-23 Thread Colin Walters
On Tue, May 23, 2017, at 10:30 AM, Djalal Harouni wrote:
>
> Maybe it depends on the cases, a general approach can be too difficult
> to handle especially from the security point. Maybe it is better to
> identify what operations need what context, and a userspace
> service/proxy can act using kthreadd with the right context... at
> least the shift to this model has been done for years now in the
> mobile industry.

Why not drop the upcall model in favor of having userspace
monitor events via a (more efficient) protocol and react to them on its own?
It's just generally more flexible and avoids all of those issues like
replicating the seccomp configuration, etc.

Something like inotify/signalfd could be a precedent around having a 
read()/poll()able
fd.  /proc/keys-requests ?

Then if you create a new user namespace, and open /proc/keys-requests, the
kernel will always write to that instead of calling /sbin/request-key.


Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored

2017-05-12 Thread Colin Walters
On Wed, May 10, 2017, at 12:18 AM, Ian Kent wrote:
> The fstatat(2) and statx() calls can pass the flag AT_NO_AUTOMOUNT
> which is meant to clear the LOOKUP_AUTOMOUNT flag and prevent triggering
> of an automount by the call. But this flag is unconditionally cleared
> for all stat family system calls except statx().
> 

Here's the GLib patch to make use of this:
https://bugzilla.gnome.org/show_bug.cgi?id=782554



Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored

2017-05-12 Thread Colin Walters
On Wed, May 10, 2017, at 12:18 AM, Ian Kent wrote:
> The fstatat(2) and statx() calls can pass the flag AT_NO_AUTOMOUNT
> which is meant to clear the LOOKUP_AUTOMOUNT flag and prevent triggering
> of an automount by the call. But this flag is unconditionally cleared
> for all stat family system calls except statx().
> 

Here's the GLib patch to make use of this:
https://bugzilla.gnome.org/show_bug.cgi?id=782554



Re: [PATCH 1/2] vfs: implement fchmodat2() syscall

2017-04-11 Thread Colin Walters
On Tue, Apr 11, 2017, at 02:07 PM, Eric Blake wrote:
> 
> A good idea on the surface. But reading the man page of openat(), the
> section on O_PATH says:
>The  file
>   itself  is not opened, and other file operations (e.g.,
> read(2),
>   write(2), fchmod(2), fchown(2), fgetxattr(2), mmap(2))
> fail with
>   the error EBADF.

Right, though more topically I'd have expected
fchmodat() (not fchmod()) to take AT_EMPTY_PATH,
just like fstatat() does.

But it doesn't appear to be supported...oh, even at
the syscall level, interesting.   Ah, I see, glibc does:

int
fchmodat (int fd, const char *file, mode_t mode, int flag)
{
  if (flag & ~AT_SYMLINK_NOFOLLOW)
return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL);
...
}

And indeed the syscall doesn't have flags, bringing us back
to the start here.   Sorry, that seems obvious in retrospect,
but I was "working forwards" from the O_PATH userspace API
mindset.




Re: [PATCH 1/2] vfs: implement fchmodat2() syscall

2017-04-11 Thread Colin Walters
On Tue, Apr 11, 2017, at 02:07 PM, Eric Blake wrote:
> 
> A good idea on the surface. But reading the man page of openat(), the
> section on O_PATH says:
>The  file
>   itself  is not opened, and other file operations (e.g.,
> read(2),
>   write(2), fchmod(2), fchown(2), fgetxattr(2), mmap(2))
> fail with
>   the error EBADF.

Right, though more topically I'd have expected
fchmodat() (not fchmod()) to take AT_EMPTY_PATH,
just like fstatat() does.

But it doesn't appear to be supported...oh, even at
the syscall level, interesting.   Ah, I see, glibc does:

int
fchmodat (int fd, const char *file, mode_t mode, int flag)
{
  if (flag & ~AT_SYMLINK_NOFOLLOW)
return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL);
...
}

And indeed the syscall doesn't have flags, bringing us back
to the start here.   Sorry, that seems obvious in retrospect,
but I was "working forwards" from the O_PATH userspace API
mindset.




Re: [PATCH 1/2] vfs: implement fchmodat2() syscall

2017-04-11 Thread Colin Walters


On Tue, Feb 28, 2017, at 02:23 PM, Eric Blake wrote:

> Might also be worth mentioning that this patch is required in order to
> solve CVE-2016-9602, per discussion at
> https://lists.gnu.org/archive/html/qemu-devel/2017-02/msg06089.html

I only briefly looked at this, but can't `open(..., O_PATH)` be used to solve
this today?


Re: [PATCH 1/2] vfs: implement fchmodat2() syscall

2017-04-11 Thread Colin Walters


On Tue, Feb 28, 2017, at 02:23 PM, Eric Blake wrote:

> Might also be worth mentioning that this patch is required in order to
> solve CVE-2016-9602, per discussion at
> https://lists.gnu.org/archive/html/qemu-devel/2017-02/msg06089.html

I only briefly looked at this, but can't `open(..., O_PATH)` be used to solve
this today?


Re: [RFC] [PATCH] Add a "nolinks" mount option.

2016-10-19 Thread Colin Walters
On Wed, Oct 19, 2016, at 07:28 AM, Mattias Nissler wrote:
> 
> Note that O_NOFOLLOW only affects the final path component. If there's
> a symlink in any of the parent directories, that'll still be traversed
> even with O_NOFOLLOW. This situation is less risky as an attacker will
> have to deal with the restriction of a fixed filename in the last
> component, but might still be exploitable.

Yeah, I meant that you'd walk the path string in userspace one by
one. That said the "fstat at the end and check device" seems a
lot better, or perhaps the mount namespaces could help.

Also, don't forget about `setfsuid()`.

> The difficulty lies in applying these measures of precaution
> system-wide. This affects most init scripts and daemons, and
> everything else that keeps state on the writable file system. 

One thing to note is that at least in the freedesktop.org/GNOME etc.
side of things, we basically never have privileged processes
accessing user home directories anymore.

A good example is that GDM used to read ~username/.config/face.png
or something like that to show the user's picture on the login screen, and that 
was
subject to many of the same risks.

But we've basically across the board migrated to a model where
the unprivileged user session talks to privileged daemons via
a DBus (or other) API.  In this case, the picture data is stored
in accountsservice.  NetworkManager is another big
example of this, where e.g. WiFi credentials can be per user, and
the session passes them to the privileged daemon over DBus,
rather than having the privileged process try to parse config files
in the user's homedir.   It's a lot easier to secure.


Re: [RFC] [PATCH] Add a "nolinks" mount option.

2016-10-19 Thread Colin Walters
On Wed, Oct 19, 2016, at 07:28 AM, Mattias Nissler wrote:
> 
> Note that O_NOFOLLOW only affects the final path component. If there's
> a symlink in any of the parent directories, that'll still be traversed
> even with O_NOFOLLOW. This situation is less risky as an attacker will
> have to deal with the restriction of a fixed filename in the last
> component, but might still be exploitable.

Yeah, I meant that you'd walk the path string in userspace one by
one. That said the "fstat at the end and check device" seems a
lot better, or perhaps the mount namespaces could help.

Also, don't forget about `setfsuid()`.

> The difficulty lies in applying these measures of precaution
> system-wide. This affects most init scripts and daemons, and
> everything else that keeps state on the writable file system. 

One thing to note is that at least in the freedesktop.org/GNOME etc.
side of things, we basically never have privileged processes
accessing user home directories anymore.

A good example is that GDM used to read ~username/.config/face.png
or something like that to show the user's picture on the login screen, and that 
was
subject to many of the same risks.

But we've basically across the board migrated to a model where
the unprivileged user session talks to privileged daemons via
a DBus (or other) API.  In this case, the picture data is stored
in accountsservice.  NetworkManager is another big
example of this, where e.g. WiFi credentials can be per user, and
the session passes them to the privileged daemon over DBus,
rather than having the privileged process try to parse config files
in the user's homedir.   It's a lot easier to secure.


Re: [RFC] [PATCH] Add a "nolinks" mount option.

2016-10-18 Thread Colin Walters
On Mon, Oct 17, 2016, at 09:02 AM, Mattias Nissler wrote:
> OK, no more feedback thus far. Is there generally any interest in a
> mount option to avoid path name aliasing resulting in target file
> confusion? Perhaps a version that only disables symlinks instead of
> also hard-disabling files hard-linked to multiple locations (those are
> much lower risk for the situation I care about)?

So the situation here is a (privileged) process that is trying to read/write
to a filesystem tree writable by other processes that are in a separate
security domain?

That's a classic situation that requires extreme care, and I am doubtful
that symlinks are the only issue you're facing.  For example, if this
process is also *parsing* any data there, there's another whole source
of risk.

I suspect for you it wouldn't be too hard to have a "follow untrusted
path" helper function, it's possible to implement in userspace safely
with O_NOFOLLOW etc.

Regardless too, it sounds like what you want more is a
"same filesystem" traversal (stat and compare devices). 

Or does it even need to handle full traversal?  Would it have mitigated
the security issue to fstat() any files you opened and verified they
were from the writable partition you expected?





Re: [RFC] [PATCH] Add a "nolinks" mount option.

2016-10-18 Thread Colin Walters
On Mon, Oct 17, 2016, at 09:02 AM, Mattias Nissler wrote:
> OK, no more feedback thus far. Is there generally any interest in a
> mount option to avoid path name aliasing resulting in target file
> confusion? Perhaps a version that only disables symlinks instead of
> also hard-disabling files hard-linked to multiple locations (those are
> much lower risk for the situation I care about)?

So the situation here is a (privileged) process that is trying to read/write
to a filesystem tree writable by other processes that are in a separate
security domain?

That's a classic situation that requires extreme care, and I am doubtful
that symlinks are the only issue you're facing.  For example, if this
process is also *parsing* any data there, there's another whole source
of risk.

I suspect for you it wouldn't be too hard to have a "follow untrusted
path" helper function, it's possible to implement in userspace safely
with O_NOFOLLOW etc.

Regardless too, it sounds like what you want more is a
"same filesystem" traversal (stat and compare devices). 

Or does it even need to handle full traversal?  Would it have mitigated
the security issue to fstat() any files you opened and verified they
were from the writable partition you expected?





Re: [PATCH v2 00/10] userns: sysctl limits for namespaces

2016-07-22 Thread Colin Walters
On Thu, Jul 21, 2016, at 12:39 PM, Eric W. Biederman wrote:
> 
> This patchset addresses two use cases:
> - Implement a sane upper bound on the number of namespaces.
> - Provide a way for sandboxes to limit the attack surface from
>   namespaces.

Perhaps this is obvious, but since you didn't quite explicitly state it;
do you see this as obsoleting the existing downstream patches
mentioned in:
https://lwn.net/Articles/673597/
It seems conceptually similar to Kees' original approach, right?

The high level makes sense to me...most interesting is
per-userns sysctls.  I'll note most current container managers
mount /proc/sys read-only, and Docker specifically drops
CAP_SYS_RESOURCE by default, so they'd likely need to learn
how to undo that if one wanted to support recursive container usage.
We'd probably need to evaluate the safety of having /proc/sys
writable generally.  (Also it's rather common to filter out CLONE_NEWUSER
via seccomp, but that's easy to undo)

But that's the flip side - if we're aiming primarily for an upstreamable
way to *limit* namespace usage, it seems sane to me.



Re: [PATCH v2 00/10] userns: sysctl limits for namespaces

2016-07-22 Thread Colin Walters
On Thu, Jul 21, 2016, at 12:39 PM, Eric W. Biederman wrote:
> 
> This patchset addresses two use cases:
> - Implement a sane upper bound on the number of namespaces.
> - Provide a way for sandboxes to limit the attack surface from
>   namespaces.

Perhaps this is obvious, but since you didn't quite explicitly state it;
do you see this as obsoleting the existing downstream patches
mentioned in:
https://lwn.net/Articles/673597/
It seems conceptually similar to Kees' original approach, right?

The high level makes sense to me...most interesting is
per-userns sysctls.  I'll note most current container managers
mount /proc/sys read-only, and Docker specifically drops
CAP_SYS_RESOURCE by default, so they'd likely need to learn
how to undo that if one wanted to support recursive container usage.
We'd probably need to evaluate the safety of having /proc/sys
writable generally.  (Also it's rather common to filter out CLONE_NEWUSER
via seccomp, but that's easy to undo)

But that's the flip side - if we're aiming primarily for an upstreamable
way to *limit* namespace usage, it seems sane to me.



Re: [PATCH v2 0/2] vfs: Define new syscall getumask.

2016-04-13 Thread Colin Walters
On Wed, Apr 13, 2016, at 08:57 AM, Richard W.M. Jones wrote:

> It's not possible to read the process umask without also modifying it,
> which is what umask(2) does.  A library cannot read umask safely,
> especially if the main program might be multithreaded.

I assume you just want to do this from a shared library so you can
determine whether or not you need to call fchown() after making files
and the like?  If that's the case it'd be good to note it in the commit
message.

BTW...it might be a good idea to add a flags argument:
https://lwn.net/Articles/585415/

Did you consider calling this `umask2`, having the initial version only support
retrieving it via a UMASK_GET flag, and lay the groundwork to support
setting a threadsafe umask with a UMASK_SET_THREAD flag?


Re: [PATCH v2 0/2] vfs: Define new syscall getumask.

2016-04-13 Thread Colin Walters
On Wed, Apr 13, 2016, at 08:57 AM, Richard W.M. Jones wrote:

> It's not possible to read the process umask without also modifying it,
> which is what umask(2) does.  A library cannot read umask safely,
> especially if the main program might be multithreaded.

I assume you just want to do this from a shared library so you can
determine whether or not you need to call fchown() after making files
and the like?  If that's the case it'd be good to note it in the commit
message.

BTW...it might be a good idea to add a flags argument:
https://lwn.net/Articles/585415/

Did you consider calling this `umask2`, having the initial version only support
retrieving it via a UMASK_GET flag, and lay the groundwork to support
setting a threadsafe umask with a UMASK_SET_THREAD flag?


Re: Thoughts on tightening up user namespace creation

2016-03-09 Thread Colin Walters
On Wed, Mar 9, 2016, at 01:14 PM, Kees Cook wrote:
> On Mon, Mar 7, 2016 at 9:15 PM, Andy Lutomirski  wrote:
> > Hi all-
> >
> > There are several users and distros that are nervous about user
> > namespaces from an attack surface point of view.
> >
> >  - RHEL and Arch have userns disabled.
> >
> >  - Ubuntu requires CAP_SYS_ADMIN
> >
> >  - Kees periodically proposes to upstream some sysctl to control
> > userns creation.
> 
> And here's another ring0 escalation flaw, made available to
> unprivileged users because of userns:
> 
> https://code.google.com/p/google-security-research/issues/detail?id=758

Looks like Andy won't have to eat his hat ;)

> The change in attack surface is _substantial_. We must have a way to
> globally disable userns.

No one would object if it was enabled but only accessible to
CAP_SYS_ADMIN though, right?  This could be useful for 
writing setuid binaries that expose some of the features, but e.g. not
CAP_NET_ADMIN.

Andy's suggestion of having this be a per-namespace setting makes
sense to me.  Currently some container tools that do use userns
are by default denying it to be recursive (Sandstorm.io and Docker 1.10 at 
least)
by using a seccomp filter on clone().  If we had this setting that
filter wouldn't be necessary, and would solve the issue that seccomp filters
aren't robust against the kernel adding new API, e.g. a new 
CLONE_NEWUSER_NONEWPRIVS
which might enable chroot() but not CAP_NET_ADMIN.



Re: Thoughts on tightening up user namespace creation

2016-03-09 Thread Colin Walters
On Wed, Mar 9, 2016, at 01:14 PM, Kees Cook wrote:
> On Mon, Mar 7, 2016 at 9:15 PM, Andy Lutomirski  wrote:
> > Hi all-
> >
> > There are several users and distros that are nervous about user
> > namespaces from an attack surface point of view.
> >
> >  - RHEL and Arch have userns disabled.
> >
> >  - Ubuntu requires CAP_SYS_ADMIN
> >
> >  - Kees periodically proposes to upstream some sysctl to control
> > userns creation.
> 
> And here's another ring0 escalation flaw, made available to
> unprivileged users because of userns:
> 
> https://code.google.com/p/google-security-research/issues/detail?id=758

Looks like Andy won't have to eat his hat ;)

> The change in attack surface is _substantial_. We must have a way to
> globally disable userns.

No one would object if it was enabled but only accessible to
CAP_SYS_ADMIN though, right?  This could be useful for 
writing setuid binaries that expose some of the features, but e.g. not
CAP_NET_ADMIN.

Andy's suggestion of having this be a per-namespace setting makes
sense to me.  Currently some container tools that do use userns
are by default denying it to be recursive (Sandstorm.io and Docker 1.10 at 
least)
by using a seccomp filter on clone().  If we had this setting that
filter wouldn't be necessary, and would solve the issue that seccomp filters
aren't robust against the kernel adding new API, e.g. a new 
CLONE_NEWUSER_NONEWPRIVS
which might enable chroot() but not CAP_NET_ADMIN.



Re: [PATCH v3 0/7] User namespace mount updates

2015-11-19 Thread Colin Walters
On Thu, Nov 19, 2015, at 02:53 AM, Richard Weinberger wrote:

> Erm, I don't want this in the kernel. That's why I've proposed the lklfuse 
> approach.

I already said this before but just to repeat, since I'm confused:

How would "lklfuse" be different from http://libguestfs.org/
which we at Red Hat (and a number of other organizations)
use quite widely now for build systems, debugging etc.

In the end it's just running the kernel in KVM with a custom protocol,
with support for non-filesystem things like "install a bootloader",
and it already supports FUSE.

I'm pretty firmly with Al here - the attack surface increase here
is too great, and we'd likely turn this off if it even did make it
into the kernel.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/7] User namespace mount updates

2015-11-19 Thread Colin Walters
On Thu, Nov 19, 2015, at 02:53 AM, Richard Weinberger wrote:

> Erm, I don't want this in the kernel. That's why I've proposed the lklfuse 
> approach.

I already said this before but just to repeat, since I'm confused:

How would "lklfuse" be different from http://libguestfs.org/
which we at Red Hat (and a number of other organizations)
use quite widely now for build systems, debugging etc.

In the end it's just running the kernel in KVM with a custom protocol,
with support for non-filesystem things like "install a bootloader",
and it already supports FUSE.

I'm pretty firmly with Al here - the attack surface increase here
is too great, and we'd likely turn this off if it even did make it
into the kernel.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/7] Initial support for user namespace owned mounts

2015-07-30 Thread Colin Walters
It's worth noting here that I think a lot of the use cases
for unprivileged mounts are testing/development type things,
and these are pretty well covered by:

http://libguestfs.org/

Basically it just runs the host kernel in a VM, and the userspace
is a minimal agent that you can talk to over virtio.  You can use
the API, or `guestmount` exposes it via FUSE.

It doesn't magically make the kernel filesystems robust against
untrusted input, but in the case of compromise, it's an
"unprivileged" VM.  I've used it for several projects and been
quite happy.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/7] Initial support for user namespace owned mounts

2015-07-30 Thread Colin Walters
It's worth noting here that I think a lot of the use cases
for unprivileged mounts are testing/development type things,
and these are pretty well covered by:

http://libguestfs.org/

Basically it just runs the host kernel in a VM, and the userspace
is a minimal agent that you can talk to over virtio.  You can use
the API, or `guestmount` exposes it via FUSE.

It doesn't magically make the kernel filesystems robust against
untrusted input, but in the case of compromise, it's an
unprivileged VM.  I've used it for several projects and been
quite happy.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/7] Initial support for user namespace owned mounts

2015-07-20 Thread Colin Walters
On Thu, Jul 16, 2015, at 12:47 AM, Eric W. Biederman wrote:

> With that said desktop environments have for a long time been
> automatically mounting whichever filesystem you place in your computer,
> so in practice what this is really about is trying to align the kernel
> with how people use filesystems.

There is a large attack surface difference between mounting a device
that someone physically plugged into the computer (and note typically
it's required that the active console be unlocked as well[1]) versus
allowing any "unprivileged" process at any time to do it.

Many server setups use "unprivileged" uids that otherwise wouldn't
be able to exploit bugs in filesystem code.

[1] https://bugzilla.gnome.org/show_bug.cgi?id=653520
"AutomountManager also keeps track of the current session availability
(using the ConsoleKit and gnome-screensaver DBus interfaces) and
inhibits mounting if the current session is locked, or another session
is in use instead."

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/7] Initial support for user namespace owned mounts

2015-07-20 Thread Colin Walters
On Thu, Jul 16, 2015, at 12:47 AM, Eric W. Biederman wrote:

 With that said desktop environments have for a long time been
 automatically mounting whichever filesystem you place in your computer,
 so in practice what this is really about is trying to align the kernel
 with how people use filesystems.

There is a large attack surface difference between mounting a device
that someone physically plugged into the computer (and note typically
it's required that the active console be unlocked as well[1]) versus
allowing any unprivileged process at any time to do it.

Many server setups use unprivileged uids that otherwise wouldn't
be able to exploit bugs in filesystem code.

[1] https://bugzilla.gnome.org/show_bug.cgi?id=653520
AutomountManager also keeps track of the current session availability
(using the ConsoleKit and gnome-screensaver DBus interfaces) and
inhibits mounting if the current session is locked, or another session
is in use instead.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] overlayfs: Warn on copy up if a process has a R/O fd open to the lower file

2015-06-10 Thread Colin Walters
On Thu, May 28, 2015, at 02:46 PM, David Howells wrote:
> Print a warning when overlayfs copies up a file if the process that triggered
> the copy up has a R/O fd open to the lower file being copied up.

Why not an option to make it an error?  If one's goal is to use overlayfs
without apps potentially corrupting data, better to fail fast.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] overlayfs: Warn on copy up if a process has a R/O fd open to the lower file

2015-06-10 Thread Colin Walters
On Thu, May 28, 2015, at 02:46 PM, David Howells wrote:
 Print a warning when overlayfs copies up a file if the process that triggered
 the copy up has a R/O fd open to the lower file being copied up.

Why not an option to make it an error?  If one's goal is to use overlayfs
without apps potentially corrupting data, better to fail fast.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] File Sealing & memfd_create()

2014-04-10 Thread Colin Walters
On Thu, Apr 10, 2014 at 3:15 PM, Andy Lutomirski  
wrote:



COW links can do this already, I think.  Of course, you'll have to 
use a

filesystem that supports them.


COW is nice if the filesystem supports them, but my userspace code 
needs to be filesystem agnostic.  Because of that, the design for 
userspace simply doesn't allow arbitrary writes.


Instead, I have to painfully audit every rpm %post/dpkg postinst type 
script to ensure they break hardlinks, and furthermore only allow 
executing scripts that are known to do so.


But I think even in a btrfs world it'd still be useful to mark files as 
content-immutable.





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] File Sealing & memfd_create()

2014-04-10 Thread Colin Walters

On Thu, Mar 20, 2014 at 11:32 AM, ty...@mit.edu wrote:


Looking at your patches, and what files you are modifying, you are
enforcing this in the low-level file system.


I would love for this to be implemented in the filesystem level as 
well.  Something like the ext4 immutable bit, but with the ability to 
still make hardlinks would be *very* useful for OSTree.  And anyone 
else that uses hardlinks as a data source.  The vserver people do 
something similiar:

http://linux-vserver.org/util-vserver:Vhashify

At the moment I have a read-only bind mount over /usr, but what I 
really want is to make the individual objects in the object store in 
/ostree/repo/objects be immutable, so even if a user or app navigates 
out to /sysroot they still can't mutate them (or the link targets in 
the visible /usr).





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] File Sealing memfd_create()

2014-04-10 Thread Colin Walters

On Thu, Mar 20, 2014 at 11:32 AM, ty...@mit.edu wrote:


Looking at your patches, and what files you are modifying, you are
enforcing this in the low-level file system.


I would love for this to be implemented in the filesystem level as 
well.  Something like the ext4 immutable bit, but with the ability to 
still make hardlinks would be *very* useful for OSTree.  And anyone 
else that uses hardlinks as a data source.  The vserver people do 
something similiar:

http://linux-vserver.org/util-vserver:Vhashify

At the moment I have a read-only bind mount over /usr, but what I 
really want is to make the individual objects in the object store in 
/ostree/repo/objects be immutable, so even if a user or app navigates 
out to /sysroot they still can't mutate them (or the link targets in 
the visible /usr).





--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] File Sealing memfd_create()

2014-04-10 Thread Colin Walters
On Thu, Apr 10, 2014 at 3:15 PM, Andy Lutomirski l...@amacapital.net 
wrote:



COW links can do this already, I think.  Of course, you'll have to 
use a

filesystem that supports them.


COW is nice if the filesystem supports them, but my userspace code 
needs to be filesystem agnostic.  Because of that, the design for 
userspace simply doesn't allow arbitrary writes.


Instead, I have to painfully audit every rpm %post/dpkg postinst type 
script to ensure they break hardlinks, and furthermore only allow 
executing scripts that are known to do so.


But I think even in a btrfs world it'd still be useful to mark files as 
content-immutable.





--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Need for llistxattrat()/lgetxattrat()/lsetxattrat()

2013-07-26 Thread Colin Walters
Hi,

So at the moment the kernel has pretty comprehensive support for the
*at() variants of file API calls, and at one point I was rewriting my
app to be a modern Unix citizen and use the *at variants, but I ran into
one admittedly obscure corner case, which is the lack of API calls to
manipulate the extended attributes of symbolic links relative to a
dirfd.  

Correctly handling xattrs (in particular, SELinux labels) is a key part
of my application, so for now I'll just live with pathnames and their
drawbacks, but:

Is this intentional or an oversight?  If the latter, should I try a
patch?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Need for llistxattrat()/lgetxattrat()/lsetxattrat()

2013-07-26 Thread Colin Walters
Hi,

So at the moment the kernel has pretty comprehensive support for the
*at() variants of file API calls, and at one point I was rewriting my
app to be a modern Unix citizen and use the *at variants, but I ran into
one admittedly obscure corner case, which is the lack of API calls to
manipulate the extended attributes of symbolic links relative to a
dirfd.  

Correctly handling xattrs (in particular, SELinux labels) is a key part
of my application, so for now I'll just live with pathnames and their
drawbacks, but:

Is this intentional or an oversight?  If the latter, should I try a
patch?


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] teach argv_split() to ignore the spaces surrounded by \e

2013-05-13 Thread Colin Walters
On Mon, 2013-05-13 at 16:35 +0200, Oleg Nesterov wrote:

> Yes, we can change format_corename() to construct "argv" by hand, and
> this was my iniital plan. But perhaps it would be better to not uglify
> this code even more?

Sure this \e is less code, but it seems pretty ugly to me.  Maybe a way
to keep fs/coredump.c sane would be always constructing an argv, and
then in the !ispipe case just join them into one string.

Though I'm still inclined to change systemd to read /proc/pid/cmdline
like abrt does; that way it works on current kernels too.

For what it's worth I noticed this problem with dconf, which uses
g_thread_new ("dconf worker", ...), and g_thread_new uses prctl
(PR_SET_NAME).  

 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] teach argv_split() to ignore the spaces surrounded by \e

2013-05-13 Thread Colin Walters
On Mon, 2013-05-13 at 16:35 +0200, Oleg Nesterov wrote:

 Yes, we can change format_corename() to construct argv by hand, and
 this was my iniital plan. But perhaps it would be better to not uglify
 this code even more?

Sure this \e is less code, but it seems pretty ugly to me.  Maybe a way
to keep fs/coredump.c sane would be always constructing an argv, and
then in the !ispipe case just join them into one string.

Though I'm still inclined to change systemd to read /proc/pid/cmdline
like abrt does; that way it works on current kernels too.

For what it's worth I noticed this problem with dconf, which uses
g_thread_new (dconf worker, ...), and g_thread_new uses prctl
(PR_SET_NAME).  

 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [systemd-devel] [PATCH 2/2] coredump: Handle programs with spaces in COMM

2013-05-04 Thread Colin Walters
On Fri, 2013-05-03 at 17:08 +0200, Lennart Poettering wrote:

> It sounds really wrong to first merge this into one string and then
> split it up again. It sounds much more sensible to instead just pass the
> string array around all the time. What's the reason to make this one
> string first?

I'm wondering if there are compatibility concerns; abrt wouldn't care
from what I can tell if we just changed the kernel.  systemd-coredump is
just plain broken right now.  I'll look for the source to the Ubuntu
one...


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [systemd-devel] [PATCH 2/2] coredump: Handle programs with spaces in COMM

2013-05-04 Thread Colin Walters
On Fri, 2013-05-03 at 17:08 +0200, Lennart Poettering wrote:

 It sounds really wrong to first merge this into one string and then
 split it up again. It sounds much more sensible to instead just pass the
 string array around all the time. What's the reason to make this one
 string first?

I'm wondering if there are compatibility concerns; abrt wouldn't care
from what I can tell if we just changed the kernel.  systemd-coredump is
just plain broken right now.  I'll look for the source to the Ubuntu
one...


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


linux-user-chroot 2012.2

2012-08-10 Thread Colin Walters
Hi,

This is the release of linux-user-chroot 2012.2.  The major change now
is that it makes use of Andy's new PR_SET_NO_NEW_PRIVS.  This doesn't
close any security hole I'm aware of - our previous use of the MS_NOSUID
bind mount over / should work - but, belt and suspenders as they say.

The code:
http://git.gnome.org/browse/linux-user-chroot/commit/?id=515c714471d0b5923f6633ef44a2270b23656ee9

As for how linux-user-chroot and PR_SET_NO_NEW_PRIVS relate, see this
thread:
http://thread.gmane.org/gmane.linux.kernel.lsm/15339

Summary
---

This tool allows regular (non-root) users to call chroot(2), create
Linux bind mounts, and use some Linux container features.  It's
primarily intended for use by build systems.

Project information
---

There's no web page yet; send patches to
Colin Walters 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


linux-user-chroot 2012.2

2012-08-10 Thread Colin Walters
Hi,

This is the release of linux-user-chroot 2012.2.  The major change now
is that it makes use of Andy's new PR_SET_NO_NEW_PRIVS.  This doesn't
close any security hole I'm aware of - our previous use of the MS_NOSUID
bind mount over / should work - but, belt and suspenders as they say.

The code:
http://git.gnome.org/browse/linux-user-chroot/commit/?id=515c714471d0b5923f6633ef44a2270b23656ee9

As for how linux-user-chroot and PR_SET_NO_NEW_PRIVS relate, see this
thread:
http://thread.gmane.org/gmane.linux.kernel.lsm/15339

Summary
---

This tool allows regular (non-root) users to call chroot(2), create
Linux bind mounts, and use some Linux container features.  It's
primarily intended for use by build systems.

Project information
---

There's no web page yet; send patches to
Colin Walters walt...@verbum.org



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/