Re: [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.

2019-10-23 Thread Andy Lutomirski
On Wed, Oct 23, 2019 at 12:10 PM Andrea Arcangeli  wrote:
>
> Hello,
>
> On Sat, Oct 12, 2019 at 06:14:23PM -0700, Andy Lutomirski wrote:
> > [adding more people because this is going to be an ABI break, sigh]
>
> That wouldn't break the ABI, no more than when if you boot a kernel
> built with CONFIG_USERFAULTFD=n.
>
> All non-cooperative features can be removed any time in a backwards
> compatible way, the only precaution is to mark their feature bits as
> reserved so they can't be reused for something else later.
>
> > least severely restricted.  A .read implementation MUST NOT ACT ON THE
> > CALLING TASK.  Ever.  Just imagine the effect of passing a userfaultfd
> > as stdin to a setuid program.
>
> With UFFD_EVENT_FORK, the newly created uffd that controls the child,
> is not passed to the parent nor to the child. Instead it's passed to
> the CRIU monitor only, which has to be already running as root and is
> fully trusted and acts a hypervisor (despite there is no hypervisor).
>
> By the time execve runs and any suid bit in the execve'd inode becomes
> relevant, well before the new userland executable code can run, the
> kernel throws away the "old_mm" controlled by any uffd and all
> attached uffds are released as well.
>
> All I found is your "A .read implementation MUST NOT ACT ON THE
> CALLING TASK" as an explanation that something is broken but I need
> further clarification.

There are two things going on here.

1. Daniel wants to add LSM labels to userfaultfd objects.  This seems
reasonable to me.  The question, as I understand it, is: who is the
subject that creates a uffd referring to a forked child?  I'm sure
this is solvable in any number of straightforward ways, but I think
it's less important than:

2. The existing ABI is busted independently of #1.  Suppose you call
userfaultfd to get a userfaultfd and enable UFFD_FEATURE_EVENT_FORK.
Then you do:

$ sudo <&[userfaultfd number]

Sudo will read it and get a new fd unexpectedly added to its fd table.
It's worse if SCM_RIGHTS is involved.

So I think we either need to declare that UFFD_FEATURE_EVENT_FORK is
only usable by global root or we need to remove it and maybe re-add it
in some other form.


--Andy


Re: [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.

2019-10-23 Thread Andrea Arcangeli
Hello,

On Sat, Oct 12, 2019 at 06:14:23PM -0700, Andy Lutomirski wrote:
> [adding more people because this is going to be an ABI break, sigh]

That wouldn't break the ABI, no more than when if you boot a kernel
built with CONFIG_USERFAULTFD=n.

All non-cooperative features can be removed any time in a backwards
compatible way, the only precaution is to mark their feature bits as
reserved so they can't be reused for something else later.

> least severely restricted.  A .read implementation MUST NOT ACT ON THE
> CALLING TASK.  Ever.  Just imagine the effect of passing a userfaultfd
> as stdin to a setuid program.

With UFFD_EVENT_FORK, the newly created uffd that controls the child,
is not passed to the parent nor to the child. Instead it's passed to
the CRIU monitor only, which has to be already running as root and is
fully trusted and acts a hypervisor (despite there is no hypervisor).

By the time execve runs and any suid bit in the execve'd inode becomes
relevant, well before the new userland executable code can run, the
kernel throws away the "old_mm" controlled by any uffd and all
attached uffds are released as well.

All I found is your "A .read implementation MUST NOT ACT ON THE
CALLING TASK" as an explanation that something is broken but I need
further clarification.

Of course I can see you can always open a uffd and pass it to any task
you are going to execve on, but that simply means the suid program
will be able to control you, not the other way around. If you don't
want to be controlled by the next task, no matter if suid or not, just
don't that. What I don't see is how you're going to control the suid
binary from the outside, the suid binary at most will block in the
poll, read and write syscalls and get garbage or write some garbage
and get an error, it won't get signals and it cannot block in any page
fault either, it's not immediately clear what's out of ordinary.

On Mon, Oct 14, 2019 at 06:04:22PM +0200, Jann Horn wrote:
> FWIW, 
> 
> just shows the kernel, kernel selftests, and strace code for decoding
> syscall arguments. CRIU uses it though (probably for postcopy live
> migration / lazy migration?), I guess that code isn't in debian for
> some reason.

https://criu.org/Userfaultfd#Limitations

The CRIU developers did a truly amazing job by making container post
copy live migration work great for a subset of apps, that alone was an
amazing achievement. Is that achievement enough to use post copy live
migration of bare metal containers in production? Unfortunately
probably not and not just in debian.

If you're wrong and UFFDIO_EVENT_FORK isn't currently buggy and in
turn it isn't causing further maintenance burden, there is no hurry of
removing them, but in the long term, if none of the non-cooperative
features find its way in production (like it was reasonable to expect
initially), they must be removed from the kernel anyway, not just
UFFD_EVEN_FORK but all non-cooperative features associated with it.

In my view the kernel is complex enough that we can't keep in the
kernel anything that isn't actively used in production.

If you're right and UFFDIO_EVENT_FORK is flawed beyond repair well
then we should remove all non cooperative features right now.

Or is someone out there using CRIU --lazy-pages in production?

Virtual machine machine postcopy live migration is shipped in
production for years and it's fully reliable and by design it cannot
suffer from any of the above limitations.

In my view there's simply no justification not to use virtual machines
when the alternative requires so much more code to be written and so
much more complexity to be dealt with.

However the higher complexity happened in lots areas of the kernel
already where things got extremely complex just to avoid using virtual
machines, despite the end result is less secure, for the only sake of
slightly higher consolidation (especially if you ignore the existence
millisecond guest boot time, direct mapped pmem nvdimm, virtfs and
free page hinting).

The non-cooperative features of userfaultfd in principle aren't
fundamentally different from other cgroup vs KVM tradeoffs, so 1) it
wasn't apparent they wouldn't be used in production eventually and 2)
it didn't sound fair enough not to give a chance to bare metal
containers to achieve feature parity with VM (again with a much higher
code complexity, but that was to be expected and it has apparently
been dealt with in other areas with more or less satisfactory
results).

While at it, as far as userfaultfd is concerned I'd rather see people
spend time to write a malloc library that uses userfaultfd with the
UFFD_FEATURE_SIGBUS features and it replaces mmap with UFFDIO_ZEROPAGE
(plus adding the THP accelleration currently missing) and munmap with
MADV_DONTNEED to do allocations and freeing of memory with full
scalability without ever hitting on the map sem for writing. This is
already possible without 

Re: [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.

2019-10-23 Thread Andy Lutomirski
On Wed, Oct 23, 2019 at 5:44 AM Mike Rapoport  wrote:
>
> On Wed, Oct 23, 2019 at 10:29:20AM +0300, Cyrill Gorcunov wrote:
> > On Tue, Oct 22, 2019 at 09:11:04PM -0700, Andy Lutomirski wrote:
> > > Trying again.  It looks like I used the wrong address for Pavel.
> >
> > Thanks for CC Andy! I must confess I didn't dive into userfaultfd engine
> > personally but let me CC more people involved from criu side. (overquoting
> > left untouched for their sake).
>
> Thanks for CC Cyrill!
>
>
> > > On Sat, Oct 12, 2019 at 6:14 PM Andy Lutomirski  wrote:
> > > >
> > > > [adding more people because this is going to be an ABI break, sigh]
> > > >
> > > > On Sat, Oct 12, 2019 at 5:52 PM Daniel Colascione  
> > > > wrote:
> > > > >
> > > > > On Sat, Oct 12, 2019 at 4:10 PM Andy Lutomirski  
> > > > > wrote:
> > > > > >
> > > > > > On Sat, Oct 12, 2019 at 12:16 PM Daniel Colascione 
> > > > > >  wrote:
> > > > > > >
> > > > > > > The new secure flag makes userfaultfd use a new "secure" anonymous
> > > > > > > file object instead of the default one, letting security modules
> > > > > > > supervise userfaultfd use.
> > > > > > >
> > > > > > > Requiring that users pass a new flag lets us avoid changing the
> > > > > > > semantics for existing callers.
> > > > > >
> > > > > > Is there any good reason not to make this be the default?
> > > > > >
> > > > > >
> > > > > > The only downside I can see is that it would increase the memory 
> > > > > > usage
> > > > > > of userfaultfd(), but that doesn't seem like such a big deal.  A
> > > > > > lighter-weight alternative would be to have a single inode shared by
> > > > > > all userfaultfd instances, which would require a somewhat different
> > > > > > internal anon_inode API.
> > > > >
> > > > > I'd also prefer to just make SELinux use mandatory, but there's a
> > > > > nasty interaction with UFFD_EVENT_FORK. Adding a new UFFD_SECURE mode
> > > > > which blocks UFFD_EVENT_FORK sidesteps this problem. Maybe you know a
> > > > > better way to deal with it.
> > > > >
> > > > > Right now, when a process with a UFFD-managed VMA using
> > > > > UFFD_EVENT_FORK forks, we make a new userfaultfd_ctx out of thin air
> > > > > and enqueue it on the message queue for the parent process. When we
> > > > > dequeue that context, we get to resolve_userfault_fork, which makes up
> > > > > a new UFFD file object out of thin air in the context of the reading
> > > > > process. Following normal SELinux rules, the SID attached to that new
> > > > > file object would be the task SID of the process *reading* the fork
> > > > > event, not the SID of the new fork child. That seems wrong, because
> > > > > the label we give to the UFFD should correspond to the label of the
> > > > > process that UFFD controls.
>
> I must admit I have no idea about how SELinux works, but what's wrong with
> making the new UFFD object to inherit the properties of the "original" one?
>
> The new file object is created in the context of the same task that owns
> the initial userfault file descriptor and it is used by the same task. So
> if you have a process that registers some of its VMAs with userfaultfd
> and enables UFFD_EVENT_FORK, the same process controls UFFD of itself and
> its children.

I'm not actually convinced this is a problem.

What *is* a problem is touching the file descriptor table at all from
read(2).  That's a big no-no.

--Andy


Re: [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.

2019-10-23 Thread Mike Rapoport
On Wed, Oct 23, 2019 at 10:29:20AM +0300, Cyrill Gorcunov wrote:
> On Tue, Oct 22, 2019 at 09:11:04PM -0700, Andy Lutomirski wrote:
> > Trying again.  It looks like I used the wrong address for Pavel.
> 
> Thanks for CC Andy! I must confess I didn't dive into userfaultfd engine
> personally but let me CC more people involved from criu side. (overquoting
> left untouched for their sake).

Thanks for CC Cyrill!

 
> > On Sat, Oct 12, 2019 at 6:14 PM Andy Lutomirski  wrote:
> > >
> > > [adding more people because this is going to be an ABI break, sigh]
> > >
> > > On Sat, Oct 12, 2019 at 5:52 PM Daniel Colascione  
> > > wrote:
> > > >
> > > > On Sat, Oct 12, 2019 at 4:10 PM Andy Lutomirski  wrote:
> > > > >
> > > > > On Sat, Oct 12, 2019 at 12:16 PM Daniel Colascione 
> > > > >  wrote:
> > > > > >
> > > > > > The new secure flag makes userfaultfd use a new "secure" anonymous
> > > > > > file object instead of the default one, letting security modules
> > > > > > supervise userfaultfd use.
> > > > > >
> > > > > > Requiring that users pass a new flag lets us avoid changing the
> > > > > > semantics for existing callers.
> > > > >
> > > > > Is there any good reason not to make this be the default?
> > > > >
> > > > >
> > > > > The only downside I can see is that it would increase the memory usage
> > > > > of userfaultfd(), but that doesn't seem like such a big deal.  A
> > > > > lighter-weight alternative would be to have a single inode shared by
> > > > > all userfaultfd instances, which would require a somewhat different
> > > > > internal anon_inode API.
> > > >
> > > > I'd also prefer to just make SELinux use mandatory, but there's a
> > > > nasty interaction with UFFD_EVENT_FORK. Adding a new UFFD_SECURE mode
> > > > which blocks UFFD_EVENT_FORK sidesteps this problem. Maybe you know a
> > > > better way to deal with it.
> > > >
> > > > Right now, when a process with a UFFD-managed VMA using
> > > > UFFD_EVENT_FORK forks, we make a new userfaultfd_ctx out of thin air
> > > > and enqueue it on the message queue for the parent process. When we
> > > > dequeue that context, we get to resolve_userfault_fork, which makes up
> > > > a new UFFD file object out of thin air in the context of the reading
> > > > process. Following normal SELinux rules, the SID attached to that new
> > > > file object would be the task SID of the process *reading* the fork
> > > > event, not the SID of the new fork child. That seems wrong, because
> > > > the label we give to the UFFD should correspond to the label of the
> > > > process that UFFD controls.

I must admit I have no idea about how SELinux works, but what's wrong with
making the new UFFD object to inherit the properties of the "original" one?

The new file object is created in the context of the same task that owns
the initial userfault file descriptor and it is used by the same task. So
if you have a process that registers some of its VMAs with userfaultfd
and enables UFFD_EVENT_FORK, the same process controls UFFD of itself and
its children.

> > >
> > > ...
> > >
> > > > But maybe we can go further: let's separate authentication and
> > > > authorization, as we do in other LSM hooks. Let's split my
> > > > inode_init_security_anon into two hooks, inode_init_security_anon and
> > > > inode_create_anon. We'd define the former to just initialize the file
> > > > object's security information --- in the SELinux case, figuring out
> > > > its class and SID --- and define the latter to answer the yes/no
> > > > question of whether a particular anonymous inode creation should be
> > > > allowed. Normally, anon_inode_getfile2() would just call both hooks.
> > > > We'd add another anon_inode_getfd flag, ANON_INODE_SKIP_AUTHORIZATION
> > > > or something, that would tell anon_inode_getfile2() to skip calling
> > > > the authorization hook, effectively making the creation always
> > > > succeed. We can then make the UFFD code pass
> > > > ANON_INODE_SKIP_AUTHORIZATION when it's creating a file object in the
> > > > fork child while creating UFFD_EVENT_FORK messages.
> > >
> > > That sounds like an improvement.  Or maybe just teach SELinux that
> > > this particular fd creation is actually making an anon_inode that is a
> > > child of an existing anon inode and that the context should be copied
> > > or whatever SELinux wants to do.  Like this, maybe:
> > >
> > > static int resolve_userfault_fork(struct userfaultfd_ctx *ctx,
> > >   struct userfaultfd_ctx *new,
> > >   struct uffd_msg *msg)
> > > {
> > > int fd;
> > >
> > > Change this:
> > >
> > > fd = anon_inode_getfd("[userfaultfd]", _fops, new,
> > >   O_RDWR | (new->flags & 
> > > UFFD_SHARED_FCNTL_FLAGS));
> > >
> > > to something like:
> > >
> > >   fd = anon_inode_make_child_fd(..., ctx->inode, ...);
> > >
> > > where ctx->inode is the one context's inode.
> > >
> > > *** HOWEVER *** !!!
> > >
> > > Now that 

Re: [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.

2019-10-23 Thread Cyrill Gorcunov
On Tue, Oct 22, 2019 at 09:11:04PM -0700, Andy Lutomirski wrote:
> Trying again.  It looks like I used the wrong address for Pavel.

Thanks for CC Andy! I must confess I didn't dive into userfaultfd engine
personally but let me CC more people involved from criu side. (overquoting
left untouched for their sake).

> 
> On Sat, Oct 12, 2019 at 6:14 PM Andy Lutomirski  wrote:
> >
> > [adding more people because this is going to be an ABI break, sigh]
> >
> > On Sat, Oct 12, 2019 at 5:52 PM Daniel Colascione  wrote:
> > >
> > > On Sat, Oct 12, 2019 at 4:10 PM Andy Lutomirski  wrote:
> > > >
> > > > On Sat, Oct 12, 2019 at 12:16 PM Daniel Colascione  
> > > > wrote:
> > > > >
> > > > > The new secure flag makes userfaultfd use a new "secure" anonymous
> > > > > file object instead of the default one, letting security modules
> > > > > supervise userfaultfd use.
> > > > >
> > > > > Requiring that users pass a new flag lets us avoid changing the
> > > > > semantics for existing callers.
> > > >
> > > > Is there any good reason not to make this be the default?
> > > >
> > > >
> > > > The only downside I can see is that it would increase the memory usage
> > > > of userfaultfd(), but that doesn't seem like such a big deal.  A
> > > > lighter-weight alternative would be to have a single inode shared by
> > > > all userfaultfd instances, which would require a somewhat different
> > > > internal anon_inode API.
> > >
> > > I'd also prefer to just make SELinux use mandatory, but there's a
> > > nasty interaction with UFFD_EVENT_FORK. Adding a new UFFD_SECURE mode
> > > which blocks UFFD_EVENT_FORK sidesteps this problem. Maybe you know a
> > > better way to deal with it.
> >
> > ...
> >
> > > But maybe we can go further: let's separate authentication and
> > > authorization, as we do in other LSM hooks. Let's split my
> > > inode_init_security_anon into two hooks, inode_init_security_anon and
> > > inode_create_anon. We'd define the former to just initialize the file
> > > object's security information --- in the SELinux case, figuring out
> > > its class and SID --- and define the latter to answer the yes/no
> > > question of whether a particular anonymous inode creation should be
> > > allowed. Normally, anon_inode_getfile2() would just call both hooks.
> > > We'd add another anon_inode_getfd flag, ANON_INODE_SKIP_AUTHORIZATION
> > > or something, that would tell anon_inode_getfile2() to skip calling
> > > the authorization hook, effectively making the creation always
> > > succeed. We can then make the UFFD code pass
> > > ANON_INODE_SKIP_AUTHORIZATION when it's creating a file object in the
> > > fork child while creating UFFD_EVENT_FORK messages.
> >
> > That sounds like an improvement.  Or maybe just teach SELinux that
> > this particular fd creation is actually making an anon_inode that is a
> > child of an existing anon inode and that the context should be copied
> > or whatever SELinux wants to do.  Like this, maybe:
> >
> > static int resolve_userfault_fork(struct userfaultfd_ctx *ctx,
> >   struct userfaultfd_ctx *new,
> >   struct uffd_msg *msg)
> > {
> > int fd;
> >
> > Change this:
> >
> > fd = anon_inode_getfd("[userfaultfd]", _fops, new,
> >   O_RDWR | (new->flags & 
> > UFFD_SHARED_FCNTL_FLAGS));
> >
> > to something like:
> >
> >   fd = anon_inode_make_child_fd(..., ctx->inode, ...);
> >
> > where ctx->inode is the one context's inode.
> >
> > *** HOWEVER *** !!!
> >
> > Now that you've pointed this mechanism out, it is utterly and
> > completely broken and should be removed from the kernel outright or at
> > least severely restricted.  A .read implementation MUST NOT ACT ON THE
> > CALLING TASK.  Ever.  Just imagine the effect of passing a userfaultfd
> > as stdin to a setuid program.
> >
> > So I think the right solution might be to attempt to *remove*
> > UFFD_EVENT_FORK.  Maybe the solution is to say that, unless the
> > creator of a userfaultfd() has global CAP_SYS_ADMIN, then it cannot
> > use UFFD_FEATURE_EVENT_FORK) and print a warning (once) when
> > UFFD_FEATURE_EVENT_FORK is allowed.  And, after some suitable
> > deprecation period, just remove it.  If it's genuinely useful, it
> > needs an entirely new API based on ioctl() or a syscall.  Or even
> > recvmsg() :)
> >
> > And UFFD_SECURE should just become automatic, since you don't have a
> > problem any more. :-p
> >
> > --Andy
> 

Cyrill


Re: [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.

2019-10-22 Thread Andy Lutomirski
Trying again.  It looks like I used the wrong address for Pavel.

On Sat, Oct 12, 2019 at 6:14 PM Andy Lutomirski  wrote:
>
> [adding more people because this is going to be an ABI break, sigh]
>
> On Sat, Oct 12, 2019 at 5:52 PM Daniel Colascione  wrote:
> >
> > On Sat, Oct 12, 2019 at 4:10 PM Andy Lutomirski  wrote:
> > >
> > > On Sat, Oct 12, 2019 at 12:16 PM Daniel Colascione  
> > > wrote:
> > > >
> > > > The new secure flag makes userfaultfd use a new "secure" anonymous
> > > > file object instead of the default one, letting security modules
> > > > supervise userfaultfd use.
> > > >
> > > > Requiring that users pass a new flag lets us avoid changing the
> > > > semantics for existing callers.
> > >
> > > Is there any good reason not to make this be the default?
> > >
> > >
> > > The only downside I can see is that it would increase the memory usage
> > > of userfaultfd(), but that doesn't seem like such a big deal.  A
> > > lighter-weight alternative would be to have a single inode shared by
> > > all userfaultfd instances, which would require a somewhat different
> > > internal anon_inode API.
> >
> > I'd also prefer to just make SELinux use mandatory, but there's a
> > nasty interaction with UFFD_EVENT_FORK. Adding a new UFFD_SECURE mode
> > which blocks UFFD_EVENT_FORK sidesteps this problem. Maybe you know a
> > better way to deal with it.
>
> ...
>
> > But maybe we can go further: let's separate authentication and
> > authorization, as we do in other LSM hooks. Let's split my
> > inode_init_security_anon into two hooks, inode_init_security_anon and
> > inode_create_anon. We'd define the former to just initialize the file
> > object's security information --- in the SELinux case, figuring out
> > its class and SID --- and define the latter to answer the yes/no
> > question of whether a particular anonymous inode creation should be
> > allowed. Normally, anon_inode_getfile2() would just call both hooks.
> > We'd add another anon_inode_getfd flag, ANON_INODE_SKIP_AUTHORIZATION
> > or something, that would tell anon_inode_getfile2() to skip calling
> > the authorization hook, effectively making the creation always
> > succeed. We can then make the UFFD code pass
> > ANON_INODE_SKIP_AUTHORIZATION when it's creating a file object in the
> > fork child while creating UFFD_EVENT_FORK messages.
>
> That sounds like an improvement.  Or maybe just teach SELinux that
> this particular fd creation is actually making an anon_inode that is a
> child of an existing anon inode and that the context should be copied
> or whatever SELinux wants to do.  Like this, maybe:
>
> static int resolve_userfault_fork(struct userfaultfd_ctx *ctx,
>   struct userfaultfd_ctx *new,
>   struct uffd_msg *msg)
> {
> int fd;
>
> Change this:
>
> fd = anon_inode_getfd("[userfaultfd]", _fops, new,
>   O_RDWR | (new->flags & 
> UFFD_SHARED_FCNTL_FLAGS));
>
> to something like:
>
>   fd = anon_inode_make_child_fd(..., ctx->inode, ...);
>
> where ctx->inode is the one context's inode.
>
> *** HOWEVER *** !!!
>
> Now that you've pointed this mechanism out, it is utterly and
> completely broken and should be removed from the kernel outright or at
> least severely restricted.  A .read implementation MUST NOT ACT ON THE
> CALLING TASK.  Ever.  Just imagine the effect of passing a userfaultfd
> as stdin to a setuid program.
>
> So I think the right solution might be to attempt to *remove*
> UFFD_EVENT_FORK.  Maybe the solution is to say that, unless the
> creator of a userfaultfd() has global CAP_SYS_ADMIN, then it cannot
> use UFFD_FEATURE_EVENT_FORK) and print a warning (once) when
> UFFD_FEATURE_EVENT_FORK is allowed.  And, after some suitable
> deprecation period, just remove it.  If it's genuinely useful, it
> needs an entirely new API based on ioctl() or a syscall.  Or even
> recvmsg() :)
>
> And UFFD_SECURE should just become automatic, since you don't have a
> problem any more. :-p
>
> --Andy


Re: [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.

2019-10-22 Thread Daniel Colascione
On Sat, Oct 12, 2019 at 6:14 PM Andy Lutomirski  wrote:
> [adding more people because this is going to be an ABI break, sigh]

Just pinging this thread. I'd like to rev my patch and I'm not sure
what we want to do about problem Andy identified. Are we removing
UFFD_EVENT_FORK?


Re: [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.

2019-10-14 Thread Jann Horn
On Sun, Oct 13, 2019 at 3:14 AM Andy Lutomirski  wrote:
> [adding more people because this is going to be an ABI break, sigh]
> On Sat, Oct 12, 2019 at 5:52 PM Daniel Colascione  wrote:
> > On Sat, Oct 12, 2019 at 4:10 PM Andy Lutomirski  wrote:
> > > On Sat, Oct 12, 2019 at 12:16 PM Daniel Colascione  
> > > wrote:
> > > > The new secure flag makes userfaultfd use a new "secure" anonymous
> > > > file object instead of the default one, letting security modules
> > > > supervise userfaultfd use.
> > > >
> > > > Requiring that users pass a new flag lets us avoid changing the
> > > > semantics for existing callers.
> > >
> > > Is there any good reason not to make this be the default?
> > >
> > >
> > > The only downside I can see is that it would increase the memory usage
> > > of userfaultfd(), but that doesn't seem like such a big deal.  A
> > > lighter-weight alternative would be to have a single inode shared by
> > > all userfaultfd instances, which would require a somewhat different
> > > internal anon_inode API.
> >
> > I'd also prefer to just make SELinux use mandatory, but there's a
> > nasty interaction with UFFD_EVENT_FORK. Adding a new UFFD_SECURE mode
> > which blocks UFFD_EVENT_FORK sidesteps this problem. Maybe you know a
> > better way to deal with it.
[...]
> Now that you've pointed this mechanism out, it is utterly and
> completely broken and should be removed from the kernel outright or at
> least severely restricted.  A .read implementation MUST NOT ACT ON THE
> CALLING TASK.  Ever.  Just imagine the effect of passing a userfaultfd
> as stdin to a setuid program.
>
> So I think the right solution might be to attempt to *remove*
> UFFD_EVENT_FORK.  Maybe the solution is to say that, unless the
> creator of a userfaultfd() has global CAP_SYS_ADMIN, then it cannot
> use UFFD_FEATURE_EVENT_FORK) and print a warning (once) when
> UFFD_FEATURE_EVENT_FORK is allowed.  And, after some suitable
> deprecation period, just remove it.  If it's genuinely useful, it
> needs an entirely new API based on ioctl() or a syscall.  Or even
> recvmsg() :)

FWIW, 
just shows the kernel, kernel selftests, and strace code for decoding
syscall arguments. CRIU uses it though (probably for postcopy live
migration / lazy migration?), I guess that code isn't in debian for
some reason.


Re: [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.

2019-10-12 Thread Daniel Colascione
On Sat, Oct 12, 2019 at 6:14 PM Andy Lutomirski  wrote:
>
..
>
> > But maybe we can go further: let's separate authentication and
> > authorization, as we do in other LSM hooks. Let's split my
> > inode_init_security_anon into two hooks, inode_init_security_anon and
> > inode_create_anon. We'd define the former to just initialize the file
> > object's security information --- in the SELinux case, figuring out
> > its class and SID --- and define the latter to answer the yes/no
> > question of whether a particular anonymous inode creation should be
> > allowed. Normally, anon_inode_getfile2() would just call both hooks.
> > We'd add another anon_inode_getfd flag, ANON_INODE_SKIP_AUTHORIZATION
> > or something, that would tell anon_inode_getfile2() to skip calling
> > the authorization hook, effectively making the creation always
> > succeed. We can then make the UFFD code pass
> > ANON_INODE_SKIP_AUTHORIZATION when it's creating a file object in the
> > fork child while creating UFFD_EVENT_FORK messages.
>
> That sounds like an improvement.  Or maybe just teach SELinux that
> this particular fd creation is actually making an anon_inode that is a
> child of an existing anon inode and that the context should be copied
> or whatever SELinux wants to do.  Like this, maybe:
>
> static int resolve_userfault_fork(struct userfaultfd_ctx *ctx,
>   struct userfaultfd_ctx *new,
>   struct uffd_msg *msg)
> {
> int fd;
>
> Change this:
>
> fd = anon_inode_getfd("[userfaultfd]", _fops, new,
>   O_RDWR | (new->flags & 
> UFFD_SHARED_FCNTL_FLAGS));
>
> to something like:
>
>   fd = anon_inode_make_child_fd(..., ctx->inode, ...);
>
> where ctx->inode is the one context's inode.

Yeah. I figured we could just add a special-purpose hook for this
case. Having a special hook for this one case feels ugly though, and
at copy_mm time, we don't have a PID for the new child yet --- I don't
know whether LSMs would care about that. But maybe this is one of
those "doctor, it hurts when I do this!" situations and this child
process difficulty is just a hint that some other design might work
better.

> Now that you've pointed this mechanism out, it is utterly and
> completely broken and should be removed from the kernel outright or at
> least severely restricted.  A .read implementation MUST NOT ACT ON THE
> CALLING TASK.  Ever.  Just imagine the effect of passing a userfaultfd
> as stdin to a setuid program.
>
> So I think the right solution might be to attempt to *remove*
> UFFD_EVENT_FORK.  Maybe the solution is to say that, unless the
> creator of a userfaultfd() has global CAP_SYS_ADMIN, then it cannot
> use UFFD_FEATURE_EVENT_FORK) and print a warning (once) when
> UFFD_FEATURE_EVENT_FORK is allowed.  And, after some suitable
> deprecation period, just remove it.  If it's genuinely useful, it
> needs an entirely new API based on ioctl() or a syscall.  Or even
> recvmsg() :)

IMHO, userfaultfd should have been a datagram socket from the start.
As you point out, it's a good fit for the UFFD protocol, which
involves FD passing and a fixed message size.

> And UFFD_SECURE should just become automatic, since you don't have a
> problem any more. :-p

Agreed. I'll wait to hear what everyone else has to say.


Re: [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.

2019-10-12 Thread Andy Lutomirski
[adding more people because this is going to be an ABI break, sigh]

On Sat, Oct 12, 2019 at 5:52 PM Daniel Colascione  wrote:
>
> On Sat, Oct 12, 2019 at 4:10 PM Andy Lutomirski  wrote:
> >
> > On Sat, Oct 12, 2019 at 12:16 PM Daniel Colascione  
> > wrote:
> > >
> > > The new secure flag makes userfaultfd use a new "secure" anonymous
> > > file object instead of the default one, letting security modules
> > > supervise userfaultfd use.
> > >
> > > Requiring that users pass a new flag lets us avoid changing the
> > > semantics for existing callers.
> >
> > Is there any good reason not to make this be the default?
> >
> >
> > The only downside I can see is that it would increase the memory usage
> > of userfaultfd(), but that doesn't seem like such a big deal.  A
> > lighter-weight alternative would be to have a single inode shared by
> > all userfaultfd instances, which would require a somewhat different
> > internal anon_inode API.
>
> I'd also prefer to just make SELinux use mandatory, but there's a
> nasty interaction with UFFD_EVENT_FORK. Adding a new UFFD_SECURE mode
> which blocks UFFD_EVENT_FORK sidesteps this problem. Maybe you know a
> better way to deal with it.

...

> But maybe we can go further: let's separate authentication and
> authorization, as we do in other LSM hooks. Let's split my
> inode_init_security_anon into two hooks, inode_init_security_anon and
> inode_create_anon. We'd define the former to just initialize the file
> object's security information --- in the SELinux case, figuring out
> its class and SID --- and define the latter to answer the yes/no
> question of whether a particular anonymous inode creation should be
> allowed. Normally, anon_inode_getfile2() would just call both hooks.
> We'd add another anon_inode_getfd flag, ANON_INODE_SKIP_AUTHORIZATION
> or something, that would tell anon_inode_getfile2() to skip calling
> the authorization hook, effectively making the creation always
> succeed. We can then make the UFFD code pass
> ANON_INODE_SKIP_AUTHORIZATION when it's creating a file object in the
> fork child while creating UFFD_EVENT_FORK messages.

That sounds like an improvement.  Or maybe just teach SELinux that
this particular fd creation is actually making an anon_inode that is a
child of an existing anon inode and that the context should be copied
or whatever SELinux wants to do.  Like this, maybe:

static int resolve_userfault_fork(struct userfaultfd_ctx *ctx,
  struct userfaultfd_ctx *new,
  struct uffd_msg *msg)
{
int fd;

Change this:

fd = anon_inode_getfd("[userfaultfd]", _fops, new,
  O_RDWR | (new->flags & UFFD_SHARED_FCNTL_FLAGS));

to something like:

  fd = anon_inode_make_child_fd(..., ctx->inode, ...);

where ctx->inode is the one context's inode.

*** HOWEVER *** !!!

Now that you've pointed this mechanism out, it is utterly and
completely broken and should be removed from the kernel outright or at
least severely restricted.  A .read implementation MUST NOT ACT ON THE
CALLING TASK.  Ever.  Just imagine the effect of passing a userfaultfd
as stdin to a setuid program.

So I think the right solution might be to attempt to *remove*
UFFD_EVENT_FORK.  Maybe the solution is to say that, unless the
creator of a userfaultfd() has global CAP_SYS_ADMIN, then it cannot
use UFFD_FEATURE_EVENT_FORK) and print a warning (once) when
UFFD_FEATURE_EVENT_FORK is allowed.  And, after some suitable
deprecation period, just remove it.  If it's genuinely useful, it
needs an entirely new API based on ioctl() or a syscall.  Or even
recvmsg() :)

And UFFD_SECURE should just become automatic, since you don't have a
problem any more. :-p

--Andy


Re: [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.

2019-10-12 Thread Daniel Colascione
On Sat, Oct 12, 2019 at 4:10 PM Andy Lutomirski  wrote:
>
> On Sat, Oct 12, 2019 at 12:16 PM Daniel Colascione  wrote:
> >
> > The new secure flag makes userfaultfd use a new "secure" anonymous
> > file object instead of the default one, letting security modules
> > supervise userfaultfd use.
> >
> > Requiring that users pass a new flag lets us avoid changing the
> > semantics for existing callers.
>
> Is there any good reason not to make this be the default?
>
>
> The only downside I can see is that it would increase the memory usage
> of userfaultfd(), but that doesn't seem like such a big deal.  A
> lighter-weight alternative would be to have a single inode shared by
> all userfaultfd instances, which would require a somewhat different
> internal anon_inode API.

I'd also prefer to just make SELinux use mandatory, but there's a
nasty interaction with UFFD_EVENT_FORK. Adding a new UFFD_SECURE mode
which blocks UFFD_EVENT_FORK sidesteps this problem. Maybe you know a
better way to deal with it.

Right now, when a process with a UFFD-managed VMA using
UFFD_EVENT_FORK forks, we make a new userfaultfd_ctx out of thin air
and enqueue it on the message queue for the parent process. When we
dequeue that context, we get to resolve_userfault_fork, which makes up
a new UFFD file object out of thin air in the context of the reading
process. Following normal SELinux rules, the SID attached to that new
file object would be the task SID of the process *reading* the fork
event, not the SID of the new fork child. That seems wrong, because
the label we give to the UFFD should correspond to the label of the
process that UFFD controls.

To try to solve this problem, we can move the file object creation to
the fork child and enqueue the file object itself instead of just the
userfaultfd_ctx, treating the dequeue as a file-descriptor-receive
operation just like a recvmsg of an AF_UNIX socket with SCM_RIGHTS.
(This approach seems more elegant anyway, since it reflects what's
actually going on.) The trouble the early-file-object-creation
approach is that the fork child may not be allowed to create UFFD file
objects on its own and an LSM can't tell the difference between
UFFD_EVENT_FORK handling creating the file object and the fork child
just calling userfaultfd(), meaning an LSM could veto the creation of
the file object for the fork event. We can't just create a
non-ANON_INODE_SECURE file object instead: that would defeat the whole
purpose of supervising UFFD using SELinux.

But maybe we can go further: let's separate authentication and
authorization, as we do in other LSM hooks. Let's split my
inode_init_security_anon into two hooks, inode_init_security_anon and
inode_create_anon. We'd define the former to just initialize the file
object's security information --- in the SELinux case, figuring out
its class and SID --- and define the latter to answer the yes/no
question of whether a particular anonymous inode creation should be
allowed. Normally, anon_inode_getfile2() would just call both hooks.
We'd add another anon_inode_getfd flag, ANON_INODE_SKIP_AUTHORIZATION
or something, that would tell anon_inode_getfile2() to skip calling
the authorization hook, effectively making the creation always
succeed. We can then make the UFFD code pass
ANON_INODE_SKIP_AUTHORIZATION when it's creating a file object in the
fork child while creating UFFD_EVENT_FORK messages.

Granted, UFFD fork processing doesn't actually occur in the fork
child, but in copy_mm, in the parent --- but the right thing should
happen anyway, right?

I'm open to suggestions. In the meantime, I figured we'd just define a
UFFD_SECURE and make it incompatible with UFFD_EVENT_FORK.

> In any event, I don't think that "make me visible to SELinux" should
> be a choice that user code makes.

Right. The new unprivileged_userfaultfd setting is ugly, but it at
least removes the ability of unprivileged users to opt out of SELinux
supervision.


Re: [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.

2019-10-12 Thread Andy Lutomirski
On Sat, Oct 12, 2019 at 12:16 PM Daniel Colascione  wrote:
>
> The new secure flag makes userfaultfd use a new "secure" anonymous
> file object instead of the default one, letting security modules
> supervise userfaultfd use.
>
> Requiring that users pass a new flag lets us avoid changing the
> semantics for existing callers.

Is there any good reason not to make this be the default?

The only downside I can see is that it would increase the memory usage
of userfaultfd(), but that doesn't seem like such a big deal.  A
lighter-weight alternative would be to have a single inode shared by
all userfaultfd instances, which would require a somewhat different
internal anon_inode API.

In any event, I don't think that "make me visible to SELinux" should
be a choice that user code makes.

--Andy


[PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.

2019-10-12 Thread Daniel Colascione
The new secure flag makes userfaultfd use a new "secure" anonymous
file object instead of the default one, letting security modules
supervise userfaultfd use.

Requiring that users pass a new flag lets us avoid changing the
semantics for existing callers.

Signed-off-by: Daniel Colascione 
---
 fs/userfaultfd.c | 28 +---
 include/uapi/linux/userfaultfd.h |  8 
 2 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index f9fd18670e22..29f920fb236e 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1022,6 +1022,13 @@ static int resolve_userfault_fork(struct userfaultfd_ctx 
*ctx,
 {
int fd;
 
+   /*
+* Using a secure-mode UFFD to monitor forks isn't supported
+* right now.
+*/
+   if (new->flags & UFFD_SECURE)
+   return -EOPNOTSUPP;
+
fd = anon_inode_getfd("[userfaultfd]", _fops, new,
  O_RDWR | (new->flags & UFFD_SHARED_FCNTL_FLAGS));
if (fd < 0)
@@ -1841,6 +1848,18 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
ret = -EINVAL;
goto out;
}
+   if ((ctx->flags & UFFD_SECURE) &&
+   (features & UFFD_FEATURE_EVENT_FORK)) {
+   /*
+* We don't support UFFD_FEATURE_EVENT_FORK on a
+* secure-mode UFFD: doing so would need us to
+* construct the new file object in the context of the
+* fork child, and it's not worth it right now.
+*/
+   ret = -EINVAL;
+   goto out;
+   }
+
/* report all available features and ioctls to userland */
uffdio_api.features = UFFD_API_FEATURES;
uffdio_api.ioctls = UFFD_API_IOCTLS;
@@ -1942,6 +1961,7 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
 {
struct userfaultfd_ctx *ctx;
int fd;
+   static const int uffd_flags = UFFD_SECURE;
 
if (!sysctl_unprivileged_userfaultfd && !capable(CAP_SYS_PTRACE))
return -EPERM;
@@ -1951,8 +1971,9 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
/* Check the UFFD_* constants for consistency.  */
BUILD_BUG_ON(UFFD_CLOEXEC != O_CLOEXEC);
BUILD_BUG_ON(UFFD_NONBLOCK != O_NONBLOCK);
+   BUILD_BUG_ON(UFFD_SHARED_FCNTL_FLAGS & uffd_flags);
 
-   if (flags & ~UFFD_SHARED_FCNTL_FLAGS)
+   if (flags & ~(UFFD_SHARED_FCNTL_FLAGS | uffd_flags))
return -EINVAL;
 
ctx = kmem_cache_alloc(userfaultfd_ctx_cachep, GFP_KERNEL);
@@ -1969,8 +1990,9 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
/* prevent the mm struct to be freed */
mmgrab(ctx->mm);
 
-   fd = anon_inode_getfd("[userfaultfd]", _fops, ctx,
- O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS));
+   fd = anon_inode_getfd2("[userfaultfd]", _fops, ctx,
+  O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS),
+  ((flags & UFFD_SECURE) ? ANON_INODE_SECURE : 0));
if (fd < 0) {
mmdrop(ctx->mm);
kmem_cache_free(userfaultfd_ctx_cachep, ctx);
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 48f1a7c2f1f0..12d7d40d7f25 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -231,4 +231,12 @@ struct uffdio_zeropage {
__s64 zeropage;
 };
 
+/*
+ * Flags for the userfaultfd(2) system call itself.
+ */
+
+/*
+ * Create a userfaultfd with MAC security checks enabled.
+ */
+#define UFFD_SECURE 1
 #endif /* _LINUX_USERFAULTFD_H */
-- 
2.23.0.700.g56cf767bdb-goog