Re: [RFC 0/3] seccomp trap to userspace

2018-03-16 Thread Christian Brauner
On Fri, Mar 16, 2018 at 09:01:47AM -0700, Andy Lutomirski wrote:
> 
> 
> > On Mar 16, 2018, at 7:47 AM, Christian Brauner 
> >  wrote:
> > 
> >> On Fri, Mar 16, 2018 at 12:46:55AM +, Andy Lutomirski wrote:
> 
> 
> I bet I confused everyone with a blatant typo:
> 
> >> 
> >> Hmm, I think we have to be very careful to avoid nasty races.  I think
> >> the correct approach is to notice the signal and send a message to the
> >> listener that a signal is pending but to take no additional action.
> >> If the handler ends up completing the syscall with a successful
> >> return, we don't want to replace it with -EINTR.  IOW the code looks
> >> kind of like:
> >> 
> >> send_to_listener("hey I got a signal");
> 
> That should be “hey I got a syscall”.   D’oh!

Ha ok, that's what led me to believe that listener != handler and I was
trying to make sense of thise. :)

Thanks!
Christian

> 
> >> wait_ret = wait_interruptible for the listener to reply;
> >> if (wait_ret == -EINTR) {
> > 
> > Hm, so from the pseudo-code it looks like: The handler would inform the
> > listener that it received a signal (either from the syscall requester or
> > from somewhere else) and then wait for the listener to reply to that
> > message.  This would allow the listener to decide what action it wants
> > the handler to take based on the signal, i.e. either cancel the request
> > or retry?  The comment makes it sound like that the handler doesn't
> > really wait on the listener when it receives a signal it simply moves
> > on.
> 
> It keeps waiting killably but not interruptibly. 
> 
> > So no "taking no additional action" here means not have the handler
> > decide to abort but the listener?
> 
> If by “handler” you mean kernel, then yes. 
> 
> There’s no userspace syscall handler involved. From the kernel’s perspective, 
> a syscall is never still in progress when a signal handler is invoked — we 
> only actually invoke syscall handlers in prepare_exit_to_usermode() or the 
> non-x86 equivalent and the functions it calls. While a syscall is running, 
> the kernel might notice that a signal is pending and do one of a few things:
> 
> 1. Just keep going. Not all syscalls can be interrupted. 
> 
> 2. Try to finish early. If a send() call has already sent some but not all 
> data, it can stop waiting and return the number of bytes sent.
> 
> 3. Abort with -EINTR.
> 
> 4. Abort with -ERESTARTSYS or one of its relatives. These fiddle with user 
> registers in a somewhat unpleasant way to pretend that the syscall never 
> actually happened.  This works for syscalls that wait with an absolute 
> timeout, for example. 
> 
> 5. Set up restart_syscall() magic, rewrite regs so it looks like the user was 
> about to call restart_syscall() when the signal happened, and abort. 
> 
> In all cases, the signal is dealt with afterwards. This could result in 
> changing regs to call the handler or in simply returning. 
> 
> 1-3 should work fully in seccomp. The only issue is that the kernel doesn’t 
> know *which* to do, nor can the kernel force the listener to abort cleanly, 
> so I think we have  no real choice but to let the listener decide. 
> 
> 4 could be supported just like 1-3. 5 is awful, and I don’t think we should 
> support it for user listeners. 


Re: [RFC 0/3] seccomp trap to userspace

2018-03-16 Thread Christian Brauner
On Fri, Mar 16, 2018 at 09:01:47AM -0700, Andy Lutomirski wrote:
> 
> 
> > On Mar 16, 2018, at 7:47 AM, Christian Brauner 
> >  wrote:
> > 
> >> On Fri, Mar 16, 2018 at 12:46:55AM +, Andy Lutomirski wrote:
> 
> 
> I bet I confused everyone with a blatant typo:
> 
> >> 
> >> Hmm, I think we have to be very careful to avoid nasty races.  I think
> >> the correct approach is to notice the signal and send a message to the
> >> listener that a signal is pending but to take no additional action.
> >> If the handler ends up completing the syscall with a successful
> >> return, we don't want to replace it with -EINTR.  IOW the code looks
> >> kind of like:
> >> 
> >> send_to_listener("hey I got a signal");
> 
> That should be “hey I got a syscall”.   D’oh!

Ha ok, that's what led me to believe that listener != handler and I was
trying to make sense of thise. :)

Thanks!
Christian

> 
> >> wait_ret = wait_interruptible for the listener to reply;
> >> if (wait_ret == -EINTR) {
> > 
> > Hm, so from the pseudo-code it looks like: The handler would inform the
> > listener that it received a signal (either from the syscall requester or
> > from somewhere else) and then wait for the listener to reply to that
> > message.  This would allow the listener to decide what action it wants
> > the handler to take based on the signal, i.e. either cancel the request
> > or retry?  The comment makes it sound like that the handler doesn't
> > really wait on the listener when it receives a signal it simply moves
> > on.
> 
> It keeps waiting killably but not interruptibly. 
> 
> > So no "taking no additional action" here means not have the handler
> > decide to abort but the listener?
> 
> If by “handler” you mean kernel, then yes. 
> 
> There’s no userspace syscall handler involved. From the kernel’s perspective, 
> a syscall is never still in progress when a signal handler is invoked — we 
> only actually invoke syscall handlers in prepare_exit_to_usermode() or the 
> non-x86 equivalent and the functions it calls. While a syscall is running, 
> the kernel might notice that a signal is pending and do one of a few things:
> 
> 1. Just keep going. Not all syscalls can be interrupted. 
> 
> 2. Try to finish early. If a send() call has already sent some but not all 
> data, it can stop waiting and return the number of bytes sent.
> 
> 3. Abort with -EINTR.
> 
> 4. Abort with -ERESTARTSYS or one of its relatives. These fiddle with user 
> registers in a somewhat unpleasant way to pretend that the syscall never 
> actually happened.  This works for syscalls that wait with an absolute 
> timeout, for example. 
> 
> 5. Set up restart_syscall() magic, rewrite regs so it looks like the user was 
> about to call restart_syscall() when the signal happened, and abort. 
> 
> In all cases, the signal is dealt with afterwards. This could result in 
> changing regs to call the handler or in simply returning. 
> 
> 1-3 should work fully in seccomp. The only issue is that the kernel doesn’t 
> know *which* to do, nor can the kernel force the listener to abort cleanly, 
> so I think we have  no real choice but to let the listener decide. 
> 
> 4 could be supported just like 1-3. 5 is awful, and I don’t think we should 
> support it for user listeners. 


Re: [RFC 0/3] seccomp trap to userspace

2018-03-16 Thread Andy Lutomirski


> On Mar 16, 2018, at 7:47 AM, Christian Brauner 
>  wrote:
> 
>> On Fri, Mar 16, 2018 at 12:46:55AM +, Andy Lutomirski wrote:


I bet I confused everyone with a blatant typo:

>> 
>> Hmm, I think we have to be very careful to avoid nasty races.  I think
>> the correct approach is to notice the signal and send a message to the
>> listener that a signal is pending but to take no additional action.
>> If the handler ends up completing the syscall with a successful
>> return, we don't want to replace it with -EINTR.  IOW the code looks
>> kind of like:
>> 
>> send_to_listener("hey I got a signal");

That should be “hey I got a syscall”.   D’oh!

>> wait_ret = wait_interruptible for the listener to reply;
>> if (wait_ret == -EINTR) {
> 
> Hm, so from the pseudo-code it looks like: The handler would inform the
> listener that it received a signal (either from the syscall requester or
> from somewhere else) and then wait for the listener to reply to that
> message.  This would allow the listener to decide what action it wants
> the handler to take based on the signal, i.e. either cancel the request
> or retry?  The comment makes it sound like that the handler doesn't
> really wait on the listener when it receives a signal it simply moves
> on.

It keeps waiting killably but not interruptibly. 

> So no "taking no additional action" here means not have the handler
> decide to abort but the listener?

If by “handler” you mean kernel, then yes. 

There’s no userspace syscall handler involved. From the kernel’s perspective, a 
syscall is never still in progress when a signal handler is invoked — we only 
actually invoke syscall handlers in prepare_exit_to_usermode() or the non-x86 
equivalent and the functions it calls. While a syscall is running, the kernel 
might notice that a signal is pending and do one of a few things:

1. Just keep going. Not all syscalls can be interrupted. 

2. Try to finish early. If a send() call has already sent some but not all 
data, it can stop waiting and return the number of bytes sent.

3. Abort with -EINTR.

4. Abort with -ERESTARTSYS or one of its relatives. These fiddle with user 
registers in a somewhat unpleasant way to pretend that the syscall never 
actually happened.  This works for syscalls that wait with an absolute timeout, 
for example. 

5. Set up restart_syscall() magic, rewrite regs so it looks like the user was 
about to call restart_syscall() when the signal happened, and abort. 

In all cases, the signal is dealt with afterwards. This could result in 
changing regs to call the handler or in simply returning. 

1-3 should work fully in seccomp. The only issue is that the kernel doesn’t 
know *which* to do, nor can the kernel force the listener to abort cleanly, so 
I think we have  no real choice but to let the listener decide. 

4 could be supported just like 1-3. 5 is awful, and I don’t think we should 
support it for user listeners. 


Re: [RFC 0/3] seccomp trap to userspace

2018-03-16 Thread Andy Lutomirski


> On Mar 16, 2018, at 7:47 AM, Christian Brauner 
>  wrote:
> 
>> On Fri, Mar 16, 2018 at 12:46:55AM +, Andy Lutomirski wrote:


I bet I confused everyone with a blatant typo:

>> 
>> Hmm, I think we have to be very careful to avoid nasty races.  I think
>> the correct approach is to notice the signal and send a message to the
>> listener that a signal is pending but to take no additional action.
>> If the handler ends up completing the syscall with a successful
>> return, we don't want to replace it with -EINTR.  IOW the code looks
>> kind of like:
>> 
>> send_to_listener("hey I got a signal");

That should be “hey I got a syscall”.   D’oh!

>> wait_ret = wait_interruptible for the listener to reply;
>> if (wait_ret == -EINTR) {
> 
> Hm, so from the pseudo-code it looks like: The handler would inform the
> listener that it received a signal (either from the syscall requester or
> from somewhere else) and then wait for the listener to reply to that
> message.  This would allow the listener to decide what action it wants
> the handler to take based on the signal, i.e. either cancel the request
> or retry?  The comment makes it sound like that the handler doesn't
> really wait on the listener when it receives a signal it simply moves
> on.

It keeps waiting killably but not interruptibly. 

> So no "taking no additional action" here means not have the handler
> decide to abort but the listener?

If by “handler” you mean kernel, then yes. 

There’s no userspace syscall handler involved. From the kernel’s perspective, a 
syscall is never still in progress when a signal handler is invoked — we only 
actually invoke syscall handlers in prepare_exit_to_usermode() or the non-x86 
equivalent and the functions it calls. While a syscall is running, the kernel 
might notice that a signal is pending and do one of a few things:

1. Just keep going. Not all syscalls can be interrupted. 

2. Try to finish early. If a send() call has already sent some but not all 
data, it can stop waiting and return the number of bytes sent.

3. Abort with -EINTR.

4. Abort with -ERESTARTSYS or one of its relatives. These fiddle with user 
registers in a somewhat unpleasant way to pretend that the syscall never 
actually happened.  This works for syscalls that wait with an absolute timeout, 
for example. 

5. Set up restart_syscall() magic, rewrite regs so it looks like the user was 
about to call restart_syscall() when the signal happened, and abort. 

In all cases, the signal is dealt with afterwards. This could result in 
changing regs to call the handler or in simply returning. 

1-3 should work fully in seccomp. The only issue is that the kernel doesn’t 
know *which* to do, nor can the kernel force the listener to abort cleanly, so 
I think we have  no real choice but to let the listener decide. 

4 could be supported just like 1-3. 5 is awful, and I don’t think we should 
support it for user listeners. 


Re: [RFC 0/3] seccomp trap to userspace

2018-03-16 Thread Christian Brauner
On Fri, Mar 16, 2018 at 12:46:55AM +, Andy Lutomirski wrote:
> On Thu, Mar 15, 2018 at 5:35 PM, Tycho Andersen  wrote:
> > Hi Andy,
> >
> > On Thu, Mar 15, 2018 at 05:11:32PM +, Andy Lutomirski wrote:
> >> On Thu, Mar 15, 2018 at 5:05 PM, Serge E. Hallyn  wrote:
> >> > Hm, synchronously - that brings to mind a thought...  I should re-look at
> >> > Tycho's patches first, but, if I'm in a container, start some syscall 
> >> > that
> >> > gets trapped to userspace, then I hit ctrl-c.  I'd like to be able to 
> >> > have
> >> > the handler be interrupted and have it return -EINTR.  Is that going to
> >> > be possible with the synchronous approach?
> >>
> >> I think so, but it should be possible with the classic async approach
> >> too.  The main issue is the difference between a classic filter like
> >> this (pseudocode):
> >>
> >> if (nr == SYS_mount) return TRAP_TO_USERSPACE;
> >>
> >> and the eBPF variant:
> >>
> >> if (nr == SYS_mount) trap_to_userspace();
> >
> > Sargun started a private design discussion thread that I don't think
> > you were on, but Alexei said something to the effect of "eBPF programs
> > will never wait on userspace", so I'm not sure we can do something
> > like this in an eBPF program. I'm cc-ing him here again to confirm,
> > but I doubt things have changed.
> >
> >> I admit that it's still not 100% clear to me that the latter is
> >> genuinely more useful than the former.
> >>
> >> The case where I think the synchronous function call is a huge win is this 
> >> one:
> >>
> >> if (nr  == SYS_mount) {
> >>   log("Someone called mount with args %lx\n", ...);
> >>   return RET_KILL;
> >> }
> >>
> >> The idea being that the log message wouldn't show up in the kernel log
> >> -- it would get sent to the listener socket belonging to whoever
> >> created the filter, and that process could then go and log it
> >> properly.  This would work perfectly in containers and in totally
> >> unprivileged applications like Chromium.
> >
> > The current implementation can't do exactly this, but you could do:
> >
> > if (nr == SYS_mount) {
> > log(...);
> > kill(pid, SIGKILL);
> > }
> >
> > from the handler instead.
> >
> > I guess Serge is asking a slightly different question: what if the
> > task gets e.g. SIGINT from the user doing a ^C or SIGALARM or
> > something, we should probably send the handler some sort of message or
> > interrupt to let it know that the syscall was cancelled. Right now the
> > current set doesn't behave that way, and the handler will just
> > continue on its merry way and get an EINVAL when it tries to respond
> > with the cancelled cookie.
> 
> Hmm, I think we have to be very careful to avoid nasty races.  I think
> the correct approach is to notice the signal and send a message to the
> listener that a signal is pending but to take no additional action.
> If the handler ends up completing the syscall with a successful
> return, we don't want to replace it with -EINTR.  IOW the code looks
> kind of like:
> 
> send_to_listener("hey I got a signal");
> wait_ret = wait_interruptible for the listener to reply;
> if (wait_ret == -EINTR) {

Hm, so from the pseudo-code it looks like: The handler would inform the
listener that it received a signal (either from the syscall requester or
from somewhere else) and then wait for the listener to reply to that
message.  This would allow the listener to decide what action it wants
the handler to take based on the signal, i.e. either cancel the request
or retry?  The comment makes it sound like that the handler doesn't
really wait on the listener when it receives a signal it simply moves
on.
So no "taking no additional action" here means not have the handler
decide to abort but the listener?

Sorry if I'm being dense.

Christian

>   send_to_listener("hey there's a signal");
>   wait_ret = wait_killable for the listener to reply to the original request;
> }
> 
> if (wait_ret == -EINTR) {
>   /* hmm, this next line might not actually be necessary, but it's
> harmless and possibly useful */
>   send_to_listener("hey we're going away");
>   /* and stop waiting */
> }
> 
> ... actually handle the result.
> 
> --Andy
> ___
> Containers mailing list
> contain...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers


Re: [RFC 0/3] seccomp trap to userspace

2018-03-16 Thread Christian Brauner
On Fri, Mar 16, 2018 at 12:46:55AM +, Andy Lutomirski wrote:
> On Thu, Mar 15, 2018 at 5:35 PM, Tycho Andersen  wrote:
> > Hi Andy,
> >
> > On Thu, Mar 15, 2018 at 05:11:32PM +, Andy Lutomirski wrote:
> >> On Thu, Mar 15, 2018 at 5:05 PM, Serge E. Hallyn  wrote:
> >> > Hm, synchronously - that brings to mind a thought...  I should re-look at
> >> > Tycho's patches first, but, if I'm in a container, start some syscall 
> >> > that
> >> > gets trapped to userspace, then I hit ctrl-c.  I'd like to be able to 
> >> > have
> >> > the handler be interrupted and have it return -EINTR.  Is that going to
> >> > be possible with the synchronous approach?
> >>
> >> I think so, but it should be possible with the classic async approach
> >> too.  The main issue is the difference between a classic filter like
> >> this (pseudocode):
> >>
> >> if (nr == SYS_mount) return TRAP_TO_USERSPACE;
> >>
> >> and the eBPF variant:
> >>
> >> if (nr == SYS_mount) trap_to_userspace();
> >
> > Sargun started a private design discussion thread that I don't think
> > you were on, but Alexei said something to the effect of "eBPF programs
> > will never wait on userspace", so I'm not sure we can do something
> > like this in an eBPF program. I'm cc-ing him here again to confirm,
> > but I doubt things have changed.
> >
> >> I admit that it's still not 100% clear to me that the latter is
> >> genuinely more useful than the former.
> >>
> >> The case where I think the synchronous function call is a huge win is this 
> >> one:
> >>
> >> if (nr  == SYS_mount) {
> >>   log("Someone called mount with args %lx\n", ...);
> >>   return RET_KILL;
> >> }
> >>
> >> The idea being that the log message wouldn't show up in the kernel log
> >> -- it would get sent to the listener socket belonging to whoever
> >> created the filter, and that process could then go and log it
> >> properly.  This would work perfectly in containers and in totally
> >> unprivileged applications like Chromium.
> >
> > The current implementation can't do exactly this, but you could do:
> >
> > if (nr == SYS_mount) {
> > log(...);
> > kill(pid, SIGKILL);
> > }
> >
> > from the handler instead.
> >
> > I guess Serge is asking a slightly different question: what if the
> > task gets e.g. SIGINT from the user doing a ^C or SIGALARM or
> > something, we should probably send the handler some sort of message or
> > interrupt to let it know that the syscall was cancelled. Right now the
> > current set doesn't behave that way, and the handler will just
> > continue on its merry way and get an EINVAL when it tries to respond
> > with the cancelled cookie.
> 
> Hmm, I think we have to be very careful to avoid nasty races.  I think
> the correct approach is to notice the signal and send a message to the
> listener that a signal is pending but to take no additional action.
> If the handler ends up completing the syscall with a successful
> return, we don't want to replace it with -EINTR.  IOW the code looks
> kind of like:
> 
> send_to_listener("hey I got a signal");
> wait_ret = wait_interruptible for the listener to reply;
> if (wait_ret == -EINTR) {

Hm, so from the pseudo-code it looks like: The handler would inform the
listener that it received a signal (either from the syscall requester or
from somewhere else) and then wait for the listener to reply to that
message.  This would allow the listener to decide what action it wants
the handler to take based on the signal, i.e. either cancel the request
or retry?  The comment makes it sound like that the handler doesn't
really wait on the listener when it receives a signal it simply moves
on.
So no "taking no additional action" here means not have the handler
decide to abort but the listener?

Sorry if I'm being dense.

Christian

>   send_to_listener("hey there's a signal");
>   wait_ret = wait_killable for the listener to reply to the original request;
> }
> 
> if (wait_ret == -EINTR) {
>   /* hmm, this next line might not actually be necessary, but it's
> harmless and possibly useful */
>   send_to_listener("hey we're going away");
>   /* and stop waiting */
> }
> 
> ... actually handle the result.
> 
> --Andy
> ___
> Containers mailing list
> contain...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers


Re: [RFC 0/3] seccomp trap to userspace

2018-03-15 Thread Andy Lutomirski
On Thu, Mar 15, 2018 at 5:35 PM, Tycho Andersen  wrote:
> Hi Andy,
>
> On Thu, Mar 15, 2018 at 05:11:32PM +, Andy Lutomirski wrote:
>> On Thu, Mar 15, 2018 at 5:05 PM, Serge E. Hallyn  wrote:
>> > Hm, synchronously - that brings to mind a thought...  I should re-look at
>> > Tycho's patches first, but, if I'm in a container, start some syscall that
>> > gets trapped to userspace, then I hit ctrl-c.  I'd like to be able to have
>> > the handler be interrupted and have it return -EINTR.  Is that going to
>> > be possible with the synchronous approach?
>>
>> I think so, but it should be possible with the classic async approach
>> too.  The main issue is the difference between a classic filter like
>> this (pseudocode):
>>
>> if (nr == SYS_mount) return TRAP_TO_USERSPACE;
>>
>> and the eBPF variant:
>>
>> if (nr == SYS_mount) trap_to_userspace();
>
> Sargun started a private design discussion thread that I don't think
> you were on, but Alexei said something to the effect of "eBPF programs
> will never wait on userspace", so I'm not sure we can do something
> like this in an eBPF program. I'm cc-ing him here again to confirm,
> but I doubt things have changed.
>
>> I admit that it's still not 100% clear to me that the latter is
>> genuinely more useful than the former.
>>
>> The case where I think the synchronous function call is a huge win is this 
>> one:
>>
>> if (nr  == SYS_mount) {
>>   log("Someone called mount with args %lx\n", ...);
>>   return RET_KILL;
>> }
>>
>> The idea being that the log message wouldn't show up in the kernel log
>> -- it would get sent to the listener socket belonging to whoever
>> created the filter, and that process could then go and log it
>> properly.  This would work perfectly in containers and in totally
>> unprivileged applications like Chromium.
>
> The current implementation can't do exactly this, but you could do:
>
> if (nr == SYS_mount) {
> log(...);
> kill(pid, SIGKILL);
> }
>
> from the handler instead.
>
> I guess Serge is asking a slightly different question: what if the
> task gets e.g. SIGINT from the user doing a ^C or SIGALARM or
> something, we should probably send the handler some sort of message or
> interrupt to let it know that the syscall was cancelled. Right now the
> current set doesn't behave that way, and the handler will just
> continue on its merry way and get an EINVAL when it tries to respond
> with the cancelled cookie.

Hmm, I think we have to be very careful to avoid nasty races.  I think
the correct approach is to notice the signal and send a message to the
listener that a signal is pending but to take no additional action.
If the handler ends up completing the syscall with a successful
return, we don't want to replace it with -EINTR.  IOW the code looks
kind of like:

send_to_listener("hey I got a signal");
wait_ret = wait_interruptible for the listener to reply;
if (wait_ret == -EINTR) {
  send_to_listener("hey there's a signal");
  wait_ret = wait_killable for the listener to reply to the original request;
}

if (wait_ret == -EINTR) {
  /* hmm, this next line might not actually be necessary, but it's
harmless and possibly useful */
  send_to_listener("hey we're going away");
  /* and stop waiting */
}

... actually handle the result.

--Andy


Re: [RFC 0/3] seccomp trap to userspace

2018-03-15 Thread Andy Lutomirski
On Thu, Mar 15, 2018 at 5:35 PM, Tycho Andersen  wrote:
> Hi Andy,
>
> On Thu, Mar 15, 2018 at 05:11:32PM +, Andy Lutomirski wrote:
>> On Thu, Mar 15, 2018 at 5:05 PM, Serge E. Hallyn  wrote:
>> > Hm, synchronously - that brings to mind a thought...  I should re-look at
>> > Tycho's patches first, but, if I'm in a container, start some syscall that
>> > gets trapped to userspace, then I hit ctrl-c.  I'd like to be able to have
>> > the handler be interrupted and have it return -EINTR.  Is that going to
>> > be possible with the synchronous approach?
>>
>> I think so, but it should be possible with the classic async approach
>> too.  The main issue is the difference between a classic filter like
>> this (pseudocode):
>>
>> if (nr == SYS_mount) return TRAP_TO_USERSPACE;
>>
>> and the eBPF variant:
>>
>> if (nr == SYS_mount) trap_to_userspace();
>
> Sargun started a private design discussion thread that I don't think
> you were on, but Alexei said something to the effect of "eBPF programs
> will never wait on userspace", so I'm not sure we can do something
> like this in an eBPF program. I'm cc-ing him here again to confirm,
> but I doubt things have changed.
>
>> I admit that it's still not 100% clear to me that the latter is
>> genuinely more useful than the former.
>>
>> The case where I think the synchronous function call is a huge win is this 
>> one:
>>
>> if (nr  == SYS_mount) {
>>   log("Someone called mount with args %lx\n", ...);
>>   return RET_KILL;
>> }
>>
>> The idea being that the log message wouldn't show up in the kernel log
>> -- it would get sent to the listener socket belonging to whoever
>> created the filter, and that process could then go and log it
>> properly.  This would work perfectly in containers and in totally
>> unprivileged applications like Chromium.
>
> The current implementation can't do exactly this, but you could do:
>
> if (nr == SYS_mount) {
> log(...);
> kill(pid, SIGKILL);
> }
>
> from the handler instead.
>
> I guess Serge is asking a slightly different question: what if the
> task gets e.g. SIGINT from the user doing a ^C or SIGALARM or
> something, we should probably send the handler some sort of message or
> interrupt to let it know that the syscall was cancelled. Right now the
> current set doesn't behave that way, and the handler will just
> continue on its merry way and get an EINVAL when it tries to respond
> with the cancelled cookie.

Hmm, I think we have to be very careful to avoid nasty races.  I think
the correct approach is to notice the signal and send a message to the
listener that a signal is pending but to take no additional action.
If the handler ends up completing the syscall with a successful
return, we don't want to replace it with -EINTR.  IOW the code looks
kind of like:

send_to_listener("hey I got a signal");
wait_ret = wait_interruptible for the listener to reply;
if (wait_ret == -EINTR) {
  send_to_listener("hey there's a signal");
  wait_ret = wait_killable for the listener to reply to the original request;
}

if (wait_ret == -EINTR) {
  /* hmm, this next line might not actually be necessary, but it's
harmless and possibly useful */
  send_to_listener("hey we're going away");
  /* and stop waiting */
}

... actually handle the result.

--Andy


Re: [RFC 0/3] seccomp trap to userspace

2018-03-15 Thread Tycho Andersen
Hi Andy,

On Thu, Mar 15, 2018 at 05:11:32PM +, Andy Lutomirski wrote:
> On Thu, Mar 15, 2018 at 5:05 PM, Serge E. Hallyn  wrote:
> > Hm, synchronously - that brings to mind a thought...  I should re-look at
> > Tycho's patches first, but, if I'm in a container, start some syscall that
> > gets trapped to userspace, then I hit ctrl-c.  I'd like to be able to have
> > the handler be interrupted and have it return -EINTR.  Is that going to
> > be possible with the synchronous approach?
> 
> I think so, but it should be possible with the classic async approach
> too.  The main issue is the difference between a classic filter like
> this (pseudocode):
> 
> if (nr == SYS_mount) return TRAP_TO_USERSPACE;
> 
> and the eBPF variant:
> 
> if (nr == SYS_mount) trap_to_userspace();

Sargun started a private design discussion thread that I don't think
you were on, but Alexei said something to the effect of "eBPF programs
will never wait on userspace", so I'm not sure we can do something
like this in an eBPF program. I'm cc-ing him here again to confirm,
but I doubt things have changed.

> I admit that it's still not 100% clear to me that the latter is
> genuinely more useful than the former.
> 
> The case where I think the synchronous function call is a huge win is this 
> one:
> 
> if (nr  == SYS_mount) {
>   log("Someone called mount with args %lx\n", ...);
>   return RET_KILL;
> }
> 
> The idea being that the log message wouldn't show up in the kernel log
> -- it would get sent to the listener socket belonging to whoever
> created the filter, and that process could then go and log it
> properly.  This would work perfectly in containers and in totally
> unprivileged applications like Chromium.

The current implementation can't do exactly this, but you could do:

if (nr == SYS_mount) {
log(...);
kill(pid, SIGKILL);
}

from the handler instead.

I guess Serge is asking a slightly different question: what if the
task gets e.g. SIGINT from the user doing a ^C or SIGALARM or
something, we should probably send the handler some sort of message or
interrupt to let it know that the syscall was cancelled. Right now the
current set doesn't behave that way, and the handler will just
continue on its merry way and get an EINVAL when it tries to respond
with the cancelled cookie.

Anyway, I think these last two points can be addressed with the
approach from this series. The notification to the handler about a
cancelled syscall might be slightly awkward, but I'll take a look.

Cheers,

Tycho


Re: [RFC 0/3] seccomp trap to userspace

2018-03-15 Thread Tycho Andersen
Hi Andy,

On Thu, Mar 15, 2018 at 05:11:32PM +, Andy Lutomirski wrote:
> On Thu, Mar 15, 2018 at 5:05 PM, Serge E. Hallyn  wrote:
> > Hm, synchronously - that brings to mind a thought...  I should re-look at
> > Tycho's patches first, but, if I'm in a container, start some syscall that
> > gets trapped to userspace, then I hit ctrl-c.  I'd like to be able to have
> > the handler be interrupted and have it return -EINTR.  Is that going to
> > be possible with the synchronous approach?
> 
> I think so, but it should be possible with the classic async approach
> too.  The main issue is the difference between a classic filter like
> this (pseudocode):
> 
> if (nr == SYS_mount) return TRAP_TO_USERSPACE;
> 
> and the eBPF variant:
> 
> if (nr == SYS_mount) trap_to_userspace();

Sargun started a private design discussion thread that I don't think
you were on, but Alexei said something to the effect of "eBPF programs
will never wait on userspace", so I'm not sure we can do something
like this in an eBPF program. I'm cc-ing him here again to confirm,
but I doubt things have changed.

> I admit that it's still not 100% clear to me that the latter is
> genuinely more useful than the former.
> 
> The case where I think the synchronous function call is a huge win is this 
> one:
> 
> if (nr  == SYS_mount) {
>   log("Someone called mount with args %lx\n", ...);
>   return RET_KILL;
> }
> 
> The idea being that the log message wouldn't show up in the kernel log
> -- it would get sent to the listener socket belonging to whoever
> created the filter, and that process could then go and log it
> properly.  This would work perfectly in containers and in totally
> unprivileged applications like Chromium.

The current implementation can't do exactly this, but you could do:

if (nr == SYS_mount) {
log(...);
kill(pid, SIGKILL);
}

from the handler instead.

I guess Serge is asking a slightly different question: what if the
task gets e.g. SIGINT from the user doing a ^C or SIGALARM or
something, we should probably send the handler some sort of message or
interrupt to let it know that the syscall was cancelled. Right now the
current set doesn't behave that way, and the handler will just
continue on its merry way and get an EINVAL when it tries to respond
with the cancelled cookie.

Anyway, I think these last two points can be addressed with the
approach from this series. The notification to the handler about a
cancelled syscall might be slightly awkward, but I'll take a look.

Cheers,

Tycho


Re: [RFC 0/3] seccomp trap to userspace

2018-03-15 Thread Andy Lutomirski
On Thu, Mar 15, 2018 at 5:25 PM, Christian Brauner
 wrote:
> On Thu, Mar 15, 2018 at 05:11:32PM +, Andy Lutomirski wrote:
>> On Thu, Mar 15, 2018 at 5:05 PM, Serge E. Hallyn  wrote:
>> > Quoting Andy Lutomirski (l...@kernel.org):
>> >> On Thu, Mar 15, 2018 at 4:09 PM, Christian Brauner
>> >>  wrote:
>> >> > On Sun, Feb 04, 2018 at 11:49:43AM +0100, Tycho Andersen wrote:
>> >> >> Several months ago at Linux Plumber's, we had a discussion about 
>> >> >> adding a
>> >> >> feature to seccomp which would allow seccomp to trigger a notification 
>> >> >> for some
>> >> >> other process. Here's a draft of that feature.
>> >> >>
>> >> >> Patch 1 contains the bulk of it, patches 2 & 3 offer an alternative 
>> >> >> way to
>> >> >> acquire the fd that receives notifications via ptrace (the method in 
>> >> >> patch 1
>> >> >> poses some problems). Other suggestions for how to acquire one of 
>> >> >> these fds
>> >> >> would be welcome.
>> >> >>
>> >> >> Take a close look at the synchronization. I think I've got it right, 
>> >> >> but I
>> >> >> probably don't :)
>> >> >>
>> >> >> Thanks!
>> >> >>
>> >> >> Tycho Andersen (3):
>> >> >>   seccomp: add a return code to trap to userspace
>> >> >>   seccomp: hoist out filter resolving logic
>> >> >>   seccomp: add a way to get a listener fd from ptrace
>> >> >>
>> >> >>  arch/Kconfig  |   7 +
>> >> >>  include/linux/seccomp.h   |  14 +-
>> >> >>  include/uapi/linux/ptrace.h   |   1 +
>> >> >>  include/uapi/linux/seccomp.h  |  18 +-
>> >> >>  kernel/ptrace.c   |   4 +
>> >> >>  kernel/seccomp.c  | 467 
>> >> >> --
>> >> >>  tools/testing/selftests/seccomp/seccomp_bpf.c | 180 +-
>> >> >>  7 files changed, 653 insertions(+), 38 deletions(-)
>> >> >
>> >> > Hey,
>> >> >
>> >> > So, I've been following the discussion silently in the background and I
>> >> > see that it got sidetracked into seccomp + ebpf. While I can see that
>> >> > there is value in adding epbf support to seccomp I'd really like to see
>> >> > this decoupled from this patchset. Afaict, this patchset would just work
>> >> > fine without the ebpf portion (but I might be just have missed the
>> >> > point). So if possible I would like to see a second version of this with
>> >> > the comments accounted for and - if possible - have this up for merging
>> >> > independent of the ebpf patchset that's floating around.
>> >> >
>> >>
>> >> The issue is that it might be (and, then again, might not be) nicer to
>> >> to *synchronously* call out to the monitor in the filter.  eBPF can do
>> >> that very cleanly, whereas classic BPF can't.
>> >
>> > Hm, synchronously - that brings to mind a thought...  I should re-look at
>> > Tycho's patches first, but, if I'm in a container, start some syscall that
>> > gets trapped to userspace, then I hit ctrl-c.  I'd like to be able to have
>> > the handler be interrupted and have it return -EINTR.  Is that going to
>> > be possible with the synchronous approach?
>>
>> I think so, but it should be possible with the classic async approach
>> too.  The main issue is the difference between a classic filter like
>> this (pseudocode):
>>
>> if (nr == SYS_mount) return TRAP_TO_USERSPACE;
>>
>> and the eBPF variant:
>>
>> if (nr == SYS_mount) trap_to_userspace();
>>
>> I admit that it's still not 100% clear to me that the latter is
>> genuinely more useful than the former.
>
> We've just discussed this on irc and the fact that most problems can be
> addressed by interfaces we already have makes it questionable what ebpf
> brings to the game here. Especially since the discussion gave the
> impression that if ebpf ever makes it to seccomp it will basically be
> because it allows a nice implementation of the trap to userspace. If
> it's even unclear whether it is really the better choice for this task
> then we could consider to no try and make this patchset use it. (I
> probably sound way more polemic than I intend to.)
>

No argument from me.

To be clear, I don't think that trap to userspace should block on eBPF
at all.  It was just a coincidence that both patches showed up around
the same time, and if they actually work well together, then it could
make sense to combine them.  But if trap to userspace ends up working
perfectly without eBPF, that's fine too.


Re: [RFC 0/3] seccomp trap to userspace

2018-03-15 Thread Andy Lutomirski
On Thu, Mar 15, 2018 at 5:25 PM, Christian Brauner
 wrote:
> On Thu, Mar 15, 2018 at 05:11:32PM +, Andy Lutomirski wrote:
>> On Thu, Mar 15, 2018 at 5:05 PM, Serge E. Hallyn  wrote:
>> > Quoting Andy Lutomirski (l...@kernel.org):
>> >> On Thu, Mar 15, 2018 at 4:09 PM, Christian Brauner
>> >>  wrote:
>> >> > On Sun, Feb 04, 2018 at 11:49:43AM +0100, Tycho Andersen wrote:
>> >> >> Several months ago at Linux Plumber's, we had a discussion about 
>> >> >> adding a
>> >> >> feature to seccomp which would allow seccomp to trigger a notification 
>> >> >> for some
>> >> >> other process. Here's a draft of that feature.
>> >> >>
>> >> >> Patch 1 contains the bulk of it, patches 2 & 3 offer an alternative 
>> >> >> way to
>> >> >> acquire the fd that receives notifications via ptrace (the method in 
>> >> >> patch 1
>> >> >> poses some problems). Other suggestions for how to acquire one of 
>> >> >> these fds
>> >> >> would be welcome.
>> >> >>
>> >> >> Take a close look at the synchronization. I think I've got it right, 
>> >> >> but I
>> >> >> probably don't :)
>> >> >>
>> >> >> Thanks!
>> >> >>
>> >> >> Tycho Andersen (3):
>> >> >>   seccomp: add a return code to trap to userspace
>> >> >>   seccomp: hoist out filter resolving logic
>> >> >>   seccomp: add a way to get a listener fd from ptrace
>> >> >>
>> >> >>  arch/Kconfig  |   7 +
>> >> >>  include/linux/seccomp.h   |  14 +-
>> >> >>  include/uapi/linux/ptrace.h   |   1 +
>> >> >>  include/uapi/linux/seccomp.h  |  18 +-
>> >> >>  kernel/ptrace.c   |   4 +
>> >> >>  kernel/seccomp.c  | 467 
>> >> >> --
>> >> >>  tools/testing/selftests/seccomp/seccomp_bpf.c | 180 +-
>> >> >>  7 files changed, 653 insertions(+), 38 deletions(-)
>> >> >
>> >> > Hey,
>> >> >
>> >> > So, I've been following the discussion silently in the background and I
>> >> > see that it got sidetracked into seccomp + ebpf. While I can see that
>> >> > there is value in adding epbf support to seccomp I'd really like to see
>> >> > this decoupled from this patchset. Afaict, this patchset would just work
>> >> > fine without the ebpf portion (but I might be just have missed the
>> >> > point). So if possible I would like to see a second version of this with
>> >> > the comments accounted for and - if possible - have this up for merging
>> >> > independent of the ebpf patchset that's floating around.
>> >> >
>> >>
>> >> The issue is that it might be (and, then again, might not be) nicer to
>> >> to *synchronously* call out to the monitor in the filter.  eBPF can do
>> >> that very cleanly, whereas classic BPF can't.
>> >
>> > Hm, synchronously - that brings to mind a thought...  I should re-look at
>> > Tycho's patches first, but, if I'm in a container, start some syscall that
>> > gets trapped to userspace, then I hit ctrl-c.  I'd like to be able to have
>> > the handler be interrupted and have it return -EINTR.  Is that going to
>> > be possible with the synchronous approach?
>>
>> I think so, but it should be possible with the classic async approach
>> too.  The main issue is the difference between a classic filter like
>> this (pseudocode):
>>
>> if (nr == SYS_mount) return TRAP_TO_USERSPACE;
>>
>> and the eBPF variant:
>>
>> if (nr == SYS_mount) trap_to_userspace();
>>
>> I admit that it's still not 100% clear to me that the latter is
>> genuinely more useful than the former.
>
> We've just discussed this on irc and the fact that most problems can be
> addressed by interfaces we already have makes it questionable what ebpf
> brings to the game here. Especially since the discussion gave the
> impression that if ebpf ever makes it to seccomp it will basically be
> because it allows a nice implementation of the trap to userspace. If
> it's even unclear whether it is really the better choice for this task
> then we could consider to no try and make this patchset use it. (I
> probably sound way more polemic than I intend to.)
>

No argument from me.

To be clear, I don't think that trap to userspace should block on eBPF
at all.  It was just a coincidence that both patches showed up around
the same time, and if they actually work well together, then it could
make sense to combine them.  But if trap to userspace ends up working
perfectly without eBPF, that's fine too.


Re: [RFC 0/3] seccomp trap to userspace

2018-03-15 Thread Christian Brauner
On Thu, Mar 15, 2018 at 05:11:32PM +, Andy Lutomirski wrote:
> On Thu, Mar 15, 2018 at 5:05 PM, Serge E. Hallyn  wrote:
> > Quoting Andy Lutomirski (l...@kernel.org):
> >> On Thu, Mar 15, 2018 at 4:09 PM, Christian Brauner
> >>  wrote:
> >> > On Sun, Feb 04, 2018 at 11:49:43AM +0100, Tycho Andersen wrote:
> >> >> Several months ago at Linux Plumber's, we had a discussion about adding 
> >> >> a
> >> >> feature to seccomp which would allow seccomp to trigger a notification 
> >> >> for some
> >> >> other process. Here's a draft of that feature.
> >> >>
> >> >> Patch 1 contains the bulk of it, patches 2 & 3 offer an alternative way 
> >> >> to
> >> >> acquire the fd that receives notifications via ptrace (the method in 
> >> >> patch 1
> >> >> poses some problems). Other suggestions for how to acquire one of these 
> >> >> fds
> >> >> would be welcome.
> >> >>
> >> >> Take a close look at the synchronization. I think I've got it right, 
> >> >> but I
> >> >> probably don't :)
> >> >>
> >> >> Thanks!
> >> >>
> >> >> Tycho Andersen (3):
> >> >>   seccomp: add a return code to trap to userspace
> >> >>   seccomp: hoist out filter resolving logic
> >> >>   seccomp: add a way to get a listener fd from ptrace
> >> >>
> >> >>  arch/Kconfig  |   7 +
> >> >>  include/linux/seccomp.h   |  14 +-
> >> >>  include/uapi/linux/ptrace.h   |   1 +
> >> >>  include/uapi/linux/seccomp.h  |  18 +-
> >> >>  kernel/ptrace.c   |   4 +
> >> >>  kernel/seccomp.c  | 467 
> >> >> --
> >> >>  tools/testing/selftests/seccomp/seccomp_bpf.c | 180 +-
> >> >>  7 files changed, 653 insertions(+), 38 deletions(-)
> >> >
> >> > Hey,
> >> >
> >> > So, I've been following the discussion silently in the background and I
> >> > see that it got sidetracked into seccomp + ebpf. While I can see that
> >> > there is value in adding epbf support to seccomp I'd really like to see
> >> > this decoupled from this patchset. Afaict, this patchset would just work
> >> > fine without the ebpf portion (but I might be just have missed the
> >> > point). So if possible I would like to see a second version of this with
> >> > the comments accounted for and - if possible - have this up for merging
> >> > independent of the ebpf patchset that's floating around.
> >> >
> >>
> >> The issue is that it might be (and, then again, might not be) nicer to
> >> to *synchronously* call out to the monitor in the filter.  eBPF can do
> >> that very cleanly, whereas classic BPF can't.
> >
> > Hm, synchronously - that brings to mind a thought...  I should re-look at
> > Tycho's patches first, but, if I'm in a container, start some syscall that
> > gets trapped to userspace, then I hit ctrl-c.  I'd like to be able to have
> > the handler be interrupted and have it return -EINTR.  Is that going to
> > be possible with the synchronous approach?
> 
> I think so, but it should be possible with the classic async approach
> too.  The main issue is the difference between a classic filter like
> this (pseudocode):
> 
> if (nr == SYS_mount) return TRAP_TO_USERSPACE;
> 
> and the eBPF variant:
> 
> if (nr == SYS_mount) trap_to_userspace();
> 
> I admit that it's still not 100% clear to me that the latter is
> genuinely more useful than the former.

We've just discussed this on irc and the fact that most problems can be
addressed by interfaces we already have makes it questionable what ebpf
brings to the game here. Especially since the discussion gave the
impression that if ebpf ever makes it to seccomp it will basically be
because it allows a nice implementation of the trap to userspace. If
it's even unclear whether it is really the better choice for this task
then we could consider to no try and make this patchset use it. (I
probably sound way more polemic than I intend to.)

> 
> The case where I think the synchronous function call is a huge win is this 
> one:
> 
> if (nr  == SYS_mount) {
>   log("Someone called mount with args %lx\n", ...);
>   return RET_KILL;
> }
> 
> The idea being that the log message wouldn't show up in the kernel log
> -- it would get sent to the listener socket belonging to whoever
> created the filter, and that process could then go and log it
> properly.  This would work perfectly in containers and in totally
> unprivileged applications like Chromium.

Hm, that is a decent point but that's also a non-essential feature. I
also wonder if there's any reason to not simply extend it to use ebpf
later if seccomp every uses it?

Christian


Re: [RFC 0/3] seccomp trap to userspace

2018-03-15 Thread Christian Brauner
On Thu, Mar 15, 2018 at 05:11:32PM +, Andy Lutomirski wrote:
> On Thu, Mar 15, 2018 at 5:05 PM, Serge E. Hallyn  wrote:
> > Quoting Andy Lutomirski (l...@kernel.org):
> >> On Thu, Mar 15, 2018 at 4:09 PM, Christian Brauner
> >>  wrote:
> >> > On Sun, Feb 04, 2018 at 11:49:43AM +0100, Tycho Andersen wrote:
> >> >> Several months ago at Linux Plumber's, we had a discussion about adding 
> >> >> a
> >> >> feature to seccomp which would allow seccomp to trigger a notification 
> >> >> for some
> >> >> other process. Here's a draft of that feature.
> >> >>
> >> >> Patch 1 contains the bulk of it, patches 2 & 3 offer an alternative way 
> >> >> to
> >> >> acquire the fd that receives notifications via ptrace (the method in 
> >> >> patch 1
> >> >> poses some problems). Other suggestions for how to acquire one of these 
> >> >> fds
> >> >> would be welcome.
> >> >>
> >> >> Take a close look at the synchronization. I think I've got it right, 
> >> >> but I
> >> >> probably don't :)
> >> >>
> >> >> Thanks!
> >> >>
> >> >> Tycho Andersen (3):
> >> >>   seccomp: add a return code to trap to userspace
> >> >>   seccomp: hoist out filter resolving logic
> >> >>   seccomp: add a way to get a listener fd from ptrace
> >> >>
> >> >>  arch/Kconfig  |   7 +
> >> >>  include/linux/seccomp.h   |  14 +-
> >> >>  include/uapi/linux/ptrace.h   |   1 +
> >> >>  include/uapi/linux/seccomp.h  |  18 +-
> >> >>  kernel/ptrace.c   |   4 +
> >> >>  kernel/seccomp.c  | 467 
> >> >> --
> >> >>  tools/testing/selftests/seccomp/seccomp_bpf.c | 180 +-
> >> >>  7 files changed, 653 insertions(+), 38 deletions(-)
> >> >
> >> > Hey,
> >> >
> >> > So, I've been following the discussion silently in the background and I
> >> > see that it got sidetracked into seccomp + ebpf. While I can see that
> >> > there is value in adding epbf support to seccomp I'd really like to see
> >> > this decoupled from this patchset. Afaict, this patchset would just work
> >> > fine without the ebpf portion (but I might be just have missed the
> >> > point). So if possible I would like to see a second version of this with
> >> > the comments accounted for and - if possible - have this up for merging
> >> > independent of the ebpf patchset that's floating around.
> >> >
> >>
> >> The issue is that it might be (and, then again, might not be) nicer to
> >> to *synchronously* call out to the monitor in the filter.  eBPF can do
> >> that very cleanly, whereas classic BPF can't.
> >
> > Hm, synchronously - that brings to mind a thought...  I should re-look at
> > Tycho's patches first, but, if I'm in a container, start some syscall that
> > gets trapped to userspace, then I hit ctrl-c.  I'd like to be able to have
> > the handler be interrupted and have it return -EINTR.  Is that going to
> > be possible with the synchronous approach?
> 
> I think so, but it should be possible with the classic async approach
> too.  The main issue is the difference between a classic filter like
> this (pseudocode):
> 
> if (nr == SYS_mount) return TRAP_TO_USERSPACE;
> 
> and the eBPF variant:
> 
> if (nr == SYS_mount) trap_to_userspace();
> 
> I admit that it's still not 100% clear to me that the latter is
> genuinely more useful than the former.

We've just discussed this on irc and the fact that most problems can be
addressed by interfaces we already have makes it questionable what ebpf
brings to the game here. Especially since the discussion gave the
impression that if ebpf ever makes it to seccomp it will basically be
because it allows a nice implementation of the trap to userspace. If
it's even unclear whether it is really the better choice for this task
then we could consider to no try and make this patchset use it. (I
probably sound way more polemic than I intend to.)

> 
> The case where I think the synchronous function call is a huge win is this 
> one:
> 
> if (nr  == SYS_mount) {
>   log("Someone called mount with args %lx\n", ...);
>   return RET_KILL;
> }
> 
> The idea being that the log message wouldn't show up in the kernel log
> -- it would get sent to the listener socket belonging to whoever
> created the filter, and that process could then go and log it
> properly.  This would work perfectly in containers and in totally
> unprivileged applications like Chromium.

Hm, that is a decent point but that's also a non-essential feature. I
also wonder if there's any reason to not simply extend it to use ebpf
later if seccomp every uses it?

Christian


Re: [RFC 0/3] seccomp trap to userspace

2018-03-15 Thread Andy Lutomirski
On Thu, Mar 15, 2018 at 5:05 PM, Serge E. Hallyn  wrote:
> Quoting Andy Lutomirski (l...@kernel.org):
>> On Thu, Mar 15, 2018 at 4:09 PM, Christian Brauner
>>  wrote:
>> > On Sun, Feb 04, 2018 at 11:49:43AM +0100, Tycho Andersen wrote:
>> >> Several months ago at Linux Plumber's, we had a discussion about adding a
>> >> feature to seccomp which would allow seccomp to trigger a notification 
>> >> for some
>> >> other process. Here's a draft of that feature.
>> >>
>> >> Patch 1 contains the bulk of it, patches 2 & 3 offer an alternative way to
>> >> acquire the fd that receives notifications via ptrace (the method in 
>> >> patch 1
>> >> poses some problems). Other suggestions for how to acquire one of these 
>> >> fds
>> >> would be welcome.
>> >>
>> >> Take a close look at the synchronization. I think I've got it right, but I
>> >> probably don't :)
>> >>
>> >> Thanks!
>> >>
>> >> Tycho Andersen (3):
>> >>   seccomp: add a return code to trap to userspace
>> >>   seccomp: hoist out filter resolving logic
>> >>   seccomp: add a way to get a listener fd from ptrace
>> >>
>> >>  arch/Kconfig  |   7 +
>> >>  include/linux/seccomp.h   |  14 +-
>> >>  include/uapi/linux/ptrace.h   |   1 +
>> >>  include/uapi/linux/seccomp.h  |  18 +-
>> >>  kernel/ptrace.c   |   4 +
>> >>  kernel/seccomp.c  | 467 
>> >> --
>> >>  tools/testing/selftests/seccomp/seccomp_bpf.c | 180 +-
>> >>  7 files changed, 653 insertions(+), 38 deletions(-)
>> >
>> > Hey,
>> >
>> > So, I've been following the discussion silently in the background and I
>> > see that it got sidetracked into seccomp + ebpf. While I can see that
>> > there is value in adding epbf support to seccomp I'd really like to see
>> > this decoupled from this patchset. Afaict, this patchset would just work
>> > fine without the ebpf portion (but I might be just have missed the
>> > point). So if possible I would like to see a second version of this with
>> > the comments accounted for and - if possible - have this up for merging
>> > independent of the ebpf patchset that's floating around.
>> >
>>
>> The issue is that it might be (and, then again, might not be) nicer to
>> to *synchronously* call out to the monitor in the filter.  eBPF can do
>> that very cleanly, whereas classic BPF can't.
>
> Hm, synchronously - that brings to mind a thought...  I should re-look at
> Tycho's patches first, but, if I'm in a container, start some syscall that
> gets trapped to userspace, then I hit ctrl-c.  I'd like to be able to have
> the handler be interrupted and have it return -EINTR.  Is that going to
> be possible with the synchronous approach?

I think so, but it should be possible with the classic async approach
too.  The main issue is the difference between a classic filter like
this (pseudocode):

if (nr == SYS_mount) return TRAP_TO_USERSPACE;

and the eBPF variant:

if (nr == SYS_mount) trap_to_userspace();

I admit that it's still not 100% clear to me that the latter is
genuinely more useful than the former.

The case where I think the synchronous function call is a huge win is this one:

if (nr  == SYS_mount) {
  log("Someone called mount with args %lx\n", ...);
  return RET_KILL;
}

The idea being that the log message wouldn't show up in the kernel log
-- it would get sent to the listener socket belonging to whoever
created the filter, and that process could then go and log it
properly.  This would work perfectly in containers and in totally
unprivileged applications like Chromium.


Re: [RFC 0/3] seccomp trap to userspace

2018-03-15 Thread Andy Lutomirski
On Thu, Mar 15, 2018 at 5:05 PM, Serge E. Hallyn  wrote:
> Quoting Andy Lutomirski (l...@kernel.org):
>> On Thu, Mar 15, 2018 at 4:09 PM, Christian Brauner
>>  wrote:
>> > On Sun, Feb 04, 2018 at 11:49:43AM +0100, Tycho Andersen wrote:
>> >> Several months ago at Linux Plumber's, we had a discussion about adding a
>> >> feature to seccomp which would allow seccomp to trigger a notification 
>> >> for some
>> >> other process. Here's a draft of that feature.
>> >>
>> >> Patch 1 contains the bulk of it, patches 2 & 3 offer an alternative way to
>> >> acquire the fd that receives notifications via ptrace (the method in 
>> >> patch 1
>> >> poses some problems). Other suggestions for how to acquire one of these 
>> >> fds
>> >> would be welcome.
>> >>
>> >> Take a close look at the synchronization. I think I've got it right, but I
>> >> probably don't :)
>> >>
>> >> Thanks!
>> >>
>> >> Tycho Andersen (3):
>> >>   seccomp: add a return code to trap to userspace
>> >>   seccomp: hoist out filter resolving logic
>> >>   seccomp: add a way to get a listener fd from ptrace
>> >>
>> >>  arch/Kconfig  |   7 +
>> >>  include/linux/seccomp.h   |  14 +-
>> >>  include/uapi/linux/ptrace.h   |   1 +
>> >>  include/uapi/linux/seccomp.h  |  18 +-
>> >>  kernel/ptrace.c   |   4 +
>> >>  kernel/seccomp.c  | 467 
>> >> --
>> >>  tools/testing/selftests/seccomp/seccomp_bpf.c | 180 +-
>> >>  7 files changed, 653 insertions(+), 38 deletions(-)
>> >
>> > Hey,
>> >
>> > So, I've been following the discussion silently in the background and I
>> > see that it got sidetracked into seccomp + ebpf. While I can see that
>> > there is value in adding epbf support to seccomp I'd really like to see
>> > this decoupled from this patchset. Afaict, this patchset would just work
>> > fine without the ebpf portion (but I might be just have missed the
>> > point). So if possible I would like to see a second version of this with
>> > the comments accounted for and - if possible - have this up for merging
>> > independent of the ebpf patchset that's floating around.
>> >
>>
>> The issue is that it might be (and, then again, might not be) nicer to
>> to *synchronously* call out to the monitor in the filter.  eBPF can do
>> that very cleanly, whereas classic BPF can't.
>
> Hm, synchronously - that brings to mind a thought...  I should re-look at
> Tycho's patches first, but, if I'm in a container, start some syscall that
> gets trapped to userspace, then I hit ctrl-c.  I'd like to be able to have
> the handler be interrupted and have it return -EINTR.  Is that going to
> be possible with the synchronous approach?

I think so, but it should be possible with the classic async approach
too.  The main issue is the difference between a classic filter like
this (pseudocode):

if (nr == SYS_mount) return TRAP_TO_USERSPACE;

and the eBPF variant:

if (nr == SYS_mount) trap_to_userspace();

I admit that it's still not 100% clear to me that the latter is
genuinely more useful than the former.

The case where I think the synchronous function call is a huge win is this one:

if (nr  == SYS_mount) {
  log("Someone called mount with args %lx\n", ...);
  return RET_KILL;
}

The idea being that the log message wouldn't show up in the kernel log
-- it would get sent to the listener socket belonging to whoever
created the filter, and that process could then go and log it
properly.  This would work perfectly in containers and in totally
unprivileged applications like Chromium.


Re: [RFC 0/3] seccomp trap to userspace

2018-03-15 Thread Serge E. Hallyn
Quoting Andy Lutomirski (l...@kernel.org):
> On Thu, Mar 15, 2018 at 4:09 PM, Christian Brauner
>  wrote:
> > On Sun, Feb 04, 2018 at 11:49:43AM +0100, Tycho Andersen wrote:
> >> Several months ago at Linux Plumber's, we had a discussion about adding a
> >> feature to seccomp which would allow seccomp to trigger a notification for 
> >> some
> >> other process. Here's a draft of that feature.
> >>
> >> Patch 1 contains the bulk of it, patches 2 & 3 offer an alternative way to
> >> acquire the fd that receives notifications via ptrace (the method in patch 
> >> 1
> >> poses some problems). Other suggestions for how to acquire one of these fds
> >> would be welcome.
> >>
> >> Take a close look at the synchronization. I think I've got it right, but I
> >> probably don't :)
> >>
> >> Thanks!
> >>
> >> Tycho Andersen (3):
> >>   seccomp: add a return code to trap to userspace
> >>   seccomp: hoist out filter resolving logic
> >>   seccomp: add a way to get a listener fd from ptrace
> >>
> >>  arch/Kconfig  |   7 +
> >>  include/linux/seccomp.h   |  14 +-
> >>  include/uapi/linux/ptrace.h   |   1 +
> >>  include/uapi/linux/seccomp.h  |  18 +-
> >>  kernel/ptrace.c   |   4 +
> >>  kernel/seccomp.c  | 467 
> >> --
> >>  tools/testing/selftests/seccomp/seccomp_bpf.c | 180 +-
> >>  7 files changed, 653 insertions(+), 38 deletions(-)
> >
> > Hey,
> >
> > So, I've been following the discussion silently in the background and I
> > see that it got sidetracked into seccomp + ebpf. While I can see that
> > there is value in adding epbf support to seccomp I'd really like to see
> > this decoupled from this patchset. Afaict, this patchset would just work
> > fine without the ebpf portion (but I might be just have missed the
> > point). So if possible I would like to see a second version of this with
> > the comments accounted for and - if possible - have this up for merging
> > independent of the ebpf patchset that's floating around.
> >
> 
> The issue is that it might be (and, then again, might not be) nicer to
> to *synchronously* call out to the monitor in the filter.  eBPF can do
> that very cleanly, whereas classic BPF can't.

Hm, synchronously - that brings to mind a thought...  I should re-look at
Tycho's patches first, but, if I'm in a container, start some syscall that
gets trapped to userspace, then I hit ctrl-c.  I'd like to be able to have
the handler be interrupted and have it return -EINTR.  Is that going to
be possible with the synchronous approach?


Re: [RFC 0/3] seccomp trap to userspace

2018-03-15 Thread Serge E. Hallyn
Quoting Andy Lutomirski (l...@kernel.org):
> On Thu, Mar 15, 2018 at 4:09 PM, Christian Brauner
>  wrote:
> > On Sun, Feb 04, 2018 at 11:49:43AM +0100, Tycho Andersen wrote:
> >> Several months ago at Linux Plumber's, we had a discussion about adding a
> >> feature to seccomp which would allow seccomp to trigger a notification for 
> >> some
> >> other process. Here's a draft of that feature.
> >>
> >> Patch 1 contains the bulk of it, patches 2 & 3 offer an alternative way to
> >> acquire the fd that receives notifications via ptrace (the method in patch 
> >> 1
> >> poses some problems). Other suggestions for how to acquire one of these fds
> >> would be welcome.
> >>
> >> Take a close look at the synchronization. I think I've got it right, but I
> >> probably don't :)
> >>
> >> Thanks!
> >>
> >> Tycho Andersen (3):
> >>   seccomp: add a return code to trap to userspace
> >>   seccomp: hoist out filter resolving logic
> >>   seccomp: add a way to get a listener fd from ptrace
> >>
> >>  arch/Kconfig  |   7 +
> >>  include/linux/seccomp.h   |  14 +-
> >>  include/uapi/linux/ptrace.h   |   1 +
> >>  include/uapi/linux/seccomp.h  |  18 +-
> >>  kernel/ptrace.c   |   4 +
> >>  kernel/seccomp.c  | 467 
> >> --
> >>  tools/testing/selftests/seccomp/seccomp_bpf.c | 180 +-
> >>  7 files changed, 653 insertions(+), 38 deletions(-)
> >
> > Hey,
> >
> > So, I've been following the discussion silently in the background and I
> > see that it got sidetracked into seccomp + ebpf. While I can see that
> > there is value in adding epbf support to seccomp I'd really like to see
> > this decoupled from this patchset. Afaict, this patchset would just work
> > fine without the ebpf portion (but I might be just have missed the
> > point). So if possible I would like to see a second version of this with
> > the comments accounted for and - if possible - have this up for merging
> > independent of the ebpf patchset that's floating around.
> >
> 
> The issue is that it might be (and, then again, might not be) nicer to
> to *synchronously* call out to the monitor in the filter.  eBPF can do
> that very cleanly, whereas classic BPF can't.

Hm, synchronously - that brings to mind a thought...  I should re-look at
Tycho's patches first, but, if I'm in a container, start some syscall that
gets trapped to userspace, then I hit ctrl-c.  I'd like to be able to have
the handler be interrupted and have it return -EINTR.  Is that going to
be possible with the synchronous approach?


Re: [RFC 0/3] seccomp trap to userspace

2018-03-15 Thread Andy Lutomirski
On Thu, Mar 15, 2018 at 4:09 PM, Christian Brauner
 wrote:
> On Sun, Feb 04, 2018 at 11:49:43AM +0100, Tycho Andersen wrote:
>> Several months ago at Linux Plumber's, we had a discussion about adding a
>> feature to seccomp which would allow seccomp to trigger a notification for 
>> some
>> other process. Here's a draft of that feature.
>>
>> Patch 1 contains the bulk of it, patches 2 & 3 offer an alternative way to
>> acquire the fd that receives notifications via ptrace (the method in patch 1
>> poses some problems). Other suggestions for how to acquire one of these fds
>> would be welcome.
>>
>> Take a close look at the synchronization. I think I've got it right, but I
>> probably don't :)
>>
>> Thanks!
>>
>> Tycho Andersen (3):
>>   seccomp: add a return code to trap to userspace
>>   seccomp: hoist out filter resolving logic
>>   seccomp: add a way to get a listener fd from ptrace
>>
>>  arch/Kconfig  |   7 +
>>  include/linux/seccomp.h   |  14 +-
>>  include/uapi/linux/ptrace.h   |   1 +
>>  include/uapi/linux/seccomp.h  |  18 +-
>>  kernel/ptrace.c   |   4 +
>>  kernel/seccomp.c  | 467 
>> --
>>  tools/testing/selftests/seccomp/seccomp_bpf.c | 180 +-
>>  7 files changed, 653 insertions(+), 38 deletions(-)
>
> Hey,
>
> So, I've been following the discussion silently in the background and I
> see that it got sidetracked into seccomp + ebpf. While I can see that
> there is value in adding epbf support to seccomp I'd really like to see
> this decoupled from this patchset. Afaict, this patchset would just work
> fine without the ebpf portion (but I might be just have missed the
> point). So if possible I would like to see a second version of this with
> the comments accounted for and - if possible - have this up for merging
> independent of the ebpf patchset that's floating around.
>

The issue is that it might be (and, then again, might not be) nicer to
to *synchronously* call out to the monitor in the filter.  eBPF can do
that very cleanly, whereas classic BPF can't.


Re: [RFC 0/3] seccomp trap to userspace

2018-03-15 Thread Andy Lutomirski
On Thu, Mar 15, 2018 at 4:09 PM, Christian Brauner
 wrote:
> On Sun, Feb 04, 2018 at 11:49:43AM +0100, Tycho Andersen wrote:
>> Several months ago at Linux Plumber's, we had a discussion about adding a
>> feature to seccomp which would allow seccomp to trigger a notification for 
>> some
>> other process. Here's a draft of that feature.
>>
>> Patch 1 contains the bulk of it, patches 2 & 3 offer an alternative way to
>> acquire the fd that receives notifications via ptrace (the method in patch 1
>> poses some problems). Other suggestions for how to acquire one of these fds
>> would be welcome.
>>
>> Take a close look at the synchronization. I think I've got it right, but I
>> probably don't :)
>>
>> Thanks!
>>
>> Tycho Andersen (3):
>>   seccomp: add a return code to trap to userspace
>>   seccomp: hoist out filter resolving logic
>>   seccomp: add a way to get a listener fd from ptrace
>>
>>  arch/Kconfig  |   7 +
>>  include/linux/seccomp.h   |  14 +-
>>  include/uapi/linux/ptrace.h   |   1 +
>>  include/uapi/linux/seccomp.h  |  18 +-
>>  kernel/ptrace.c   |   4 +
>>  kernel/seccomp.c  | 467 
>> --
>>  tools/testing/selftests/seccomp/seccomp_bpf.c | 180 +-
>>  7 files changed, 653 insertions(+), 38 deletions(-)
>
> Hey,
>
> So, I've been following the discussion silently in the background and I
> see that it got sidetracked into seccomp + ebpf. While I can see that
> there is value in adding epbf support to seccomp I'd really like to see
> this decoupled from this patchset. Afaict, this patchset would just work
> fine without the ebpf portion (but I might be just have missed the
> point). So if possible I would like to see a second version of this with
> the comments accounted for and - if possible - have this up for merging
> independent of the ebpf patchset that's floating around.
>

The issue is that it might be (and, then again, might not be) nicer to
to *synchronously* call out to the monitor in the filter.  eBPF can do
that very cleanly, whereas classic BPF can't.


Re: [RFC 0/3] seccomp trap to userspace

2018-03-15 Thread Christian Brauner
On Sun, Feb 04, 2018 at 11:49:43AM +0100, Tycho Andersen wrote:
> Several months ago at Linux Plumber's, we had a discussion about adding a
> feature to seccomp which would allow seccomp to trigger a notification for 
> some
> other process. Here's a draft of that feature.
> 
> Patch 1 contains the bulk of it, patches 2 & 3 offer an alternative way to
> acquire the fd that receives notifications via ptrace (the method in patch 1
> poses some problems). Other suggestions for how to acquire one of these fds
> would be welcome.
> 
> Take a close look at the synchronization. I think I've got it right, but I
> probably don't :)
> 
> Thanks!
> 
> Tycho Andersen (3):
>   seccomp: add a return code to trap to userspace
>   seccomp: hoist out filter resolving logic
>   seccomp: add a way to get a listener fd from ptrace
> 
>  arch/Kconfig  |   7 +
>  include/linux/seccomp.h   |  14 +-
>  include/uapi/linux/ptrace.h   |   1 +
>  include/uapi/linux/seccomp.h  |  18 +-
>  kernel/ptrace.c   |   4 +
>  kernel/seccomp.c  | 467 
> --
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 180 +-
>  7 files changed, 653 insertions(+), 38 deletions(-)

Hey,

So, I've been following the discussion silently in the background and I
see that it got sidetracked into seccomp + ebpf. While I can see that
there is value in adding epbf support to seccomp I'd really like to see
this decoupled from this patchset. Afaict, this patchset would just work
fine without the ebpf portion (but I might be just have missed the
point). So if possible I would like to see a second version of this with
the comments accounted for and - if possible - have this up for merging
independent of the ebpf patchset that's floating around.

Christian


Re: [RFC 0/3] seccomp trap to userspace

2018-03-15 Thread Christian Brauner
On Sun, Feb 04, 2018 at 11:49:43AM +0100, Tycho Andersen wrote:
> Several months ago at Linux Plumber's, we had a discussion about adding a
> feature to seccomp which would allow seccomp to trigger a notification for 
> some
> other process. Here's a draft of that feature.
> 
> Patch 1 contains the bulk of it, patches 2 & 3 offer an alternative way to
> acquire the fd that receives notifications via ptrace (the method in patch 1
> poses some problems). Other suggestions for how to acquire one of these fds
> would be welcome.
> 
> Take a close look at the synchronization. I think I've got it right, but I
> probably don't :)
> 
> Thanks!
> 
> Tycho Andersen (3):
>   seccomp: add a return code to trap to userspace
>   seccomp: hoist out filter resolving logic
>   seccomp: add a way to get a listener fd from ptrace
> 
>  arch/Kconfig  |   7 +
>  include/linux/seccomp.h   |  14 +-
>  include/uapi/linux/ptrace.h   |   1 +
>  include/uapi/linux/seccomp.h  |  18 +-
>  kernel/ptrace.c   |   4 +
>  kernel/seccomp.c  | 467 
> --
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 180 +-
>  7 files changed, 653 insertions(+), 38 deletions(-)

Hey,

So, I've been following the discussion silently in the background and I
see that it got sidetracked into seccomp + ebpf. While I can see that
there is value in adding epbf support to seccomp I'd really like to see
this decoupled from this patchset. Afaict, this patchset would just work
fine without the ebpf portion (but I might be just have missed the
point). So if possible I would like to see a second version of this with
the comments accounted for and - if possible - have this up for merging
independent of the ebpf patchset that's floating around.

Christian


[RFC 0/3] seccomp trap to userspace

2018-02-04 Thread Tycho Andersen
Several months ago at Linux Plumber's, we had a discussion about adding a
feature to seccomp which would allow seccomp to trigger a notification for some
other process. Here's a draft of that feature.

Patch 1 contains the bulk of it, patches 2 & 3 offer an alternative way to
acquire the fd that receives notifications via ptrace (the method in patch 1
poses some problems). Other suggestions for how to acquire one of these fds
would be welcome.

Take a close look at the synchronization. I think I've got it right, but I
probably don't :)

Thanks!

Tycho Andersen (3):
  seccomp: add a return code to trap to userspace
  seccomp: hoist out filter resolving logic
  seccomp: add a way to get a listener fd from ptrace

 arch/Kconfig  |   7 +
 include/linux/seccomp.h   |  14 +-
 include/uapi/linux/ptrace.h   |   1 +
 include/uapi/linux/seccomp.h  |  18 +-
 kernel/ptrace.c   |   4 +
 kernel/seccomp.c  | 467 --
 tools/testing/selftests/seccomp/seccomp_bpf.c | 180 +-
 7 files changed, 653 insertions(+), 38 deletions(-)

-- 
2.14.1



[RFC 0/3] seccomp trap to userspace

2018-02-04 Thread Tycho Andersen
Several months ago at Linux Plumber's, we had a discussion about adding a
feature to seccomp which would allow seccomp to trigger a notification for some
other process. Here's a draft of that feature.

Patch 1 contains the bulk of it, patches 2 & 3 offer an alternative way to
acquire the fd that receives notifications via ptrace (the method in patch 1
poses some problems). Other suggestions for how to acquire one of these fds
would be welcome.

Take a close look at the synchronization. I think I've got it right, but I
probably don't :)

Thanks!

Tycho Andersen (3):
  seccomp: add a return code to trap to userspace
  seccomp: hoist out filter resolving logic
  seccomp: add a way to get a listener fd from ptrace

 arch/Kconfig  |   7 +
 include/linux/seccomp.h   |  14 +-
 include/uapi/linux/ptrace.h   |   1 +
 include/uapi/linux/seccomp.h  |  18 +-
 kernel/ptrace.c   |   4 +
 kernel/seccomp.c  | 467 --
 tools/testing/selftests/seccomp/seccomp_bpf.c | 180 +-
 7 files changed, 653 insertions(+), 38 deletions(-)

-- 
2.14.1