Re: For review: seccomp_user_notif(2) manual page [v2]

2020-11-02 Thread Jann Horn
On Mon, Nov 2, 2020 at 8:50 PM Sargun Dhillon  wrote:
> On Mon, Nov 2, 2020 at 11:45 AM Michael Kerrisk (man-pages)
>  wrote:
> >Caveats regarding blocking system calls
> >Suppose that the target performs a blocking system call (e.g.,
> >accept(2)) that the supervisor should handle.  The supervisor
> >might then in turn execute the same blocking system call.
> >
> >In this scenario, it is important to note that if the target's
> >system call is now interrupted by a signal, the supervisor is not
> >informed of this.  If the supervisor does not take suitable steps
> >to actively discover that the target's system call has been
> >canceled, various difficulties can occur.  Taking the example of
> >accept(2), the supervisor might remain blocked in its accept(2)
> >holding a port number that the target (which, after the
> >interruption by the signal handler, perhaps closed  its listening
> >socket) might expect to be able to reuse in a bind(2) call.
> >
> >Therefore, when the supervisor wishes to emulate a blocking system
> >call, it must do so in such a way that it gets informed if the
> >target's system call is interrupted by a signal handler.  For
> >example, if the supervisor itself executes the same blocking
> >system call, then it could employ a separate thread that uses the
> >SECCOMP_IOCTL_NOTIF_ID_VALID operation to check if the target is
> >still blocked in its system call.  Alternatively, in the accept(2)
> >example, the supervisor might use poll(2) to monitor both the
> >notification file descriptor (so as as to discover when the
> >target's accept(2) call has been interrupted) and the listening
> >file descriptor (so as to know when a connection is available).
> >
> >If the target's system call is interrupted, the supervisor must
> >take care to release resources (e.g., file descriptors) that it
> >acquired on behalf of the target.
> >
> > Does that seem okay?
> >
> This is far clearer than my explanation. The one thing is that *just*
> poll is not good enough, you would poll, with some timeout, and when
> that timeout is hit, check if all the current notifications are valid,
> as poll isn't woken up when an in progress notification goes off
> AFAIK.

Arguably that's so terrible that it qualifies for being in the BUGS
section of the manpage.

If you want this to be fixed properly, I recommend that someone
implements my proposal from
,
unless you can come up with something better.


Re: For review: seccomp_user_notif(2) manual page [v2]

2020-11-02 Thread Sargun Dhillon
On Mon, Nov 2, 2020 at 11:45 AM Michael Kerrisk (man-pages)
 wrote:
>
> Hello Sargun,
>
> Thanks for your reply!
>
> On 11/2/20 9:07 AM, Sargun Dhillon wrote:
> > On Sat, Oct 31, 2020 at 9:27 AM Michael Kerrisk (man-pages)
> >  wrote:
> >>
> >> Hello Sargun,
> >>
> >> Thanks for your reply.
> >>
> >> On 10/30/20 9:27 PM, Sargun Dhillon wrote:
> >>> On Thu, Oct 29, 2020 at 09:37:21PM +0100, Michael Kerrisk (man-pages)
> >>> wrote:
> >>
> >> [...]
> >>
> > I think I commented in another thread somewhere that the
> > supervisor is not notified if the syscall is preempted. Therefore
> > if it is performing a preemptible, long-running syscall, you need
> > to poll SECCOMP_IOCTL_NOTIF_ID_VALID in the background, otherwise
> > you can end up in a bad situation -- like leaking resources, or
> > holding on to file descriptors after the program under
> > supervision has intended to release them.
> 
>  It's been a long day, and I'm not sure I reallu understand this.
>  Could you outline the scnario in more detail?
> 
> >>> S: Sets up filter + interception for accept T: socket(AF_INET,
> >>> SOCK_STREAM, 0) = 7 T: bind(7, {127.0.0.1, }, ..) T: listen(7,
> >>> 10) T: pidfd_getfd(T, 7) = 7 # For the sake of discussion.
> >>
> >> Presumably, the preceding line should have been:
> >>
> >> S: pidfd_getfd(T, 7) = 7 # For the sake of discussion.
> >> (s/T:/S:/)
> >>
> >> right?
> >
> > Right.
> >>
> >>
> >>> T: accept(7, ...) S: Intercepts accept S: Does accept in background
> >>> T: Receives signal, and accept(...) responds in EINTR T: close(7) S:
> >>> Still running accept(7, ), holding port , so if now T
> >>> retries to bind to port , things fail.
> >>
> >> Okay -- I understand. Presumably the solution here is not to
> >> block in accept(), but rather to use poll() to monitor both the
> >> notification FD and the listening socket FD?
> >>
> > You need to have some kind of mechanism to periodically check
> > if the notification is still alive, and preempt the accept. It doesn't
> > matter how exactly you "background" the accept (threads, or
> > O_NONBLOCK + epoll).
> >
> > The thing is you need to make sure that when the process
> > cancels a syscall, you need to release the resources you
> > may have acquired on its behalf or bad things can happen.
> >
>
> Got it. I added the following text:
>
>Caveats regarding blocking system calls
>Suppose that the target performs a blocking system call (e.g.,
>accept(2)) that the supervisor should handle.  The supervisor
>might then in turn execute the same blocking system call.
>
>In this scenario, it is important to note that if the target's
>system call is now interrupted by a signal, the supervisor is not
>informed of this.  If the supervisor does not take suitable steps
>to actively discover that the target's system call has been
>canceled, various difficulties can occur.  Taking the example of
>accept(2), the supervisor might remain blocked in its accept(2)
>holding a port number that the target (which, after the
>interruption by the signal handler, perhaps closed  its listening
>socket) might expect to be able to reuse in a bind(2) call.
>
>Therefore, when the supervisor wishes to emulate a blocking system
>call, it must do so in such a way that it gets informed if the
>target's system call is interrupted by a signal handler.  For
>example, if the supervisor itself executes the same blocking
>system call, then it could employ a separate thread that uses the
>SECCOMP_IOCTL_NOTIF_ID_VALID operation to check if the target is
>still blocked in its system call.  Alternatively, in the accept(2)
>example, the supervisor might use poll(2) to monitor both the
>notification file descriptor (so as as to discover when the
>target's accept(2) call has been interrupted) and the listening
>file descriptor (so as to know when a connection is available).
>
>If the target's system call is interrupted, the supervisor must
>take care to release resources (e.g., file descriptors) that it
>acquired on behalf of the target.
>
> Does that seem okay?
>
This is far clearer than my explanation. The one thing is that *just*
poll is not good enough, you would poll, with some timeout, and when
that timeout is hit, check if all the current notifications are valid,
as poll isn't woken up when an in progress notification goes off
AFAIK.

> > ENOENT The cookie number is not valid. This can happen if a
> > response has already been sent, or if the syscall was
> > interrupted
> >
> > EBADF If the file descriptor specified in srcfd is invalid, or if
> > the fd is out of range of the destination program.
> 
>  The piece "or if the fd is out of range of the destination program"
>  is not clear to me. Can you say 

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-11-02 Thread Michael Kerrisk (man-pages)
Hello Sargun,

Thanks for your reply!

On 11/2/20 9:07 AM, Sargun Dhillon wrote:
> On Sat, Oct 31, 2020 at 9:27 AM Michael Kerrisk (man-pages)
>  wrote:
>>
>> Hello Sargun,
>>
>> Thanks for your reply.
>>
>> On 10/30/20 9:27 PM, Sargun Dhillon wrote:
>>> On Thu, Oct 29, 2020 at 09:37:21PM +0100, Michael Kerrisk (man-pages)
>>> wrote:
>>
>> [...]
>>
> I think I commented in another thread somewhere that the
> supervisor is not notified if the syscall is preempted. Therefore
> if it is performing a preemptible, long-running syscall, you need
> to poll SECCOMP_IOCTL_NOTIF_ID_VALID in the background, otherwise
> you can end up in a bad situation -- like leaking resources, or
> holding on to file descriptors after the program under
> supervision has intended to release them.

 It's been a long day, and I'm not sure I reallu understand this.
 Could you outline the scnario in more detail?

>>> S: Sets up filter + interception for accept T: socket(AF_INET,
>>> SOCK_STREAM, 0) = 7 T: bind(7, {127.0.0.1, }, ..) T: listen(7,
>>> 10) T: pidfd_getfd(T, 7) = 7 # For the sake of discussion.
>>
>> Presumably, the preceding line should have been:
>>
>> S: pidfd_getfd(T, 7) = 7 # For the sake of discussion.
>> (s/T:/S:/)
>>
>> right?
> 
> Right.
>>
>>
>>> T: accept(7, ...) S: Intercepts accept S: Does accept in background
>>> T: Receives signal, and accept(...) responds in EINTR T: close(7) S:
>>> Still running accept(7, ), holding port , so if now T
>>> retries to bind to port , things fail.
>>
>> Okay -- I understand. Presumably the solution here is not to
>> block in accept(), but rather to use poll() to monitor both the
>> notification FD and the listening socket FD?
>>
> You need to have some kind of mechanism to periodically check
> if the notification is still alive, and preempt the accept. It doesn't
> matter how exactly you "background" the accept (threads, or
> O_NONBLOCK + epoll).
> 
> The thing is you need to make sure that when the process
> cancels a syscall, you need to release the resources you
> may have acquired on its behalf or bad things can happen.
> 

Got it. I added the following text:

   Caveats regarding blocking system calls
   Suppose that the target performs a blocking system call (e.g.,
   accept(2)) that the supervisor should handle.  The supervisor
   might then in turn execute the same blocking system call.

   In this scenario, it is important to note that if the target's
   system call is now interrupted by a signal, the supervisor is not
   informed of this.  If the supervisor does not take suitable steps
   to actively discover that the target's system call has been
   canceled, various difficulties can occur.  Taking the example of
   accept(2), the supervisor might remain blocked in its accept(2)
   holding a port number that the target (which, after the
   interruption by the signal handler, perhaps closed  its listening
   socket) might expect to be able to reuse in a bind(2) call.

   Therefore, when the supervisor wishes to emulate a blocking system
   call, it must do so in such a way that it gets informed if the
   target's system call is interrupted by a signal handler.  For
   example, if the supervisor itself executes the same blocking
   system call, then it could employ a separate thread that uses the
   SECCOMP_IOCTL_NOTIF_ID_VALID operation to check if the target is
   still blocked in its system call.  Alternatively, in the accept(2)
   example, the supervisor might use poll(2) to monitor both the
   notification file descriptor (so as as to discover when the
   target's accept(2) call has been interrupted) and the listening
   file descriptor (so as to know when a connection is available).

   If the target's system call is interrupted, the supervisor must
   take care to release resources (e.g., file descriptors) that it
   acquired on behalf of the target.

Does that seem okay?

> ENOENT The cookie number is not valid. This can happen if a
> response has already been sent, or if the syscall was
> interrupted
>
> EBADF If the file descriptor specified in srcfd is invalid, or if
> the fd is out of range of the destination program.

 The piece "or if the fd is out of range of the destination program"
 is not clear to me. Can you say some more please.

>>>
>>> IIRC the maximum fd range is specific in proc by some sysctl named
>>> nr_open. It's also evaluated against RLIMITs, and nr_max.
>>>
>>> If nr-open (maximum fds open per process, iiirc) is 1000, even if 10
>>> FDs are open, it wont work if newfd is 1001.
>>
>> Actually, the relevant limit seems to be just the RLIMIT_NOFILE
>> resource limit at least in my reading of fs/file.c::replace_fd().
>> So I made the text
>>
>>   EBADF  Allocating the file descriptor in the target would
>>  

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-11-02 Thread Jann Horn
On Sat, Oct 31, 2020 at 9:51 AM Michael Kerrisk (man-pages)
 wrote:
> On 10/30/20 8:20 PM, Jann Horn wrote:
> > On Thu, Oct 29, 2020 at 8:14 PM Michael Kerrisk (man-pages)
> >  wrote:
> >> On 10/29/20 2:42 AM, Jann Horn wrote:
> >>> As discussed at
> >>> ,
> >>> we need to re-check checkNotificationIdIsValid() after reading remote
> >>> memory but before using the read value in any way. Otherwise, the
> >>> syscall could in the meantime get interrupted by a signal handler, the
> >>> signal handler could return, and then the function that performed the
> >>> syscall could free() allocations or return (thereby freeing buffers on
> >>> the stack).
> >>>
> >>> In essence, this pread() is (unavoidably) a potential use-after-free
> >>> read; and to make that not have any security impact, we need to check
> >>> whether UAF read occurred before using the read value. This should
> >>> probably be called out elsewhere in the manpage, too...
> >>>
> >>> Now, of course, **reading** is the easy case. The difficult case is if
> >>> we have to **write** to the remote process... because then we can't
> >>> play games like that. If we write data to a freed pointer, we're
> >>> screwed, that's it. (And for somewhat unrelated bonus fun, consider
> >>> that /proc/$pid/mem is originally intended for process debugging,
> >>> including installing breakpoints, and will therefore happily write
> >>> over "readonly" private mappings, such as typical mappings of
> >>> executable code.)
> >>>
> >>> So, h... I guess if anyone wants to actually write memory back to
> >>> the target process, we'd better come up with some dedicated API for
> >>> that, using an ioctl on the seccomp fd that magically freezes the
> >>> target process inside the syscall while writing to its memory, or
> >>> something like that? And until then, the manpage should have a big fat
> >>> warning that writing to the target's memory is simply not possible
> >>> (safely).
> >>
> >> Thank you for your very clear explanation! It turned out to be
> >> trivially easy to demonstrate this issue with a slightly modified
> >> version of my program.
> >>
> >> As well as the change to the code example that I already mentioned
> >> my reply of a few hours ago, I've added the following text to the
> >> page:
> >>
> >>Caveats regarding the use of /proc/[tid]/mem
> >>The discussion above noted the need to use the
> >>SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) when opening the
> >>/proc/[tid]/mem file of the target to avoid the possibility of
> >>accessing the memory of the wrong process in the event that the
> >>target terminates and its ID is recycled by another (unrelated)
> >>thread.  However, the use of this ioctl(2) operation is also
> >>necessary in other situations, as explained in the following
> >>pargraphs.
> >
> > (nit: paragraphs)
>
> I spotted that one also already. But thanks for reading carefully!
>
> >>Consider the following scenario, where the supervisor tries to
> >>read the pathname argument of a target's blocked mount(2) system
> >>call:
> > [...]
> >> Seem okay?
> >
> > Yeah, sounds good.
> >
> >> By the way, is there any analogous kind of issue concerning
> >> pidfd_getfd()? I'm thinking not, but I wonder if I've missed
> >> something.
> >
> > When it is used by a seccomp supervisor, you mean? I think basically
> > the same thing applies - when resource identifiers (such as memory
> > addresses or file descriptors) are passed to a syscall, it generally
> > has to be assumed that those identifiers may become invalid and be
> > reused as soon as the syscall has returned.
>
> I probably needed to be more explicit. Would the following (i.e., a
> single cookie check) not be sufficient to handle the above scenario.
> Here, the target is making a syscall a system call that employs the
> file descriptor 'tfd':
>
> T: makes syscall that triggers notification
> S: Get notification
> S: pidfd = pidfd_open(T, 0);
> S: sfd = pifd_getfd(pidfd, tfd, 0)
> S: check that the cookie is still valid
> S: do operation with sfd [*]
>
> By contrast, I can see that we might want to do multiple cookie
> checks in the /proc/PID/mem case, since the supervisor might do
> multiple reads.

Aaah, okay. I didn't really understand the question at first.

> Or, do you mean: there really needs to be another cookie check after
> the point [*], since, if the the target's syscall was interrupted
> and 'tfd' was closed/resused, then the supervisor would be operating
> with a file descriptor that refers to an open file description
> (a "struct file") that is no longer meaningful in the target?
> (Thinking about it, I think this probably is what you mean, but
> I want to confirm.)

I wasn't thinking about your actual question when I wrote that. :P

I think you could argue that leaving out the first cookie check 

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-11-02 Thread Jann Horn
On Sat, Oct 31, 2020 at 9:31 AM Michael Kerrisk (man-pages)
 wrote:
> On 10/30/20 8:14 PM, Jann Horn wrote:
> > With the caveat that a cancelled syscall
> > could've also led to the memory being munmap()ed, so the nread==0 case
> > could also happen legitimately - so you might want to move this check
> > up above the nread==0 (mm went away) and nread==-1 (mm still exists,
> > but read from address failed, errno EIO) checks if the error message
> > shouldn't appear spuriously.
>
> In any case, I've been refactoring (simplifying) that code a little.
> I haven't so far rearranged the order of the checks, but I already
> log message for the nread==0 case. (Instead, there will eventually
> be an error when the response is sent.)
>
> I also haven't exactly tested the scenario you describe in the
> seccomp unotify scenario, but I think the above is not correct. Here
> are two scenarios I did test, simply with mmap() and /proc/PID/mem
> (no seccomp involved):
>
> Scenario 1:
> A creates a mapping at address X
> B opens /proc/A/mem and and lseeks on resulting FD to offset X
> A terminates
> B reads from FD ==> read() returns 0 (EOF)
>
> Scenario 2:
> A creates a mapping at address X
> B opens /proc/A/mem and and lseeks on resulting FD to offset X
> A unmaps mapping at address X
> B reads from FD ==> read() returns -1 / EIO.
>
> That last scenario seems to contradict what you say, since I
> think you meant that in this case read() should return 0 in
> that case. Have I misunderstood you?

Sorry, I messed up the description when I wrote that. Yes, this looks
as expected - EIO if the VMA is gone, 0 if the mm_users of the
mm_struct have dropped to zero because all tasks that use the mm have
exited.


Re: For review: seccomp_user_notif(2) manual page [v2]

2020-11-02 Thread Sargun Dhillon
On Sat, Oct 31, 2020 at 9:27 AM Michael Kerrisk (man-pages)
 wrote:
>
> Hello Sargun,
>
> Thanks for your reply.
>
> On 10/30/20 9:27 PM, Sargun Dhillon wrote:
> > On Thu, Oct 29, 2020 at 09:37:21PM +0100, Michael Kerrisk (man-pages)
> > wrote:
>
> [...]
>
> >>> I think I commented in another thread somewhere that the
> >>> supervisor is not notified if the syscall is preempted. Therefore
> >>> if it is performing a preemptible, long-running syscall, you need
> >>> to poll SECCOMP_IOCTL_NOTIF_ID_VALID in the background, otherwise
> >>> you can end up in a bad situation -- like leaking resources, or
> >>> holding on to file descriptors after the program under
> >>> supervision has intended to release them.
> >>
> >> It's been a long day, and I'm not sure I reallu understand this.
> >> Could you outline the scnario in more detail?
> >>
> > S: Sets up filter + interception for accept T: socket(AF_INET,
> > SOCK_STREAM, 0) = 7 T: bind(7, {127.0.0.1, }, ..) T: listen(7,
> > 10) T: pidfd_getfd(T, 7) = 7 # For the sake of discussion.
>
> Presumably, the preceding line should have been:
>
> S: pidfd_getfd(T, 7) = 7 # For the sake of discussion.
> (s/T:/S:/)
>
> right?

Right.
>
>
> > T: accept(7, ...) S: Intercepts accept S: Does accept in background
> > T: Receives signal, and accept(...) responds in EINTR T: close(7) S:
> > Still running accept(7, ), holding port , so if now T
> > retries to bind to port , things fail.
>
> Okay -- I understand. Presumably the solution here is not to
> block in accept(), but rather to use poll() to monitor both the
> notification FD and the listening socket FD?
>
You need to have some kind of mechanism to periodically check
if the notification is still alive, and preempt the accept. It doesn't
matter how exactly you "background" the accept (threads, or
O_NONBLOCK + epoll).

The thing is you need to make sure that when the process
cancels a syscall, you need to release the resources you
may have acquired on its behalf or bad things can happen.

> >>> A very specific example is if you're performing an accept on
> >>> behalf of the program generating the notification, and the
> >>> program intends to reuse the port. You can get into all sorts of
> >>> awkward situations there.
> >>
> >> [...]
> >>
> > See above
>
> [...]
>
> >>> In addition, if it is a socket, it inherits the cgroup v1 classid
> >>> and netprioidx of the receiving process.
> >>>
> >>> The argument of this is as follows:
> >>>
> >>> struct seccomp_notif_addfd { __u64 id; __u32 flags; __u32 srcfd;
> >>> __u32 newfd; __u32 newfd_flags; };
> >>>
> >>> id This is the cookie value that was obtained using
> >>> SECCOMP_IOCTL_NOTIF_RECV.
> >>>
> >>> flags A bitmask that includes zero or more of the
> >>> SECCOMP_ADDFD_FLAG_* bits set
> >>>
> >>> SECCOMP_ADDFD_FLAG_SETFD - Use dup2 (or dup3?) like semantics
> >>> when copying the file descriptor.
> >>>
> >>> srcfd The file descriptor number to copy in the supervisor
> >>> process.
> >>>
> >>> newfd If the SECCOMP_ADDFD_FLAG_SETFD flag is specified this will
> >>> be the file descriptor that is used in the dup2 semantics. If
> >>> this file descriptor exists in the receiving process, it is
> >>> closed and replaced by this file descriptor in an atomic fashion.
> >>> If the copy process fails due to a MAC failure, or if srcfd is
> >>> invalid, the newfd will not be closed in the receiving process.
> >>
> >> Great description!
> >>
> >>> If SECCOMP_ADDFD_FLAG_SETFD it not set, then this value must be
> >>> 0.
> >>>
> >>> newfd_flags The file descriptor flags to set on the file
> >>> descriptor after it has been received by the process. The only
> >>> flag that can currently be specified is O_CLOEXEC.
> >>>
> >>> On success, this operation returns the file descriptor number in
> >>> the receiving process. On failure, -1 is returned.
> >>>
> >>> It can fail with the following error codes:
> >>>
> >>> EINPROGRESS The cookie number specified hasn't been received by
> >>> the listener
> >>
> >> I don't understand this. Can you say more about the scenario?
> >>
> >
> > This should not really happen. But if you do a ADDFD(...), on a
> > notification *before* you've received it, you will get this error. So
> > for example,
> > --> epoll() -> returns
> > --> RECV(...) cookie id is 777
> > --> epoll(...) -> returns
> > <-- ioctl(ADDFD, id = 778) # Notice how we haven't done a receive yet
> > where we've received a notification for 778.
>
> Got it. Looking also at the source code, I came up with the
> following:
>
>   EINPROGRESS
>  The user-space notification specified in the id
>  field exists but has not yet been fetched (by a
>  SECCOMP_IOCTL_NOTIF_RECV) or has already been
>  responded to (by a SECCOMP_IOCTL_NOTIF_SEND).
>
> Does that seem okay?
>
Looks good to me.

> >>> ENOENT The cookie number is not valid. This can happen if a
> >>> response has already been sent, or 

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-31 Thread Michael Kerrisk (man-pages)
Hello Sargun,

Thanks for your reply.

On 10/30/20 9:27 PM, Sargun Dhillon wrote:
> On Thu, Oct 29, 2020 at 09:37:21PM +0100, Michael Kerrisk (man-pages)
> wrote:

[...]

>>> I think I commented in another thread somewhere that the
>>> supervisor is not notified if the syscall is preempted. Therefore
>>> if it is performing a preemptible, long-running syscall, you need
>>> to poll SECCOMP_IOCTL_NOTIF_ID_VALID in the background, otherwise
>>> you can end up in a bad situation -- like leaking resources, or
>>> holding on to file descriptors after the program under
>>> supervision has intended to release them.
>> 
>> It's been a long day, and I'm not sure I reallu understand this. 
>> Could you outline the scnario in more detail?
>> 
> S: Sets up filter + interception for accept T: socket(AF_INET,
> SOCK_STREAM, 0) = 7 T: bind(7, {127.0.0.1, }, ..) T: listen(7,
> 10) T: pidfd_getfd(T, 7) = 7 # For the sake of discussion.

Presumably, the preceding line should have been:

S: pidfd_getfd(T, 7) = 7 # For the sake of discussion.
(s/T:/S:/)

right?

> T: accept(7, ...) S: Intercepts accept S: Does accept in background 
> T: Receives signal, and accept(...) responds in EINTR T: close(7) S:
> Still running accept(7, ), holding port , so if now T
> retries to bind to port , things fail.

Okay -- I understand. Presumably the solution here is not to 
block in accept(), but rather to use poll() to monitor both the
notification FD and the listening socket FD?

>>> A very specific example is if you're performing an accept on
>>> behalf of the program generating the notification, and the
>>> program intends to reuse the port. You can get into all sorts of
>>> awkward situations there.
>> 
>> [...]
>> 
> See above

[...]

>>> In addition, if it is a socket, it inherits the cgroup v1 classid
>>> and netprioidx of the receiving process.
>>> 
>>> The argument of this is as follows:
>>> 
>>> struct seccomp_notif_addfd { __u64 id; __u32 flags; __u32 srcfd; 
>>> __u32 newfd; __u32 newfd_flags; };
>>> 
>>> id This is the cookie value that was obtained using 
>>> SECCOMP_IOCTL_NOTIF_RECV.
>>> 
>>> flags A bitmask that includes zero or more of the 
>>> SECCOMP_ADDFD_FLAG_* bits set
>>> 
>>> SECCOMP_ADDFD_FLAG_SETFD - Use dup2 (or dup3?) like semantics
>>> when copying the file descriptor.
>>> 
>>> srcfd The file descriptor number to copy in the supervisor
>>> process.
>>> 
>>> newfd If the SECCOMP_ADDFD_FLAG_SETFD flag is specified this will
>>> be the file descriptor that is used in the dup2 semantics. If
>>> this file descriptor exists in the receiving process, it is
>>> closed and replaced by this file descriptor in an atomic fashion.
>>> If the copy process fails due to a MAC failure, or if srcfd is
>>> invalid, the newfd will not be closed in the receiving process.
>> 
>> Great description!
>> 
>>> If SECCOMP_ADDFD_FLAG_SETFD it not set, then this value must be
>>> 0.
>>> 
>>> newfd_flags The file descriptor flags to set on the file
>>> descriptor after it has been received by the process. The only
>>> flag that can currently be specified is O_CLOEXEC.
>>> 
>>> On success, this operation returns the file descriptor number in
>>> the receiving process. On failure, -1 is returned.
>>> 
>>> It can fail with the following error codes:
>>> 
>>> EINPROGRESS The cookie number specified hasn't been received by
>>> the listener
>> 
>> I don't understand this. Can you say more about the scenario?
>> 
> 
> This should not really happen. But if you do a ADDFD(...), on a
> notification *before* you've received it, you will get this error. So
> for example, 
> --> epoll() -> returns 
> --> RECV(...) cookie id is 777
> --> epoll(...) -> returns
> <-- ioctl(ADDFD, id = 778) # Notice how we haven't done a receive yet
> where we've received a notification for 778.

Got it. Looking also at the source code, I came up with the 
following:

  EINPROGRESS
 The user-space notification specified in the id
 field exists but has not yet been fetched (by a
 SECCOMP_IOCTL_NOTIF_RECV) or has already been
 responded to (by a SECCOMP_IOCTL_NOTIF_SEND).

Does that seem okay?

>>> ENOENT The cookie number is not valid. This can happen if a
>>> response has already been sent, or if the syscall was
>>> interrupted
>>> 
>>> EBADF If the file descriptor specified in srcfd is invalid, or if
>>> the fd is out of range of the destination program.
>> 
>> The piece "or if the fd is out of range of the destination program"
>> is not clear to me. Can you say some more please.
>> 
> 
> IIRC the maximum fd range is specific in proc by some sysctl named 
> nr_open. It's also evaluated against RLIMITs, and nr_max.
> 
> If nr-open (maximum fds open per process, iiirc) is 1000, even if 10
> FDs are open, it wont work if newfd is 1001.

Actually, the relevant limit seems to be just the RLIMIT_NOFILE
resource limit at least in my reading of fs/file.c::replace_fd().
So 

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-31 Thread Michael Kerrisk (man-pages)
On 10/30/20 8:20 PM, Jann Horn wrote:
> On Thu, Oct 29, 2020 at 8:14 PM Michael Kerrisk (man-pages)
>  wrote:
>> On 10/29/20 2:42 AM, Jann Horn wrote:
>>> As discussed at
>>> ,
>>> we need to re-check checkNotificationIdIsValid() after reading remote
>>> memory but before using the read value in any way. Otherwise, the
>>> syscall could in the meantime get interrupted by a signal handler, the
>>> signal handler could return, and then the function that performed the
>>> syscall could free() allocations or return (thereby freeing buffers on
>>> the stack).
>>>
>>> In essence, this pread() is (unavoidably) a potential use-after-free
>>> read; and to make that not have any security impact, we need to check
>>> whether UAF read occurred before using the read value. This should
>>> probably be called out elsewhere in the manpage, too...
>>>
>>> Now, of course, **reading** is the easy case. The difficult case is if
>>> we have to **write** to the remote process... because then we can't
>>> play games like that. If we write data to a freed pointer, we're
>>> screwed, that's it. (And for somewhat unrelated bonus fun, consider
>>> that /proc/$pid/mem is originally intended for process debugging,
>>> including installing breakpoints, and will therefore happily write
>>> over "readonly" private mappings, such as typical mappings of
>>> executable code.)
>>>
>>> So, h... I guess if anyone wants to actually write memory back to
>>> the target process, we'd better come up with some dedicated API for
>>> that, using an ioctl on the seccomp fd that magically freezes the
>>> target process inside the syscall while writing to its memory, or
>>> something like that? And until then, the manpage should have a big fat
>>> warning that writing to the target's memory is simply not possible
>>> (safely).
>>
>> Thank you for your very clear explanation! It turned out to be
>> trivially easy to demonstrate this issue with a slightly modified
>> version of my program.
>>
>> As well as the change to the code example that I already mentioned
>> my reply of a few hours ago, I've added the following text to the
>> page:
>>
>>Caveats regarding the use of /proc/[tid]/mem
>>The discussion above noted the need to use the
>>SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) when opening the
>>/proc/[tid]/mem file of the target to avoid the possibility of
>>accessing the memory of the wrong process in the event that the
>>target terminates and its ID is recycled by another (unrelated)
>>thread.  However, the use of this ioctl(2) operation is also
>>necessary in other situations, as explained in the following
>>pargraphs.
> 
> (nit: paragraphs)

I spotted that one also already. But thanks for reading carefully!

>>Consider the following scenario, where the supervisor tries to
>>read the pathname argument of a target's blocked mount(2) system
>>call:
> [...]
>> Seem okay?
> 
> Yeah, sounds good.
> 
>> By the way, is there any analogous kind of issue concerning
>> pidfd_getfd()? I'm thinking not, but I wonder if I've missed
>> something.
> 
> When it is used by a seccomp supervisor, you mean? I think basically
> the same thing applies - when resource identifiers (such as memory
> addresses or file descriptors) are passed to a syscall, it generally
> has to be assumed that those identifiers may become invalid and be
> reused as soon as the syscall has returned.

I probably needed to be more explicit. Would the following (i.e., a
single cookie check) not be sufficient to handle the above scenario.
Here, the target is making a syscall a system call that employs the
file descriptor 'tfd':

T: makes syscall that triggers notification
S: Get notification
S: pidfd = pidfd_open(T, 0);
S: sfd = pifd_getfd(pidfd, tfd, 0)
S: check that the cookie is still valid
S: do operation with sfd [*]

By contrast, I can see that we might want to do multiple cookie
checks in the /proc/PID/mem case, since the supervisor might do
multiple reads.

Or, do you mean: there really needs to be another cookie check after
the point [*], since, if the the target's syscall was interrupted
and 'tfd' was closed/resused, then the supervisor would be operating
with a file descriptor that refers to an open file description
(a "struct file") that is no longer meaningful in the target?
(Thinking about it, I think this probably is what you mean, but 
I want to confirm.)

Thanks,

Michael
-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-31 Thread Michael Kerrisk (man-pages)
On 10/30/20 8:14 PM, Jann Horn wrote:
> On Thu, Oct 29, 2020 at 3:19 PM Michael Kerrisk (man-pages)
>  wrote:
>> On 10/29/20 2:42 AM, Jann Horn wrote:
>>> On Mon, Oct 26, 2020 at 10:55 AM Michael Kerrisk (man-pages)
>>>  wrote:
static bool
getTargetPathname(struct seccomp_notif *req, int notifyFd,
  char *path, size_t len)
{
char procMemPath[PATH_MAX];

snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", 
 req->pid);

int procMemFd = open(procMemPath, O_RDONLY);
if (procMemFd == -1)
errExit("\tS: open");

/* Check that the process whose info we are accessing is still 
 alive.
   If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
   in checkNotificationIdIsValid()) succeeds, we know that the
   /proc/PID/mem file descriptor that we opened corresponds to 
 the
   process for which we received a notification. If that process
   subsequently terminates, then read() on that file descriptor
   will return 0 (EOF). */

checkNotificationIdIsValid(notifyFd, req->id);

/* Read bytes at the location containing the pathname argument
   (i.e., the first argument) of the mkdir(2) call */

ssize_t nread = pread(procMemFd, path, len, req->data.args[0]);
if (nread == -1)
errExit("pread");
>>>
>>> As discussed at
>>> ,
>>> we need to re-check checkNotificationIdIsValid() after reading remote
>>> memory but before using the read value in any way. Otherwise, the
>>> syscall could in the meantime get interrupted by a signal handler, the
>>> signal handler could return, and then the function that performed the
>>> syscall could free() allocations or return (thereby freeing buffers on
>>> the stack).
>>>
>>> In essence, this pread() is (unavoidably) a potential use-after-free
>>> read; and to make that not have any security impact, we need to check
>>> whether UAF read occurred before using the read value. This should
>>> probably be called out elsewhere in the manpage, too...
>>
>> Thanks very much for pointing me at this!
>>
>> So, I want to conform that the fix to the code is as simple as
>> adding a check following the pread() call, something like:
>>
>> [[
>>  ssize_t nread = pread(procMemFd, path, len, req->data.args[argNum]);
>>  if (nread == -1)
>> errExit("Supervisor: pread");
>>
>>  if (nread == 0) {
>> fprintf(stderr, "\tS: pread() of /proc/PID/mem "
>> "returned 0 (EOF)\n");
>> exit(EXIT_FAILURE);
>>  }
>>
>>  if (close(procMemFd) == -1)
>> errExit("Supervisor: close-/proc/PID/mem");
>>
>> +/* Once again check that the notification ID is still valid. The
>> +   case we are particularly concerned about here is that just
>> +   before we fetched the pathname, the target's blocked system
>> +   call was interrupted by a signal handler, and after the handler
>> +   returned, the target carried on execution (past the interrupted
>> +   system call). In that case, we have no guarantees about what we
>> +   are reading, since the target's memory may have been arbitrarily
>> +   changed by subsequent operations. */
>> +
>> +if (!notificationIdIsValid(notifyFd, req->id, "post-open"))
>> +return false;
>> +
>>  /* We have no guarantees about what was in the memory of the target
>> process. We therefore treat the buffer returned by pread() as
>> untrusted input. The buffer should be terminated by a null byte;
>> if not, then we will trigger an error for the target process. */
>>
>>  if (strnlen(path, nread) < nread)
>>  return true;
>> ]]
> 
> Yeah, that should do the job. 

Thanks.

> With the caveat that a cancelled syscall
> could've also led to the memory being munmap()ed, so the nread==0 case
> could also happen legitimately - so you might want to move this check
> up above the nread==0 (mm went away) and nread==-1 (mm still exists,
> but read from address failed, errno EIO) checks if the error message
> shouldn't appear spuriously.

In any case, I've been refactoring (simplifying) that code a little.
I haven't so far rearranged the order of the checks, but I already
log message for the nread==0 case. (Instead, there will eventually
be an error when the response is sent.)

I also haven't exactly tested the scenario you describe in the
seccomp unotify scenario, but I think the above is not correct. Here 
are two scenarios I did test, simply with mmap() and /proc/PID/mem
(no seccomp involved): 

Scenario 1:
A creates a mapping at address X
B opens /proc/A/mem and and lseeks on 

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-30 Thread Sargun Dhillon
On Thu, Oct 29, 2020 at 09:37:21PM +0100, Michael Kerrisk (man-pages) wrote:
> Hello Sargun,,
> 
> On 10/29/20 9:53 AM, Sargun Dhillon wrote:
> > On Mon, Oct 26, 2020 at 10:55:04AM +0100, Michael Kerrisk (man-pages) wrote:
> 
> [...]
> 
> >>ioctl(2) operations
> >>The following ioctl(2) operations are provided to support seccomp
> >>user-space notification.  For each of these operations, the first
> >>(file descriptor) argument of ioctl(2) is the listening file
> >>descriptor returned by a call to seccomp(2) with the
> >>SECCOMP_FILTER_FLAG_NEW_LISTENER flag.
> >>
> >>SECCOMP_IOCTL_NOTIF_RECV
> >>   This operation is used to obtain a user-space notification
> >>   event.  If no such event is currently pending, the
> >>   operation blocks until an event occurs.  The third
> >>   ioctl(2) argument is a pointer to a structure of the
> >>   following form which contains information about the event.
> >>   This structure must be zeroed out before the call.
> >>
> >>   struct seccomp_notif {
> >>   __u64  id;  /* Cookie */
> >>   __u32  pid; /* TID of target thread */
> >>   __u32  flags;   /* Currently unused (0) */
> >>   struct seccomp_data data;   /* See seccomp(2) */
> >>   };
> >>
> >>   The fields in this structure are as follows:
> >>
> >>   id This is a cookie for the notification.  Each such
> >>  cookie is guaranteed to be unique for the
> >>  corresponding seccomp filter.
> >>
> >>  · It can be used with the
> >>SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation
> >>to verify that the target is still alive.
> >>
> >>  · When returning a notification response to the
> >>kernel, the supervisor must include the cookie
> >>value in the seccomp_notif_resp structure that is
> >>specified as the argument of the
> >>SECCOMP_IOCTL_NOTIF_SEND operation.
> >>
> >>   pidThis is the thread ID of the target thread that
> >>  triggered the notification event.
> >>
> >>   flags  This is a bit mask of flags providing further
> >>  information on the event.  In the current
> >>  implementation, this field is always zero.
> >>
> >>   data   This is a seccomp_data structure containing
> >>  information about the system call that triggered
> >>  the notification.  This is the same structure that
> >>  is passed to the seccomp filter.  See seccomp(2)
> >>  for details of this structure.
> >>
> >>   On success, this operation returns 0; on failure, -1 is
> >>   returned, and errno is set to indicate the cause of the
> >>   error.  This operation can fail with the following errors:
> >>
> >>   EINVAL (since Linux 5.5)
> >>  The seccomp_notif structure that was passed to the
> >>  call contained nonzero fields.
> >>
> >>   ENOENT The target thread was killed by a signal as the
> >>  notification information was being generated, or
> >>  the target's (blocked) system call was interrupted
> >>  by a signal handler.
> >>
> > 
> > I think I commented in another thread somewhere that the supervisor is not 
> > notified if the syscall is preempted. Therefore if it is performing a 
> > preemptible, long-running syscall, you need to poll
> > SECCOMP_IOCTL_NOTIF_ID_VALID in the background, otherwise you can
> > end up in a bad situation -- like leaking resources, or holding on to
> > file descriptors after the program under supervision has intended to
> > release them.
> 
> It's been a long day, and I'm not sure I reallu understand this.
> Could you outline the scnario in more detail?
> 
S: Sets up filter + interception for accept
T: socket(AF_INET, SOCK_STREAM, 0) = 7
T: bind(7, {127.0.0.1, }, ..)
T: listen(7, 10)
T: pidfd_getfd(T, 7) = 7 # For the sake of discussion.
T: accept(7, ...)
S: Intercepts accept
S: Does accept in background
T: Receives signal, and accept(...) responds in EINTR
T: close(7)
S: Still running accept(7, ), holding port , so if now T retries
   to bind to port , things fail.

> > A very specific example is if you're performing an accept on behalf
> > of the program generating the notification, and the program intends
> > to reuse the port. You can get into all sorts of awkward situations
> > there.
> 
> [...]
> 
See above

> > 

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-30 Thread Michael Kerrisk (man-pages)
On 10/30/20 8:24 PM, Jann Horn wrote:
> On Thu, Oct 29, 2020 at 8:53 PM Michael Kerrisk (man-pages)
>  wrote:
>> On 10/29/20 4:26 PM, Christian Brauner wrote:
>>> I like this manpage. I think this is the most comprehensive explanation
>>> of any seccomp feature
>>
>> Thanks (at least, I think so...)
>>
>>> and somewhat understandable.
>>   
>>
>> (... but I'm not sure ;-).)
> 
> Relevant: http://tinefetz.net/files/gimgs/78_78_17.jpg

Perfekt :-).


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-30 Thread Jann Horn
On Thu, Oct 29, 2020 at 8:14 PM Michael Kerrisk (man-pages)
 wrote:
> On 10/29/20 2:42 AM, Jann Horn wrote:
> > As discussed at
> > ,
> > we need to re-check checkNotificationIdIsValid() after reading remote
> > memory but before using the read value in any way. Otherwise, the
> > syscall could in the meantime get interrupted by a signal handler, the
> > signal handler could return, and then the function that performed the
> > syscall could free() allocations or return (thereby freeing buffers on
> > the stack).
> >
> > In essence, this pread() is (unavoidably) a potential use-after-free
> > read; and to make that not have any security impact, we need to check
> > whether UAF read occurred before using the read value. This should
> > probably be called out elsewhere in the manpage, too...
> >
> > Now, of course, **reading** is the easy case. The difficult case is if
> > we have to **write** to the remote process... because then we can't
> > play games like that. If we write data to a freed pointer, we're
> > screwed, that's it. (And for somewhat unrelated bonus fun, consider
> > that /proc/$pid/mem is originally intended for process debugging,
> > including installing breakpoints, and will therefore happily write
> > over "readonly" private mappings, such as typical mappings of
> > executable code.)
> >
> > So, h... I guess if anyone wants to actually write memory back to
> > the target process, we'd better come up with some dedicated API for
> > that, using an ioctl on the seccomp fd that magically freezes the
> > target process inside the syscall while writing to its memory, or
> > something like that? And until then, the manpage should have a big fat
> > warning that writing to the target's memory is simply not possible
> > (safely).
>
> Thank you for your very clear explanation! It turned out to be
> trivially easy to demonstrate this issue with a slightly modified
> version of my program.
>
> As well as the change to the code example that I already mentioned
> my reply of a few hours ago, I've added the following text to the
> page:
>
>Caveats regarding the use of /proc/[tid]/mem
>The discussion above noted the need to use the
>SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) when opening the
>/proc/[tid]/mem file of the target to avoid the possibility of
>accessing the memory of the wrong process in the event that the
>target terminates and its ID is recycled by another (unrelated)
>thread.  However, the use of this ioctl(2) operation is also
>necessary in other situations, as explained in the following
>pargraphs.

(nit: paragraphs)

>Consider the following scenario, where the supervisor tries to
>read the pathname argument of a target's blocked mount(2) system
>call:
[...]
> Seem okay?

Yeah, sounds good.

> By the way, is there any analogous kind of issue concerning
> pidfd_getfd()? I'm thinking not, but I wonder if I've missed
> something.

When it is used by a seccomp supervisor, you mean? I think basically
the same thing applies - when resource identifiers (such as memory
addresses or file descriptors) are passed to a syscall, it generally
has to be assumed that those identifiers may become invalid and be
reused as soon as the syscall has returned.


Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-30 Thread Jann Horn
On Thu, Oct 29, 2020 at 8:53 PM Michael Kerrisk (man-pages)
 wrote:
> On 10/29/20 4:26 PM, Christian Brauner wrote:
> > I like this manpage. I think this is the most comprehensive explanation
> > of any seccomp feature
>
> Thanks (at least, I think so...)
>
> > and somewhat understandable.
>   
>
> (... but I'm not sure ;-).)

Relevant: http://tinefetz.net/files/gimgs/78_78_17.jpg


Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-30 Thread Jann Horn
On Thu, Oct 29, 2020 at 3:19 PM Michael Kerrisk (man-pages)
 wrote:
> On 10/29/20 2:42 AM, Jann Horn wrote:
> > On Mon, Oct 26, 2020 at 10:55 AM Michael Kerrisk (man-pages)
> >  wrote:
> >>static bool
> >>getTargetPathname(struct seccomp_notif *req, int notifyFd,
> >>  char *path, size_t len)
> >>{
> >>char procMemPath[PATH_MAX];
> >>
> >>snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", 
> >> req->pid);
> >>
> >>int procMemFd = open(procMemPath, O_RDONLY);
> >>if (procMemFd == -1)
> >>errExit("\tS: open");
> >>
> >>/* Check that the process whose info we are accessing is still 
> >> alive.
> >>   If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
> >>   in checkNotificationIdIsValid()) succeeds, we know that the
> >>   /proc/PID/mem file descriptor that we opened corresponds to 
> >> the
> >>   process for which we received a notification. If that process
> >>   subsequently terminates, then read() on that file descriptor
> >>   will return 0 (EOF). */
> >>
> >>checkNotificationIdIsValid(notifyFd, req->id);
> >>
> >>/* Read bytes at the location containing the pathname argument
> >>   (i.e., the first argument) of the mkdir(2) call */
> >>
> >>ssize_t nread = pread(procMemFd, path, len, req->data.args[0]);
> >>if (nread == -1)
> >>errExit("pread");
> >
> > As discussed at
> > ,
> > we need to re-check checkNotificationIdIsValid() after reading remote
> > memory but before using the read value in any way. Otherwise, the
> > syscall could in the meantime get interrupted by a signal handler, the
> > signal handler could return, and then the function that performed the
> > syscall could free() allocations or return (thereby freeing buffers on
> > the stack).
> >
> > In essence, this pread() is (unavoidably) a potential use-after-free
> > read; and to make that not have any security impact, we need to check
> > whether UAF read occurred before using the read value. This should
> > probably be called out elsewhere in the manpage, too...
>
> Thanks very much for pointing me at this!
>
> So, I want to conform that the fix to the code is as simple as
> adding a check following the pread() call, something like:
>
> [[
>  ssize_t nread = pread(procMemFd, path, len, req->data.args[argNum]);
>  if (nread == -1)
> errExit("Supervisor: pread");
>
>  if (nread == 0) {
> fprintf(stderr, "\tS: pread() of /proc/PID/mem "
> "returned 0 (EOF)\n");
> exit(EXIT_FAILURE);
>  }
>
>  if (close(procMemFd) == -1)
> errExit("Supervisor: close-/proc/PID/mem");
>
> +/* Once again check that the notification ID is still valid. The
> +   case we are particularly concerned about here is that just
> +   before we fetched the pathname, the target's blocked system
> +   call was interrupted by a signal handler, and after the handler
> +   returned, the target carried on execution (past the interrupted
> +   system call). In that case, we have no guarantees about what we
> +   are reading, since the target's memory may have been arbitrarily
> +   changed by subsequent operations. */
> +
> +if (!notificationIdIsValid(notifyFd, req->id, "post-open"))
> +return false;
> +
>  /* We have no guarantees about what was in the memory of the target
> process. We therefore treat the buffer returned by pread() as
> untrusted input. The buffer should be terminated by a null byte;
> if not, then we will trigger an error for the target process. */
>
>  if (strnlen(path, nread) < nread)
>  return true;
> ]]

Yeah, that should do the job. With the caveat that a cancelled syscall
could've also led to the memory being munmap()ed, so the nread==0 case
could also happen legitimately - so you might want to move this check
up above the nread==0 (mm went away) and nread==-1 (mm still exists,
but read from address failed, errno EIO) checks if the error message
shouldn't appear spuriously.


Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-29 Thread Michael Kerrisk (man-pages)
Hello Sargun,,

On 10/29/20 9:53 AM, Sargun Dhillon wrote:
> On Mon, Oct 26, 2020 at 10:55:04AM +0100, Michael Kerrisk (man-pages) wrote:

[...]

>>ioctl(2) operations
>>The following ioctl(2) operations are provided to support seccomp
>>user-space notification.  For each of these operations, the first
>>(file descriptor) argument of ioctl(2) is the listening file
>>descriptor returned by a call to seccomp(2) with the
>>SECCOMP_FILTER_FLAG_NEW_LISTENER flag.
>>
>>SECCOMP_IOCTL_NOTIF_RECV
>>   This operation is used to obtain a user-space notification
>>   event.  If no such event is currently pending, the
>>   operation blocks until an event occurs.  The third
>>   ioctl(2) argument is a pointer to a structure of the
>>   following form which contains information about the event.
>>   This structure must be zeroed out before the call.
>>
>>   struct seccomp_notif {
>>   __u64  id;  /* Cookie */
>>   __u32  pid; /* TID of target thread */
>>   __u32  flags;   /* Currently unused (0) */
>>   struct seccomp_data data;   /* See seccomp(2) */
>>   };
>>
>>   The fields in this structure are as follows:
>>
>>   id This is a cookie for the notification.  Each such
>>  cookie is guaranteed to be unique for the
>>  corresponding seccomp filter.
>>
>>  · It can be used with the
>>SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation
>>to verify that the target is still alive.
>>
>>  · When returning a notification response to the
>>kernel, the supervisor must include the cookie
>>value in the seccomp_notif_resp structure that is
>>specified as the argument of the
>>SECCOMP_IOCTL_NOTIF_SEND operation.
>>
>>   pidThis is the thread ID of the target thread that
>>  triggered the notification event.
>>
>>   flags  This is a bit mask of flags providing further
>>  information on the event.  In the current
>>  implementation, this field is always zero.
>>
>>   data   This is a seccomp_data structure containing
>>  information about the system call that triggered
>>  the notification.  This is the same structure that
>>  is passed to the seccomp filter.  See seccomp(2)
>>  for details of this structure.
>>
>>   On success, this operation returns 0; on failure, -1 is
>>   returned, and errno is set to indicate the cause of the
>>   error.  This operation can fail with the following errors:
>>
>>   EINVAL (since Linux 5.5)
>>  The seccomp_notif structure that was passed to the
>>  call contained nonzero fields.
>>
>>   ENOENT The target thread was killed by a signal as the
>>  notification information was being generated, or
>>  the target's (blocked) system call was interrupted
>>  by a signal handler.
>>
>>┌─┐
>>│FIXME│
>>├─┤
>>│From my experiments, it appears that if a│
>>│SECCOMP_IOCTL_NOTIF_RECV is done after the target│
>>│thread terminates, then the ioctl() simply blocks│
>>│(rather than returning an error to indicate that the │
>>│target no longer exists).│
>>│ │
>>│I found that surprising, and it required some│
>>│contortions in the example program.  It was not  │
>>│possible to code my SIGCHLD handler (which reaps the │
>>│zombie when the worker/target terminates) to simply  │
>>│set a flag checked in the main handleNotifications() │
>>│loop, since this created an unavoidable race where   │
>>│the child might terminate just after I had checked   │
>>│the flag, but before I blocked (forever!) in the │
>>│SECCOMP_IOCTL_NOTIF_RECV operation. Instead, I had   │
>>│to code the signal handler to simply call _exit(2)   │
>>│in order to terminate the parent process (the│
>>│supervisor). │
>>│ │
>>│Is this expected behavior? It seems to me 

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-29 Thread Michael Kerrisk (man-pages)
Hello Christian

Thanks for taking a look at the page.

On 10/29/20 4:26 PM, Christian Brauner wrote:
> On Mon, Oct 26, 2020 at 10:55:04AM +0100, Michael Kerrisk (man-pages) wrote:
>> Hi all (and especially Tycho and Sargun),
>>
>> Following review comments on the first draft (thanks to Jann, Kees,
>> Christian and Tycho), I've made a lot of changes to this page.
>> I've also added a few FIXMEs relating to outstanding API issues.
>> I'd like a second pass review of the page before I release it.
>> But also, this mail serves as a way of noting the outstanding API
>> issues.
>>
>> Tycho: I still have an outstanding question for you at [2].
>>
>> Sargun: can you please prepare something on SECCOMP_ADDFD_FLAG_SETFD
>> and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?
>>
>> I've shown the rendered version of the page below. The page source
>> currently sits in a branch at
>> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif
>>
>> At this point, I'm mainly interested in feedback about the FIXMEs,
>> some of which relate to the text of the page itself, while the
>> others relate to the various outstanding API issues. The first 
>> FIXME provides a small opportunity for some bikeshedding :-);
> 
> I like this manpage. I think this is the most comprehensive explanation
> of any seccomp feature

Thanks (at least, I think so...)

> and somewhat understandable.
  

(... but I'm not sure ;-).)

> Just tiny comments below, hopefully no bike-shedding though. :)

Most relevant point for bikeshedding is the page name. I plan 
to change it to seccomp_unotify(2) (shorter, reads better out loud).

>> Thanks,
>>
>> Michael
>>
>> [1] 
>> https://lore.kernel.org/linux-man/45f07f17-18b6-d187-0914-6f341fe90...@gmail.com/
>> [2] 
>> https://lore.kernel.org/linux-man/8f20d586-9609-ef83-c85a-272e37e68...@gmail.com/
>>
>> =
>>
>> SECCOMP_USER_NOTIF(2)   Linux Programmer's Manual  SECCOMP_USER_NOTIF(2)

[...]

>>An overview of the steps performed by the target and the
>>supervisor is as follows:
>>
>>1. The target establishes a seccomp filter in the usual manner,
>>   but with two differences:
>>
>>   · The seccomp(2) flags argument includes the flag
>> SECCOMP_FILTER_FLAG_NEW_LISTENER.  Consequently, the return
>> value of the (successful) seccomp(2) call is a new
>> "listening" file descriptor that can be used to receive
>> notifications.  Only one "listening" seccomp filter can be
>> installed for a thread.
>>
>> ┌─┐
>> │FIXME│
>> ├─┤
>> │Is the last sentence above correct?  │
>> │ │
>> │Kees Cook (25 Oct 2020) notes:   │
>> │ │
>> │I like this limitation, but I expect that it'll need │
>> │to change in the future. Even with LSMs, we see the  │
>> │need for arbitrary stacking, and the idea of there   │
>> │being only 1 supervisor will eventually break down.  │
>> │Right now there is only 1 because only container │
>> │managers are using this feature. But if some daemon  │
>> │starts using it to isolate some thread, suddenly it  │
>> │might break if a container manager is trying to  │
>> │listen to it too, etc. I expect it won't be needed   │
>> │soon, but I do think it'll change.   │
>> │ │
>> └─┘
>>
>>   · In cases where it is appropriate, the seccomp filter returns
>> the action value SECCOMP_RET_USER_NOTIF.  This return value
>> will trigger a notification event.
>>
>>2. In order that the supervisor can obtain notifications using
>>   the listening file descriptor, (a duplicate of) that file
>>   descriptor must be passed from the target to the supervisor.
>>   One way in which this could be done is by passing the file
>>   descriptor over a UNIX domain socket connection between the
>>   target and the supervisor (using the SCM_RIGHTS ancillary
>>   message type described in unix(7)).
> 
> Fwiw, on newer kernels you could also use pidfd_getfd() for that.

Thanks. I added that to the text.

>>3. The supervisor will receive notification events on the
>>   listening file descriptor.  These events are returned as
>>   structures of type seccomp_notif.  Because this structure and
>>   its size may evolve over kernel versions, 

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-29 Thread Michael Kerrisk (man-pages)
Hello Jann,

On 10/29/20 2:42 AM, Jann Horn wrote:
> On Mon, Oct 26, 2020 at 10:55 AM Michael Kerrisk (man-pages)
>  wrote:
>>static bool
>>getTargetPathname(struct seccomp_notif *req, int notifyFd,
>>  char *path, size_t len)
>>{
>>char procMemPath[PATH_MAX];
>>
>>snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", 
>> req->pid);
>>
>>int procMemFd = open(procMemPath, O_RDONLY);
>>if (procMemFd == -1)
>>errExit("\tS: open");
>>
>>/* Check that the process whose info we are accessing is still 
>> alive.
>>   If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
>>   in checkNotificationIdIsValid()) succeeds, we know that the
>>   /proc/PID/mem file descriptor that we opened corresponds to the
>>   process for which we received a notification. If that process
>>   subsequently terminates, then read() on that file descriptor
>>   will return 0 (EOF). */
>>
>>checkNotificationIdIsValid(notifyFd, req->id);
>>
>>/* Read bytes at the location containing the pathname argument
>>   (i.e., the first argument) of the mkdir(2) call */
>>
>>ssize_t nread = pread(procMemFd, path, len, req->data.args[0]);
>>if (nread == -1)
>>errExit("pread");
> 
> As discussed at
> ,
> we need to re-check checkNotificationIdIsValid() after reading remote
> memory but before using the read value in any way. Otherwise, the
> syscall could in the meantime get interrupted by a signal handler, the
> signal handler could return, and then the function that performed the
> syscall could free() allocations or return (thereby freeing buffers on
> the stack).
> 
> In essence, this pread() is (unavoidably) a potential use-after-free
> read; and to make that not have any security impact, we need to check
> whether UAF read occurred before using the read value. This should
> probably be called out elsewhere in the manpage, too...
> 
> Now, of course, **reading** is the easy case. The difficult case is if
> we have to **write** to the remote process... because then we can't
> play games like that. If we write data to a freed pointer, we're
> screwed, that's it. (And for somewhat unrelated bonus fun, consider
> that /proc/$pid/mem is originally intended for process debugging,
> including installing breakpoints, and will therefore happily write
> over "readonly" private mappings, such as typical mappings of
> executable code.)
> 
> So, h... I guess if anyone wants to actually write memory back to
> the target process, we'd better come up with some dedicated API for
> that, using an ioctl on the seccomp fd that magically freezes the
> target process inside the syscall while writing to its memory, or
> something like that? And until then, the manpage should have a big fat
> warning that writing to the target's memory is simply not possible
> (safely).

Thank you for your very clear explanation! It turned out to be 
trivially easy to demonstrate this issue with a slightly modified
version of my program.

As well as the change to the code example that I already mentioned
my reply of a few hours ago, I've added the following text to the 
page:

   Caveats regarding the use of /proc/[tid]/mem
   The discussion above noted the need to use the
   SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) when opening the
   /proc/[tid]/mem file of the target to avoid the possibility of
   accessing the memory of the wrong process in the event that the
   target terminates and its ID is recycled by another (unrelated)
   thread.  However, the use of this ioctl(2) operation is also
   necessary in other situations, as explained in the following
   pargraphs.

   Consider the following scenario, where the supervisor tries to
   read the pathname argument of a target's blocked mount(2) system
   call:

   • From one of its functions (func()), the target calls mount(2),
 which triggers a user-space notification and causes the target
 to block.

   • The supervisor receives the notification, opens
 /proc/[tid]/mem, and (successfully) performs the
 SECCOMP_IOCTL_NOTIF_ID_VALID check.

   • The target receives a signal, which causes the mount(2) to
 abort.

   • The signal handler executes in the target, and returns.

   • Upon return from the handler, the execution of func() resumes,
 and it returns (and perhaps other functions are called,
 overwriting the memory that had been used for the stack frame
 of func()).

   • Using the address provided in the notification information, the
 supervisor reads from the target's memory location that used to
 contain the 

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-29 Thread Christian Brauner
On Mon, Oct 26, 2020 at 10:55:04AM +0100, Michael Kerrisk (man-pages) wrote:
> Hi all (and especially Tycho and Sargun),
> 
> Following review comments on the first draft (thanks to Jann, Kees,
> Christian and Tycho), I've made a lot of changes to this page.
> I've also added a few FIXMEs relating to outstanding API issues.
> I'd like a second pass review of the page before I release it.
> But also, this mail serves as a way of noting the outstanding API
> issues.
> 
> Tycho: I still have an outstanding question for you at [2].
> 
> Sargun: can you please prepare something on SECCOMP_ADDFD_FLAG_SETFD
> and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?
> 
> I've shown the rendered version of the page below. The page source
> currently sits in a branch at
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif
> 
> At this point, I'm mainly interested in feedback about the FIXMEs,
> some of which relate to the text of the page itself, while the
> others relate to the various outstanding API issues. The first 
> FIXME provides a small opportunity for some bikeshedding :-);

I like this manpage. I think this is the most comprehensive explanation
of any seccomp feature and somewhat understandable.

Just tiny comments below, hopefully no bike-shedding though. :)

> 
> 
> Thanks,
> 
> Michael
> 
> [1] 
> https://lore.kernel.org/linux-man/45f07f17-18b6-d187-0914-6f341fe90...@gmail.com/
> [2] 
> https://lore.kernel.org/linux-man/8f20d586-9609-ef83-c85a-272e37e68...@gmail.com/
> 
> =
> 
> SECCOMP_USER_NOTIF(2)   Linux Programmer's Manual  SECCOMP_USER_NOTIF(2)
> 
> NAME
>seccomp_user_notif - Seccomp user-space notification mechanism
> 
>┌─┐
>│FIXME│
>├─┤
>│Might "seccomp_unotify(2)" be a better name for this │
>│page?  It's slightly shorter to type, and perhaps│
>│reads better when spoken.│
>└─┘
> 
> SYNOPSIS
>#include 
>#include 
>#include 
> 
>int seccomp(unsigned int operation, unsigned int flags, void *args);
> 
>#include 
> 
>int ioctl(int fd, SECCOMP_IOCTL_NOTIF_RECV,
>  struct seccomp_notif *req);
>int ioctl(int fd, SECCOMP_IOCTL_NOTIF_SEND,
>  struct seccomp_notif_resp *resp);
>int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ID_VALID, __u64 *id);
> 
> DESCRIPTION
>This page describes the user-space notification mechanism
>provided by the Secure Computing (seccomp) facility.  As well as
>the use of the SECCOMP_FILTER_FLAG_NEW_LISTENER flag, the
>SECCOMP_RET_USER_NOTIF action value, and the
>SECCOMP_GET_NOTIF_SIZES operation described in seccomp(2), this
>mechanism involves the use of a number of related ioctl(2)
>operations (described below).
> 
>Overview
>In conventional usage of a seccomp filter, the decision about how
>to treat a system call is made by the filter itself.  By
>contrast, the user-space notification mechanism allows the
>seccomp filter to delegate the handling of the system call to
>another user-space process.  Note that this mechanism is
>explicitly not intended as a method implementing security policy;
>see NOTES.
> 
>In the discussion that follows, the thread(s) on which the
>seccomp filter is installed is (are) referred to as the target,
>and the process that is notified by the user-space notification
>mechanism is referred to as the supervisor.
> 
>A suitably privileged supervisor can use the user-space
>notification mechanism to perform actions on behalf of the
>target.  The advantage of the user-space notification mechanism
>is that the supervisor will usually be able to retrieve
>information about the target and the performed system call that
>the seccomp filter itself cannot.  (A seccomp filter is limited
>in the information it can obtain and the actions that it can
>perform because it is running on a virtual machine inside the
>kernel.)
> 
>An overview of the steps performed by the target and the
>supervisor is as follows:
> 
>1. The target establishes a seccomp filter in the usual manner,
>   but with two differences:
> 
>   · The seccomp(2) flags argument includes the flag
> SECCOMP_FILTER_FLAG_NEW_LISTENER.  Consequently, the return
> value of the (successful) seccomp(2) call is a new
> "listening" file descriptor that can be used to receive
> notifications.  Only one "listening" seccomp filter can be
> installed for a thread.
> 
> 

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-29 Thread Michael Kerrisk (man-pages)
Hello Jann,

On 10/29/20 2:42 AM, Jann Horn wrote:
> On Mon, Oct 26, 2020 at 10:55 AM Michael Kerrisk (man-pages)
>  wrote:
>>static bool
>>getTargetPathname(struct seccomp_notif *req, int notifyFd,
>>  char *path, size_t len)
>>{
>>char procMemPath[PATH_MAX];
>>
>>snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", 
>> req->pid);
>>
>>int procMemFd = open(procMemPath, O_RDONLY);
>>if (procMemFd == -1)
>>errExit("\tS: open");
>>
>>/* Check that the process whose info we are accessing is still 
>> alive.
>>   If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
>>   in checkNotificationIdIsValid()) succeeds, we know that the
>>   /proc/PID/mem file descriptor that we opened corresponds to the
>>   process for which we received a notification. If that process
>>   subsequently terminates, then read() on that file descriptor
>>   will return 0 (EOF). */
>>
>>checkNotificationIdIsValid(notifyFd, req->id);
>>
>>/* Read bytes at the location containing the pathname argument
>>   (i.e., the first argument) of the mkdir(2) call */
>>
>>ssize_t nread = pread(procMemFd, path, len, req->data.args[0]);
>>if (nread == -1)
>>errExit("pread");
> 
> As discussed at
> ,
> we need to re-check checkNotificationIdIsValid() after reading remote
> memory but before using the read value in any way. Otherwise, the
> syscall could in the meantime get interrupted by a signal handler, the
> signal handler could return, and then the function that performed the
> syscall could free() allocations or return (thereby freeing buffers on
> the stack).
> 
> In essence, this pread() is (unavoidably) a potential use-after-free
> read; and to make that not have any security impact, we need to check
> whether UAF read occurred before using the read value. This should
> probably be called out elsewhere in the manpage, too...

Thanks very much for pointing me at this!

So, I want to conform that the fix to the code is as simple as
adding a check following the pread() call, something like:

[[
 ssize_t nread = pread(procMemFd, path, len, req->data.args[argNum]);
 if (nread == -1)
errExit("Supervisor: pread");
 
 if (nread == 0) {
fprintf(stderr, "\tS: pread() of /proc/PID/mem "
"returned 0 (EOF)\n");
exit(EXIT_FAILURE);
 }
 
 if (close(procMemFd) == -1)
errExit("Supervisor: close-/proc/PID/mem");
 
+/* Once again check that the notification ID is still valid. The
+   case we are particularly concerned about here is that just
+   before we fetched the pathname, the target's blocked system
+   call was interrupted by a signal handler, and after the handler
+   returned, the target carried on execution (past the interrupted
+   system call). In that case, we have no guarantees about what we
+   are reading, since the target's memory may have been arbitrarily
+   changed by subsequent operations. */
+
+if (!notificationIdIsValid(notifyFd, req->id, "post-open"))
+return false;
+
 /* We have no guarantees about what was in the memory of the target
process. We therefore treat the buffer returned by pread() as
untrusted input. The buffer should be terminated by a null byte;
if not, then we will trigger an error for the target process. */
 
 if (strnlen(path, nread) < nread)
 return true;
]]

> Now, of course, **reading** is the easy case. The difficult case is if
> we have to **write** to the remote process... because then we can't
> play games like that. If we write data to a freed pointer, we're
> screwed, that's it. (And for somewhat unrelated bonus fun, consider
> that /proc/$pid/mem is originally intended for process debugging,
> including installing breakpoints, and will therefore happily write
> over "readonly" private mappings, such as typical mappings of
> executable code.)
> 
> So, h... I guess if anyone wants to actually write memory back to
> the target process, we'd better come up with some dedicated API for
> that, using an ioctl on the seccomp fd that magically freezes the
> target process inside the syscall while writing to its memory, or
> something like that? And until then, the manpage should have a big fat
> warning that writing to the target's memory is simply not possible
> (safely).
> 
>>if (nread == 0) {
>>fprintf(stderr, "\tS: pread() of /proc/PID/mem "
>>"returned 0 (EOF)\n");
>>exit(EXIT_FAILURE);
>>}
> .

I'll think over some changes to the text of the manual page.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages 

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-29 Thread Sargun Dhillon
On Mon, Oct 26, 2020 at 10:55:04AM +0100, Michael Kerrisk (man-pages) wrote:
> Hi all (and especially Tycho and Sargun),
> 
> Following review comments on the first draft (thanks to Jann, Kees,
> Christian and Tycho), I've made a lot of changes to this page.
> I've also added a few FIXMEs relating to outstanding API issues.
> I'd like a second pass review of the page before I release it.
> But also, this mail serves as a way of noting the outstanding API
> issues.
> 
> Tycho: I still have an outstanding question for you at [2].
> 
> Sargun: can you please prepare something on SECCOMP_ADDFD_FLAG_SETFD
> and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?
> 
> I've shown the rendered version of the page below. The page source
> currently sits in a branch at
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif
> 
> At this point, I'm mainly interested in feedback about the FIXMEs,
> some of which relate to the text of the page itself, while the
> others relate to the various outstanding API issues. The first 
> FIXME provides a small opportunity for some bikeshedding :-);
> 
> 
> Thanks,
> 
> Michael
> 
> [1] 
> https://lore.kernel.org/linux-man/45f07f17-18b6-d187-0914-6f341fe90...@gmail.com/
> [2] 
> https://lore.kernel.org/linux-man/8f20d586-9609-ef83-c85a-272e37e68...@gmail.com/
> 
> =
> 
> SECCOMP_USER_NOTIF(2)   Linux Programmer's Manual  SECCOMP_USER_NOTIF(2)
> 
> NAME
>seccomp_user_notif - Seccomp user-space notification mechanism
> 
>┌─┐
>│FIXME│
>├─┤
>│Might "seccomp_unotify(2)" be a better name for this │
>│page?  It's slightly shorter to type, and perhaps│
>│reads better when spoken.│
>└─┘
> 
> SYNOPSIS
>#include 
>#include 
>#include 
> 
>int seccomp(unsigned int operation, unsigned int flags, void *args);
> 
>#include 
> 
>int ioctl(int fd, SECCOMP_IOCTL_NOTIF_RECV,
>  struct seccomp_notif *req);
>int ioctl(int fd, SECCOMP_IOCTL_NOTIF_SEND,
>  struct seccomp_notif_resp *resp);
>int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ID_VALID, __u64 *id);
> 
> DESCRIPTION
>This page describes the user-space notification mechanism
>provided by the Secure Computing (seccomp) facility.  As well as
>the use of the SECCOMP_FILTER_FLAG_NEW_LISTENER flag, the
>SECCOMP_RET_USER_NOTIF action value, and the
>SECCOMP_GET_NOTIF_SIZES operation described in seccomp(2), this
>mechanism involves the use of a number of related ioctl(2)
>operations (described below).
> 
>Overview
>In conventional usage of a seccomp filter, the decision about how
>to treat a system call is made by the filter itself.  By
>contrast, the user-space notification mechanism allows the
>seccomp filter to delegate the handling of the system call to
>another user-space process.  Note that this mechanism is
>explicitly not intended as a method implementing security policy;
>see NOTES.
> 
>In the discussion that follows, the thread(s) on which the
>seccomp filter is installed is (are) referred to as the target,
>and the process that is notified by the user-space notification
>mechanism is referred to as the supervisor.
> 
>A suitably privileged supervisor can use the user-space
>notification mechanism to perform actions on behalf of the
>target.  The advantage of the user-space notification mechanism
>is that the supervisor will usually be able to retrieve
>information about the target and the performed system call that
>the seccomp filter itself cannot.  (A seccomp filter is limited
>in the information it can obtain and the actions that it can
>perform because it is running on a virtual machine inside the
>kernel.)
> 
>An overview of the steps performed by the target and the
>supervisor is as follows:
> 
>1. The target establishes a seccomp filter in the usual manner,
>   but with two differences:
> 
>   · The seccomp(2) flags argument includes the flag
> SECCOMP_FILTER_FLAG_NEW_LISTENER.  Consequently, the return
> value of the (successful) seccomp(2) call is a new
> "listening" file descriptor that can be used to receive
> notifications.  Only one "listening" seccomp filter can be
> installed for a thread.
> 
> ┌─┐
> │FIXME│
> 

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-29 Thread Jann Horn
On Thu, Oct 29, 2020 at 3:04 AM Tycho Andersen  wrote:
> On Thu, Oct 29, 2020 at 02:42:58AM +0100, Jann Horn wrote:
> > On Mon, Oct 26, 2020 at 10:55 AM Michael Kerrisk (man-pages)
> >  wrote:
> > >static bool
> > >getTargetPathname(struct seccomp_notif *req, int notifyFd,
> > >  char *path, size_t len)
> > >{
> > >char procMemPath[PATH_MAX];
> > >
> > >snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", 
> > > req->pid);
> > >
> > >int procMemFd = open(procMemPath, O_RDONLY);
> > >if (procMemFd == -1)
> > >errExit("\tS: open");
> > >
> > >/* Check that the process whose info we are accessing is still 
> > > alive.
> > >   If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
> > >   in checkNotificationIdIsValid()) succeeds, we know that the
> > >   /proc/PID/mem file descriptor that we opened corresponds to 
> > > the
> > >   process for which we received a notification. If that 
> > > process
> > >   subsequently terminates, then read() on that file descriptor
> > >   will return 0 (EOF). */
> > >
> > >checkNotificationIdIsValid(notifyFd, req->id);
> > >
> > >/* Read bytes at the location containing the pathname argument
> > >   (i.e., the first argument) of the mkdir(2) call */
> > >
> > >ssize_t nread = pread(procMemFd, path, len, req->data.args[0]);
> > >if (nread == -1)
> > >errExit("pread");
> >
> > As discussed at
> > ,
> > we need to re-check checkNotificationIdIsValid() after reading remote
> > memory but before using the read value in any way. Otherwise, the
> > syscall could in the meantime get interrupted by a signal handler, the
> > signal handler could return, and then the function that performed the
> > syscall could free() allocations or return (thereby freeing buffers on
> > the stack).
> >
> > In essence, this pread() is (unavoidably) a potential use-after-free
> > read; and to make that not have any security impact, we need to check
> > whether UAF read occurred before using the read value. This should
> > probably be called out elsewhere in the manpage, too...
> >
> > Now, of course, **reading** is the easy case. The difficult case is if
> > we have to **write** to the remote process... because then we can't
> > play games like that. If we write data to a freed pointer, we're
> > screwed, that's it. (And for somewhat unrelated bonus fun, consider
> > that /proc/$pid/mem is originally intended for process debugging,
> > including installing breakpoints, and will therefore happily write
> > over "readonly" private mappings, such as typical mappings of
> > executable code.)
> >
> > So, h... I guess if anyone wants to actually write memory back to
> > the target process, we'd better come up with some dedicated API for
> > that, using an ioctl on the seccomp fd that magically freezes the
>
> By freeze here you mean a killable wait instead of an interruptible
> wait, right?

Nope, nonkillable.

Consider the case of vfork(), where a target process does something like this:

void spawn_executable(char **argv, char **envv) {
  pid_t child = vfork();
  if (child == 0) {
char path[1000];
sprintf(path, ...);
execve(path, argv, envv);
  }
}

and the seccomp notifier wants to look at the execve() path (as a
somewhat silly example). The child process is just borrowing the
parent's stack, and as soon as the child either gets far enough into
execve() or dies, the parent continues using that stack. So keeping
the child in killable sleep would not be enough to prevent reuse of
the child's stack.


But conceptually that's not really a big problem - we already have a
way to force the target task to stay inside the seccomp code no matter
if it gets SIGKILLed or whatever, and that is to take the notify_lock.
When the target task wakes up and wants to continue executing, it has
to first get through mutex_lock(>notify_lock) - and that will
always block until the lock is free. So we could e.g. do something
like:

 - Grab references to the source pages in the supervisor's address
space with get_user_pages_fast().
 - Take mmap_sem on the target.
 - Grab references to pages in the relevant range with pin_user_pages_remote().
 - Drop the mmap_sem.
 - Take the notify_lock.
 - Recheck whether the notification with the right ID is still there.
 - Copy data from the pinned source pages to the pinned target pages.
 - Drop the notify_lock.
 - Drop the page references.

and this way we would still guarantee that the target process would
only be blocked in noninterruptible sleep for a small amount of time
(and would not be indirectly blocked on sleeping operations through
the mutex). It'd be pretty straightforward, I think. But as 

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-28 Thread Jann Horn
On Mon, Oct 26, 2020 at 10:55 AM Michael Kerrisk (man-pages)
 wrote:
>static bool
>getTargetPathname(struct seccomp_notif *req, int notifyFd,
>  char *path, size_t len)
>{
>char procMemPath[PATH_MAX];
>
>snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", 
> req->pid);
>
>int procMemFd = open(procMemPath, O_RDONLY);
>if (procMemFd == -1)
>errExit("\tS: open");
>
>/* Check that the process whose info we are accessing is still 
> alive.
>   If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
>   in checkNotificationIdIsValid()) succeeds, we know that the
>   /proc/PID/mem file descriptor that we opened corresponds to the
>   process for which we received a notification. If that process
>   subsequently terminates, then read() on that file descriptor
>   will return 0 (EOF). */
>
>checkNotificationIdIsValid(notifyFd, req->id);
>
>/* Read bytes at the location containing the pathname argument
>   (i.e., the first argument) of the mkdir(2) call */
>
>ssize_t nread = pread(procMemFd, path, len, req->data.args[0]);
>if (nread == -1)
>errExit("pread");

As discussed at
,
we need to re-check checkNotificationIdIsValid() after reading remote
memory but before using the read value in any way. Otherwise, the
syscall could in the meantime get interrupted by a signal handler, the
signal handler could return, and then the function that performed the
syscall could free() allocations or return (thereby freeing buffers on
the stack).

In essence, this pread() is (unavoidably) a potential use-after-free
read; and to make that not have any security impact, we need to check
whether UAF read occurred before using the read value. This should
probably be called out elsewhere in the manpage, too...

Now, of course, **reading** is the easy case. The difficult case is if
we have to **write** to the remote process... because then we can't
play games like that. If we write data to a freed pointer, we're
screwed, that's it. (And for somewhat unrelated bonus fun, consider
that /proc/$pid/mem is originally intended for process debugging,
including installing breakpoints, and will therefore happily write
over "readonly" private mappings, such as typical mappings of
executable code.)

So, h... I guess if anyone wants to actually write memory back to
the target process, we'd better come up with some dedicated API for
that, using an ioctl on the seccomp fd that magically freezes the
target process inside the syscall while writing to its memory, or
something like that? And until then, the manpage should have a big fat
warning that writing to the target's memory is simply not possible
(safely).

>if (nread == 0) {
>fprintf(stderr, "\tS: pread() of /proc/PID/mem "
>"returned 0 (EOF)\n");
>exit(EXIT_FAILURE);
>}


Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-26 Thread Tycho Andersen
On Mon, Oct 26, 2020 at 10:55:04AM +0100, Michael Kerrisk (man-pages) wrote:
> Hi all (and especially Tycho and Sargun),
> 
> Following review comments on the first draft (thanks to Jann, Kees,
> Christian and Tycho), I've made a lot of changes to this page.
> I've also added a few FIXMEs relating to outstanding API issues.
> I'd like a second pass review of the page before I release it.
> But also, this mail serves as a way of noting the outstanding API
> issues.
> 
> Tycho: I still have an outstanding question for you at [2].
> [2] 
> https://lore.kernel.org/linux-man/8f20d586-9609-ef83-c85a-272e37e68...@gmail.com/

I don't have that thread in my inbox any more, but I can reply here:
no, I don't know any users of this info, but I also don't anticipate
knowing how people will all use this feature :)

Tycho


Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-26 Thread Tycho Andersen
On Mon, Oct 26, 2020 at 03:30:29PM +0100, Michael Kerrisk (man-pages) wrote:
> Hi Tycho,
> 
> Thanks for getting back to me.
> 
> On Mon, 26 Oct 2020 at 14:54, Tycho Andersen  wrote:
> >
> > On Mon, Oct 26, 2020 at 10:55:04AM +0100, Michael Kerrisk (man-pages) wrote:
> > > Hi all (and especially Tycho and Sargun),
> > >
> > > Following review comments on the first draft (thanks to Jann, Kees,
> > > Christian and Tycho), I've made a lot of changes to this page.
> > > I've also added a few FIXMEs relating to outstanding API issues.
> > > I'd like a second pass review of the page before I release it.
> > > But also, this mail serves as a way of noting the outstanding API
> > > issues.
> > >
> > > Tycho: I still have an outstanding question for you at [2].
> > > [2] 
> > > https://lore.kernel.org/linux-man/8f20d586-9609-ef83-c85a-272e37e68...@gmail.com/
> >
> > I don't have that thread in my inbox any more, but I can reply here:
> > no, I don't know any users of this info, but I also don't anticipate
> > knowing how people will all use this feature :)
> 
> Yes, but my questions were:
> 
> [[
> [1] So, I think maybe I now understand what you intended with setting
> POLLOUT: the notification has been received ("read") and now the
> FD can be used to NOTIFY_SEND ("write") a response. Right?
> 
> [2] If that's correct, I don't have a problem with it. I just wonder:
> is it useful? IOW: are there situations where the process doing the
> NOTIFY_SEND might want to test for POLLOUT because the it doesn't
> know whether a NOTIFY_RECV has occurred?
> ]]
> 
> So, do I understand right in [1]? (The implication from your reply is
> yes, but I want to be sure...)

Yes.

> For [2], my question was not about users, but *use cases*. The
> question I asked myself is: why does the feature exist? Hence my
> question [2] reworded: "when you designed this, did you have in mind
> scenarios here the process doing the NOTIFY_SEND might need to test
> for POLLOUT because it doesn't know whether a NOTIFY_RECV has
> occurred?"

I did not.

Tycho


Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-26 Thread Michael Kerrisk (man-pages)
Hi Tycho,

Thanks for getting back to me.

On Mon, 26 Oct 2020 at 14:54, Tycho Andersen  wrote:
>
> On Mon, Oct 26, 2020 at 10:55:04AM +0100, Michael Kerrisk (man-pages) wrote:
> > Hi all (and especially Tycho and Sargun),
> >
> > Following review comments on the first draft (thanks to Jann, Kees,
> > Christian and Tycho), I've made a lot of changes to this page.
> > I've also added a few FIXMEs relating to outstanding API issues.
> > I'd like a second pass review of the page before I release it.
> > But also, this mail serves as a way of noting the outstanding API
> > issues.
> >
> > Tycho: I still have an outstanding question for you at [2].
> > [2] 
> > https://lore.kernel.org/linux-man/8f20d586-9609-ef83-c85a-272e37e68...@gmail.com/
>
> I don't have that thread in my inbox any more, but I can reply here:
> no, I don't know any users of this info, but I also don't anticipate
> knowing how people will all use this feature :)

Yes, but my questions were:

[[
[1] So, I think maybe I now understand what you intended with setting
POLLOUT: the notification has been received ("read") and now the
FD can be used to NOTIFY_SEND ("write") a response. Right?

[2] If that's correct, I don't have a problem with it. I just wonder:
is it useful? IOW: are there situations where the process doing the
NOTIFY_SEND might want to test for POLLOUT because the it doesn't
know whether a NOTIFY_RECV has occurred?
]]

So, do I understand right in [1]? (The implication from your reply is
yes, but I want to be sure...)

For [2], my question was not about users, but *use cases*. The
question I asked myself is: why does the feature exist? Hence my
question [2] reworded: "when you designed this, did you have in mind
scenarios here the process doing the NOTIFY_SEND might need to test
for POLLOUT because it doesn't know whether a NOTIFY_RECV has
occurred?"

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


For review: seccomp_user_notif(2) manual page [v2]

2020-10-26 Thread Michael Kerrisk (man-pages)
Hi all (and especially Tycho and Sargun),

Following review comments on the first draft (thanks to Jann, Kees,
Christian and Tycho), I've made a lot of changes to this page.
I've also added a few FIXMEs relating to outstanding API issues.
I'd like a second pass review of the page before I release it.
But also, this mail serves as a way of noting the outstanding API
issues.

Tycho: I still have an outstanding question for you at [2].

Sargun: can you please prepare something on SECCOMP_ADDFD_FLAG_SETFD
and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?

I've shown the rendered version of the page below. The page source
currently sits in a branch at
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif

At this point, I'm mainly interested in feedback about the FIXMEs,
some of which relate to the text of the page itself, while the
others relate to the various outstanding API issues. The first 
FIXME provides a small opportunity for some bikeshedding :-);


Thanks,

Michael

[1] 
https://lore.kernel.org/linux-man/45f07f17-18b6-d187-0914-6f341fe90...@gmail.com/
[2] 
https://lore.kernel.org/linux-man/8f20d586-9609-ef83-c85a-272e37e68...@gmail.com/

=

SECCOMP_USER_NOTIF(2)   Linux Programmer's Manual  SECCOMP_USER_NOTIF(2)

NAME
   seccomp_user_notif - Seccomp user-space notification mechanism

   ┌─┐
   │FIXME│
   ├─┤
   │Might "seccomp_unotify(2)" be a better name for this │
   │page?  It's slightly shorter to type, and perhaps│
   │reads better when spoken.│
   └─┘

SYNOPSIS
   #include 
   #include 
   #include 

   int seccomp(unsigned int operation, unsigned int flags, void *args);

   #include 

   int ioctl(int fd, SECCOMP_IOCTL_NOTIF_RECV,
 struct seccomp_notif *req);
   int ioctl(int fd, SECCOMP_IOCTL_NOTIF_SEND,
 struct seccomp_notif_resp *resp);
   int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ID_VALID, __u64 *id);

DESCRIPTION
   This page describes the user-space notification mechanism
   provided by the Secure Computing (seccomp) facility.  As well as
   the use of the SECCOMP_FILTER_FLAG_NEW_LISTENER flag, the
   SECCOMP_RET_USER_NOTIF action value, and the
   SECCOMP_GET_NOTIF_SIZES operation described in seccomp(2), this
   mechanism involves the use of a number of related ioctl(2)
   operations (described below).

   Overview
   In conventional usage of a seccomp filter, the decision about how
   to treat a system call is made by the filter itself.  By
   contrast, the user-space notification mechanism allows the
   seccomp filter to delegate the handling of the system call to
   another user-space process.  Note that this mechanism is
   explicitly not intended as a method implementing security policy;
   see NOTES.

   In the discussion that follows, the thread(s) on which the
   seccomp filter is installed is (are) referred to as the target,
   and the process that is notified by the user-space notification
   mechanism is referred to as the supervisor.

   A suitably privileged supervisor can use the user-space
   notification mechanism to perform actions on behalf of the
   target.  The advantage of the user-space notification mechanism
   is that the supervisor will usually be able to retrieve
   information about the target and the performed system call that
   the seccomp filter itself cannot.  (A seccomp filter is limited
   in the information it can obtain and the actions that it can
   perform because it is running on a virtual machine inside the
   kernel.)

   An overview of the steps performed by the target and the
   supervisor is as follows:

   1. The target establishes a seccomp filter in the usual manner,
  but with two differences:

  · The seccomp(2) flags argument includes the flag
SECCOMP_FILTER_FLAG_NEW_LISTENER.  Consequently, the return
value of the (successful) seccomp(2) call is a new
"listening" file descriptor that can be used to receive
notifications.  Only one "listening" seccomp filter can be
installed for a thread.

┌─┐
│FIXME│
├─┤
│Is the last sentence above correct?  │
│ │
│Kees Cook (25 Oct 2020) notes:   │
│ │
│I like this