Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-28 Thread Joey Pabalinas
On Wed, Nov 28, 2018 at 11:05:49PM +0100, Christian Brauner wrote:
> On Wed, Nov 28, 2018 at 11:45:34AM -1000, Joey Pabalinas wrote:
> > On Mon, Nov 19, 2018 at 11:32:39AM +0100, Christian Brauner wrote:
> > > + if (info) {
> > > + ret = __copy_siginfo_from_user(sig, &kinfo, info);
> > > + if (unlikely(ret))
> > > + goto err;
> > 
> 
> I think you're misreading here. It jumps to:
> 
> err:
> fdput(f);
>   return ret;
> 
> and does propagate the error. This is also an old iteration of the patch.

Whoops, that I am. Sorry about that.

-- 
Cheers,
Joey Pabalinas


signature.asc
Description: PGP signature


Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-28 Thread Christian Brauner
On Wed, Nov 28, 2018 at 11:45:34AM -1000, Joey Pabalinas wrote:
> On Mon, Nov 19, 2018 at 11:32:39AM +0100, Christian Brauner wrote:
> > +   if (info) {
> > +   ret = __copy_siginfo_from_user(sig, &kinfo, info);
> > +   if (unlikely(ret))
> > +   goto err;
> 

I think you're misreading here. It jumps to:

err:
fdput(f);
return ret;

and does propagate the error. This is also an old iteration of the patch.

Christian

> What's the reason you don't propagate up the errors from 
> __copy_siginfo_from_user()?
> Granted, I admit that -E2BIG is kind of weird to return, but -EFAULT seems 
> like a
> fairly sane error.
> 
> Or is there some reason it's more useful to just return -EINVAL for all of the
> failure cases here?
> 
> -- 
> Cheers,
> Joey Pabalinas




Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-28 Thread Joey Pabalinas
On Mon, Nov 19, 2018 at 11:32:39AM +0100, Christian Brauner wrote:
> + if (info) {
> + ret = __copy_siginfo_from_user(sig, &kinfo, info);
> + if (unlikely(ret))
> + goto err;

What's the reason you don't propagate up the errors from 
__copy_siginfo_from_user()?
Granted, I admit that -E2BIG is kind of weird to return, but -EFAULT seems like 
a
fairly sane error.

Or is there some reason it's more useful to just return -EINVAL for all of the
failure cases here?

-- 
Cheers,
Joey Pabalinas


signature.asc
Description: PGP signature


Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-21 Thread Serge E. Hallyn
On Mon, Nov 19, 2018 at 03:39:54PM -0700, Tycho Andersen wrote:
> On Mon, Nov 19, 2018 at 11:32:39AM +0100, Christian Brauner wrote:
> >
> > +/**
> > + *  sys_procfd_signal - send a signal to a process through a process file
> > + *  descriptor
> > + *  @fd: the file descriptor of the process
> > + *  @sig: signal to be sent
> > + *  @info: the signal info
> > + *  @flags: future flags to be passed
> > + */
> > +SYSCALL_DEFINE4(procfd_signal, int, fd, int, sig, siginfo_t __user *, info,
> > +   int, flags)
> > +{
> 
> Can I just register an objection here that I think using a syscall
> just for this is silly?
> 
> My understanding is that the concern is that some code might do:
> 
> unknown_fd = recv_fd();
> ioctl(unknown_fd, SOME_IOCTL, NULL); // where SOME_IOCTL == PROC_FD_KILL
> // whoops, unknown_fd was a procfd and we killed a task!

This could just be my own mental model, but for something like "kill a
task", an ioctl just seems wrong.  Syscall seems more natural.

I'd ack either method.

-serge


Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-21 Thread Serge E. Hallyn
On Tue, Nov 20, 2018 at 08:23:43AM +1100, Aleksa Sarai wrote:
> On 2018-11-20, Aleksa Sarai  wrote:
> > On 2018-11-19, Christian Brauner  wrote:
> > > On Tue, Nov 20, 2018 at 07:28:57AM +1100, Aleksa Sarai wrote:
> > > > On 2018-11-19, Christian Brauner  wrote:
> > > > > + if (info) {
> > > > > + ret = __copy_siginfo_from_user(sig, &kinfo, info);
> > > > > + if (unlikely(ret))
> > > > > + goto err;
> > > > > + /*
> > > > > +  * Not even root can pretend to send signals from the 
> > > > > kernel.
> > > > > +  * Nor can they impersonate a kill()/tgkill(), which 
> > > > > adds
> > > > > +  * source info.
> > > > > +  */
> > > > > + ret = -EPERM;
> > > > > + if ((kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL) &&
> > > > > + (task_pid(current) != pid))
> > > > > + goto err;
> > > > > + } else {
> > > > > + prepare_kill_siginfo(sig, &kinfo);
> > > > > + }
> > > > 
> > > > I wonder whether we should also have a pidns restriction here, since
> > > > currently it isn't possible for a container process using a pidns to
> > > > signal processes outside its pidns. AFAICS, this isn't done through an
> > > > explicit check -- it's a side-effect of processes in a pidns not being
> > > > able to address non-descendant-pidns processes.
> > > > 
> > > > But maybe it's reasonable to allow sending a procfd to a different pidns
> > > > and the same operations working on it? If we extend the procfd API to
> > > 
> > > No, I don't think so. I really don't want any fancy semantics in here.
> > > Fancy doesn't get merged and fancy is hard to maintain. So we should do
> > > something like:
> > > 
> > > if (proc_pid_ns() != current_pid_ns)
> > >   return EINVAL
> > 
> > This isn't quite sufficient. The key thing is that you have to be in an
> > *ancestor* (or same) pidns, not the *same* pidns. Ideally you can re-use
> > the check already in pidns_get_parent, and expose it. It would be
> > something as trivial as:
> > 
> > bool pidns_is_descendant(struct pid_namespace *ns,
> >  struct pid_namespace *ancestor)
> > {
> > for (;;) {
> > if (!ns)
> > return false;
> > if (ns == ancestor)
> > break;
> > ns = ns->parent;
> > }
> > return true;
> > }
> > 
> > And you can rewrite pidns_get_parent to use it. So you would instead be
> > doing:
> > 
> > if (pidns_is_descendant(proc_pid_ns, task_active_pid_ns(current)))
> > return -EPERM;
> 
> Scratch the last bit, -EPERM is wrong here. I would argue that -EINVAL
> is *somewhat* wrong because arguable the more semantically consistent
> error (with kill(2)) would be -ESRCH -- but then you're mixing the "pid
> is dead" and "pid is not visible to you" cases. I'm not sure what the
> right errno would be here (I'm sure some of the LKML greybeards will
> have a better clue.) :P

Actually I like EXDEV for this.  ERMOTE also works.


Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-21 Thread Serge E. Hallyn
On Tue, Nov 20, 2018 at 11:31:13AM +0100, Christian Brauner wrote:
> On Mon, Nov 19, 2018 at 10:59:12PM -0600, Eric W. Biederman wrote:
> > Daniel Colascione  writes:
> > 
> > > On Mon, Nov 19, 2018 at 1:37 PM Christian Brauner  
> > > wrote:
> > >>
> > >> On Mon, Nov 19, 2018 at 01:26:22PM -0800, Daniel Colascione wrote:
> > >> > On Mon, Nov 19, 2018 at 1:21 PM, Christian Brauner 
> > >> >  wrote:
> > >> > > That can be done without a loop by comparing the level counter for 
> > >> > > the
> > >> > > two pid namespaces.
> > >> > >
> > >> > >>
> > >> > >> And you can rewrite pidns_get_parent to use it. So you would 
> > >> > >> instead be
> > >> > >> doing:
> > >> > >>
> > >> > >> if (pidns_is_descendant(proc_pid_ns, 
> > >> > >> task_active_pid_ns(current)))
> > >> > >> return -EPERM;
> > >> > >>
> > >> > >> (Or you can just copy the 5-line loop into procfd_signal -- though I
> > >> > >> imagine we'll need this for all of the procfd_* APIs.)
> > >> >
> > >> > Why is any of this even necessary? Why does the child namespace we're
> > >> > considering even have a file descriptor to its ancestor's procfs? If
> > >>
> > >> Because you can send file descriptors between processes and container
> > >> runtimes tend to do that.
> > >
> > > Right. But why *would* a container runtime send one of these procfs
> > > FDs to a container?
> > >
> > >> > it has one of these FDs, it can already *read* all sorts of
> > >> > information it really shouldn't be able to acquire, so the additional
> > >> > ability to send a signal (subject to the usual permission checks)
> > >> > feels like sticking a finger in a dike that's already well-perforated.
> > >> > IMHO, we shouldn't bother with this check. The patch would be simpler
> > >> > without it.
> > >>
> > >> We will definitely not allow signaling processes in an ancestor pid
> > >> namespace! That is a security issue! I can imagine container runtimes
> > >> killing their monitoring process etc. pp. Not happening, unless someone
> > >> with deep expertise in signals can convince me otherwise.
> > >
> > > If parent namespace procfs FDs or mounts really can leak into child
> > > namespaces as easily as Aleksa says, then I don't mind adding the
> > > check. I was under the impression that if you find yourself in this
> > > situation, you already have a big problem.
> > 
> > There is one big reason to have the check, and I have not seen it
> > mentioned yet in this thread.
> > 
> > When SI_USER is set we report the pid of the sender of the signal in
> > si_pid.  When the signal comes from the kernel si_pid == 0.  When signal
> > is sent from an ancestor pid namespace si_pid also equals 0 (which is
> > reasonable).
> > 
> > A signal out to a process in a parent pid namespace such as SIGCHLD is
> > reasonable as we can map the pid.  I really don't see the point of
> > forbidding that.  From the perspective of the process in the parent pid
> > namespace it is just another process in it's pid namespace.  So it
> > should pose no problem from the perspective of the receiving process.
> > 
> > A signal to a process in a pid namespace that is neither a parent nor a
> > descendent pid namespace would be a problem, as there is no well defined
> > notion of what si_pid should be set to.  So for that case perhaps we
> > should have something like a noprocess pid that we can set.  Perhaps we
> > could set si_pid to 0x.  That would take a small extension to
> > pid_nr_ns.
> > 
> > File descriptors are not namespaced.  It is completely legitimate to use
> > file descriptors to get around limitations of namespaces.
> 
> Frankly, I don't see a good argument for why we would allow that even if
> safe. I have not heard a legitimate use-case or need for this.
> At this point I care about very simple semantics. Being able to signal
> into ancestor pid namespaces and cousin namespaces is interesting but
> makes the syscall more brittle and harder to understand.

Yeah, I'm with you on that.  We can always open that door later if a good
use case comes up, but I prefer simple at first.

> Changing pid_nr_ns() might be the solution but this function is called
> all over the place in the kernel and I'm not going to risk breaking
> something by changing it for a feature that no one so far has ever
> asked for.
> If you are ok with this then we should hold off on this. We can always
> add this feature later by removing the check when someone has a use-case
> for it.
> I'll send a v2 of the patch that keeps the restriction for now. If you
> insist on it being removed we can make the change in a follow-up
> iteration.
> 
> Christian
> 
> > 
> > Adding limitations to a file descriptor based api because someone else
> > can't set up their processes in such a way as to get the restrictions
> > they are looking for seems very sad.
> > 
> > Frankly I think it is one of the better features of namespaces that we
> > have to carefully handle and define these cases so that when the
> > inevitable le

Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-20 Thread Christian Brauner
On Mon, Nov 19, 2018 at 10:59:12PM -0600, Eric W. Biederman wrote:
> Daniel Colascione  writes:
> 
> > On Mon, Nov 19, 2018 at 1:37 PM Christian Brauner  
> > wrote:
> >>
> >> On Mon, Nov 19, 2018 at 01:26:22PM -0800, Daniel Colascione wrote:
> >> > On Mon, Nov 19, 2018 at 1:21 PM, Christian Brauner 
> >> >  wrote:
> >> > > That can be done without a loop by comparing the level counter for the
> >> > > two pid namespaces.
> >> > >
> >> > >>
> >> > >> And you can rewrite pidns_get_parent to use it. So you would instead 
> >> > >> be
> >> > >> doing:
> >> > >>
> >> > >> if (pidns_is_descendant(proc_pid_ns, task_active_pid_ns(current)))
> >> > >> return -EPERM;
> >> > >>
> >> > >> (Or you can just copy the 5-line loop into procfd_signal -- though I
> >> > >> imagine we'll need this for all of the procfd_* APIs.)
> >> >
> >> > Why is any of this even necessary? Why does the child namespace we're
> >> > considering even have a file descriptor to its ancestor's procfs? If
> >>
> >> Because you can send file descriptors between processes and container
> >> runtimes tend to do that.
> >
> > Right. But why *would* a container runtime send one of these procfs
> > FDs to a container?
> >
> >> > it has one of these FDs, it can already *read* all sorts of
> >> > information it really shouldn't be able to acquire, so the additional
> >> > ability to send a signal (subject to the usual permission checks)
> >> > feels like sticking a finger in a dike that's already well-perforated.
> >> > IMHO, we shouldn't bother with this check. The patch would be simpler
> >> > without it.
> >>
> >> We will definitely not allow signaling processes in an ancestor pid
> >> namespace! That is a security issue! I can imagine container runtimes
> >> killing their monitoring process etc. pp. Not happening, unless someone
> >> with deep expertise in signals can convince me otherwise.
> >
> > If parent namespace procfs FDs or mounts really can leak into child
> > namespaces as easily as Aleksa says, then I don't mind adding the
> > check. I was under the impression that if you find yourself in this
> > situation, you already have a big problem.
> 
> There is one big reason to have the check, and I have not seen it
> mentioned yet in this thread.
> 
> When SI_USER is set we report the pid of the sender of the signal in
> si_pid.  When the signal comes from the kernel si_pid == 0.  When signal
> is sent from an ancestor pid namespace si_pid also equals 0 (which is
> reasonable).
> 
> A signal out to a process in a parent pid namespace such as SIGCHLD is
> reasonable as we can map the pid.  I really don't see the point of
> forbidding that.  From the perspective of the process in the parent pid
> namespace it is just another process in it's pid namespace.  So it
> should pose no problem from the perspective of the receiving process.
> 
> A signal to a process in a pid namespace that is neither a parent nor a
> descendent pid namespace would be a problem, as there is no well defined
> notion of what si_pid should be set to.  So for that case perhaps we
> should have something like a noprocess pid that we can set.  Perhaps we
> could set si_pid to 0x.  That would take a small extension to
> pid_nr_ns.
> 
> File descriptors are not namespaced.  It is completely legitimate to use
> file descriptors to get around limitations of namespaces.

Frankly, I don't see a good argument for why we would allow that even if
safe. I have not heard a legitimate use-case or need for this.
At this point I care about very simple semantics. Being able to signal
into ancestor pid namespaces and cousin namespaces is interesting but
makes the syscall more brittle and harder to understand.
Changing pid_nr_ns() might be the solution but this function is called
all over the place in the kernel and I'm not going to risk breaking
something by changing it for a feature that no one so far has ever
asked for.
If you are ok with this then we should hold off on this. We can always
add this feature later by removing the check when someone has a use-case
for it.
I'll send a v2 of the patch that keeps the restriction for now. If you
insist on it being removed we can make the change in a follow-up
iteration.

Christian

> 
> Adding limitations to a file descriptor based api because someone else
> can't set up their processes in such a way as to get the restrictions
> they are looking for seems very sad.
> 
> Frankly I think it is one of the better features of namespaces that we
> have to carefully handle and define these cases so that when the
> inevitable leaks happen you are not immediately in a world of hurt.  All
> of the other permission checks etc continue to do their job.  Plus you
> are prepared for the case when someone wants their containers to have an
> interesting communication primitive.
> 
> Eric
> 
> 
> 
> 


Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Eric W. Biederman
Daniel Colascione  writes:

> On Mon, Nov 19, 2018 at 1:37 PM Christian Brauner  
> wrote:
>>
>> On Mon, Nov 19, 2018 at 01:26:22PM -0800, Daniel Colascione wrote:
>> > On Mon, Nov 19, 2018 at 1:21 PM, Christian Brauner  
>> > wrote:
>> > > That can be done without a loop by comparing the level counter for the
>> > > two pid namespaces.
>> > >
>> > >>
>> > >> And you can rewrite pidns_get_parent to use it. So you would instead be
>> > >> doing:
>> > >>
>> > >> if (pidns_is_descendant(proc_pid_ns, task_active_pid_ns(current)))
>> > >> return -EPERM;
>> > >>
>> > >> (Or you can just copy the 5-line loop into procfd_signal -- though I
>> > >> imagine we'll need this for all of the procfd_* APIs.)
>> >
>> > Why is any of this even necessary? Why does the child namespace we're
>> > considering even have a file descriptor to its ancestor's procfs? If
>>
>> Because you can send file descriptors between processes and container
>> runtimes tend to do that.
>
> Right. But why *would* a container runtime send one of these procfs
> FDs to a container?
>
>> > it has one of these FDs, it can already *read* all sorts of
>> > information it really shouldn't be able to acquire, so the additional
>> > ability to send a signal (subject to the usual permission checks)
>> > feels like sticking a finger in a dike that's already well-perforated.
>> > IMHO, we shouldn't bother with this check. The patch would be simpler
>> > without it.
>>
>> We will definitely not allow signaling processes in an ancestor pid
>> namespace! That is a security issue! I can imagine container runtimes
>> killing their monitoring process etc. pp. Not happening, unless someone
>> with deep expertise in signals can convince me otherwise.
>
> If parent namespace procfs FDs or mounts really can leak into child
> namespaces as easily as Aleksa says, then I don't mind adding the
> check. I was under the impression that if you find yourself in this
> situation, you already have a big problem.

There is one big reason to have the check, and I have not seen it
mentioned yet in this thread.

When SI_USER is set we report the pid of the sender of the signal in
si_pid.  When the signal comes from the kernel si_pid == 0.  When signal
is sent from an ancestor pid namespace si_pid also equals 0 (which is
reasonable).

A signal out to a process in a parent pid namespace such as SIGCHLD is
reasonable as we can map the pid.  I really don't see the point of
forbidding that.  From the perspective of the process in the parent pid
namespace it is just another process in it's pid namespace.  So it
should pose no problem from the perspective of the receiving process.

A signal to a process in a pid namespace that is neither a parent nor a
descendent pid namespace would be a problem, as there is no well defined
notion of what si_pid should be set to.  So for that case perhaps we
should have something like a noprocess pid that we can set.  Perhaps we
could set si_pid to 0x.  That would take a small extension to
pid_nr_ns.

File descriptors are not namespaced.  It is completely legitimate to use
file descriptors to get around limitations of namespaces.

Adding limitations to a file descriptor based api because someone else
can't set up their processes in such a way as to get the restrictions
they are looking for seems very sad.

Frankly I think it is one of the better features of namespaces that we
have to carefully handle and define these cases so that when the
inevitable leaks happen you are not immediately in a world of hurt.  All
of the other permission checks etc continue to do their job.  Plus you
are prepared for the case when someone wants their containers to have an
interesting communication primitive.

Eric






Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Daniel Colascione
On Mon, Nov 19, 2018 at 4:28 PM Andy Lutomirski  wrote:
>
> On Mon, Nov 19, 2018 at 3:07 PM Tycho Andersen  wrote:
> > > These tools also care about ioctls. Adding a system call is a pain,
> > > but the solution is to make adding system calls less of a pain, not to
> > > permanently make the Linux ABI worse.
> >
> > For user-defined values of "worse" :)
> >
>
> I tend to agree with Tycho here.  But I'm wondering if it might be
> worth considering a better ioctl.
>
> /me dons flame-proof hat
>
> We could do:
>
> long better_ioctl(int fd, u32 nr, const void *inbuf, size_t inlen,
> const void *outbuf, size_t outlen);
>
> and have a central table in the kernel listing all possible nr values
> along with which driver they belong to.  We could have a sane
> signature and get rid of the nr collision problem.

The essential difference between a regular system call and an ioctl is
that in the former, the invoked kernel-side code depends on the
operation number, and in the latter, the invoked kernel-side code
depends on the operation number and file descriptor type. By creating
a new kind of collision-free ioctl, all you've done is re-invent the
system call, but with a funky calling convention and less operand
space. It makes no sense. Previous system call multiplexers --- e.g.,
socketcall --- are widely regarded as mistakes, and there's no reason
to repeat these mistakes.

System call numbers are not scarce, and your other proposal to clean
up the x86 numbering will make wiring up a new system call less
annoying. The *only* purpose of an ioctl is to solve the system call
numbering coordination problem --- if the invoked kernel-side code
depends on (DRIVER, OPERATION_NUMBER), and DRIVER can vary out-of-tree
with ioctl, ioctl lets out-of-tree code expose interfaces. For in-tree
code, this problem doesn't exist, so there's no reason to use the
awkward ioctl workaround!


Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Andy Lutomirski
On Mon, Nov 19, 2018 at 4:33 PM Christian Brauner  wrote:
>
> On Mon, Nov 19, 2018 at 04:27:49PM -0800, Andy Lutomirski wrote:
> > On Mon, Nov 19, 2018 at 3:07 PM Tycho Andersen  wrote:
> > > > These tools also care about ioctls. Adding a system call is a pain,
> > > > but the solution is to make adding system calls less of a pain, not to
> > > > permanently make the Linux ABI worse.
> > >
> > > For user-defined values of "worse" :)
> > >
> >
> > I tend to agree with Tycho here.  But I'm wondering if it might be
> > worth considering a better ioctl.
> >
> > /me dons flame-proof hat
> >
> > We could do:
> >
> > long better_ioctl(int fd, u32 nr, const void *inbuf, size_t inlen,
> > const void *outbuf, size_t outlen);
>
> I'm the writer of this patch so take this with a grain of salt.
> I think it is a bad idea to stop this patch and force us to introduce a
> new type of ioctl().

I agree completely.

> An ioctl() is also not easy to use for this task because we want to add
> a siginfo_t argument which I actually think provides value and makes
> this interface more useful.
>

You could always have a struct procfd_kill and pass a pointer to
*that*.  But sure, it's ugly.


Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Christian Brauner
On Mon, Nov 19, 2018 at 04:27:49PM -0800, Andy Lutomirski wrote:
> On Mon, Nov 19, 2018 at 3:07 PM Tycho Andersen  wrote:
> > > These tools also care about ioctls. Adding a system call is a pain,
> > > but the solution is to make adding system calls less of a pain, not to
> > > permanently make the Linux ABI worse.
> >
> > For user-defined values of "worse" :)
> >
> 
> I tend to agree with Tycho here.  But I'm wondering if it might be
> worth considering a better ioctl.
> 
> /me dons flame-proof hat
> 
> We could do:
> 
> long better_ioctl(int fd, u32 nr, const void *inbuf, size_t inlen,
> const void *outbuf, size_t outlen);

I'm the writer of this patch so take this with a grain of salt.
I think it is a bad idea to stop this patch and force us to introduce a
new type of ioctl().
An ioctl() is also not easy to use for this task because we want to add
a siginfo_t argument which I actually think provides value and makes
this interface more useful.

> 
> and have a central table in the kernel listing all possible nr values
> along with which driver they belong to.  We could have a sane
> signature and get rid of the nr collision problem.
> 
> The major problem I see is that u32 isn't really enough to have a sane
> way to allow out-of-tree drivers to use this, and that we can't
> readily use anything bigger than u32 without indirection because we're
> out of syscall argument space.


Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Andy Lutomirski
On Mon, Nov 19, 2018 at 3:07 PM Tycho Andersen  wrote:
> > These tools also care about ioctls. Adding a system call is a pain,
> > but the solution is to make adding system calls less of a pain, not to
> > permanently make the Linux ABI worse.
>
> For user-defined values of "worse" :)
>

I tend to agree with Tycho here.  But I'm wondering if it might be
worth considering a better ioctl.

/me dons flame-proof hat

We could do:

long better_ioctl(int fd, u32 nr, const void *inbuf, size_t inlen,
const void *outbuf, size_t outlen);

and have a central table in the kernel listing all possible nr values
along with which driver they belong to.  We could have a sane
signature and get rid of the nr collision problem.

The major problem I see is that u32 isn't really enough to have a sane
way to allow out-of-tree drivers to use this, and that we can't
readily use anything bigger than u32 without indirection because we're
out of syscall argument space.


Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Christian Brauner
On Tue, Nov 20, 2018 at 07:37:58AM +0800, kbuild test robot wrote:
> Hi Christian,
> 
> Thank you for the patch! Yet something to improve:
> 
> [auto build test ERROR on linus/master]
> [also build test ERROR on v4.20-rc3]
> [cannot apply to next-20181119]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> 
> url:
> https://github.com/0day-ci/linux/commits/Christian-Brauner/proc-allow-signaling-processes-via-file-descriptors/20181120-063836
> config: riscv-tinyconfig (attached as .config)
> compiler: riscv64-linux-gcc (GCC) 8.1.0
> reproduce:
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> GCC_VERSION=8.1.0 make.cross ARCH=riscv 
> 
> All errors (new ones prefixed by >>):
> 
>kernel/signal.c: In function '__do_sys_procfd_signal':
> >> kernel/signal.c:3341:7: error: implicit declaration of function 
> >> 'proc_is_procfd'; did you mean 'clockid_to_fd'? 
> >> [-Werror=implicit-function-declaration]
>  if (!proc_is_procfd(f.file))
>   ^~

On my radar and fixed. This happens when CONFIG_PROC_FS unset.

>   clockid_to_fd
>cc1: some warnings being treated as errors
> 
> vim +3341 kernel/signal.c
> 
>   3314
>   3315/**
>   3316 *  sys_procfd_signal - send a signal to a process through a 
> process file
>   3317 *  descriptor
>   3318 *  @fd: the file descriptor of the process
>   3319 *  @sig: signal to be sent
>   3320 *  @info: the signal info
>   3321 *  @flags: future flags to be passed
>   3322 */
>   3323SYSCALL_DEFINE4(procfd_signal, int, fd, int, sig, siginfo_t 
> __user *, info,
>   3324int, flags)
>   3325{
>   3326int ret;
>   3327struct pid *pid;
>   3328kernel_siginfo_t kinfo;
>   3329struct fd f;
>   3330
>   3331/* Enforce flags be set to 0 until we add an extension. 
> */
>   3332if (flags)
>   return -EINVAL;
>   3334
>   3335f = fdget_raw(fd);
>   3336if (!f.file)
>   3337return -EBADF;
>   3338
>   3339ret = -EINVAL;
>   3340/* Is this a process file descriptor? */
> > 3341if (!proc_is_procfd(f.file))
>   3342goto err;
>   3343
>   3344pid = f.file->private_data;
>   3345if (!pid)
>   3346goto err;
>   3347
>   3348if (info) {
>   3349ret = __copy_siginfo_from_user(sig, &kinfo, 
> info);
>   3350if (unlikely(ret))
>   3351goto err;
>   3352/*
>   3353 * Not even root can pretend to send signals 
> from the kernel.
>   3354 * Nor can they impersonate a kill()/tgkill(), 
> which adds
>   3355 * source info.
>   3356 */
>   3357ret = -EPERM;
>   3358if ((kinfo.si_code >= 0 || kinfo.si_code == 
> SI_TKILL) &&
>   3359(task_pid(current) != pid))
>   3360goto err;
>   3361} else {
>   3362prepare_kill_siginfo(sig, &kinfo);
>   3363}
>   3364
>   3365ret = kill_pid_info(sig, &kinfo, pid);
>   3366
>   3367err:
>   3368fdput(f);
>   3369return ret;
>   3370}
>   3371
> 
> ---
> 0-DAY kernel test infrastructureOpen Source Technology Center
> https://lists.01.org/pipermail/kbuild-all   Intel Corporation




Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread kbuild test robot
Hi Christian,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v4.20-rc3]
[cannot apply to next-20181119]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Christian-Brauner/proc-allow-signaling-processes-via-file-descriptors/20181120-063836
config: riscv-tinyconfig (attached as .config)
compiler: riscv64-linux-gcc (GCC) 8.1.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=8.1.0 make.cross ARCH=riscv 

All warnings (new ones prefixed by >>):

>> :1338:2: warning: #warning syscall procfd_signal not implemented 
>> [-Wcpp]

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread kbuild test robot
Hi Christian,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.20-rc3]
[cannot apply to next-20181119]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Christian-Brauner/proc-allow-signaling-processes-via-file-descriptors/20181120-063836
config: riscv-tinyconfig (attached as .config)
compiler: riscv64-linux-gcc (GCC) 8.1.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=8.1.0 make.cross ARCH=riscv 

All errors (new ones prefixed by >>):

   kernel/signal.c: In function '__do_sys_procfd_signal':
>> kernel/signal.c:3341:7: error: implicit declaration of function 
>> 'proc_is_procfd'; did you mean 'clockid_to_fd'? 
>> [-Werror=implicit-function-declaration]
 if (!proc_is_procfd(f.file))
  ^~
  clockid_to_fd
   cc1: some warnings being treated as errors

vim +3341 kernel/signal.c

  3314  
  3315  /**
  3316   *  sys_procfd_signal - send a signal to a process through a process 
file
  3317   *  descriptor
  3318   *  @fd: the file descriptor of the process
  3319   *  @sig: signal to be sent
  3320   *  @info: the signal info
  3321   *  @flags: future flags to be passed
  3322   */
  3323  SYSCALL_DEFINE4(procfd_signal, int, fd, int, sig, siginfo_t __user *, 
info,
  3324  int, flags)
  3325  {
  3326  int ret;
  3327  struct pid *pid;
  3328  kernel_siginfo_t kinfo;
  3329  struct fd f;
  3330  
  3331  /* Enforce flags be set to 0 until we add an extension. */
  3332  if (flags)
    return -EINVAL;
  3334  
  3335  f = fdget_raw(fd);
  3336  if (!f.file)
  3337  return -EBADF;
  3338  
  3339  ret = -EINVAL;
  3340  /* Is this a process file descriptor? */
> 3341  if (!proc_is_procfd(f.file))
  3342  goto err;
  3343  
  3344  pid = f.file->private_data;
  3345  if (!pid)
  3346  goto err;
  3347  
  3348  if (info) {
  3349  ret = __copy_siginfo_from_user(sig, &kinfo, info);
  3350  if (unlikely(ret))
  3351  goto err;
  3352  /*
  3353   * Not even root can pretend to send signals from the 
kernel.
  3354   * Nor can they impersonate a kill()/tgkill(), which 
adds
  3355   * source info.
  3356   */
  3357  ret = -EPERM;
  3358  if ((kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL) &&
  3359  (task_pid(current) != pid))
  3360  goto err;
  3361  } else {
  3362  prepare_kill_siginfo(sig, &kinfo);
  3363  }
  3364  
  3365  ret = kill_pid_info(sig, &kinfo, pid);
  3366  
  3367  err:
  3368  fdput(f);
  3369  return ret;
  3370  }
  3371  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Tycho Andersen
On Mon, Nov 19, 2018 at 02:49:22PM -0800, Daniel Colascione wrote:
> On Mon, Nov 19, 2018 at 2:40 PM Tycho Andersen  wrote:
> > Can I just register an objection here that I think using a syscall
> > just for this is silly?
> 
> Yes, you can argue that the bikeshed should be ioctl-colored and not
> syscall-colored.
> 
> > My understanding is that the concern is that some code might do:
> >
> > unknown_fd = recv_fd();
> > ioctl(unknown_fd, SOME_IOCTL, NULL); // where SOME_IOCTL == PROC_FD_KILL
> > // whoops, unknown_fd was a procfd and we killed a task!
> >
> > In my experience when writing fd sending/receiving code, the sender and
> > receiver are fairly tightly coupled. Has anyone ever actually fixed a
> > bug where they had an fd that they lost track of what "type" it was
> > and screwed up like this? It seems completely theoretical to me.
> 
> Yes, I have fixed bugs of this form.
> 
> > The ioctl() approach has the benefit of being extensible.
> 
> The system call table is also extensible.

But not infinitely so. The x32 ABI starts at 512, and right now I see
334 syscalls on x86_64. So the next 178 people can say "let's just
define a syscall", and after that? I suppose we could move to setting
BIT(10), but how much userspace assumes > 512 => compat syscall and
blocks it via seccomp or whatever?

Contrast that with the ioctl space, which is an unsigned long and
fairly sparse still (Documentation/ioctl/ioctl-number.txt).

> ioctl is for when a given piece of functionality *can't*
> realistically get its own system call because it's separated from
> the main kernel somehow. procfs is a core part of the kernel, so we
> can and should expose interfaces to it using system calls.

I suppose it's obvious, but I disagree.

> > Adding a
> > syscall means that everyone has to do all the boilerplate for each new
> > pid op in the kernel, arches, libc, strace, etc.
> 
> These tools also care about ioctls. Adding a system call is a pain,
> but the solution is to make adding system calls less of a pain, not to
> permanently make the Linux ABI worse.

For user-defined values of "worse" :)

Tycho


Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Daniel Colascione
On Mon, Nov 19, 2018 at 2:40 PM Tycho Andersen  wrote:
> Can I just register an objection here that I think using a syscall
> just for this is silly?

Yes, you can argue that the bikeshed should be ioctl-colored and not
syscall-colored.

> My understanding is that the concern is that some code might do:
>
> unknown_fd = recv_fd();
> ioctl(unknown_fd, SOME_IOCTL, NULL); // where SOME_IOCTL == PROC_FD_KILL
> // whoops, unknown_fd was a procfd and we killed a task!
>
> In my experience when writing fd sending/receiving code, the sender and
> receiver are fairly tightly coupled. Has anyone ever actually fixed a
> bug where they had an fd that they lost track of what "type" it was
> and screwed up like this? It seems completely theoretical to me.

Yes, I have fixed bugs of this form.

> The ioctl() approach has the benefit of being extensible.

The system call table is also extensible. ioctl is for when a given
piece of functionality *can't* realistically get its own system call
because it's separated from the main kernel somehow. procfs is a core
part of the kernel, so we can and should expose interfaces to it using
system calls.

> Adding a
> syscall means that everyone has to do all the boilerplate for each new
> pid op in the kernel, arches, libc, strace, etc.

These tools also care about ioctls. Adding a system call is a pain,
but the solution is to make adding system calls less of a pain, not to
permanently make the Linux ABI worse.


Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Tycho Andersen
On Mon, Nov 19, 2018 at 11:32:39AM +0100, Christian Brauner wrote:
>
> +/**
> + *  sys_procfd_signal - send a signal to a process through a process file
> + *  descriptor
> + *  @fd: the file descriptor of the process
> + *  @sig: signal to be sent
> + *  @info: the signal info
> + *  @flags: future flags to be passed
> + */
> +SYSCALL_DEFINE4(procfd_signal, int, fd, int, sig, siginfo_t __user *, info,
> + int, flags)
> +{

Can I just register an objection here that I think using a syscall
just for this is silly?

My understanding is that the concern is that some code might do:

unknown_fd = recv_fd();
ioctl(unknown_fd, SOME_IOCTL, NULL); // where SOME_IOCTL == PROC_FD_KILL
// whoops, unknown_fd was a procfd and we killed a task!

In my experience when writing fd sending/receiving code, the sender and
receiver are fairly tightly coupled. Has anyone ever actually fixed a
bug where they had an fd that they lost track of what "type" it was
and screwed up like this? It seems completely theoretical to me.

The ioctl() approach has the benefit of being extensible. Adding a
syscall means that everyone has to do all the boilerplate for each new
pid op in the kernel, arches, libc, strace, etc.

Tycho


Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Daniel Colascione
On Mon, Nov 19, 2018 at 1:37 PM Christian Brauner  wrote:
>
> On Mon, Nov 19, 2018 at 01:26:22PM -0800, Daniel Colascione wrote:
> > On Mon, Nov 19, 2018 at 1:21 PM, Christian Brauner  
> > wrote:
> > > That can be done without a loop by comparing the level counter for the
> > > two pid namespaces.
> > >
> > >>
> > >> And you can rewrite pidns_get_parent to use it. So you would instead be
> > >> doing:
> > >>
> > >> if (pidns_is_descendant(proc_pid_ns, task_active_pid_ns(current)))
> > >> return -EPERM;
> > >>
> > >> (Or you can just copy the 5-line loop into procfd_signal -- though I
> > >> imagine we'll need this for all of the procfd_* APIs.)
> >
> > Why is any of this even necessary? Why does the child namespace we're
> > considering even have a file descriptor to its ancestor's procfs? If
>
> Because you can send file descriptors between processes and container
> runtimes tend to do that.

Right. But why *would* a container runtime send one of these procfs
FDs to a container?

> > it has one of these FDs, it can already *read* all sorts of
> > information it really shouldn't be able to acquire, so the additional
> > ability to send a signal (subject to the usual permission checks)
> > feels like sticking a finger in a dike that's already well-perforated.
> > IMHO, we shouldn't bother with this check. The patch would be simpler
> > without it.
>
> We will definitely not allow signaling processes in an ancestor pid
> namespace! That is a security issue! I can imagine container runtimes
> killing their monitoring process etc. pp. Not happening, unless someone
> with deep expertise in signals can convince me otherwise.

If parent namespace procfs FDs or mounts really can leak into child
namespaces as easily as Aleksa says, then I don't mind adding the
check. I was under the impression that if you find yourself in this
situation, you already have a big problem.


Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Christian Brauner
On Mon, Nov 19, 2018 at 01:26:22PM -0800, Daniel Colascione wrote:
> On Mon, Nov 19, 2018 at 1:21 PM, Christian Brauner  
> wrote:
> > That can be done without a loop by comparing the level counter for the
> > two pid namespaces.
> >
> >>
> >> And you can rewrite pidns_get_parent to use it. So you would instead be
> >> doing:
> >>
> >> if (pidns_is_descendant(proc_pid_ns, task_active_pid_ns(current)))
> >> return -EPERM;
> >>
> >> (Or you can just copy the 5-line loop into procfd_signal -- though I
> >> imagine we'll need this for all of the procfd_* APIs.)
> 
> Why is any of this even necessary? Why does the child namespace we're
> considering even have a file descriptor to its ancestor's procfs? If

Because you can send file descriptors between processes and container
runtimes tend to do that.

> it has one of these FDs, it can already *read* all sorts of
> information it really shouldn't be able to acquire, so the additional
> ability to send a signal (subject to the usual permission checks)
> feels like sticking a finger in a dike that's already well-perforated.
> IMHO, we shouldn't bother with this check. The patch would be simpler
> without it.

We will definitely not allow signaling processes in an ancestor pid
namespace! That is a security issue! I can imagine container runtimes
killing their monitoring process etc. pp. Not happening, unless someone
with deep expertise in signals can convince me otherwise.


Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Aleksa Sarai
On 2018-11-19, Daniel Colascione  wrote:
> On Mon, Nov 19, 2018 at 1:21 PM, Christian Brauner  
> wrote:
> > That can be done without a loop by comparing the level counter for the
> > two pid namespaces.
> >
> >>
> >> And you can rewrite pidns_get_parent to use it. So you would instead be
> >> doing:
> >>
> >> if (pidns_is_descendant(proc_pid_ns, task_active_pid_ns(current)))
> >> return -EPERM;
> >>
> >> (Or you can just copy the 5-line loop into procfd_signal -- though I
> >> imagine we'll need this for all of the procfd_* APIs.)
> 
> Why is any of this even necessary? Why does the child namespace we're
> considering even have a file descriptor to its ancestor's procfs? If
> it has one of these FDs, it can already *read* all sorts of
> information it really shouldn't be able to acquire, so the additional
> ability to send a signal (subject to the usual permission checks)
> feels like sticking a finger in a dike that's already well-perforated.
> IMHO, we shouldn't bother with this check. The patch would be simpler
> without it.

First of all, currently it isn't possible to signal processes in an
ancestor pidns. Given the long thread about exit code visibility
semantics, I'm sure you see why bringing up this question is reasonable.

Some people (stupidly) bind-mount / into containers. There were several
CVEs in both LXC and runc where you could access the host filesystem
(including the host /proc). I'd prefer to not provide a mechanism for
such escalations to start sending signals to host processes, since I
don't see a strong reason why it should be allowed (and allowing it
would add more cracks to the isolation of pidns).

I think there is a huge difference between having read access to /proc
and being able to use /proc to signal processes which you ordinarily
would not be able to signal.

And another important point is that of semantics.

If we move forward with procfd_new() and the rest of the API we are
discussing, I'd argue we'd want to allow passing an nsfs fd to specify
what pidns we want the process to be created in (for procfd_new()). This
will obviously require a permission check to make sure we aren't
creating processes in a parent pidns -- and so for consistency all
procfd_* operations should have similar checks.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Daniel Colascione
On Mon, Nov 19, 2018 at 1:21 PM, Christian Brauner  wrote:
> That can be done without a loop by comparing the level counter for the
> two pid namespaces.
>
>>
>> And you can rewrite pidns_get_parent to use it. So you would instead be
>> doing:
>>
>> if (pidns_is_descendant(proc_pid_ns, task_active_pid_ns(current)))
>> return -EPERM;
>>
>> (Or you can just copy the 5-line loop into procfd_signal -- though I
>> imagine we'll need this for all of the procfd_* APIs.)

Why is any of this even necessary? Why does the child namespace we're
considering even have a file descriptor to its ancestor's procfs? If
it has one of these FDs, it can already *read* all sorts of
information it really shouldn't be able to acquire, so the additional
ability to send a signal (subject to the usual permission checks)
feels like sticking a finger in a dike that's already well-perforated.
IMHO, we shouldn't bother with this check. The patch would be simpler
without it.


Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Aleksa Sarai
On 2018-11-19, Christian Brauner  wrote:
> On Tue, Nov 20, 2018 at 08:18:10AM +1100, Aleksa Sarai wrote:
> > On 2018-11-19, Christian Brauner  wrote:
> > > On Tue, Nov 20, 2018 at 07:28:57AM +1100, Aleksa Sarai wrote:
> > > > On 2018-11-19, Christian Brauner  wrote:
> > > > > + if (info) {
> > > > > + ret = __copy_siginfo_from_user(sig, &kinfo, info);
> > > > > + if (unlikely(ret))
> > > > > + goto err;
> > > > > + /*
> > > > > +  * Not even root can pretend to send signals from the 
> > > > > kernel.
> > > > > +  * Nor can they impersonate a kill()/tgkill(), which 
> > > > > adds
> > > > > +  * source info.
> > > > > +  */
> > > > > + ret = -EPERM;
> > > > > + if ((kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL) &&
> > > > > + (task_pid(current) != pid))
> > > > > + goto err;
> > > > > + } else {
> > > > > + prepare_kill_siginfo(sig, &kinfo);
> > > > > + }
> > > > 
> > > > I wonder whether we should also have a pidns restriction here, since
> > > > currently it isn't possible for a container process using a pidns to
> > > > signal processes outside its pidns. AFAICS, this isn't done through an
> > > > explicit check -- it's a side-effect of processes in a pidns not being
> > > > able to address non-descendant-pidns processes.
> > > > 
> > > > But maybe it's reasonable to allow sending a procfd to a different pidns
> > > > and the same operations working on it? If we extend the procfd API to
> > > 
> > > No, I don't think so. I really don't want any fancy semantics in here.
> > > Fancy doesn't get merged and fancy is hard to maintain. So we should do
> > > something like:
> > > 
> > > if (proc_pid_ns() != current_pid_ns)
> > >   return EINVAL
> > 
> > This isn't quite sufficient. The key thing is that you have to be in an
> > *ancestor* (or same) pidns, not the *same* pidns. Ideally you can re-use
> > the check already in pidns_get_parent, and expose it. It would be
> > something as trivial as:
> > 
> > bool pidns_is_descendant(struct pid_namespace *ns,
> >  struct pid_namespace *ancestor)
> > {
> > for (;;) {
> > if (!ns)
> > return false;
> > if (ns == ancestor)
> > break;
> > ns = ns->parent;
> > }
> > return true;
> > }
> 
> That can be done without a loop by comparing the level counter for the
> two pid namespaces.

If so, we can refactor how pidns_get_parent() and family work. :P

But yes, I agree with doing the above check.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Aleksa Sarai
On 2018-11-20, Aleksa Sarai  wrote:
> On 2018-11-19, Christian Brauner  wrote:
> > On Tue, Nov 20, 2018 at 07:28:57AM +1100, Aleksa Sarai wrote:
> > > On 2018-11-19, Christian Brauner  wrote:
> > > > +   if (info) {
> > > > +   ret = __copy_siginfo_from_user(sig, &kinfo, info);
> > > > +   if (unlikely(ret))
> > > > +   goto err;
> > > > +   /*
> > > > +* Not even root can pretend to send signals from the 
> > > > kernel.
> > > > +* Nor can they impersonate a kill()/tgkill(), which 
> > > > adds
> > > > +* source info.
> > > > +*/
> > > > +   ret = -EPERM;
> > > > +   if ((kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL) &&
> > > > +   (task_pid(current) != pid))
> > > > +   goto err;
> > > > +   } else {
> > > > +   prepare_kill_siginfo(sig, &kinfo);
> > > > +   }
> > > 
> > > I wonder whether we should also have a pidns restriction here, since
> > > currently it isn't possible for a container process using a pidns to
> > > signal processes outside its pidns. AFAICS, this isn't done through an
> > > explicit check -- it's a side-effect of processes in a pidns not being
> > > able to address non-descendant-pidns processes.
> > > 
> > > But maybe it's reasonable to allow sending a procfd to a different pidns
> > > and the same operations working on it? If we extend the procfd API to
> > 
> > No, I don't think so. I really don't want any fancy semantics in here.
> > Fancy doesn't get merged and fancy is hard to maintain. So we should do
> > something like:
> > 
> > if (proc_pid_ns() != current_pid_ns)
> > return EINVAL
> 
> This isn't quite sufficient. The key thing is that you have to be in an
> *ancestor* (or same) pidns, not the *same* pidns. Ideally you can re-use
> the check already in pidns_get_parent, and expose it. It would be
> something as trivial as:
> 
> bool pidns_is_descendant(struct pid_namespace *ns,
>  struct pid_namespace *ancestor)
> {
> for (;;) {
> if (!ns)
> return false;
> if (ns == ancestor)
> break;
> ns = ns->parent;
> }
> return true;
> }
> 
> And you can rewrite pidns_get_parent to use it. So you would instead be
> doing:
> 
> if (pidns_is_descendant(proc_pid_ns, task_active_pid_ns(current)))
> return -EPERM;

Scratch the last bit, -EPERM is wrong here. I would argue that -EINVAL
is *somewhat* wrong because arguable the more semantically consistent
error (with kill(2)) would be -ESRCH -- but then you're mixing the "pid
is dead" and "pid is not visible to you" cases. I'm not sure what the
right errno would be here (I'm sure some of the LKML greybeards will
have a better clue.) :P

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Christian Brauner
On Tue, Nov 20, 2018 at 08:18:10AM +1100, Aleksa Sarai wrote:
> On 2018-11-19, Christian Brauner  wrote:
> > On Tue, Nov 20, 2018 at 07:28:57AM +1100, Aleksa Sarai wrote:
> > > On 2018-11-19, Christian Brauner  wrote:
> > > > +   if (info) {
> > > > +   ret = __copy_siginfo_from_user(sig, &kinfo, info);
> > > > +   if (unlikely(ret))
> > > > +   goto err;
> > > > +   /*
> > > > +* Not even root can pretend to send signals from the 
> > > > kernel.
> > > > +* Nor can they impersonate a kill()/tgkill(), which 
> > > > adds
> > > > +* source info.
> > > > +*/
> > > > +   ret = -EPERM;
> > > > +   if ((kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL) &&
> > > > +   (task_pid(current) != pid))
> > > > +   goto err;
> > > > +   } else {
> > > > +   prepare_kill_siginfo(sig, &kinfo);
> > > > +   }
> > > 
> > > I wonder whether we should also have a pidns restriction here, since
> > > currently it isn't possible for a container process using a pidns to
> > > signal processes outside its pidns. AFAICS, this isn't done through an
> > > explicit check -- it's a side-effect of processes in a pidns not being
> > > able to address non-descendant-pidns processes.
> > > 
> > > But maybe it's reasonable to allow sending a procfd to a different pidns
> > > and the same operations working on it? If we extend the procfd API to
> > 
> > No, I don't think so. I really don't want any fancy semantics in here.
> > Fancy doesn't get merged and fancy is hard to maintain. So we should do
> > something like:
> > 
> > if (proc_pid_ns() != current_pid_ns)
> > return EINVAL
> 
> This isn't quite sufficient. The key thing is that you have to be in an
> *ancestor* (or same) pidns, not the *same* pidns. Ideally you can re-use
> the check already in pidns_get_parent, and expose it. It would be
> something as trivial as:
> 
> bool pidns_is_descendant(struct pid_namespace *ns,
>  struct pid_namespace *ancestor)
> {
> for (;;) {
> if (!ns)
> return false;
> if (ns == ancestor)
> break;
> ns = ns->parent;
> }
> return true;
> }

That can be done without a loop by comparing the level counter for the
two pid namespaces.

> 
> And you can rewrite pidns_get_parent to use it. So you would instead be
> doing:
> 
> if (pidns_is_descendant(proc_pid_ns, task_active_pid_ns(current)))
> return -EPERM;
> 
> (Or you can just copy the 5-line loop into procfd_signal -- though I
> imagine we'll need this for all of the procfd_* APIs.)
> 
> -- 
> Aleksa Sarai
> Senior Software Engineer (Containers)
> SUSE Linux GmbH
> 




Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Christian Brauner
On Tue, Nov 20, 2018 at 08:18:10AM +1100, Aleksa Sarai wrote:
> On 2018-11-19, Christian Brauner  wrote:
> > On Tue, Nov 20, 2018 at 07:28:57AM +1100, Aleksa Sarai wrote:
> > > On 2018-11-19, Christian Brauner  wrote:
> > > > +   if (info) {
> > > > +   ret = __copy_siginfo_from_user(sig, &kinfo, info);
> > > > +   if (unlikely(ret))
> > > > +   goto err;
> > > > +   /*
> > > > +* Not even root can pretend to send signals from the 
> > > > kernel.
> > > > +* Nor can they impersonate a kill()/tgkill(), which 
> > > > adds
> > > > +* source info.
> > > > +*/
> > > > +   ret = -EPERM;
> > > > +   if ((kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL) &&
> > > > +   (task_pid(current) != pid))
> > > > +   goto err;
> > > > +   } else {
> > > > +   prepare_kill_siginfo(sig, &kinfo);
> > > > +   }
> > > 
> > > I wonder whether we should also have a pidns restriction here, since
> > > currently it isn't possible for a container process using a pidns to
> > > signal processes outside its pidns. AFAICS, this isn't done through an
> > > explicit check -- it's a side-effect of processes in a pidns not being
> > > able to address non-descendant-pidns processes.
> > > 
> > > But maybe it's reasonable to allow sending a procfd to a different pidns
> > > and the same operations working on it? If we extend the procfd API to
> > 
> > No, I don't think so. I really don't want any fancy semantics in here.
> > Fancy doesn't get merged and fancy is hard to maintain. So we should do
> > something like:
> > 
> > if (proc_pid_ns() != current_pid_ns)
> > return EINVAL
> 
> This isn't quite sufficient. The key thing is that you have to be in an
> *ancestor* (or same) pidns, not the *same* pidns. Ideally you can re-use

See my next mail.

> the check already in pidns_get_parent, and expose it. It would be
> something as trivial as:
> 
> bool pidns_is_descendant(struct pid_namespace *ns,
>  struct pid_namespace *ancestor)
> {
> for (;;) {
> if (!ns)
> return false;
> if (ns == ancestor)
> break;
> ns = ns->parent;
> }
> return true;
> }
> 
> And you can rewrite pidns_get_parent to use it. So you would instead be
> doing:
> 
> if (pidns_is_descendant(proc_pid_ns, task_active_pid_ns(current)))
> return -EPERM;
> 
> (Or you can just copy the 5-line loop into procfd_signal -- though I
> imagine we'll need this for all of the procfd_* APIs.)
> 
> -- 
> Aleksa Sarai
> Senior Software Engineer (Containers)
> SUSE Linux GmbH
> 




Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Aleksa Sarai
On 2018-11-19, Christian Brauner  wrote:
> On Tue, Nov 20, 2018 at 07:28:57AM +1100, Aleksa Sarai wrote:
> > On 2018-11-19, Christian Brauner  wrote:
> > > + if (info) {
> > > + ret = __copy_siginfo_from_user(sig, &kinfo, info);
> > > + if (unlikely(ret))
> > > + goto err;
> > > + /*
> > > +  * Not even root can pretend to send signals from the kernel.
> > > +  * Nor can they impersonate a kill()/tgkill(), which adds
> > > +  * source info.
> > > +  */
> > > + ret = -EPERM;
> > > + if ((kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL) &&
> > > + (task_pid(current) != pid))
> > > + goto err;
> > > + } else {
> > > + prepare_kill_siginfo(sig, &kinfo);
> > > + }
> > 
> > I wonder whether we should also have a pidns restriction here, since
> > currently it isn't possible for a container process using a pidns to
> > signal processes outside its pidns. AFAICS, this isn't done through an
> > explicit check -- it's a side-effect of processes in a pidns not being
> > able to address non-descendant-pidns processes.
> > 
> > But maybe it's reasonable to allow sending a procfd to a different pidns
> > and the same operations working on it? If we extend the procfd API to
> 
> No, I don't think so. I really don't want any fancy semantics in here.
> Fancy doesn't get merged and fancy is hard to maintain. So we should do
> something like:
> 
> if (proc_pid_ns() != current_pid_ns)
>   return EINVAL

This isn't quite sufficient. The key thing is that you have to be in an
*ancestor* (or same) pidns, not the *same* pidns. Ideally you can re-use
the check already in pidns_get_parent, and expose it. It would be
something as trivial as:

bool pidns_is_descendant(struct pid_namespace *ns,
 struct pid_namespace *ancestor)
{
for (;;) {
if (!ns)
return false;
if (ns == ancestor)
break;
ns = ns->parent;
}
return true;
}

And you can rewrite pidns_get_parent to use it. So you would instead be
doing:

if (pidns_is_descendant(proc_pid_ns, task_active_pid_ns(current)))
return -EPERM;

(Or you can just copy the 5-line loop into procfd_signal -- though I
imagine we'll need this for all of the procfd_* APIs.)

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Christian Brauner
On Mon, Nov 19, 2018 at 09:55:18PM +0100, Christian Brauner wrote:
> On Tue, Nov 20, 2018 at 07:28:57AM +1100, Aleksa Sarai wrote:
> > On 2018-11-19, Christian Brauner  wrote:
> > > + if (info) {
> > > + ret = __copy_siginfo_from_user(sig, &kinfo, info);
> > > + if (unlikely(ret))
> > > + goto err;
> > > + /*
> > > +  * Not even root can pretend to send signals from the kernel.
> > > +  * Nor can they impersonate a kill()/tgkill(), which adds
> > > +  * source info.
> > > +  */
> > > + ret = -EPERM;
> > > + if ((kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL) &&
> > > + (task_pid(current) != pid))
> > > + goto err;
> > > + } else {
> > > + prepare_kill_siginfo(sig, &kinfo);
> > > + }
> > 
> > I wonder whether we should also have a pidns restriction here, since
> > currently it isn't possible for a container process using a pidns to
> > signal processes outside its pidns. AFAICS, this isn't done through an
> > explicit check -- it's a side-effect of processes in a pidns not being
> > able to address non-descendant-pidns processes.
> > 
> > But maybe it's reasonable to allow sending a procfd to a different pidns
> > and the same operations working on it? If we extend the procfd API to
> 
> No, I don't think so. I really don't want any fancy semantics in here.
> Fancy doesn't get merged and fancy is hard to maintain. So we should do
> something like:
> 
> if (proc_pid_ns() != current_pid_ns)
>   return EINVAL

To be more precise, we need to detect if fd refers to an ancestor pidns
and if so return EINVAL.

> 
> > allow process creation this would allow a container to create a process
> > outside its pidns.
> > 
> > -- 
> > Aleksa Sarai
> > Senior Software Engineer (Containers)
> > SUSE Linux GmbH
> > 
> 
> 


Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Christian Brauner
On Tue, Nov 20, 2018 at 07:28:57AM +1100, Aleksa Sarai wrote:
> On 2018-11-19, Christian Brauner  wrote:
> > +   if (info) {
> > +   ret = __copy_siginfo_from_user(sig, &kinfo, info);
> > +   if (unlikely(ret))
> > +   goto err;
> > +   /*
> > +* Not even root can pretend to send signals from the kernel.
> > +* Nor can they impersonate a kill()/tgkill(), which adds
> > +* source info.
> > +*/
> > +   ret = -EPERM;
> > +   if ((kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL) &&
> > +   (task_pid(current) != pid))
> > +   goto err;
> > +   } else {
> > +   prepare_kill_siginfo(sig, &kinfo);
> > +   }
> 
> I wonder whether we should also have a pidns restriction here, since
> currently it isn't possible for a container process using a pidns to
> signal processes outside its pidns. AFAICS, this isn't done through an
> explicit check -- it's a side-effect of processes in a pidns not being
> able to address non-descendant-pidns processes.
> 
> But maybe it's reasonable to allow sending a procfd to a different pidns
> and the same operations working on it? If we extend the procfd API to

No, I don't think so. I really don't want any fancy semantics in here.
Fancy doesn't get merged and fancy is hard to maintain. So we should do
something like:

if (proc_pid_ns() != current_pid_ns)
return EINVAL

> allow process creation this would allow a container to create a process
> outside its pidns.
> 
> -- 
> Aleksa Sarai
> Senior Software Engineer (Containers)
> SUSE Linux GmbH
> 




Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Aleksa Sarai
On 2018-11-19, Christian Brauner  wrote:
> + if (info) {
> + ret = __copy_siginfo_from_user(sig, &kinfo, info);
> + if (unlikely(ret))
> + goto err;
> + /*
> +  * Not even root can pretend to send signals from the kernel.
> +  * Nor can they impersonate a kill()/tgkill(), which adds
> +  * source info.
> +  */
> + ret = -EPERM;
> + if ((kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL) &&
> + (task_pid(current) != pid))
> + goto err;
> + } else {
> + prepare_kill_siginfo(sig, &kinfo);
> + }

I wonder whether we should also have a pidns restriction here, since
currently it isn't possible for a container process using a pidns to
signal processes outside its pidns. AFAICS, this isn't done through an
explicit check -- it's a side-effect of processes in a pidns not being
able to address non-descendant-pidns processes.

But maybe it's reasonable to allow sending a procfd to a different pidns
and the same operations working on it? If we extend the procfd API to
allow process creation this would allow a container to create a process
outside its pidns.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Daniel Colascione
Yep. That's also what I was talking about, FWIW.

On Mon, Nov 19, 2018 at 11:31 AM, Christian Brauner
 wrote:
> On Mon, Nov 19, 2018 at 01:02:06PM -0600, Eric W. Biederman wrote:
>> Christian Brauner  writes:
>>
>> > On Mon, Nov 19, 2018 at 07:59:24AM -0800, Daniel Colascione wrote:
>> >> You never addressed my comment on the previous patch about your use of
>> >
>> > Sorry, that thread exploded so quickly that I might have missed it.
>> >
>> >> private_data here. Why can't you use the struct pid reference that's
>> >> already in the inode?
>> >
>> > If that's what people prefer we can probably use that. There was
>> > precedent for stashing away such data in fs/proc/base.c already for
>> > various other things including user namespaces and struct mm so I
>> > followed this model. A prior version of my patch (I didn't send out) did
>> > retrive the inode via proc_pid() in .open() took an additional reference
>> > via get_pid() and dropped it in .release(). Do we prefer that?
>>
>> If you are using proc// directories as your file descriptors, you
>> don't need to add an open or a release method at all.  The existing file
>> descriptors hold a reference to the inode which holds a reference the
>> the struct pid.
>>
>> The only time you need to get a reference is when you need a task
>> and kill_pid_info already performs that work for you.
>
> Oh, I see what you and Andy are saying now. Sweet, that means we can
> trim down the patch even more. Less code, less headache.
>
> Thanks!
>
>>
>> So using proc_pid is all you need to do to get the pid from the existing
>> file descriptors.
>>
>> Eric
>>


Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Christian Brauner
On Mon, Nov 19, 2018 at 01:02:06PM -0600, Eric W. Biederman wrote:
> Christian Brauner  writes:
> 
> > On Mon, Nov 19, 2018 at 07:59:24AM -0800, Daniel Colascione wrote:
> >> You never addressed my comment on the previous patch about your use of
> >
> > Sorry, that thread exploded so quickly that I might have missed it.
> >
> >> private_data here. Why can't you use the struct pid reference that's
> >> already in the inode?
> >
> > If that's what people prefer we can probably use that. There was
> > precedent for stashing away such data in fs/proc/base.c already for
> > various other things including user namespaces and struct mm so I
> > followed this model. A prior version of my patch (I didn't send out) did
> > retrive the inode via proc_pid() in .open() took an additional reference
> > via get_pid() and dropped it in .release(). Do we prefer that?
> 
> If you are using proc// directories as your file descriptors, you
> don't need to add an open or a release method at all.  The existing file
> descriptors hold a reference to the inode which holds a reference the
> the struct pid.
> 
> The only time you need to get a reference is when you need a task
> and kill_pid_info already performs that work for you.

Oh, I see what you and Andy are saying now. Sweet, that means we can
trim down the patch even more. Less code, less headache.

Thanks!

> 
> So using proc_pid is all you need to do to get the pid from the existing
> file descriptors.
> 
> Eric
> 


Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Eric W. Biederman
Christian Brauner  writes:

> On Mon, Nov 19, 2018 at 07:59:24AM -0800, Daniel Colascione wrote:
>> You never addressed my comment on the previous patch about your use of
>
> Sorry, that thread exploded so quickly that I might have missed it.
>
>> private_data here. Why can't you use the struct pid reference that's
>> already in the inode?
>
> If that's what people prefer we can probably use that. There was
> precedent for stashing away such data in fs/proc/base.c already for
> various other things including user namespaces and struct mm so I
> followed this model. A prior version of my patch (I didn't send out) did
> retrive the inode via proc_pid() in .open() took an additional reference
> via get_pid() and dropped it in .release(). Do we prefer that?

If you are using proc// directories as your file descriptors, you
don't need to add an open or a release method at all.  The existing file
descriptors hold a reference to the inode which holds a reference the
the struct pid.

The only time you need to get a reference is when you need a task
and kill_pid_info already performs that work for you.

So using proc_pid is all you need to do to get the pid from the existing
file descriptors.

Eric



Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Christian Brauner
On Mon, Nov 19, 2018 at 07:45:04AM -0800, Andy Lutomirski wrote:
> On Mon, Nov 19, 2018 at 2:33 AM Christian Brauner  
> wrote:
> >
> > The kill() syscall operates on process identifiers. After a process has
> > exited its pid can be reused by another process. If a caller sends a signal
> > to a reused pid it will end up signaling the wrong process. This issue has
> > often surfaced and there has been a push [1] to address this problem.
> >
> > A prior patch has introduced the ability to get a file descriptor
> > referencing struct pid by opening /proc/. This guarantees a stable
> > handle on a process which can be used to send signals to the referenced
> > process. Discussion has shown that a dedicated syscall is preferable over
> > ioctl()s. Thus, the  new syscall procfd_signal() is introduced to solve
> > this problem. It operates on a process file descriptor.
> > The syscall takes an additional siginfo_t and flags argument. If siginfo_t
> > is NULL then procfd_signal() behaves like kill() if it is not NULL it
> > behaves like rt_sigqueueinfo.
> > The flags argument is added to allow for future extensions of this syscall.
> > It currently needs to be passed as 0.
> 
> A few questions.  First: you've made this work on /proc/PID, but
> should it also work on /proc/PID/task/TID to send signals to a
> specific thread?

Yeah, so I thought about that. Your point being to combine: kill(),
tgkill() aka rt_sigqueueinfo() and rt_tg_sigqueueinfo(). If I understand
this correctly the implication is to also get file descriptors to
/proc/PID/task/TID and pass them to procfd_signal()? Can we hold of on
that one? Adding this in the future should be easily doable by simply
getting /proc/PID/task/TID file descriptors but I would like this
patchset to be as small as possible.

> 
> > +bool proc_is_procfd(const struct file *file)
> > +{
> > +   return d_is_dir(file->f_path.dentry) &&
> > +  (file->f_op == &proc_tgid_base_operations);
> > +}
> 
> Maybe rename to proc_is_tgid_procfd() to leave room for proc_is_tid_procfd()?

Yes, good idea!

> 
> > +   if (info) {
> > +   ret = __copy_siginfo_from_user(sig, &kinfo, info);
> > +   if (unlikely(ret))
> > +   goto err;
> > +   /*
> > +* Not even root can pretend to send signals from the 
> > kernel.
> > +* Nor can they impersonate a kill()/tgkill(), which adds
> > +* source info.
> > +*/
> > +   ret = -EPERM;
> > +   if ((kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL) &&
> > +   (task_pid(current) != pid))
> > +   goto err;
> 
> Is the exception for signaling yourself actually useful here?

I tried to strictly follow the sigqueue-based permission checks. I'm not
comfortable removing this check without signal-experts telling me that
it is safe to do.


Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Christian Brauner
On Mon, Nov 19, 2018 at 07:59:24AM -0800, Daniel Colascione wrote:
> On Mon, Nov 19, 2018 at 2:32 AM, Christian Brauner  
> wrote:
> > The kill() syscall operates on process identifiers. After a process has
> > exited its pid can be reused by another process. If a caller sends a signal
> > to a reused pid it will end up signaling the wrong process. This issue has
> > often surfaced and there has been a push [1] to address this problem.
> >
> > A prior patch has introduced the ability to get a file descriptor
> > referencing struct pid by opening /proc/. This guarantees a stable
> > handle on a process which can be used to send signals to the referenced
> > process. Discussion has shown that a dedicated syscall is preferable over
> > ioctl()s. Thus, the  new syscall procfd_signal() is introduced to solve
> > this problem. It operates on a process file descriptor.
> > The syscall takes an additional siginfo_t and flags argument. If siginfo_t
> > is NULL then procfd_signal() behaves like kill() if it is not NULL it
> > behaves like rt_sigqueueinfo.
> > The flags argument is added to allow for future extensions of this syscall.
> > It currently needs to be passed as 0.
> >
> > With this patch a process can be killed via:
> >
> >  #define _GNU_SOURCE
> >  #include 
> >  #include 
> >  #include 
> >  #include 
> >  #include 
> >  #include 
> >  #include 
> >  #include 
> >  #include 
> >  #include 
> >
> >  int main(int argc, char *argv[])
> >  {
> >  int ret;
> >  char buf[1000];
> >
> >  if (argc < 2)
> >  exit(EXIT_FAILURE);
> >
> >  ret = snprintf(buf, sizeof(buf), "/proc/%s", argv[1]);
> >  if (ret < 0)
> >  exit(EXIT_FAILURE);
> >
> >  int fd = open(buf, O_DIRECTORY | O_CLOEXEC);
> >  if (fd < 0) {
> >  printf("%s - Failed to open \"%s\"\n", strerror(errno), 
> > buf);
> >  exit(EXIT_FAILURE);
> >  }
> >
> >  ret = syscall(__NR_procfd_signal, fd, SIGKILL, NULL, 0);
> >  if (ret < 0) {
> >  printf("Failed to send SIGKILL \"%s\"\n", strerror(errno));
> >  close(fd);
> >  exit(EXIT_FAILURE);
> >  }
> >
> >  close(fd);
> >
> >  exit(EXIT_SUCCESS);
> >  }
> >
> > [1]: https://lkml.org/lkml/2018/11/18/130
> >
> > Cc: "Eric W. Biederman" 
> > Cc: Serge Hallyn 
> > Cc: Jann Horn 
> > Cc: Kees Cook 
> > Cc: Andy Lutomirsky 
> > Cc: Andrew Morton 
> > Cc: Oleg Nesterov 
> > Cc: Aleksa Sarai 
> > Cc: Al Viro 
> > Signed-off-by: Christian Brauner 
> > ---
> > Changelog:
> > v1:
> > - patch introduced
> > ---
> >  arch/x86/entry/syscalls/syscall_32.tbl |  1 +
> >  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
> >  fs/proc/base.c |  6 ++
> >  include/linux/proc_fs.h|  1 +
> >  include/linux/syscalls.h   |  2 +
> >  kernel/signal.c| 76 --
> >  6 files changed, 81 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
> > b/arch/x86/entry/syscalls/syscall_32.tbl
> > index 3cf7b533b3d1..e637eab883e9 100644
> > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > @@ -398,3 +398,4 @@
> >  384i386arch_prctl  sys_arch_prctl  
> > __ia32_compat_sys_arch_prctl
> >  385i386io_pgetevents   sys_io_pgetevents   
> > __ia32_compat_sys_io_pgetevents
> >  386i386rseqsys_rseq
> > __ia32_sys_rseq
> > +387i386procfd_signal   sys_procfd_signal   
> > __ia32_sys_procfd_signal
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> > b/arch/x86/entry/syscalls/syscall_64.tbl
> > index f0b1709a5ffb..e95f6741ab42 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -343,6 +343,7 @@
> >  332common  statx   __x64_sys_statx
> >  333common  io_pgetevents   __x64_sys_io_pgetevents
> >  334common  rseq__x64_sys_rseq
> > +335common  procfd_signal   __x64_sys_procfd_signal
> >
> >  #
> >  # x32-specific system call numbers start at 512 to avoid cache impact
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 6365a4fea314..a12c9de92bd0 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -3055,6 +3055,12 @@ static const struct file_operations 
> > proc_tgid_base_operations = {
> > .release= proc_tgid_release,
> >  };
> >
> > +bool proc_is_procfd(const struct file *file)
> > +{
> > +   return d_is_dir(file->f_path.dentry) &&
> > +  (file->f_op == &proc_tgid_base_operations);
> > +}
> > +
> >  static struct dentry *proc_tgid_base_lookup(struct inode *dir, struct 
> > dentry *dentry, unsigned int flags)
> >  {
> > return proc_pident_lookup(d

Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Christian Brauner
On Mon, Nov 19, 2018 at 06:10:53PM +0100, Eugene Syromiatnikov wrote:
> On Mon, Nov 19, 2018 at 11:32:39AM +0100, Christian Brauner wrote:
> > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
> > b/arch/x86/entry/syscalls/syscall_32.tbl
> > index 3cf7b533b3d1..e637eab883e9 100644
> > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > @@ -398,3 +398,4 @@
> >  384i386arch_prctl  sys_arch_prctl  
> > __ia32_compat_sys_arch_prctl
> >  385i386io_pgetevents   sys_io_pgetevents   
> > __ia32_compat_sys_io_pgetevents
> >  386i386rseqsys_rseq
> > __ia32_sys_rseq
> > +387i386procfd_signal   sys_procfd_signal   
> > __ia32_sys_procfd_signal
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> > b/arch/x86/entry/syscalls/syscall_64.tbl
> > index f0b1709a5ffb..e95f6741ab42 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -343,6 +343,7 @@
> >  332common  statx   __x64_sys_statx
> >  333common  io_pgetevents   __x64_sys_io_pgetevents
> >  334common  rseq__x64_sys_rseq
> > +335common  procfd_signal   __x64_sys_procfd_signal
> 
> You have wired up the syscall on x86 but have not added the syscall number
> to the generic syscall header (include/uapi/asm-generic/unistd.h), why?

Oversight on my part, sorry.


Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Eugene Syromiatnikov
On Mon, Nov 19, 2018 at 11:32:39AM +0100, Christian Brauner wrote:
> +/**
> + *  sys_procfd_signal - send a signal to a process through a process file
> + *  descriptor
> + *  @fd: the file descriptor of the process
> + *  @sig: signal to be sent
> + *  @info: the signal info
> + *  @flags: future flags to be passed
> + */
> +SYSCALL_DEFINE4(procfd_signal, int, fd, int, sig, siginfo_t __user *, info,
> + int, flags)

It could be considered better to use an unsigned type for the "flags"
argument.


Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Eugene Syromiatnikov
On Mon, Nov 19, 2018 at 11:32:39AM +0100, Christian Brauner wrote:
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
> b/arch/x86/entry/syscalls/syscall_32.tbl
> index 3cf7b533b3d1..e637eab883e9 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -398,3 +398,4 @@
>  384  i386arch_prctl  sys_arch_prctl  
> __ia32_compat_sys_arch_prctl
>  385  i386io_pgetevents   sys_io_pgetevents   
> __ia32_compat_sys_io_pgetevents
>  386  i386rseqsys_rseq
> __ia32_sys_rseq
> +387  i386procfd_signal   sys_procfd_signal   
> __ia32_sys_procfd_signal
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> b/arch/x86/entry/syscalls/syscall_64.tbl
> index f0b1709a5ffb..e95f6741ab42 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -343,6 +343,7 @@
>  332  common  statx   __x64_sys_statx
>  333  common  io_pgetevents   __x64_sys_io_pgetevents
>  334  common  rseq__x64_sys_rseq
> +335  common  procfd_signal   __x64_sys_procfd_signal

You have wired up the syscall on x86 but have not added the syscall number
to the generic syscall header (include/uapi/asm-generic/unistd.h), why?


Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Daniel Colascione
On Mon, Nov 19, 2018 at 2:32 AM, Christian Brauner  wrote:
> The kill() syscall operates on process identifiers. After a process has
> exited its pid can be reused by another process. If a caller sends a signal
> to a reused pid it will end up signaling the wrong process. This issue has
> often surfaced and there has been a push [1] to address this problem.
>
> A prior patch has introduced the ability to get a file descriptor
> referencing struct pid by opening /proc/. This guarantees a stable
> handle on a process which can be used to send signals to the referenced
> process. Discussion has shown that a dedicated syscall is preferable over
> ioctl()s. Thus, the  new syscall procfd_signal() is introduced to solve
> this problem. It operates on a process file descriptor.
> The syscall takes an additional siginfo_t and flags argument. If siginfo_t
> is NULL then procfd_signal() behaves like kill() if it is not NULL it
> behaves like rt_sigqueueinfo.
> The flags argument is added to allow for future extensions of this syscall.
> It currently needs to be passed as 0.
>
> With this patch a process can be killed via:
>
>  #define _GNU_SOURCE
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
>
>  int main(int argc, char *argv[])
>  {
>  int ret;
>  char buf[1000];
>
>  if (argc < 2)
>  exit(EXIT_FAILURE);
>
>  ret = snprintf(buf, sizeof(buf), "/proc/%s", argv[1]);
>  if (ret < 0)
>  exit(EXIT_FAILURE);
>
>  int fd = open(buf, O_DIRECTORY | O_CLOEXEC);
>  if (fd < 0) {
>  printf("%s - Failed to open \"%s\"\n", strerror(errno), buf);
>  exit(EXIT_FAILURE);
>  }
>
>  ret = syscall(__NR_procfd_signal, fd, SIGKILL, NULL, 0);
>  if (ret < 0) {
>  printf("Failed to send SIGKILL \"%s\"\n", strerror(errno));
>  close(fd);
>  exit(EXIT_FAILURE);
>  }
>
>  close(fd);
>
>  exit(EXIT_SUCCESS);
>  }
>
> [1]: https://lkml.org/lkml/2018/11/18/130
>
> Cc: "Eric W. Biederman" 
> Cc: Serge Hallyn 
> Cc: Jann Horn 
> Cc: Kees Cook 
> Cc: Andy Lutomirsky 
> Cc: Andrew Morton 
> Cc: Oleg Nesterov 
> Cc: Aleksa Sarai 
> Cc: Al Viro 
> Signed-off-by: Christian Brauner 
> ---
> Changelog:
> v1:
> - patch introduced
> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |  1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
>  fs/proc/base.c |  6 ++
>  include/linux/proc_fs.h|  1 +
>  include/linux/syscalls.h   |  2 +
>  kernel/signal.c| 76 --
>  6 files changed, 81 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
> b/arch/x86/entry/syscalls/syscall_32.tbl
> index 3cf7b533b3d1..e637eab883e9 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -398,3 +398,4 @@
>  384i386arch_prctl  sys_arch_prctl  
> __ia32_compat_sys_arch_prctl
>  385i386io_pgetevents   sys_io_pgetevents   
> __ia32_compat_sys_io_pgetevents
>  386i386rseqsys_rseq
> __ia32_sys_rseq
> +387i386procfd_signal   sys_procfd_signal   
> __ia32_sys_procfd_signal
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> b/arch/x86/entry/syscalls/syscall_64.tbl
> index f0b1709a5ffb..e95f6741ab42 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -343,6 +343,7 @@
>  332common  statx   __x64_sys_statx
>  333common  io_pgetevents   __x64_sys_io_pgetevents
>  334common  rseq__x64_sys_rseq
> +335common  procfd_signal   __x64_sys_procfd_signal
>
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 6365a4fea314..a12c9de92bd0 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3055,6 +3055,12 @@ static const struct file_operations 
> proc_tgid_base_operations = {
> .release= proc_tgid_release,
>  };
>
> +bool proc_is_procfd(const struct file *file)
> +{
> +   return d_is_dir(file->f_path.dentry) &&
> +  (file->f_op == &proc_tgid_base_operations);
> +}
> +
>  static struct dentry *proc_tgid_base_lookup(struct inode *dir, struct dentry 
> *dentry, unsigned int flags)
>  {
> return proc_pident_lookup(dir, dentry,
> diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
> index d0e1f1522a78..2d53a47fba34 100644
> --- a/include/linux/proc_fs.h
> +++ b/include/linux/proc_fs.h
> @@ -73,6 +73,7 @@ struct proc_dir_entry *proc_create_net_single_write(const 
> char *name, umode_t mo
> 

Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Daniel Colascione
On Mon, Nov 19, 2018 at 7:45 AM, Andy Lutomirski  wrote:
> On Mon, Nov 19, 2018 at 2:33 AM Christian Brauner  
> wrote:
>>
>> The kill() syscall operates on process identifiers. After a process has
>> exited its pid can be reused by another process. If a caller sends a signal
>> to a reused pid it will end up signaling the wrong process. This issue has
>> often surfaced and there has been a push [1] to address this problem.
>>
>> A prior patch has introduced the ability to get a file descriptor
>> referencing struct pid by opening /proc/. This guarantees a stable
>> handle on a process which can be used to send signals to the referenced
>> process. Discussion has shown that a dedicated syscall is preferable over
>> ioctl()s. Thus, the  new syscall procfd_signal() is introduced to solve
>> this problem. It operates on a process file descriptor.
>> The syscall takes an additional siginfo_t and flags argument. If siginfo_t
>> is NULL then procfd_signal() behaves like kill() if it is not NULL it
>> behaves like rt_sigqueueinfo.
>> The flags argument is added to allow for future extensions of this syscall.
>> It currently needs to be passed as 0.
>
> A few questions.  First: you've made this work on /proc/PID, but
> should it also work on /proc/PID/task/TID to send signals to a
> specific thread?

+1

>> +   if (info) {
>> +   ret = __copy_siginfo_from_user(sig, &kinfo, info);
>> +   if (unlikely(ret))
>> +   goto err;
>> +   /*
>> +* Not even root can pretend to send signals from the kernel.
>> +* Nor can they impersonate a kill()/tgkill(), which adds
>> +* source info.
>> +*/
>> +   ret = -EPERM;
>> +   if ((kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL) &&
>> +   (task_pid(current) != pid))
>> +   goto err;
>
> Is the exception for signaling yourself actually useful here?

All the signal functions exempt the current process from access
checks. Whether that's useful or not (and I think it is), being
inconsistent here would be wrong.


Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Andy Lutomirski
On Mon, Nov 19, 2018 at 2:33 AM Christian Brauner  wrote:
>
> The kill() syscall operates on process identifiers. After a process has
> exited its pid can be reused by another process. If a caller sends a signal
> to a reused pid it will end up signaling the wrong process. This issue has
> often surfaced and there has been a push [1] to address this problem.
>
> A prior patch has introduced the ability to get a file descriptor
> referencing struct pid by opening /proc/. This guarantees a stable
> handle on a process which can be used to send signals to the referenced
> process. Discussion has shown that a dedicated syscall is preferable over
> ioctl()s. Thus, the  new syscall procfd_signal() is introduced to solve
> this problem. It operates on a process file descriptor.
> The syscall takes an additional siginfo_t and flags argument. If siginfo_t
> is NULL then procfd_signal() behaves like kill() if it is not NULL it
> behaves like rt_sigqueueinfo.
> The flags argument is added to allow for future extensions of this syscall.
> It currently needs to be passed as 0.

A few questions.  First: you've made this work on /proc/PID, but
should it also work on /proc/PID/task/TID to send signals to a
specific thread?

> +bool proc_is_procfd(const struct file *file)
> +{
> +   return d_is_dir(file->f_path.dentry) &&
> +  (file->f_op == &proc_tgid_base_operations);
> +}

Maybe rename to proc_is_tgid_procfd() to leave room for proc_is_tid_procfd()?

> +   if (info) {
> +   ret = __copy_siginfo_from_user(sig, &kinfo, info);
> +   if (unlikely(ret))
> +   goto err;
> +   /*
> +* Not even root can pretend to send signals from the kernel.
> +* Nor can they impersonate a kill()/tgkill(), which adds
> +* source info.
> +*/
> +   ret = -EPERM;
> +   if ((kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL) &&
> +   (task_pid(current) != pid))
> +   goto err;

Is the exception for signaling yourself actually useful here?


[PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Christian Brauner
The kill() syscall operates on process identifiers. After a process has
exited its pid can be reused by another process. If a caller sends a signal
to a reused pid it will end up signaling the wrong process. This issue has
often surfaced and there has been a push [1] to address this problem.

A prior patch has introduced the ability to get a file descriptor
referencing struct pid by opening /proc/. This guarantees a stable
handle on a process which can be used to send signals to the referenced
process. Discussion has shown that a dedicated syscall is preferable over
ioctl()s. Thus, the  new syscall procfd_signal() is introduced to solve
this problem. It operates on a process file descriptor.
The syscall takes an additional siginfo_t and flags argument. If siginfo_t
is NULL then procfd_signal() behaves like kill() if it is not NULL it
behaves like rt_sigqueueinfo.
The flags argument is added to allow for future extensions of this syscall.
It currently needs to be passed as 0.

With this patch a process can be killed via:

 #define _GNU_SOURCE
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 

 int main(int argc, char *argv[])
 {
 int ret;
 char buf[1000];

 if (argc < 2)
 exit(EXIT_FAILURE);

 ret = snprintf(buf, sizeof(buf), "/proc/%s", argv[1]);
 if (ret < 0)
 exit(EXIT_FAILURE);

 int fd = open(buf, O_DIRECTORY | O_CLOEXEC);
 if (fd < 0) {
 printf("%s - Failed to open \"%s\"\n", strerror(errno), buf);
 exit(EXIT_FAILURE);
 }

 ret = syscall(__NR_procfd_signal, fd, SIGKILL, NULL, 0);
 if (ret < 0) {
 printf("Failed to send SIGKILL \"%s\"\n", strerror(errno));
 close(fd);
 exit(EXIT_FAILURE);
 }

 close(fd);

 exit(EXIT_SUCCESS);
 }

[1]: https://lkml.org/lkml/2018/11/18/130

Cc: "Eric W. Biederman" 
Cc: Serge Hallyn 
Cc: Jann Horn 
Cc: Kees Cook 
Cc: Andy Lutomirsky 
Cc: Andrew Morton 
Cc: Oleg Nesterov 
Cc: Aleksa Sarai 
Cc: Al Viro 
Signed-off-by: Christian Brauner 
---
Changelog:
v1:
- patch introduced
---
 arch/x86/entry/syscalls/syscall_32.tbl |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl |  1 +
 fs/proc/base.c |  6 ++
 include/linux/proc_fs.h|  1 +
 include/linux/syscalls.h   |  2 +
 kernel/signal.c| 76 --
 6 files changed, 81 insertions(+), 6 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index 3cf7b533b3d1..e637eab883e9 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -398,3 +398,4 @@
 384i386arch_prctl  sys_arch_prctl  
__ia32_compat_sys_arch_prctl
 385i386io_pgetevents   sys_io_pgetevents   
__ia32_compat_sys_io_pgetevents
 386i386rseqsys_rseq
__ia32_sys_rseq
+387i386procfd_signal   sys_procfd_signal   
__ia32_sys_procfd_signal
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index f0b1709a5ffb..e95f6741ab42 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -343,6 +343,7 @@
 332common  statx   __x64_sys_statx
 333common  io_pgetevents   __x64_sys_io_pgetevents
 334common  rseq__x64_sys_rseq
+335common  procfd_signal   __x64_sys_procfd_signal
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 6365a4fea314..a12c9de92bd0 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3055,6 +3055,12 @@ static const struct file_operations 
proc_tgid_base_operations = {
.release= proc_tgid_release,
 };
 
+bool proc_is_procfd(const struct file *file)
+{
+   return d_is_dir(file->f_path.dentry) &&
+  (file->f_op == &proc_tgid_base_operations);
+}
+
 static struct dentry *proc_tgid_base_lookup(struct inode *dir, struct dentry 
*dentry, unsigned int flags)
 {
return proc_pident_lookup(dir, dentry,
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index d0e1f1522a78..2d53a47fba34 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -73,6 +73,7 @@ struct proc_dir_entry *proc_create_net_single_write(const 
char *name, umode_t mo
int (*show)(struct seq_file 
*, void *),
proc_write_t write,
void *data);
+extern bool proc_is_procfd(const struct file *file);
 
 #else /* CONFIG_PROC_FS */
 
diff --git a/include/linux/syscalls.h b/include/linux/sy