Re: siginfo pid not populated from ptrace?

2018-12-06 Thread Eric W. Biederman
Kees Cook  writes:

> On Thu, Dec 6, 2018 at 1:11 PM Eric W. Biederman  
> wrote:
>>
>> Tycho Andersen  writes:
>>
>> > On Thu, Dec 06, 2018 at 10:48:39AM -0800, Linus Torvalds wrote:
>> >> On Thu, Dec 6, 2018 at 6:40 AM Eric W. Biederman  
>> >> wrote:
>> >> >
>> >> > We have in the past had ptrace users that weren't just about debugging
>> >> > so I don't know that it is fair to just dismiss it as debugging
>> >> > infrastructure.
>> >>
>> >> Absolutely.
>> >>
>> >> Some uses are more than just debug. People occasionally use ptrace
>> >> because it's the only way to do what they want, so you'll find people
>> >> who do it for sandboxing, for example. It's not necessarily designed
>> >> for that, or particularly fast or well-suited for it, but I've
>> >> definitely seen it used that way.
>> >>
>> >> So I don't think the behavioral test breakage like this is necessarily
>> >> a huge deal, and until some "real use" actually shows that it cares it
>> >> might be something we dismiss as "just test", but it very much has the
>> >> potential to hit real uses.
>> >>
>> >> The fact that a behavioral test broke is definitely interesting.
>> >>
>> >> And maybe some of the siginfo allocations could depend on whether the
>> >> signal is actually ever caught or not.
>> >>
>> >> For example, a terminal signal (or one that is ignored) might not need
>> >> siginfo. But if the process is ptraced, maybe that terminal signal
>> >> isn't actually terminal? So we might have situations where we want to
>> >> simply check "is the signal target being ptraced"..
>> >
>> > Yes, something like this, I suppose? It works for me.
>>
>> The challenge is that we could be delivering this to a zombie signal
>> group leader.  At which point we won't deliver it to the target task.
>>
>> Sigh it is probably time that I dig in and figure out how to avoid that
>> case which we need to fix anyway because we can get the permission
>> checks wrong for multi-threaded processes that call setuid and friends.
>>
>> Once that is sorted your small change will at least be safe.
>>
>> Eric
>>
>> > From 3bcaadd56ebb532ab4d481556fcc0826d65efc43 Mon Sep 17 00:00:00 2001
>> > From: Tycho Andersen 
>> > Date: Thu, 6 Dec 2018 12:15:22 -0700
>> > Subject: [PATCH] signal: allocate siginfo when a traced task gets SIGSTOP
>> >
>> > Tracers can view SIGSTOP:
>> >
>> > https://lore.kernel.org/lkml/87zhtthkuy@xmission.com/T/#u
>> >
>> > so let's allocate a siginfo for SIGSTOP when a task is traced.
>> >
>> > Signed-off-by: Tycho Andersen 
>> > ---
>> >  kernel/signal.c | 9 ++---
>> >  1 file changed, 6 insertions(+), 3 deletions(-)
>> >
>> > diff --git a/kernel/signal.c b/kernel/signal.c
>> > index 9a32bc2088c9..ab4ba00119f4 100644
>> > --- a/kernel/signal.c
>> > +++ b/kernel/signal.c
>> > @@ -1056,11 +1056,14 @@ static int __send_signal(int sig, struct 
>> > kernel_siginfo *info, struct task_struc
>> >   goto ret;
>> >
>> >   result = TRACE_SIGNAL_DELIVERED;
>> > +
>> >   /*
>> > -  * Skip useless siginfo allocation for SIGKILL SIGSTOP,
>> > -  * and kernel threads.
>> > +  * Skip useless siginfo allocation for SIGKILL and kernel threads.
>> > +  * SIGSTOP is visible to tracers, so only skip allocation when the 
>> > task
>> > +  * is not traced.
>> >*/
>> > - if (sig_kernel_only(sig) || (t->flags & PF_KTHREAD))
>> > + if ((sig == SIGKILL) || (!task_is_traced(t) && sig == SIGSTOP) ||
>> > + (t->flags & PF_KTHREAD))
>> >   goto out_set;
>> >
>> >   /*
>
> What should we do for v4.20? I need to have the selftests actually
> passing. :)

For v4.20 we need to do one of two things.
1) Present a plausible case that someone will could care about,
   we document it in the commit we can perform my earlier partial revert.

2) Remove the sanity check seccomp_bpf.c

I really just want to ensure we have clear reasoning here.

Eric



Re: [PATCH v4] signal: add taskfd_send_signal() syscall

2018-12-06 Thread Eric W. Biederman
Daniel Colascione  writes:

> On Thu, Dec 6, 2018 at 12:29 PM Eric W. Biederman  
> wrote:
>> Christian Brauner  writes:
>>
>> > [1]: You cannot replicate certain aspects of kill *yet*. We have
>> > established this before. If we want process group support later we do
>> > have the flags argument to extend the sycall.
>>
>> Then you have horrible contradiction in the API.
>>
>> Either the grouping is a property of your file descriptor or the
>> grouping comes from the flags argument.
>>
>> If the grouping is specified in the flags argument then pidfd is the
>> proper name for your file descriptors, and the appropriate prefix
>> for your system call.
>
> Yes and no. "taskfd" is fine, since even if we do add a
> kill-process-group capability, the general facility under discussion
> is still *about* tasks in general, so "taskfd" still tells you in a
> general sense what the thing does. "pidfd" would be wrong, and for the
> same reason that the kernel's "struct pid" is badly-named: the object
> being named is a *task*, and signaling a particular task instead of
> whatever task happens to be labeled with a particular numeric PID at
> the time of all is the whole point of this change.

No.  The object being named by struct pid is an identifier.

The same identifier can label processes, the same identifier can
label threads, the same identifier can label process groups the
same identifier can label sessions.

That is fundamental to how that identifier works.

Now the new file descriptor can either be a handle to the struct pid
the general purpose identifier, or the file descriptor can be a handle
to a process.  Other file descriptor types would be handles to different
kinds of processes.


When people tell me the new interface can handle process groups or
sessions by just adding a flag.  I hear they want to have an handle
to struct pid then general purpose identifier.  Because it doesn't make
any sense at all to add a flag to when you need to add a new file
descriptor type to keep the API consistent.

Later when I hear people say that they want an identifier for a process
and not have a handle to a general purpose identifier I think they need
to think about differnt types of file descriptors for the different
purposes.  The flags argument does not help at all there.

>> If the grouping is a property of your file descriptor it does not make
>> sense to talk about using the flags argument later.
>>
>> Your intention is to add the thread case to support pthreads once the
>> process case is sorted out.  So this is something that needs to be made
>> clear.  Did I miss how you plan to handle threads?
>>
>> And this fundamentally and definitely gets into all of my concerns about
>> proper handling of pid_task and PIDTYPE_TGID etc.
>
> To the extent that it's possible, this system call should mimic the
> behavior of a signal-send to a positive numeric PID (i.e., a specific
> task), so if we change one, we should change both.

There is a special case in kill(2) where we treat an identifier that
only identifies a thread as an identifier of a process.  We should
not copy that.

Making kill(2) only accept the pid of thread group leaders would make
all kinds of sense.  Unfortunately we need to stay bug compatible with
past versions of linux.  So kill(2) won't be changing unless it appears
clear that not a single application cares about that case.

I will eventually be changing kill(2) to not sending signals through
zombie group leaders so the permission checks are performed correctly.

All I am asking is that we don't copy that bug compatibility code into a
new system call, and that use use the current internal APIs in a fashion
that shows you are not replicating misfeatures of old system calls.

I know it takes understanding a little more kernel code but for people
designing a signal sending API it is not unreasonable to ask that they
understand the current APIs and the current code.  The fact that this
discussion has revealed some significant misconceptions of how the
current code works, and what I am asking for concerns me.

Eric


Re: [PATCH v4] signal: add taskfd_send_signal() syscall

2018-12-06 Thread Eric W. Biederman
Christian Brauner  writes:

>> Your intention is to add the thread case to support pthreads once the
>> process case is sorted out.  So this is something that needs to be made
>> clear.  Did I miss how you plan to handle threads?
>
> Yeah, maybe you missed it in the commit message [2] which is based on a
> discussion with Andy [3] and Arnd [4]:

Looking at your references I haven't missed it.  You are not deciding
anything as of yet to keep it simple.  Except you are returning
EOPNOTSUPP.  You are very much intending to do something.

Decide.  Do you use the flags parameter or is the width of the
target depending on the flags.

That is fundamental to how the system call and it's extensions work.
That is fundamental to my review.

Until that is decided.
Nacked-by: "Eric W. Biederman" 

There are a lot of fundamental maintenance issues and you can very easily
get them wrong if you are not clear on the job of the file descriptor
and the job of the flags argument.

I want don't want new crap that we have to abandon that has a nasty set
of bugs because no one wanted to think through the system call all of
the way and understand the corner cases.

Eric


Re: siginfo pid not populated from ptrace?

2018-12-06 Thread Eric W. Biederman
Tycho Andersen  writes:

> On Thu, Dec 06, 2018 at 10:48:39AM -0800, Linus Torvalds wrote:
>> On Thu, Dec 6, 2018 at 6:40 AM Eric W. Biederman  
>> wrote:
>> >
>> > We have in the past had ptrace users that weren't just about debugging
>> > so I don't know that it is fair to just dismiss it as debugging
>> > infrastructure.
>> 
>> Absolutely.
>> 
>> Some uses are more than just debug. People occasionally use ptrace
>> because it's the only way to do what they want, so you'll find people
>> who do it for sandboxing, for example. It's not necessarily designed
>> for that, or particularly fast or well-suited for it, but I've
>> definitely seen it used that way.
>> 
>> So I don't think the behavioral test breakage like this is necessarily
>> a huge deal, and until some "real use" actually shows that it cares it
>> might be something we dismiss as "just test", but it very much has the
>> potential to hit real uses.
>> 
>> The fact that a behavioral test broke is definitely interesting.
>> 
>> And maybe some of the siginfo allocations could depend on whether the
>> signal is actually ever caught or not.
>> 
>> For example, a terminal signal (or one that is ignored) might not need
>> siginfo. But if the process is ptraced, maybe that terminal signal
>> isn't actually terminal? So we might have situations where we want to
>> simply check "is the signal target being ptraced"..
>
> Yes, something like this, I suppose? It works for me.

The challenge is that we could be delivering this to a zombie signal
group leader.  At which point we won't deliver it to the target task.

Sigh it is probably time that I dig in and figure out how to avoid that
case which we need to fix anyway because we can get the permission
checks wrong for multi-threaded processes that call setuid and friends.

Once that is sorted your small change will at least be safe.

Eric

> From 3bcaadd56ebb532ab4d481556fcc0826d65efc43 Mon Sep 17 00:00:00 2001
> From: Tycho Andersen 
> Date: Thu, 6 Dec 2018 12:15:22 -0700
> Subject: [PATCH] signal: allocate siginfo when a traced task gets SIGSTOP
>
> Tracers can view SIGSTOP:
>
> https://lore.kernel.org/lkml/87zhtthkuy@xmission.com/T/#u
>
> so let's allocate a siginfo for SIGSTOP when a task is traced.
>
> Signed-off-by: Tycho Andersen 
> ---
>  kernel/signal.c | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 9a32bc2088c9..ab4ba00119f4 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1056,11 +1056,14 @@ static int __send_signal(int sig, struct 
> kernel_siginfo *info, struct task_struc
>   goto ret;
>  
>   result = TRACE_SIGNAL_DELIVERED;
> +
>   /*
> -  * Skip useless siginfo allocation for SIGKILL SIGSTOP,
> -  * and kernel threads.
> +  * Skip useless siginfo allocation for SIGKILL and kernel threads.
> +  * SIGSTOP is visible to tracers, so only skip allocation when the task
> +  * is not traced.
>*/
> - if (sig_kernel_only(sig) || (t->flags & PF_KTHREAD))
> + if ((sig == SIGKILL) || (!task_is_traced(t) && sig == SIGSTOP) ||
> + (t->flags & PF_KTHREAD))
>   goto out_set;
>  
>   /*


Re: [PATCH v4] signal: add taskfd_send_signal() syscall

2018-12-06 Thread Eric W. Biederman
Christian Brauner  writes:

> On Thu, Dec 06, 2018 at 01:17:24PM -0600, Eric W. Biederman wrote:
>> Christian Brauner  writes:
>> 
>> > On December 7, 2018 4:01:19 AM GMT+13:00, ebied...@xmission.com wrote:
>> >>Christian Brauner  writes:
>> >>
>> >>> The kill() syscall operates on process identifiers (pid). After a
>> >>process
>> >>> has exited its pid can be reused by another process. If a caller
>> >>sends a
>> >>> signal to a reused pid it will end up signaling the wrong process.
>> >>This
>> >>> issue has often surfaced and there has been a push [1] to address
>> >>this
>> >>> problem.
>> >>>
>> >>> This patch uses file descriptors (fd) from proc/ as stable
>> >>handles on
>> >>> struct pid. Even if a pid is recycled the handle will not change. The
>> >>fd
>> >>> can be used to send signals to the process it refers to.
>> >>> Thus, the new syscall taskfd_send_signal() is introduced to solve
>> >>this
>> >>> problem. Instead of pids it operates on process fds (taskfd).
>> >>
>> >>I am not yet thrilled with the taskfd naming.
>> >
>> > Userspace cares about what does this thing operate on?
>> > It operates on processes and threads.
>> > The most common term people use is "task".
>> > I literally "polled" ten non-kernel people for that purpose and asked:
>> > "What term would you use to refer to a process and a thread?"
>> > Turns out it is task. So if find this pretty apt.
>> > Additionally, the proc manpage uses task in the exact same way (also see 
>> > the commit message).
>> > If you can get behind that name even if feeling it's not optimal it would 
>> > be great.
>> 
>> Once I understand why threads and not process groups.  I don't see that
>> logic yet.
>
> The point is: userspace takes "task" to be a generic term for processes
> and tasks. Which is what is important. The term also covers process
> groups for all that its worth. Most of userspace isn't even aware of
> that distinction necessarily.
>
> fd_send_signal() makes the syscall name meaningless: what is userspace
> signaling too? The point being that there's a lot more that you require
> userspace to infer from fd_send_signal() than from task_send_signal()
> where most people get the right idea right away: "signals to a process
> or thread".
>
>> 
>> >>Is there any plan to support sesssions and process groups?
>> >
>> > I don't see the necessity.
>> > As I said in previous mails:
>> > we can emulate all interesting signal syscalls with this one.
>> 
>> I don't know what you mean by all of the interesting signal system
>> calls.  I do know you can not replicate kill(2).
>
> [1]: You cannot replicate certain aspects of kill *yet*. We have
> established this before. If we want process group support later we do
> have the flags argument to extend the sycall.

Then you have horrible contradiction in the API.

Either the grouping is a property of your file descriptor or the
grouping comes from the flags argument.

If the grouping is specified in the flags argument then pidfd is the
proper name for your file descriptors, and the appropriate prefix
for your system call.

If the grouping is a property of your file descriptor it does not make
sense to talk about using the flags argument later.

Your intention is to add the thread case to support pthreads once the
process case is sorted out.  So this is something that needs to be made
clear.  Did I miss how you plan to handle threads?

And this fundamentally and definitely gets into all of my concerns about
proper handling of pid_task and PIDTYPE_TGID etc.

>> Sending signals to a process group the "kill(-pgrp)" case with kill
>> sends the signals to an atomic snapshot of processes.  If the signal
>> is SIGKILL then it is guaranteed that then entire process group is
>> killed with no survivors.
>
> See [1].
>
>> 
>> > We succeeded in doing that.
>> 
>> I am not certain you have.
>
> See [1].
>
>> 
>> > No need to get more fancy.
>> > There's currently no obvious need for more features.
>> > Features should be implemented when someone actually needs them.
>> 
>> That is fair.  I don't understand what you are doing with sending
>> signals to a thread.  That seems like one of the least useful
>> corner cases of sending signals.
>
> It's what glibc and Florian care about for pthreads and their our
> biggest user atm so they get some I'd argue they get some say in this. :)

Fair enough.  If glibc could use this then we certainly have users and a
user case.

Eric


Re: [PATCH v4] signal: add taskfd_send_signal() syscall

2018-12-06 Thread Eric W. Biederman
Christian Brauner  writes:

> On December 7, 2018 4:01:19 AM GMT+13:00, ebied...@xmission.com wrote:
>>Christian Brauner  writes:
>>
>>> The kill() syscall operates on process identifiers (pid). After a
>>process
>>> has exited its pid can be reused by another process. If a caller
>>sends a
>>> signal to a reused pid it will end up signaling the wrong process.
>>This
>>> issue has often surfaced and there has been a push [1] to address
>>this
>>> problem.
>>>
>>> This patch uses file descriptors (fd) from proc/ as stable
>>handles on
>>> struct pid. Even if a pid is recycled the handle will not change. The
>>fd
>>> can be used to send signals to the process it refers to.
>>> Thus, the new syscall taskfd_send_signal() is introduced to solve
>>this
>>> problem. Instead of pids it operates on process fds (taskfd).
>>
>>I am not yet thrilled with the taskfd naming.
>
> Userspace cares about what does this thing operate on?
> It operates on processes and threads.
> The most common term people use is "task".
> I literally "polled" ten non-kernel people for that purpose and asked:
> "What term would you use to refer to a process and a thread?"
> Turns out it is task. So if find this pretty apt.
> Additionally, the proc manpage uses task in the exact same way (also see the 
> commit message).
> If you can get behind that name even if feeling it's not optimal it would be 
> great.

Once I understand why threads and not process groups.  I don't see that
logic yet.

>>Is there any plan to support sesssions and process groups?
>
> I don't see the necessity.
> As I said in previous mails:
> we can emulate all interesting signal syscalls with this one.

I don't know what you mean by all of the interesting signal system
calls.  I do know you can not replicate kill(2).

Sending signals to a process group the "kill(-pgrp)" case with kill
sends the signals to an atomic snapshot of processes.  If the signal
is SIGKILL then it is guaranteed that then entire process group is
killed with no survivors.

> We succeeded in doing that.

I am not certain you have.

> No need to get more fancy.
> There's currently no obvious need for more features.
> Features should be implemented when someone actually needs them.

That is fair.  I don't understand what you are doing with sending
signals to a thread.  That seems like one of the least useful
corner cases of sending signals.

Eric


Re: [PATCH v4] signal: add taskfd_send_signal() syscall

2018-12-06 Thread Eric W. Biederman
Daniel Colascione  writes:

> On Thu, Dec 6, 2018 at 7:02 AM Eric W. Biederman  
> wrote:
>>
>> Christian Brauner  writes:
>>
>> > The kill() syscall operates on process identifiers (pid). After a process
>> > has exited its pid can be reused by another process. If a caller sends a
>> > signal to a reused pid it will end up signaling the wrong process. This
>> > issue has often surfaced and there has been a push [1] to address this
>> > problem.
>> >
>> > This patch uses file descriptors (fd) from proc/ as stable handles on
>> > struct pid. Even if a pid is recycled the handle will not change. The fd
>> > can be used to send signals to the process it refers to.
>> > Thus, the new syscall taskfd_send_signal() is introduced to solve this
>> > problem. Instead of pids it operates on process fds (taskfd).
>>
>> I am not yet thrilled with the taskfd naming.
>
> Both the old and new names were fine. Do you want to suggest a name at
> this point? You can't just say "I don't like this. Guess again"
> forever.

Both names suck, as neither name actually describes what the function is
designed to do.

Most issues happen at the interface between abstractions.  A name that
confuses your users will just make that confusion more likely.  So it is
important that we do the very best with the name that we can do.

We are already having questions about what happens when you perform the
non-sense operation of sending a signal to a zombie.  It comes up
because there are races when a process may die and you are not expecting
it.  That is an issue with the existing signal sending API, that has
caused confusion.   That isn't half as confusing as the naming issue.

A task in linux is a single thread.  A process is all of the threads.
If we are going to support both cases it doesn't make sense to hard code
a single case in the name.

I would be inclined to simplify things and call the syscall something
like "fdkill(int fd, struct siginfo *info, int flags)".  Or perhaps
just "fd_send_signal(int fd, struct siginfo *info, int flags)".

Then we are not overspecifying what the system call does in the name.
Plus it makes it clear that the fd specifies where the signal goes.
Something I see that by your reply below that you were confused about.

>> Is there any plan to support sesssions and process groups?
>
> Such a thing could be added with flags in the future. Why complicate
> this patch?

Actually that isn't the way this is designed.  You would have to have
another kind of file descriptor.  I am asking because it goes to the
question of naming and what we are trying to do here.

We don't need to implement that but we have already looked into this
kind of extensibility.  If we want the extensibility we should make
room for it, or just close the door.  Having the door half open and a
confusing interface is a problem for users.

>> I am concerned about using kill_pid_info.  It does this:
>>
>>
>> rcu_read_lock();
>> p = pid_task(pid, PIDTYPE_PID);
>> if (p)
>> error = group_send_sig_info(sig, info, p, PIDTYPE_TGID);
>> rcu_read_unlock();
>>
>> That pid_task(PIDTYPE_PID) is fine for existing callers that need bug
>> compatibility.  For new interfaces I would strongly prefer 
>> pid_task(PIDTYPE_TGID).
>
> What is the bug that PIDTYPE_PID preserves?

I am not 100% certain all of the bits for this to matter have been
merged yet but we are close enough that it would not be hard to make it
matter.

There are two strange behaviours of ordinary kill on the linux kernel
that I am aware of.

1) kill(thread_id,...) where the thread_id is not the id of the first
   thread and the thread_id thus the pid of the process sends the signal
   to the entire process.  Something that arguably should not happen.

2) kill(pid,...) where the original thread has exited performs the
   permission checks against the dead signal group leader.  Which means
   that the permission checks for sending a signal are very likely wrong
   for a multi-threaded processes that calls a function like setuid.

To fix the second case we are going to have to perform the permission
checks on a non-zombie thread.  That is something that is straight
forward to make work with PIDTYPE_TGID.  It is not so easy to make work
with PIDTYPE_PID.

I looked and it doesn't look like I have merged the logic of having
PIDTYPE_TGID point to a living thread when the signal group leader
exits and becomes a zombie.  It isn't hard but it does require some very
delicate surgery on the code, so that we don't break some of the
historic confusion of threads and process groups.

Eric


Re: [PATCH v4] signal: add taskfd_send_signal() syscall

2018-12-06 Thread Eric W. Biederman
uld be noted that the addition of
> __copy_siginfo_from_user_any() is caused by a bug in the original
> implementation of rt_sigqueueinfo(2) (cf. 12).
> With upcoming rework for syscall handling things might improve
> significantly (cf. [11]) and __copy_siginfo_from_user_any() will not gain
> any additional callers.
>
> /* testing */
> This patch was tested on x64 and x86.
>
> /* userspace usage */
> An asciinema recording for the basic functionality can be found under [9].
> With this patch a process can be killed via:
>
>  #define _GNU_SOURCE
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
>
>  static inline int do_taskfd_send_signal(int taskfd, int sig, siginfo_t *info,
>  unsigned int flags)
>  {
>  #ifdef __NR_taskfd_send_signal
>  return syscall(__NR_taskfd_send_signal, taskfd, sig, info, flags);
>  #else
>  return -ENOSYS;
>  #endif
>  }
>
>  int main(int argc, char *argv[])
>  {
>  int fd, ret, saved_errno, sig;
>
>  if (argc < 3)
>  exit(EXIT_FAILURE);
>
>  fd = open(argv[1], O_DIRECTORY | O_CLOEXEC);
>  if (fd < 0) {
>  printf("%s - Failed to open \"%s\"\n", strerror(errno), 
> argv[1]);
>  exit(EXIT_FAILURE);
>  }
>
>  sig = atoi(argv[2]);
>
>  printf("Sending signal %d to process %s\n", sig, argv[1]);
>  ret = do_taskfd_send_signal(fd, sig, NULL, 0);
>
>  saved_errno = errno;
>  close(fd);
>  errno = saved_errno;
>
>  if (ret < 0) {
>  printf("%s - Failed to send signal %d to process %s\n",
> strerror(errno), sig, argv[1]);
>  exit(EXIT_FAILURE);
>  }
>
>  exit(EXIT_SUCCESS);
>  }
>
> [1]:  https://lore.kernel.org/lkml/20181029221037.87724-1-dan...@google.com/
> [2]:  https://lore.kernel.org/lkml/874lbtjvtd@oldenburg2.str.redhat.com/
> [3]:  https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6f...@brauner.io/
> [4]:  https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru...@brauner.io/
> [5]:  https://lore.kernel.org/lkml/20181121213946.ga10...@mail.hallyn.com/
> [6]:  https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6...@brauner.io/
> [7]:  
> https://lore.kernel.org/lkml/36323361-90bd-41af-ab5b-ee0d7ba02...@amacapital.net/
> [8]:  https://lore.kernel.org/lkml/87tvjxp8pc@xmission.com/
> [9]:  https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy
> [10]: https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru...@brauner.io/
> [11]: 
> https://lore.kernel.org/lkml/f53d6d38-3521-4c20-9034-5af447df6...@amacapital.net/
> [12]: https://lore.kernel.org/lkml/87zhtjn8ck@xmission.com/
>
> Cc: Arnd Bergmann 
> Cc: "Eric W. Biederman" 
> Cc: Serge Hallyn 
> Cc: Jann Horn 
> Cc: Andy Lutomirsky 
> Cc: Andrew Morton 
> Cc: Oleg Nesterov 
> Cc: Aleksa Sarai 
> Cc: Al Viro 
> Cc: Florian Weimer 
> Signed-off-by: Christian Brauner 
> Reviewed-by: Kees Cook 
> ---
> Changelog:
> v4:
> - updated asciinema to use "taskfd_" prefix
> - s/procfd_send_signal/taskfd_send_signal/g
> - s/proc_is_tgid_procfd/tgid_taskfd_to_pid/b
> - s/proc_is_tid_procfd/tid_taskfd_to_pid/b
> - s/__copy_siginfo_from_user_generic/__copy_siginfo_from_user_any/g
> - make it clear that __copy_siginfo_from_user_any() is a workaround caused
>   by a bug in the original implementation of rt_sigqueueinfo()
> - when spoofing signals turn them into regular kill signals if si_code is
>   set to SI_USER
> - make proc_is_t{g}id_procfd() return struct pid to allow proc_pid() to
>   stay private to fs/proc/
> v3:
> - add __copy_siginfo_from_user_generic() to avoid adding compat syscalls
> - s/procfd_signal/procfd_send_signal/g
> - change type of flags argument from int to unsigned int
> - add comment about what happens to zombies
> - add proc_is_tid_procfd()
> - return EOPNOTSUPP when /proc//task/ fd is passed so userspace
>   has a way of knowing that tidfds are not supported currently.
> v2:
> - define __NR_procfd_signal in unistd.h
> - wire up compat syscall
> - s/proc_is_procfd/proc_is_tgid_procfd/g
> - provide stubs when CONFIG_PROC_FS=n
> - move proc_pid() to linux/proc_fs.h header
> - use proc_pid() to grab struct pid from /proc/ fd
> v1:
> - patch introduced
> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  fs/proc/base.c |  20 +++-
>  inc

Re: siginfo pid not populated from ptrace?

2018-12-06 Thread Eric W. Biederman
Kees Cook  writes:

> On Sat, Dec 1, 2018 at 7:04 AM Eric W. Biederman  
> wrote:
>>
>> Kees Cook  writes:
>>
>> > On Tue, Nov 27, 2018 at 8:44 PM Eric W. Biederman  
>> > wrote:
>> >>
>> >> Kees Cook  writes:
>> >>
>> >> > On Tue, Nov 27, 2018 at 4:38 PM, Kees Cook  
>> >> > wrote:
>> >> >> On Tue, Nov 27, 2018 at 3:21 PM, Tycho Andersen  wrote:
>> >> >>> On Mon, Nov 12, 2018 at 12:24:43PM -0700, Tycho Andersen wrote:
>> >> >>>> On Mon, Nov 12, 2018 at 11:55:38AM -0700, Tycho Andersen wrote:
>> >> >>>> > I haven't manage to reproduce it on stock v4.20-rc2, unfortunately.
>> >> >>>>
>> >> >>>> Ok, now I have,
>> >> >>>>
>> >> >>>> seccomp_bpf.c:2736:global.syscall_restart:Expected getpid() (1493) 
>> >> >>>> == info._sifields._kill.si_pid (0)
>> >> >>>> global.syscall_restart: Test failed at step #22
>> >> >>>
>> >> >>> Seems like this is still happening on v4.20-rc4,
>> >> >>>
>> >> >>> [ RUN  ] global.syscall_restart
>> >> >>> seccomp_bpf.c:2736:global.syscall_restart:Expected getpid() (1901) == 
>> >> >>> info._sifields._kill.si_pid (0)
>> >> >>> global.syscall_restart: Test failed at step #22
>> >> >>
>> >> >> This fails every time for me -- is it still racey for you?
>> >> >>
>> >> >> I'm attempting a bisect, hoping it doesn't _become_ racey for me. ;)
>> >> >
>> >> > This bisects to here for me:
>> >> >
>> >> > commit f149b31557446aff9ca96d4be7e39cc266f6e7cc
>> >> > Author: Eric W. Biederman 
>> >> > Date:   Mon Sep 3 09:50:36 2018 +0200
>> >> >
>> >> > signal: Never allocate siginfo for SIGKILL or SIGSTOP
>> >> >
>> >> > The SIGKILL and SIGSTOP signals are never delivered to userspace so
>> >> > queued siginfo for these signals can never be observed.  Therefore
>> >> > remove the chance of failure by never even attempting to allocate
>> >> > siginfo in those cases.
>> >> >
>> >> > Reviewed-by: Thomas Gleixner 
>> >> > Signed-off-by: "Eric W. Biederman" 
>> >> >
>> >> > They are certainly visible via seccomp ;)
>> >>
>> >> Well SIGSTOP is visible via PTRACE_GETSIGINFO.
>> >>
>> >> I see what is happening now.  Since we don't have queued siginfo
>> >> we generate some as:
>> >> /*
>> >>  * Ok, it wasn't in the queue.  This must be
>> >>  * a fast-pathed signal or we must have been
>> >>  * out of queue space.  So zero out the info.
>> >>  */
>> >> clear_siginfo(info);
>> >> info->si_signo = sig;
>> >> info->si_errno = 0;
>> >> info->si_code = SI_USER;
>> >> info->si_pid = 0;
>> >> info->si_uid = 0;
>> >>
>> >> Which allows last_signfo to be set,
>> >> so despite not really having any siginfo PTRACE_GET_SIGINFO
>> >> has something to return so does not return -EINVAL.
>> >>
>> >> Reconstructing my context that was part of removing SEND_SIG_FORCED
>> >> so this looks like it will take a little more than a revert to fix
>> >> this.
>> >>
>> >> This is definitely a change that is visible to user space.  The logic in
>> >> my patch was definitely wrong with respect to SIGSTOP and
>> >> PTRACE_GETSIGINFO.  Is there something in userspace that actually cares?
>> >> AKA is the idiom that the test seccomp_bpf.c is using something that
>> >> non-test code does?
>> >
>> > I think this would be needed by any ptracer that handled multiple
>> > threads. It needs to figure out which pid stopped. I think it's worth
>> > fixing, yes.
>> >
>> >> The change below should restore the old behavior.  I am just wondering
>> >> if this is something we want to do.  siginfo is allocated with
>> >> GFP_ATOMIC so if your machine is under memory pressur

Re: [PATCH v4] signal: add taskfd_send_signal() syscall

2018-12-06 Thread Eric W. Biederman
Florian Weimer  writes:

> * Eric W. Biederman:
>
>> Floriam are you seeing a problem with this behavior or the way Christian
>> was describing it?
>
> My hope is that you could use taskfd_send_signal one day to send a
> signal to a process which you *known* (based on how you've written your
> application) should be running and not in a zombie state, and get back
> an error if it has exited.
>
> If you get this error, only then you wait on the process, using the file
> descriptor you have, and run some recovery code.
>
> Wouldn't that be a reasonable approach once we've got task descriptors?

Getting an error back if the target was a zombie does seem reasonable,
as in principle it is an easy thing to notice, and post zombie once the
process has been reaped we definitely get an error back.

I also agree that it sounds like an extension, as changing the default
would violate the princile of least surprise.

Eric


Re: [PATCH v4] signal: add taskfd_send_signal() syscall

2018-12-06 Thread Eric W. Biederman
Florian Weimer  writes:

> * Jürg Billeter:
>
>> On Thu, 2018-12-06 at 13:30 +0100, Florian Weimer wrote:
>>> * Christian Brauner:
>>> 
>>> > /* zombies */
>>> > Zombies can be signaled just as any other process. No special error will 
>>> > be
>>> > reported since a zombie state is an unreliable state (cf. [3]).
>>> 
>>> I still disagree with this analysis.  If I know that the target process
>>> is still alive, and it is not, this is a persistent error condition
>>> which can be reliably reported.  Given that someone might send SIGKILL
>>> to the process behind my back, detecting this error condition could be
>>> useful.
>>
>> As I understand it, kill() behaves the same way. I think it's good that
>> this new syscall keeps the behavior as close as possible to kill().
>
> No, kill does not behave in this way because the PID can be reused.
> The error condition is not stable there.

I am not quite certain what is being discussed here.

Posix says:
[ESRCH]
No process or process group can be found corresponding to that 
specified by pid.

The linux man page says:
   ESRCH  The process or process group does not exist.  Note that an
  existing process might be a zombie, a process that has terminated
  execution, but has not yet been wait(2)ed for.

What happens with this new system call is exactly the linux behavior.
Success is returned until the specified process or thread group has
been waited for.

The only difference from the behavior of kill is that because the
association between the file descriptor and the pid is not affected by
pid reuse once ESRCH is returned ESRCH will always be returned.

Floriam are you seeing a problem with this behavior or the way Christian
was describing it?

Eric


Re: [PATCH v3] signal: add procfd_send_signal() syscall

2018-12-05 Thread Eric W. Biederman
ECTORY | O_CLOEXEC);
>  if (fd < 0) {
>  printf("%s - Failed to open \"%s\"\n", strerror(errno), 
> argv[1]);
>  exit(EXIT_FAILURE);
>  }
>
>  sig = atoi(argv[2]);
>
>  printf("Sending signal %d to process %s\n", sig, argv[1]);
>  ret = do_procfd_send_signal(fd, sig, NULL, 0);
>
>  saved_errno = errno;
>  close(fd);
>  errno = saved_errno;
>
>  if (ret < 0) {
>  printf("%s - Failed to send signal %d to process %s\n",
> strerror(errno), sig, argv[1]);
>  exit(EXIT_FAILURE);
>  }
>
>  exit(EXIT_SUCCESS);
>  }
>
> [1]:  https://lkml.org/lkml/2018/11/18/130
> [2]:  https://lore.kernel.org/lkml/874lbtjvtd@oldenburg2.str.redhat.com/
> [3]:  https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6f...@brauner.io/
> [4]:  https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru...@brauner.io/
> [5]:  https://lore.kernel.org/lkml/20181121213946.ga10...@mail.hallyn.com/
> [6]:  https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6...@brauner.io/
> [7]:  
> https://lore.kernel.org/lkml/36323361-90bd-41af-ab5b-ee0d7ba02...@amacapital.net/
> [8]:  https://lore.kernel.org/lkml/87tvjxp8pc@xmission.com/
> [9]:  https://asciinema.org/a/X1J8eGhe3vCfBE2b9TXtTaSJ7
> [10]: https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru...@brauner.io/
> [11]: 
> https://lore.kernel.org/lkml/f53d6d38-3521-4c20-9034-5af447df6...@amacapital.net/
>
> Cc: Arnd Bergmann 
> Cc: "Eric W. Biederman" 
> Cc: Kees Cook 
> Cc: Serge Hallyn 
> Cc: Jann Horn 
> Cc: Andy Lutomirsky 
> Cc: Andrew Morton 
> Cc: Oleg Nesterov 
> Cc: Aleksa Sarai 
> Cc: Al Viro 
> Cc: Florian Weimer 
> Signed-off-by: Christian Brauner 
> ---
> Changelog:
> v3:
> - add __copy_siginfo_from_user_generic() to avoid adding compat syscalls
> - s/procfd_signal/procfd_send_signal/g
> - change type of flags argument from int to unsigned int
> - add comment about what happens to zombies
> - add proc_is_tid_procfd()
> - return EOPNOTSUPP when /proc//task/ fd is passed so userspace
>   has a way of knowing that tidfds are not supported currently.
> v2:
> - define __NR_procfd_signal in unistd.h
> - wire up compat syscall
> - s/proc_is_procfd/proc_is_tgid_procfd/g
> - provide stubs when CONFIG_PROC_FS=n
> - move proc_pid() to linux/proc_fs.h header
> - use proc_pid() to grab struct pid from /proc/ fd
> v1:
> - patch introduced
> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  fs/proc/base.c |  17 +++-
>  fs/proc/internal.h |   5 -
>  include/linux/proc_fs.h|  18 
>  include/linux/syscalls.h   |   3 +
>  include/uapi/asm-generic/unistd.h  |   4 +-
>  kernel/signal.c| 127 +++--
>  8 files changed, 163 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
> b/arch/x86/entry/syscalls/syscall_32.tbl
> index 3cf7b533b3d1..9ab0477d6dc3 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -398,3 +398,4 @@
>  384  i386arch_prctl  sys_arch_prctl  
> __ia32_compat_sys_arch_prctl
>  385  i386io_pgetevents   sys_io_pgetevents   
> __ia32_compat_sys_io_pgetevents
>  386  i386rseqsys_rseq
> __ia32_sys_rseq
> +387  i386procfd_send_signal  sys_procfd_send_signal  
> __ia32_sys_procfd_send_signal
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> b/arch/x86/entry/syscalls/syscall_64.tbl
> index f0b1709a5ffb..6d22b1130ed0 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -343,6 +343,7 @@
>  332  common  statx   __x64_sys_statx
>  333  common  io_pgetevents   __x64_sys_io_pgetevents
>  334  common  rseq__x64_sys_rseq
> +335  common  procfd_send_signal  __x64_sys_procfd_send_signal
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index ce3465479447..906647d51f6d 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -716,7 +716,10 @@ static int proc_pid_permission(struct inode *inode, int 
> mask)
>   return generic_permission(inode, mask);
>  }
>  
> -
> +struct pid *proc_pid(const struct inode *inode)
> +{
> + return PROC_I(inode)->

Re: [PATCH v3] signal: add procfd_send_signal() syscall

2018-12-05 Thread Eric W. Biederman
ECTORY | O_CLOEXEC);
>  if (fd < 0) {
>  printf("%s - Failed to open \"%s\"\n", strerror(errno), 
> argv[1]);
>  exit(EXIT_FAILURE);
>  }
>
>  sig = atoi(argv[2]);
>
>  printf("Sending signal %d to process %s\n", sig, argv[1]);
>  ret = do_procfd_send_signal(fd, sig, NULL, 0);
>
>  saved_errno = errno;
>  close(fd);
>  errno = saved_errno;
>
>  if (ret < 0) {
>  printf("%s - Failed to send signal %d to process %s\n",
> strerror(errno), sig, argv[1]);
>  exit(EXIT_FAILURE);
>  }
>
>  exit(EXIT_SUCCESS);
>  }
>
> [1]:  https://lkml.org/lkml/2018/11/18/130
> [2]:  https://lore.kernel.org/lkml/874lbtjvtd@oldenburg2.str.redhat.com/
> [3]:  https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6f...@brauner.io/
> [4]:  https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru...@brauner.io/
> [5]:  https://lore.kernel.org/lkml/20181121213946.ga10...@mail.hallyn.com/
> [6]:  https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6...@brauner.io/
> [7]:  
> https://lore.kernel.org/lkml/36323361-90bd-41af-ab5b-ee0d7ba02...@amacapital.net/
> [8]:  https://lore.kernel.org/lkml/87tvjxp8pc@xmission.com/
> [9]:  https://asciinema.org/a/X1J8eGhe3vCfBE2b9TXtTaSJ7
> [10]: https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru...@brauner.io/
> [11]: 
> https://lore.kernel.org/lkml/f53d6d38-3521-4c20-9034-5af447df6...@amacapital.net/
>
> Cc: Arnd Bergmann 
> Cc: "Eric W. Biederman" 
> Cc: Kees Cook 
> Cc: Serge Hallyn 
> Cc: Jann Horn 
> Cc: Andy Lutomirsky 
> Cc: Andrew Morton 
> Cc: Oleg Nesterov 
> Cc: Aleksa Sarai 
> Cc: Al Viro 
> Cc: Florian Weimer 
> Signed-off-by: Christian Brauner 
> ---
> Changelog:
> v3:
> - add __copy_siginfo_from_user_generic() to avoid adding compat syscalls
> - s/procfd_signal/procfd_send_signal/g
> - change type of flags argument from int to unsigned int
> - add comment about what happens to zombies
> - add proc_is_tid_procfd()
> - return EOPNOTSUPP when /proc//task/ fd is passed so userspace
>   has a way of knowing that tidfds are not supported currently.
> v2:
> - define __NR_procfd_signal in unistd.h
> - wire up compat syscall
> - s/proc_is_procfd/proc_is_tgid_procfd/g
> - provide stubs when CONFIG_PROC_FS=n
> - move proc_pid() to linux/proc_fs.h header
> - use proc_pid() to grab struct pid from /proc/ fd
> v1:
> - patch introduced
> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  fs/proc/base.c |  17 +++-
>  fs/proc/internal.h |   5 -
>  include/linux/proc_fs.h|  18 
>  include/linux/syscalls.h   |   3 +
>  include/uapi/asm-generic/unistd.h  |   4 +-
>  kernel/signal.c| 127 +++--
>  8 files changed, 163 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
> b/arch/x86/entry/syscalls/syscall_32.tbl
> index 3cf7b533b3d1..9ab0477d6dc3 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -398,3 +398,4 @@
>  384  i386arch_prctl  sys_arch_prctl  
> __ia32_compat_sys_arch_prctl
>  385  i386io_pgetevents   sys_io_pgetevents   
> __ia32_compat_sys_io_pgetevents
>  386  i386rseqsys_rseq
> __ia32_sys_rseq
> +387  i386procfd_send_signal  sys_procfd_send_signal  
> __ia32_sys_procfd_send_signal
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> b/arch/x86/entry/syscalls/syscall_64.tbl
> index f0b1709a5ffb..6d22b1130ed0 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -343,6 +343,7 @@
>  332  common  statx   __x64_sys_statx
>  333  common  io_pgetevents   __x64_sys_io_pgetevents
>  334  common  rseq__x64_sys_rseq
> +335  common  procfd_send_signal  __x64_sys_procfd_send_signal
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index ce3465479447..906647d51f6d 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -716,7 +716,10 @@ static int proc_pid_permission(struct inode *inode, int 
> mask)
>   return generic_permission(inode, mask);
>  }
>  
> -
> +struct pid *proc_pid(const struct inode *inode)
> +{
> + return PROC_I(inode)->

Re: [PATCH v2] signal: add procfd_signal() syscall

2018-12-01 Thread Eric W. Biederman
Andy Lutomirski  writes:

>> On Dec 1, 2018, at 7:28 AM, Eric W. Biederman  wrote:
>> 
>> 
>> It just occurs to me that the simple way to implement
>> procfd_sigqueueinfo info is like:
>> 
>> int copy_siginfo_from_user_any(kernel_siginfo_t *info, siginfo_t *uinfo)
>> {
>> #ifdef CONFIG_COMPAT
>>if (in_compat_syscall)
>>return copy_siginfo_from_user32(info, uinfo);
>> #endif
>>return copy_siginfo_from_user(info, uinfo);
>> }
>> 
>> long procfd_sigqueueinfo(int fd, siginfo_t *uinfo)
>> {
>>kernel_siginfo info;
>> 
>>if (copy_siginfo_from_user_any(, uinfo))
>>return -EFAULT;
>>...;
>> }
>> 
>> It looks like there is already a place in ptrace.c that already
>> hand rolls copy_siginfo_from_user_any.
>> 
>> So while I would love to figure out the subset of siginfo_t tha we can
>> just pass through, as I think that would make a better more forward
>> compatible copy_siginfo_from_user32.
>
> Seems reasonable to me. It’s less code overall than any other suggestion, too.
>
>>  I think for this use case we just
>> add the in_compat_syscall test and then we just need to ensure this new
>> system call is placed in the proper places in the syscall table.
>> 
>> Because we will need 3 call sights: x86_64, x32 and ia32.  As the layout
>> changes between those three subarchitecuters.
>> 
>> 
>
> If it’s done this way, it can just be “common” in the 64-bit
> table. And we kick the can a bit farther down the road :)
>
> I’m working on patches to clean up x86’s syscall mess. It’s slow
> because I keep finding new messes.  So far I have rt_sigreturn working
> like every other syscall — whee.
>
> Also, Eric, for your edification, I have a draft patch set to
> radically simplify x86’s signal delivery and return.  Once that’s
> done, I can trivially speed up delivery by a ton by using sysret.

Nice.

Do we care about the performance of synchronous signal delivery (AKA
hardware exceptions) vs ordinary signal delivery.  I get the feeling
there are serious simplifications to be had in that case.

Eric




Re: [PATCH v2] signal: add procfd_signal() syscall

2018-12-01 Thread Eric W. Biederman


It just occurs to me that the simple way to implement
procfd_sigqueueinfo info is like:

int copy_siginfo_from_user_any(kernel_siginfo_t *info, siginfo_t *uinfo)
{
#ifdef CONFIG_COMPAT
if (in_compat_syscall)
return copy_siginfo_from_user32(info, uinfo);
#endif
return copy_siginfo_from_user(info, uinfo);
}

long procfd_sigqueueinfo(int fd, siginfo_t *uinfo)
{
kernel_siginfo info;

if (copy_siginfo_from_user_any(, uinfo))
return -EFAULT;
...;
}

It looks like there is already a place in ptrace.c that already
hand rolls copy_siginfo_from_user_any.

So while I would love to figure out the subset of siginfo_t tha we can
just pass through, as I think that would make a better more forward
compatible copy_siginfo_from_user32.  I think for this use case we just
add the in_compat_syscall test and then we just need to ensure this new
system call is placed in the proper places in the syscall table.

Because we will need 3 call sights: x86_64, x32 and ia32.  As the layout
changes between those three subarchitecuters.

Eric



Re: siginfo pid not populated from ptrace?

2018-12-01 Thread Eric W. Biederman
Kees Cook  writes:

> On Tue, Nov 27, 2018 at 8:44 PM Eric W. Biederman  
> wrote:
>>
>> Kees Cook  writes:
>>
>> > On Tue, Nov 27, 2018 at 4:38 PM, Kees Cook  wrote:
>> >> On Tue, Nov 27, 2018 at 3:21 PM, Tycho Andersen  wrote:
>> >>> On Mon, Nov 12, 2018 at 12:24:43PM -0700, Tycho Andersen wrote:
>> >>>> On Mon, Nov 12, 2018 at 11:55:38AM -0700, Tycho Andersen wrote:
>> >>>> > I haven't manage to reproduce it on stock v4.20-rc2, unfortunately.
>> >>>>
>> >>>> Ok, now I have,
>> >>>>
>> >>>> seccomp_bpf.c:2736:global.syscall_restart:Expected getpid() (1493) == 
>> >>>> info._sifields._kill.si_pid (0)
>> >>>> global.syscall_restart: Test failed at step #22
>> >>>
>> >>> Seems like this is still happening on v4.20-rc4,
>> >>>
>> >>> [ RUN  ] global.syscall_restart
>> >>> seccomp_bpf.c:2736:global.syscall_restart:Expected getpid() (1901) == 
>> >>> info._sifields._kill.si_pid (0)
>> >>> global.syscall_restart: Test failed at step #22
>> >>
>> >> This fails every time for me -- is it still racey for you?
>> >>
>> >> I'm attempting a bisect, hoping it doesn't _become_ racey for me. ;)
>> >
>> > This bisect to here for me:
>> >
>> > commit f149b31557446aff9ca96d4be7e39cc266f6e7cc
>> > Author: Eric W. Biederman 
>> > Date:   Mon Sep 3 09:50:36 2018 +0200
>> >
>> > signal: Never allocate siginfo for SIGKILL or SIGSTOP
>> >
>> > The SIGKILL and SIGSTOP signals are never delivered to userspace so
>> > queued siginfo for these signals can never be observed.  Therefore
>> > remove the chance of failure by never even attempting to allocate
>> > siginfo in those cases.
>> >
>> > Reviewed-by: Thomas Gleixner 
>> > Signed-off-by: "Eric W. Biederman" 
>> >
>> > They are certainly visible via seccomp ;)
>>
>> Well SIGSTOP is visible via PTRACE_GETSIGINFO.
>>
>> I see what is happening now.  Since we don't have queued siginfo
>> we generate some as:
>> /*
>>  * Ok, it wasn't in the queue.  This must be
>>  * a fast-pathed signal or we must have been
>>  * out of queue space.  So zero out the info.
>>  */
>> clear_siginfo(info);
>> info->si_signo = sig;
>> info->si_errno = 0;
>> info->si_code = SI_USER;
>> info->si_pid = 0;
>> info->si_uid = 0;
>>
>> Which allows last_signfo to be set,
>> so despite not really having any siginfo PTRACE_GET_SIGINFO
>> has something to return so does not return -EINVAL.
>>
>> Reconstructing my context that was part of removing SEND_SIG_FORCED
>> so this looks like it will take a little more than a revert to fix
>> this.
>>
>> This is definitely a change that is visible to user space.  The logic in
>> my patch was definitely wrong with respect to SIGSTOP and
>> PTRACE_GETSIGINFO.  Is there something in userspace that actually cares?
>> AKA is the idiom that the test seccomp_bpf.c is using something that
>> non-test code does?
>
> I think this would be needed by any ptracer that handled multiple
> threads. It needs to figure out which pid stopped. I think it's worth
> fixing, yes.
>
>> The change below should restore the old behavior.  I am just wondering
>> if this is something we want to do.  siginfo is allocated with
>> GFP_ATOMIC so if your machine is under memory pressure there is a real
>> chance the allocation can fail.  Which would cause whatever is breaking
>> now to break less deterministically then.
>
> I think memory pressure that would block a 128 byte GFP_ATOMIC
> allocation would mean the system was about to seriously fall over.
> Given the user-facing behavior change and that an existing test was
> already checking for this means we need to fix it.

That sounds good but it is all rubbish.
A) It doesn't matter for tracing multiple processes because ptrace
   only works on a single signal at a time.  AKA if by the time you
   are calling PTRACE_GETSIGINFO you already know which process you are
   working against.
B) Not every signal includes si_pid so even if you didn't know who you
   were talking to this would be an issue.
C) For a non-rt signal we only every try and queue up a signal signal.
   We don

Re: [PATCH v2] signal: add procfd_signal() syscall

2018-12-01 Thread Eric W. Biederman
Arnd Bergmann  writes:

> On Fri, Nov 30, 2018 at 7:56 AM Christian Brauner  
> wrote:
>> On Thu, Nov 29, 2018 at 11:13:57PM -0600, Eric W. Biederman wrote:
>> > Arnd Bergmann  writes:
>> > > On Thu, Nov 29, 2018 at 9:14 PM Andy Lutomirski  
>> > > wrote:
>> > >
>> > > It looks like we already have a 'struct signalfd_siginfo' that is 
>> > > defined in a
>> > > sane architecture-independent way, so I'd suggest we use that.
>> >
>> > Unfortunately it isn't maintained very well.  What you can
>> > express with signalfd_siginfo is a subset what you can express with
>> > siginfo.  Many of the linux extensions to siginfo for exception
>> > information add pointers and have integers right after those pointers.
>> > Not all of those linux specific extentions for exceptions are handled
>> > by signalfd_siginfo (it needs new fields).
>
> Would those fit in the 30 remaining padding bytes?

Probably.  I expect at some point the technique signalfd has been using
won't scale.  Most of what are missing are synchronous exceptions but
the crazy seccomp extensions might be missing as well.  I say crazy
because seccomp places an integer immediately after a pointer which
makes 32bit 64bit compatibility impossible.

>> > As originally defined siginfo had the sigval_t union at the end so it
>> > was perfectly fine on both 32bit and 64bit as it only had a single
>> > pointer in the structure and the other fields were 32bits in size.
>
> The problem with sigval_t of course is that it is incompatible between
> 32-bit and 64-bit. We can add the same information, but at least on
> the syscall level that would have to be a __u64.

But sigval_t with nothing after it is not incompatible between 32bit and
64bit.  With nothing after it sigval_t in struct siginfo already has
the extra 32bits you would need.  It is sort of like transparently
adding a __u64 member to the union.

That is where we really are with sigval_t.   Except for some crazy
corner cases.

>> > Although I do feel the pain as x86_64 has to deal with 3 versions
>> > of siginfo.  A 64bit one.  A 32bit one for ia32.  A 32bit one for x32
>> > with a 64bit si_clock_t.
>
> At least you and Al have managed to get it down to a single source-level
> definition across all architectures, but there is also the (lesser) problem
> that the structure has a slightly different binary layout on each of the
> classic architectures.
>
> If we go with Andy's suggestion of having only a single binary layout
> on x86 for the new call, I'd argue that we should also make it have
> the exact same layout on all other architectures.

Yes.

>> > > We may then also want to make sure that any system call that takes a
>> > > siginfo has a replacement that takes a signalfd_siginfo, and that this
>> > > replacement can be used to implement the old version purely in
>> > > user space.
>> >
>> > If you are not implementing CRIU and reinserting exceptions to yourself.
>> > At most user space wants the ability to implement sigqueue:
>> >
>> > AKA:
>> > sigqueue(pid_t pid, int sig, const union sigval value);
>> >
>> > Well sigqueue with it's own si_codes so the following would cover all
>> > the use cases I know of:
>> > int sendnewsig(pid_t pid, int sig, int si_code, const union sigval value);
>> >
>> > The si_code could even be set to SI_USER to request that the normal
>> > trusted SI_USER values be filled in.  si_code values of < 0 if not
>> > recognized could reasonably safely be treated as the _rt member of
>> > the siginfo union.  Or perhaps better we error out in such a case.
>> >
>> > If we want to be flexible and not have N kinds of system calls that
>> > is the direction I would go.  That is simple, and you don't need any of
>> > the rest.
>
> I'm not sure I understand what you are suggesting here. Would the
> low-level implementation of this be based on procfd, or do you
> mean that would be done for pid_t at the kernel level, plus another
> syscall for procfd?

I was thinking in general and for something that would be enough to back
sigqueue, and normal kill semantics.

For Christians proposed system call I can think of two ways we could go:

long procfd_sigsend(int fd, int sig, int si_code, __u64 sigval);

Where sigval is an appropriately written version of union sigval that
works on both 32bit and 64bit.  Or we could go with:

long procfd_sigqueueinfo(int fd, struct siginfo_subset *info)

Both would support the same set of system calls today and both would
have some similar challenges. 

Re: [PATCH] fs: Make /proc/sys inodes be owned by global root.

2018-12-01 Thread Eric W. Biederman
Radoslaw Burny  writes:

> On Tue, Nov 27, 2018 at 6:29 AM Eric W. Biederman  
> wrote:
>
>  Luis Chamberlain  writes:
>
>  > On Mon, Nov 26, 2018 at 06:26:07PM +0100, Radoslaw Burny wrote:
>  >> Due to a recent commit (d151ddc00498 - fs: Update i_[ug]id_(read|write)
>  >> to translate relative to s_user_ns),
>  >
>  > Recent? This is commit is from 2014 and present upstream since v4.8.
>  > And the commit ID you mentioned in your commit log seems to be
>  > incorrect. I get:
>  >
>  > 81754357770ebd900801231e7bc8d151ddc00498a fs: Update i_[ug]id_(read|write) 
> to translate relative to s_user_ns
>  >
>  >> inodes under /proc/sys have -1
>  >> written to their i_uid/i_gid members if a containing userns does not
>  >> have entries for root in the uid/gid_map.
>  >
>  > Thanks for the description of how to run into the issue described but
>  > is there also a practical use case today where this is happening? I ask
>  > as it would be good to know the severity of the issue in the real world
>  > today.
>
>  People trying to run containers without a root user in the container.
>  It atypical but something doable. 
>
>  >> This wouldn't normally matter, because these values are not used for
>  >> access checks. However, a later change (0bd23d09b874 - Don't modify
>  >> inodes with a uid or gid unknown to the vfs) changes the kernel to
>  >> prevent opens for write if the i_uid/i_gid field in the inode is -1,
>  >> even if the /proc/sys-specific access checks would otherwise pass.
>  >> 
>  >> This causes a problem: in a userns without root mapping, even the
>  >> namespace creator cannot write to e.g. /proc/sys/kernel/shmmax.
>  >> This change fixes the problem by overriding i_uid/i_gid back to
>  >> GLOBAL_ROOT_UID/GID.
>  >
>  > We really need Seth and Eric to provide guidance here as they were
>  > the ones devising this long ago, but to me your solution seems backward.
>  > Why allow any namespace to muck with /proc/sys/ seettings?
>
>  There are many per namespace sysctls. Most of them are in the
>  networking stack.
>
>  > Let's recall that this case was a corner case, and writeback was the
>  > biggest concern, and for that it was decided that you'd simply not get
>  > write access, and so its read only. Its not clear to me if things like
>  > proc were considered. For the regular file case the situation can be
>  > addressed with chown, however we can't chown proc files.
>  >
>  >> Tested: Used a repro program that creates a user namespace without any
>  >> mapping and stat'ed /proc/$PID/root/proc/sys/kernel/shmmax from outside.
>  >> Before the change, it shows uid/gid of 65534,
>  >
>  > I thought you said it would be uid/gid -1 without your patch?
>
>  It is INVALID_UID/INVALID_GID. It is an over simplifcation to call
>  them -1. As they are not a valid value and are never mapped in any
>  user namespace they are displayed as the overflow_uid or overflow_gid
>  which is 65534 by default.
>
>  >> with the change it's 0.
>  >
>  > Note that a good way to also test issues is with the lib/test_sysctl.c
>  > module and the tools/testing/selftests/sysctl/sysctl.sh script, so if
>  > you can device a test there, once we decide what to do that would be
>  > appreciated.
>
>  We spoke about this at LPC. And this is the correct behavioral change.
>
>  The problem is there is a default value for i_uid and i_gid that is
>  correct in the general case. That default value is not corect for
>  sysctl, because proc is weird. As the sysctl permission check in
>  test_perm are all against GLOBAL_ROOT_UID and GLOBAL_ROOT_GID we did not
>  notice that i_uid and i_gid were being set wrong.
>
>  So all this patch does is fix the default values i_uid and i_gid.
>
>  The commit comment seems worth cleaning up. But for the
>  content of the code.
>
>  I expect when I have a few moments I will pick this change up.
>
>  Reviewed-by: "Eric W. Biederman" 
>
>  Eric
>
> Thanks, Eric. Should I send a v2 patch with an updated description,
> or can you just modify the description when applying this one?

I am absolutely swampped and moving at the moment.  Can you please
send a v2 with an updated description.

Thank you,
Eric

>
>  >> Signed-off-by: Radoslaw Burny 
>  >> ---
>  >> fs/proc/proc_sysctl.c | 4 
>  >> 1 file changed, 4 insertions(+)
>  >> 
>  >> diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
>  >> index c5cbbdff3c3d..67379a389658 100644
>  >> --- a/fs/proc/proc_sysctl.c
>  >> +++ b/fs/proc/proc_sysctl.c
>  >> @@ -499,6 +499,10 @@ static struct inode *proc_sys_make_inode(struct 
> super_block *sb,
>  >> 
>  >> if (root->set_ownership)
>  >> root->set_ownership(head, table, >i_uid, >i_gid);
>  >> + else {
>  >> + inode->i_uid = GLOBAL_ROOT_UID;
>  >> + inode->i_gid = GLOBAL_ROOT_GID;
>  >> + }
>  >> 
>  >> out:
>  >> return inode;
>  >> -- 
>  >> 2.20.0.rc0.387.gc7a69e6b6c-goog
>  >> 


Re: [PATCH v2] signal: add procfd_signal() syscall

2018-12-01 Thread Eric W. Biederman
Andy Lutomirski  writes:

> On Fri, Nov 30, 2018 at 3:41 AM Arnd Bergmann  wrote:
>> siginfo_t as it is now still has a number of other downsides, and Andy in
>> particular didn't like the idea of having three new variants on x86
>> (depending on how you count). His alternative suggestion of having
>> a single syscall entry point that takes a 'signfo_t __user *' but interprets
>> it as compat_siginfo depending on in_compat_syscall()/in_x32_syscall()
>> should work correctly, but feels wrong to me, or at least inconsistent
>> with how we do this elsewhere.
>
> BTW, do we consider siginfo_t to be extensible?  If so, and if we pass
> in a pointer, presumably we should pass a length as well.

siginfo is extensible in the sense that the structure is 128 bytes
and we use at most 48 bytes.  siginfo gets embedded in stack frames when
signals get delivered so a size change upwards is non-trivial, and is
possibly and ABI break so I believe a length field would be pointless.

Eric



Re: dcache_readdir NULL inode oops

2018-11-30 Thread Eric W. Biederman
"gre...@linuxfoundation.org"  writes:

> Adding Eric as he touched this code last :)
>
> On Thu, Nov 29, 2018 at 07:25:48PM +, Jan Glauber wrote:
>> On Wed, Nov 28, 2018 at 08:08:06PM +, Will Deacon wrote:
>> > I spent some more time looking at this today...
>> > 
>> > On Fri, Nov 23, 2018 at 06:05:25PM +, Will Deacon wrote:
>> > > Doing some more debugging, it looks like the usual failure case is where
>> > > one CPU clears the inode field in the dentry via:
>> > >
>> > >   devpts_pty_kill()
>> > >   -> d_delete()   // dentry->d_lockref.count == 1
>> > >   -> dentry_unlink_inode()
>> > >
>> > > whilst another CPU gets a pointer to the dentry via:
>> > >
>> > >   sys_getdents64()
>> > >   -> iterate_dir()
>> > >   -> dcache_readdir()
>> > >   -> next_positive()
>> > >
>> > > and explodes on the subsequent inode dereference when trying to pass the
>> > > inode number to dir_emit():
>> > >
>> > >   if (!dir_emit(..., d_inode(next)->i_ino, ...))
>> > >
>> > > Indeed, the hack below triggers a warning, indicating that the inode
>> > > is being cleared concurrently.
>> > >
>> > > I can't work out whether the getdents64() path should hold a refcount
>> > > to stop d_delete() in its tracks, or whether devpts_pty_kill() shouldn't
>> > > be calling d_delete() like this at all.
>> > 
>> > So the issue is that opening /dev/pts/ptmx creates a new pty in /dev/pts,
>> > which disappears when you close /dev/pts/ptmx. Consequently, when we tear
>> > down the dentry for the magic new file, we have to take the i_node rwsem of
>> > the *parent* so that concurrent path walkers don't trip over it whilst its
>> > being freed. I wrote a simple concurrent program to getdents(/dev/pts/) in
>> > one thread, whilst another opens and closes /dev/pts/ptmx: it crashes the
>> > kernel in seconds.
>> 
>> I also made a testcase and verified that your fix is fine. I also tried
>> replacing open-close on /dev/ptmx with mkdir-rmdir but that does not
>> trigger the error.
>> 
>> > Patch below, but I'd still like somebody else to look at this, please.
>> 
>> I wonder why no inode_lock on parent is needed for devpts_pty_new(), but
>> I'm obviously not a VFS expert... So your patch looks good to me and
>> clearly solves the issue.
>> 
>> thanks,
>> Jan
>> 
>> > Will
>> > 
>> > --->8
>> > 
>> > diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
>> > index c53814539070..50ddb95ff84c 100644
>> > --- a/fs/devpts/inode.c
>> > +++ b/fs/devpts/inode.c
>> > @@ -619,11 +619,17 @@ void *devpts_get_priv(struct dentry *dentry)
>> >   */
>> >  void devpts_pty_kill(struct dentry *dentry)
>> >  {
>> > -   WARN_ON_ONCE(dentry->d_sb->s_magic != DEVPTS_SUPER_MAGIC);
>> > +   struct super_block *sb = dentry->d_sb;
>> > +   struct dentry *parent = sb->s_root;
>> > 
>> > +   WARN_ON_ONCE(sb->s_magic != DEVPTS_SUPER_MAGIC);
>
> Side note, I wonder if this is even needed anymore...
>
>> > +
>> > +   inode_lock(parent->d_inode);
>> > dentry->d_fsdata = NULL;
>> > drop_nlink(dentry->d_inode);
>> > d_delete(dentry);
>> > +   inode_unlock(parent->d_inode);
>> > +
>> > dput(dentry);   /* d_alloc_name() in devpts_pty_new() */
>> >  }
>
> This feels right but getting some feedback from others would be good.

This is going to be special at least because we are not coming through
the normal unlink path and we are manipulating the dcache.

This looks plausible.  If this is whats going on then we have had this
bug for a very long time.  I will see if I can make some time.

It looks like in the general case everything is serialized by the
devpts_mutex.  I wonder if just changing the order of operations
here would be enough.

AKA: drop_nlink d_delete then dentry->d_fsdata.  Ugh d_fsdata is not
implicated so that won't help here.

I will look more shortly.  I am in the middle in the move so I don't
have time to complete the analysis today.

Eric



Re: [PATCH] fs: Make /proc/sys inodes be owned by global root.

2018-11-30 Thread Eric W. Biederman
Luis Chamberlain  writes:

> On Mon, Nov 26, 2018 at 11:29:40PM -0600, Eric W. Biederman wrote:
>> Luis Chamberlain  writes:
>> > Thanks for the description of how to run into the issue described but
>> > is there also a practical use case today where this is happening? I ask
>> > as it would be good to know the severity of the issue in the real world
>> > today.
>> 
>> People trying to run containers without a root user in the container.
>> It atypical but something doable.  
>
> My question was if there are generic tools / propreitary tools which are
> doing this widely *today*. Or is this just a custom setup some folks
> use?
>
>> We spoke about this at LPC.  And this is the correct behavioral change.
>> 
>> The problem is there is a default value for i_uid and i_gid that is
>> correct in the general case.  That default value is not corect for
>> sysctl, because proc is weird.  As the sysctl permission check in
>> test_perm are all against GLOBAL_ROOT_UID and GLOBAL_ROOT_GID we did not
>> notice that i_uid and i_gid were being set wrong.
>> 
>> So all this patch does is fix the default values i_uid and i_gid.
>> 
>> The commit comment seems worth cleaning up.  But for the
>> content of the code.
>
> The logic seems sensible then, but are we implicating what a container
> does with its sysctl values onto the entire system? If so, sure, it
> seems you want this for networking purposes as there are a series of
> sysctl values a container may want to muck with, but are we sure we
> want the same for *all* sysctl entries?

No.  Please look at the patch again.  It sets the default uid and gid
for sysctl entries to 0.  AKA GLOBAL_ROOT_UID and GLOBAL_ROOT_GID
because there is a bug and they were not set to that value.

Those are the uids and gids that are tested agasint.  It just happens
you have to be in a weird configuration for this bug to become a problem.

Eric


Re: [PATCH v2] signal: add procfd_signal() syscall

2018-11-29 Thread Eric W. Biederman
Arnd Bergmann  writes:

> On Thu, Nov 29, 2018 at 9:14 PM Andy Lutomirski  wrote:
>> > On Nov 29, 2018, at 11:55 AM, Christian Brauner  
>> > wrote:
>> >> On Thu, Nov 29, 2018 at 11:22:58AM -0800, Andy Lutomirski wrote:
>> >>> On Thu, Nov 29, 2018 at 11:17 AM Christian Brauner 
>> >>>  wrote:
>>  On November 30, 2018 5:54:18 AM GMT+13:00, Andy Lutomirski 
>>   wrote:
>> >>
>> >> The #1 fix would add a copy_siginfo_from_user64() or similar.
>> >
>> > Thanks very much! That all helped a bunch already! I'll try to go the
>> > copy_siginfo_from_user64() way first and see if I can make this work. If
>> > we do this I would however only want to use it for the new syscall first
>> > and not change all other signal syscalls over to it too. I'd rather keep
>> > this patchset focussed and small and do such conversions caused by the
>> > new approach later. Does that sound reasonable?
>>
>> Absolutely. I don’t think we can change old syscalls — the ABI is set in 
>> stone.
>> But for new syscalls, I think the always-64-bit behavior makes sense.
>
> It looks like we already have a 'struct signalfd_siginfo' that is defined in a
> sane architecture-independent way, so I'd suggest we use that.

Unfortunately it isn't maintained very well.  What you can
express with signalfd_siginfo is a subset what you can express with
siginfo.  Many of the linux extensions to siginfo for exception
information add pointers and have integers right after those pointers.
Not all of those linux specific extentions for exceptions are handled
by signalfd_siginfo (it needs new fields).

As originally defined siginfo had the sigval_t union at the end so it
was perfectly fine on both 32bit and 64bit as it only had a single
pointer in the structure and the other fields were 32bits in size.

Although I do feel the pain as x86_64 has to deal with 3 versions
of siginfo.  A 64bit one.  A 32bit one for ia32.  A 32bit one for x32
with a 64bit si_clock_t.

> We may then also want to make sure that any system call that takes a
> siginfo has a replacement that takes a signalfd_siginfo, and that this
> replacement can be used to implement the old version purely in
> user space.

If you are not implementing CRIU and reinserting exceptions to yourself.
At most user space wants the ability to implement sigqueue:

AKA:
sigqueue(pid_t pid, int sig, const union sigval value);

Well sigqueue with it's own si_codes so the following would cover all
the use cases I know of:
int sendnewsig(pid_t pid, int sig, int si_code, const union sigval value);

The si_code could even be set to SI_USER to request that the normal
trusted SI_USER values be filled in.  si_code values of < 0 if not
recognized could reasonably safely be treated as the _rt member of
the siginfo union.  Or perhaps better we error out in such a case.

If we want to be flexible and not have N kinds of system calls that
is the direction I would go.  That is simple, and you don't need any of
the rest.


Alternatively we abandon the mistake of sigqueueinfo and not allow
a signal number in the arguments that differs from the si_signo in the
siginfo and allow passing the entire thing unchanged from sender to
receiver.  That is maximum flexibility.

signalfd_siginfo just sucks in practice.  It is larger that siginfo 104
bytes versus 48.  We must deliver it to userspace as a siginfo so it
must be translated.  Because of the translation signalfd_siginfo adds
no flexiblity in practice, because it can not just be passed through.
Finallay signalfd_siginfo does not have encodings for all of the
siginfo union members, so it fails to be fully general.

Personally if I was to define signalfd_siginfo today I would make it:
struct siginfo_subset {
__u32 sis_signo;
__s32 sis_errno;
__s32 sis_code;
__u32 sis_pad;
__u32 sis_pid;
__u32 sis_uid;
__u64 sis_data (A pointer or integer data field);
};

That is just 32bytes, and is all that is needed for everything
except for synchronous exceptions.  Oh and that happens to be a proper
subset of a any sane siginfo layout, on both 32bit and 64bit.

This is one of those rare times where POSIX is sane and what linux
has implemented is not.

Eric


Re: [PATCH] prctl: add PR_{GET,SET}_KILL_DESCENDANTS_ON_EXIT

2018-11-28 Thread Eric W. Biederman
Oleg Nesterov  writes:

> On 11/27, Jürg Billeter wrote:
>>
>> @@ -704,6 +713,9 @@ static void exit_notify(struct task_struct *tsk, int 
>> group_dead)
>>  struct task_struct *p, *n;
>>  LIST_HEAD(dead);
>>  
>> +if (group_dead && tsk->signal->kill_descendants_on_exit)
>> +walk_process_tree(tsk, kill_descendant_visitor, NULL);
>
> Well, this is not exactly right, at least this is suboptimal in that
> other sub-threads can too call walk_process_tree(kill_descendant_visitor)
> later for no reason.

Oleg I think I am missing something.

Reading kernel/exit.c I see "group_dead = 
atomic_dec_and_test(>signal->live)".
Which seems like enough to ensure exactly one task/thread calls 
walk_process_tree.

Can you explain what I am missing?

Eric



Re: siginfo pid not populated from ptrace?

2018-11-27 Thread Eric W. Biederman
Kees Cook  writes:

> On Tue, Nov 27, 2018 at 4:38 PM, Kees Cook  wrote:
>> On Tue, Nov 27, 2018 at 3:21 PM, Tycho Andersen  wrote:
>>> On Mon, Nov 12, 2018 at 12:24:43PM -0700, Tycho Andersen wrote:
>>>> On Mon, Nov 12, 2018 at 11:55:38AM -0700, Tycho Andersen wrote:
>>>> > I haven't manage to reproduce it on stock v4.20-rc2, unfortunately.
>>>>
>>>> Ok, now I have,
>>>>
>>>> seccomp_bpf.c:2736:global.syscall_restart:Expected getpid() (1493) == 
>>>> info._sifields._kill.si_pid (0)
>>>> global.syscall_restart: Test failed at step #22
>>>
>>> Seems like this is still happening on v4.20-rc4,
>>>
>>> [ RUN  ] global.syscall_restart
>>> seccomp_bpf.c:2736:global.syscall_restart:Expected getpid() (1901) == 
>>> info._sifields._kill.si_pid (0)
>>> global.syscall_restart: Test failed at step #22
>>
>> This fails every time for me -- is it still racey for you?
>>
>> I'm attempting a bisect, hoping it doesn't _become_ racey for me. ;)
>
> This bisect to here for me:
>
> commit f149b31557446aff9ca96d4be7e39cc266f6e7cc
> Author: Eric W. Biederman 
> Date:   Mon Sep 3 09:50:36 2018 +0200
>
> signal: Never allocate siginfo for SIGKILL or SIGSTOP
>
> The SIGKILL and SIGSTOP signals are never delivered to userspace so
>     queued siginfo for these signals can never be observed.  Therefore
> remove the chance of failure by never even attempting to allocate
> siginfo in those cases.
>
> Reviewed-by: Thomas Gleixner 
> Signed-off-by: "Eric W. Biederman" 
>
> They are certainly visible via seccomp ;)

Well SIGSTOP is visible via PTRACE_GETSIGINFO.

I see what is happening now.  Since we don't have queued siginfo
we generate some as:
/*
 * Ok, it wasn't in the queue.  This must be
 * a fast-pathed signal or we must have been
 * out of queue space.  So zero out the info.
 */
clear_siginfo(info);
info->si_signo = sig;
info->si_errno = 0;
info->si_code = SI_USER;
info->si_pid = 0;
info->si_uid = 0;

Which allows last_signfo to be set,
so despite not really having any siginfo PTRACE_GET_SIGINFO
has something to return so does not return -EINVAL.

Reconstructing my context that was part of removing SEND_SIG_FORCED
so this looks like it will take a little more than a revert to fix
this.

This is definitely a change that is visible to user space.  The logic in
my patch was definitely wrong with respect to SIGSTOP and
PTRACE_GETSIGINFO.  Is there something in userspace that actually cares?
AKA is the idiom that the test seccomp_bpf.c is using something that
non-test code does?

The change below should restore the old behavior.  I am just wondering
if this is something we want to do.  siginfo is allocated with
GFP_ATOMIC so if your maching is under memory pressure there is a real
chance the allocation can fail.  Which would cause whatever is breaking
now to break less deterministically then.

If we need to fix this do we need to make siginfo allocation more
reliable?

Eric


diff --git a/kernel/signal.c b/kernel/signal.c
index 4fd431ce4f91..5c34c55bfea4 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1057,10 +1057,10 @@ static int __send_signal(int sig, struct kernel_siginfo 
*info, struct task_struc
 
result = TRACE_SIGNAL_DELIVERED;
/*
-* Skip useless siginfo allocation for SIGKILL SIGSTOP,
+* Skip useless siginfo allocation for SIGKILL,
 * and kernel threads.
 */
-   if (sig_kernel_only(sig) || (t->flags & PF_KTHREAD))
+   if ((sig == SIGKILL) || (t->flags & PF_KTHREAD))
goto out_set;
 
/*



Re: BUG: corrupted list in freeary

2018-11-27 Thread Eric W. Biederman



syzbot  writes:

> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit:e195ca6cb6f2 Merge branch 'for-linus' of git://git.kernel...
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=10d3e6a340
> kernel config:  https://syzkaller.appspot.com/x/.config?x=73e2bc0cb6463446
> dashboard link: https://syzkaller.appspot.com/bug?extid=c92d3646e35bc5d1a909
> compiler:   gcc (GCC) 8.0.1 20180413 (experimental)

I don't have this commit.  I don't have a clue where it find this
commit.  I can't possibly look into this let alone debug this.

That is not a useful but report.  Please fix your bot.

Eric




Re: [PATCH] fs: Make /proc/sys inodes be owned by global root.

2018-11-26 Thread Eric W. Biederman
Luis Chamberlain  writes:

> On Mon, Nov 26, 2018 at 06:26:07PM +0100, Radoslaw Burny wrote:
>> Due to a recent commit (d151ddc00498 - fs: Update i_[ug]id_(read|write)
>> to translate relative to s_user_ns),
>
> Recent? This is commit is from 2014 and present upstream since v4.8.
> And the commit ID you mentioned in your commit log seems to be
> incorrect. I get:
>
> 81754357770ebd900801231e7bc8d151ddc00498a fs: Update i_[ug]id_(read|write) to 
> translate relative to s_user_ns
>
>> inodes under /proc/sys have -1
>> written to their i_uid/i_gid members if a containing userns does not
>> have entries for root in the uid/gid_map.
>
> Thanks for the description of how to run into the issue described but
> is there also a practical use case today where this is happening? I ask
> as it would be good to know the severity of the issue in the real world
> today.

People trying to run containers without a root user in the container.
It atypical but something doable.  

>> This wouldn't normally matter, because these values are not used for
>> access checks. However, a later change (0bd23d09b874 - Don't modify
>> inodes with a uid or gid unknown to the vfs) changes the kernel to
>> prevent opens for write if the i_uid/i_gid field in the inode is -1,
>> even if the /proc/sys-specific access checks would otherwise pass.
>> 
>> This causes a problem: in a userns without root mapping, even the
>> namespace creator cannot write to e.g. /proc/sys/kernel/shmmax.
>> This change fixes the problem by overriding i_uid/i_gid back to
>> GLOBAL_ROOT_UID/GID.
>
> We really need Seth and Eric to provide guidance here as they were
> the ones devising this long ago, but to me your solution seems backward.
> Why allow any namespace to muck with /proc/sys/ seettings?

There are many per namespace sysctls.  Most of them are in the
networking stack.

> Let's recall that this case was a corner case, and writeback was the
> biggest concern, and for that it was decided that you'd simply not get
> write access, and so its read only. Its not clear to me if things like
> proc were considered. For the regular file case the situation can be
> addressed with  chown, however we can't chown proc files.
>
>> Tested: Used a repro program that creates a user namespace without any
>> mapping and stat'ed /proc/$PID/root/proc/sys/kernel/shmmax from outside.
>> Before the change, it shows uid/gid of 65534,
>
> I thought you said it would be uid/gid -1 without your patch?

It is INVALID_UID/INVALID_GID.  It is an over simplifcation to call
them -1.   As they are not a valid value and are never mapped in any
user namespace they are displayed as the overflow_uid or overflow_gid
which is 65534 by default.

>> with the change it's 0.
>
> Note that a good way to also test issues is with the lib/test_sysctl.c
> module and the tools/testing/selftests/sysctl/sysctl.sh script, so if
> you can device a test there, once we decide what to do that would be
> appreciated.

We spoke about this at LPC.  And this is the correct behavioral change.

The problem is there is a default value for i_uid and i_gid that is
correct in the general case.  That default value is not corect for
sysctl, because proc is weird.  As the sysctl permission check in
test_perm are all against GLOBAL_ROOT_UID and GLOBAL_ROOT_GID we did not
notice that i_uid and i_gid were being set wrong.

So all this patch does is fix the default values i_uid and i_gid.

The commit comment seems worth cleaning up.  But for the
content of the code.

I expect when I have a few moments I will pick this change up.

Reviewed-by: "Eric W. Biederman" 

Eric

>> Signed-off-by: Radoslaw Burny 
>> ---
>>  fs/proc/proc_sysctl.c | 4 
>>  1 file changed, 4 insertions(+)
>> 
>> diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
>> index c5cbbdff3c3d..67379a389658 100644
>> --- a/fs/proc/proc_sysctl.c
>> +++ b/fs/proc/proc_sysctl.c
>> @@ -499,6 +499,10 @@ static struct inode *proc_sys_make_inode(struct 
>> super_block *sb,
>>  
>>  if (root->set_ownership)
>>  root->set_ownership(head, table, >i_uid, >i_gid);
>> +else {
>> +inode->i_uid = GLOBAL_ROOT_UID;
>> +inode->i_gid = GLOBAL_ROOT_GID;
>> +}
>>  
>>  out:
>>  return inode;
>> -- 
>> 2.20.0.rc0.387.gc7a69e6b6c-goog
>> 


Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Eric W. Biederman
Daniel Colascione  writes:

> On Mon, Nov 19, 2018 at 1:37 PM Christian Brauner  
> wrote:
>>
>> On Mon, Nov 19, 2018 at 01:26:22PM -0800, Daniel Colascione wrote:
>> > On Mon, Nov 19, 2018 at 1:21 PM, Christian Brauner  
>> > wrote:
>> > > That can be done without a loop by comparing the level counter for the
>> > > two pid namespaces.
>> > >
>> > >>
>> > >> And you can rewrite pidns_get_parent to use it. So you would instead be
>> > >> doing:
>> > >>
>> > >> if (pidns_is_descendant(proc_pid_ns, task_active_pid_ns(current)))
>> > >> return -EPERM;
>> > >>
>> > >> (Or you can just copy the 5-line loop into procfd_signal -- though I
>> > >> imagine we'll need this for all of the procfd_* APIs.)
>> >
>> > Why is any of this even necessary? Why does the child namespace we're
>> > considering even have a file descriptor to its ancestor's procfs? If
>>
>> Because you can send file descriptors between processes and container
>> runtimes tend to do that.
>
> Right. But why *would* a container runtime send one of these procfs
> FDs to a container?
>
>> > it has one of these FDs, it can already *read* all sorts of
>> > information it really shouldn't be able to acquire, so the additional
>> > ability to send a signal (subject to the usual permission checks)
>> > feels like sticking a finger in a dike that's already well-perforated.
>> > IMHO, we shouldn't bother with this check. The patch would be simpler
>> > without it.
>>
>> We will definitely not allow signaling processes in an ancestor pid
>> namespace! That is a security issue! I can imagine container runtimes
>> killing their monitoring process etc. pp. Not happening, unless someone
>> with deep expertise in signals can convince me otherwise.
>
> If parent namespace procfs FDs or mounts really can leak into child
> namespaces as easily as Aleksa says, then I don't mind adding the
> check. I was under the impression that if you find yourself in this
> situation, you already have a big problem.

There is one big reason to have the check, and I have not seen it
mentioned yet in this thread.

When SI_USER is set we report the pid of the sender of the signal in
si_pid.  When the signal comes from the kernel si_pid == 0.  When signal
is sent from an ancestor pid namespace si_pid also equals 0 (which is
reasonable).

A signal out to a process in a parent pid namespace such as SIGCHLD is
reasonable as we can map the pid.  I really don't see the point of
forbidding that.  From the perspective of the process in the parent pid
namespace it is just another process in it's pid namespace.  So it
should pose no problem from the perspective of the receiving process.

A signal to a process in a pid namespace that is neither a parent nor a
descendent pid namespace would be a problem, as there is no well defined
notion of what si_pid should be set to.  So for that case perhaps we
should have something like a noprocess pid that we can set.  Perhaps we
could set si_pid to 0x.  That would take a small extension to
pid_nr_ns.

File descriptors are not namespaced.  It is completely legitimate to use
file descriptors to get around limitations of namespaces.

Adding limitations to a file descriptor based api because someone else
can't set up their processes in such a way as to get the restrictions
they are looking for seems very sad.

Frankly I think it is one of the better features of namespaces that we
have to carefully handle and define these cases so that when the
inevitable leaks happen you are not immediately in a world of hurt.  All
of the other permission checks etc continue to do their job.  Plus you
are prepared for the case when someone wants their containers to have an
interesting communication primitive.

Eric






Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

2018-11-19 Thread Eric W. Biederman
Christian Brauner  writes:

> On Mon, Nov 19, 2018 at 07:59:24AM -0800, Daniel Colascione wrote:
>> You never addressed my comment on the previous patch about your use of
>
> Sorry, that thread exploded so quickly that I might have missed it.
>
>> private_data here. Why can't you use the struct pid reference that's
>> already in the inode?
>
> If that's what people prefer we can probably use that. There was
> precedent for stashing away such data in fs/proc/base.c already for
> various other things including user namespaces and struct mm so I
> followed this model. A prior version of my patch (I didn't send out) did
> retrive the inode via proc_pid() in .open() took an additional reference
> via get_pid() and dropped it in .release(). Do we prefer that?

If you are using proc// directories as your file descriptors, you
don't need to add an open or a release method at all.  The existing file
descriptors hold a reference to the inode which holds a reference the
the struct pid.

The only time you need to get a reference is when you need a task
and kill_pid_info already performs that work for you.

So using proc_pid is all you need to do to get the pid from the existing
file descriptors.

Eric



Re: [PATCH] proc: allow killing processes via file descriptors (Larger pids)

2018-11-19 Thread Eric W. Biederman
Dmitry Safonov <0x7f454...@gmail.com> writes:
>
> So, I just wanted to gently remind about procfs with netlink socket[1].
> It seems to me that whenever you receive() pid information, you
> can request some uniq 64(?) bit number and kill the process using it.
> Whenever uniqueness of 64-bit number to handle pids will be a question
> the netlink message might be painlessly extended to 128 or whatever.

No.

I have seen this requested twice in this thread now, and despite
understanding where everyone is coming from I do not believe it will
be wise to implement larger pids.

The problem is that we then have to make these larger pids live in
the pid namespace, make struct pid larger to hold them and update
CRIU to save and restore them.

All for a very small handful of processes that use this extended API.

Eric


Re: [PATCH] proc: allow killing processes via file descriptors

2018-11-18 Thread Eric W. Biederman
Daniel Colascione  writes:

> On Sun, Nov 18, 2018 at 9:13 AM, Andy Lutomirski  wrote:
>> On Sun, Nov 18, 2018 at 8:29 AM Daniel Colascione  wrote:
>>>
>>> On Sun, Nov 18, 2018 at 8:17 AM, Andy Lutomirski  wrote:
>>> > On Sun, Nov 18, 2018 at 7:53 AM Daniel Colascione  
>>> > wrote:
>>> >>
>>> >> On Sun, Nov 18, 2018 at 7:38 AM, Andy Lutomirski  wrote:
>>> >> > I fully agree that a more comprehensive, less expensive API for
>>> >> > managing processes would be nice.  But I also think that this patch
>>> >> > (using the directory fd and ioctl) is better from a security
>>> >> > perspective than using a new file in /proc.
>>> >>
>>> >> That's an assertion, not an argument. And I'm not opposed to an
>>> >> operation on the directory FD, now that it's clear Linus has banned
>>> >> "write(2)-as-a-command" APIs. I just insist that we implement the API
>>> >> with a system call instead of a less-reliable ioctl due to the
>>> >> inherent namespace collision issues in ioctl command names.
>>> >
>>> > Linus banned it because of bugs iike the ones in the patch.
>>>
>>> Maybe: he didn't provide a reason. What's your point?
>>
>> My point is that an API that involves a file like /proc/PID/kill is
>> very tricky to get right.  Here are some considerations:
>
> Moot. write(2) for this interface is off the table anyway. The right
> approach here is a system call that accepts a /proc/pid directory file
> descriptor, a signal number, and a signal information field (as in
> sigqueue(2)).

If we did not have the permission check challenges and could perform
the permission checks in open, write(2) would be on the table.
Performing write(2) would only be concrend about data.

Unfortunately we have setresuid and exec which make that infeasible
for the kill operations.

>> Now if we had an ioctlat() API, maybe it would make sense.  But we
>> don't, and I think it would be a bit crazy to add one.
>
> A process is not a driver. Why won't this idea of using an ioctl for
> the kill-process-by-dfd thing just won't die? An ioctl has *zero*
> advantage in this context.

An ioctl has an advantage in implementation complexity.  An ioctl is
very much easier to wire up that a system call.

I don't know if that outweighs ioctls disadvantages in long term
maintainability.

Eric



Re: siginfo pid not populated from ptrace?

2018-11-12 Thread Eric W. Biederman
Tycho Andersen  writes:

> On Mon, Nov 12, 2018 at 12:30:25PM -0600, Eric W. Biederman wrote:
>> Tycho Andersen  writes:
>> 
>> > Hi Oleg,
>> >
>> > I've been running some tests on my seccomp series, and in one of the
>> > tests on v4.20-rc2, I noticed,
>> >
>> > [ RUN  ] global.syscall_restart
>> > seccomp_bpf.c:2784:global.syscall_restart:Expected getpid() (1492) == 
>> > info._sifields._kill.si_pid (0)
>> > global.syscall_restart: Test failed at step #22
>> >
>> > which seems unrelated to my series (the kernel was stock v4.20 with my
>> > patches on top).
>> >
>> > I've been running a lot of tests, and only seen this once, so it seems
>> > like a fairly rare race. I tried to look through the code but didn't
>> > see anything obvious. Thoughts?
>> 
>> My guess would be pid namespaces, or stopping for a signal other than
>> SIGSTOP.
>> 
>> If you can get this to reproduce at all it would be interesting to see
>> si_signo and si_code.  So that we can see just which signal is in info,
>> and how it should be decoded. 
>
> Sure, here's what I see,
>
> seccomp_bpf.c:2784:global.syscall_restart:Expected getpid() (2195) == 
> info._sifields._kill.si_pid (0)
> seccomp_bpf.c:2785:global.syscall_restart:si_signo: 19
> seccomp_bpf.c:2786:global.syscall_restart:si_code: 0

That is SI_USER and SIGSTOP.  So as expected. 

>> I see this test at line 2736 in 4.20-rc1 so there are almost 50 lines of
>> change in your version of seccomp_bpf.c.  So I hope I am reading the
>> proper test.
>
> Yes, sorry, that's additional test stuff from my user trap series. I
> haven't manage to reproduce it on stock v4.20-rc2, unfortunately. It
> could be that this is some memory corruption introduced by my series,
> but I'm running these tests with KASAN so hopefully it would complain?

I don't have a clue what KASAN would do.  I think it is mostly concerned
with unitialized memory, so this might be slipping through the cracks
somewhere.

It can be easy to mess up siginfo.  That is part of the reason I just
reworked things to use helpers most of the time when touching siginfo.

Eric


Re: siginfo pid not populated from ptrace?

2018-11-12 Thread Eric W. Biederman
Tycho Andersen  writes:

> Hi Oleg,
>
> I've been running some tests on my seccomp series, and in one of the
> tests on v4.20-rc2, I noticed,
>
> [ RUN  ] global.syscall_restart
> seccomp_bpf.c:2784:global.syscall_restart:Expected getpid() (1492) == 
> info._sifields._kill.si_pid (0)
> global.syscall_restart: Test failed at step #22
>
> which seems unrelated to my series (the kernel was stock v4.20 with my
> patches on top).
>
> I've been running a lot of tests, and only seen this once, so it seems
> like a fairly rare race. I tried to look through the code but didn't
> see anything obvious. Thoughts?

My guess would be pid namespaces, or stopping for a signal other than
SIGSTOP.

If you can get this to reproduce at all it would be interesting to see
si_signo and si_code.  So that we can see just which signal is in info,
and how it should be decoded. 

I see this test at line 2736 in 4.20-rc1 so there are almost 50 lines of
change in your version of seccomp_bpf.c.  So I hope I am reading the
proper test.

Eric


[GIT PULL] namespace fix for v4.20-rc3

2018-11-12 Thread Eric W. Biederman
Linus,

Please pull the for-linus branch from the git tree:

   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
for-linus

   HEAD: 1e9c75fb9c47a75a9aec0cd17db5f6dc36b58e00 mnt: fix __detach_mounts 
infinite loop

Benjamin Coddington noticed an unkillable busy loop in the kernel that
anyone who is sufficiently motivated can trigger.  This bug did not
exist in earlier kernels making this bug a regression.

I have tested the change personally and confirmed that the bug exists
and that the fix works.  This fix has been picked up by linux-next and
hopefully the automated testing bots and no problems have been reported
from those sources.

Ordinarily I would let something like this sit a little longer but I am
going to be away at Linux Plumbers the rest of this week and I am afraid
if I don't send the pull request now this fix will get lost.

Benjamin Coddington (1):
  mnt: fix __detach_mounts infinite loop

 fs/namespace.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Eric


Re: [git pull] mount API series

2018-11-11 Thread Eric W. Biederman
Steven Whitehouse  writes:

> Hi,
>
>
> On 31/10/18 15:38, Eric W. Biederman wrote:
>> Al Viro  writes:
>>
>>> mount API series from David Howells.  Last cycle's objections
>>> had been of the "I'd do it differently" variety and with no such
>>> differently done variants having ever materialized over several
>>> cycles...
>> Absolutely not.
>>
>> My objections fundamentally is that I can find real problems when I look
>> at the code.  Further the changes have not been incremental changes that
>> have evolved the code from one state to another but complete
>> replacements of code that make code review very difficult and bisection
>> completely inapplicable.
>>
>> I also object that this series completely fails to fix the worst but I
>> have ever seen in the mount API.  Whit no real intrest shown in working
>> to fix it.
>>
>> A couple of bugs that I can see quickly.  Several of which I have
>> previously reported:
>>
>> - There is an easily triggered NULL pointer deference with open_tree
>>and mount propagation.
>>
>>
> Can you share some details of what this NULL dereference is? David and
> Al have been working on the changes as requested by Linus later in
> this thread, and they'd like to tidy up this issue too at the same
> time if possible. We are not asking you to actually provide a fix, in
> case you are too busy to do so, however it would be good to know what
> the issue is so that we can make sure that it is resolved in the next
> round of patches,

https://lore.kernel.org/lkml/87bm7n5k1r@xmission.com/

Eric


Re: [PATCH v3] Implement /proc/pid/kill

2018-11-11 Thread Eric W. Biederman
David Laight  writes:

> From: Daniel Colascione
>> Sent: 31 October 2018 19:33
> ...
>> You can't do it today with kill. The idea that keeping a open file
>> descriptor to a /proc/pid or a file within it prevents PID reuse is
>> widespread, but incorrect.
>
> Is there a real good reason why that shouldn't be the case?
> ie Holding a reference on the 'struct pid' being enough to stop reuse.
>
> A patch to do that would be more generally useful.

Holding an open file descriptor to /proc/pid is enough to prevent pid
reuse problems.  If a pid number is reused a new 'struct pid' is
generated.  There is not 'struct pid' reuse.

So in solving this the kernel data structure you would need to hold onto
is a 'struct pid'.  It is just necessary to add an interface to sending
signals to that 'struct pid' and not looking up the 'struct pid' again
by number.

Eric


[GIT PULL] namespace fixes for v4.20-rc2

2018-11-10 Thread Eric W. Biederman


Linus,

Please pull the for-linus branch from the git tree:

   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
for-linus

   HEAD: 9c8e0a1b683525464a2abe9fb4b54404a50ed2b4 mount: Prevent MNT_DETACH 
from disconnecting locked mounts

I believe all of these are simple obviously correct bug fixes.  These
fall into two groups.  Fixing the implementation of MNT_LOCKED which
prevents lesser privileged users from seeing unders mounts created by
more privileged users.  Fixing the extended uid and group mapping in
user namespaces.

As well as ensuring the code looks correct I have spot tested these
changes as well and in my testing the fixes are working.

I have let these changes sit on my branch for a few days as well and
none of the automated testing has found any problems either.

Eric W. Biederman (3):
  mount: Retest MNT_LOCKED in do_umount
  mount: Don't allow copying MNT_UNBINDABLE|MNT_LOCKED mounts
  mount: Prevent MNT_DETACH from disconnecting locked mounts

Jann Horn (1):
  userns: also map extents in the reverse map to kernel IDs

 fs/namespace.c  | 22 +-
 kernel/user_namespace.c | 12 
 2 files changed, 25 insertions(+), 9 deletions(-)

Eric


Re: x86_64 INIT/SIPI Bug

2018-11-08 Thread Eric W. Biederman
Rian Quinn  writes:

> I apologize upfront if this is the wrong place to post this, pretty new to 
> this.
>
> We are working on the Bareflank Hypervisor (www.bareflank.org), and we
> are passing through the INIT/SIPI process (similar to how a VMX
> rootkit from EFI might boot the OS) and we noticed that on Arch Linux,
> the INIT/SIPI process stalls, something we are not seeing on Ubuntu.
>
> Turns out, to fix the issue, we had to turn on cpu_init_udelay=1.
> The problem is, once a hypervisor is turned on, even one that is doing
> nothing but passing through the instructions, the "quick" that is
> detailed below fails as the kernel is not giving the CPU enough time
> to perform a VMExit/VMEntry (even through the VMExit is not doing
> anything).
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/smpboot.c?h=v4.20-rc1#n650
>
> You can see our INIT/SIPI code here if you are interested:
> https://github.com/rianquinn/extended_apis/blob/hyperkernel_1/bfvmm/src/hve/arch/intel_x64/vmexit/init_signal.cpp
>
> The reason I suggest this is a bug is the manual clearly states that a
> wait is required and the "quirk" that turns off this delay prevents
> code like this from working. Would it be possible to either:
> - Turn this off by default, but still allow someone to turn it on if
> they are confident the delay is not needed?
> - Provide a generic way to turn this off (maybe if a hypervisor is
> detected, it defaults to off)?
>
> I'd be more than happy to provide a patch and test, but I'm not sure
> if there is any interest in changing this code.

I would suggest testing either for your hypervisor or simply for being
inside some hypervisor.  As I read the code it is only turning off the
10ms delay on processors where it is know that it is safe to do so.
This is a case where it is not safe to disable the 10ms delay,
so it makes sense not not turn off the delay.

Eric



Re: [PATCH v6 0/1] ns: introduce binfmt_misc namespace

2018-11-01 Thread Eric W. Biederman
Laurent Vivier  writes:

> On 01/11/2018 04:51, Jann Horn wrote:
>> On Thu, Nov 1, 2018 at 3:59 AM James Bottomley
>>  wrote:
>>>
>>> On Tue, 2018-10-16 at 11:52 +0200, Laurent Vivier wrote:
 Hi,

 Any comment on this last version?

 Any chance to be merged?
>>>
>>> I've got a use case for this:  I went to one of the Graphene talks in
>>> Edinburgh and it struck me that we seem to keep reinventing the type of
>>> sandboxing that qemu-user already does.  However if you want to do an
>>> x86 on x86 sandbox, you can't currently use the binfmt_misc mechanism
>>> because that has you running *every* binary on the system emulated.
>>> Doing it per user namespace fixes this problem and allows us to at
>>> least cut down on all the pointless duplication.
>> 
>> Waait. What? qemu-user does not do "sandboxing". qemu-user makes
>> your code slower and *LESS* secure. As far as I know, qemu-user is
>> only intended for purposes like development and testing.
>> 
>
> I think the idea here is not to run qemu, but to use an interpreter
> (something like gVisor) into a container to control the binaries
> execution inside the container without using this interpreter on the
> host itself (container and host shares the same binfmt_misc
> magic/mask).

Please remind me of this patchset after the merge window is over, and if
there are no issues I will take it via my user namespace branch.

Last I looked I had a concern that some of the permission check issues
were being papered over by using override cred instead of fixing the
deaper code.  Sometimes they are necessary but seeing work-arounds
instead of fixes for problems tends to be a maintenance issue, possibly
with security consequences.  Best is if the everyone agrees on how all
of the interfaces work so their are no surprises.

Eric




Re: [git pull] mount API series

2018-10-31 Thread Eric W. Biederman
ebied...@xmission.com (Eric W. Biederman) writes:

> I am going to stop there.  I believe there are more issues in the code.
> I am relieved that I am not seeing the loss of some of the security
> hooks that I thought I saw last time I looked at the code.

Bah.  Now I see the missing security hook.

There are a set of security hooks that allow security modules to parse
mount options.

On a good day they look like:

security_mnt_opts opts;
char *secdata;

secdata = alloc_secdata();
security_sb_copy_data("a,mount,options,string", secdata);

security_init_mnt_opts();
security_parse_opts_str(secdata, );
security_set_mnt_opts(sb, , 0, NULL);
security_free_mnt_opts();

In practice however things are not that explicit.  With
security_sb_kern_mount performing all of the mnt_opts work.

However after the rewrite in the patchset.

The function sb_kern_mount no longer exists and it's replacement
sb_get_tree out of necessity does not call parse_opts_str.  This is
because the mount options can no longer be passed as a string.

The legacy compatibility code also does not call sb_parse_opts_str.

The result is using the existing apis all of the security module command
line parsing except for (btrfs and nfs) no longer works.


The changes are not structured in a way that makes any of this easy to
find.  Which is why I have been saying I wouldn't do it that way.  It
also is the case that this pattern repeats through out the patches.
Replacing code with something brand new, instead of evolving what is
there.  That makes it easy for this kind of thing to slip through.

Eric


Re: [git pull] mount API series

2018-10-31 Thread Eric W. Biederman
Al Viro  writes:

>   mount API series from David Howells.  Last cycle's objections
> had been of the "I'd do it differently" variety and with no such
> differently done variants having ever materialized over several
> cycles...

Absolutely not.

My objections fundamentally is that I can find real problems when I look
at the code.  Further the changes have not been incremental changes that
have evolved the code from one state to another but complete
replacements of code that make code review very difficult and bisection
completely inapplicable.

I also object that this series completely fails to fix the worst but I
have ever seen in the mount API.  Whit no real intrest shown in working
to fix it.

A couple of bugs that I can see quickly.  Several of which I have
previously reported:

- There is an easily triggered NULL pointer deference with open_tree
  and mount propagation.

- Bisection will not work with the cpuset filesystem patch.  At least
  cpuset looks like it may be mountable now.

- The setting of fc->user_ns on proc remains broken.  In particular if you
  create a child user namespace and attempt to mount proc it will succeed
  instead of fail.

- The mqueue filesystem has the same issue as proc.  fc->user_ns is misset.

I suspect I didn't report it well but I reported both the proc and the
mqueue filesystem weeks before the merge window opened.


I am going to stop there.  I believe there are more issues in the code.
I am relieved that I am not seeing the loss of some of the security
hooks that I thought I saw last time I looked at the code.


Given both that I have reported bugs that remain unfixed and it's
non-evolutionary nature that makes this patchset hard to review I have
no confidence in this set of patches.

I think the scope has been too large for anyone to properly review the
code and it should be broken into smaller pieces.  That there were
significant open_tree and move_mount issues being found just before the
merge window opened seems a testament to that.  (AKA the first 3 or so
patches in the patchset and fundamental to the rest).


My apologies that we have not been able communication better and that I
have not been able to clearly point out all of the bugs.   My apologies
I have not yet been able to give more constructive feedback.

Still at the end of the day there remain regressions of existing
functionality in that tree.

Eric


Re: [RFC PATCH] Implement /proc/pid/kill

2018-10-30 Thread Eric W. Biederman
Christian Brauner  writes:

> On Tue, Oct 30, 2018 at 12:12 PM Daniel Colascione  wrote:
>>
>> On Tue, Oct 30, 2018 at 11:04 AM, Christian Brauner
>>  wrote:
>> > On Tue, Oct 30, 2018 at 11:48 AM Daniel Colascione  
>> > wrote:
>> >>
>> >> Why not?
>> >>
>> >> Does your proposed API allow for a race-free pkill, with arbitrary
>> >> selection criteria? This capability is a good litmus test for fixing
>> >> the long-standing Unix process API issues.
>> >
>> > You'd have a handle on the process with an fd so yes, it would be.
>>
>> Thanks. That's good to hear.
>>
>> Any idea on the timetable for this proposal? I'm open to lots of
>> alternative technical approaches, but I don't want this capability to
>> languish for a long time.
>
> Latest end of year likely sooner depending on the feedback I'm getting
> during LPC.

Frankly.  If you want a race free fork variant probably the easiest
thing to do is to return a open copy of the proc directory entry.  Wrapped
in a bind mount so that you can't see beyond that directory in proc.

My only concern would be if a vfsmount per process would be too heavy for such 
a use.

Eric





Re: [RFC PATCH] Implement /proc/pid/kill

2018-10-30 Thread Eric W. Biederman
Aleksa Sarai  writes:

> On 2018-10-29, Daniel Colascione  wrote:
>> Add a simple proc-based kill interface. To use /proc/pid/kill, just
>> write the signal number in base-10 ASCII to the kill file of the
>> process to be killed: for example, 'echo 9 > /proc/$$/kill'.
>> 
>> Semantically, /proc/pid/kill works like kill(2), except that the
>> process ID comes from the proc filesystem context instead of from an
>> explicit system call parameter. This way, it's possible to avoid races
>> between inspecting some aspect of a process and that process's PID
>> being reused for some other process.
>
> (Aside from any UX concerns other folks might have.)
>
> I think it would be a good idea to (at least temporarily) restrict this
> so that only processes that are in the same PID namespace as the /proc
> being resolved through may use this interface. Otherwise you might have
> cases where partial container breakouts can start sending signals to
> PIDs they wouldn't normally be able to address.

No.  That is the container managers job.  If you have the wrong proc
mounted in your container or otherwise allow access to it that is the
fault of the application that set up the container.

The pid namespace limits visibility.  If something becomes visible and
you have permissions over it, it is perfectly reasonable for you to
execute those permissions.

Eric


Re: [RFC PATCH] Implement /proc/pid/kill

2018-10-30 Thread Eric W. Biederman
Daniel Colascione  writes:

> Add a simple proc-based kill interface. To use /proc/pid/kill, just
> write the signal number in base-10 ASCII to the kill file of the
> process to be killed: for example, 'echo 9 > /proc/$$/kill'.
>
> Semantically, /proc/pid/kill works like kill(2), except that the
> process ID comes from the proc filesystem context instead of from an
> explicit system call parameter. This way, it's possible to avoid races
> between inspecting some aspect of a process and that process's PID
> being reused for some other process.
>
> With /proc/pid/kill, it's possible to write a proper race-free and
> safe pkill(1). An approximation follows. A real program might use
> openat(2), having opened a process's /proc/pid directory explicitly,
> with the directory file descriptor serving as a sort of "process
> handle".
>
> #!/bin/bash
> set -euo pipefail
> pat=$1
> for proc_status in /proc/*/status; do (
> cd $(dirname $proc_status)
> readarray proc_argv -d'' < cmdline
> if ((${#proc_argv[@]} > 0)) &&
>[[ ${proc_argv[0]} = *$pat* ]];
> then
> echo 15 > kill
> fi
> ) || true; done
>

In general this looks good.

Unfortunately the permission checks are are subject to a serious
problem.  Even if I don't have permission to kill a process I quite
likely will be allowed to open the file.

Then I just need to find a setuid or setcap executable will write to
stdout or stderr a number.  Then I have killed something I should not
have the privileges to kill.

At a bare minimum you need to perform the permission check using the
credentials of the opener of the file.Which means refactoring
kill_pid so that you can perform the permission check for killing the
application during open.  Given that process credentials can change
completely during exec you also need to rule out a change in process
credentials making it so that the original opener of the file would not
be able to kill the process as it is now.

But overall this looks quite reasaonble.

Eric

> Signed-off-by: Daniel Colascione 
> ---
>  fs/proc/base.c | 39 +++
>  1 file changed, 39 insertions(+)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 7e9f07bf260d..923d62b21e67 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -205,6 +205,44 @@ static int proc_root_link(struct dentry *dentry, struct 
> path *path)
>   return result;
>  }
>  
> +static ssize_t proc_pid_kill_write(struct file *file,
> +const char __user *buf,
> +size_t count, loff_t *ppos)
> +{
> + ssize_t res;
> + int sig;
> + char buffer[4];
> +
> + res = -EINVAL;
> + if (*ppos != 0)
> + goto out;
> +
> + res = -EINVAL;
> + if (count > sizeof(buffer) - 1)
> + goto out;
> +
> + res = -EFAULT;
> + if (copy_from_user(buffer, buf, count))
> + goto out;
> +
> + buffer[count] = '\0';
> + res = kstrtoint(strstrip(buffer), 10, );
> + if (res)
> + goto out;
> +
> + res = kill_pid(proc_pid(file_inode(file)), sig, 0);
> + if (res)
> + goto out;
> + res = count;
> +out:
> + return res;
> +
> +}
> +
> +static const struct file_operations proc_pid_kill_ops = {
> + .write  = proc_pid_kill_write,
> +};
> +
>  static ssize_t get_mm_cmdline(struct mm_struct *mm, char __user *buf,
> size_t count, loff_t *ppos)
>  {
> @@ -2935,6 +2973,7 @@ static const struct pid_entry tgid_base_stuff[] = {
>  #ifdef CONFIG_HAVE_ARCH_TRACEHOOK
>   ONE("syscall",S_IRUSR, proc_pid_syscall),
>  #endif
> + REG("kill",   S_IRUGO | S_IWUGO, proc_pid_kill_ops),
>   REG("cmdline",S_IRUGO, proc_pid_cmdline_ops),
>   ONE("stat",   S_IRUGO, proc_tgid_stat),
>   ONE("statm",  S_IRUGO, proc_pid_statm),


Re: [PATCH v2] proc: use ns_capable instead of capable for timerslack_ns

2018-10-30 Thread Eric W. Biederman
Benjamin Gordon  writes:

> Access to timerslack_ns is controlled by a process having CAP_SYS_NICE
> in its effective capability set, but the current check looks in the root
> namespace instead of the process' user namespace.  Since a process is
> allowed to do other activities controlled by CAP_SYS_NICE inside a
> namespace, it should also be able to adjust timerslack_ns.

Acked-by: "Eric W. Biederman" 

I don't see any fundamental probess with how the processes user
namespace is being accessed.   You can race with setns
and that may result in a descendent user namespace of the current
user namespace being set.  But if you have permissions in the parent
user namespace you will have permissions over a child user namespace.
So the race can't effect the outcome of the ns_capable test.

That and while __task_cred(p) may change it is guaranteed there is a
valid one until __put_task_struct which only happens when a process has
a zero refcount.  Which the success of get_proc_task in before these
checks already ensures is not true.

So from my perspective this looks like a reasonable change.

I don't know how this looks from people who understand the timer bits
and what timerslack does. I suspect it is reasonable as there is no
permission check for changing yourself.

Eric

> Signed-off-by: Benjamin Gordon 
> Cc: John Stultz 
> Cc: "Eric W. Biederman" 
> Cc: Kees Cook 
> Cc: "Serge E. Hallyn" 
> Cc: Thomas Gleixner 
> Cc: Arjan van de Ven 
> Cc: Oren Laadan 
> Cc: Ruchi Kandoi 
> Cc: Rom Lemarchand 
> Cc: Todd Kjos 
> Cc: Colin Cross 
> Cc: Nick Kralevich 
> Cc: Dmitry Shmidt 
> Cc: Elliott Hughes 
> Cc: Android Kernel Team 
> Cc: Andrew Morton 
> ---
>
> Changes from v1:
>   - Use the namespace of the target process instead of the file opener.
> Didn't carry over John Stultz' Acked-by since the changes aren't
> cosmetic.
>
>  fs/proc/base.c | 12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index c78d8da09b52c..bdc093ba81dd3 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2385,10 +2385,13 @@ static ssize_t timerslack_ns_write(struct file *file, 
> const char __user *buf,
>   return -ESRCH;
>  
>   if (p != current) {
> - if (!capable(CAP_SYS_NICE)) {
> + rcu_read_lock();
> + if (!ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE)) {
> + rcu_read_unlock();
>   count = -EPERM;
>   goto out;
>   }
> + rcu_read_unlock();
>  
>   err = security_task_setscheduler(p);
>   if (err) {
> @@ -2421,11 +2424,14 @@ static int timerslack_ns_show(struct seq_file *m, 
> void *v)
>   return -ESRCH;
>  
>   if (p != current) {
> -
> - if (!capable(CAP_SYS_NICE)) {
> + rcu_read_lock();
> + if (!ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE)) {
> + rcu_read_unlock();
>   err = -EPERM;
>   goto out;
>   }
> + rcu_read_unlock();
> +
>   err = security_task_getscheduler(p);
>   if (err)
>   goto out;


Re: [RFC 00/20] ns: Introduce Time Namespace

2018-10-29 Thread Eric W. Biederman
Thomas Gleixner  writes:

> Andrei,
>
> On Sat, 20 Oct 2018, Andrei Vagin wrote:
>> When a container is migrated to another host, we have to restore its
>> monotonic and boottime clocks, but we still expect that the container
>> will continue using the host real-time clock.
>> 
>> Before stating this series, I was thinking about this, I decided that
>> these cases can be solved independently. Probably, the full isolation of
>> the time sub-system will have much higher overhead than just offsets for
>> a few clocks. And the idea that isolation of the real-time clock should
>> be optional gives us another hint that offsets for monotonic and
>> boot-time clocks can be implemented independently.
>> 
>> Eric and Tomas, what do you think about this? If you agree that these
>> two cases can be implemented separately, what should we do with this
>> series to make it ready to be merged?
>> 
>> I know that we need to:
>> 
>> * look at device drivers that report timestamps in CLOCK_MONOTONIC base.
>
> and CLOCK_BOOTTIME and that's quite a few.
>
>> * forbid changing offsets after creating timers
>
> There are more things to think about. What about interfaces which expose
> boot time or monotonic time in /proc?
>
> Aside of that (I finally came around to look at the series in more detail)
> I'm really unhappy about the unconditional overhead once the Time namespace
> config switch is enabled. This applies especially to the VDSO. We spent
> quite some time recently to squeeze a few cycles out of those functions and
> it would be a pity to pointlessly waste cycles for the !namespace case.
>
> I can see the urge for this, but please let us think it through properly
> before rushing anything in which we are going to regret once we want to do
> more sophisticated time domain management, e.g. support for isolated clock
> real time. I'm worried, that without a clear plan about the overall
> picture, we end up with duct tape which is hard to distangle after the
> fact.
>
> There have been a few other things brought up versus time management in
> general, like the TSN folks utilizing grand clock masters which expose
> random time instead of proper TAI. Plus some requirements for exposing some
> sort of 'monotonic' clocks which are derived from external synchronization
> mechanisms, but should not affect the regular time keeping clocks.
>
> While different issues, these all fall into the category of separate time
> domains, so taking a step back to the drawing board is probably the best
> thing what we can do now.
>
> There are certainly a few things which can be looked at independently,
> e.g. the VDSO mechanics or general mechanisms to avoid plastering the whole
> kernel with these name space functions applying offsets left and right. I
> rather have dedicated core functionality which replaces/amends existing
> timer functions to become time namespace aware.
>
> I'll try to find some time in the next weeks to look deeper into that, but
> I can't promise anything before returning from LPC. Btw, LPC would be a
> great opportunity to discuss that. Are you and the other name space wizards
> there by any chance?

I will be and there are going to be both container and CRIU
mini-conferences.  So there should at least some of us around.

Eric


Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

2018-10-25 Thread Eric W. Biederman
Jann Horn  writes:

> On Wed, Oct 24, 2018 at 3:30 PM Eric W. Biederman  
> wrote:
>> Enke Chen  writes:
>> > For simplicity and consistency, this patch provides an implementation
>> > for signal-based fault notification prior to the coredump of a child
>> > process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> > be used by an application to express its interest and to specify the
>> > signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>> > signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>> >
>> > Changes to prctl(2):
>> >
>> >PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>> >   Set the child pre-coredump signal of the calling process to
>> >   arg2 (either SIGUSR1, or SIUSR2, or SIGCHLD, or 0 to clear).
>> >   This is the signal that the calling process will get prior to
>> >   the coredump of a child process. This value is cleared across
>> >   execve(2), or for the child of a fork(2).
>> >
>> >   When SIGCHLD is specified, the signal code will be set to
>> >   CLD_PREDUMP in such an SIGCHLD signal.
> [...]
>> Ugh.  Your test case is even using signalfd.  So you don't even want
>> this signal to be delivered as a signal.
>
> Just to make sure everyone's on the same page: You're suggesting that
> it might make sense to deliver the pre-dump notification via a new
> type of file instead (along the lines of signalfd, timerfd, eventfd
> and so on)?

My real complaint was that the API was not being tested in the way it
is expected to be used.  Which makes a test pretty much useless as some
aspect userspace could regress and the test would not notice because it
is testing something different.



I do think that a file descriptor based API might be a good alternative
to a signal based API.  The proc connector and signals are not the only
API solution.

The common solution to this problem is that distributions defailt the
rlimit core file size to 0.

Eric


Re: [PATCH] proc: use ns_capable instead of capable for timerslack_ns

2018-10-25 Thread Eric W. Biederman
bmgor...@google.com writes:

> From: Benjamin Gordon 
>
> Access to timerslack_ns is controlled by a process having CAP_SYS_NICE
> in its effective capability set, but the current check looks in the root
> namespace instead of the process' user namespace.  Since a process is
> allowed to do other activities controlled by CAP_SYS_NICE inside a
> namespace, it should also be able to adjust timerslack_ns.

The goal seems legitimate.  However the permission checks look wrong.

In particular the choice of user namespace should be
"p->cred->user_ns".  This will limit this to tasks that have
CAP_SYS_NICE in the same namespace as the task that is being modified.

Testing file->f_cred->user_ns it is testing whoever opened the file and
that could be anyone.

Eric


> Signed-off-by: Benjamin Gordon 
> Cc: John Stultz 
> Cc: Kees Cook 
> Cc: "Serge E. Hallyn" 
> Cc: Thomas Gleixner 
> Cc: Arjan van de Ven 
> Cc: Oren Laadan 
> Cc: Ruchi Kandoi 
> Cc: Rom Lemarchand 
> Cc: Todd Kjos 
> Cc: Colin Cross 
> Cc: Nick Kralevich 
> Cc: Dmitry Shmidt 
> Cc: Elliott Hughes 
> Cc: Android Kernel Team 
> Cc: Andrew Morton 
> ---
>  fs/proc/base.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 7e9f07bf260d..4b50937dff80 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2356,7 +2356,7 @@ static ssize_t timerslack_ns_write(struct file *file, 
> const char __user *buf,
>   return -ESRCH;
>  
>   if (p != current) {
> - if (!capable(CAP_SYS_NICE)) {
> + if (!ns_capable(file->f_cred->user_ns, CAP_SYS_NICE)) {
>   count = -EPERM;
>   goto out;
>   }
> @@ -2393,7 +2393,7 @@ static int timerslack_ns_show(struct seq_file *m, void 
> *v)
>  
>   if (p != current) {
>  
> - if (!capable(CAP_SYS_NICE)) {
> + if (!ns_capable(seq_user_ns(m), CAP_SYS_NICE)) {
>   err = -EPERM;
>   goto out;
>   }


Re: [PATCH] ARM: mm: Facilitate debugging CONFIG_KUSER_HELPERS disabled

2018-10-25 Thread Eric W. Biederman
Florian Fainelli  writes:

> Some software such as perf makes unconditional use of the special
> [vectors] page which is only provided when CONFIG_KUSER_HELPERS is
> enabled in the kernel.
>
> Facilitate the debugging of such situations by printing a debug message
> to the kernel log showing the task name and the faulting address.

Can't someone trigger this segv deliberately and spam the kerne log?
Doesn't this need something like printk_ratelimit or something to ensure
this message only gets printed once?

Eric

> Suggested-by: Russell King 
> Signed-off-by: Florian Fainelli 
> ---
>  arch/arm/mm/fault.c | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
> index f4ea4c62c613..f17471fbc1c4 100644
> --- a/arch/arm/mm/fault.c
> +++ b/arch/arm/mm/fault.c
> @@ -173,6 +173,11 @@ __do_user_fault(struct task_struct *tsk, unsigned long 
> addr,
>   show_regs(regs);
>   }
>  #endif
> +#ifndef CONFIG_KUSER_HELPERS
> + if ((sig == SIGSEGV) && ((addr & PAGE_MASK) == 0x))
> + printk(KERN_DEBUG "%s: CONFIG_KUSER_HELPERS disabled at 
> 0x%08lx\n",
> +tsk->comm, addr);
> +#endif
>  
>   tsk->thread.address = addr;
>   tsk->thread.error_code = fsr;


Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

2018-10-25 Thread Eric W. Biederman
Enke Chen  writes:

> Hi, Eric:
>
> Thanks for your comments. Please see my replies inline.
>
> On 10/24/18 6:29 AM, Eric W. Biederman wrote:
>> Enke Chen  writes:
>> 
>>> For simplicity and consistency, this patch provides an implementation
>>> for signal-based fault notification prior to the coredump of a child
>>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>>> be used by an application to express its interest and to specify the
>>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>>
>>> Changes to prctl(2):
>>>
>>>PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>>>   Set the child pre-coredump signal of the calling process to
>>>   arg2 (either SIGUSR1, or SIUSR2, or SIGCHLD, or 0 to clear).
>>>   This is the signal that the calling process will get prior to
>>>   the coredump of a child process. This value is cleared across
>>>   execve(2), or for the child of a fork(2).
>>>
>>>   When SIGCHLD is specified, the signal code will be set to
>>>   CLD_PREDUMP in such an SIGCHLD signal.
>> 
>> Your signal handling is still not right.  Please read and comprehend
>> siginfo_layout.
>> 
>> You have not filled in all of the required fields for the SIGCHLD case.
>> For the non SIGCHLD case you are using si_code == 0 == SI_USER which is
>> very wrong.  This is not a user generated signal.
>> 
>> Let me say this slowly.  The pair si_signo si_code determines the union
>> member of struct siginfo.  That needs to be handled consistently. You
>> aren't.  I just finished fixing this up in the entire kernel and now you
>> are trying to add a usage that is worst than most of the bugs I have
>> fixed.  I really don't appreciate having to deal with no bugs.
>> 
>
> My apologies. I will investigate and make them consistent.
>
>> 
>> 
>> Further siginfo can be dropped.  Multiple signals with the same signal
>> number can be consolidated.  What is your plan for dealing with that?
>
> The primary application for the early notification involves a process
> manager which is responsible for re-spawning processes or initiating
> the control-plane fail-over. There are two models:
>
> One model is to have 1:1 relationship between a process manager and
> application process. There can only be one predump-signal (say, SIGUSR1)
> from the child to the parent, and will unlikely be dropped or consolidated.
>
> Another model is to have 1:N where there is only one process manager with
> multiple application processes. One of the RT signal can be used to help
> make it more reliable.

Which suggests you want one of the negative si_codes, and to use the _rt
siginfo member like sigqueue.

>> Other code paths pair with wait to get the information out.  There
>> is no equivalent of wait in your code.
>
> I was not aware of that before.  Let me investigate.
>
>> 
>> Signals can be delayed by quite a bit, scheduling delays etc.  They can
>> not provide any meaningful kind of real time notification.
>> 
>
> The timing requirement is about 50-100 msecs for BFD.  Not sure if that
> qualifies as "real time".  This mechanism has worked well in deployment
> over the years.

It would help if those numbers were put into the patch description so
people can tell if the mechanism is quick enough.

>> So between delays and loss of information signals appear to be a very
>> poor fit for this usecase.
>> 
>> I am concerned about code that does not fit the usecase well because
>> such code winds up as code that no one cares about that must be
>> maintained indefinitely, because somewhere out there there is one use
>> that would break if the interface was removed.  This does not feel like
>> an interface people will want to use and maintain in proper working
>> order forever.
>> 
>> Ugh.  Your test case is even using signalfd.  So you don't even want
>> this signal to be delivered as a signal.
>
> I actually tested sigaction()/waitpid() as well. If there is a preference,
> I can check in the sigaction()/waitpid() version instead.
>
>> 
>> You add an interface that takes a pointer and you don't add a compat
>> interface.  See Oleg's point of just returning the signal number in the
>> return code.
>
> This is what Oleg said "but I won't insist, this is subjective and cosmetic".
>
> It is no big deal either way. It just seems less work if we do not keep
> add

Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

2018-10-24 Thread Eric W. Biederman
Enke Chen  writes:

> For simplicity and consistency, this patch provides an implementation
> for signal-based fault notification prior to the coredump of a child
> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> be used by an application to express its interest and to specify the
> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>
> Changes to prctl(2):
>
>PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>   Set the child pre-coredump signal of the calling process to
>   arg2 (either SIGUSR1, or SIUSR2, or SIGCHLD, or 0 to clear).
>   This is the signal that the calling process will get prior to
>   the coredump of a child process. This value is cleared across
>   execve(2), or for the child of a fork(2).
>
>   When SIGCHLD is specified, the signal code will be set to
>   CLD_PREDUMP in such an SIGCHLD signal.

Your signal handling is still not right.  Please read and comprehend
siginfo_layout.

You have not filled in all of the required fields for the SIGCHLD case.
For the non SIGCHLD case you are using si_code == 0 == SI_USER which is
very wrong.  This is not a user generated signal.

Let me say this slowly.  The pair si_signo si_code determines the union
member of struct siginfo.  That needs to be handled consistently. You
aren't.  I just finished fixing this up in the entire kernel and now you
are trying to add a usage that is worst than most of the bugs I have
fixed.  I really don't appreciate having to deal with no bugs.



Further siginfo can be dropped.  Multiple signals with the same signal
number can be consolidated.  What is your plan for dealing with that?
Other code paths pair with wait to get the information out.  There
is no equivalent of wait in your code.

Signals can be delayed by quite a bit, scheduling delays etc.  They can
not provide any meaningful kind of real time notification.

So between delays and loss of information signals appear to be a very
poor fit for this usecase.

I am concerned about code that does not fit the usecase well because
such code winds up as code that no one cares about that must be
maintained indefinitely, because somewhere out there there is one use
that would break if the interface was removed.  This does not feel like
an interface people will want to use and maintain in proper working
order forever.

Ugh.  Your test case is even using signalfd.  So you don't even want
this signal to be delivered as a signal.

You add an interface that takes a pointer and you don't add a compat
interface.  See Oleg's point of just returning the signal number in the
return code.

Now I am wondering how well prctl works from a 32bit process on a 64bit
kernel.  At first glance it looks like it probably does not work.

Consistency with PDEATHSIG is not a good argument for anything.
PDEATHSIG at the present time is unusable in the real world by most
applications that want something like it.

So far I see an interface that even you don't want to use as designed,
that is implemented incorrectly.

The concern is real and deserves to be addressed.  I don't think signals
are the right way to handle it, and certainly not this patch as it
stands.

Eric


> diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
> index 9ccbf05..a3deba8 100644
> --- a/arch/x86/kernel/signal_compat.c
> +++ b/arch/x86/kernel/signal_compat.c
> @@ -30,7 +30,7 @@ static inline void signal_compat_build_tests(void)
>   BUILD_BUG_ON(NSIGSEGV != 7);
>   BUILD_BUG_ON(NSIGBUS  != 5);
>   BUILD_BUG_ON(NSIGTRAP != 5);
> - BUILD_BUG_ON(NSIGCHLD != 6);
> + BUILD_BUG_ON(NSIGCHLD != 7);
>   BUILD_BUG_ON(NSIGSYS  != 1);
>  
>   /* This is part of the ABI and can never change in size: */
> diff --git a/fs/coredump.c b/fs/coredump.c
> index e42e17e..f11e31f 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -546,6 +546,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>   struct cred *cred;
>   int retval = 0;
>   int ispipe;
> + bool notify;
>   struct files_struct *displaced;
>   /* require nonrelative corefile path and be extra careful */
>   bool need_suid_safe = false;
> @@ -590,6 +591,15 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>   if (retval < 0)
>   goto fail_creds;
>  
> + /*
> +  * Send the pre-coredump signal to the parent if requested.
> +  */
> + read_lock(_lock);
> + notify = do_notify_parent_predump(current);
> + read_unlock(_lock);
> + if (notify)
> + cond_resched();
> +
>   old_cred = override_creds(cred);
>  
>   ispipe = format_corename(, );
> diff --git a/fs/exec.c b/fs/exec.c
> index fc281b7..7714da7 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1181,6 +1181,9 @@ static int de_thread(struct task_struct *tsk)
>   /* we have changed execution domain */
>   

[GIT PULL] siginfo updates for 4.20-rc1

2018-10-22 Thread Eric W. Biederman


Linus,

Please pull the siginfo-linus branch from the git tree:

   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
siginfo-linus

   HEAD: a36700589b85443e28170be59fa11c8a104130a5 signal: Guard against 
negative signal numbers in copy_siginfo_from_user32

I have been slowly sorting out siginfo and this is the culmination of that work.

The primary result is in several ways the signal infrastructure has been made
less error prone.  The code has been updated so that manually specifying
SEND_SIG_FORCED is never necessary.  The conversion to the new siginfo sending
functions is now complete, which makes it difficult to send a signal without
filling in the proper siginfo fields.

At the tail end of the patchset comes the optimization of decreasing the size of
struct siginfo in the kernel from 128 bytes to about 48 bytes on 64bit.  The
fundamental observation that enables this is by definition none of the known
ways to use struct siginfo uses the extra bytes.

This comes at the cost of a small user space observable difference.  For the
rare case of siginfo being injected into the kernel only what can be copied
into kernel_siginfo is delivered to the destination, the rest of the bytes are
set to 0.  For cases where the signal and the si_code are known this is safe,
because we know thos bytes are not used.  For cases where the signal and si_code
combination is unknown the bits that won't fit into struct kernel_siginfo are
tested to verify they are zero, and the send fails if they are not.

I made an extensive search through userspace code and I could not find anything
that would break because of the above change.  If it turns out I did break
something it will take just the revert of a single change to restore
kernel_siginfo to the same size as userspace siginfo.

Testing did reveal dependencies on preferring the signo passed to sigqueueinfo
over si->signo, so bit the bullet and added the complexity necessary to handle
that case.

Testing also revealed bad things can happen if a negative signal number is
passed into the system calls.  Something no sane application will do but
something a malicious program or a fuzzer might do.  So I have fixed the code
that performs the bounds checks to ensure negative signal numbers are handled.


There are minor conflicts between this tree and several other trees.
- The x86 tree
- The y2038 tree
- The arm64 tree
- The x86 tip tree

I think only the resolution of the x86 tip tree is at all difficult.  None of
the conflicts are fundamental.  They are all from changes to other parts of the
code that are just close enough to have context conflicts.  The x86 tip tree
conflict actually involves a conflict from removing a unnecessary pkey parameter
on the siginfo side and a some small refactoring on the x86 side.

Eric W. Biederman (80):
  signal: Always ignore SIGKILL and SIGSTOP sent to the global init
  signal: Properly deliver SIGILL from uprobes
  signal: Properly deliver SIGSEGV from x86 uprobes
  signal: Always deliver the kernel's SIGKILL and SIGSTOP to a pid 
namespace init
  signal: send_sig_all no longer needs SEND_SIG_FORCED
  signal: Remove the siginfo paramater from kernel_dqueue_signal
  signal: Don't send siginfo to kthreads.
  signal: Never allocate siginfo for SIGKILL or SIGSTOP
  signal: Use SEND_SIG_PRIV not SEND_SIG_FORCED with SIGKILL and SIGSTOP
  signal: Remove SEND_SIG_FORCED
  signal/GenWQE: Fix sending of SIGKILL
  tty_io: Use group_send_sig_info in __do_SACK to note it is a session 
being killed
  signal: Use group_send_sig_info to kill all processes in a pid namespace
  signal: Remove specific_send_sig_info
  signal: Pair exports with their functions
  signal: Simplify tracehook_report_syscall_exit
  signal/x86: Inline fill_sigtrap_info in it's only caller send_sigtrap
  signal/x86: Move MCE error reporting out of force_sig_info_fault
  signal/x86: Use send_sig_mceerr as apropriate
  signal/x86: In trace_mpx_bounds_register_exception add __user annotations
  signal/x86: Move mpx siginfo generation into do_bounds
  signal/x86/traps: Factor out show_signal
  signal/x86/traps: Move more code into do_trap_no_signal so it can be 
reused
  signal/x86/traps: Use force_sig_bnderr
  signal/x86/traps: Use force_sig instead of open coding it.
  signal/x86/traps: Simplify trap generation
  signal/x86: Remove pkey parameter from bad_area_nosemaphore
  signal/x86: Remove the pkey parameter from do_sigbus
  signal/x86: Remove pkey parameter from mm_fault_error
  signal/x86: Don't compute pkey in __do_page_fault
  signal/x86: Pass pkey not vma into __bad_area
  signal/x86: Call force_sig_pkuerr from __bad_area_nosemaphore
  signal/x86: Replace force_sig_info_fault with force_sig_fault
  signal/x86: Pass pkey by value
  signal/x86: Use force_sig_fault where appropriate
  signal/powerpc: 

Re: Interrupts, smp_load_acquire(), smp_store_release(), etc.

2018-10-22 Thread Eric W. Biederman
"Paul E. McKenney"  writes:

> On Sat, Oct 20, 2018 at 04:18:37PM -0400, Alan Stern wrote:
>> On Sat, 20 Oct 2018, Paul E. McKenney wrote:
>> 
>> > The second (informal) litmus test has a more interesting Linux-kernel
>> > counterpart:
>> > 
>> >void t1_interrupt(void)
>> >{
>> >r0 = READ_ONCE(y);
>> >smp_store_release(, 1);
>> >}
>> > 
>> >void t1(void)
>> >{
>> >smp_store_release(, 1);
>> >}
>> > 
>> >void t2(void)
>> >{
>> >r1 = smp_load_acquire();
>> >r2 = smp_load_acquire();
>> >}
>> > 
>> > On store-reordering architectures that implement smp_store_release()
>> > as a memory-barrier instruction followed by a store, the interrupt could
>> > arrive betweentimes in t1(), so that there would be no ordering between
>> > t1_interrupt()'s store to x and t1()'s store to y.  This could (again,
>> > in paranoid theory) result in the outcome r0==0 && r1==0 && r2==1.
>> 
>> This is disconcerting only if we assume that t1_interrupt() has to be
>> executed by the same CPU as t1().  If the interrupt could be fielded by
>> a different CPU then the paranoid outcome is perfectly understandable,
>> even in an SC context.
>> 
>> So the question really should be limited to situations where a handler 
>> is forced to execute in the context of a particular thread.  While 
>> POSIX does allow such restrictions for user programs, I'm not aware of 
>> any similar mechanism in the kernel.

> Good point, and I was in fact assuming that t1() and t1_interrupt()
> were executing on the same CPU.
>
> This sort of thing happens naturally in the kernel when both t1()
> and t1_interrupt() are accessing per-CPU variables.

Interrupts have a cpumask of the cpus they may be dlievered on.

I believe networking does in fact have places where percpu actions
happen as well as interrupts pinned to a single cpu.  And yes I agree
percpu variables mean that you do not need to pin an interrupt to a
single cpu to cause this to happen.

Eric


Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

2018-10-21 Thread Eric W. Biederman
David Howells  writes:

> From: Al Viro 
>
> Allow a detached tree created by open_tree(..., OPEN_TREE_CLONE) to be
> attached by move_mount(2).
>
> If by the time of final fput() of OPEN_TREE_CLONE-opened file its tree is
> not detached anymore, it won't be dissolved.  move_mount(2) is adjusted
> to handle detached source.
>
> That gives us equivalents of mount --bind and mount --rbind.

In light of recent conversations about double umount_tree.

Do we want to simply limit ourselves to attaching file->f_path of
a FMODE_NEED_UMOUNT file descriptor and clearing FMODE_NEED_UMOUNT
when it is attached?

Either that or perhaps move the logic into mntput on when to perform the
umount_tree?

Otherwise I believe we start running into surprising situations:

This works:
ofd = open_tree(...);
dup_fd = openat(ofd, "", O_PATH);
mount_move(dup_fd, ...);
close(ofd);
close(dup_fd);

This should fail:
ofd = open_tree(...);
dup_fd = openat(ofd, "", O_PATH);
close(ofd);
mount_move(dup_fd, ...);
close(dup_fd);

Allowing any file descriptor that points to mnt_ns == NULL without
MNT_UMOUNT set seems like it is adding flexibility for no reason.


> Signed-off-by: Al Viro 
> Signed-off-by: David Howells 
> ---
>
>  fs/namespace.c |   26 --
>  1 file changed, 20 insertions(+), 6 deletions(-)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index dd38141b1723..caf5c55ef555 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1785,8 +1785,10 @@ void dissolve_on_fput(struct vfsmount *mnt)
>  {
>   namespace_lock();
>   lock_mount_hash();
> - mntget(mnt);
> - umount_tree(real_mount(mnt), UMOUNT_CONNECTED);
> + if (!real_mount(mnt)->mnt_ns) {
> + mntget(mnt);
> + umount_tree(real_mount(mnt), UMOUNT_CONNECTED);
> + }

^^ This change should be unnecessary.

>   unlock_mount_hash();
>   namespace_unlock();
>  }
> @@ -2393,6 +2395,7 @@ static int do_move_mount(struct path *old_path, struct 
> path *new_path)
>   struct mount *old;
>   struct mountpoint *mp;
>   int err;
> + bool attached;
>  
>   mp = lock_mount(new_path);
>   err = PTR_ERR(mp);
> @@ -2403,10 +2406,19 @@ static int do_move_mount(struct path *old_path, 
> struct path *new_path)
>   p = real_mount(new_path->mnt);
>  
>   err = -EINVAL;
> - if (!check_mnt(p) || !check_mnt(old))
> + /* The mountpoint must be in our namespace. */
> + if (!check_mnt(p))
> + goto out1;
> + /* The thing moved should be either ours or completely unattached. */
> + if (old->mnt_ns && !check_mnt(old))
>   goto out1;

^^^

!old->mnt_ns  should only be allowed when it comes from a file
 descriptor with FMODE_NEED_UMOUNT.


> - if (!mnt_has_parent(old))
> + attached = mnt_has_parent(old);
> + /*
> +  * We need to allow open_tree(OPEN_TREE_CLONE) followed by
> +  * move_mount(), but mustn't allow "/" to be moved.
> +  */
> + if (old->mnt_ns && !attached)
>   goto out1;
>  
>   if (old->mnt.mnt_flags & MNT_LOCKED)
> @@ -2421,7 +2433,7 @@ static int do_move_mount(struct path *old_path, struct 
> path *new_path)
>   /*
>* Don't move a mount residing in a shared parent.
>*/
> - if (IS_MNT_SHARED(old->mnt_parent))
> + if (attached && IS_MNT_SHARED(old->mnt_parent))
>   goto out1;
>   /*
>* Don't move a mount tree containing unbindable mounts to a destination
> @@ -2435,7 +2447,7 @@ static int do_move_mount(struct path *old_path, struct 
> path *new_path)
>   goto out1;
>  
>   err = attach_recursive_mnt(old, real_mount(new_path->mnt), mp,
> -_path);
> +attached ? _path : NULL);
>   if (err)
>   goto out1;

^^^
Somewhere around here we should have code that clears FMODE_NEED_UMOUNT.


> @@ -3121,6 +3133,8 @@ SYSCALL_DEFINE5(mount, char __user *, dev_name, char 
> __user *, dir_name,
>  
>  /*
>   * Move a mount from one place to another.
> + * In combination with open_tree(OPEN_TREE_CLONE [| AT_RECURSIVE]) it can be
> + * used to copy a mount subtree.
>   *
>   * Note the flags value is a combination of MOVE_MOUNT_* flags.
>   */


Re: [PATCH 01/34] vfs: syscall: Add open_tree(2) to reference or clone a mount [ver #12]

2018-10-21 Thread Eric W. Biederman
David Howells  writes:

> From: Al Viro 
>
> open_tree(dfd, pathname, flags)
>
> Returns an O_PATH-opened file descriptor or an error.
> dfd and pathname specify the location to open, in usual
> fashion (see e.g. fstatat(2)).  flags should be an OR of
> some of the following:
>   * AT_PATH_EMPTY, AT_NO_AUTOMOUNT, AT_SYMLINK_NOFOLLOW -
> same meanings as usual
>   * OPEN_TREE_CLOEXEC - make the resulting descriptor
> close-on-exec
>   * OPEN_TREE_CLONE or OPEN_TREE_CLONE | AT_RECURSIVE -
> instead of opening the location in question, create a detached
> mount tree matching the subtree rooted at location specified by
> dfd/pathname.  With AT_RECURSIVE the entire subtree is cloned,
> without it - only the part within in the mount containing the
> location in question.  In other words, the same as mount --rbind
> or mount --bind would've taken.  The detached tree will be
> dissolved on the final close of obtained file.  Creation of such
> detached trees requires the same capabilities as doing mount --bind.


What happens when mounts propgate to such a detached mount tree?

It looks to me like the test in propagate_one for setting
CL_UNPRIVILEGED will trigger a NULL pointer dereference:

AKA:
/* Notice when we are propagating across user namespaces */
if (m->mnt_ns->user_ns != user_ns)
type |= CL_UNPRIVILEGED;

Since we don't know which mount namespace if any this subtree is going
into the test should become:

if (!m->mnt_ns || (m->mnt_ns->user_ns != user_ns))
type |= CL_UNPRIVILEGED;

If the tree is never attached anywhere it won't hurt as we don't allow
mounts or umounts on the detached subtree.  We don't have enough
information to know about the namespace we copied from if it would cause
CL_UNPRIVILEGED to be set on any given propagation.  Similarly we don't
have any information at all about the mount namespace this tree may
possibily be copied to.

Eric


> Signed-off-by: Al Viro 
> Signed-off-by: David Howells 
> cc: linux-...@vger.kernel.org
> ---
>
>  arch/x86/entry/syscalls/syscall_32.tbl |1 
>  arch/x86/entry/syscalls/syscall_64.tbl |1 
>  fs/file_table.c|9 +-
>  fs/internal.h  |1 
>  fs/namespace.c |  132 
> +++-
>  include/linux/fs.h |7 +-
>  include/linux/syscalls.h   |1 
>  include/uapi/linux/fcntl.h |2 
>  include/uapi/linux/mount.h |   10 ++
>  9 files changed, 137 insertions(+), 27 deletions(-)
>  create mode 100644 include/uapi/linux/mount.h
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
> b/arch/x86/entry/syscalls/syscall_32.tbl
> index 3cf7b533b3d1..ea1b413afd47 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -398,3 +398,4 @@
>  384  i386arch_prctl  sys_arch_prctl  
> __ia32_compat_sys_arch_prctl
>  385  i386io_pgetevents   sys_io_pgetevents   
> __ia32_compat_sys_io_pgetevents
>  386  i386rseqsys_rseq
> __ia32_sys_rseq
> +387  i386open_tree   sys_open_tree   
> __ia32_sys_open_tree
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> b/arch/x86/entry/syscalls/syscall_64.tbl
> index f0b1709a5ffb..0545bed581dc 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -343,6 +343,7 @@
>  332  common  statx   __x64_sys_statx
>  333  common  io_pgetevents   __x64_sys_io_pgetevents
>  334  common  rseq__x64_sys_rseq
> +335  common  open_tree   __x64_sys_open_tree
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/file_table.c b/fs/file_table.c
> index e49af4caf15d..e03c8d121c6c 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -255,6 +255,7 @@ static void __fput(struct file *file)
>   struct dentry *dentry = file->f_path.dentry;
>   struct vfsmount *mnt = file->f_path.mnt;
>   struct inode *inode = file->f_inode;
> + fmode_t mode = file->f_mode;
>  
>   if (unlikely(!(file->f_mode & FMODE_OPENED)))
>   goto out;
> @@ -277,18 +278,20 @@ static void __fput(struct file *file)
>   if (file->f_op->release)
>   file->f_op->release(inode, file);
>   if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL &&
> -  !(file->f_mode & FMODE_PATH))) {
> +  !(mode & FMODE_PATH))) {
>   cdev_put(inode->i_cdev);
>   }
>   fops_put(file->f_op);
>   put_pid(file->f_owner.pid);
> - if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
> + if ((mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
>   i_readcount_dec(inode);
> - if (file->f_mode & FMODE_WRITER) {
> + 

Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]

2018-10-17 Thread Eric W. Biederman
David Howells  writes:

> Eric W. Biederman  wrote:
>
>> Davids check will work for bind mounts as well.  It just won't work it
>> just won't work for files or subdirectories of some mountpoint.
>
> Bind mounts have to be done with open_tree() and move_mount().  You can't now
> do fsmount() on something fspick()'d.

But a bind mount will have mnt_root set to the the dentry that was
bound.

Therefore fspick as you are proposing modifying will work for the root
of bind mounts, as well as the root of regular mounts.  My apologies for
not being clear.

Eric



Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]

2018-10-17 Thread Eric W. Biederman
Alan Jenkins  writes:

> On 17/10/2018 14:20, David Howells wrote:
>> David Howells  wrote:
>>
>>> I should probably check that the picked point is actually a mountpoint.
>> The root of the mount object at the path specified, that is, perhaps with
>> something like the attached.
>>
>> David
>
>
> I agree.  I'm happy to see this is using the same check as do_remount().
>
>
> * change filesystem flags. dir should be a physical root of filesystem.
> * If you've mounted a non-root directory somewhere and want to do remount
> * on it - tough luck.
> */

Davids check will work for bind mounts as well.  It just won't work it
just won't work for files or subdirectories of some mountpoint.

Eric

>> ---
>> diff --git a/fs/fsopen.c b/fs/fsopen.c
>> index f673e93ac456..a17a233c 100644
>> --- a/fs/fsopen.c
>> +++ b/fs/fsopen.c
>> @@ -186,6 +186,10 @@ SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, 
>> path, unsigned int, flags
>>  if (ret < 0)
>>  goto err;
>>   +  ret = -EINVAL;
>> +if (target.mnt->mnt_root != target.dentry)
>> +goto err_path;
>> +
>>  fc = vfs_new_fs_context(target.dentry->d_sb->s_type, target.dentry,
>>  0, 0, FS_CONTEXT_FOR_RECONFIGURE);
>>  if (IS_ERR(fc)) {
>>


Re: [PATCH v3 1/2] sysctl: handle overflow in proc_get_long

2018-10-16 Thread Eric W. Biederman
Christian Brauner  writes:

> proc_get_long() is a funny function. It uses simple_strtoul() and for a
> good reason. proc_get_long() wants to always succeed the parse and return
> the maybe incorrect value and the trailing characters to check against a
> pre-defined list of acceptable trailing values.
> However, simple_strtoul() explicitly ignores overflows which can cause
> funny things like the following to happen:
>
> echo 18446744073709551616 > /proc/sys/fs/file-max
> cat /proc/sys/fs/file-max
> 0
>
> (Which will cause your system to silently die behind your back.)
>
> On the other hand kstrtoul() does do overflow detection but does not return
> the trailing characters, and also fails the parse when anything other than
> '\n' is a trailing character whereas proc_get_long() wants to be more
> lenient.
>
> Now, before adding another kstrtoul() function let's simply add a static
> parse strtoul_lenient() which:
> - fails on overflow with -ERANGE
> - returns the trailing characters to the caller
>
> The reason why we should fail on ERANGE is that we already do a partial
> fail on overflow right now. Namely, when the TMPBUFLEN is exceeded. So we
> already reject values such as 184467440737095516160 (21 chars) but accept
> values such as 18446744073709551616 (20 chars) but both are overflows. So
> we should just always reject 64bit overflows and not special-case this
> based on the number of chars.
>
> Acked-by: Kees Cook 
> Signed-off-by: Christian Brauner 
> Signed-off-by: Christian Brauner 
> ---
> v2->v3:
> - (Kees) s/#include <../lib/kstrtox.h>/#include "../lib/kstrtox.h"/g
> - (Kees) document strtoul_lenient()
>
> v1->v2:
> - s/sysctl_cap_erange/sysctl_lenient/g
> - consistenly fail on overflow
>
> v0->v1:
> - s/sysctl_strtoul_lenient/strtoul_cap_erange/g
> - (Al) remove bool overflow return argument from strtoul_cap_erange
> - (Al) return ULONG_MAX on ERANGE from strtoul_cap_erange
> - (Dominik) fix spelling in commit message
> ---
>  kernel/sysctl.c | 40 +++-
>  1 file changed, 39 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index cc02050fd0c4..102aa7a65687 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -68,6 +68,8 @@
>  #include 
>  #include 
>  
> +#include "../lib/kstrtox.h"
> +
>  #include 
>  #include 
>  
> @@ -2065,6 +2067,41 @@ static void proc_skip_char(char **buf, size_t *size, 
> const char v)
>   }
>  }
>  
> +/**
> + * strtoul_lenient - parse an ASCII formatted integer from a buffer and only
> + *   fail on overflow
> + *
> + * @cp: kernel buffer containing the string to parse
> + * @endp: pointer to store the trailing characters
> + * @base: the base to use
> + * @res: where the parsed integer will be stored
> + *
> + * In case of success 0 is returned and @res will contain the parsed integer,
> + * @endp will hold any trailing characters.
> + * This function will fail the parse on overflow. If there wasn't an overflow
> + * the function will defer the decision what characters count as invalid to 
> the
> + * caller.
> + */
> +static int strtoul_lenient(const char *cp, char **endp, unsigned int base,
> +unsigned long *res)
> +{
> + unsigned long long result;
> + unsigned int rv;
> +
> + cp = _parse_integer_fixup_radix(cp, );
> + rv = _parse_integer(cp, base, );
> + if ((rv & KSTRTOX_OVERFLOW) || (result != (unsigned long)result))
> + return -ERANGE;
> +
> + cp += rv;
> +
> + if (endp)
> + *endp = (char *)cp;
> +
> + *res = (unsigned long)result;
> + return 0;
> +}
> +
>  #define TMPBUFLEN 22
>  /**
>   * proc_get_long - reads an ASCII formatted integer from a user buffer
> @@ -2108,7 +2145,8 @@ static int proc_get_long(char **buf, size_t *size,
>   if (!isdigit(*p))
>   return -EINVAL;
>  
> - *val = simple_strtoul(p, , 0);
> + if (strtoul_lenient(p, , 0, val))
> + return -EINVAL;

Is it deliberate that on an error stroul_lenient returns -ERANGE but
then proc_get_long returns -EINVAL?  That feels wrong.  The write system
call does not permit -ERANGE or -EINVAL for the contents of the data
so both options appear equally bad from a standards point of view.

I am just wondering what the thinking is here.

>   len = p - tmp;


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-16 Thread Eric W. Biederman
Enke Chen  writes:

> Hi, Eric:
>
> On 10/15/18 4:28 PM, Eric W. Biederman wrote:

>> With that said I think the best solution would be to figure out how to
>> allow the coredump to run in parallel with the usual exit signal, and
>> exit code reaping of the process> 
>> That would solve the problem for everyone, and would not introduce any
>> new complicated APIs.
>
> That would certainly help. But given the huge deployment of Linux, I don't
> think it would be feasible to change this fundamental behavior (signal post
> coredump).

Of course it will be feasible to change.  Make it a sysctl and keep the
current default and no one will even notice.  Waiting for something that
is happening asynchronously is not be difficult so having the wait
optional should not be a problem.

Right now the default in most distributions is to disable core dumps
entirely.   Which means that you are going to have to find a very
specific situation in which people and applications care about core
dumps happening to break an existing setup.

Then all you have to do to get the non-blocking behavior is to just do:
echo 1 > /proc/sys/kernel_core_async

Then everything else works without modifications and everyone is happy.
Maybe I am wearing rose colored glasses but that looks like all that is
needed and it should be much easier to work with and maintain than
having to modify every manager process to listen for unreliable signals,
and then take action on those unreliable signals.

Eric


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-16 Thread Eric W. Biederman
Oleg Nesterov  writes:

> On 10/15, Enke Chen wrote:
>>
>> > I don't understand why we need valid_predump_signal() at all.
>>
>> Most of the signals have well-defined semantics, and would not be appropriate
>> for this purpose.
>
> you are going to change the rules anyway.

I will just add that CLD_XXX is only valid with SIGCHLD as they are
signal specific si_codes.  In conjunction with another signal like
SIGUSR it will have another meaning.  I would really appreciate it
if new code does not further complicate siginfo_layout.

>> That is why it is limited to only SIGCHLD, SIGUSR1, SIGUSR2.
>
> Which do not queue. So the parent won't get the 2nd signal if 2 children
> crash at the same time.

We do best effort queueing but we don't guarantee anything.  So yes
this makes signals a very louzy interface for sending this kind of
information.

>> >>   if (sig_kernel_coredump(signr)) {
>> >> + /*
>> >> +  * Notify the parent prior to the coredump if the
>> >> +  * parent is interested in such a notificaiton.
>> >> +  */
>> >> + int p_sig = current->real_parent->predump_signal;
>> >> +
>> >> + if (valid_predump_signal(p_sig)) {
>> >> + read_lock(_lock);
>> >> + do_notify_parent_predump(current);
>> >> + read_unlock(_lock);
>> >> + cond_resched();
>> >
>> > perhaps this should be called by do_coredump() after coredump_wait() kills
>> > all the sub-threads?
>>
>> proc_coredump_connector(current) is located here, they should stay together.
>
> Why?
>
> Once again, other threads are still alive. So if the parent restarts the 
> service
> after it recieves -predump_signal, the new process can "race" with the old 
> thread.

Yes.  It isn't until do_coredump calls coredump_wait that all of the
threads are killed.

Eric



Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Eric W. Biederman
Enke Chen  writes:

> For simplicity and consistency, this patch provides an implementation
> for signal-based fault notification prior to the coredump of a child
> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> be used by an application to express its interest and to specify the
> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>
> Background:
>
> As the coredump of a process may take time, in certain time-sensitive
> applications it is necessary for a parent process (e.g., a process
> manager) to be notified of a child's imminent death before the coredump
> so that the parent process can act sooner, such as re-spawning an
> application process, or initiating a control-plane fail-over.

You talk about time senstive and then you talk about bash scripts.
I don't think your definition of time-sensitive and my definition match.

With that said I think the best solution would be to figure out how to
allow the coredump to run in parallel with the usual exit signal, and
exit code reaping of the process.

That would solve the problem for everyone, and would not introduce any
new complicated APIs.

Short of that having the prctl in the process that receives the signals
they you are doing is the right way to go.

You are however calling do_notify_parent_predump from the wrong
function, and frankly with the wrong locking.  There are multiple paths
to the do_coredump function so you really want this notification from
do_coredump.

But I still think it would be better to solve the root cause problem and
change the coredump logic to be able to run in parallel with the normal
exit notification and zombie reaping logic.  Then the problem you are
trying to solve goes away and everyone's code gets simpler.

Eric

> Currently there are two ways for a parent process to be notified of a
> child process's state change. One is to use the POSIX signal, and
> another is to use the kernel connector module. The specific events and
> actions are summarized as follows:
>
> Process EventPOSIX SignalConnector-based
> --
> ptrace_attach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>  SIGCHLD / CLD_STOPPED
>
> ptrace_detach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>  SIGCHLD / CLD_CONTINUED
>
> pre_coredump/N/A proc_coredump_connector()
> get_signal()
>
> post_coredump/   do_notify_parent()  proc_exit_connector()
> do_exit()SIGCHLD / exit_signal
> --
>
> As shown in the table, the signal-based pre-coredump notification is not
> currently available. In some cases using a connector-based notification
> can be quite complicated (e.g., when a process manager is written in shell
> scripts and thus is subject to certain inherent limitations), and a
> signal-based notification would be simpler and better suited.
>
> Signed-off-by: Enke Chen 
> ---
>  arch/x86/kernel/signal_compat.c|  2 +-
>  include/linux/sched.h  |  4 ++
>  include/linux/signal.h |  5 +++
>  include/uapi/asm-generic/siginfo.h |  3 +-
>  include/uapi/linux/prctl.h |  4 ++
>  kernel/fork.c  |  1 +
>  kernel/signal.c| 51 +
>  kernel/sys.c   | 77 
> ++
>  8 files changed, 145 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
> index 9ccbf05..a3deba8 100644
> --- a/arch/x86/kernel/signal_compat.c
> +++ b/arch/x86/kernel/signal_compat.c
> @@ -30,7 +30,7 @@ static inline void signal_compat_build_tests(void)
>   BUILD_BUG_ON(NSIGSEGV != 7);
>   BUILD_BUG_ON(NSIGBUS  != 5);
>   BUILD_BUG_ON(NSIGTRAP != 5);
> - BUILD_BUG_ON(NSIGCHLD != 6);
> + BUILD_BUG_ON(NSIGCHLD != 7);
>   BUILD_BUG_ON(NSIGSYS  != 1);
>  
>   /* This is part of the ABI and can never change in size: */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 09026ea..cfb9645 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -696,6 +696,10 @@ struct task_struct {
>   int exit_signal;
>   /* The signal sent when the parent dies: */
>   int pdeath_signal;
> +
> + /* The signal sent prior to a child's coredump: */
> + int predump_signal;
> +
>   /* JOBCTL_*, siglock protected: */
>   unsigned long   jobctl;
>  
> diff --git a/include/linux/signal.h b/include/linux/signal.h
> index 706a499..7cb976d 100644
> --- a/include/linux/signal.h
> +++ b/include/linux/signal.h
> @@ -256,6 +256,11 @@ static inline int valid_signal(unsigned long sig)
>   return sig 

Re: [RFC] Allow user namespace inside chroot

2018-10-15 Thread Eric W. Biederman
Jann Horn  writes:

> On Mon, Oct 15, 2018 at 7:10 PM  wrote:
>> @@ -1281,7 +1285,7 @@ static int userns_install(struct nsproxy *nsproxy, 
>> struct ns_common *ns)
>> return -ENOMEM;
>>
>> put_user_ns(cred->user_ns);
>> -   set_cred_user_ns(cred, get_user_ns(user_ns));
>> +   set_cred_user_ns(cred, get_user_ns(user_ns), 0);
>
> This looks bogus. With this, I think your restriction can be bypassed
> if process A forks a child B, B creates a new user namespace, then A
> enters the user namespace with setns() and has full capabilities. Am I
> missing something?

Nope.  I feel silly for missing that angle.

Even without the full capabilities the userns_install angle will place
you at the root of the mount namespace outside of the chroot.

At which point I have visions of the special cases multiplying like
bunnies make this work.  Without a very strong case I don't like this at all.

Eric





Re: [RFC] Allow user namespace inside chroot

2018-10-15 Thread Eric W. Biederman


Have you considered using pivot_root to drop all of the pieces of the
filesystem you don't want to be visible?  That should be a much better
solution overall.

It is must a matter of:
mount --bind /path/you/would/chroot/to
pivot_root /path/you/would/chroot/to /put/old
umount -l /put/old

You might need to do something like make --rprivate before calling
pivot_root to stop mount propagation to the parent.  But I can't
image it to be a practical problem.


Also note that being in a chroot tends to indicate one of two things,
being in an old build system, or being in some kind of chroot jail.
Because of the jails created with chroot we want to be very careful
with enabling user namespaces in that context.

There have been some very clever people figuring out how to get out of
chroot jails by passing file descriptors between processes and using
things like pivot root.

Even if your analysis is semantically perfect there is the issue of
increasing the attack surface of preexising chroot jails.  I believe
that would make the kernel more vulnerable overall, and for only
a very small simplification of implementation details.

So unless I am missing something I don't see the use case for this that
would not be better served by just properly setting up your mount
namespace, and the attack surface increase of chroot jails makes we
very relucatant to see a change like this.

Eric

nagarathnam.muthus...@oracle.com writes:

> From: Nagarathnam Muthusamy 
>
> Following commit disables the creation of user namespace inside
> the chroot environment.
>
> userns: Don't allow creation if the user is chrooted
>
> commit 3151527ee007b73a0ebd296010f1c0454a919c7d
>
> Consider a system in which a non-root user creates a combination
> of user, pid and mount namespaces and confines a process to it.
> The system will have multiple levels of nested namespaces.
> The root namespace in the system will have lots of directories
> which should not be exposed to the child confined to the set of
> namespaces.
>
> Without chroot, we will have to hide all unwanted directories
> individually using bind mounts and mount namespace. Chroot enables
> us to expose a handpicked list of directories which the child
> can see but if we use chroot we wont be able to create nested
> namespaces.
>
> Allowing a process to create user namespace within a chroot
> environment will enable it to chroot, which in turn can be used
> to escape the jail.
>
> This patch drops the chroot privilege when user namespace is
> created within the chroot environment so the process cannot
> use it to escape the chroot jail. The process can still modify
> the view of the file system using mount namespace but for those
> modifications to be useful, it needs to run a setuid program with
> that intented uid directly mapped into the user namespace as it is
> which is not possible for an unprivileged process.
>
> If there were any other corner cases which were considered while
> deciding to disable the creation of user namespace as a whole
> within the chroot environment please let me know.
>
> Signed-off-by: Nagarathnam Muthusamy
> ---
>  kernel/user_namespace.c | 22 +-
>  1 file changed, 13 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index e5222b5..83d2a70 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -44,7 +44,7 @@ static void dec_user_namespaces(struct ucounts *ucounts)
>   return dec_ucount(ucounts, UCOUNT_USER_NAMESPACES);
>  }
>  
> -static void set_cred_user_ns(struct cred *cred, struct user_namespace 
> *user_ns)
> +static void set_cred_user_ns(struct cred *cred, struct user_namespace 
> *user_ns, int is_chrooted)
>  {
>   /* Start with the same capabilities as init but useless for doing
>* anything as the capabilities are bound to the new user namespace.
> @@ -55,6 +55,11 @@ static void set_cred_user_ns(struct cred *cred, struct 
> user_namespace *user_ns)
>   cred->cap_effective = CAP_FULL_SET;
>   cred->cap_ambient = CAP_EMPTY_SET;
>   cred->cap_bset = CAP_FULL_SET;
> + if (is_chrooted) {
> + cap_lower(cred->cap_permitted, CAP_SYS_CHROOT);
> + cap_lower(cred->cap_effective, CAP_SYS_CHROOT);
> + cap_lower(cred->cap_bset, CAP_SYS_CHROOT);
> + }
>  #ifdef CONFIG_KEYS
>   key_put(cred->request_key_auth);
>   cred->request_key_auth = NULL;
> @@ -78,6 +83,7 @@ int create_user_ns(struct cred *new)
>   kgid_t group = new->egid;
>   struct ucounts *ucounts;
>   int ret, i;
> + int is_chrooted = 0;
>  
>   ret = -ENOSPC;
>   if (parent_ns->level > 32)
> @@ -88,14 +94,12 @@ int create_user_ns(struct cred *new)
>   goto fail;
>  
>   /*
> -  * Verify that we can not violate the policy of which files
> -  * may be accessed that is specified by the root directory,
> -  * by verifing that the root directory is at the root of the
> -  * mount 

Re: linux-next: manual merge of the userns tree with the tip tree

2018-10-14 Thread Eric W. Biederman
Stephen Rothwell  writes:

> Hi all,
>
> On Mon, 15 Oct 2018 15:11:59 +1100 Stephen Rothwell  
> wrote:
>>
>> Today's linux-next merge of the userns tree got a conflict in:
>> 
>>   arch/x86/mm/fault.c
>> 
>> between commit:
>> 
>>   164477c2331b ("x86/mm: Clarify hardware vs. software "error_code"")
>> (and others from that series)
>> 
>> from the tip tree and commits:
>> 
>>   768fd9c69bb5 ("signal/x86: Remove pkey parameter from 
>> bad_area_nosemaphore")
>>   25c102d803ea ("signal/x86: Remove pkey parameter from mm_fault_error")
>> 
>> from the userns tree.
>> 
>> I fixed it up (I think - see below) and can carry the fix as
>> necessary. This is now fixed as far as linux-next is concerned, but any
>> non trivial conflicts should be mentioned to your upstream maintainer
>> when your tree is submitted for merging.  You may also want to consider
>> cooperating with the maintainer of the conflicting tree to minimise any
>> particularly complex conflicts.
>> 
>> -- 
>> Cheers,
>> Stephen Rothwell
>> 
>> diff --cc arch/x86/mm/fault.c
>> index c2e3e5127ebc,8d77700a7883..
>> --- a/arch/x86/mm/fault.c
>> +++ b/arch/x86/mm/fault.c
>>  +/* Handle faults in the user portion of the address space */
>>  +static inline
>>  +void do_user_addr_fault(struct pt_regs *regs,
>>  +   unsigned long hw_error_code,
>>  +   unsigned long address)
>>  +{
>>  +   unsigned long sw_error_code;
>>  +   struct vm_area_struct *vma;
>>  +   struct task_struct *tsk;
>>  +   struct mm_struct *mm;
>>  +   vm_fault_t fault, major = 0;
>>  +   unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
>>  +   u32 pkey;
>
> I missed removing the above line.

Yes.  At first glance with the above change it looks like you got it.

Eric



Re: [tip:x86/mm] x86/mm: Break out user address space handling

2018-10-14 Thread Eric W. Biederman
tip-bot for Dave Hansen  writes:

> Commit-ID:  aa37c51b9421d66f7931c5fdcb9ce80c450974be
> Gitweb: 
> https://git.kernel.org/tip/aa37c51b9421d66f7931c5fdcb9ce80c450974be
> Author: Dave Hansen 
> AuthorDate: Fri, 28 Sep 2018 09:02:23 -0700
> Committer:  Peter Zijlstra 
> CommitDate: Tue, 9 Oct 2018 16:51:15 +0200
>
> x86/mm: Break out user address space handling
>
> The last patch broke out kernel address space handing into its own
> helper.  Now, do the same for user address space handling.
>
> Cc: x...@kernel.org
> Cc: Jann Horn 
> Cc: Sean Christopherson 
> Cc: Thomas Gleixner 
> Cc: Andy Lutomirski 
> Signed-off-by: Dave Hansen 
> Signed-off-by: Peter Zijlstra (Intel) 
> Link: http://lkml.kernel.org/r/20180928160223.9c4f6...@viggo.jf.intel.com
> ---
>  arch/x86/mm/fault.c | 47 ---
>  1 file changed, 28 insertions(+), 19 deletions(-)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index c7e32f453852..0d1f5d39fc63 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -966,6 +966,7 @@ bad_area_access_error(struct pt_regs *regs, unsigned long 
> error_code,
>   __bad_area(regs, error_code, address, vma, SEGV_ACCERR);
>  }
>  
> +/* Handle faults in the kernel portion of the address space */
   ^^
I believe you mean the __user__ portion of the address space.
Given that the call chain is:

do_user_addr_fault
   handle_mm_fault
  do_sigbus  

>  static void
>  do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long 
> address,
> u32 *pkey, unsigned int fault)
> @@ -1254,14 +1255,11 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned 
> long hw_error_code,
>  }
>  NOKPROBE_SYMBOL(do_kern_addr_fault);
>  
> -/*
> - * This routine handles page faults.  It determines the address,
> - * and the problem, and then passes it off to one of the appropriate
> - * routines.
> - */
> -static noinline void
> -__do_page_fault(struct pt_regs *regs, unsigned long hw_error_code,
> - unsigned long address)
> +/* Handle faults in the user portion of the address space */
> +static inline
> +void do_user_addr_fault(struct pt_regs *regs,
> + unsigned long hw_error_code,
> + unsigned long address)
>  {
>   unsigned long sw_error_code;
>   struct vm_area_struct *vma;
> @@ -1274,17 +1272,6 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
> hw_error_code,
>   tsk = current;
>   mm = tsk->mm;
>  
> - prefetchw(>mmap_sem);
> -
> - if (unlikely(kmmio_fault(regs, address)))
> - return;
> -
> - /* Was the fault on kernel-controlled part of the address space? */
> - if (unlikely(fault_in_kernel_space(address))) {
> - do_kern_addr_fault(regs, hw_error_code, address);
> - return;
> - }
> -
>   /* kprobes don't want to hook the spurious faults: */
>   if (unlikely(kprobes_fault(regs)))
>   return;
> @@ -1488,6 +1475,28 @@ good_area:
>  
>   check_v8086_mode(regs, address, tsk);
>  }
> +NOKPROBE_SYMBOL(do_user_addr_fault);
> +
> +/*
> + * This routine handles page faults.  It determines the address,
> + * and the problem, and then passes it off to one of the appropriate
> + * routines.
> + */
> +static noinline void
> +__do_page_fault(struct pt_regs *regs, unsigned long hw_error_code,
> + unsigned long address)
> +{
> + prefetchw(>mm->mmap_sem);
> +
> + if (unlikely(kmmio_fault(regs, address)))
> + return;
> +
> + /* Was the fault on kernel-controlled part of the address space? */
> + if (unlikely(fault_in_kernel_space(address)))
> + do_kern_addr_fault(regs, hw_error_code, address);
> + else
> + do_user_addr_fault(regs, hw_error_code, address);
> +}
>  NOKPROBE_SYMBOL(__do_page_fault);
>  
>  static nokprobe_inline void

Eric


Re: [LKP] 601d5abfea [ 13.686356] BUG: unable to handle kernel paging request at 34ca027e

2018-10-12 Thread Eric W. Biederman
kernel test robot  writes:

> Greetings,
>
> 0day kernel testing robot got the below dmesg and the first bad commit is
>
> https://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
> siginfo-next
>

This is an odd one.  This is the first time I recall the lkp test robot
emailing me and telling me that I have fixed an issue.  I don't mind I
am just a little surprised.

Recently it was discovered that on my branch passing a large negative signal
number to rt_sigqueueinfo could cause an oops.  The underlying issues
were fixed by:

b2a2ab527d6d ("signal: Guard against negative signal numbers in 
copy_siginfo_from_user")
a36700589b85 ("signal: Guard against negative signal numbers in 
copy_siginfo_from_user32")

It appears that this issue was found in linux-next before these fixes
were applied.  And then the top of my siginfo-next branch where these
fixes exist was tested.

I have not problem with that sequence of events it is just a little
surprising.

If I have not read this test report correctly please let me know.

Eric

> commit 601d5abfeaf244b86bb68c1e05c6e0d57be2f6b0
> Author: Eric W. Biederman 
> AuthorDate: Fri Oct 5 09:02:48 2018 +0200
> Commit: Eric W. Biederman 
> CommitDate: Mon Oct 8 09:35:26 2018 +0200
>
> signal: In sigqueueinfo prefer sig not si_signo
> 
> Andrei Vagin  reported:
> 
> > Accoding to the man page, the user should not set si_signo, it has to 
> be set
> > by kernel.
> >
> > $ man 2 rt_sigqueueinfo
> >
> > The uinfo argument specifies the data to accompany  the  signal.   
> This
> >argument  is  a  pointer to a structure of type siginfo_t, 
> described in
> >sigaction(2) (and defined  by  including  ).   The  
> caller
> >should set the following fields in this structure:
> >
> >si_code
> >   This  must  be  one of the SI_* codes in the Linux kernel 
> source
> >   file include/asm-generic/siginfo.h, with  the  
> restriction  that
> >   the  code  must  be  negative (i.e., cannot be SI_USER, 
> which is
> >   used by the kernel to indicate a signal  sent  by  
> kill(2))  and
> >   cannot  (since  Linux  2.6.39) be SI_TKILL (which is used 
> by the
> >   kernel to indicate a signal sent using tgkill(2)).
> >
> >si_pid This should be set to a process ID, typically the process 
> ID  of
> >   the sender.
> >
> >si_uid This  should  be set to a user ID, typically the real 
> user ID of
> >   the sender.
> >
> >si_value
> >   This field contains the user data to accompany the 
> signal.   For
> >   more information, see the description of the last (union 
> sigval)
> >   argument of sigqueue(3).
>     >
> >Internally, the kernel sets the si_signo field to the  value  
> specified
> >in  sig,  so that the receiver of the signal can also obtain the 
> signal
> >number via that field.
> >
> > On Tue, Sep 25, 2018 at 07:19:02PM +0200, Eric W. Biederman wrote:
> >>
> >> If there is some application that calls sigqueueinfo directly that has
> >> a problem with this added sanity check we can revisit this when we see
> >> what kind of crazy that application is doing.
> >
> >
> > I already know two "applications" ;)
> >
> > 
> https://github.com/torvalds/linux/blob/master/tools/testing/selftests/ptrace/peeksiginfo.c
> > 
> https://github.com/checkpoint-restore/criu/blob/master/test/zdtm/static/sigpending.c
> >
> > Disclaimer: I'm the author of both of them.
> 
> Looking at the kernel code the historical behavior has alwasy been to 
> prefer
> the signal number passed in by the kernel.
> 
> So sigh.  Implmenet __copy_siginfo_from_user and 
> __copy_siginfo_from_user32 to
> take that signal number and prefer it.  The user of ptrace will still
> use copy_siginfo_from_user and copy_siginfo_from_user32 as they do not and
> never have had a signal number there.
> 
> Luckily this change has never made it farther than linux-next.
> 
> Fixes: e75dc036c445 ("signal: Fail sigqueueinfo if si_signo != sig")
> Reported-by: Andrei Vagin 
> Tested-by: Andrei Vagin 
> Signed-off-by: "Eric W. Biederm

Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

2018-10-11 Thread Eric W. Biederman
David Howells  writes:

> Okay, this appears to fix the cycle-creation problem.
>
> It could probably be improved by comparing sequence numbers as Alan suggests,
> but I need to work out how to get at that.

It should just be a matter of replacing the test
"if (p->mnt.mnt_sb->s_type == )" with "if mnt_ns_loop(p->mnt.mnt_root)"

That would allow reusing 100% of the existing logic, and remove the need
to export file_system_type nsfs;

As your test exists below it will reject a lot more than mount namespace
file descriptors.  It will reject file descriptors for every other
namespace as well.

Eric

> ---
> commit 069c3376f7849044117c866aeafbb1a525f84926
> Author: David Howells 
> Date:   Thu Oct 4 23:18:59 2018 +0100
>
> fixes
>
> diff --git a/fs/internal.h b/fs/internal.h
> index 17029b30e196..47a6c80c3c51 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -172,6 +172,7 @@ extern void mnt_pin_kill(struct mount *m);
>   * fs/nsfs.c
>   */
>  extern const struct dentry_operations ns_dentry_operations;
> +extern struct file_system_type nsfs;
>  
>  /*
>   * fs/ioctl.c
> diff --git a/fs/namespace.c b/fs/namespace.c
> index e969ded7d54b..25ecd8b3c76b 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2388,6 +2388,27 @@ static inline int tree_contains_unbindable(struct 
> mount *mnt)
>   return 0;
>  }
>  
> +/*
> + * Object if there are any nsfs mounts in the specified subtree.  These can 
> act
> + * as pins for mount namespaces that aren't checked by the mount-cycle 
> checking
> + * code, thereby allowing cycles to be made.
> + */
> +static bool check_for_nsfs_mounts(struct mount *subtree)
> +{
> + struct mount *p;
> + bool ret = false;
> +
> + lock_mount_hash();
> + for (p = subtree; p; p = next_mnt(p, subtree))
> + if (p->mnt.mnt_sb->s_type == )
> + goto out;
> +
> + ret = true;
> +out:
> + unlock_mount_hash();
> + return ret;
> +}
> +
>  static int do_move_mount(struct path *old_path, struct path *new_path)
>  {
>   struct path parent_path = {.mnt = NULL, .dentry = NULL};
> @@ -2442,6 +2463,8 @@ static int do_move_mount(struct path *old_path, struct 
> path *new_path)
>   if (IS_MNT_SHARED(p) && tree_contains_unbindable(old))
>   goto out1;
>   err = -ELOOP;
> + if (!check_for_nsfs_mounts(old))
> + goto out1;
>   for (; mnt_has_parent(p); p = p->mnt_parent)
>   if (p == old)
>   goto out1;
> diff --git a/fs/nsfs.c b/fs/nsfs.c
> index f069eb6495b0..d3abcd5c2a23 100644
> --- a/fs/nsfs.c
> +++ b/fs/nsfs.c
> @@ -269,7 +269,7 @@ static struct dentry *nsfs_mount(struct file_system_type 
> *fs_type,
>   return mount_pseudo(fs_type, "nsfs:", _ops,
>   _dentry_operations, NSFS_MAGIC);
>  }
> -static struct file_system_type nsfs = {
> +struct file_system_type nsfs = {
>   .name = "nsfs",
>   .mount = nsfs_mount,
>   .kill_sb = kill_anon_super,


Re: [LKP] 4ce5f9c9e7 [ 1.323881] WARNING: CPU: 0 PID: 1 at mm/slab_common.c:1031 kmalloc_slab

2018-10-10 Thread Eric W. Biederman
ebied...@xmission.com (Eric W. Biederman) writes:

> Sean Christopherson  writes:
>
>> On Wed, Oct 10, 2018 at 05:06:52PM -0500, Eric W. Biederman wrote:
>>> ebied...@xmission.com (Eric W. Biederman) writes:
>>> 
>>> > So I am flummoxed.  I am reading through the code and I don't see
>>> > anything that could trigger this, and when I ran the supplied reproducer
>>> > it did not reproduce for me.
>>> 
>>> Even more so.  With my tool chain the line that reports the failing
>>> address is impossible.
>>> 
>>> [   73.034423] RIP: 0010:copy_siginfo_from_user+0x4d/0xd0
>>> 
>>> With the supplied configureation my tool chain only has 0x30 bytes for
>>> all of copy_siginfo_from_user.  So I can't even begin to guess where
>>> in that function things are failing.
>>> 
>>> Any additional information that you can provide would be a real help
>>> in tracking down this strange failure.
>>
>> I don't have the exact toolchain, but I was able to get somewhat close
>> and may have found a smoking gun.  0x4d in my build is in the general
>> vicinity of "sig_sicodes[sig].limit" in known_siginfo_layout().  This
>> lines up with the register state from the log, e.g. RDI=0500104d8,
>> which is the mask generated by sig_specific_sicodes.  From what I can
>> tell, @sig is never bounds checked.  If the compiler generated an AND
>> instruction to compare against sig_specific_sicodes then that could
>> resolve true with any arbitrary value that happened to collide with
>> sig_specific_sicodes and result in an out-of-bounds access to
>> @sig_sicodes.  siginfo_layout() for example explicitly checks @sig
>> before indexing @sig_sicode, e.g. "sig < ARRAY_SIZE(sig_sicodes)".
>>
>> Maybe this?
>
> But sig is bounds checked.  Even better sig is checked to see if it
> is one of the values in the array.
>
>> From include/linux/signal.h
>
> #define SIG_SPECIFIC_SICODES_MASK (\
>   rt_sigmask(SIGILL)|  rt_sigmask(SIGFPE)| \
>   rt_sigmask(SIGSEGV)   |  rt_sigmask(SIGBUS)| \
>   rt_sigmask(SIGTRAP)   |  rt_sigmask(SIGCHLD)   | \
>   rt_sigmask(SIGPOLL)   |  rt_sigmask(SIGSYS)| \
>   SIGEMT_MASK)
>
> #define siginmask(sig, mask) \
>   ((sig) < SIGRTMIN && (rt_sigmask(sig) & (mask)))
>
> #define sig_specific_sicodes(sig) siginmask(sig, 
> SIG_SPECIFIC_SICODES_MASK)
>
>
>
> Hmm.  I wonder if something is passing in a negative signal number.
> There is not a bounds check for that.  A sufficiently large signal
> number might be the problem here.  Yes.  I can get an oops with
> a sufficiently large negative signal number.
>
> The code will later call valid_signal in check_permissions and
> that will cause the system call to fail, so the issue is just that
> the signal number is not being validated early enough.
>
> On the output path (copy_siginfo_to_user and copy_siginfo_to_user32) the
> signal number should be validated before it ever reaches userspace
> which is why I expect trinity never triggered anything.
>
> There is copy_siginfo_from_user32 and that does call siginfo_layout with
> a possibly negative signal number.  Which has the same potential issues.
>
> So I am going to go with the fix below.  That fixes things in my testing
> and by being unsigned should fix keep negative numbers from being a
> problem.

Sean thank you very much for putting me on the right path to track this
failing test down.

Eric


Re: [LKP] 4ce5f9c9e7 [ 1.323881] WARNING: CPU: 0 PID: 1 at mm/slab_common.c:1031 kmalloc_slab

2018-10-10 Thread Eric W. Biederman
Sean Christopherson  writes:

> On Wed, Oct 10, 2018 at 05:06:52PM -0500, Eric W. Biederman wrote:
>> ebied...@xmission.com (Eric W. Biederman) writes:
>> 
>> > So I am flummoxed.  I am reading through the code and I don't see
>> > anything that could trigger this, and when I ran the supplied reproducer
>> > it did not reproduce for me.
>> 
>> Even more so.  With my tool chain the line that reports the failing
>> address is impossible.
>> 
>> [   73.034423] RIP: 0010:copy_siginfo_from_user+0x4d/0xd0
>> 
>> With the supplied configureation my tool chain only has 0x30 bytes for
>> all of copy_siginfo_from_user.  So I can't even begin to guess where
>> in that function things are failing.
>> 
>> Any additional information that you can provide would be a real help
>> in tracking down this strange failure.
>
> I don't have the exact toolchain, but I was able to get somewhat close
> and may have found a smoking gun.  0x4d in my build is in the general
> vicinity of "sig_sicodes[sig].limit" in known_siginfo_layout().  This
> lines up with the register state from the log, e.g. RDI=0500104d8,
> which is the mask generated by sig_specific_sicodes.  From what I can
> tell, @sig is never bounds checked.  If the compiler generated an AND
> instruction to compare against sig_specific_sicodes then that could
> resolve true with any arbitrary value that happened to collide with
> sig_specific_sicodes and result in an out-of-bounds access to
> @sig_sicodes.  siginfo_layout() for example explicitly checks @sig
> before indexing @sig_sicode, e.g. "sig < ARRAY_SIZE(sig_sicodes)".
>
> Maybe this?

But sig is bounds checked.  Even better sig is checked to see if it
is one of the values in the array.

>From include/linux/signal.h

#define SIG_SPECIFIC_SICODES_MASK (\
rt_sigmask(SIGILL)|  rt_sigmask(SIGFPE)| \
rt_sigmask(SIGSEGV)   |  rt_sigmask(SIGBUS)| \
rt_sigmask(SIGTRAP)   |  rt_sigmask(SIGCHLD)   | \
rt_sigmask(SIGPOLL)   |  rt_sigmask(SIGSYS)| \
SIGEMT_MASK)

#define siginmask(sig, mask) \
((sig) < SIGRTMIN && (rt_sigmask(sig) & (mask)))

#define sig_specific_sicodes(sig)   siginmask(sig, 
SIG_SPECIFIC_SICODES_MASK)



Hmm.  I wonder if something is passing in a negative signal number.
There is not a bounds check for that.  A sufficiently large signal
number might be the problem here.  Yes.  I can get an oops with
a sufficiently large negative signal number.

The code will later call valid_signal in check_permissions and
that will cause the system call to fail, so the issue is just that
the signal number is not being validated early enough.

On the output path (copy_siginfo_to_user and copy_siginfo_to_user32) the
signal number should be validated before it ever reaches userspace
which is why I expect trinity never triggered anything.

There is copy_siginfo_from_user32 and that does call siginfo_layout with
a possibly negative signal number.  Which has the same potential issues.

So I am going to go with the fix below.  That fixes things in my testing
and by being unsigned should fix keep negative numbers from being a
problem.

diff --git a/kernel/signal.c b/kernel/signal.c
index 2bffc5a50183..4fd431ce4f91 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2860,7 +2860,7 @@ static const struct {
[SIGSYS]  = { NSIGSYS,  SIL_SYS },
 };
 
-static bool known_siginfo_layout(int sig, int si_code)
+static bool known_siginfo_layout(unsigned sig, int si_code)
 {
if (si_code == SI_KERNEL)
return true;
@@ -2879,7 +2879,7 @@ static bool known_siginfo_layout(int sig, int si_code)
return false;
 }
 
-enum siginfo_layout siginfo_layout(int sig, int si_code)
+enum siginfo_layout siginfo_layout(unsigned sig, int si_code)
 {
enum siginfo_layout layout = SIL_KILL;
if ((si_code > SI_USER) && (si_code < SI_KERNEL)) {


Re: [Ksummit-discuss] [PATCH v2 0/3] code of conduct fixes

2018-10-10 Thread Eric W. Biederman
James Bottomley  writes:

> Resend to show accumulated tags and also to add a third patch listing
> the TAB as the reporting point as a few people seem to want.  If it
> gets the same level of support, I'll send it in with the other two.


There is also:

> Our Responsibilities
> 
> 
> Maintainers are responsible for clarifying the standards of acceptable 
> behavior
> and are expected to take appropriate and fair corrective action in response to
> any instances of unacceptable behavior.
> 
> Maintainers have the right and responsibility to remove, edit, or reject
> comments, commits, code, wiki edits, issues, and other contributions that are
> not aligned to this Code of Conduct, or to ban temporarily or permanently any
> contributor for other behaviors that they deem inappropriate, threatening,
> offensive, or harmful.

Which is very problematic.
a) In append only logs like git we can not edit history.
   Making it a mainters responsibility to edit the history, to do the
   impossible is a problem.

b) There are no responsibilities of for people who are not Maintainers.
   That is another problem.

c) The entire tone of the reponsibilities section is out of line with a
   community where there are no enforcement powers only the power to
   accept or not accept a patch.  Only the power to persuade not to
   enforce.

Overall in the discussions I have heard people talking about persuading,
educating, and not feeding trolls.   Nowhere have I heard people talking
about policing the community which I understand that responsiblity
section to be talking about.

Increasingly I am getting the feeling that this document does not the
linux development community.  Perhaps a revert and trying to come up
with better language from scratch would be better.

I don't know how to rephrase that reponsibility section but if we don't
go with the revert something looks like it need sot be done there.

Eric










Re: [Ksummit-discuss] [PATCH v2 1/3] code-of-conduct: Fix the ambiguity about collecting email addresses

2018-10-10 Thread Eric W. Biederman
James Bottomley  writes:

> The current code of conduct has an ambiguity in the it considers publishing
> private information such as email addresses unacceptable behaviour.  Since
> the Linux kernel collects and publishes email addresses as part of the patch
> process, add an exception clause for email addresses ordinarily collected by
> the project to correct this ambiguity.

Acked-by: "Eric W. Biederman" 

>
> Fixes: 8a104f8b5867c682 ("Code of Conduct: Let's revamp it.")
> Acked-by: Geert Uytterhoeven 
> Acked-by: Shuah Khan 
> Acked-by: Guenter Roeck 
> Reviewed-by: Alan Cox 
> Reviewed-by: Mauro Carvalho Chehab 
> Signed-off-by: James Bottomley 
> ---
>  Documentation/process/code-of-conduct.rst | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/Documentation/process/code-of-conduct.rst 
> b/Documentation/process/code-of-conduct.rst
> index ab7c24b5478c..aa40e34e7785 100644
> --- a/Documentation/process/code-of-conduct.rst
> +++ b/Documentation/process/code-of-conduct.rst
> @@ -31,7 +31,7 @@ Examples of unacceptable behavior by participants include:
>  * Trolling, insulting/derogatory comments, and personal or political attacks
>  * Public or private harassment
>  * Publishing others’ private information, such as a physical or electronic
> -  address, without explicit permission
> +  address not ordinarily collected by the project, without explicit 
> permission
>  * Other conduct which could reasonably be considered inappropriate in a
>professional setting


Re: [LKP] 4ce5f9c9e7 [ 1.323881] WARNING: CPU: 0 PID: 1 at mm/slab_common.c:1031 kmalloc_slab

2018-10-10 Thread Eric W. Biederman
ebied...@xmission.com (Eric W. Biederman) writes:

> So I am flummoxed.  I am reading through the code and I don't see
> anything that could trigger this, and when I ran the supplied reproducer
> it did not reproduce for me.

Even more so.  With my tool chain the line that reports the failing
address is impossible.

[   73.034423] RIP: 0010:copy_siginfo_from_user+0x4d/0xd0

With the supplied configureation my tool chain only has 0x30 bytes for
all of copy_siginfo_from_user.  So I can't even begin to guess where
in that function things are failing.

Any additional information that you can provide would be a real help
in tracking down this strange failure.

Thank you,
Eric Biederman



Re: [LKP] 4ce5f9c9e7 [ 1.323881] WARNING: CPU: 0 PID: 1 at mm/slab_common.c:1031 kmalloc_slab

2018-10-10 Thread Eric W. Biederman


So I am flummoxed.  I am reading through the code and I don't see
anything that could trigger this, and when I ran the supplied reproducer
it did not reproduce for me.

Plus there is the noise from the kmalloc_slab test that is goofing up
the subject line.

Is there any chance I can get a disassembly of the
copy_siginfo_from_user or post_copy_siginfo_from_user from your build?
I don't have the same tool chain.

Right now I am strongly suspecting that there is a memory stomp
somewhere and the earlier tests just happen on something that is the
pinpointed commit to misbehave.

Either that or it is simply that I don't have the latest and greatest
smep/smap hardware and there is an off by one I am not seeing.

I don't doubt that this test is finding something I haven't figured out
how to see what it is finding, and when I exercise the same code path
with my own tests everything appears to work.

Eric

kernel test robot  writes:

> Greetings,
>
> 0day kernel testing robot got the below dmesg and the first bad commit is
>
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
>
> commit 4ce5f9c9e7546915c559ffae594e6d73f918db00
> Author: Eric W. Biederman 
> AuthorDate: Tue Sep 25 12:59:31 2018 +0200
> Commit: Eric W. Biederman 
> CommitDate: Wed Oct 3 16:50:39 2018 +0200
>
> signal: Use a smaller struct siginfo in the kernel
> 
> We reserve 128 bytes for struct siginfo but only use about 48 bytes on
> 64bit and 32 bytes on 32bit.  Someday we might use more but it is unlikely
> to be anytime soon.
> 
> Userspace seems content with just enough bytes of siginfo to implement
> sigqueue.  Or in the case of checkpoint/restart reinjecting signals
> the kernel has sent.
> 
> Reducing the stack footprint and the work to copy siginfo around from
> 2 cachelines to 1 cachelines seems worth doing even if I don't have
> benchmarks to show a performance difference.
>     
> Suggested-by: Linus Torvalds 
> Signed-off-by: "Eric W. Biederman" 
>
> ae7795bc61  signal: Distinguish between kernel_siginfo and siginfo
> 4ce5f9c9e7  signal: Use a smaller struct siginfo in the kernel
> 570b7bdeaf  Add linux-next specific files for 20181009
> +---+++---+
> |   | ae7795bc61 | 4ce5f9c9e7 | 
> next-20181009 |
> +---+++---+
> | boot_successes| 0  | 0  | 28
> |
> | boot_failures | 1144   | 280| 8 
> |
> | WARNING:at_mm/slab_common.c:#kmalloc_slab | 1144   | 280|   
> |
> | RIP:kmalloc_slab  | 1144   | 280|   
> |
> | Mem-Info  | 1144   | 280| 8 
> |
> | BUG:unable_to_handle_kernel   | 0  | 5  | 7 
> |
> | Oops:#[##]| 0  | 7  | 8 
> |
> | RIP:copy_siginfo_from_user| 0  | 7  |   
> |
> | Kernel_panic-not_syncing:Fatal_exception  | 0  | 7  | 8 
> |
> | RIP:post_copy_siginfo_from_user   | 0  | 0  | 8 
> |
> +---+++---+
>
> [1.320405] test_overflow: ok: (s8)(0 << 7) == 0
> [1.321071] test_overflow: ok: (s16)(0 << 15) == 0
> [1.321756] test_overflow: ok: (int)(0 << 31) == 0
> [1.322442] test_overflow: ok: (s32)(0 << 31) == 0
> [1.323121] test_overflow: ok: (s64)(0 << 63) == 0
> [1.323881] WARNING: CPU: 0 PID: 1 at mm/slab_common.c:1031 
> kmalloc_slab+0x17/0x70
> [1.324113] CPU: 0 PID: 1 Comm: swapper/0 Tainted: GT 
> 4.19.0-rc1-00077-g4ce5f9c #1
> [1.324113] RIP: 0010:kmalloc_slab+0x17/0x70
> [1.324113] Code: 00 00 00 83 3d 11 78 14 03 02 55 48 89 e5 5d 0f 97 c0 c3 
> 55 48 81 ff 00 00 40 00 48 89 e5 76 0e 31 c0 81 e6 00 02 00 00 75 4b <0f> 0b 
> eb 47 48 81 ff c0 00 00 00 77 19 48 85 ff b8 10 00 00 00 74
> [1.324113] RSP: :88000fc7fd50 EFLAGS: 00010246
> [1.324113] RAX:  RBX: 006000c0 RCX: 
> 88001fb68d47
> [1.324113] RDX: 0001 RSI:  RDI: 
> 
> [1.324113] RBP: 88000fc7fd50 R08: b128ac78 R09: 
> 0001
> [1.324113] R10: 0001 R11:  R12: 
> 88001d814800
> [1.324113] R13:

[REVIEW][PATCH 7/6] signal: In sigqueueinfo prefer sig not si_signo

2018-10-05 Thread Eric W. Biederman


Andrei Vagin  reported:

> Accoding to the man page, the user should not set si_signo, it has to be set
> by kernel.
>
> $ man 2 rt_sigqueueinfo
>
> The uinfo argument specifies the data to accompany  the  signal.   This
>argument  is  a  pointer to a structure of type siginfo_t, described in
>sigaction(2) (and defined  by  including  ).   The  caller
>should set the following fields in this structure:
>
>si_code
>   This  must  be  one of the SI_* codes in the Linux kernel source
>   file include/asm-generic/siginfo.h, with  the  restriction  that
>   the  code  must  be  negative (i.e., cannot be SI_USER, which is
>   used by the kernel to indicate a signal  sent  by  kill(2))  and
>   cannot  (since  Linux  2.6.39) be SI_TKILL (which is used by the
>   kernel to indicate a signal sent using tgkill(2)).
>
>si_pid This should be set to a process ID, typically the process ID  of
>   the sender.
>
>si_uid This  should  be set to a user ID, typically the real user ID of
>   the sender.
>
>si_value
>   This field contains the user data to accompany the signal.   For
>   more information, see the description of the last (union sigval)
>   argument of sigqueue(3).
>
>Internally, the kernel sets the si_signo field to the  value  specified
>in  sig,  so that the receiver of the signal can also obtain the signal
>number via that field.
>
> On Tue, Sep 25, 2018 at 07:19:02PM +0200, Eric W. Biederman wrote:
>>
>> If there is some application that calls sigqueueinfo directly that has
>> a problem with this added sanity check we can revisit this when we see
>> what kind of crazy that application is doing.
>
>
> I already know two "applications" ;)
>
> https://github.com/torvalds/linux/blob/master/tools/testing/selftests/ptrace/peeksiginfo.c
> https://github.com/checkpoint-restore/criu/blob/master/test/zdtm/static/sigpending.c
>
> Disclaimer: I'm the author of both of them.

Looking at the kernel code the historical behavior has alwasy been to prefer
the signal number passed in by the kernel.

So sigh.  Implmenet __copy_siginfo_from_user and __copy_siginfo_from_user32 to
take that signal number and prefer it.  The user of ptrace will still
use copy_siginfo_from_user and copy_siginfo_from_user32 as they do not and
never have had a signal number there.

Luckily this change has never made it farther than linux-next.

Fixes: e75dc036c445 ("signal: Fail sigqueueinfo if si_signo != sig")
Signed-off-by: "Eric W. Biederman" 
---

Andrei can you verify this fixes your programs?
Thank you very much,
Eric


 kernel/signal.c | 141 
 1 file changed, 84 insertions(+), 57 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 1c2dd117fee0..2bffc5a50183 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2925,11 +2925,10 @@ int copy_siginfo_to_user(siginfo_t __user *to, const 
kernel_siginfo_t *from)
return 0;
 }
 
-int copy_siginfo_from_user(kernel_siginfo_t *to, const siginfo_t __user *from)
+static int post_copy_siginfo_from_user(kernel_siginfo_t *info,
+  const siginfo_t __user *from)
 {
-   if (copy_from_user(to, from, sizeof(struct kernel_siginfo)))
-   return -EFAULT;
-   if (unlikely(!known_siginfo_layout(to->si_signo, to->si_code))) {
+   if (unlikely(!known_siginfo_layout(info->si_signo, info->si_code))) {
char __user *expansion = si_expansion(from);
char buf[SI_EXPANSION_SIZE];
int i;
@@ -2949,6 +2948,22 @@ int copy_siginfo_from_user(kernel_siginfo_t *to, const 
siginfo_t __user *from)
return 0;
 }
 
+static int __copy_siginfo_from_user(int signo, kernel_siginfo_t *to,
+   const siginfo_t __user *from)
+{
+   if (copy_from_user(to, from, sizeof(struct kernel_siginfo)))
+   return -EFAULT;
+   to->si_signo = signo;
+   return post_copy_siginfo_from_user(to, from);
+}
+
+int copy_siginfo_from_user(kernel_siginfo_t *to, const siginfo_t __user *from)
+{
+   if (copy_from_user(to, from, sizeof(struct kernel_siginfo)))
+   return -EFAULT;
+   return post_copy_siginfo_from_user(to, from);
+}
+
 #ifdef CONFIG_COMPAT
 int copy_siginfo_to_user32(struct compat_siginfo __user *to,
   const struct kernel_siginfo *from)
@@ -3041,88 +3056,106 @@ int __copy_siginfo_to_user32(struct compat_siginfo 
__user *to,
return 0;
 }
 
-int copy_siginfo_from_user32(struct kernel_siginfo *to,
-const struct compat_sigin

Re: [REVIEW][PATCH 2/6] signal: Fail sigqueueinfo if si_signo != sig

2018-10-05 Thread Eric W. Biederman
Andrei Vagin  writes:

> On Tue, Sep 25, 2018 at 07:19:02PM +0200, Eric W. Biederman wrote:
>> The kernel needs to validate that the contents of struct siginfo make
>> sense as siginfo is copied into the kernel, so that the proper union
>> members can be put in the appropriate locations.  The field si_signo
>> is a fundamental part of that validation.  As such changing the
>> contents of si_signo after the validation make no sense and can result
>> in nonsense values in the kernel.
>
> Accoding to the man page, the user should not set si_signo, it has to be set
> by kernel.

I wanted to say look at copy_siginfo_from_user32 it uses si_signo and it
has always done so.  But before I got to fixing copy_siginfo_from_user32
only looked at si_code.  This is one of the areas where we deliberately
slightly changed the KABI.  To start looking an signo instead of magic
kernel internal si_code values.

So yes.  Looking at si_signo instead of the passed in signo appears to
be a regression all of the way around (except for ptrace) where that
value is not present.  So I will see if I can figure out how to refactor
the code to accomplish that.

I am travelling for the next couple of days so I won't be able to get to
this for a bit.

I am thinking I will need two version of copy_siginfo_from user now.
One for ptrace and one for rt_sgiqueueinfo.  Sigh.

> $ man 2 rt_sigqueueinfo
>
> The uinfo argument specifies the data to accompany  the  signal.   This
>argument  is  a  pointer to a structure of type siginfo_t, described in
>sigaction(2) (and defined  by  including  ).   The  caller
>should set the following fields in this structure:
>
>si_code
>   This  must  be  one of the SI_* codes in the Linux kernel source
>   file include/asm-generic/siginfo.h, with  the  restriction  that
>   the  code  must  be  negative (i.e., cannot be SI_USER, which is
>   used by the kernel to indicate a signal  sent  by  kill(2))  and
>   cannot  (since  Linux  2.6.39) be SI_TKILL (which is used by the
>   kernel to indicate a signal sent using tgkill(2)).
>
>si_pid This should be set to a process ID, typically the process ID  of
>   the sender.
>
>si_uid This  should  be set to a user ID, typically the real user ID of
>   the sender.
>
>si_value
>   This field contains the user data to accompany the signal.   For
>   more information, see the description of the last (union sigval)
>   argument of sigqueue(3).
>
>Internally, the kernel sets the si_signo field to the  value  specified
>in  sig,  so that the receiver of the signal can also obtain the signal
>number via that field.
>
>> 
>> As such simply fail if someone is silly enough to set si_signo out of
>> sync with the signal number passed to sigqueueinfo.
>> 
>> I don't expect a problem as glibc's sigqueue implementation sets
>> "si_signo = sig" and CRIU just returns to the kernel what the kernel
>> gave to it.
>> 
>> If there is some application that calls sigqueueinfo directly that has
>> a problem with this added sanity check we can revisit this when we see
>> what kind of crazy that application is doing.
>
>
> I already know two "applications" ;)
>
> https://github.com/torvalds/linux/blob/master/tools/testing/selftests/ptrace/peeksiginfo.c
> https://github.com/checkpoint-restore/criu/blob/master/test/zdtm/static/sigpending.c
>
> Disclaimer: I'm the author of both of them.
>
>> 
>> Signed-off-by: "Eric W. Biederman" 
>> ---
>>  kernel/signal.c | 6 --
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>> 
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index 7b49c31d3fdb..e445b0a63faa 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -3306,7 +3306,8 @@ static int do_rt_sigqueueinfo(pid_t pid, int sig, 
>> siginfo_t *info)
>>  (task_pid_vnr(current) != pid))
>>  return -EPERM;
>>  
>> -info->si_signo = sig;
>> +if (info->si_signo != sig)
>> +return -EINVAL;
>>  
>>  /* POSIX.1b doesn't mention process groups.  */
>>  return kill_proc_info(sig, info, pid);
>> @@ -3354,7 +3355,8 @@ static int do_rt_tgsigqueueinfo(pid_t tgid, pid_t pid, 
>> int sig, siginfo_t *info)
>>  (task_pid_vnr(current) != pid))
>>  return -EPERM;
>>  
>> -info->si_signo = sig;
>> +if (info->si_signo != sig)
>> +return -EINVAL;
>>  
>>  return do_send_specific(tgid, pid, sig, info);
>>  }


Re: [REVIEW][PATCH 2/6] signal: Fail sigqueueinfo if si_signo != sig

2018-10-05 Thread Eric W. Biederman
Andrei Vagin  writes:

> On Tue, Sep 25, 2018 at 07:19:02PM +0200, Eric W. Biederman wrote:
>> The kernel needs to validate that the contents of struct siginfo make
>> sense as siginfo is copied into the kernel, so that the proper union
>> members can be put in the appropriate locations.  The field si_signo
>> is a fundamental part of that validation.  As such changing the
>> contents of si_signo after the validation make no sense and can result
>> in nonsense values in the kernel.
>
> Accoding to the man page, the user should not set si_signo, it has to be set
> by kernel.
>
> $ man 2 rt_sigqueueinfo
>
> The uinfo argument specifies the data to accompany  the  signal.   This
>argument  is  a  pointer to a structure of type siginfo_t, described in
>sigaction(2) (and defined  by  including  ).   The  caller
>should set the following fields in this structure:
>
>si_code
>   This  must  be  one of the SI_* codes in the Linux kernel source
>   file include/asm-generic/siginfo.h, with  the  restriction  that
>   the  code  must  be  negative (i.e., cannot be SI_USER, which is
>   used by the kernel to indicate a signal  sent  by  kill(2))  and
>   cannot  (since  Linux  2.6.39) be SI_TKILL (which is used by the
>   kernel to indicate a signal sent using tgkill(2)).
>
>si_pid This should be set to a process ID, typically the process ID  of
>   the sender.
>
>si_uid This  should  be set to a user ID, typically the real user ID of
>   the sender.
>
>si_value
>   This field contains the user data to accompany the signal.   For
>   more information, see the description of the last (union sigval)
>   argument of sigqueue(3).
>
>Internally, the kernel sets the si_signo field to the  value  specified
>in  sig,  so that the receiver of the signal can also obtain the signal
>number via that field.
>
>> 
>> As such simply fail if someone is silly enough to set si_signo out of
>> sync with the signal number passed to sigqueueinfo.
>> 
>> I don't expect a problem as glibc's sigqueue implementation sets
>> "si_signo = sig" and CRIU just returns to the kernel what the kernel
>> gave to it.
>> 
>> If there is some application that calls sigqueueinfo directly that has
>> a problem with this added sanity check we can revisit this when we see
>> what kind of crazy that application is doing.
>
>
> I already know two "applications" ;)
>
> https://github.com/torvalds/linux/blob/master/tools/testing/selftests/ptrace/peeksiginfo.c
> https://github.com/checkpoint-restore/criu/blob/master/test/zdtm/static/sigpending.c
>
> Disclaimer: I'm the author of both of them.

Fair enough.  Then this counts as a regression.  The setting in the
kernel happens in an awkward place and I will see if it can be moved
earlier.

Eric


>> Signed-off-by: "Eric W. Biederman" 
>> ---
>>  kernel/signal.c | 6 --
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>> 
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index 7b49c31d3fdb..e445b0a63faa 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -3306,7 +3306,8 @@ static int do_rt_sigqueueinfo(pid_t pid, int sig, 
>> siginfo_t *info)
>>  (task_pid_vnr(current) != pid))
>>  return -EPERM;
>>  
>> -info->si_signo = sig;
>> +if (info->si_signo != sig)
>> +return -EINVAL;
>>  
>>  /* POSIX.1b doesn't mention process groups.  */
>>  return kill_proc_info(sig, info, pid);
>> @@ -3354,7 +3355,8 @@ static int do_rt_tgsigqueueinfo(pid_t tgid, pid_t pid, 
>> int sig, siginfo_t *info)
>>  (task_pid_vnr(current) != pid))
>>  return -EPERM;
>>  
>> -info->si_signo = sig;
>> +if (info->si_signo != sig)
>> +return -EINVAL;
>>  
>>  return do_send_specific(tgid, pid, sig, info);
>>  }


Re: Linux 4.19-rc4 released, an apology, and a maintainership note

2018-10-04 Thread Eric W. Biederman
Pavel Snajdr  writes:
>
> We started our organization (vpsFree.org) on top of OpenVZ patch set and are 
> now
> working to get vanilla up to the task of replacing the venerable 2.6.32-based
> OpenVZ 6 Linux-like thing. The new Code of Conduct is a guarantee for us, that
> we won't be laughed out of the room  and that our members won't be demotivated
> to contribute upstream - if we can all agree to be nice on each other; yet we
> still need you guys to tell us, when we're trying stupid things or going about
> things the wrong way, in some way that we will notice & can learn from.
> If I understand the context correctly, the previous "regime" could be the
> culprit, at least to some extent, why still don't have the VM look
> containers with vanilla.

At an implementation level namespaces and cgroups are hard.  Coming up
with a good solid design that is very maintainable and handles all of
the corner cases is difficult.  Very few people choose to do the work
of digging into the details and figuring out what is really needed.

This is not an area where you can hand hold someone.  It really takes
people who are able to work out from first principles what the code will
need to do.

Very often people will propose patches that do solve their specific case
but only do 10% or maybe 20% of what is needed for a general kernel
level solution.  For something that just works and does not cause
maintenance problems in the long run.

Someone has to deep dive and understand all of the problems and sort it
out.

That takes a person that is willing and able to stand up with all of the
rest of the kernel developers as an equal.   It requires listening to
other opinions to see where you need to change and where things are
wrong but it also requires being able to figure things out for yourself
and to come up with solid technical contributions.

I hope I have been something reasonable to work with on this front, if
not please let me know.

I know many other maintainers get hit with such a stream of bad
container ideas that they tend to stop listening.  As their priorities
are elsewhere I don't blame them.

Also don't forget that most of the time to do a good implemenation that it
requires rewriting an entire subsystem to make it container friendly.
Think of the effort that requires, especially when you are not allowed
to cause regressions in the subystem while rewriting it.

Further the only power a maintainer has is to accept patches, to listen
to people, and to express opinions that are worth listening to.  In the
midst of doing all of those things a maintainers time is limited.

You said a change in attitude gives you optimism that you can make work
with the upstream kernel.  I am glad you have optimism as overall the
kernel is a welcoming place.

At the same time there can't be guarantees that people won't be
demontivated.  If you care about the remaining kernel problems for
implementing containers, you need to realize those that are difficult
problems that don't easily admit to solutions.  That is why the problems
still remain.  To get a small flavor just look at how much work I had to
go through to sort out siginfo in the kernel which is just one very
small piece of the larger puzzle.  So please realize that sometimes
actually realizing the scope of the technical difficulties might be
demotivating in and of itself.

Similarly because maintainers have a limited amount of time there are no
guarantees how much we can help others.  We can try but people have to
meet maintainers at least half way in figuring out how things work
themselves, and sometimes there is just not enough time to say anything.
As the old saying goes: "You can lead a horse to water but you can't make
him drink".

So there are no guarantees that people won't be demotivated or that they
will learn what is necessary.  All that we can do is aim to keep
conversations polite and focused on the technical details of the project.
Which should keep things from getting unpleasant at the level of humans
interacting with humans.  I don't think that will give you greater
guarantees beyond that, and it feels like you are reading greater
guarantees into recent events.

I would like to see what you see as missing from the mainline
kernel.  But that is a topic for the containers list, and possibly for
the containers track at Linux Plumbers conference in Vancouver.

Eric


Re: linux-next: manual merge of the userns tree with the y2038 tree

2018-10-04 Thread Eric W. Biederman
Stephen Rothwell  writes:

> Hi Eric,
>
> Today's linux-next merge of the userns tree got a conflict in:
>
>   kernel/signal.c
>
> between commit:
>
>   49c39f8464a9 ("y2038: signal: Change rt_sigtimedwait to use 
> __kernel_timespec")
>
> from the y2038 tree and commit:
>
>   ae7795bc6187 ("signal: Distinguish between kernel_siginfo and siginfo")
>
> from the userns tree.
>
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.

Thank you.

This is good to know about.

Eric


Re: Setting monotonic time?

2018-10-03 Thread Eric W. Biederman
Thomas Gleixner  writes:

> On Wed, 3 Oct 2018, Eric W. Biederman wrote:
>> Direct access to hardware/drivers and not through an abstraction like
>> the vfs (an abstraction over block devices) can legitimately be handled
>> by hotplug events.  I unplug one keyboard I plug in another.
>> 
>> I don't know if the input layer is more of a general abstraction
>> or more of a hardware device.  I have not dug into it but my guess
>> is abstraction from what I have heard.
>> 
>> The scary difficulty here is if after restart input is reporting times
>> in CLOCK_MONOTONIC and the applications in the namespace are talking
>> about times in CLOCK_MONOTONIC_SYNC.  Then there is an issue.  As even
>> with a fixed offset the times don't match up.
>> 
>> So a time namespace absolutely needs to do is figure out how to deal
>> with all of the kernel interfaces reporting times and figure out how to
>> report them in the current time namespace.
>
> So you want to talk to Arnd who is leading the y2038 effort. He knowns how
> many and which interfaces are involved aside of the obvious core timer
> ones. It's quite an amount and the problem is that you really need to do
> that at the interface level, because many of those time stamps are taken in
> contexts which are completely oblivious of name spaces. Ditto for timeouts
> and similar things which are handed in through these interfaces.

Yep.  That sounds right.

Eric


Re: [RFC v2 v2 1/1] ns: add binfmt_misc to the mount namespace

2018-10-03 Thread Eric W. Biederman
Laurent Vivier  writes:

> This patch allows to have a different binftm_misc configuration
> in each container we mount binfmt_misc filesystem with mount namespace
> enabled.
>
> A container started without the CLONE_NEWNS will use the host binfmt_misc
> configuration, otherwise the container starts with an empty binfmt_misc
> interpreters list.
>
> For instance, using "unshare" we can start a chroot of an another
> architecture and configure the binfmt_misc interpreted without being root
> to run the binaries in this chroot.

A couple of things.
As has already been mentioned on your previous version anything that
comes through the filesystem interface needs to lookup up the local
binfmt context not through current but through file->f_dentry->d_sb.
AKA the superblock of the mounted filesystem.

As you have this coded any time a mount namespace is unshared you get a
new binfmt context.  That has a very reasonable chance of breaking
existing userspace.

A mount of binfmt_misc today from within a user namespace is not allowed
which is why I have figured that will be a nice place to trigger
creating a new binfmt context.

It is fundamentally necessary to be able to get a pointer to the binfmt
context from current.  Either stored in an existing namespace or
stored in nsproxy.  Anything else will risk breaking backwards
compatibility with existing user space for no good reason.

What is fundamentally being changed is the behavior of exec.

Changing the behavior of exec needs to be carefully contained or we risk
confusing privileged applications.

I believe your last email to James Bottomley detailed a very strong use
case for this functionality.

As the key gains over the existing kernel is unprivileged use.  As it is
the behavior of exec that is changing.  You definitely need a user
namespace involved.

So I think the simplest would be to hang the binfmt context off of a
user namespace.  But I am open to other ideas.

My primary concern is that we keep the cognitive and the maintenance
burden as small as is reasonably possible so that the costs don't out
weigh the benefit.

Eric


> Signed-off-by: Laurent Vivier 
> ---
>  fs/binfmt_misc.c | 50 +---
>  fs/mount.h   |  8 
>  fs/namespace.c   |  6 ++
>  3 files changed, 40 insertions(+), 24 deletions(-)
>
> diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
> index aa4a7a23ff99..ecb14776c759 100644
> --- a/fs/binfmt_misc.c
> +++ b/fs/binfmt_misc.c
> @@ -25,6 +25,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "internal.h"
>  
> @@ -38,9 +39,6 @@ enum {
>   VERBOSE_STATUS = 1 /* make it zero to save 400 bytes kernel memory */
>  };
>  
> -static LIST_HEAD(entries);
> -static int enabled = 1;
> -
>  enum {Enabled, Magic};
>  #define MISC_FMT_PRESERVE_ARGV0 (1 << 31)
>  #define MISC_FMT_OPEN_BINARY (1 << 30)
> @@ -60,10 +58,7 @@ typedef struct {
>   struct file *interp_file;
>  } Node;
>  
> -static DEFINE_RWLOCK(entries_lock);
>  static struct file_system_type bm_fs_type;
> -static struct vfsmount *bm_mnt;
> -static int entry_count;
>  
>  /*
>   * Max length of the register string.  Determined by:
> @@ -91,7 +86,7 @@ static Node *check_file(struct linux_binprm *bprm)
>   struct list_head *l;
>  
>   /* Walk all the registered handlers. */
> - list_for_each(l, ) {
> + list_for_each(l, _ns(entries)) {
>   Node *e = list_entry(l, Node, list);
>   char *s;
>   int j;
> @@ -135,15 +130,15 @@ static int load_misc_binary(struct linux_binprm *bprm)
>   int fd_binary = -1;
>  
>   retval = -ENOEXEC;
> - if (!enabled)
> + if (!binfmt_ns(enabled))
>   return retval;
>  
>   /* to keep locking time low, we copy the interpreter string */
> - read_lock(_lock);
> + read_lock(_ns(entries_lock));
>   fmt = check_file(bprm);
>   if (fmt)
>   dget(fmt->dentry);
> - read_unlock(_lock);
> + read_unlock(_ns(entries_lock));
>   if (!fmt)
>   return retval;
>  
> @@ -613,15 +608,15 @@ static void kill_node(Node *e)
>  {
>   struct dentry *dentry;
>  
> - write_lock(_lock);
> + write_lock(_ns(entries_lock));
>   list_del_init(>list);
> - write_unlock(_lock);
> + write_unlock(_ns(entries_lock));
>  
>   dentry = e->dentry;
>   drop_nlink(d_inode(dentry));
>   d_drop(dentry);
>   dput(dentry);
> - simple_release_fs(_mnt, _count);
> + simple_release_fs(_ns(bm_mnt), _ns(entry_count));
>  }
>  
>  /* / */
> @@ -716,7 +711,8 @@ static ssize_t bm_register_write(struct file *file, const 
> char __user *buffer,
>   if (!inode)
>   goto out2;
>  
> - err = simple_pin_fs(_fs_type, _mnt, _count);
> + err = simple_pin_fs(_fs_type, _ns(bm_mnt),
> + _ns(entry_count));
>   if (err) {
>   iput(inode);
>   inode = NULL;
> @@ -730,7 +726,8 @@ static ssize_t 

Re: Setting monotonic time?

2018-10-02 Thread Eric W. Biederman
Thomas Gleixner  writes:

> On Tue, 2 Oct 2018, Arnd Bergmann wrote:
>> On Mon, Oct 1, 2018 at 8:53 PM Thomas Gleixner  wrote:
>> >
>> > On Mon, 1 Oct 2018, Eric W. Biederman wrote:
>> > > In the context of process migration there is a simpler subproblem that I
>> > > think it is worth exploring if we can do something about.
>> > >
>> > > For a cluster of machines all running with synchronized
>> > > clocks. CLOCK_REALTIME matches. CLOCK_MONOTNIC does not match between
>> > > machines.   Not having a matching CLOCK_MONOTONIC prevents successful
>> > > process migration between nodes in that cluster.
>> > >
>> > > Would it be possible to allow setting CLOCK_MONOTONIC at the very
>> > > beginning of time?  So that all of the nodes in a cluster can be in
>> > > sync?
>> > >
>> > > No change in skew just in offset for CLOCK_MONOTONIC.
>> > >
>> > > There are also dragons involved in coordinating things so that
>> > > CLOCK_MONOTONIC gets set before CLOCK_MONOTONIC gets used.  So I don't
>> > > know if allowing CLOCK_MONOTONIC to be set would be practical but it
>> > > seems work exploring all on it's own.
>> >
>> > It's used very early on in the kernel, so that would be a major surprise
>> > for many things including user space which has expectations on clock
>> > monotonic.
>> >
>> > It would be reasonably easy to add CLOCK_MONONOTIC_SYNC which can be set in
>> > the way you described and then in name spaces make it possible to magically
>> > map CLOCK_MONOTONIC to CLOCK_MONOTONIC_SYNC.
>> >
>> > It still wouldn't allow to have different NTP/PTP time domains, but might
>> > be a good start to address the main migration headaches.
>> 
>> If we make CLOCK_MONOTONIC settable this way in a namespace,
>> do you think that should include device drivers that report timestamps
>> in CLOCK_MONOTONIC base, or only the timekeeping clock and timer
>> interfaces?
>
> Uurgh. That gets messy very fast.
>
>> Examples for drivers that can report timestamps are input, sound, v4l,
>> and drm. I think most of these can report stamps in either monotonic
>> or realtime base, while socket timestamps notably are always in
>> realtime.
>> 
>> We can probably get away with not setting the timebase for those
>> device drivers as long as the checkpoint/restart and migration features
>> are not expected to restore the state of an open character device
>> in that way. I don't know if that is a reasonable assumption to make
>> for the examples I listed.
>
> No idea. I'm not a container migration wizard.

Direct access to hardware/drivers and not through an abstraction like
the vfs (an abstraction over block devices) can legitimately be handled
by hotplug events.  I unplug one keyboard I plug in another.

I don't know if the input layer is more of a general abstraction
or more of a hardware device.  I have not dug into it but my guess
is abstraction from what I have heard.

The scary difficulty here is if after restart input is reporting times
in CLOCK_MONOTONIC and the applications in the namespace are talking
about times in CLOCK_MONOTONIC_SYNC.  Then there is an issue.  As even
with a fixed offset the times don't match up.

So a time namespace absolutely needs to do is figure out how to deal
with all of the kernel interfaces reporting times and figure out how to
report them in the current time namespace.

Eric








Setting monotonic time?

2018-10-01 Thread Eric W. Biederman


In the context of process migration there is a simpler subproblem that I
think it is worth exploring if we can do something about.

For a cluster of machines all running with synchronized
clocks. CLOCK_REALTIME matches. CLOCK_MONOTNIC does not match between
machines.   Not having a matching CLOCK_MONOTONIC prevents successful
process migration between nodes in that cluster.

Would it be possible to allow setting CLOCK_MONOTONIC at the very
beginning of time?  So that all of the nodes in a cluster can be in
sync?

No change in skew just in offset for CLOCK_MONOTONIC.

There are also dragons involved in coordinating things so that
CLOCK_MONOTONIC gets set before CLOCK_MONOTONIC gets used.  So I don't
know if allowing CLOCK_MONOTONIC to be set would be practical but it
seems work exploring all on it's own.

Dmitry would setting CLOCK_MONOTONIC exactly once at boot time solve
your problem that is you are looking at a time namespace to solve?

Eric


Re: [RFC 00/20] ns: Introduce Time Namespace

2018-10-01 Thread Eric W. Biederman
Thomas Gleixner  writes:

> Eric,
>
> On Fri, 28 Sep 2018, Eric W. Biederman wrote:
>> Thomas Gleixner  writes:
>> > On Wed, 26 Sep 2018, Eric W. Biederman wrote:
>> >> At the same time using the techniques from the nohz work and a little
>> >> smarts I expect we could get the code to scale.
>> >
>> > You'd need to invoke the update when the namespace is switched in and
>> > hasn't been updated since the last tick happened. That might be doable, but
>> > you also need to take the wraparound constraints of the underlying
>> > clocksources into account, which again can cause walking all name spaces
>> > when they are all idle long enough.
>> 
>> The wrap around constraints being how long before the time sources wrap
>> around so you have to read them once per wrap around?  I have not dug
>> deeply enough into the code to see that yet.
>
> It's done by limiting the NOHZ idle time when all CPUs are going into deep
> sleep for a long time, i.e. we make sure that at least one CPU comes back
> sufficiently _before_ the wraparound happens and invokes the update
> function.
>
> It's not so much a problem for TSC, but not every clocksource the kernel
> supports has wraparound times in the range of hundreds of years.
>
> But yes, your idea of keeping track of wraparounds might work. Tricky, but
> looks feasible on first sight, but we should be aware of the dragons.

Oh.  Yes.  Definitely.  A key enabler of any namespace implementation is
figuring out how to tame the dragons.

>> Please pardon me for thinking out load.
>> 
>> There are one or more time sources that we use to compute the time
>> and for each time source we have a conversion from ticks of the
>> time source to nanoseconds.
>> 
>> Each time source needs to be sampled at least once per wrap-around
>> and something incremented so that we don't loose time when looking
>> at that time source.
>> 
>> There are several clocks presented to userspace and they all share the
>> same length of second and are all fundamentally offsets from
>> CLOCK_MONOTONIC.
>
> Yes. That's the readout side. This one is doable. But now look at timers.
>
> If you arm the timer from a name space, then it needs to be converted to
> host time in order to sort it into the hrtimer queue and at some point arm
> the clockevent device for it. This works as long as host and name space
> time have a constant offset and the same skew.
>
> Once the name space time has a different skew this falls apart because the
> armed timer will either expire late or early.
>
> Late might be acceptable, early violates the spec. You could do an extra
> check for rescheduling it, if it's early, but that requires to store the
> name space time accessor in the hrtimer itself because not every timer
> expiry happens so that it can be checked in the name space context (think
> signal based timers). We need to add this extra magic right into
> __hrtimer_run_queues() which is called from the hard and soft interrupt. We
> really don't want to touch all relevant callbacks or syscalls. The latter
> is not sufficient anyway for signal based timer delivery.
>
> That's going to be interesting in terms of synchronization and might also
> cause substantial overhead at least for the timers which belong to name
> spaces.
>
> But that also means that anything which is early can and probably will
> cause rearming of the timer hardware possibly for a very short delta. We
> need to think about whether this can be abused to create interrupt storms.
>
> Now if you accept a bit late, which I'm not really happy about, then you
> surely won't accept very late, i.e. hours, days. But that can happen when
> settimeofday() comes into play. Right now with a single time domain, this
> is easy. When settimeofday() or adjtimex() makes time jump, we just go and
> reprogramm the hardware timers accordingly, which might also result in
> immediate expiry of timers.
>
> But this does not help for time jumps in name spaces because the timer is
> enqueued on the host time base.
>
> And no, we should not think about creating per name space hrtimer queues
> and then have to walk through all of them for finding the first expiring
> timer in order to arm the hardware. That cannot scale.
>
> Walking all hrtimer bases on all CPUs and check all queued timers whether
> they belong to the affected name space does not scale either.
>
> So we'd need to keep track of queued timers belonging to a name space and
> then just handle them. Interesting locking problem and also a scalability
> issue because this might need to be done on all online CPUs. Haven't
> though

Re: [RFC 0/2] ns: introduce binfmt_misc namespace

2018-10-01 Thread Eric W. Biederman
Laurent Vivier  writes:

> Le 01/10/2018 à 09:21, Eric W. Biederman a écrit :
>> Andy Lutomirski  writes:
>> 
>>> On Sun, Sep 30, 2018 at 4:47 PM Laurent Vivier  wrote:
>>>>
>>>> This series introduces a new namespace for binfmt_misc.
>>>>
>>>
>>> This seems conceptually quite reasonable, but I'm wondering if the
>>> number of namespace types is getting out of hand given the current
>>> API.  Should we be considering whether we need a new set of namespace
>>> creation APIs that scale better to larger numbers of namespace types?
>> 
>> I would rather encourage a way to make this part of an existing
>> namespace or find a way to make a mount of binfmt_misc control this.
>> 
>> Hmm.  This looks like something that can be very straight forwardly be
>> made part of the user namespace.  If you ever mount binfmt_misc in the
>> user namespace you get the new behavior.  Otherwise you get the existing
>> behavior.
>
> Thank you. I'll do that.
>
>> A user namespace will definitely be required, as otherwise you run the
>> risk of confusing root (and suid root exectuables0 by being able to
>> change the behavior of executables.
>> 
>> What is the motivation for this?  My impression is that very few people
>> tweak binfmt_misc.
>
> I think more and more people are using an interpreter like qemu
> linux-usermode to have a cross-compilation environment: they bootstrap a
> distro filesystems (with something like debootstrap), and then use
> binfmt_misc to run the compiler inside this environment (see for
> instance [1] [2] [3] or [4] [5]). This is interesting because you have
> more than a cross-compiler with that: you have also all the libraries of
> the target system, you can select exactly which target release you want
> to build to, with the exact same compiler and libraries versions (and
> you can re-use it you want to do maintenance on your project 10 years
> later...)
>
> The problem with this is you need to be root:
> 1- to chroot
> 2- to configure binfmt_misc
>
> We already can use "unshare --map-root-user chroot" to address the point
> 1, and this series tries to address the point 2.
>
> I think it's also interesting to have a per container configuration for
> binfmt_misc when the server administrator configures it and don't want
> to share each user configuration with all the other user ones (in
> something like docker or a cloud application).

OK.  So it sounds like you are already needing a user namespace for
this.   If this is your use case then my proposed method above seems to
fit rather well.  James Bottomley was doing something similar that
connected to personality(2).  That might be worth a look to see if there
is some synergy there.

>> I also don't think this raises to the level where it makes sense to
>> create a new namespace for this.
>
> OK.
>
> Thanks,
> Laurent
>
> [1] https://wiki.debian.org/Arm64Qemu
> [2] https://wiki.debian.org/M68k/sbuildQEMU
> [3] https://wiki.debian.org/RISC-V#Manual_qemu-user_installation
> [4] https://kbeckmann.github.io/2017/05/26/QEMU-instead-of-cross-compiling/
> [5] https://wiki.gentoo.org/wiki/Crossdev_qemu-static-user-chroot


Re: [RFC 0/2] ns: introduce binfmt_misc namespace

2018-10-01 Thread Eric W. Biederman
Andy Lutomirski  writes:

> On Sun, Sep 30, 2018 at 4:47 PM Laurent Vivier  wrote:
>>
>> This series introduces a new namespace for binfmt_misc.
>>
>
> This seems conceptually quite reasonable, but I'm wondering if the
> number of namespace types is getting out of hand given the current
> API.  Should we be considering whether we need a new set of namespace
> creation APIs that scale better to larger numbers of namespace types?

I would rather encourage a way to make this part of an existing
namespace or find a way to make a mount of binfmt_misc control this.

Hmm.  This looks like something that can be very straight forwardly be
made part of the user namespace.  If you ever mount binfmt_misc in the
user namespace you get the new behavior.  Otherwise you get the existing
behavior.

A user namespace will definitely be required, as otherwise you run the
risk of confusing root (and suid root exectuables0 by being able to
change the behavior of executables.

What is the motivation for this?  My impression is that very few people
tweak binfmt_misc.

I also don't think this raises to the level where it makes sense to
create a new namespace for this.

Eric


Re: [RFC 00/20] ns: Introduce Time Namespace

2018-09-28 Thread Eric W. Biederman
Thomas Gleixner  writes:

> On Wed, 26 Sep 2018, Eric W. Biederman wrote:
>> Reading the code the calling sequence there is:
>> tick_sched_do_timer
>>tick_do_update_jiffies64
>>   update_wall_time
>>   timekeeping_advance
>>  timekeepging_update
>> 
>> If I read that properly under the right nohz circumstances that update
>> can be delayed indefinitely.
>> 
>> So I think we could prototype a time namespace that was per
>> timekeeping_update and just had update_wall_time iterate through
>> all of the time namespaces.
>
> Please don't go there. timekeeping_update() is already heavy and walking
> through a gazillion of namespaces will just make it horrible,
>
>> I don't think the naive version would scale to very many time
>> namespaces.
>
> :)
>
>> At the same time using the techniques from the nohz work and a little
>> smarts I expect we could get the code to scale.
>
> You'd need to invoke the update when the namespace is switched in and
> hasn't been updated since the last tick happened. That might be doable, but
> you also need to take the wraparound constraints of the underlying
> clocksources into account, which again can cause walking all name spaces
> when they are all idle long enough.

The wrap around constraints being how long before the time sources wrap
around so you have to read them once per wrap around?  I have not dug
deeply enough into the code to see that yet.

> From there it becomes hairy, because it's not only timekeeping,
> i.e. reading time, this is also affecting all timers which are armed from a
> namespace.
>
> That gets really ugly because when you do settimeofday() or adjtimex() for
> a particular namespace, then you have to search for all armed timers of
> that namespace and adjust them.
>
> The original posix timer code had the same issue because it mapped the
> clock realtime timers to the timer wheel so any setting of the clock caused
> a full walk of all armed timers, disarming, adjusting and requeing
> them. That's horrible not only performance wise, it's also a locking
> nightmare of all sorts.
>
> Add time skew via NTP/PTP into the picture and you might have to adjust
> timers as well, because you need to guarantee that they are not expiring
> early.
>
> I haven't looked through Dimitry's patches yet, but I don't see how this
> can work at all without introducing subtle issues all over the place.

Then it sounds like this will take some more digging.

Please pardon me for thinking out load.

There are one or more time sources that we use to compute the time
and for each time source we have a conversion from ticks of the
time source to nanoseconds.

Each time source needs to be sampled at least once per wrap-around
and something incremented so that we don't loose time when looking
at that time source.

There are several clocks presented to userspace and they all share the
same length of second and are all fundamentally offsets from
CLOCK_MONOTONIC.

I see two fundamental driving cases for a time namespace.
1) Migration from one node to another node in a cluster in almost
   real time.

   The problem is that CLOCK_MONOTONIC between nodes in the cluster
   has not relation ship to each other (except a synchronized length of
   the second).  So applications that migrate can see CLOCK_MONOTONIC
   and CLOCK_BOOTTIME go backwards.

   This is the truly pressing problem and adding some kind of offset
   sounds like it would be the solution.  Possibly by allowing a boot
   time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC.

2) Dealing with two separate time management domains.  Say a machine
   that needes to deal with both something inside of google where they
   slew time to avoid leap time seconds and something in the outside
   world proper UTC time is kept as an offset from TAI with the
   occasional leap seconds.

   In the later case it would fundamentally require having seconds of
   different length.


A pure 64bit nanoseond counter is good for 500 years.  So 64bit
variables can be used to hold time, and everything can be converted from
there.

This suggests we can for ticks have two values.
- The number of ticks from the time source.
- The number of times the ticks would have rolled over.

That sounds like it may be a little simplistic as it would require being
very diligent about firing a timer exactly at rollover and not losing
that, but for a handwaving argument is probably enough to generate
a 64bit tick counter.

If the focus is on a 64bit tick counter then what update_wall_time
has to do is very limited.  Just deal the accounting needed to cope with
tick rollover.

Getting the actual time looks like it would be as simple as now, with
perhaps an extra addition to account for the number of times the tick
counter has ro

Re: [PATCH V6 08/33] csky: Process management and Signal

2018-09-27 Thread Eric W. Biederman
Guo Ren  writes:

> --- /dev/null
> +++ b/arch/csky/abiv2/fpu.c
> +void fpu_fpe(struct pt_regs * regs)
> +{
> + int sig;
> + unsigned int fesr;
> + siginfo_t info;
> +
> + fesr = mfcr("cr<2, 2>");
> +
> + if(fesr & FPE_ILLE){
> + info.si_code = ILL_ILLOPC;
> + sig = SIGILL;
> + }
> + else if(fesr & FPE_IDC){
> + info.si_code = ILL_ILLOPN;
> + sig = SIGILL;
> + }
> + else if(fesr & FPE_FEC){
> + sig = SIGFPE;
> + if(fesr & FPE_IOC){
> + info.si_code = FPE_FLTINV;
> + }
> + else if(fesr & FPE_DZC){
> + info.si_code = FPE_FLTDIV;
> + }
> + else if(fesr & FPE_UFC){
> + info.si_code = FPE_FLTUND;
> + }
> + else if(fesr & FPE_OFC){
> + info.si_code = FPE_FLTOVF;
> + }
> + else if(fesr & FPE_IXC){
> + info.si_code = FPE_FLTRES;
> + }
> + else {
> + info.si_code = NSIGFPE;
> + }
> + }
> + else {
> + info.si_code = NSIGFPE;
> + sig = SIGFPE;
> + }
> + info.si_signo = SIGFPE;
> + info.si_errno = 0;
> + info.si_addr = (void *)regs->pc;
> + force_sig_info(sig, , current);
> +}


This use of sending a signal is buggy.  It results in undefined values
being copied to userspace.

Userspace should never be sent NSIGXXX as a si_code.  You can use
FPE_FLTUNK for this default case.

In new code please use force_sig_fault instead of force_sig_info in new
code.  That saves you the trouble of messing with struct siginfo.

Thank you very much,
Eric Biederman



Re: [PATCH v14 09/19] x86/mm: x86/sgx: Signal SEGV_SGXERR for #PFs w/ PF_SGX

2018-09-27 Thread Eric W. Biederman
Jarkko Sakkinen  writes:

> From: Sean Christopherson 
>
> Signal SIGSEGV(SEGV_SGXERR) for all faults with PF_SGX set in the
> error code.  The PF_SGX bit is set if and only if the #PF is detected
> by the Enclave Page Cache Map (EPCM), which is consulted only after
> an access walks the kernel's page tables, i.e.:
>
>   a. the access was allowed by the kernel
>   b. the kernel's tables have become less restrictive than the EPCM
>   c. the kernel cannot fixup the cause of the fault
>
> Noteably, (b) implies that either the kernel has botched the EPC
> mappings or the EPCM has been invalidated due to a power event.  In
> either case, userspace needs to be alerted so that it can take
> appropriate action, e.g. restart the enclave.  This is reinforced
> by (c) as the kernel doesn't really have any other reasonable option,
> e.g. we could kill the task or panic, but neither is warranted.
>
> Signed-off-by: Sean Christopherson 
> Cc: Dave Hansen 
> Signed-off-by: Jarkko Sakkinen 
> ---
>  arch/x86/mm/fault.c | 20 +---
>  1 file changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 85d20516b2f3..3fb2b2838d6c 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -960,10 +960,13 @@ static noinline void
>  bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
> unsigned long address, struct vm_area_struct *vma)
>  {
> + int si_code = SEGV_ACCERR;
> +
>   if (bad_area_access_from_pkeys(error_code, vma))
> - __bad_area(regs, error_code, address, vma, SEGV_PKUERR);
> - else
> - __bad_area(regs, error_code, address, vma, SEGV_ACCERR);
> + si_code = SEGV_PKUERR;
> + else if (unlikely(error_code & X86_PF_SGX))
> + si_code = SEGV_SGXERR;
> + __bad_area(regs, error_code, address, vma, si_code);
>  }

This conflicts with a cleanup in this area I have sitting in linux-next.
It isn't in the x86 tree but you can find my siginfo tree at:

git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
siginfo-next

In my tree bad area no longer takes a vma parameter.

If you are going to make changes to the fault handling code this cycle
please let's figure out how to build it on top of my clean ups.

Thank you,
Eric Biederman




Re: [PATCH v14 08/19] signal: x86/sgx: Add SIGSEGV siginfo code for SGX EPCM fault

2018-09-27 Thread Eric W. Biederman
Jarkko Sakkinen  writes:

> From: Sean Christopherson 
>
> The SGX Enclave Page Cache Map (EPCM) is a hardware-managed table
> that enforces accesses to an enclave's EPC page in addition to the
> software-managed kernel page tables, i.e. the effective permissions
> for an EPC page are a logical AND of the kernel's page tables and
> the corresponding EPCM entry.  The primary purpose of the EPCM is
> to prevent a malcious or compromised kernel from attacking an enclave
> by modifying the enclave's page tables.  The EPCM entires for an
> enclave are populated when the enclave is built and verified, using
> metadata provided by the enclave that is included in the measurement
> used to verify the enclave.
>
> In normal operation of a properly functioning, non-malicious kernel
> (and enclave), the EPCM permissions will never trigger a fault, i.e.
> the kernel may make the permissions for an EPC page more restrictive,
> e.g. mark it not-present to swap out the EPC page, but the kernel will
> never make its permissions less restrictive.
>
> But, there is a legitimate scenario in which the kernel's page tables
> can become less restrictive than the EPCM: on current hardware all
> enclaves are destroyed (by hardware) on a transition to S3 or lower
> sleep states, i.e. all EPCM entries are invalid (not-present) after
> the system resumes from its sleep state.
>
> Unfortunately, on CPUs that support only SGX1, EPCM violations result
> in a #GP.  The upside of the #GP is that no kernel changes are needed
> to deal with the EPCM being blasted away by hardware, e.g. userspace
> gets a SIGSEGV, assumes the EPCM was lost and restarts its enclave
> and everyone is happy.  The downside is that userspace has to assume
> the SIGSEGV was because the EPC was lost (or possibly do some leg work
> to rule out other causes).
>
> In SGX2, the oddity of delivering a #GP due to what are inherently
> paging related violations is remedied.  CPUs that support SGX2 deliver
> EPCM violations as #PFs with a new SGX error code bit set.  So, now
> that hardware provides us with a way to unequivocally determine that
> a fault was due to a EPCM violation, define a signfo code for SIGSEGV
> so that the information can be passed onto userspace.
>
> Cc: Dave Hansen 
> Signed-off-by: Sean Christopherson 
> Signed-off-by: Jarkko Sakkinen 
> ---
>  include/uapi/asm-generic/siginfo.h | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/include/uapi/asm-generic/siginfo.h 
> b/include/uapi/asm-generic/siginfo.h
> index 80e2a7227205..fdd898e2325b 100644
> --- a/include/uapi/asm-generic/siginfo.h
> +++ b/include/uapi/asm-generic/siginfo.h
> @@ -225,7 +225,11 @@ typedef struct siginfo {
>  #else
>  # define SEGV_PKUERR 4   /* failed protection key checks */
>  #endif
> +#ifdef __x86_64__
> +#define SEGV_SGXERR  5   /* SGX Enclave Page Cache Map fault */
> +#else
>  #define SEGV_ACCADI  5   /* ADI not enabled for mapped object */
> +#endif

Don't do this crazy ifdef thing.  si_codes are not supposed to be per
architecture.  There are a few historical bugs but with a 32bit space
it is just stupid to add #ifdefs.

Just set.
#define SEGV_SGXERR 8 and increase NSIGSEGV

Anything else is just asking for trouble.  Especially when you want to
get SGX working on itaninum.

>  #define SEGV_ADIDERR 6   /* Disrupting MCD error */
>  #define SEGV_ADIPERR 7   /* Precise MCD exception */
>  #define NSIGSEGV 7

Eric


Re: [REVIEW][PATCH 00/15] signal/arm64: siginfo cleanups

2018-09-27 Thread Eric W. Biederman
Catalin Marinas  writes:

> Hi Eric,
>
> On Mon, Sep 24, 2018 at 11:07:05AM +0200, Eric W. Biederman wrote:
>> This is the continuation of my work to sort out signaling of exceptions
>> with siginfo.  The old signal sending functions by taking a siginfo
>> argument resulted in their callers having to deal with the fiddly nature
>> of siginfo directly.  In small numbers of callers this is not a problem
>> but in the number of callers in the kernel this resulted in cases
>> where fields were not initialized or improperly initialized before
>> being passed to userspace.
>> 
>> To avoid having to worry about those issues I have added new signal
>> sending functions that each deal wit a different siginfo case.  When
>> using these functions there is no room for the fiddly nature of siginfo
>> to cause mistakes.
>> 
>> This is my set of changes to update arm64 to use those functions.
>> Along with some refactoring so those functions can be cleanly used.
>> 
>> Folks please review and double check me.  I think I have kept these
>> changes simple and obviously correct but I am human and mess up
>> sometimes.
>
> Nice clean-up, thanks. I started reviewing the patches, I should finish
> by tomorrow (I also applied them locally to give some testing).
>
>> After these patches have had a chance to be reviewed I plan to merge
>> them by my siginfo tree.  If you would rather take them in the arm64
>> tree let me know.   All of the prerequisites should have been merged
>> through Linus's tree several releases ago.
>
> Either way works for me. There is a trivial conflict in
> force_signal_inject() with the arm64 for-next/core tree so I could as
> well put them on top of this branch and send them during the 4.20
> merging window.

As long as there is a trivial conflict I would like to keep everything
in one tree.

There is a following patchset that manages to reduce the size of struct
siginfo in the kernel that I have also posted for review.With
everything in one tree I can make that change now, and just cross it off
my list of things to worry about.

Eric



Re: [RFC 00/20] ns: Introduce Time Namespace

2018-09-26 Thread Eric W. Biederman
Andrey Vagin  writes:

> On Tue, Sep 25, 2018 at 12:02:32AM +0200, Eric W. Biederman wrote:
>> Andrey Vagin  writes:
>> 
>> > On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote:
>> >> Dmitry Safonov  writes:
>> >> 
>> >> > Discussions around time virtualization are there for a long time.
>> >> > The first attempt to implement time namespace was in 2006 by Jeff Dike.
>> >> > From that time, the topic appears on and off in various discussions.
>> >> >
>> >> > There are two main use cases for time namespaces:
>> >> > 1. change date and time inside a container;
>> >> > 2. adjust clocks for a container restored from a checkpoint.
>> >> >
>> >> > “It seems like this might be one of the last major obstacles keeping
>> >> > migration from being used in production systems, given that not all
>> >> > containers and connections can be migrated as long as a time dependency
>> >> > is capable of messing it up.” (by github.com/dav-ell)
>> >> >
>> >> > The kernel provides access to several clocks: CLOCK_REALTIME,
>> >> > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the
>> >> > start points for them are not defined and are different for each running
>> >> > system. When a container is migrated from one node to another, all
>> >> > clocks have to be restored into consistent states; in other words, they
>> >> > have to continue running from the same points where they have been
>> >> > dumped.
>> >> >
>> >> > The main idea behind this patch set is adding per-namespace offsets for
>> >> > system clocks. When a process in a non-root time namespace requests
>> >> > time of a clock, a namespace offset is added to the current value of
>> >> > this clock on a host and the sum is returned.
>> >> >
>> >> > All offsets are placed on a separate page, this allows up to map it as 
>> >> > part of vvar into user processes and use offsets from vdso calls.
>> >> >
>> >> > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME
>> >> > clocks.
>> >> >
>> >> > Questions to discuss:
>> >> >
>> >> > * Clone flags exhaustion. Currently there is only one unused clone flag
>> >> > bit left, and it may be worth to use it to extend arguments of the clone
>> >> > system call.
>> >> >
>> >> > * Realtime clock implementation details:
>> >> >   Is having a simple offset enough?
>> >> >   What to do when date and time is changed on the host?
>> >> >   Is there a need to adjust vfs modification and creation times? 
>> >> >   Implementation for adjtime() syscall.
>> >> 
>> >> Overall I support this effort.  In my quick skim this code looked good.
>> >
>> > Hi Eric,
>> >
>> > Thank you for the feedback.
>> >
>> >> 
>> >> My feeling is that we need to be able to support running ntpd and
>> >> support one namespace doing googles smoothing of leap seconds while
>> >> another namespace takes the leap second.
>> >> 
>> >> What I was imagining when I was last thinking about this was one
>> >> instance of struct timekeeper aka tk_core per time namespace.  That
>> >> structure already keeps offsets for all of the various clocks from
>> >> the kerne internal time sources.  What would be needed would be to
>> >> pass in an appropriate time namespace pointer.
>> >> 
>> >> I could be completely wrong as I have not take the time to completely
>> >> trace through the code.  Have you looked at pushing the time namespace
>> >> down as far as tk_core?
>> >> 
>> >> What I think would be the big advantage (besides ntp working) is that
>> >> the bulk of the code could be reused.  Allowing testing of the kernel's
>> >> time code by setting up a new time namespace.  So a person in production
>> >> could setup a time namespace with the time set ahead a little  bit and
>> >> be able to verify that the kernel handles the upcoming leap second
>> >> properly.
>> >>
>> >
>> > It is an interesting idea, but I have a few questions:
>> >
>> > 1. Does it mean that timekeeping_update() wil

[REVIEW][PATCH 5/6] signal: Distinguish between kernel_siginfo and siginfo

2018-09-25 Thread Eric W. Biederman
Linus recently observed that if we did not worry about the padding
member in struct siginfo it is only about 48 bytes, and 48 bytes is
much nicer than 128 bytes for allocating on the stack and copying
around in the kernel.

The obvious thing of only adding the padding when userspace is
including siginfo.h won't work as there are sigframe definitions in
the kernel that embed struct siginfo.

So split siginfo in two; kernel_siginfo and siginfo.  Keeping the
traditional name for the userspace definition.  While the version that
is used internally to the kernel and ultimately will not be padded to
128 bytes is called kernel_siginfo.

The definition of struct kernel_siginfo I have put in include/signal_types.h

A set of buildtime checks has been added to verify the two structures have
the same field offsets.

To make it easy to verify the change kernel_siginfo retains the same
size as siginfo.  The reduction in size comes in a following change.

Signed-off-by: "Eric W. Biederman" 
---
 arch/x86/include/asm/compat.h |   2 +-
 drivers/usb/core/devio.c  |   4 +-
 fs/binfmt_elf.c   |   6 +-
 fs/coredump.c |   2 +-
 fs/fcntl.c|   2 +-
 fs/signalfd.c |   6 +-
 include/linux/binfmts.h   |   2 +-
 include/linux/compat.h|   4 +-
 include/linux/coredump.h  |   4 +-
 include/linux/lsm_hooks.h |   4 +-
 include/linux/posix-timers.h  |   2 +-
 include/linux/ptrace.h|   2 +-
 include/linux/sched.h |   2 +-
 include/linux/sched/signal.h  |  18 ++--
 include/linux/security.h  |   6 +-
 include/linux/signal.h|  15 ++--
 include/linux/signal_types.h  |  11 ++-
 include/trace/events/signal.h |   4 +-
 ipc/mqueue.c  |   2 +-
 kernel/ptrace.c   |  10 +--
 kernel/seccomp.c  |   6 +-
 kernel/signal.c   | 151 ++
 kernel/time/posix-timers.c|   2 +-
 security/apparmor/lsm.c   |   2 +-
 security/security.c   |   2 +-
 security/selinux/hooks.c  |   2 +-
 security/smack/smack_lsm.c|   2 +-
 27 files changed, 165 insertions(+), 110 deletions(-)

diff --git a/arch/x86/include/asm/compat.h b/arch/x86/include/asm/compat.h
index fb97cf7c4137..a0f46bdd9f24 100644
--- a/arch/x86/include/asm/compat.h
+++ b/arch/x86/include/asm/compat.h
@@ -240,6 +240,6 @@ static inline bool in_compat_syscall(void)
 
 struct compat_siginfo;
 int __copy_siginfo_to_user32(struct compat_siginfo __user *to,
-   const siginfo_t *from, bool x32_ABI);
+   const kernel_siginfo_t *from, bool x32_ABI);
 
 #endif /* _ASM_X86_COMPAT_H */
diff --git a/drivers/usb/core/devio.c b/drivers/usb/core/devio.c
index 6ce77b33da61..c260ea8808b0 100644
--- a/drivers/usb/core/devio.c
+++ b/drivers/usb/core/devio.c
@@ -582,7 +582,7 @@ static void async_completed(struct urb *urb)
 {
struct async *as = urb->context;
struct usb_dev_state *ps = as->ps;
-   struct siginfo sinfo;
+   struct kernel_siginfo sinfo;
struct pid *pid = NULL;
const struct cred *cred = NULL;
unsigned long flags;
@@ -2599,7 +2599,7 @@ const struct file_operations usbdev_file_operations = {
 static void usbdev_remove(struct usb_device *udev)
 {
struct usb_dev_state *ps;
-   struct siginfo sinfo;
+   struct kernel_siginfo sinfo;
 
while (!list_empty(>filelist)) {
ps = list_entry(udev->filelist.next, struct usb_dev_state, 
list);
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index efae2fb0930a..54207327f98f 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1580,7 +1580,7 @@ static void fill_auxv_note(struct memelfnote *note, 
struct mm_struct *mm)
 }
 
 static void fill_siginfo_note(struct memelfnote *note, user_siginfo_t 
*csigdata,
-   const siginfo_t *siginfo)
+   const kernel_siginfo_t *siginfo)
 {
mm_segment_t old_fs = get_fs();
set_fs(KERNEL_DS);
@@ -1782,7 +1782,7 @@ static int fill_thread_core_info(struct 
elf_thread_core_info *t,
 
 static int fill_note_info(struct elfhdr *elf, int phdrs,
  struct elf_note_info *info,
- const siginfo_t *siginfo, struct pt_regs *regs)
+ const kernel_siginfo_t *siginfo, struct pt_regs *regs)
 {
struct task_struct *dump_task = current;
const struct user_regset_view *view = task_user_regset_view(dump_task);
@@ -2031,7 +2031,7 @@ static int elf_note_info_init(struct elf_note_info *info)
 
 static int fill_note_info(struct elfhdr *elf, int phdrs,
  struct elf_note_info *info,
- const siginfo_t *siginfo, struct pt_regs *regs)
+ const kernel_siginfo_t *siginfo, struct pt_regs *regs)
 {
struct list_head *t;
struct core_thread *ct;
diff --git a/fs/coredump.c b/fs/coredump.c
index 1e2c87acac9b..e42e17e55bfd 100644
--- a/

[REVIEW][PATCH 6/6] signal: Use a smaller struct siginfo in the kernel

2018-09-25 Thread Eric W. Biederman
We reserve 128 bytes for struct siginfo but only use about 48 bytes on
64bit and 32 bytes on 32bit.  Someday we might use more but it is unlikely
to be anytime soon.

Userspace seems content with just enough bytes of siginfo to implement
sigqueue.  Or in the case of checkpoint/restart reinjecting signals
the kernel has sent.

Reducing the stack footprint and the work to copy siginfo around from
2 cachelines to 1 cachelines seems worth doing even if I don't have
benchmarks to show a performance difference.

Suggested-by: Linus Torvalds 
Signed-off-by: "Eric W. Biederman" 
---
 include/linux/signal.h   |  2 +
 include/linux/signal_types.h |  5 +--
 kernel/signal.c  | 82 
 3 files changed, 67 insertions(+), 22 deletions(-)

diff --git a/include/linux/signal.h b/include/linux/signal.h
index 70031b10b918..706a499d1eb1 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -22,6 +22,8 @@ static inline void clear_siginfo(kernel_siginfo_t *info)
memset(info, 0, sizeof(*info));
 }
 
+#define SI_EXPANSION_SIZE (sizeof(struct siginfo) - sizeof(struct 
kernel_siginfo))
+
 int copy_siginfo_to_user(siginfo_t __user *to, const kernel_siginfo_t *from);
 int copy_siginfo_from_user(kernel_siginfo_t *to, const siginfo_t __user *from);
 
diff --git a/include/linux/signal_types.h b/include/linux/signal_types.h
index 2a40a9c5e4ad..f8a90ae9c6ec 100644
--- a/include/linux/signal_types.h
+++ b/include/linux/signal_types.h
@@ -10,10 +10,7 @@
 #include 
 
 typedef struct kernel_siginfo {
-   union {
-   __SIGINFO;
-   int _si_pad[SI_MAX_SIZE/sizeof(int)];
-   };
+   __SIGINFO;
 } kernel_siginfo_t;
 
 /*
diff --git a/kernel/signal.c b/kernel/signal.c
index 161cad4e448c..1c2dd117fee0 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2844,27 +2844,48 @@ COMPAT_SYSCALL_DEFINE2(rt_sigpending, compat_sigset_t 
__user *, uset,
 }
 #endif
 
+static const struct {
+   unsigned char limit, layout;
+} sig_sicodes[] = {
+   [SIGILL]  = { NSIGILL,  SIL_FAULT },
+   [SIGFPE]  = { NSIGFPE,  SIL_FAULT },
+   [SIGSEGV] = { NSIGSEGV, SIL_FAULT },
+   [SIGBUS]  = { NSIGBUS,  SIL_FAULT },
+   [SIGTRAP] = { NSIGTRAP, SIL_FAULT },
+#if defined(SIGEMT)
+   [SIGEMT]  = { NSIGEMT,  SIL_FAULT },
+#endif
+   [SIGCHLD] = { NSIGCHLD, SIL_CHLD },
+   [SIGPOLL] = { NSIGPOLL, SIL_POLL },
+   [SIGSYS]  = { NSIGSYS,  SIL_SYS },
+};
+
+static bool known_siginfo_layout(int sig, int si_code)
+{
+   if (si_code == SI_KERNEL)
+   return true;
+   else if ((si_code > SI_USER)) {
+   if (sig_specific_sicodes(sig)) {
+   if (si_code <= sig_sicodes[sig].limit)
+   return true;
+   }
+   else if (si_code <= NSIGPOLL)
+   return true;
+   }
+   else if (si_code >= SI_DETHREAD)
+   return true;
+   else if (si_code == SI_ASYNCNL)
+   return true;
+   return false;
+}
+
 enum siginfo_layout siginfo_layout(int sig, int si_code)
 {
enum siginfo_layout layout = SIL_KILL;
if ((si_code > SI_USER) && (si_code < SI_KERNEL)) {
-   static const struct {
-   unsigned char limit, layout;
-   } filter[] = {
-   [SIGILL]  = { NSIGILL,  SIL_FAULT },
-   [SIGFPE]  = { NSIGFPE,  SIL_FAULT },
-   [SIGSEGV] = { NSIGSEGV, SIL_FAULT },
-   [SIGBUS]  = { NSIGBUS,  SIL_FAULT },
-   [SIGTRAP] = { NSIGTRAP, SIL_FAULT },
-#if defined(SIGEMT)
-   [SIGEMT]  = { NSIGEMT,  SIL_FAULT },
-#endif
-   [SIGCHLD] = { NSIGCHLD, SIL_CHLD },
-   [SIGPOLL] = { NSIGPOLL, SIL_POLL },
-   [SIGSYS]  = { NSIGSYS,  SIL_SYS },
-   };
-   if ((sig < ARRAY_SIZE(filter)) && (si_code <= 
filter[sig].limit)) {
-   layout = filter[sig].layout;
+   if ((sig < ARRAY_SIZE(sig_sicodes)) &&
+   (si_code <= sig_sicodes[sig].limit)) {
+   layout = sig_sicodes[sig].layout;
/* Handle the exceptions */
if ((sig == SIGBUS) &&
(si_code >= BUS_MCEERR_AR) && (si_code <= 
BUS_MCEERR_AO))
@@ -2889,17 +2910,42 @@ enum siginfo_layout siginfo_layout(int sig, int si_code)
return layout;
 }
 
+static inline char __user *si_expansion(const siginfo_t __user *info)
+{
+   return ((char __user *)info) + sizeof(struct kernel_siginfo);
+}
+
 int copy_siginfo_to_user(siginfo_t __user *to, const kernel_siginfo_t *from)
 {
+   char __user *expansion = si_expansion(to);
if (copy_to_user(to, from , sizeof(struct kernel_siginfo)

[REVIEW][PATCH 3/6] signal: Remove the need for __ARCH_SI_PREABLE_SIZE and SI_PAD_SIZE

2018-09-25 Thread Eric W. Biederman
Rework the defintion of struct siginfo so that the array padding
struct siginfo to SI_MAX_SIZE can be placed in a union along side of
the rest of the struct siginfo members.  The result is that we no
longer need the __ARCH_SI_PREAMBLE_SIZE or SI_PAD_SIZE definitions.

Signed-off-by: "Eric W. Biederman" 
---
 arch/alpha/include/uapi/asm/siginfo.h   |   1 -
 arch/arm64/include/uapi/asm/Kbuild  |   1 +
 arch/arm64/include/uapi/asm/siginfo.h   |  24 ---
 arch/ia64/include/uapi/asm/siginfo.h|   2 -
 arch/mips/include/uapi/asm/siginfo.h|  11 --
 arch/parisc/include/uapi/asm/Kbuild |   1 +
 arch/parisc/include/uapi/asm/siginfo.h  |  11 --
 arch/powerpc/include/uapi/asm/Kbuild|   1 +
 arch/powerpc/include/uapi/asm/siginfo.h |  18 ---
 arch/riscv/include/uapi/asm/Kbuild  |   1 +
 arch/riscv/include/uapi/asm/siginfo.h   |  24 ---
 arch/s390/include/uapi/asm/Kbuild   |   1 +
 arch/s390/include/uapi/asm/siginfo.h|  17 ---
 arch/sparc/include/uapi/asm/siginfo.h   |   1 -
 arch/x86/include/uapi/asm/siginfo.h |   2 -
 include/uapi/asm-generic/siginfo.h  | 187 
 kernel/signal.c |   3 -
 17 files changed, 99 insertions(+), 207 deletions(-)
 delete mode 100644 arch/arm64/include/uapi/asm/siginfo.h
 delete mode 100644 arch/parisc/include/uapi/asm/siginfo.h
 delete mode 100644 arch/powerpc/include/uapi/asm/siginfo.h
 delete mode 100644 arch/riscv/include/uapi/asm/siginfo.h
 delete mode 100644 arch/s390/include/uapi/asm/siginfo.h

diff --git a/arch/alpha/include/uapi/asm/siginfo.h 
b/arch/alpha/include/uapi/asm/siginfo.h
index db3f0138536f..6e1a2af2f962 100644
--- a/arch/alpha/include/uapi/asm/siginfo.h
+++ b/arch/alpha/include/uapi/asm/siginfo.h
@@ -2,7 +2,6 @@
 #ifndef _ALPHA_SIGINFO_H
 #define _ALPHA_SIGINFO_H
 
-#define __ARCH_SI_PREAMBLE_SIZE(4 * sizeof(int))
 #define __ARCH_SI_TRAPNO
 
 #include 
diff --git a/arch/arm64/include/uapi/asm/Kbuild 
b/arch/arm64/include/uapi/asm/Kbuild
index 198afbf0688f..6c5adf458690 100644
--- a/arch/arm64/include/uapi/asm/Kbuild
+++ b/arch/arm64/include/uapi/asm/Kbuild
@@ -19,3 +19,4 @@ generic-y += swab.h
 generic-y += termbits.h
 generic-y += termios.h
 generic-y += types.h
+generic-y += siginfo.h
diff --git a/arch/arm64/include/uapi/asm/siginfo.h 
b/arch/arm64/include/uapi/asm/siginfo.h
deleted file mode 100644
index 574d12f86039..
--- a/arch/arm64/include/uapi/asm/siginfo.h
+++ /dev/null
@@ -1,24 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
-/*
- * Copyright (C) 2012 ARM Ltd.
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program.  If not, see <http://www.gnu.org/licenses/>.
- */
-#ifndef __ASM_SIGINFO_H
-#define __ASM_SIGINFO_H
-
-#define __ARCH_SI_PREAMBLE_SIZE(4 * sizeof(int))
-
-#include 
-
-#endif
diff --git a/arch/ia64/include/uapi/asm/siginfo.h 
b/arch/ia64/include/uapi/asm/siginfo.h
index 52b5af424511..796af1ccaa7e 100644
--- a/arch/ia64/include/uapi/asm/siginfo.h
+++ b/arch/ia64/include/uapi/asm/siginfo.h
@@ -9,8 +9,6 @@
 #define _UAPI_ASM_IA64_SIGINFO_H
 
 
-#define __ARCH_SI_PREAMBLE_SIZE(4 * sizeof(int))
-
 #include 
 
 #define si_imm _sifields._sigfault._imm/* as per UNIX SysV ABI 
spec */
diff --git a/arch/mips/include/uapi/asm/siginfo.h 
b/arch/mips/include/uapi/asm/siginfo.h
index 262504bd59a5..c34c7eef0a1c 100644
--- a/arch/mips/include/uapi/asm/siginfo.h
+++ b/arch/mips/include/uapi/asm/siginfo.h
@@ -14,17 +14,6 @@
 #define __ARCH_SIGEV_PREAMBLE_SIZE (sizeof(long) + 2*sizeof(int))
 #undef __ARCH_SI_TRAPNO /* exception code needs to fill this ...  */
 
-/*
- * Careful to keep union _sifields from shifting ...
- */
-#if _MIPS_SZLONG == 32
-#define __ARCH_SI_PREAMBLE_SIZE (3 * sizeof(int))
-#elif _MIPS_SZLONG == 64
-#define __ARCH_SI_PREAMBLE_SIZE (4 * sizeof(int))
-#else
-#error _MIPS_SZLONG neither 32 nor 64
-#endif
-
 #define __ARCH_HAS_SWAPPED_SIGINFO
 
 #include 
diff --git a/arch/parisc/include/uapi/asm/Kbuild 
b/arch/parisc/include/uapi/asm/Kbuild
index 286ef5a5904b..adb5c64831c7 100644
--- a/arch/parisc/include/uapi/asm/Kbuild
+++ b/arch/parisc/include/uapi/asm/Kbuild
@@ -7,3 +7,4 @@ generic-y += kvm_para.h
 generic-y += param.h
 generic-y += poll.h
 generic-y += resource.h
+generic-y += siginfo.h
diff --git a/arch/parisc/include/uapi/asm/siginfo.h 
b/arch/parisc/include/uapi/asm/siginfo.h
deleted file mode 100644
index 4a1062e05aaf..
--- a/arch/parisc/include/uapi/asm/siginfo

[REVIEW][PATCH 4/6] signal: Introduce copy_siginfo_from_user and use it's return value

2018-09-25 Thread Eric W. Biederman
In preparation for using a smaller version of siginfo in the kernel
introduce copy_siginfo_from_user and use it when siginfo is copied from
userspace.

Make the pattern for using copy_siginfo_from_user and
copy_siginfo_from_user32 to capture the return value and return that
value on error.

This is a necessary prerequisite for using a smaller siginfo
in the kernel than the kernel exports to userspace.

Signed-off-by: "Eric W. Biederman" 
---
 include/linux/signal.h |  1 +
 kernel/ptrace.c| 12 +---
 kernel/signal.c| 25 -
 3 files changed, 22 insertions(+), 16 deletions(-)

diff --git a/include/linux/signal.h b/include/linux/signal.h
index 3d4cd5db30a9..de94c159bfb0 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -22,6 +22,7 @@ static inline void clear_siginfo(struct siginfo *info)
 }
 
 int copy_siginfo_to_user(struct siginfo __user *to, const struct siginfo 
*from);
+int copy_siginfo_from_user(struct siginfo *to, const struct siginfo __user 
*from);
 
 enum siginfo_layout {
SIL_KILL,
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 45f77a1b9c97..a807ff5cc1a9 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -919,9 +919,8 @@ int ptrace_request(struct task_struct *child, long request,
break;
 
case PTRACE_SETSIGINFO:
-   if (copy_from_user(, datavp, sizeof siginfo))
-   ret = -EFAULT;
-   else
+   ret = copy_siginfo_from_user(, datavp);
+   if (!ret)
ret = ptrace_setsiginfo(child, );
break;
 
@@ -1215,10 +1214,9 @@ int compat_ptrace_request(struct task_struct *child, 
compat_long_t request,
break;
 
case PTRACE_SETSIGINFO:
-   if (copy_siginfo_from_user32(
-   , (struct compat_siginfo __user *) datap))
-   ret = -EFAULT;
-   else
+   ret = copy_siginfo_from_user32(
+   , (struct compat_siginfo __user *) datap);
+   if (!ret)
ret = ptrace_setsiginfo(child, );
break;
 #ifdef CONFIG_HAVE_ARCH_TRACEHOOK
diff --git a/kernel/signal.c b/kernel/signal.c
index debb485a76db..c0e289e62d77 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2896,6 +2896,13 @@ int copy_siginfo_to_user(siginfo_t __user *to, const 
siginfo_t *from)
return 0;
 }
 
+int copy_siginfo_from_user(siginfo_t *to, const siginfo_t __user *from)
+{
+   if (copy_from_user(to, from, sizeof(struct siginfo)))
+   return -EFAULT;
+   return 0;
+}
+
 #ifdef CONFIG_COMPAT
 int copy_siginfo_to_user32(struct compat_siginfo __user *to,
   const struct siginfo *from)
@@ -3323,8 +3330,9 @@ SYSCALL_DEFINE3(rt_sigqueueinfo, pid_t, pid, int, sig,
siginfo_t __user *, uinfo)
 {
siginfo_t info;
-   if (copy_from_user(, uinfo, sizeof(siginfo_t)))
-   return -EFAULT;
+   int ret = copy_siginfo_from_user(, uinfo);
+   if (unlikely(ret))
+   return ret;
return do_rt_sigqueueinfo(pid, sig, );
 }
 
@@ -3365,10 +3373,9 @@ SYSCALL_DEFINE4(rt_tgsigqueueinfo, pid_t, tgid, pid_t, 
pid, int, sig,
siginfo_t __user *, uinfo)
 {
siginfo_t info;
-
-   if (copy_from_user(, uinfo, sizeof(siginfo_t)))
-   return -EFAULT;
-
+   int ret = copy_siginfo_from_user(, uinfo);
+   if (unlikely(ret))
+   return ret;
return do_rt_tgsigqueueinfo(tgid, pid, sig, );
 }
 
@@ -3380,9 +3387,9 @@ COMPAT_SYSCALL_DEFINE4(rt_tgsigqueueinfo,
struct compat_siginfo __user *, uinfo)
 {
siginfo_t info;
-
-   if (copy_siginfo_from_user32(, uinfo))
-   return -EFAULT;
+   int ret = copy_siginfo_from_user32(, uinfo);
+   if (unlikely(ret))
+   return ret;
return do_rt_tgsigqueueinfo(tgid, pid, sig, );
 }
 #endif
-- 
2.17.1



[REVIEW][PATCH 2/6] signal: Fail sigqueueinfo if si_signo != sig

2018-09-25 Thread Eric W. Biederman
The kernel needs to validate that the contents of struct siginfo make
sense as siginfo is copied into the kernel, so that the proper union
members can be put in the appropriate locations.  The field si_signo
is a fundamental part of that validation.  As such changing the
contents of si_signo after the validation make no sense and can result
in nonsense values in the kernel.

As such simply fail if someone is silly enough to set si_signo out of
sync with the signal number passed to sigqueueinfo.

I don't expect a problem as glibc's sigqueue implementation sets
"si_signo = sig" and CRIU just returns to the kernel what the kernel
gave to it.

If there is some application that calls sigqueueinfo directly that has
a problem with this added sanity check we can revisit this when we see
what kind of crazy that application is doing.

Signed-off-by: "Eric W. Biederman" 
---
 kernel/signal.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 7b49c31d3fdb..e445b0a63faa 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -3306,7 +3306,8 @@ static int do_rt_sigqueueinfo(pid_t pid, int sig, 
siginfo_t *info)
(task_pid_vnr(current) != pid))
return -EPERM;
 
-   info->si_signo = sig;
+   if (info->si_signo != sig)
+   return -EINVAL;
 
/* POSIX.1b doesn't mention process groups.  */
return kill_proc_info(sig, info, pid);
@@ -3354,7 +3355,8 @@ static int do_rt_tgsigqueueinfo(pid_t tgid, pid_t pid, 
int sig, siginfo_t *info)
(task_pid_vnr(current) != pid))
return -EPERM;
 
-   info->si_signo = sig;
+   if (info->si_signo != sig)
+   return -EINVAL;
 
return do_send_specific(tgid, pid, sig, info);
 }
-- 
2.17.1



[REVIEW][PATCH 1/6] signal/sparc: Move EMT_TAGOVF into the generic siginfo.h

2018-09-25 Thread Eric W. Biederman
When moving all of the architectures specific si_codes into
siginfo.h, I apparently overlooked EMT_TAGOVF.  Move it now.

Remove the now redundant test in siginfo_layout for SIGEMT
as now NSIGEMT is always defined.

Signed-off-by: "Eric W. Biederman" 
---
 arch/sparc/include/uapi/asm/siginfo.h | 6 --
 include/uapi/asm-generic/siginfo.h| 6 ++
 kernel/signal.c   | 2 +-
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/sparc/include/uapi/asm/siginfo.h 
b/arch/sparc/include/uapi/asm/siginfo.h
index e7049550ac82..6c820ea0813b 100644
--- a/arch/sparc/include/uapi/asm/siginfo.h
+++ b/arch/sparc/include/uapi/asm/siginfo.h
@@ -17,10 +17,4 @@
 
 #define SI_NOINFO  32767   /* no information in siginfo_t */
 
-/*
- * SIGEMT si_codes
- */
-#define EMT_TAGOVF 1   /* tag overflow */
-#define NSIGEMT1
-
 #endif /* _UAPI__SPARC_SIGINFO_H */
diff --git a/include/uapi/asm-generic/siginfo.h 
b/include/uapi/asm-generic/siginfo.h
index 80e2a7227205..1811b8101937 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -285,6 +285,12 @@ typedef struct siginfo {
 #define SYS_SECCOMP1   /* seccomp triggered */
 #define NSIGSYS1
 
+/*
+ * SIGEMT si_codes
+ */
+#define EMT_TAGOVF 1   /* tag overflow */
+#define NSIGEMT1
+
 /*
  * sigevent definitions
  * 
diff --git a/kernel/signal.c b/kernel/signal.c
index e16278710b36..7b49c31d3fdb 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2856,7 +2856,7 @@ enum siginfo_layout siginfo_layout(int sig, int si_code)
[SIGSEGV] = { NSIGSEGV, SIL_FAULT },
[SIGBUS]  = { NSIGBUS,  SIL_FAULT },
[SIGTRAP] = { NSIGTRAP, SIL_FAULT },
-#if defined(SIGEMT) && defined(NSIGEMT)
+#if defined(SIGEMT)
[SIGEMT]  = { NSIGEMT,  SIL_FAULT },
 #endif
[SIGCHLD] = { NSIGCHLD, SIL_CHLD },
-- 
2.17.1



[REVIEW][PATCH 0/6] signal: Shrinking the kernel's siginfo structure

2018-09-25 Thread Eric W. Biederman


I am posting these patches for review in the hopes that if I have missed
something someone else might catch it.  If no issues turn up I intend to
merge these patches through my siginfo branch.

This set of patches make a few small ABI changes to areas of the linux
ABI where as far as I can determine no one cares.

- sigqueueinfo in all it's variants now fails if si_signo != sig
  instead of quietly changing si_signo.

  The deep issue is that the change happens after the code has already
  verified in copy_siginfom_from_user that the si_signo and si_code
  combination is meaningful.

- copy_siginfo_from_user now fails if the trailing bytes of siginfo
  are not 0, when the signal number and si_code combination is not
  recognized.

  If the si_signo and si_code combination is recognized we know any
  trailing bytes are meaningless as the meaningful bytes are in siginfo.

  This check is to allow people to define new siginfo union members and
  that will fail on older kernels.  The check makes it as safe as
  possible to have a kernel_siginfo that is smaller than the ABI defined
  siginfo that the kernel reads from and writes to userspace.

The net effect of this change is a kernel that only uses 48 bytes for
siginfo internally when the ABI defines siginfo to be 128 bytes.

The first EMT_TAGOVF change is not necesssary to strinking siginfo.

Eric W. Biederman (6):
  signal/sparc: Move EMT_TAGOVF into the generic siginfo.h
  signal: Fail sigqueueinfo if si_signo != sig
  signal: Remove the need for __ARCH_SI_PREABLE_SIZE and SI_PAD_SIZE
  signal: Introduce copy_siginfo_from_user and use it's return value
  signal: Distinguish between kernel_siginfo and siginfo
  signal: Use a smaller struct siginfo in the kernel

 arch/alpha/include/uapi/asm/siginfo.h   |   1 -
 arch/arm64/include/uapi/asm/Kbuild  |   1 +
 arch/arm64/include/uapi/asm/siginfo.h   |  24 ---
 arch/ia64/include/uapi/asm/siginfo.h|   2 -
 arch/mips/include/uapi/asm/siginfo.h|  11 --
 arch/parisc/include/uapi/asm/Kbuild |   1 +
 arch/parisc/include/uapi/asm/siginfo.h  |  11 --
 arch/powerpc/include/uapi/asm/Kbuild|   1 +
 arch/powerpc/include/uapi/asm/siginfo.h |  18 ---
 arch/riscv/include/uapi/asm/Kbuild  |   1 +
 arch/riscv/include/uapi/asm/siginfo.h   |  24 ---
 arch/s390/include/uapi/asm/Kbuild   |   1 +
 arch/s390/include/uapi/asm/siginfo.h|  17 ---
 arch/sparc/include/uapi/asm/siginfo.h   |   7 -
 arch/x86/include/asm/compat.h   |   2 +-
 arch/x86/include/uapi/asm/siginfo.h |   2 -
 drivers/usb/core/devio.c|   4 +-
 fs/binfmt_elf.c |   6 +-
 fs/coredump.c   |   2 +-
 fs/fcntl.c  |   2 +-
 fs/signalfd.c   |   6 +-
 include/linux/binfmts.h |   2 +-
 include/linux/compat.h  |   4 +-
 include/linux/coredump.h|   4 +-
 include/linux/lsm_hooks.h   |   4 +-
 include/linux/posix-timers.h|   2 +-
 include/linux/ptrace.h  |   2 +-
 include/linux/sched.h   |   2 +-
 include/linux/sched/signal.h|  18 +--
 include/linux/security.h|   6 +-
 include/linux/signal.h  |  16 +-
 include/linux/signal_types.h|   8 +-
 include/trace/events/signal.h   |   4 +-
 include/uapi/asm-generic/siginfo.h  | 193 ---
 ipc/mqueue.c|   2 +-
 kernel/ptrace.c |  22 ++-
 kernel/seccomp.c|   6 +-
 kernel/signal.c | 263 ++--
 kernel/time/posix-timers.c  |   2 +-
 security/apparmor/lsm.c |   2 +-
 security/security.c |   2 +-
 security/selinux/hooks.c|   2 +-
 security/smack/smack_lsm.c  |   2 +-
 43 files changed, 356 insertions(+), 356 deletions(-)

Eric



Re: [RFC 00/20] ns: Introduce Time Namespace

2018-09-24 Thread Eric W. Biederman
Andrey Vagin  writes:

> On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote:
>> Dmitry Safonov  writes:
>> 
>> > Discussions around time virtualization are there for a long time.
>> > The first attempt to implement time namespace was in 2006 by Jeff Dike.
>> > From that time, the topic appears on and off in various discussions.
>> >
>> > There are two main use cases for time namespaces:
>> > 1. change date and time inside a container;
>> > 2. adjust clocks for a container restored from a checkpoint.
>> >
>> > “It seems like this might be one of the last major obstacles keeping
>> > migration from being used in production systems, given that not all
>> > containers and connections can be migrated as long as a time dependency
>> > is capable of messing it up.” (by github.com/dav-ell)
>> >
>> > The kernel provides access to several clocks: CLOCK_REALTIME,
>> > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the
>> > start points for them are not defined and are different for each running
>> > system. When a container is migrated from one node to another, all
>> > clocks have to be restored into consistent states; in other words, they
>> > have to continue running from the same points where they have been
>> > dumped.
>> >
>> > The main idea behind this patch set is adding per-namespace offsets for
>> > system clocks. When a process in a non-root time namespace requests
>> > time of a clock, a namespace offset is added to the current value of
>> > this clock on a host and the sum is returned.
>> >
>> > All offsets are placed on a separate page, this allows up to map it as 
>> > part of vvar into user processes and use offsets from vdso calls.
>> >
>> > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME
>> > clocks.
>> >
>> > Questions to discuss:
>> >
>> > * Clone flags exhaustion. Currently there is only one unused clone flag
>> > bit left, and it may be worth to use it to extend arguments of the clone
>> > system call.
>> >
>> > * Realtime clock implementation details:
>> >   Is having a simple offset enough?
>> >   What to do when date and time is changed on the host?
>> >   Is there a need to adjust vfs modification and creation times? 
>> >   Implementation for adjtime() syscall.
>> 
>> Overall I support this effort.  In my quick skim this code looked good.
>
> Hi Eric,
>
> Thank you for the feedback.
>
>> 
>> My feeling is that we need to be able to support running ntpd and
>> support one namespace doing googles smoothing of leap seconds while
>> another namespace takes the leap second.
>> 
>> What I was imagining when I was last thinking about this was one
>> instance of struct timekeeper aka tk_core per time namespace.  That
>> structure already keeps offsets for all of the various clocks from
>> the kerne internal time sources.  What would be needed would be to
>> pass in an appropriate time namespace pointer.
>> 
>> I could be completely wrong as I have not take the time to completely
>> trace through the code.  Have you looked at pushing the time namespace
>> down as far as tk_core?
>> 
>> What I think would be the big advantage (besides ntp working) is that
>> the bulk of the code could be reused.  Allowing testing of the kernel's
>> time code by setting up a new time namespace.  So a person in production
>> could setup a time namespace with the time set ahead a little  bit and
>> be able to verify that the kernel handles the upcoming leap second
>> properly.
>>
>
> It is an interesting idea, but I have a few questions:
>
> 1. Does it mean that timekeeping_update() will be called for each
> namespace? This functions is called periodically, it updates times on the
> timekeeper structure, updates vsyscall_gtod_data, etc. What will be an
> overhead of this?

I don't know if periodically is a proper characterization.  There may be
a code path that does that.  But from what I can see timekeeping_update
is the guts of settimeofday (and a few related functions).

So it appears to make sense for timekeeping_update to be per namespace.

Hmm.  Looking at what is updated in the vsyscall_gtod_data it does
look like you would have to periodically update things, but I don't know
big that period would be.  As long as the period is reasonably large,
or the time namespaces were sufficiently deschronized it should not
be a problem.  But that is the class of problem that could make
my ide

[REVIEW][PATCH 1/3] signal/unicore32: Use send_sig_fault where appropriate

2018-09-24 Thread Eric W. Biederman
Signed-off-by: "Eric W. Biederman" 
---
 arch/unicore32/kernel/fpu-ucf64.c | 12 +++-
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/arch/unicore32/kernel/fpu-ucf64.c 
b/arch/unicore32/kernel/fpu-ucf64.c
index 8594b168f25e..fc5dad32a982 100644
--- a/arch/unicore32/kernel/fpu-ucf64.c
+++ b/arch/unicore32/kernel/fpu-ucf64.c
@@ -54,14 +54,6 @@
  */
 void ucf64_raise_sigfpe(struct pt_regs *regs)
 {
-   siginfo_t info;
-
-   clear_siginfo();
-
-   info.si_signo = SIGFPE;
-   info.si_code = FPE_FLTUNK;
-   info.si_addr = (void __user *)(instruction_pointer(regs) - 4);
-
/*
 * This is the same as NWFPE, because it's not clear what
 * this is used for
@@ -69,7 +61,9 @@ void ucf64_raise_sigfpe(struct pt_regs *regs)
current->thread.error_code = 0;
current->thread.trap_no = 6;
 
-   send_sig_info(SIGFPE, , current);
+   send_sig_fault(SIGFPE, FPE_FLTUNK,
+  (void __user *)(instruction_pointer(regs) - 4),
+  current);
 }
 
 /*
-- 
2.17.1



[REVIEW][PATCH 3/3] signal/unicore32: Use force_sig_fault where appropriate

2018-09-24 Thread Eric W. Biederman
Signed-off-by: "Eric W. Biederman" 
---
 arch/unicore32/mm/fault.c | 9 +
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index a942776110a0..b9a3a50644c1 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -120,17 +120,10 @@ static void __do_user_fault(struct task_struct *tsk, 
unsigned long addr,
unsigned int fsr, unsigned int sig, int code,
struct pt_regs *regs)
 {
-   struct siginfo si;
-
tsk->thread.address = addr;
tsk->thread.error_code = fsr;
tsk->thread.trap_no = 14;
-   clear_siginfo();
-   si.si_signo = sig;
-   si.si_errno = 0;
-   si.si_code = code;
-   si.si_addr = (void __user *)addr;
-   force_sig_info(sig, , tsk);
+   force_sig_fault(sig, code, (void __user *)addr, tsk);
 }
 
 void do_bad_area(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
-- 
2.17.1



  1   2   3   4   5   6   7   8   9   10   >