Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-05-04 Thread Serge E. Hallyn
On Mon, Apr 29, 2019 at 07:31:43PM +0200, Enrico Weigelt, metux IT consult 
wrote:

Argh.  Sorry, it seems your emails aren't making it into my inbox, only
my once-in-a-long-while-checked lkml folder.  Sorry again.

> On 29.04.19 17:49, Serge E. Hallyn wrote:
> 
> >> * all users are equal - no root at all. the only exception is the>>   
> >> initial process, which gets the kernel devices mounted into his>>
>  namespace.> > This does not match my understanding, but I'm most likely
> wrong.  (I thought> there was an actual 'host owner' uid, which mostly
> is only used for initial> process, but is basically root with a
> different name, and used far less.  No> uid transitions without factotem
> so that it *looked* like no root user).
> Not quite (IIRC). The hostowner is just the user who booted the machine,
> the initial process runs under this uname and gets the kernel devices
> bound into his namespace, so he can start fileservers on them.
> 
> Also the caphash device (the one you can create capabilities, eg. for
> user change, which then can be used via capuse device) can only be
> opened once - usually by the host factotum.
> 
> There really is no such thing like root user.
> 
> >> What I'd like to achieve on Linux: * unprivileged users can have their 
> >> own mount namespace, where
> they>>   can mount at will (maybe just 9P).> > No problem, you can do
> that now.
>
> But only within separate userns, IMHO. (and, when I last tried, plain

"Only within a separate userns" - but why does that matter?  It's just
a different uid mapping.

> users couldn't directly create their userns).

Plain users can definately create their own userns, directly.  On some
distros there is a kernel knob like

#cat /proc/sys/kernel/unprivileged_userns_clone
1

which when unset prevents unprivileged users creating a namespace.

> >> * but they still appear as the same normal users to the rest of the
> >>   system
> > 
> > No problem, you can do that now.
> 
> How exactly ? Did I miss something vital ?

By unsharing your namespace and writing the new uid mapping.  You can of
course only map your own uid without using any privileged helpers at all.
And it requires help from a second process, which does the writing to
the uid map file after the first process has unshared.  But you can do it.
For instance, using the nsexec.c at

https://github.com/fcicq/nsexec

You can:

Terminal 1:
shallyn@stp:~/src/nsexec$ ./nsexec -UWm
about to unshare with 1002
Press any key to exec (I am 31157)

Now in terminal 2:

Terminal 2:
shallyn@stp:~/src/nsexec$ echo "0 1000 1" > /proc/31157/uid_map
shallyn@stp:~/src/nsexec$ echo deny > /proc/31157/setgroups
shallyn@stp:~/src/nsexec$ echo "0 1000 1" > /proc/31157/gid_map

Then back in terminal 1:
# id
uid=0(root) gid=0(root) groups=0(root),65534(nogroup)
# mount --bind /etc /mnt
# echo $?
0
# ls /root
ls: cannot open directory '/root': Permission denied

To the rest of the system you look like uid 1000.  You could have
chosen uid 1000 in your new namespace, but then you couldn't mount.
Of course you can nest user namespaces so you could create another,
this time mapping uid 1000 so you look like 1000 to yourself as well.

> >> * 9p programs (compiled for Linux ABI) can run parallel to traditional
> >>   linux programs within the same user and sessions (eg. from a terminal,
> >>   i can call both the same way)
> >> * namespace modifications affect both equally (eg. I could run ff in
> >>   an own ns)
> > 
> > affect both of what equally?
> 
> mount / bind.
> 
> > That's exactly what user namespaces are for.  You can create a new
> > user namespace, using no privilege at all, with your current uid (i.e.
> > 1000) mapped to whatever uid you like; if you pick 0, then you can unshare 
> > all
> > the namespaces you like.  
> 
> But I don't like to appear as 'root' in here. I just wanna have my own
> filesystem namespace, nothing more.

Right.  As you know setuid makes that impossible, unfortunately.  That's
where nonewprivs shows promise.

> > Once you unshare mnt_ns, you can mount to your
> > heart's content.  To other processes on the host, your process is
> > uid 1000.
> 
> Is that the uid, I'm appearing to filesystems ?

Yes.

> > Regarding factotem, I agree that with the pidfd work going on etc, it's 
> > getting
> > more and more tempting to attempt a switch to that.  Looking back at my 
> > folder,
> > I see you posted a kernel patch for it.  I had done the same long ago.  
> > Happy to
> > work with you again on that, and put a simple daemon into shadow package, if
> > util-linux isn't deemed the far better place.
> 
> Yeah :)
> 
> 
> --mtx
> 
> -- 
> Enrico Weigelt, metux IT consult
> Free software and Linux embedded engineering
> i...@metux.net -- +49-151-27565287


Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-30 Thread Florian Weimer
* Linus Torvalds:

> On Tue, Apr 30, 2019 at 9:19 AM Linus Torvalds
>  wrote:
>>
>> Of course, if you *don't* need the exact vfork() semantics, clone
>> itself actually very much supports a callback model with s separate
>> stack. You can basically do this:
>>
>>  - allocate new stack for the child
>>  - in trivial asm wrapper, do:
>> - push the callback address on the child stack
>> - clone(CLONE_VFORK|CLONE_VM|CLONE_SIGCHLD, chld_stack, NULL, NULL,NULL)
>> - "ret"
>>  - free new stack
>>
>> where the "ret" in the child will just go to the callback, while the
>> parent (eventually) just returns from the trivial wrapper and frees
>> the new stack (which by definition is no longer used, since the child
>> has exited or execve'd.
>
> In fact, Florian, maybe this is the solution to your "I want to use
> vfork for posix_spawn(), but I don't know if I can trust it" problem.
>
> Just use clone() directly. On WSL it will presumably just fail, and
> you can then fall back on doing the slow stupid
> fork+pipes-to-communicate.

We already use clone.  I don't know why.  We should add a comment that
provides the reason.

> On valgrind, I don't know what will happen. Maybe it will just do an
> unchecked posix_spawn() because valgrind doesn't catch it?

I think what happens with these emulators that they use fork (no shared
address space) but suspend the parent thread.  clone with CLONE_VFORK
definitely does not fail.  That mostly works, except that you do not get
back the error code from the execve.  Instead, the process is considered
launched, and the caller collects the exit status from the _exit after
the failed execve.

> Of course, if you *don't* need the exact vfork() semantics, clone
> itself actually very much supports a callback model with s separate
> stack. You can basically do this:
> 
>  - allocate new stack for the child
>  - in trivial asm wrapper, do:
> - push the callback address on the child stack
> - clone(CLONE_VFORK|CLONE_VM|CLONE_SIGCHLD, chld_stack, NULL, NULL,NULL)
> - "ret"
>  - free new stack
> 
> where the "ret" in the child will just go to the callback, while the
> parent (eventually) just returns from the trivial wrapper and frees
> the new stack (which by definition is no longer used, since the child
> has exited or execve'd.
> 
> So you can most definitely create a "vfork_with_child_callback()" with
> clone, and it would arguably be a much superior interface to vfork()
> anyway (maybe you'd like to pass in some arguments to the callback too
> - add more stack setup for the child as needed), but it wouldn't be
> the right solution for programs that just want to use the standard BSD
> vfork() model.

As far as we understand the situation, we believe that we absolutely
must block all signals for both the parent thread and the new
subprocess.  Signals can be unblocked in the subprocess, but only after
setting their handlers to SIG_DFL or SIG_IGN.  (Parent signal handlers
cannot run in the subprocess because application-supplied signal
handlers generally do not expect to run with a corrupt TCB—or a
different PID.)

At that point, I wonder if we can just skip the stack setup for the
CLONE_VFORK case and reuse the existing stack.

Thanks,
Florian


Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-30 Thread Linus Torvalds
On Tue, Apr 30, 2019 at 5:39 AM Oleg Nesterov  wrote:
>
> Yes, but I am wondering if man vfork should clarify what "child terminates"
> actually means. I mean, the child can do clone(CLONE_THREAD) + sys_exit(),
> this will wake the parent thread up before the child process exits or execs.

That falls solidly into the "give people rope" category.

If the vfork() child wants to mess with the parent, it has many easier
ways to do it than create more threads.

As mentioned, the real problem with vfork() tends to be that the child
unintentionally messes with the parent because it just gets the stack
sharing wrong. No need to add intention there.

> I see nothing wrong, but I was always curious whether it was designed this
> way on purpose or not.

Oh, it's definitely on purpose. Trying to do some nested usage count
would be horrendously complex, and even a trivial "don't allow any
other clone() calls if we already have a vfork completion pending" is
just unnecessary logic.

Because at least in *theory*, there's actually nothing horribly wrong
with allowing a thread to be created during the vfork(). I don't see
the _point_, but it's not conceptually something that couldn't work
(you'd need a separate thread stack etc, but that's normal clone()).

So no, there's no safety or bogus "you can't do that". If you want to
play games after vfork(), go wild.

   Linus


Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-30 Thread Linus Torvalds
On Tue, Apr 30, 2019 at 9:19 AM Linus Torvalds
 wrote:
>
> Of course, if you *don't* need the exact vfork() semantics, clone
> itself actually very much supports a callback model with s separate
> stack. You can basically do this:
>
>  - allocate new stack for the child
>  - in trivial asm wrapper, do:
> - push the callback address on the child stack
> - clone(CLONE_VFORK|CLONE_VM|CLONE_SIGCHLD, chld_stack, NULL, NULL,NULL)
> - "ret"
>  - free new stack
>
> where the "ret" in the child will just go to the callback, while the
> parent (eventually) just returns from the trivial wrapper and frees
> the new stack (which by definition is no longer used, since the child
> has exited or execve'd.

In fact, Florian, maybe this is the solution to your "I want to use
vfork for posix_spawn(), but I don't know if I can trust it" problem.

Just use clone() directly. On WSL it will presumably just fail, and
you can then fall back on doing the slow stupid
fork+pipes-to-communicate.

On valgrind, I don't know what will happen. Maybe it will just do an
unchecked posix_spawn() because valgrind doesn't catch it?

  Linus


Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-30 Thread Linus Torvalds
On Tue, Apr 30, 2019 at 1:21 AM Florian Weimer  wrote:
>
> > (In fact, if I recall correctly, the _reason_ we have an explicit
> > 'vfork()' entry point rather than using clone() with magic parameters
> > was that the lack of arguments meant that you didn't have to
> > save/restore any registers in user space, which made the whole stack
> > issue simpler. But it's been two decades, so my memory is bitrotting).
>
> That's an interesting point.  Using a callback-style interface avoids
> that because you never need to restore the registers in the new
> subprocess.  It's still appropriate to use an assembler implementation,
> I think, because it will be more obviously correct.

I agree that a callback interface would have been a whole lot more
obvious and less prone to subtle problems.

But if you want vfork() because the programs you want to build use it,
that's the interface you need..

Of course, if you *don't* need the exact vfork() semantics, clone
itself actually very much supports a callback model with s separate
stack. You can basically do this:

 - allocate new stack for the child
 - in trivial asm wrapper, do:
- push the callback address on the child stack
- clone(CLONE_VFORK|CLONE_VM|CLONE_SIGCHLD, chld_stack, NULL, NULL,NULL)
- "ret"
 - free new stack

where the "ret" in the child will just go to the callback, while the
parent (eventually) just returns from the trivial wrapper and frees
the new stack (which by definition is no longer used, since the child
has exited or execve'd.

So you can most definitely create a "vfork_with_child_callback()" with
clone, and it would arguably be a much superior interface to vfork()
anyway (maybe you'd like to pass in some arguments to the callback too
- add more stack setup for the child as needed), but it wouldn't be
the right solution for programs that just want to use the standard BSD
vfork() model.

> vfork is also more benign from a memory accounting perspective.  In some
> environments, it's not possible to call fork from a large process
> because the accounting assumes (conservatively) that the new process
> will dirty a lot of its private memory.

Indeed.

 Linus


Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-30 Thread Oleg Nesterov
On 04/29, Linus Torvalds wrote:
>
> Linux vfork() is very much a real vfork(). What do you mean?

Yes, but I am wondering if man vfork should clarify what "child terminates"
actually means. I mean, the child can do clone(CLONE_THREAD) + sys_exit(),
this will wake the parent thread up before the child process exits or execs.

I see nothing wrong, but I was always curious whether it was designed this
way on purpose or not.

Oleg.



Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-30 Thread Florian Weimer
* Linus Torvalds:

> Note that vfork() is "exciting" for the compiler in much the same way
> "setjmp/longjmp()" is, because of the shared stack use in the child
> and the parent. It is *very* easy to get this wrong and cause massive
> and subtle memory corruption issues because the parent returns to
> something that has been messed up by the child.

Just using a wrapper around vfork is enough for that, if the return
address is saved on the stack.  It's surprising hard to write a test
case for that, but the corruption is definitely there.

> (In fact, if I recall correctly, the _reason_ we have an explicit
> 'vfork()' entry point rather than using clone() with magic parameters
> was that the lack of arguments meant that you didn't have to
> save/restore any registers in user space, which made the whole stack
> issue simpler. But it's been two decades, so my memory is bitrotting).

That's an interesting point.  Using a callback-style interface avoids
that because you never need to restore the registers in the new
subprocess.  It's still appropriate to use an assembler implementation,
I think, because it will be more obviously correct.

> Also, particularly if you have a big address space, vfork()+execve()
> can be quite a bit faster than fork()+execve(). Linux fork() is pretty
> efficient, but if you have gigabytes of VM space to copy, it's going
> to take time even if you do it fairly well.

vfork is also more benign from a memory accounting perspective.  In some
environments, it's not possible to call fork from a large process
because the accounting assumes (conservatively) that the new process
will dirty a lot of its private memory.

Thanks,
Florian


Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-30 Thread Florian Weimer
* Linus Torvalds:

> On Mon, Apr 29, 2019 at 1:38 PM Florian Weimer  wrote:
>>
>> In Linux-as-the-ABI (as opposed to Linux-as-the-implementation), vfork
>> is sometimes implemented as fork, so applications cannot rely on the
>> vfork behavior regarding the stopped parent and the shared address
>> space.
>
> What broken library does that?
>
> Sure, we didn't have a proper vfork() long long long ago. But that
> predates both git and BK, so it's some time in the 90's. We've had a
> proper vfork() *forever*.

It's not so much about libraries, it's alternative implementations of
the system call interface: valgrind, qemu-user and WSL.  For valgrind
and qemu-user, it's about cloning the internal data structures, so that
the subprocess does not clobber what's in the parent process (which may
have multiple threads and may not be fully blocked due to vfork).  For
WSL, who knows what's going on there.

>> In fact, it would be nice to have a flag we can check in the posix_spawn
>> implementation, so that we can support vfork-as-fork without any run
>> time cost to native Linux.
>
> No. Just make a bug-report to whatever broken library you use. What's
> the point of having a library that can't even get vfork() right? Why
> would you want to have a flag to say "vfork is broken"?

It's apparently quite difficult to fix valgrind and qemu-user.  WSL is
apparently not given the resources it needs, yet a surprising number of
people see it as a viable replacement and report what are essentially
vfork-related bugs.

If I had the flag, I could at least fix posix_spawn in glibc to consult
it before assuming that vfork shares address space.  (The sharing allows
straightforward reporting of the vfork error code, without resorting to
pipes or creating a MAP_SHARED mapping.)  For obvious reasons, I do not
want to apply the workaround unconditionally.

Thanks,
Florian


Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-29 Thread Linus Torvalds
On Mon, Apr 29, 2019 at 5:39 PM Jann Horn  wrote:
>
> ... uuuh, whoops. Turns out I don't know what I'm talking about.

Well, apparently there's some odd libc issue accoprding to Florian, so
there *might* be something to it.

> Nevermind. For some reason I thought vfork() was just
> CLONE_VFORK|SIGCHLD, but now I see I got that completely wrong.

Well, inside the kernel, that's actually *very* close to what vfork() is:

  SYSCALL_DEFINE0(vfork)
  {
return _do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, 0,
0, NULL, NULL, 0);
  }

but that's just an internal implementation detail. It's a real vfork()
and should act as the traditional BSD "share everything" without any
address space copying. The CLONE_VFORK flag is what does the "wait for
child to exit or execve" magic.

Note that vfork() is "exciting" for the compiler in much the same way
"setjmp/longjmp()" is, because of the shared stack use in the child
and the parent. It is *very* easy to get this wrong and cause massive
and subtle memory corruption issues because the parent returns to
something that has been messed up by the child.

That may be why some libc might end up just using "fork()", because it
ends up avoiding bugs in user space.
(In fact, if I recall correctly, the _reason_ we have an explicit
'vfork()' entry point rather than using clone() with magic parameters
was that the lack of arguments meant that you didn't have to
save/restore any registers in user space, which made the whole stack
issue simpler. But it's been two decades, so my memory is bitrotting).

Also, particularly if you have a big address space, vfork()+execve()
can be quite a bit faster than fork()+execve(). Linux fork() is pretty
efficient, but if you have gigabytes of VM space to copy, it's going
to take time even if you do it fairly well.

   Linus


Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-29 Thread Jann Horn
On Mon, Apr 29, 2019 at 4:21 PM Linus Torvalds
 wrote:
>
> On Mon, Apr 29, 2019 at 12:55 PM Jann Horn  wrote:
> >
> > ... I guess that already has a name, and it's called vfork(). (Well,
> > except that the Linux vfork() isn't a real vfork().)
>
> What?
>
> Linux vfork() is very much a real vfork(). What do you mean?

... uuuh, whoops. Turns out I don't know what I'm talking about.
Nevermind. For some reason I thought vfork() was just
CLONE_VFORK|SIGCHLD, but now I see I got that completely wrong.


Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-29 Thread Linus Torvalds
On Mon, Apr 29, 2019 at 1:38 PM Florian Weimer  wrote:
>
> In Linux-as-the-ABI (as opposed to Linux-as-the-implementation), vfork
> is sometimes implemented as fork, so applications cannot rely on the
> vfork behavior regarding the stopped parent and the shared address
> space.

What broken library does that?

Sure, we didn't have a proper vfork() long long long ago. But that
predates both git and BK, so it's some time in the 90's. We've had a
proper vfork() *forever*.

> In fact, it would be nice to have a flag we can check in the posix_spawn
> implementation, so that we can support vfork-as-fork without any run
> time cost to native Linux.

No. Just make a bug-report to whatever broken library you use. What's
the point of having a library that can't even get vfork() right? Why
would you want to have a flag to say "vfork is broken"?

 Linus


Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-29 Thread Christian Brauner
On Mon, Apr 29, 2019 at 10:50 PM Florian Weimer  wrote:
>
> * Jann Horn:
>
> >> int clone_temporary(int (*fn)(void *arg), void *arg, pid_t *child_pid,
> >> )
> >>
> >> and then you'd use it like this to fork off a child process:
> >>
> >> int spawn_shell_subprocess_(void *arg) {
> >>   char *cmdline = arg;
> >>   execl("/bin/sh", "sh", "-c", cmdline);
> >>   return -1;
> >> }
> >> pid_t spawn_shell_subprocess(char *cmdline) {
> >>   pid_t child_pid;
> >>   int res = clone_temporary(spawn_shell_subprocess_, cmdline,
> >> _pid, [...]);
> >>   if (res == 0) return child_pid;
> >>   return res;
> >> }
> >>
> >> clone_temporary() could be implemented roughly as follows by the libc
> >> (or other userspace code):
> >>
> >> sigset_t sigset, sigset_old;
> >> sigfillset();
> >> sigprocmask(SIG_SETMASK, , _old);
> >> int child_pid;
> >> int result = 0;
> >> /* starting here, use inline assembly to ensure that no stack
> >> allocations occur */
> >> long child = syscall(__NR_clone,
> >> CLONE_VM|CLONE_CHILD_SETTID|CLONE_CHILD_CLEARTID|SIGCHLD, $RSP -
> >> ABI_STACK_REDZONE_SIZE, NULL, _pid, 0);
> >> if (child == -1) { result = -1; goto reset_sigmask; }
> >> if (child == 0) {
> >>   result = fn(arg);
> >>   syscall(__NR_exit, 0);
> >> }
> >> futex(_pid, FUTEX_WAIT, child, NULL);
> >> /* end of no-stack-allocations zone */
> >> reset_sigmask:
> >> sigprocmask(SIG_SETMASK, _old, NULL);
> >> return result;
> >
> > ... I guess that already has a name, and it's called vfork(). (Well,
> > except that the Linux vfork() isn't a real vfork().)
> >
> > So I guess my question is: Why not vfork()?
>
> Mainly because some users want access to the clone flags, and that's not
> possible with the current userspace wrappers.  The stack setup for the
> undocumented clone wrapper is also cumbersome, and the ia64 pecularity
> annoying.
>
> For the stack sharing, the callback-based interface looks like the
> absolutely right thing to do to me.  It enforces the notion that you can
> safely return on the child path from a function calling vfork.
>
> > And if vfork() alone isn't flexible enough, alternatively: How about
> > an API that forks a new child in the same address space, and then
> > allows the parent to invoke arbitrary syscalls in the context of the
> > child?
>
> As long it's not an eBPF script …

You shouldn't even joke about this (I'm serious.).
I'm very certain there are people who'd think this is a good idea.

>
> > You could also build that in userspace if you wanted, I think - just
> > let the child run an assembly loop that reads registers from a unix
> > seqpacket socket, invokes the syscall instruction, and writes the
> > value of the result register back into the seqpacket socket. As long
> > as you use CLONE_VM, you don't have to worry about moving the pointer
> > targets of syscalls. The user-visible API could look like this:
>
> People already use a variant of this, execve'ing twice.  See
> jspawnhelper.
>
> Thanks,
> Florian


Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-29 Thread Florian Weimer
* Jann Horn:

>> int clone_temporary(int (*fn)(void *arg), void *arg, pid_t *child_pid,
>> )
>>
>> and then you'd use it like this to fork off a child process:
>>
>> int spawn_shell_subprocess_(void *arg) {
>>   char *cmdline = arg;
>>   execl("/bin/sh", "sh", "-c", cmdline);
>>   return -1;
>> }
>> pid_t spawn_shell_subprocess(char *cmdline) {
>>   pid_t child_pid;
>>   int res = clone_temporary(spawn_shell_subprocess_, cmdline,
>> _pid, [...]);
>>   if (res == 0) return child_pid;
>>   return res;
>> }
>>
>> clone_temporary() could be implemented roughly as follows by the libc
>> (or other userspace code):
>>
>> sigset_t sigset, sigset_old;
>> sigfillset();
>> sigprocmask(SIG_SETMASK, , _old);
>> int child_pid;
>> int result = 0;
>> /* starting here, use inline assembly to ensure that no stack
>> allocations occur */
>> long child = syscall(__NR_clone,
>> CLONE_VM|CLONE_CHILD_SETTID|CLONE_CHILD_CLEARTID|SIGCHLD, $RSP -
>> ABI_STACK_REDZONE_SIZE, NULL, _pid, 0);
>> if (child == -1) { result = -1; goto reset_sigmask; }
>> if (child == 0) {
>>   result = fn(arg);
>>   syscall(__NR_exit, 0);
>> }
>> futex(_pid, FUTEX_WAIT, child, NULL);
>> /* end of no-stack-allocations zone */
>> reset_sigmask:
>> sigprocmask(SIG_SETMASK, _old, NULL);
>> return result;
>
> ... I guess that already has a name, and it's called vfork(). (Well,
> except that the Linux vfork() isn't a real vfork().)
>
> So I guess my question is: Why not vfork()?

Mainly because some users want access to the clone flags, and that's not
possible with the current userspace wrappers.  The stack setup for the
undocumented clone wrapper is also cumbersome, and the ia64 pecularity
annoying.

For the stack sharing, the callback-based interface looks like the
absolutely right thing to do to me.  It enforces the notion that you can
safely return on the child path from a function calling vfork.

> And if vfork() alone isn't flexible enough, alternatively: How about
> an API that forks a new child in the same address space, and then
> allows the parent to invoke arbitrary syscalls in the context of the
> child?

As long it's not an eBPF script …

> You could also build that in userspace if you wanted, I think - just
> let the child run an assembly loop that reads registers from a unix
> seqpacket socket, invokes the syscall instruction, and writes the
> value of the result register back into the seqpacket socket. As long
> as you use CLONE_VM, you don't have to worry about moving the pointer
> targets of syscalls. The user-visible API could look like this:

People already use a variant of this, execve'ing twice.  See
jspawnhelper.

Thanks,
Florian


Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-29 Thread Christian Brauner
On Mon, Apr 29, 2019 at 10:38 PM Florian Weimer  wrote:
>
> * Linus Torvalds:
>
> > On Mon, Apr 29, 2019 at 12:55 PM Jann Horn  wrote:
> >>
> >> ... I guess that already has a name, and it's called vfork(). (Well,
> >> except that the Linux vfork() isn't a real vfork().)
> >
> > What?
> >
> > Linux vfork() is very much a real vfork(). What do you mean?
>
> In Linux-as-the-ABI (as opposed to Linux-as-the-implementation), vfork
> is sometimes implemented as fork, so applications cannot rely on the
> vfork behavior regarding the stopped parent and the shared address
> space.
>
> In fact, it would be nice to have a flag we can check in the posix_spawn
> implementation, so that we can support vfork-as-fork without any run
> time cost to native Linux.

After the next merge window we'll be out of flags if things go as planned.
To address this problem, Jann and I are currently in the middle of working
on a clone version that we intend to send out for discussion afterwards.
If the proposal is acceptable it would bump the number of available flags
significantly, putting things like this within reach.

Christian


Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-29 Thread Florian Weimer
* Linus Torvalds:

> On Mon, Apr 29, 2019 at 12:55 PM Jann Horn  wrote:
>>
>> ... I guess that already has a name, and it's called vfork(). (Well,
>> except that the Linux vfork() isn't a real vfork().)
>
> What?
>
> Linux vfork() is very much a real vfork(). What do you mean?

In Linux-as-the-ABI (as opposed to Linux-as-the-implementation), vfork
is sometimes implemented as fork, so applications cannot rely on the
vfork behavior regarding the stopped parent and the shared address
space.

In fact, it would be nice to have a flag we can check in the posix_spawn
implementation, so that we can support vfork-as-fork without any run
time cost to native Linux.

Thanks,
Florian


Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-29 Thread Linus Torvalds
On Mon, Apr 29, 2019 at 12:55 PM Jann Horn  wrote:
>
> ... I guess that already has a name, and it's called vfork(). (Well,
> except that the Linux vfork() isn't a real vfork().)

What?

Linux vfork() is very much a real vfork(). What do you mean?

 Linus


Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-29 Thread Jann Horn
On Mon, Apr 29, 2019 at 3:30 PM Jann Horn  wrote:
> On Sat, Apr 20, 2019 at 3:14 AM Kevin Easton  wrote:
> > On Mon, Apr 15, 2019 at 01:29:23PM -0700, Andy Lutomirski wrote:
> > > On Mon, Apr 15, 2019 at 12:59 PM Aleksa Sarai  wrote:
> > > >
> > > > On 2019-04-15, Enrico Weigelt, metux IT consult  wrote:
> > > > > > This patchset makes it possible to retrieve pid file descriptors at
> > > > > > process creation time by introducing the new flag CLONE_PIDFD to the
> > > > > > clone() system call as previously discussed.
> > > > >
> > > > > Sorry, for highjacking this thread, but I'm curious on what things to
> > > > > consider when introducing new CLONE_* flags.
> > > > >
> > > > > The reason I'm asking is:
> > > > >
> > > > > I'm working on implementing plan9-like fs namespaces, where 
> > > > > unprivileged
> > > > > processes can change their own namespace at will. For that, certain
> > > > > traditional unix'ish things have to be disabled, most notably suid.
> > > > > As forbidding suid can be helpful in other scenarios, too, I thought
> > > > > about making this its own feature. Doing that switch on clone() seems
> > > > > a nice place for that, IMHO.
> > > >
> > > > Just spit-balling -- is no_new_privs not sufficient for this usecase?
> > > > Not granting privileges such as setuid during execve(2) is the main
> > > > point of that flag.
> > > >
> > >
> > > I would personally *love* it if distros started setting no_new_privs
> > > for basically all processes.  And pidfd actually gets us part of the
> > > way toward a straightforward way to make sudo and su still work in a
> > > no_new_privs world: su could call into a daemon that would spawn the
> > > privileged task, and su would get a (read-only!) pidfd back and then
> > > wait for the fd and exit.  I suppose that, done naively, this might
> > > cause some odd effects with respect to tty handling, but I bet it's
> > > solveable.  I suppose it would be nifty if there were a way for a
> > > process, by mutual agreement, to reparent itself to an unrelated
> > > process.
> > >
> > > Anyway, clone(2) is an enormous mess.  Surely the right solution here
> > > is to have a whole new process creation API that takes a big,
> > > extensible struct as an argument, and supports *at least* the full
> > > abilities of posix_spawn() and ideally covers all the use cases for
> > > fork() + do stuff + exec().  It would be nifty if this API also had a
> > > way to say "add no_new_privs and therefore enable extra functionality
> > > that doesn't work without no_new_privs".  This functionality would
> > > include things like returning a future extra-privileged pidfd that
> > > gives ptrace-like access.
> > >
> > > As basic examples, the improved process creation API should take a
> > > list of dup2() operations to perform, fds to remove the O_CLOEXEC flag
> > > from, fds to close (or, maybe even better, a list of fds to *not*
> > > close), a list of rlimit changes to make, a list of signal changes to
> > > make, the ability to set sid, pgrp, uid, gid (as in
> > > setresuid/setresgid), the ability to do capset() operations, etc.  The
> > > posix_spawn() API, for all that it's rather complicated, covers a
> > > bunch of the basics pretty well.
> >
> > The idea of a system call that takes an infinitely-extendable laundry
> > list of operations to perform in kernel space seems quite inelegant, if
> > only for the error-reporting reason.
> >
> > Instead, I suggest that what you'd want is a way to create a new
> > embryonic process that has no address space and isn't yet schedulable.
> > You then just need other-process-directed variants of all the normal
> > setup functions - so pr_openat(pidfd, dirfd, pathname, flags, mode),
> > pr_sigaction(pidfd, signum, act, oldact), pr_dup2(pidfd, oldfd, newfd)
> > etc.
> >
> > Then when it's all set up you pr_execve() to kick it off.
>
> Is this really necessary? I agree that fork()+exec() is suboptimal,
> but if you just want to avoid the cost of duplicating the address
> space, you can AFAICS already do that in userspace with
> clone(CLONE_VM|CLONE_CHILD_SETTID|CLONE_CHILD_CLEARTID|SIGCHLD). Then
> the parent can block on a futex until the child leaves the mm_struct
> through execve() (or by exiting, in the case of an error), and the
> child can temporarily have its stack at the bottom of the caller's
> stack. You could build an API like this around it in userspace:
>
> int clone_temporary(int (*fn)(void *arg), void *arg, pid_t *child_pid,
> )
>
> and then you'd use it like this to fork off a child process:
>
> int spawn_shell_subprocess_(void *arg) {
>   char *cmdline = arg;
>   execl("/bin/sh", "sh", "-c", cmdline);
>   return -1;
> }
> pid_t spawn_shell_subprocess(char *cmdline) {
>   pid_t child_pid;
>   int res = clone_temporary(spawn_shell_subprocess_, cmdline,
> _pid, [...]);
>   if (res == 0) return child_pid;
>   return res;
> }
>
> clone_temporary() could be implemented roughly as follows by the libc
> (or other userspace 

Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-29 Thread Jann Horn
On Sat, Apr 20, 2019 at 3:14 AM Kevin Easton  wrote:
> On Mon, Apr 15, 2019 at 01:29:23PM -0700, Andy Lutomirski wrote:
> > On Mon, Apr 15, 2019 at 12:59 PM Aleksa Sarai  wrote:
> > >
> > > On 2019-04-15, Enrico Weigelt, metux IT consult  wrote:
> > > > > This patchset makes it possible to retrieve pid file descriptors at
> > > > > process creation time by introducing the new flag CLONE_PIDFD to the
> > > > > clone() system call as previously discussed.
> > > >
> > > > Sorry, for highjacking this thread, but I'm curious on what things to
> > > > consider when introducing new CLONE_* flags.
> > > >
> > > > The reason I'm asking is:
> > > >
> > > > I'm working on implementing plan9-like fs namespaces, where unprivileged
> > > > processes can change their own namespace at will. For that, certain
> > > > traditional unix'ish things have to be disabled, most notably suid.
> > > > As forbidding suid can be helpful in other scenarios, too, I thought
> > > > about making this its own feature. Doing that switch on clone() seems
> > > > a nice place for that, IMHO.
> > >
> > > Just spit-balling -- is no_new_privs not sufficient for this usecase?
> > > Not granting privileges such as setuid during execve(2) is the main
> > > point of that flag.
> > >
> >
> > I would personally *love* it if distros started setting no_new_privs
> > for basically all processes.  And pidfd actually gets us part of the
> > way toward a straightforward way to make sudo and su still work in a
> > no_new_privs world: su could call into a daemon that would spawn the
> > privileged task, and su would get a (read-only!) pidfd back and then
> > wait for the fd and exit.  I suppose that, done naively, this might
> > cause some odd effects with respect to tty handling, but I bet it's
> > solveable.  I suppose it would be nifty if there were a way for a
> > process, by mutual agreement, to reparent itself to an unrelated
> > process.
> >
> > Anyway, clone(2) is an enormous mess.  Surely the right solution here
> > is to have a whole new process creation API that takes a big,
> > extensible struct as an argument, and supports *at least* the full
> > abilities of posix_spawn() and ideally covers all the use cases for
> > fork() + do stuff + exec().  It would be nifty if this API also had a
> > way to say "add no_new_privs and therefore enable extra functionality
> > that doesn't work without no_new_privs".  This functionality would
> > include things like returning a future extra-privileged pidfd that
> > gives ptrace-like access.
> >
> > As basic examples, the improved process creation API should take a
> > list of dup2() operations to perform, fds to remove the O_CLOEXEC flag
> > from, fds to close (or, maybe even better, a list of fds to *not*
> > close), a list of rlimit changes to make, a list of signal changes to
> > make, the ability to set sid, pgrp, uid, gid (as in
> > setresuid/setresgid), the ability to do capset() operations, etc.  The
> > posix_spawn() API, for all that it's rather complicated, covers a
> > bunch of the basics pretty well.
>
> The idea of a system call that takes an infinitely-extendable laundry
> list of operations to perform in kernel space seems quite inelegant, if
> only for the error-reporting reason.
>
> Instead, I suggest that what you'd want is a way to create a new
> embryonic process that has no address space and isn't yet schedulable.
> You then just need other-process-directed variants of all the normal
> setup functions - so pr_openat(pidfd, dirfd, pathname, flags, mode),
> pr_sigaction(pidfd, signum, act, oldact), pr_dup2(pidfd, oldfd, newfd)
> etc.
>
> Then when it's all set up you pr_execve() to kick it off.

Is this really necessary? I agree that fork()+exec() is suboptimal,
but if you just want to avoid the cost of duplicating the address
space, you can AFAICS already do that in userspace with
clone(CLONE_VM|CLONE_CHILD_SETTID|CLONE_CHILD_CLEARTID|SIGCHLD). Then
the parent can block on a futex until the child leaves the mm_struct
through execve() (or by exiting, in the case of an error), and the
child can temporarily have its stack at the bottom of the caller's
stack. You could build an API like this around it in userspace:

int clone_temporary(int (*fn)(void *arg), void *arg, pid_t *child_pid,
)

and then you'd use it like this to fork off a child process:

int spawn_shell_subprocess_(void *arg) {
  char *cmdline = arg;
  execl("/bin/sh", "sh", "-c", cmdline);
  return -1;
}
pid_t spawn_shell_subprocess(char *cmdline) {
  pid_t child_pid;
  int res = clone_temporary(spawn_shell_subprocess_, cmdline,
_pid, [...]);
  if (res == 0) return child_pid;
  return res;
}

clone_temporary() could be implemented roughly as follows by the libc
(or other userspace code):

sigset_t sigset, sigset_old;
sigfillset();
sigprocmask(SIG_SETMASK, , _old);
int child_pid;
int result = 0;
/* starting here, use inline assembly to ensure that no stack
allocations occur */
long child = syscall(__NR_clone,

Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-29 Thread Enrico Weigelt, metux IT consult
On 29.04.19 17:49, Serge E. Hallyn wrote:

>> * all users are equal - no root at all. the only exception is the>>   
>> initial process, which gets the kernel devices mounted into his>>
 namespace.> > This does not match my understanding, but I'm most likely
wrong.  (I thought> there was an actual 'host owner' uid, which mostly
is only used for initial> process, but is basically root with a
different name, and used far less.  No> uid transitions without factotem
so that it *looked* like no root user).
Not quite (IIRC). The hostowner is just the user who booted the machine,
the initial process runs under this uname and gets the kernel devices
bound into his namespace, so he can start fileservers on them.

Also the caphash device (the one you can create capabilities, eg. for
user change, which then can be used via capuse device) can only be
opened once - usually by the host factotum.

There really is no such thing like root user.

>> What I'd like to achieve on Linux: * unprivileged users can have their 
>> own mount namespace, where
they>>   can mount at will (maybe just 9P).> > No problem, you can do
that now.
But only within separate userns, IMHO. (and, when I last tried, plain
users couldn't directly create their userns).

>> * but they still appear as the same normal users to the rest of the
>>   system
> 
> No problem, you can do that now.

How exactly ? Did I miss something vital ?

>> * 9p programs (compiled for Linux ABI) can run parallel to traditional
>>   linux programs within the same user and sessions (eg. from a terminal,
>>   i can call both the same way)
>> * namespace modifications affect both equally (eg. I could run ff in
>>   an own ns)
> 
> affect both of what equally?

mount / bind.

> That's exactly what user namespaces are for.  You can create a new
> user namespace, using no privilege at all, with your current uid (i.e.
> 1000) mapped to whatever uid you like; if you pick 0, then you can unshare all
> the namespaces you like.  

But I don't like to appear as 'root' in here. I just wanna have my own
filesystem namespace, nothing more.

> Once you unshare mnt_ns, you can mount to your
> heart's content.  To other processes on the host, your process is
> uid 1000.

Is that the uid, I'm appearing to filesystems ?

> Regarding factotem, I agree that with the pidfd work going on etc, it's 
> getting
> more and more tempting to attempt a switch to that.  Looking back at my 
> folder,
> I see you posted a kernel patch for it.  I had done the same long ago.  Happy 
> to
> work with you again on that, and put a simple daemon into shadow package, if
> util-linux isn't deemed the far better place.

Yeah :)


--mtx

-- 
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
i...@metux.net -- +49-151-27565287


Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-29 Thread Serge E. Hallyn
On Tue, Apr 16, 2019 at 08:32:50PM +0200, Enrico Weigelt, metux IT consult 
wrote:

(Sorry for the late reply, I had missed this one)

> On 15.04.19 17:50, Serge E. Hallyn wrote:
> 
> Hi,
> 
> >> I'm working on implementing plan9-like fs namespaces, where unprivileged>> 
> >> processes can change their own namespace at will. For that, certain>
> > Is there any place where we can see previous discussion about this?
> Yes, lkml and constainers list.
> It's stalled since few month, as I'm too busy w/ other things.
> 
> > If you have to disable suid anyway, then is there any reason why the> 
> > existing ability to do this in a private user namespace, with only>
> your own uid mapped (which you can do without any privilege) does> not
> suffice?  That was actually one of the main design goals of user>
> namespaces, to be able to clone(CLONE_NEWUSER), map your current uid,>
> then clone(CLONE_NEWNS) and bind mount at will.
> Well, it's not that easy ... maybe I should explain a bit more about how
> Plan9 works, and how I intent to map it into Linux:
> 
> * on plan9, anybody can alter his own fs namespace (bind and mount), as
>   well as spawning new ones
> * basically anything is coming from some fileserver - even devices
>   (eg. there is no such thing like device nodes)
> * access control is done by the individual fileservers, based on the
>   initial authentication (on connecting to the server, before mounting)

yes, so far I'm aware of this,

> * all users are equal - no root at all. the only exception is the
>   initial process, which gets the kernel devices mounted into his
>   namespace.

This does not match my understanding, but I'm most likely wrong.  (I thought
there was an actual 'host owner' uid, which mostly is only used for initial
process, but is basically root with a different name, and used far less.  No
uid transitions without factotem so that it *looked* like no root user).

> What I'd like to achieve on Linux:
> 
> * unprivileged users can have their own mount namespace, where they
>   can mount at will (maybe just 9P).

No problem, you can do that now.

> * but they still appear as the same normal users to the rest of the
>   system

No problem, you can do that now.

> * 9p programs (compiled for Linux ABI) can run parallel to traditional
>   linux programs within the same user and sessions (eg. from a terminal,
>   i can call both the same way)
> * namespace modifications affect both equally (eg. I could run ff in
>   an own ns)

affect both of what equally?

> * these namespaces exist as long as there's one process alive in here

That's sort of how it is now, except you can also pin the namespaces
with their fds.

> * creating a new ns can be done by unprivileged user

That's true now.

>  One of the things to make this work (w/o introducing a massive security
> hole) is disable suid for those processes (actually, one day i'd like to
> get rid of it completely, but that's another story).

That's exactly what user namespaces are for.  You can create a new
user namespace, using no privilege at all, with your current uid (i.e.
1000) mapped to whatever uid you like; if you pick 0, then you can unshare all
the namespaces you like.  Once you unshare mnt_ns, you can mount to your
heart's content.  To other processes on the host, your process is
uid 1000.  Host uid 0 is not mapped into your ns, so you cannot exploit
suid to host root.

Regarding factotem, I agree that with the pidfd work going on etc, it's getting
more and more tempting to attempt a switch to that.  Looking back at my folder,
I see you posted a kernel patch for it.  I had done the same long ago.  Happy to
work with you again on that, and put a simple daemon into shadow package, if
util-linux isn't deemed the far better place.

-serge


Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-20 Thread Al Viro
On Mon, Apr 15, 2019 at 01:29:23PM -0700, Andy Lutomirski wrote:

> Anyway, clone(2) is an enormous mess.  Surely the right solution here
> is to have a whole new process creation API that takes a big,
> extensible struct as an argument, and supports *at least* the full
> abilities of posix_spawn() and ideally covers all the use cases for
> fork() + do stuff + exec().  It would be nifty if this API also had a
> way to say "add no_new_privs and therefore enable extra functionality
> that doesn't work without no_new_privs".  This functionality would
> include things like returning a future extra-privileged pidfd that
> gives ptrace-like access.

You had been two weeks too late with that, and a bit too obvious with the use
of "surely" too close to the beginning...

If it was _not_ a belated AFD posting, alt.tasteless is over -> that way...


Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-20 Thread Daniel Colascione
On Sat, Apr 20, 2019 at 12:14 AM Kevin Easton  wrote:
> On Mon, Apr 15, 2019 at 01:29:23PM -0700, Andy Lutomirski wrote:
> > On Mon, Apr 15, 2019 at 12:59 PM Aleksa Sarai  wrote:
> > >
> > > On 2019-04-15, Enrico Weigelt, metux IT consult  wrote:
> > > > > This patchset makes it possible to retrieve pid file descriptors at
> > > > > process creation time by introducing the new flag CLONE_PIDFD to the
> > > > > clone() system call as previously discussed.
> > > >
> > > > Sorry, for highjacking this thread, but I'm curious on what things to
> > > > consider when introducing new CLONE_* flags.
> > > >
> > > > The reason I'm asking is:
> > > >
> > > > I'm working on implementing plan9-like fs namespaces, where unprivileged
> > > > processes can change their own namespace at will. For that, certain
> > > > traditional unix'ish things have to be disabled, most notably suid.
> > > > As forbidding suid can be helpful in other scenarios, too, I thought
> > > > about making this its own feature. Doing that switch on clone() seems
> > > > a nice place for that, IMHO.
> > >
> > > Just spit-balling -- is no_new_privs not sufficient for this usecase?
> > > Not granting privileges such as setuid during execve(2) is the main
> > > point of that flag.
> > >
> >
> > I would personally *love* it if distros started setting no_new_privs
> > for basically all processes.  And pidfd actually gets us part of the
> > way toward a straightforward way to make sudo and su still work in a
> > no_new_privs world: su could call into a daemon that would spawn the
> > privileged task, and su would get a (read-only!) pidfd back and then
> > wait for the fd and exit.  I suppose that, done naively, this might
> > cause some odd effects with respect to tty handling, but I bet it's
> > solveable.  I suppose it would be nifty if there were a way for a
> > process, by mutual agreement, to reparent itself to an unrelated
> > process.
> >
> > Anyway, clone(2) is an enormous mess.  Surely the right solution here
> > is to have a whole new process creation API that takes a big,
> > extensible struct as an argument, and supports *at least* the full
> > abilities of posix_spawn() and ideally covers all the use cases for
> > fork() + do stuff + exec().  It would be nifty if this API also had a
> > way to say "add no_new_privs and therefore enable extra functionality
> > that doesn't work without no_new_privs".  This functionality would
> > include things like returning a future extra-privileged pidfd that
> > gives ptrace-like access.
> >
> > As basic examples, the improved process creation API should take a
> > list of dup2() operations to perform, fds to remove the O_CLOEXEC flag
> > from, fds to close (or, maybe even better, a list of fds to *not*
> > close), a list of rlimit changes to make, a list of signal changes to
> > make, the ability to set sid, pgrp, uid, gid (as in
> > setresuid/setresgid), the ability to do capset() operations, etc.  The
> > posix_spawn() API, for all that it's rather complicated, covers a
> > bunch of the basics pretty well.
>
> The idea of a system call that takes an infinitely-extendable laundry
> list of operations to perform in kernel space seems quite inelegant, if
> only for the error-reporting reason.
>
> Instead, I suggest that what you'd want is a way to create a new
> embryonic process that has no address space and isn't yet schedulable.
> You then just need other-process-directed variants of all the normal
> setup functions - so pr_openat(pidfd, dirfd, pathname, flags, mode),
> pr_sigaction(pidfd, signum, act, oldact), pr_dup2(pidfd, oldfd, newfd)
> etc.

Providing process-directed versions of these functions would be useful
for a variety of management tasks anyway,

> Then when it's all set up you pr_execve() to kick it off.

Yes. That's the right general approach.


Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-20 Thread Christian Brauner
On April 20, 2019 9:14:06 AM GMT+02:00, Kevin Easton  wrote:
>On Mon, Apr 15, 2019 at 01:29:23PM -0700, Andy Lutomirski wrote:
>> On Mon, Apr 15, 2019 at 12:59 PM Aleksa Sarai 
>wrote:
>> >
>> > On 2019-04-15, Enrico Weigelt, metux IT consult 
>wrote:
>> > > > This patchset makes it possible to retrieve pid file
>descriptors at
>> > > > process creation time by introducing the new flag CLONE_PIDFD
>to the
>> > > > clone() system call as previously discussed.
>> > >
>> > > Sorry, for highjacking this thread, but I'm curious on what
>things to
>> > > consider when introducing new CLONE_* flags.
>> > >
>> > > The reason I'm asking is:
>> > >
>> > > I'm working on implementing plan9-like fs namespaces, where
>unprivileged
>> > > processes can change their own namespace at will. For that,
>certain
>> > > traditional unix'ish things have to be disabled, most notably
>suid.
>> > > As forbidding suid can be helpful in other scenarios, too, I
>thought
>> > > about making this its own feature. Doing that switch on clone()
>seems
>> > > a nice place for that, IMHO.
>> >
>> > Just spit-balling -- is no_new_privs not sufficient for this
>usecase?
>> > Not granting privileges such as setuid during execve(2) is the main
>> > point of that flag.
>> >
>> 
>> I would personally *love* it if distros started setting no_new_privs
>> for basically all processes.  And pidfd actually gets us part of the
>> way toward a straightforward way to make sudo and su still work in a
>> no_new_privs world: su could call into a daemon that would spawn the
>> privileged task, and su would get a (read-only!) pidfd back and then
>> wait for the fd and exit.  I suppose that, done naively, this might
>> cause some odd effects with respect to tty handling, but I bet it's
>> solveable.  I suppose it would be nifty if there were a way for a
>> process, by mutual agreement, to reparent itself to an unrelated
>> process.
>> 
>> Anyway, clone(2) is an enormous mess.  Surely the right solution here
>> is to have a whole new process creation API that takes a big,
>> extensible struct as an argument, and supports *at least* the full
>> abilities of posix_spawn() and ideally covers all the use cases for
>> fork() + do stuff + exec().  It would be nifty if this API also had a
>> way to say "add no_new_privs and therefore enable extra functionality
>> that doesn't work without no_new_privs".  This functionality would
>> include things like returning a future extra-privileged pidfd that
>> gives ptrace-like access.
>> 
>> As basic examples, the improved process creation API should take a
>> list of dup2() operations to perform, fds to remove the O_CLOEXEC
>flag
>> from, fds to close (or, maybe even better, a list of fds to *not*
>> close), a list of rlimit changes to make, a list of signal changes to
>> make, the ability to set sid, pgrp, uid, gid (as in
>> setresuid/setresgid), the ability to do capset() operations, etc. 
>The
>> posix_spawn() API, for all that it's rather complicated, covers a
>> bunch of the basics pretty well.
>
>The idea of a system call that takes an infinitely-extendable laundry
>list of operations to perform in kernel space seems quite inelegant, if
>only for the error-reporting reason.
>
>Instead, I suggest that what you'd want is a way to create a new
>embryonic process that has no address space and isn't yet schedulable.
>You then just need other-process-directed variants of all the normal
>setup functions - so pr_openat(pidfd, dirfd, pathname, flags, mode),
>pr_sigaction(pidfd, signum, act, oldact), pr_dup2(pidfd, oldfd, newfd)
>etc.
>
>Then when it's all set up you pr_execve() to kick it off.
>
>- Kevin

I proposed a version of this a while back when we first started talking about 
this.


Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-20 Thread Kevin Easton
On Mon, Apr 15, 2019 at 01:29:23PM -0700, Andy Lutomirski wrote:
> On Mon, Apr 15, 2019 at 12:59 PM Aleksa Sarai  wrote:
> >
> > On 2019-04-15, Enrico Weigelt, metux IT consult  wrote:
> > > > This patchset makes it possible to retrieve pid file descriptors at
> > > > process creation time by introducing the new flag CLONE_PIDFD to the
> > > > clone() system call as previously discussed.
> > >
> > > Sorry, for highjacking this thread, but I'm curious on what things to
> > > consider when introducing new CLONE_* flags.
> > >
> > > The reason I'm asking is:
> > >
> > > I'm working on implementing plan9-like fs namespaces, where unprivileged
> > > processes can change their own namespace at will. For that, certain
> > > traditional unix'ish things have to be disabled, most notably suid.
> > > As forbidding suid can be helpful in other scenarios, too, I thought
> > > about making this its own feature. Doing that switch on clone() seems
> > > a nice place for that, IMHO.
> >
> > Just spit-balling -- is no_new_privs not sufficient for this usecase?
> > Not granting privileges such as setuid during execve(2) is the main
> > point of that flag.
> >
> 
> I would personally *love* it if distros started setting no_new_privs
> for basically all processes.  And pidfd actually gets us part of the
> way toward a straightforward way to make sudo and su still work in a
> no_new_privs world: su could call into a daemon that would spawn the
> privileged task, and su would get a (read-only!) pidfd back and then
> wait for the fd and exit.  I suppose that, done naively, this might
> cause some odd effects with respect to tty handling, but I bet it's
> solveable.  I suppose it would be nifty if there were a way for a
> process, by mutual agreement, to reparent itself to an unrelated
> process.
> 
> Anyway, clone(2) is an enormous mess.  Surely the right solution here
> is to have a whole new process creation API that takes a big,
> extensible struct as an argument, and supports *at least* the full
> abilities of posix_spawn() and ideally covers all the use cases for
> fork() + do stuff + exec().  It would be nifty if this API also had a
> way to say "add no_new_privs and therefore enable extra functionality
> that doesn't work without no_new_privs".  This functionality would
> include things like returning a future extra-privileged pidfd that
> gives ptrace-like access.
> 
> As basic examples, the improved process creation API should take a
> list of dup2() operations to perform, fds to remove the O_CLOEXEC flag
> from, fds to close (or, maybe even better, a list of fds to *not*
> close), a list of rlimit changes to make, a list of signal changes to
> make, the ability to set sid, pgrp, uid, gid (as in
> setresuid/setresgid), the ability to do capset() operations, etc.  The
> posix_spawn() API, for all that it's rather complicated, covers a
> bunch of the basics pretty well.

The idea of a system call that takes an infinitely-extendable laundry
list of operations to perform in kernel space seems quite inelegant, if
only for the error-reporting reason.

Instead, I suggest that what you'd want is a way to create a new
embryonic process that has no address space and isn't yet schedulable.
You then just need other-process-directed variants of all the normal
setup functions - so pr_openat(pidfd, dirfd, pathname, flags, mode),
pr_sigaction(pidfd, signum, act, oldact), pr_dup2(pidfd, oldfd, newfd)
etc.

Then when it's all set up you pr_execve() to kick it off.

- Kevin



Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-18 Thread Enrico Weigelt, metux IT consult
On 17.04.19 14:54, Christian Brauner wrote:

>> Ah, that is a cool thing !>> I suppose that also works across namespaces ?> 
>> > Yes, it should. If
you hand off the pidfd to another pidns (e.g. via SCM> credentials) for
example.
I thought about things like sending the pidfd via unix socket.
It would be really cool if the receiving process could then control
the referred process (eg. send signals), even if it's in a different
pidns.

>> What other things can be done via pidfd ?
> 
> Very basic things right now and until CLONE_PIDFD is accepted (possibly
> for 5.2) we won't enable any more features.
> I'm not a fan of wild speculations and grand schemes that then don't
> come to fruition. :) Right now it's about focussing on somewhat cleanly
> covering the basics. Coming to a consensus there was hard enough. We
> have no intention in making this more complex right now as it needs to
> be.

IMHO, it would be good if it would support all operations that now can
be done via numerical PID, and w/ the permissions of the process who
created that pidfd. Certainly, that would also need some lockdown
mechanism, so the creating process can define what the holder of the
fd can actually do.

--mtx

-- 
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
i...@metux.net -- +49-151-27565287


Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-17 Thread Andy Lutomirski



> On Apr 17, 2019, at 5:19 AM, Florian Weimer  wrote:
> 
> * Andy Lutomirski:
> 
>> I would personally *love* it if distros started setting no_new_privs
>> for basically all processes.
> 
> Wouldn't no_new_privs inhibit all security transitions, including those
> that reduce privileges?  And therefore effectively reduce security?

In principle, you still can reduce privileges with no_new_privs.  SELinux has a 
whole mechanism for privilege-reducing transitions on exec that works in 
no_new_privs mode. Also, all the traditional privilege dropping techniques work 
— setresuid(), unshare(), etc are all unaffected.

> 
>> There seems to be some demand to be able to do large
> parts of container setup using posix_spawn, so we'll probably add
> support for things like writing to arbitrary files eventually.  And of
> course, proper error reporting, so that you can figure out which file
> creation action failed.
> 

ISTM the way to handle this is to have a way to make a container, set it up, 
and then clone/spawn into it.  The current unshare() API is severely awkward.

Maybe the new better kernel spawn API shouldn’t support unshare-like semantics 
at all and should instead work like setns().

Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-17 Thread Christian Brauner
On Wed, Apr 17, 2019 at 02:03:16PM +0200, Enrico Weigelt, metux IT consult 
wrote:
> On 16.04.19 23:31, Andy Lutomirski wrote:
> 
> >> How exactly would the pidfd improve this scenario ?
> >> IMHO, would just need to pass the inherited fd's to that daemon (eg.
> >> via unix socket) which then sets them up in the new child process.
> > 
> > It makes it easier to wait until the privileged program exits.
> > Without pidfd, you can't just wait(2) because the program that gets
> > spawned isn't a child.  
> 
> Ah, that is a cool thing !
> I suppose that also works across namespaces ?

Yes, it should. If you hand off the pidfd to another pidns (e.g. via SCM
credentials) for example.

> 
> What other things can be done via pidfd ?

Very basic things right now and until CLONE_PIDFD is accepted (possibly
for 5.2) we won't enable any more features.
I'm not a fan of wild speculations and grand schemes that then don't
come to fruition. :) Right now it's about focussing on somewhat cleanly
covering the basics. Coming to a consensus there was hard enough. We
have no intention in making this more complex right now as it needs to
be.

Christian


Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-17 Thread Florian Weimer
* Andy Lutomirski:

> I would personally *love* it if distros started setting no_new_privs
> for basically all processes.

Wouldn't no_new_privs inhibit all security transitions, including those
that reduce privileges?  And therefore effectively reduce security?

> Anyway, clone(2) is an enormous mess.  Surely the right solution here
> is to have a whole new process creation API that takes a big,
> extensible struct as an argument, and supports *at least* the full
> abilities of posix_spawn() and ideally covers all the use cases for
> fork() + do stuff + exec().  It would be nifty if this API also had a
> way to say "add no_new_privs and therefore enable extra functionality
> that doesn't work without no_new_privs".  This functionality would
> include things like returning a future extra-privileged pidfd that
> gives ptrace-like access.

I agree that consistent replacement for the clone system call makes
sense.  I'm not sure if covering everything that posix_spawn could do
would make sense.  There seems to be some demand to be able to do large
parts of container setup using posix_spawn, so we'll probably add
support for things like writing to arbitrary files eventually.  And of
course, proper error reporting, so that you can figure out which file
creation action failed.

Thanks,
Florian


Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-17 Thread Enrico Weigelt, metux IT consult
On 16.04.19 23:31, Andy Lutomirski wrote:

>> How exactly would the pidfd improve this scenario ?
>> IMHO, would just need to pass the inherited fd's to that daemon (eg.
>> via unix socket) which then sets them up in the new child process.
> 
> It makes it easier to wait until the privileged program exits.
> Without pidfd, you can't just wait(2) because the program that gets
> spawned isn't a child.  

Ah, that is a cool thing !
I suppose that also works across namespaces ?

What other things can be done via pidfd ?

>> But: how can we handle things like cgroups ?
> 
> Find a secure way to tell the daemon what cgroups to use?

hmm, do we have some fd-handle to cgroups ?
In that case a process could send a handle of his cgroup to some
other process (eg. some "login" deamon) allowing him to join in.

We could look at cgroups more as kind of capabilities instead of
limitations (eg. things like: members of cgroup "net-foo1" are
granted n% of network bandwith, etc). That would open up completely
new approaches to security and resource control :)

It could go even further: anybody can create new cgroups within his
own, narrow down some limits and pass this to some other agent that
acts on behalf of him and is allowed to use his share of the system
resources for that.


--mtx

-- 
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
i...@metux.net -- +49-151-27565287


Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-16 Thread Andy Lutomirski
On Tue, Apr 16, 2019 at 11:46 AM Enrico Weigelt, metux IT consult
 wrote:
>
> On 15.04.19 22:29, Andy Lutomirski wrote:
>
> 
>
> > I would personally *love* it if distros started setting no_new_privs> for 
> > basically all processes.
>
> Maybe a pam module for that would be fine.
> But this should be configurable per-user, as so many things still rely
> on suid.
>
> Actually, I'd like to move all authentication / privilege switching
> to factotum (login(1), sshd, etc then also could run as unprivileged
> users).
>
> > And pidfd actually gets us part of the> way toward a straightforward way to 
> > make sudo and su still work in a>
> no_new_privs world: su could call into a daemon that would spawn the>
> privileged task, and su would get a (read-only!) pidfd back and then>
> wait for the fd and exit.
>
> How exactly would the pidfd improve this scenario ?
> IMHO, would just need to pass the inherited fd's to that daemon (eg.
> via unix socket) which then sets them up in the new child process.
>

It makes it easier to wait until the privileged program exits.
Without pidfd, you can't just wait(2) because the program that gets
spawned isn't a child.  With pidfd, the daemon can pass the pidfd
back.  Without pidfd, of course, you can wait by asking the daemon to
tell you when the program exits, but that's a uglier IMO.

> > I suppose that, done naively, this might> cause some odd effects with 
> > respect to tty handling, but I bet it's>
> solveable.
>
> Yes, signals and process groups would be a bit tricky. Some signals
> could be transmitted in a similar way as ssh does.
>
> But: how can we handle things like cgroups ?

Find a secure way to tell the daemon what cgroups to use?


--Andy


Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-16 Thread Enrico Weigelt, metux IT consult
On 15.04.19 22:29, Andy Lutomirski wrote:



> I would personally *love* it if distros started setting no_new_privs> for 
> basically all processes.

Maybe a pam module for that would be fine.
But this should be configurable per-user, as so many things still rely
on suid.

Actually, I'd like to move all authentication / privilege switching
to factotum (login(1), sshd, etc then also could run as unprivileged
users).

> And pidfd actually gets us part of the> way toward a straightforward way to 
> make sudo and su still work in a>
no_new_privs world: su could call into a daemon that would spawn the>
privileged task, and su would get a (read-only!) pidfd back and then>
wait for the fd and exit.

How exactly would the pidfd improve this scenario ?
IMHO, would just need to pass the inherited fd's to that daemon (eg.
via unix socket) which then sets them up in the new child process.

> I suppose that, done naively, this might> cause some odd effects with respect 
> to tty handling, but I bet it's>
solveable.

Yes, signals and process groups would be a bit tricky. Some signals
could be transmitted in a similar way as ssh does.

But: how can we handle things like cgroups ?


--mtx

-- 
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
i...@metux.net -- +49-151-27565287


Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-16 Thread Enrico Weigelt, metux IT consult
On 15.04.19 21:59, Aleksa Sarai wrote:

> Just spit-balling -- is no_new_privs not sufficient for this usecase?> Not 
> granting privileges such as setuid during execve(2) is the main>
point of that flag.
Oh, I wasn't aware of that. Thanks.


--mtx

-- 
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
i...@metux.net -- +49-151-27565287


Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-16 Thread Enrico Weigelt, metux IT consult
On 15.04.19 17:50, Serge E. Hallyn wrote:

Hi,

>> I'm working on implementing plan9-like fs namespaces, where unprivileged>> 
>> processes can change their own namespace at will. For that, certain>
> Is there any place where we can see previous discussion about this?
Yes, lkml and constainers list.
It's stalled since few month, as I'm too busy w/ other things.

> If you have to disable suid anyway, then is there any reason why the> 
> existing ability to do this in a private user namespace, with only>
your own uid mapped (which you can do without any privilege) does> not
suffice?  That was actually one of the main design goals of user>
namespaces, to be able to clone(CLONE_NEWUSER), map your current uid,>
then clone(CLONE_NEWNS) and bind mount at will.
Well, it's not that easy ... maybe I should explain a bit more about how
Plan9 works, and how I intent to map it into Linux:

* on plan9, anybody can alter his own fs namespace (bind and mount), as
  well as spawning new ones
* basically anything is coming from some fileserver - even devices
  (eg. there is no such thing like device nodes)
* access control is done by the individual fileservers, based on the
  initial authentication (on connecting to the server, before mounting)
* all users are equal - no root at all. the only exception is the
  initial process, which gets the kernel devices mounted into his
  namespace.

What I'd like to achieve on Linux:

* unprivileged users can have their own mount namespace, where they
  can mount at will (maybe just 9P).
* but they still appear as the same normal users to the rest of the
  system
* 9p programs (compiled for Linux ABI) can run parallel to traditional
  linux programs within the same user and sessions (eg. from a terminal,
  i can call both the same way)
* namespace modifications affect both equally (eg. I could run ff in
  an own ns)
* these namespaces exist as long as there's one process alive in here
* creating a new ns can be done by unprivileged user
 One of the things to make this work (w/o introducing a massive security
hole) is disable suid for those processes (actually, one day i'd like to
get rid of it completely, but that's another story).


--mtx

-- 
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
i...@metux.net -- +49-151-27565287


Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-15 Thread Andy Lutomirski
On Mon, Apr 15, 2019 at 2:26 PM Jonathan Kowalski  wrote:
>
> On Mon, Apr 15, 2019 at 9:34 PM Andy Lutomirski  wrote:
> > I would personally *love* it if distros started setting no_new_privs
> > for basically all processes.  And pidfd actually gets us part of the
> > way toward a straightforward way to make sudo and su still work in a
> > no_new_privs world: su could call into a daemon that would spawn the
> > privileged task, and su would get a (read-only!) pidfd back and then
> > wait for the fd and exit.  I suppose that, done naively, this might
> > cause some odd effects with respect to tty handling, but I bet it's
> > solveable.  I suppose it would be nifty if there were a way for a
>
> Hmm, isn't what you're describing roughly what systemd-run -t does? It
> will serialize the argument list, ask PID 1 to create a transient unit
> (go through the polkit stuff), and then set the stdout/stderr and
> stdin of the service to your tty, make it the controlling terminal of
> the process and
> reset it. So I guess it should work with sudo/su just fine too.
>
> There is also s6-sudod (and a s6-sudoc client to it) that works in a
> similar fashion, though it's a lot less fancy.

Cute.  Now we just distros to work out the kinks and to ship these as
sudo and su :)

>
> > process, by mutual agreement, to reparent itself to an unrelated
> > process.
> >
> > Anyway, clone(2) is an enormous mess.  Surely the right solution here
> > is to have a whole new process creation API that takes a big,
> > extensible struct as an argument, and supports *at least* the full
> > abilities of posix_spawn() and ideally covers all the use cases for
> > fork() + do stuff + exec().  It would be nifty if this API also had a
> > way to say "add no_new_privs and therefore enable extra functionality
> > that doesn't work without no_new_privs".  This functionality would
> > include things like returning a future extra-privileged pidfd that
> > gives ptrace-like access.
>
> My idea was that this intent could be supplied at clone time, you
> could attach ptrace access modes to a pidfd (we could make those a bit
> granular, perhaps) and any API that takes PIDs and checks against the
> caller's ptrace access mode could instead derive so from the pidfd.
> Since killing is a bit convoluted due to setuid binaries, that should
> work if one is CAP_KILL capable in the owning userns of the task, and
> if not that, has permissions to kill and the target has NNP set.

This CAP_KILL trick makes me nervous.  This particular permission is
really quite powerful, and it would need some analysis to conclude
that it's not *more* powerful than CAP_KILL.

> This
> would allow you to bind kill privileges in a way that is compatible
> with both worlds, the upshot being NNP allows for the functionality to
> be available to a lot more of userspace. Ofcourse, this would require
> a new clone version, possibly with taking a clone2 struct which sets a
> few parameters for the process and the flags for the pidfd.
>
> Another point is that you have a pidfd_open (or something else) that
> can create multiple pidfds from a pidfd obtained at clone time and
> create pidfds with varying level of rights. It can also work by taking
> a TID to open a pidfd for an external task (and then for all the
> rights you wish to acquire on it, check against your ambient
> authority).

Indeed.

>
> (Actually, in general, having FMODE_* style bits spanning all methods
> a file descriptor can take (through system calls), with the type of
> object as key (class containing a set), and be able to enable/disable
> them and seal them would be a useful addition, this all happening at
> the struct file level instead of inode level sealing in memfds).

At the risk of saying a dirty word, the Windows API works quite a bit
like this :)


Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-15 Thread Jonathan Kowalski
On Mon, Apr 15, 2019 at 9:34 PM Andy Lutomirski  wrote:
>
> On Mon, Apr 15, 2019 at 12:59 PM Aleksa Sarai  wrote:
> >
> > On 2019-04-15, Enrico Weigelt, metux IT consult  wrote:
> > > > This patchset makes it possible to retrieve pid file descriptors at
> > > > process creation time by introducing the new flag CLONE_PIDFD to the
> > > > clone() system call as previously discussed.
> > >
> > > Sorry, for highjacking this thread, but I'm curious on what things to
> > > consider when introducing new CLONE_* flags.
> > >
> > > The reason I'm asking is:
> > >
> > > I'm working on implementing plan9-like fs namespaces, where unprivileged
> > > processes can change their own namespace at will. For that, certain
> > > traditional unix'ish things have to be disabled, most notably suid.
> > > As forbidding suid can be helpful in other scenarios, too, I thought
> > > about making this its own feature. Doing that switch on clone() seems
> > > a nice place for that, IMHO.
> >
> > Just spit-balling -- is no_new_privs not sufficient for this usecase?
> > Not granting privileges such as setuid during execve(2) is the main
> > point of that flag.
> >
>
> I would personally *love* it if distros started setting no_new_privs
> for basically all processes.  And pidfd actually gets us part of the
> way toward a straightforward way to make sudo and su still work in a
> no_new_privs world: su could call into a daemon that would spawn the
> privileged task, and su would get a (read-only!) pidfd back and then
> wait for the fd and exit.  I suppose that, done naively, this might
> cause some odd effects with respect to tty handling, but I bet it's
> solveable.  I suppose it would be nifty if there were a way for a

Hmm, isn't what you're describing roughly what systemd-run -t does? It
will serialize the argument list, ask PID 1 to create a transient unit
(go through the polkit stuff), and then set the stdout/stderr and
stdin of the service to your tty, make it the controlling terminal of
the process and
reset it. So I guess it should work with sudo/su just fine too.

There is also s6-sudod (and a s6-sudoc client to it) that works in a
similar fashion, though it's a lot less fancy.

> process, by mutual agreement, to reparent itself to an unrelated
> process.
>
> Anyway, clone(2) is an enormous mess.  Surely the right solution here
> is to have a whole new process creation API that takes a big,
> extensible struct as an argument, and supports *at least* the full
> abilities of posix_spawn() and ideally covers all the use cases for
> fork() + do stuff + exec().  It would be nifty if this API also had a
> way to say "add no_new_privs and therefore enable extra functionality
> that doesn't work without no_new_privs".  This functionality would
> include things like returning a future extra-privileged pidfd that
> gives ptrace-like access.

My idea was that this intent could be supplied at clone time, you
could attach ptrace access modes to a pidfd (we could make those a bit
granular, perhaps) and any API that takes PIDs and checks against the
caller's ptrace access mode could instead derive so from the pidfd.
Since killing is a bit convoluted due to setuid binaries, that should
work if one is CAP_KILL capable in the owning userns of the task, and
if not that, has permissions to kill and the target has NNP set. This
would allow you to bind kill privileges in a way that is compatible
with both worlds, the upshot being NNP allows for the functionality to
be available to a lot more of userspace. Ofcourse, this would require
a new clone version, possibly with taking a clone2 struct which sets a
few parameters for the process and the flags for the pidfd.

Another point is that you have a pidfd_open (or something else) that
can create multiple pidfds from a pidfd obtained at clone time and
create pidfds with varying level of rights. It can also work by taking
a TID to open a pidfd for an external task (and then for all the
rights you wish to acquire on it, check against your ambient
authority).

(Actually, in general, having FMODE_* style bits spanning all methods
a file descriptor can take (through system calls), with the type of
object as key (class containing a set), and be able to enable/disable
them and seal them would be a useful addition, this all happening at
the struct file level instead of inode level sealing in memfds).

>
> As basic examples, the improved process creation API should take a
> list of dup2() operations to perform, fds to remove the O_CLOEXEC flag
> from, fds to close (or, maybe even better, a list of fds to *not*
> close), a list of rlimit changes to make, a list of signal changes to
> make, the ability to set sid, pgrp, uid, gid (as in
> setresuid/setresgid), the ability to do capset() operations, etc.  The
> posix_spawn() API, for all that it's rather complicated, covers a
> bunch of the basics pretty well.
>
> Sharing the parent's VM, signal set, fd table, etc, should all be
> options, but they should default to 

Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-15 Thread Andy Lutomirski
On Mon, Apr 15, 2019 at 12:59 PM Aleksa Sarai  wrote:
>
> On 2019-04-15, Enrico Weigelt, metux IT consult  wrote:
> > > This patchset makes it possible to retrieve pid file descriptors at
> > > process creation time by introducing the new flag CLONE_PIDFD to the
> > > clone() system call as previously discussed.
> >
> > Sorry, for highjacking this thread, but I'm curious on what things to
> > consider when introducing new CLONE_* flags.
> >
> > The reason I'm asking is:
> >
> > I'm working on implementing plan9-like fs namespaces, where unprivileged
> > processes can change their own namespace at will. For that, certain
> > traditional unix'ish things have to be disabled, most notably suid.
> > As forbidding suid can be helpful in other scenarios, too, I thought
> > about making this its own feature. Doing that switch on clone() seems
> > a nice place for that, IMHO.
>
> Just spit-balling -- is no_new_privs not sufficient for this usecase?
> Not granting privileges such as setuid during execve(2) is the main
> point of that flag.
>

I would personally *love* it if distros started setting no_new_privs
for basically all processes.  And pidfd actually gets us part of the
way toward a straightforward way to make sudo and su still work in a
no_new_privs world: su could call into a daemon that would spawn the
privileged task, and su would get a (read-only!) pidfd back and then
wait for the fd and exit.  I suppose that, done naively, this might
cause some odd effects with respect to tty handling, but I bet it's
solveable.  I suppose it would be nifty if there were a way for a
process, by mutual agreement, to reparent itself to an unrelated
process.

Anyway, clone(2) is an enormous mess.  Surely the right solution here
is to have a whole new process creation API that takes a big,
extensible struct as an argument, and supports *at least* the full
abilities of posix_spawn() and ideally covers all the use cases for
fork() + do stuff + exec().  It would be nifty if this API also had a
way to say "add no_new_privs and therefore enable extra functionality
that doesn't work without no_new_privs".  This functionality would
include things like returning a future extra-privileged pidfd that
gives ptrace-like access.

As basic examples, the improved process creation API should take a
list of dup2() operations to perform, fds to remove the O_CLOEXEC flag
from, fds to close (or, maybe even better, a list of fds to *not*
close), a list of rlimit changes to make, a list of signal changes to
make, the ability to set sid, pgrp, uid, gid (as in
setresuid/setresgid), the ability to do capset() operations, etc.  The
posix_spawn() API, for all that it's rather complicated, covers a
bunch of the basics pretty well.

Sharing the parent's VM, signal set, fd table, etc, should all be
options, but they should default to *off*.

(Many other operating systems allow one to create a process and gain a
capability to do all kinds of things to that process.  It's a
generally good idea.)

--Andy


Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-15 Thread Aleksa Sarai
On 2019-04-15, Enrico Weigelt, metux IT consult  wrote:
> > This patchset makes it possible to retrieve pid file descriptors at
> > process creation time by introducing the new flag CLONE_PIDFD to the
> > clone() system call as previously discussed.
> 
> Sorry, for highjacking this thread, but I'm curious on what things to
> consider when introducing new CLONE_* flags.
> 
> The reason I'm asking is:
> 
> I'm working on implementing plan9-like fs namespaces, where unprivileged
> processes can change their own namespace at will. For that, certain
> traditional unix'ish things have to be disabled, most notably suid.
> As forbidding suid can be helpful in other scenarios, too, I thought
> about making this its own feature. Doing that switch on clone() seems
> a nice place for that, IMHO.

Just spit-balling -- is no_new_privs not sufficient for this usecase?
Not granting privileges such as setuid during execve(2) is the main
point of that flag.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-15 Thread Serge E. Hallyn
On Mon, Apr 15, 2019 at 12:08:09PM +0200, Enrico Weigelt, metux IT consult 
wrote:
> On 14.04.19 22:14, Christian Brauner wrote:
> 
> Hi folks,
> 
> > This patchset makes it possible to retrieve pid file descriptors at
> > process creation time by introducing the new flag CLONE_PIDFD to the
> > clone() system call as previously discussed.
> 
> Sorry, for highjacking this thread, but I'm curious on what things to
> consider when introducing new CLONE_* flags.
> 
> The reason I'm asking is:
> 
> I'm working on implementing plan9-like fs namespaces, where unprivileged
> processes can change their own namespace at will. For that, certain

Is there any place where we can see previous discussion about this?

> traditional unix'ish things have to be disabled, most notably suid.

If you have to disable suid anyway, then is there any reason why the
existing ability to do this in a private user namespace, with only
your own uid mapped (which you can do without any privilege) does
not suffice?  That was actually one of the main design goals of user
namespaces, to be able to clone(CLONE_NEWUSER), map your current uid,
then clone(CLONE_NEWNS) and bind mount at will.

> As forbidding suid can be helpful in other scenarios, too, I thought
> about making this its own feature. Doing that switch on clone() seems
> a nice place for that, IMHO.
> 
> As there might be potentially even more CLONE_* flags in the future,
> and the bitmask size is limited, this raises the question on how to
> proceed with those flag additions in the future.
> 
> What's your thoughts on that ?
> 
> 
> --mtx
> 
> -- 
> Enrico Weigelt, metux IT consult
> Free software and Linux embedded engineering
> i...@metux.net -- +49-151-27565287