re: fexecve

2019-09-08 Thread matthew green
not really commenting on the proposal itself, but ..

> Let us not forget that you need a binary inside the chroot that can
> call fexecve() on a file descriptor or the ability to build such a
> binary.

this is only one buffer overflow away...  ie, strength in
layers would imply you should not rely this.


.mrg.


Re: fexecve

2019-09-08 Thread Mouse
>> (I'd actually _like_ to see something capabilityish, in which case
>> "can use fexecve" would be a capability that could be removed, from
>> init if need be, on systems that care about this sort of thing.)
> Couldn't we have an enable/disable sysctl variable for this?

Certainly.  I would count that as "something capabilityish" - after
all, assuming it's per-process, in what ways, aside from the APIs used
to control it, does that differ from a capability?

Or, to return for a moment to my roots,

$ SET PROC/PRIV=FEXECVE

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: fexecve

2019-09-08 Thread Taylor R Campbell
> Date: Sun, 8 Sep 2019 14:03:03 -0400
> From: Thor Lancelot Simon 
> 
> On Sun, Sep 08, 2019 at 01:23:46PM -0400, Christos Zoulas wrote:
> > 
> > Here's a simple fexecve(2) implementation. Comments?
> 
> I think this is dangerous in systems which use chroot into filesystems
> mounted noexec (or nosuid) and file-descriptor passing into the constrained
> environment to feed data.  Now new executables (and even setuid ones) can
> be fed in, too.
> 
> What can we do about that?

It sounds like you're positing:

- there is a chrooted process A
- there is a colluding process B outside the chroot
- they share a socket
- B can open setuid executables and send their fds over the socket
- A can now execute setuid executables outside the chroot

How is this substantively different from the following?

- there is a chrooted process A
- there is a colluding process B outside the chroot
- they share a socket
- A can ask B to execute files by pathname and B will happily oblige
- A can now execute setuid executables outside the chroot

That is, under what meaningful circumstances can you rule out the
first scenario but not the second one?


Re: fexecve

2019-09-08 Thread Paul Goyette

On Sun, 8 Sep 2019, Mouse wrote:


(2) Losing the command name isn't good; lots of people turn process
accounting on for logging (in fact, I'd assume 99.9% of people who
turn process accounting on use it purely for logging) and it
substantially decreases the utility if it's easily circumvented.


Isn't the command name easy to lose and/or forge already, with links if
nothng else?  In any case, it seems to me this is one reason to make
fexecve() optional.  (I'd actually _like_ to see something
capabilityish, in which case "can use fexecve" would be a capability
that could be removed, from init if need be, on systems that care about
this sort of thing.)


Couldn't we have an enable/disable sysctl variable for this?



(3) Setugid processes should be prohibited, or at least setugid
dynamically-linked processes, because otherwise there's a window
where a live update of a library might be used to run the old binary
with a new set of libraries.


How does fexecve() make anything possible here that wasn't possible
before?  It seems to me that updating .so libraries has always carried
this risk, so I must be missing something.

/~\ The ASCII Mouse
\ / Ribbon Campaign
X  Against HTML mo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

!DSPAM:5d756850213181273910470!




++--+---+
| Paul Goyette   | PGP Key fingerprint: | E-mail addresses: |
| (Retired)  | FA29 0E3B 35AF E8AE 6651 | p...@whooppee.com |
| Software Developer | 0786 F758 55DE 53BA 7731 | pgoye...@netbsd.org   |
++--+---+


Re: fexecve

2019-09-08 Thread Mouse
> (2) Losing the command name isn't good; lots of people turn process
> accounting on for logging (in fact, I'd assume 99.9% of people who
> turn process accounting on use it purely for logging) and it
> substantially decreases the utility if it's easily circumvented.

Isn't the command name easy to lose and/or forge already, with links if
nothng else?  In any case, it seems to me this is one reason to make
fexecve() optional.  (I'd actually _like_ to see something
capabilityish, in which case "can use fexecve" would be a capability
that could be removed, from init if need be, on systems that care about
this sort of thing.)

> (3) Setugid processes should be prohibited, or at least setugid
> dynamically-linked processes, because otherwise there's a window
> where a live update of a library might be used to run the old binary
> with a new set of libraries.

How does fexecve() make anything possible here that wasn't possible
before?  It seems to me that updating .so libraries has always carried
this risk, so I must be missing something.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: fexecve

2019-09-08 Thread Alexander Nasonov
Christos Zoulas wrote:
> - We can completely dissallow fexecve in chrooted environments.

Full disk encryption (loaded with cgdroot.kmod) requires a complete
system to be chrooted.

-- 
Alex


Re: fexecve

2019-09-08 Thread Mouse
> Here's a simple fexecve(2) implementation.  Comments?

Strikes me as a very good thing to have optionally available.

I'd need to think a good deal more to decide whether I think it's a
reasonable thing to have enabled by default.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: fexecve

2019-09-08 Thread David Holland
On Sun, Sep 08, 2019 at 01:23:46PM -0400, Christos Zoulas wrote:
 > Here's a simple fexecve(2) implementation. Comments?

Two additional things come to mind besides the chroot issue:

(1) For consistency of permission handling, it ought to require that
the file be opened with O_EXEC, and check the execute bit at open
time, not at exec time. However, I imagine that doing so is against
whoever defined fexecve(). It also would create a gap in which a
process could store up references to files that were previously
executable and execute them later, perhaps after they've been
modified. So that isn't unproblematic.

But without that there are going to be a pile of odd corner cases,
which I doubt we'll be able to identify in full beforehand, and some
of which may turn out to be problematic as well. One that comes to
mind is: if you open a program file for write while it's initially
being written out, you can save that handle and use it later (after
chmod) to both alter the program and execute it. On its own this isn't
much use, but I'm not convinced that it can't be in combination with
other circumstances.

(2) Losing the command name isn't good; lots of people turn process
accounting on for logging (in fact, I'd assume 99.9% of people who
turn process accounting on use it purely for logging) and it
substantially decreases the utility if it's easily circumvented. Also,
breaking $ORIGIN for fexecve will probably lead to howling
eventually.

Not trivial to fix though.

(3) Setugid processes should be prohibited, or at least setugid
dynamically-linked processes, because otherwise there's a window where
a live update of a library might be used to run the old binary with a
new set of libraries. At least with ELF this can easily lead to UB that
might be exploitable.

-- 
David A. Holland
dholl...@netbsd.org


Re: fexecve

2019-09-08 Thread Christos Zoulas
In article <20190908183938.ga25...@panix.com>,
Thor Lancelot Simon   wrote:
>On Sun, Sep 08, 2019 at 06:27:11PM -, Christos Zoulas wrote:
>> In article <20190908180303.ga6...@panix.com>,
>> Thor Lancelot Simon   wrote:
>> >On Sun, Sep 08, 2019 at 01:23:46PM -0400, Christos Zoulas wrote:
>> >> 
>> >> Here's a simple fexecve(2) implementation. Comments?
>> >
>> >I think this is dangerous in systems which use chroot into filesystems
>> >mounted noexec (or nosuid) and file-descriptor passing into the constrained
>> >environment to feed data.  Now new executables (and even setuid ones) can
>> >be fed in, too.
>> >
>> >What can we do about that?
>> 
>> - We can completely dissallow fexecve in chrooted environments.
>> 
>> or
>> 
>> - We can check the permissions of the mountpoint of the current working
>>   directory in addition to checking the mountpoint of the executable's
>>   vnode.
>
>I'd like to figure out a way to make this _optional_ in chrooted environments
>because I think in a system designed to use it, it actually could provide a
>security enhancement.  At the same time, I'm worried about the effect on
>systems designed as sketched out above but without this feature in mind.
>
>But I'm having trouble thinking through how it'd work.  A flag of course and
>a test, but on what -- the receive side of the socket when the chroot's
>performed, perhaps?
>
>Or, maybe:
>
>1) Find a way to take the properties of the listen socket from which the
>   received-on socket was cloned into account; so if I chroot-then-listen
>   and I don't have a writable, executable filesystem in which to create
>   my listening socket, I can't receive an executable fd that way
>
>2) At chroot time, block executable fd passing on any socket that hasn't
>   been deliberately marked as "can receive executables" with fcntl
>
>Maybe those two in combination (neither looks easy, from my memory of the
>relevant code, particularly #1) would work?

Let us not forget that you need a binary inside the chroot that can
call fexecve() on a file descriptor or the ability to build such a
binary.

christos



Re: fexecve

2019-09-08 Thread Thor Lancelot Simon
On Sun, Sep 08, 2019 at 06:27:11PM -, Christos Zoulas wrote:
> In article <20190908180303.ga6...@panix.com>,
> Thor Lancelot Simon   wrote:
> >On Sun, Sep 08, 2019 at 01:23:46PM -0400, Christos Zoulas wrote:
> >> 
> >> Here's a simple fexecve(2) implementation. Comments?
> >
> >I think this is dangerous in systems which use chroot into filesystems
> >mounted noexec (or nosuid) and file-descriptor passing into the constrained
> >environment to feed data.  Now new executables (and even setuid ones) can
> >be fed in, too.
> >
> >What can we do about that?
> 
> - We can completely dissallow fexecve in chrooted environments.
> 
> or
> 
> - We can check the permissions of the mountpoint of the current working
>   directory in addition to checking the mountpoint of the executable's
>   vnode.

I'd like to figure out a way to make this _optional_ in chrooted environments
because I think in a system designed to use it, it actually could provide a
security enhancement.  At the same time, I'm worried about the effect on
systems designed as sketched out above but without this feature in mind.

But I'm having trouble thinking through how it'd work.  A flag of course and
a test, but on what -- the receive side of the socket when the chroot's
performed, perhaps?

Or, maybe:

1) Find a way to take the properties of the listen socket from which the
   received-on socket was cloned into account; so if I chroot-then-listen
   and I don't have a writable, executable filesystem in which to create
   my listening socket, I can't receive an executable fd that way

2) At chroot time, block executable fd passing on any socket that hasn't
   been deliberately marked as "can receive executables" with fcntl

Maybe those two in combination (neither looks easy, from my memory of the
relevant code, particularly #1) would work?

Thor


Re: fexecve

2019-09-08 Thread Christos Zoulas
In article <20190908180303.ga6...@panix.com>,
Thor Lancelot Simon   wrote:
>On Sun, Sep 08, 2019 at 01:23:46PM -0400, Christos Zoulas wrote:
>> 
>> Here's a simple fexecve(2) implementation. Comments?
>
>I think this is dangerous in systems which use chroot into filesystems
>mounted noexec (or nosuid) and file-descriptor passing into the constrained
>environment to feed data.  Now new executables (and even setuid ones) can
>be fed in, too.
>
>What can we do about that?

- We can completely dissallow fexecve in chrooted environments.

or

- We can check the permissions of the mountpoint of the current working
  directory in addition to checking the mountpoint of the executable's
  vnode.

christos



Re: fexecve

2019-09-08 Thread Thor Lancelot Simon
On Sun, Sep 08, 2019 at 01:23:46PM -0400, Christos Zoulas wrote:
> 
> Here's a simple fexecve(2) implementation. Comments?

I think this is dangerous in systems which use chroot into filesystems
mounted noexec (or nosuid) and file-descriptor passing into the constrained
environment to feed data.  Now new executables (and even setuid ones) can
be fed in, too.

What can we do about that?

Thor


fexecve

2019-09-08 Thread Christos Zoulas


Here's a simple fexecve(2) implementation. Comments?

christos

Index: include/unistd.h
===
RCS file: /cvsroot/src/include/unistd.h,v
retrieving revision 1.151
diff -u -p -u -r1.151 unistd.h
--- include/unistd.h18 Nov 2018 19:22:23 -  1.151
+++ include/unistd.h8 Sep 2019 17:20:08 -
@@ -107,6 +107,7 @@ int  execle(const char *, const char *, 
 int execlp(const char *, const char *, ...);
 int execv(const char *, char * const *);
 int execve(const char *, char * const *, char * const *);
+int fexecve(int, char * const *, char * const *);
 int execvp(const char *, char * const *);
 pid_t   fork(void);
 longfpathconf(int, int);
Index: sys/kern/exec_elf.c
===
RCS file: /cvsroot/src/sys/kern/exec_elf.c,v
retrieving revision 1.98
diff -u -p -u -r1.98 exec_elf.c
--- sys/kern/exec_elf.c 7 Jun 2019 23:35:52 -   1.98
+++ sys/kern/exec_elf.c 8 Sep 2019 17:20:08 -
@@ -157,6 +157,7 @@ elf_populate_auxv(struct lwp *l, struct 
size_t len, vlen;
AuxInfo ai[ELF_AUX_ENTRIES], *a, *execname;
struct elf_args *ap;
+   char *path = l->l_proc->p_path;
int error;
 
a = ai;
@@ -224,9 +225,11 @@ elf_populate_auxv(struct lwp *l, struct 
a->a_v = l->l_proc->p_stackbase;
a++;
 
-   execname = a;
-   a->a_type = AT_SUN_EXECNAME;
-   a++;
+   if (*path == '/') {
+   execname = a;
+   a->a_type = AT_SUN_EXECNAME;
+   a++;
+   }
 
exec_free_emul_arg(pack);
} else {
@@ -242,7 +245,6 @@ elf_populate_auxv(struct lwp *l, struct 
KASSERT(vlen <= sizeof(ai));
 
if (execname) {
-   char *path = l->l_proc->p_path;
execname->a_v = (uintptr_t)(*stackp + vlen);
len = strlen(path) + 1;
if ((error = copyout(path, (*stackp + vlen), len)) != 0)
Index: sys/kern/kern_exec.c
===
RCS file: /cvsroot/src/sys/kern/kern_exec.c,v
retrieving revision 1.479
diff -u -p -u -r1.479 kern_exec.c
--- sys/kern/kern_exec.c7 Sep 2019 15:34:44 -   1.479
+++ sys/kern/kern_exec.c8 Sep 2019 17:20:08 -
@@ -309,7 +309,8 @@ exec_path_free(struct execve_data *data)
 {  
pathbuf_stringcopy_put(data->ed_pathbuf, data->ed_pathstring);
pathbuf_destroy(data->ed_pathbuf);
-   PNBUF_PUT(data->ed_resolvedpathbuf);
+   if (data->ed_resolvedpathbuf)
+   PNBUF_PUT(data->ed_resolvedpathbuf);
 }
 
 /*
@@ -343,22 +344,35 @@ check_exec(struct lwp *l, struct exec_pa
 {
int error, i;
struct vnode*vp;
-   struct nameidata nd;
size_t  resid;
 
-   // grab the absolute pathbuf here before namei() trashes it.
-   pathbuf_copystring(pb, epp->ep_resolvedname, PATH_MAX);
-   NDINIT(, LOOKUP, FOLLOW | LOCKLEAF | TRYEMULROOT, pb);
+   if (epp->ep_resolvedname) {
+   struct nameidata nd;
 
-   /* first get the vnode */
-   if ((error = namei()) != 0)
-   return error;
-   epp->ep_vp = vp = nd.ni_vp;
+   // grab the absolute pathbuf here before namei() trashes it.
+   pathbuf_copystring(pb, epp->ep_resolvedname, PATH_MAX);
+   NDINIT(, LOOKUP, FOLLOW | LOCKLEAF | TRYEMULROOT, pb);
+
+   /* first get the vnode */
+   if ((error = namei()) != 0)
+   return error;
 
+   epp->ep_vp = vp = nd.ni_vp;
 #ifdef DIAGNOSTIC
-   /* paranoia (take this out once namei stuff stabilizes) */
-   memset(nd.ni_pnbuf, '~', PATH_MAX);
+   /* paranoia (take this out once namei stuff stabilizes) */
+   memset(nd.ni_pnbuf, '~', PATH_MAX);
 #endif
+   } else {
+   struct file *fp;
+
+   if ((error = fd_getvnode(epp->ep_xfd, )) != 0)
+   return error;
+   epp->ep_vp = vp = fp->f_vnode;
+   vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
+   vref(vp);
+   fd_putfile(epp->ep_xfd);
+   }
+
 
/* check access and type */
if (vp->v_type != VREG) {
@@ -387,19 +401,21 @@ check_exec(struct lwp *l, struct exec_pa
/* unlock vp, since we need it unlocked from here on out. */
VOP_UNLOCK(vp);
 
+   if (epp->ep_resolvedname) {
 #if NVERIEXEC > 0
-   error = veriexec_verify(l, vp, epp->ep_resolvedname,
-   epp->ep_flags & EXEC_INDIR ? VERIEXEC_INDIRECT : VERIEXEC_DIRECT,
-   NULL);
-   if (error)
-   goto bad2;
+   error = veriexec_verify(l, vp, epp->ep_resolvedname,
+   epp->ep_flags & EXEC_INDIR ? VERIEXEC_INDIRECT
+

Re: pool guard

2019-09-08 Thread Jason Thorpe



> On Sep 8, 2019, at 4:48 PM, Maxime Villard  wrote:
> 
> "Fixing" this entails first having a UVM that can scale up to and work
> reasonably well with such gigantic amounts of RAM, which is far from being the
> case currently. We will need to have SMP, NUMA, large pages, and in short, we
> will likely have to rewrite UVM entirely.

Hence "at some point".  I am under no illusions that it won't be a large amount 
of work.  At the very least, document the limitation so it's easy to find later.

-- thorpej



Re: pool guard

2019-09-08 Thread Maxime Villard

Le 08/09/2019 à 08:54, Jason Thorpe a écrit :

On Sep 8, 2019, at 9:29 AM, Maxime Villard  wrote:
Don't confuse VA and PA. NetBSD-amd64 supports 16TB of PA, so even if
you have 48TB nvdimms it gets truncated to 16TB. Then, we have 32TB
of VA, twice more than the maximum PA. So again, we are fine.


Yes, but obviously we should fix that at some point.


"Fixing" this entails first having a UVM that can scale up to and work
reasonably well with such gigantic amounts of RAM, which is far from being the
case currently. We will need to have SMP, NUMA, large pages, and in short, we
will likely have to rewrite UVM entirely.

By the time we have done all of that, the 64bit CPUs will already ship with
5-level page trees which multiply by 512 the size of the VA space, so again,
we will be largely fine.

Even otherwise, revisiting pool guards to increase the mapped/unmapped ratio
is a very insignificant and easy task.

I see no problem related to VA/PA amounts.


Re: pool guard

2019-09-08 Thread Jason Thorpe


> On Sep 8, 2019, at 9:29 AM, Maxime Villard  wrote:
> 
> Don't confuse VA and PA. NetBSD-amd64 supports 16TB of PA, so even if
> you have 48TB nvdimms it gets truncated to 16TB. Then, we have 32TB
> of VA, twice more than the maximum PA. So again, we are fine.

Yes, but obviously we should fix that at some point.

-- thorpej



Re: pool guard

2019-09-08 Thread Maxime Villard

Le 07/09/2019 à 23:47, matthew green a écrit :

No performance cost, because these guarded buffers are allocated only when the
pools grow, which is a rare operation that occurs almost only at boot time. No
actual memory consumption either, because unmapped areas don't consume physical
memory, only virtual, and on 64bit arches we have plenty of that - eg 32TB on
amd64, far beyond what we will ever need -, so no problem with consuming VA.


i like this idea, but i would like to point out that HPE already
sell a machine with 48TiB ram and nvdimms are going to explode
the apparently memory in the coming years, so "32TiB" is very far
from "plenty".  we have many challenges to get beyond 8TiB tho,
since we count pages in 'int' all over uvm.


Don't confuse VA and PA. NetBSD-amd64 supports 16TB of PA, so even if
you have 48TB nvdimms it gets truncated to 16TB. Then, we have 32TB
of VA, twice more than the maximum PA. So again, we are fine.