Re: start unlocking kbind(2)

2022-06-24 Thread Scott Cheloha
On Wed, Jun 15, 2022 at 10:40:35PM -0500, Scott Cheloha wrote:
> On Wed, Jun 15, 2022 at 06:17:07PM -0600, Theo de Raadt wrote:
> > Mark Kettenis  wrote:
> > 
> > > Well, I believe that Scott was trying to fix a race condition that can
> > > only happen if code is using kbind(2) incorrectly, i.e. when the
> > > threads deliberately pass different cookies to kbind(2) or execute
> > > kbind(2) from different "text" addresses.
> > > 
> > > I still think the solution is simply to accept that race condition.
> > 
> > Right.
> > 
> > People are not calling kbind.  They are calling syscall(SYS_kbind
> > 
> > The man page says "don't do that".  No user serviceable parts inside.
> > Do not provide to children.
> > 
> > That said, Scott is about to share a diff he and I did a few cycles
> > around, to at least make the call-in transaction be a lock.
> 
> [...]
> 
> This patch reorganizes the logic in sys_kbind() preceding the
> copyin(9) so that we only need to take the kernel lock in a single
> spot if something goes wrong and we need to raise SIGILL.
> 
> It also puts the per-process mutex, ps_mtx, around that logic.  This
> guarantees that the first thread to reach sys_kbind() sets
> ps_kbind_addr and ps_kbind_cookie, even in oddball situations where
> the program isn't using ld.so(1) correctly.
> 
> [...]

10 days, no replies public or private.  I trust that means the patch
is fine.

Here is a tweaked patch with the binding loop wrapped with the kernel
lock.  We can more carefully determine whether uvm_unmap_remove(),
uvm_map_extract(), and uvm_unmap_detach() are MP-safe in a subsequent
patch.  They *look* safe but I can't be sure and nobody volunteered
any thoughts or review for the prior patch so we'll do the
conservative thing and push the kernel lock down, just as dlg@ was
going to do.

If something comes of guenther@'s PT_OPENBSD_KBIND idea we can
trivially remove ps_mtx from this code.

OK?

Index: kern/syscalls.master
===
RCS file: /cvs/src/sys/kern/syscalls.master,v
retrieving revision 1.224
diff -u -p -r1.224 syscalls.master
--- kern/syscalls.master16 May 2022 07:36:04 -  1.224
+++ kern/syscalls.master24 Jun 2022 14:43:26 -
@@ -194,7 +194,7 @@
const struct timespec *times, int flag); }
 85 STD { int sys_futimens(int fd, \
const struct timespec *times); }
-86 STD { int sys_kbind(const struct __kbind *param, \
+86 STD NOLOCK  { int sys_kbind(const struct __kbind *param, \
size_t psize, int64_t proc_cookie); }
 87 STD NOLOCK  { int sys_clock_gettime(clockid_t clock_id, \
struct timespec *tp); }
Index: kern/init_sysent.c
===
RCS file: /cvs/src/sys/kern/init_sysent.c,v
retrieving revision 1.237
diff -u -p -r1.237 init_sysent.c
--- kern/init_sysent.c  16 May 2022 07:38:10 -  1.237
+++ kern/init_sysent.c  24 Jun 2022 14:43:26 -
@@ -1,4 +1,4 @@
-/* $OpenBSD: init_sysent.c,v 1.237 2022/05/16 07:38:10 mvs Exp $   */
+/* $OpenBSD$   */
 
 /*
  * System call switch table.
@@ -204,7 +204,7 @@ const struct sysent sysent[] = {
sys_utimensat },/* 84 = utimensat */
{ 2, s(struct sys_futimens_args), 0,
sys_futimens }, /* 85 = futimens */
-   { 3, s(struct sys_kbind_args), 0,
+   { 3, s(struct sys_kbind_args), SY_NOLOCK | 0,
sys_kbind },/* 86 = kbind */
{ 2, s(struct sys_clock_gettime_args), SY_NOLOCK | 0,
sys_clock_gettime },/* 87 = clock_gettime */
Index: sys/proc.h
===
RCS file: /cvs/src/sys/sys/proc.h,v
retrieving revision 1.330
diff -u -p -r1.330 proc.h
--- sys/proc.h  13 May 2022 15:32:00 -  1.330
+++ sys/proc.h  24 Jun 2022 14:43:27 -
@@ -234,8 +234,8 @@ struct process {
uint64_t ps_pledge;
uint64_t ps_execpledge;
 
-   int64_t ps_kbind_cookie;
-   u_long  ps_kbind_addr;
+   int64_t ps_kbind_cookie;/* [m] */
+   u_long  ps_kbind_addr;  /* [m] */
 
 /* End area that is copied on creation. */
 #define ps_endcopy ps_refcnt
Index: sys/syscall.h
===
RCS file: /cvs/src/sys/sys/syscall.h,v
retrieving revision 1.234
diff -u -p -r1.234 syscall.h
--- sys/syscall.h   16 May 2022 07:38:10 -  1.234
+++ sys/syscall.h   24 Jun 2022 14:43:27 -
@@ -1,4 +1,4 @@
-/* $OpenBSD: syscall.h,v 1.234 2022/05/16 07:38:10 mvs Exp $   */
+/* $OpenBSD$   */
 
 /*
  * System call numbers.
Index: sys/syscallargs.h
===
RCS file: /cvs/src/sys/sys/syscallargs.h,v
retrieving 

Re: start unlocking kbind(2)

2022-06-15 Thread Scott Cheloha
On Wed, Jun 15, 2022 at 06:17:07PM -0600, Theo de Raadt wrote:
> Mark Kettenis  wrote:
> 
> > Well, I believe that Scott was trying to fix a race condition that can
> > only happen if code is using kbind(2) incorrectly, i.e. when the
> > threads deliberately pass different cookies to kbind(2) or execute
> > kbind(2) from different "text" addresses.
> > 
> > I still think the solution is simply to accept that race condition.
> 
> Right.
> 
> People are not calling kbind.  They are calling syscall(SYS_kbind
> 
> The man page says "don't do that".  No user serviceable parts inside.
> Do not provide to children.
> 
> That said, Scott is about to share a diff he and I did a few cycles
> around, to at least make the call-in transaction be a lock.

Okay, here it is.

This patch reorganizes the logic in sys_kbind() preceding the
copyin(9) so that we only need to take the kernel lock in a single
spot if something goes wrong and we need to raise SIGILL.

It also puts the per-process mutex, ps_mtx, around that logic.  This
guarantees that the first thread to reach sys_kbind() sets
ps_kbind_addr and ps_kbind_cookie, even in oddball situations where
the program isn't using ld.so(1) correctly.

I am aware that this is "unsupported", but on balance I would prefer
that the initialization of those two variables was atomic.  I think
taking the mutex is a very small price to pay for the guarantee that
the "security check" logic always happens the way it was intended to
happen in the order it was intended to happen.  Note that we are not
wrapping the binding loop up in a lock, just the ps_kbind_* variable
logic.

... if Mark and Philip want to push back on that, I think I would
yield and just drop the mutex.

... also, if Philip goes ahead with the PT_OPENBSD_KBIND thing we no
longer need the mutex because, as I understand it, we would no longer
need the ps_kbind_cookie.

Even in either of those cases I still think the logic refactoring
shown here is a good thing.

--

I have been running with this now for a day and haven't hit any
panics.  The last time I tried unlocking kbind(2) I hit panics pretty
quickly due to KERNEL_ASSERT_LOCKED() calls down in the binding loop.

Because I'm not seeing that I assume this means something has changed
down in UVM and we no longer need to wrap the binding loop with the
kernel lock.

Thoughts?

Index: uvm/uvm_mmap.c
===
RCS file: /cvs/src/sys/uvm/uvm_mmap.c,v
retrieving revision 1.169
diff -u -p -r1.169 uvm_mmap.c
--- uvm/uvm_mmap.c  19 Jan 2022 10:43:48 -  1.169
+++ uvm/uvm_mmap.c  16 Jun 2022 03:36:53 -
@@ -1127,7 +1127,7 @@ sys_kbind(struct proc *p, void *v, regis
size_t psize, s;
u_long pc;
int count, i, extra;
-   int error;
+   int error, sigill = 0;
 
/*
 * extract syscall args from uap
@@ -1135,23 +1135,41 @@ sys_kbind(struct proc *p, void *v, regis
paramp = SCARG(uap, param);
psize = SCARG(uap, psize);
 
-   /* a NULL paramp disables the syscall for the process */
-   if (paramp == NULL) {
-   if (pr->ps_kbind_addr != 0)
-   sigexit(p, SIGILL);
-   pr->ps_kbind_addr = BOGO_PC;
-   return 0;
-   }
-
-   /* security checks */
+   /*
+* If paramp is NULL and we're uninitialized, disable the syscall
+* for the process.  Raise SIGILL if we're already initialized.
+*
+* If paramp is non-NULL and we're uninitialized, do initialization.
+* Otherwise, do security checks raise SIGILL on failure.
+*/
pc = PROC_PC(p);
-   if (pr->ps_kbind_addr == 0) {
+   mtx_enter(>ps_mtx);
+   if (paramp == NULL) {
+   if (pr->ps_kbind_addr == 0)
+   pr->ps_kbind_addr = BOGO_PC;
+   else
+   sigill = 1;
+   } else if (pr->ps_kbind_addr == 0) {
pr->ps_kbind_addr = pc;
pr->ps_kbind_cookie = SCARG(uap, proc_cookie);
-   } else if (pc != pr->ps_kbind_addr || pc == BOGO_PC)
-   sigexit(p, SIGILL);
-   else if (pr->ps_kbind_cookie != SCARG(uap, proc_cookie))
+   } else if (pc != pr->ps_kbind_addr || pc == BOGO_PC ||
+   pr->ps_kbind_cookie != SCARG(uap, proc_cookie)) {
+   sigill = 1;
+   }
+   mtx_leave(>ps_mtx);
+
+   /* Raise SIGILL if something is off. */
+   if (sigill) {
+   KERNEL_LOCK();
sigexit(p, SIGILL);
+   /* NOTREACHED */
+   KERNEL_UNLOCK();
+   }
+
+   /* We're done if we were disabling the syscall. */
+   if (paramp == NULL)
+   return 0;
+
if (psize < sizeof(struct __kbind) || psize > sizeof(param))
return EINVAL;
if ((error = copyin(paramp, , psize)))
Index: kern/syscalls.master
===

Re: start unlocking kbind(2)

2022-06-15 Thread Theo de Raadt
Mark Kettenis  wrote:

> Well, I believe that Scott was trying to fix a race condition that can
> only happen if code is using kbind(2) incorrectly, i.e. when the
> threads deliberately pass different cookies to kbind(2) or execute
> kbind(2) from different "text" addresses.
> 
> I still think the solution is simply to accept that race condition.

Right.

People are not calling kbind.  They are calling syscall(SYS_kbind

The man page says "don't do that".  No user serviceable parts inside.
Do not provide to children.

That said, Scott is about to share a diff he and I did a few cycles
around, to at least make the call-in transaction be a lock.



Re: start unlocking kbind(2)

2022-06-15 Thread Mark Kettenis
> From: Philip Guenther 
> Date: Wed, 15 Jun 2022 10:07:06 -0900
> 
> On Mon, 13 Jun 2022, Theo de Raadt wrote:
> > Scott Cheloha  wrote:
> > > > Am I wrong that kbind is never called twice in the same address space?
> > > 
> > > Isn't this exactly what happened the last time we tried this?
> > 
> > Tried what?  kbind has never been NOLOCK.
> 
> Scott's referring to rev 1.237 of kern_fork.c, where we tried to require the
> first use of kbind be before __tfork(2) was called.  This blew up because
> there's at least two ways to legitimately call pthread_create(3) before
> doing your first lazy binding:
> 
>  1) build with -znow, then dlopen() something not linked with -znow after
> calling pthread_create()
> 
>  2) pass address of pthread_create() to another function which calls 
> through the pointer, no special link options required (just need to 
> split up the _create and *funcptr enough that the compiler 
> won't optimize to a direct call)
> 
> I've thought about how to possibly force a lazy resolution and haven't
> thought up anything that wasn't unmaintainable, and probably
> unimplementable.

Well, I believe that Scott was trying to fix a race condition that can
only happen if code is using kbind(2) incorrectly, i.e. when the
threads deliberately pass different cookies to kbind(2) or execute
kbind(2) from different "text" addresses.

I still think the solution is simply to accept that race condition.



Re: start unlocking kbind(2)

2022-06-15 Thread Philip Guenther
On Mon, 13 Jun 2022, Theo de Raadt wrote:
> Scott Cheloha  wrote:
> > > Am I wrong that kbind is never called twice in the same address space?
> >
> > Isn't this exactly what happened the last time we tried this?
>
> Tried what?  kbind has never been NOLOCK.

Scott's referring to rev 1.237 of kern_fork.c, where we tried to require
the first use of kbind be before __tfork(2) was called.  This blew up
because there's at least two ways to legitimately call pthread_create(3)
before doing your first lazy binding:

 1) build with -znow, then dlopen() something not linked with -znow after
calling pthread_create()

 2) pass address of pthread_create() to another function which calls
through the pointer, no special link options required (just need to
split up the _create and *funcptr enough that the compiler
won't optimize to a direct call)


I've thought about how to possibly force a lazy resolution and haven't
thought up anything that wasn't unmaintainable, and probably
unimplementable.


So, what could work reliably?

My first thought would be to have ld.so's _dl_boot() do the equivalent of
struct __kbind kb = {
.kb_addr = _syscall_in_dl_bind;
.kb_size = 0;
};
kbind(, sizeof kb, pcookie)

...after teaching the kernel to accept such a first call to kbind and set
pr->ps_kbind_* from them.  That would permit the reimposing of the "first
kbind can't be after __tfork" restriction; is it indeed enough to permit
the ps_kbind_* members without locks?


Another idea is to discard the cookie restriction and just trust the
calling address to be enough.  I mean, we trust it to be sufficient
for sigreturn and that's perhaps more powerful, no?  If we're fine with
that, then how about giving ld.so a PT_OPENBSD_KBIND segment with the
required address, so the kernel can just set it up at exec time?
Eventually, we could disable kbind if said header wasn't present on the ELF
interpreter.

(In a groteque abuse of the ELF program header, we *could* keep the cookie
by, for example, having the PT_OPENBSD_KBIND p_vaddr/p_memsz identify the
location of the random cookie in ld.so memory and pass the kbind syscall
address via p_paddr.  You are permitted to barf.)


Philip


Re: start unlocking kbind(2)

2022-06-13 Thread Theo de Raadt
Scott Cheloha  wrote:

> > Am I wrong that kbind is never called twice in the same address space?
> 
> Isn't this exactly what happened the last time we tried this?

Tried what?  kbind has never been NOLOCK.

> > Can someone describe a synthetic sequence where racing sys_kbind calls
> > perform the wrong action?
> 
> This is highly synthetic, but:
> 
> If Thread1

Stop right there.

Is this in a static binary or a dynamic binary?

In a dynamic binary, ld.so has used kbind(), and kbind() cannot be used
again.

In a static binary, early_static_init() calls kbind() and locks you out.

When were these threads created?

> sets ps_kbind_addr before Thread2 but Thread2 compares
> ps_kbind_cookie with SCARG(uap, proc_cookie) before Thread1 can change
> ps_kbind_addr from 0 to its own SCARG(uap, proc_cookie) it is possible
> that Thread2 will pass the security check and proceed, even though it
> should have caused a SIGILL.

You cannot call kbind() synthetically, from the manual page:

 kbind is currently intended for use by ld.so(1) only.

That is why there is no libc stub for it.  That's why dlg's demo program
replaces the libc one with a raw assembly version, so that it can call it.

> Thread2 knows ps_kbind_cookie is initialy zero, so there is a
> theoretical race.

Again, how did you end up with threads, and how did you call kbind without
using assembly language, and what program would ever do this?

> If you wanted to make this statistically impossible you could
> initialize ps_kbind_cookie to a random value during fork(2) so that
> Thread2 has a 1 in 2^64 chance of guessing the the initial
> ps_kbind_cookie value and bypassing the security check as
> described above.

Where did these threads come from before ld.so completed linkage resolution??




Re: start unlocking kbind(2)

2022-06-13 Thread Scott Cheloha
On Sun, Jun 12, 2022 at 12:12:33AM -0600, Theo de Raadt wrote:
> David Gwynne  wrote:
> 
> > On Wed, May 18, 2022 at 07:42:32PM -0600, Theo de Raadt wrote:
> > > Mark Kettenis  wrote:
> > > 
> > > > > Isn't the vm_map_lock enough?
> > > > 
> > > > Could be.  The fast path is going to take that lock anyway.  This
> > > > would require a bit of surgery to uvm_map_extract() to make sure we
> > > > don't take the vm_map_lock twice.  Worth exploring I'd say.
> > > 
> > > I think the vm_map_lock can be dropped before it reaches that code,
> > > because of 3 cases: (1) new kbind lock, (2) a repeated kbind lock and
> > > return, or (3) violation and process termination.
> > > 
> > > So before doing the copyin() and updates, simply vm_map_unlock()
> > > 
> > > Will that work and isn't it simpler than David's proposal?
> > 
> > I'm not super familiar with uvm so it's hard for me to say it would or
> > wouldn't be simpler, but my initial impressions are that it would be a
> > lot more surgery and I wouldn't be confident I did such changes right.
> 
> i don't understand.  As a rule, using fewer locks and mutexes is simpler
> to understand.
> 
> i do not understand the purpose for your diff, and think you have simply
> brushed my questions aside.
> 
> > Releasing a lock and then taking it again quickly can have undesirable
> > consequences too. If something else is waiting on the rwlock,
> > releasing it will wake them up, but they will probably lose because
> > we've already taken it again.
> 
> Then don't release the vm_map_lock and regrab it, but keep it active.
> 
> I think there is a lot of fuss going on here for a system call which, as
> far as I can tell, is never called in a threaded program.  Static libc or
> ld.so always call kbind early on, before threads, precisely ONCE, and any
> later call to it kills the program but why do we need a mutex for the
> simple action of observing that ps_kbind_addr is not 0?
> 
> Am I wrong that kbind is never called twice in the same address space?

Isn't this exactly what happened the last time we tried this?

> Can someone describe a synthetic sequence where racing sys_kbind calls
> perform the wrong action?

This is highly synthetic, but:

If Thread1 sets ps_kbind_addr before Thread2 but Thread2 compares
ps_kbind_cookie with SCARG(uap, proc_cookie) before Thread1 can change
ps_kbind_addr from 0 to its own SCARG(uap, proc_cookie) it is possible
that Thread2 will pass the security check and proceed, even though it
should have caused a SIGILL.

Thread2 knows ps_kbind_cookie is initialy zero, so there is a
theoretical race.

If you wanted to make this statistically impossible you could
initialize ps_kbind_cookie to a random value during fork(2) so that
Thread2 has a 1 in 2^64 chance of guessing the the initial
ps_kbind_cookie value and bypassing the security check as
described above.

If you do that, then we can do the security check locklessly with
atomic_cas_ulong(9).



Re: start unlocking kbind(2)

2022-06-12 Thread Theo de Raadt
David Gwynne  wrote:

> On Wed, May 18, 2022 at 07:42:32PM -0600, Theo de Raadt wrote:
> > Mark Kettenis  wrote:
> > 
> > > > Isn't the vm_map_lock enough?
> > > 
> > > Could be.  The fast path is going to take that lock anyway.  This
> > > would require a bit of surgery to uvm_map_extract() to make sure we
> > > don't take the vm_map_lock twice.  Worth exploring I'd say.
> > 
> > I think the vm_map_lock can be dropped before it reaches that code,
> > because of 3 cases: (1) new kbind lock, (2) a repeated kbind lock and
> > return, or (3) violation and process termination.
> > 
> > So before doing the copyin() and updates, simply vm_map_unlock()
> > 
> > Will that work and isn't it simpler than David's proposal?
> 
> I'm not super familiar with uvm so it's hard for me to say it would or
> wouldn't be simpler, but my initial impressions are that it would be a
> lot more surgery and I wouldn't be confident I did such changes right.

i don't understand.  As a rule, using fewer locks and mutexes is simpler
to understand.

i do not understand the purpose for your diff, and think you have simply
brushed my questions aside.

> Releasing a lock and then taking it again quickly can have undesirable
> consequences too. If something else is waiting on the rwlock,
> releasing it will wake them up, but they will probably lose because
> we've already taken it again.

Then don't release the vm_map_lock and regrab it, but keep it active.


I think there is a lot of fuss going on here for a system call which, as
far as I can tell, is never called in a threaded program.  Static libc or
ld.so always call kbind early on, before threads, precisely ONCE, and any
later call to it kills the program but why do we need a mutex for the
simple action of observing that ps_kbind_addr is not 0?

Am I wrong that kbind is never called twice in the same address space?

Can someone describe a synthetic sequence where racing sys_kbind calls
perform the wrong action?



Re: start unlocking kbind(2)

2022-06-11 Thread David Gwynne
On Tue, May 31, 2022 at 04:48:16PM +0200, Martin Pieuchot wrote:
> On 18/05/22(Wed) 15:53, Alexander Bluhm wrote:
> > On Tue, May 17, 2022 at 10:44:54AM +1000, David Gwynne wrote:
> > > + cookie = SCARG(uap, proc_cookie);
> > > + if (pr->ps_kbind_addr == pc) {
> > > + membar_datadep_consumer();
> > > + if (pr->ps_kbind_cookie != cookie)
> > > + goto sigill;
> > > + } else {
> > 
> > You must use membar_consumer() here.  membar_datadep_consumer() is
> > a barrier between reading pointer and pointed data.  Only alpha
> > requires membar_datadep_consumer() for that, everywhere else it is
> > a NOP.
> > 
> > > + mtx_enter(>ps_mtx);
> > > + kpc = pr->ps_kbind_addr;
> > 
> > Do we need kpc variable?  I would prefer to read explicit
> > pr->ps_kbind_addr in the two places where we use it.
> > 
> > I think the logic of barriers and mutexes is correct.
> > 
> > with the suggestions above OK bluhm@
> 
> I believe you should go ahead with the current diff.  ok with me.  Moving
> the field under the scope of another lock can be easily done afterward.

Agreed. I'll commit this version with tweaks to make bluhm@ happy
tomorrow.

Index: kern/init_sysent.c
===
RCS file: /cvs/src/sys/kern/init_sysent.c,v
retrieving revision 1.236
diff -u -p -r1.236 init_sysent.c
--- kern/init_sysent.c  1 May 2022 23:00:04 -   1.236
+++ kern/init_sysent.c  12 Jun 2022 04:48:06 -
@@ -1,4 +1,4 @@
-/* $OpenBSD: init_sysent.c,v 1.236 2022/05/01 23:00:04 tedu Exp $  */
+/* $OpenBSD$   */
 
 /*
  * System call switch table.
@@ -204,7 +204,7 @@ const struct sysent sysent[] = {
sys_utimensat },/* 84 = utimensat */
{ 2, s(struct sys_futimens_args), 0,
sys_futimens }, /* 85 = futimens */
-   { 3, s(struct sys_kbind_args), 0,
+   { 3, s(struct sys_kbind_args), SY_NOLOCK | 0,
sys_kbind },/* 86 = kbind */
{ 2, s(struct sys_clock_gettime_args), SY_NOLOCK | 0,
sys_clock_gettime },/* 87 = clock_gettime */
Index: kern/syscalls.master
===
RCS file: /cvs/src/sys/kern/syscalls.master,v
retrieving revision 1.223
diff -u -p -r1.223 syscalls.master
--- kern/syscalls.master24 Feb 2022 07:41:51 -  1.223
+++ kern/syscalls.master12 Jun 2022 04:48:06 -
@@ -194,7 +194,7 @@
const struct timespec *times, int flag); }
 85 STD { int sys_futimens(int fd, \
const struct timespec *times); }
-86 STD { int sys_kbind(const struct __kbind *param, \
+86 STD NOLOCK  { int sys_kbind(const struct __kbind *param, \
size_t psize, int64_t proc_cookie); }
 87 STD NOLOCK  { int sys_clock_gettime(clockid_t clock_id, \
struct timespec *tp); }
Index: uvm/uvm_mmap.c
===
RCS file: /cvs/src/sys/uvm/uvm_mmap.c,v
retrieving revision 1.169
diff -u -p -r1.169 uvm_mmap.c
--- uvm/uvm_mmap.c  19 Jan 2022 10:43:48 -  1.169
+++ uvm/uvm_mmap.c  12 Jun 2022 04:48:06 -
@@ -70,6 +70,7 @@
 #include 
 #include /* for KBIND* */
 #include 
+#include 
 
 #include   /* for __LDPGSZ */
 
@@ -1127,31 +1128,61 @@ sys_kbind(struct proc *p, void *v, regis
size_t psize, s;
u_long pc;
int count, i, extra;
+   uint64_t cookie;
int error;
 
/*
 * extract syscall args from uap
 */
paramp = SCARG(uap, param);
-   psize = SCARG(uap, psize);
 
/* a NULL paramp disables the syscall for the process */
if (paramp == NULL) {
+   mtx_enter(>ps_mtx);
if (pr->ps_kbind_addr != 0)
-   sigexit(p, SIGILL);
+   goto leave_sigill;
pr->ps_kbind_addr = BOGO_PC;
+   mtx_leave(>ps_mtx);
return 0;
}
 
/* security checks */
+
+   /*
+* ps_kbind_addr can only be set to 0 or BOGO_PC by the
+* kernel, not by a call from userland.
+*/
pc = PROC_PC(p);
-   if (pr->ps_kbind_addr == 0) {
-   pr->ps_kbind_addr = pc;
-   pr->ps_kbind_cookie = SCARG(uap, proc_cookie);
-   } else if (pc != pr->ps_kbind_addr || pc == BOGO_PC)
-   sigexit(p, SIGILL);
-   else if (pr->ps_kbind_cookie != SCARG(uap, proc_cookie))
-   sigexit(p, SIGILL);
+   if (pc == 0 || pc == BOGO_PC)
+   goto sigill;
+
+   cookie = SCARG(uap, proc_cookie);
+   if (pr->ps_kbind_addr == pc) {
+   membar_consumer();
+   if (pr->ps_kbind_cookie != cookie)
+   goto sigill;
+   } else {

Re: start unlocking kbind(2)

2022-06-11 Thread David Gwynne
On Wed, May 18, 2022 at 07:42:32PM -0600, Theo de Raadt wrote:
> Mark Kettenis  wrote:
> 
> > > Isn't the vm_map_lock enough?
> > 
> > Could be.  The fast path is going to take that lock anyway.  This
> > would require a bit of surgery to uvm_map_extract() to make sure we
> > don't take the vm_map_lock twice.  Worth exploring I'd say.
> 
> I think the vm_map_lock can be dropped before it reaches that code,
> because of 3 cases: (1) new kbind lock, (2) a repeated kbind lock and
> return, or (3) violation and process termination.
> 
> So before doing the copyin() and updates, simply vm_map_unlock()
> 
> Will that work and isn't it simpler than David's proposal?

I'm not super familiar with uvm so it's hard for me to say it would or
wouldn't be simpler, but my initial impressions are that it would be a
lot more surgery and I wouldn't be confident I did such changes right.

Releasing a lock and then taking it again quickly can have undesirable
consequences too. If something else is waiting on the rwlock,
releasing it will wake them up, but they will probably lose because
we've already taken it again.



Re: start unlocking kbind(2)

2022-05-31 Thread Martin Pieuchot
On 18/05/22(Wed) 15:53, Alexander Bluhm wrote:
> On Tue, May 17, 2022 at 10:44:54AM +1000, David Gwynne wrote:
> > +   cookie = SCARG(uap, proc_cookie);
> > +   if (pr->ps_kbind_addr == pc) {
> > +   membar_datadep_consumer();
> > +   if (pr->ps_kbind_cookie != cookie)
> > +   goto sigill;
> > +   } else {
> 
> You must use membar_consumer() here.  membar_datadep_consumer() is
> a barrier between reading pointer and pointed data.  Only alpha
> requires membar_datadep_consumer() for that, everywhere else it is
> a NOP.
> 
> > +   mtx_enter(>ps_mtx);
> > +   kpc = pr->ps_kbind_addr;
> 
> Do we need kpc variable?  I would prefer to read explicit
> pr->ps_kbind_addr in the two places where we use it.
> 
> I think the logic of barriers and mutexes is correct.
> 
> with the suggestions above OK bluhm@

I believe you should go ahead with the current diff.  ok with me.  Moving
the field under the scope of another lock can be easily done afterward.



Re: start unlocking kbind(2)

2022-05-18 Thread Theo de Raadt
Mark Kettenis  wrote:

> > Isn't the vm_map_lock enough?
> 
> Could be.  The fast path is going to take that lock anyway.  This
> would require a bit of surgery to uvm_map_extract() to make sure we
> don't take the vm_map_lock twice.  Worth exploring I'd say.

I think the vm_map_lock can be dropped before it reaches that code,
because of 3 cases: (1) new kbind lock, (2) a repeated kbind lock and
return, or (3) violation and process termination.

So before doing the copyin() and updates, simply vm_map_unlock()

Will that work and isn't it simpler than David's proposal?



Re: start unlocking kbind(2)

2022-05-18 Thread Mark Kettenis
> From: "Theo de Raadt" 
> Date: Wed, 18 May 2022 10:48:01 -0600
> 
> Mark Kettenis  wrote:
> 
> > > I guess there is another question -- should the ps_kbind_* variables
> > > be stored inside the uvmspace, rather than inside pr?
> > 
> > I think there are arguments for both.  But I don't think moving it
> > makes things easier.
> 
> Whatever proc sets the kbind configuration, wins.  Any other proc must
> set the same kbind (if it is different, it gets killed).

Correct.

> Isn't the vm_map_lock enough?

Could be.  The fast path is going to take that lock anyway.  This
would require a bit of surgery to uvm_map_extract() to make sure we
don't take the vm_map_lock twice.  Worth exploring I'd say.



Re: start unlocking kbind(2)

2022-05-18 Thread Theo de Raadt
Mark Kettenis  wrote:

> > I guess there is another question -- should the ps_kbind_* variables
> > be stored inside the uvmspace, rather than inside pr?
> 
> I think there are arguments for both.  But I don't think moving it
> makes things easier.

Whatever proc sets the kbind configuration, wins.  Any other proc must
set the same kbind (if it is different, it gets killed).

Isn't the vm_map_lock enough?





Re: start unlocking kbind(2)

2022-05-18 Thread Mark Kettenis
> From: "Theo de Raadt" 
> Date: Wed, 18 May 2022 10:09:38 -0600
> 
> I guess there is another question -- should the ps_kbind_* variables
> be stored inside the uvmspace, rather than inside pr?

I think there are arguments for both.  But I don't think moving it
makes things easier.



Re: start unlocking kbind(2)

2022-05-18 Thread Mark Kettenis
> From: "Theo de Raadt" 
> Date: Wed, 18 May 2022 10:05:03 -0600
> 
> David Gwynne  wrote:
> 
> > from yet another perspective, broad use of a common lock can end up
> > hurting in the long run because you may end up where everything is
> > serialised and you have to go back and do a ton of work again anyway.
> > 
> > > I think the theory is ps_kbind_addr is fixed to a shared address space
> > > in "pr", if you are threaded there is only one ps_kbind_addr for all the
> > > processes "p" sharing that address space.
> > 
> > yes.
> > 
> > > And execve() uses single_thread_set, which means you can't race?
> > 
> > also yes. the value starts at 0, updates are serialised by the
> > mutex, and the updates don't produce invalid intermediate state.
> 
> so what is the mutex for?  Is this being over-designed?
> 
> On a shared address space, ld.so will be entered before we thread,
> straight out of execve.  I don't think we have constructors starting
> threads, that would be very bad right?

Unfortunately code exists that creates threads in constructors.  At
least that was the conclusion when we had a very similar discussion
some months ago.

It's a seriously bad idea, but the code that actually does this
probably does even worse things.

> So there are two cases -- static binary and dynaamic.  On one address space,
> kbind is only called once, and there is only one attempt, and any other
> attempt should kill you, because you are doing it manual & wrong.  <-- is
> this correct, guenther?

So in the dynamic case there may be multiple parallel calls to
kbind().  They should all come from the same address though, unless
someone is doing something deliberately nasty.



Re: start unlocking kbind(2)

2022-05-18 Thread Theo de Raadt
I guess there is another question -- should the ps_kbind_* variables
be stored inside the uvmspace, rather than inside pr?



Re: start unlocking kbind(2)

2022-05-18 Thread Theo de Raadt
David Gwynne  wrote:

> from yet another perspective, broad use of a common lock can end up
> hurting in the long run because you may end up where everything is
> serialised and you have to go back and do a ton of work again anyway.
> 
> > I think the theory is ps_kbind_addr is fixed to a shared address space
> > in "pr", if you are threaded there is only one ps_kbind_addr for all the
> > processes "p" sharing that address space.
> 
> yes.
> 
> > And execve() uses single_thread_set, which means you can't race?
> 
> also yes. the value starts at 0, updates are serialised by the
> mutex, and the updates don't produce invalid intermediate state.

so what is the mutex for?  Is this being over-designed?

On a shared address space, ld.so will be entered before we thread,
straight out of execve.  I don't think we have constructors starting
threads, that would be very bad right?

So there are two cases -- static binary and dynaamic.  On one address space,
kbind is only called once, and there is only one attempt, and any other
attempt should kill you, because you are doing it manual & wrong.  <-- is
this correct, guenther?




Re: start unlocking kbind(2)

2022-05-18 Thread Alexander Bluhm
On Tue, May 17, 2022 at 10:44:54AM +1000, David Gwynne wrote:
> + cookie = SCARG(uap, proc_cookie);
> + if (pr->ps_kbind_addr == pc) {
> + membar_datadep_consumer();
> + if (pr->ps_kbind_cookie != cookie)
> + goto sigill;
> + } else {

You must use membar_consumer() here.  membar_datadep_consumer() is
a barrier between reading pointer and pointed data.  Only alpha
requires membar_datadep_consumer() for that, everywhere else it is
a NOP.

> + mtx_enter(>ps_mtx);
> + kpc = pr->ps_kbind_addr;

Do we need kpc variable?  I would prefer to read explicit
pr->ps_kbind_addr in the two places where we use it.

I think the logic of barriers and mutexes is correct.

with the suggestions above OK bluhm@



Re: start unlocking kbind(2)

2022-05-18 Thread David Gwynne
On Tue, May 17, 2022 at 10:14:18AM -0600, Theo de Raadt wrote:
> Martin Pieuchot  wrote:
> 
> > On 17/05/22(Tue) 10:44, David Gwynne wrote:
> > > this narrows the scope of the KERNEL_LOCK in kbind(2) so the syscall
> > > argument checks can be done without the kernel lock.
> > > 
> > > care is taken to allow the pc/cookie checks to run without any lock by
> > > being careful with the order of the checks. all modifications to the
> > > pc/cookie state are serialised by the per process mutex.
> > 
> > I don't understand why it is safe to do the following check without
> > holding a mutex:

because you can't read any invalid intermediate value. in the fast
path here, either it is the expected value when you're using the
interface correctly, or it isnt. if it isn't then you take the "slow"
path by taking the mutex and doing the comprehensive value checks
and updates.

> > 
> > if (pr->ps_kbind_addr == pc)
> > ...
> > 
> > Is there much differences when always grabbing the per-process mutex?

sure, that sequence is maybe half a dozen instructions without the
mutex, and many times that with the mutex.

from a different perspective, i can't see a big difference between
kernel compile times with or without this diff, so right now the mutex
isn't going to make a difference either.

from yet another perspective, broad use of a common lock can end up
hurting in the long run because you may end up where everything is
serialised and you have to go back and do a ton of work again anyway.

> I think the theory is ps_kbind_addr is fixed to a shared address space
> in "pr", if you are threaded there is only one ps_kbind_addr for all the
> processes "p" sharing that address space.

yes.

> And execve() uses single_thread_set, which means you can't race?

also yes. the value starts at 0, updates are serialised by the
mutex, and the updates don't produce invalid intermediate state.



Re: start unlocking kbind(2)

2022-05-17 Thread Theo de Raadt
Martin Pieuchot  wrote:

> On 17/05/22(Tue) 10:44, David Gwynne wrote:
> > this narrows the scope of the KERNEL_LOCK in kbind(2) so the syscall
> > argument checks can be done without the kernel lock.
> > 
> > care is taken to allow the pc/cookie checks to run without any lock by
> > being careful with the order of the checks. all modifications to the
> > pc/cookie state are serialised by the per process mutex.
> 
> I don't understand why it is safe to do the following check without
> holding a mutex:
> 
>   if (pr->ps_kbind_addr == pc)
>   ...
> 
> Is there much differences when always grabbing the per-process mutex?

I think the theory is ps_kbind_addr is fixed to a shared address space
in "pr", if you are threaded there is only one ps_kbind_addr for all the
processes "p" sharing that address space.

And execve() uses single_thread_set, which means you can't race?




Re: start unlocking kbind(2)

2022-05-17 Thread Martin Pieuchot
On 17/05/22(Tue) 10:44, David Gwynne wrote:
> this narrows the scope of the KERNEL_LOCK in kbind(2) so the syscall
> argument checks can be done without the kernel lock.
> 
> care is taken to allow the pc/cookie checks to run without any lock by
> being careful with the order of the checks. all modifications to the
> pc/cookie state are serialised by the per process mutex.

I don't understand why it is safe to do the following check without
holding a mutex:

if (pr->ps_kbind_addr == pc)
...

Is there much differences when always grabbing the per-process mutex?

> i dont know enough about uvm to say whether it is safe to unlock the
> actual memory updates too, but even if i was confident i would still
> prefer to change it as a separate step.

I agree.

> Index: kern/init_sysent.c
> ===
> RCS file: /cvs/src/sys/kern/init_sysent.c,v
> retrieving revision 1.236
> diff -u -p -r1.236 init_sysent.c
> --- kern/init_sysent.c1 May 2022 23:00:04 -   1.236
> +++ kern/init_sysent.c17 May 2022 00:36:03 -
> @@ -1,4 +1,4 @@
> -/*   $OpenBSD: init_sysent.c,v 1.236 2022/05/01 23:00:04 tedu Exp $  */
> +/*   $OpenBSD$   */
>  
>  /*
>   * System call switch table.
> @@ -204,7 +204,7 @@ const struct sysent sysent[] = {
>   sys_utimensat },/* 84 = utimensat */
>   { 2, s(struct sys_futimens_args), 0,
>   sys_futimens }, /* 85 = futimens */
> - { 3, s(struct sys_kbind_args), 0,
> + { 3, s(struct sys_kbind_args), SY_NOLOCK | 0,
>   sys_kbind },/* 86 = kbind */
>   { 2, s(struct sys_clock_gettime_args), SY_NOLOCK | 0,
>   sys_clock_gettime },/* 87 = clock_gettime */
> Index: kern/syscalls.master
> ===
> RCS file: /cvs/src/sys/kern/syscalls.master,v
> retrieving revision 1.223
> diff -u -p -r1.223 syscalls.master
> --- kern/syscalls.master  24 Feb 2022 07:41:51 -  1.223
> +++ kern/syscalls.master  17 May 2022 00:36:03 -
> @@ -194,7 +194,7 @@
>   const struct timespec *times, int flag); }
>  85   STD { int sys_futimens(int fd, \
>   const struct timespec *times); }
> -86   STD { int sys_kbind(const struct __kbind *param, \
> +86   STD NOLOCK  { int sys_kbind(const struct __kbind *param, \
>   size_t psize, int64_t proc_cookie); }
>  87   STD NOLOCK  { int sys_clock_gettime(clockid_t clock_id, \
>   struct timespec *tp); }
> Index: uvm/uvm_mmap.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_mmap.c,v
> retrieving revision 1.169
> diff -u -p -r1.169 uvm_mmap.c
> --- uvm/uvm_mmap.c19 Jan 2022 10:43:48 -  1.169
> +++ uvm/uvm_mmap.c17 May 2022 00:36:03 -
> @@ -70,6 +70,7 @@
>  #include 
>  #include   /* for KBIND* */
>  #include 
> +#include 
>  
>  #include /* for __LDPGSZ */
>  
> @@ -1125,33 +1126,64 @@ sys_kbind(struct proc *p, void *v, regis
>   const char *data;
>   vaddr_t baseva, last_baseva, endva, pageoffset, kva;
>   size_t psize, s;
> - u_long pc;
> + u_long pc, kpc;
>   int count, i, extra;
> + uint64_t cookie;
>   int error;
>  
>   /*
>* extract syscall args from uap
>*/
>   paramp = SCARG(uap, param);
> - psize = SCARG(uap, psize);
>  
>   /* a NULL paramp disables the syscall for the process */
>   if (paramp == NULL) {
> + mtx_enter(>ps_mtx);
>   if (pr->ps_kbind_addr != 0)
> - sigexit(p, SIGILL);
> + goto leave_sigill;
>   pr->ps_kbind_addr = BOGO_PC;
> + mtx_leave(>ps_mtx);
>   return 0;
>   }
>  
>   /* security checks */
> +
> + /*
> +  * ps_kbind_addr can only be set to 0 or BOGO_PC by the
> +  * kernel, not by a call from userland.
> +  */
>   pc = PROC_PC(p);
> - if (pr->ps_kbind_addr == 0) {
> - pr->ps_kbind_addr = pc;
> - pr->ps_kbind_cookie = SCARG(uap, proc_cookie);
> - } else if (pc != pr->ps_kbind_addr || pc == BOGO_PC)
> - sigexit(p, SIGILL);
> - else if (pr->ps_kbind_cookie != SCARG(uap, proc_cookie))
> - sigexit(p, SIGILL);
> + if (pc == 0 || pc == BOGO_PC)
> + goto sigill;
> +
> + cookie = SCARG(uap, proc_cookie);
> + if (pr->ps_kbind_addr == pc) {
> + membar_datadep_consumer();
> + if (pr->ps_kbind_cookie != cookie)
> + goto sigill;
> + } else {
> + mtx_enter(>ps_mtx);
> + kpc = pr->ps_kbind_addr;
> +
> + /*
> +  * If we're the first thread in (kpc is 0), then
> +  * 

start unlocking kbind(2)

2022-05-16 Thread David Gwynne
this narrows the scope of the KERNEL_LOCK in kbind(2) so the syscall
argument checks can be done without the kernel lock.

care is taken to allow the pc/cookie checks to run without any lock by
being careful with the order of the checks. all modifications to the
pc/cookie state are serialised by the per process mutex.

i dont know enough about uvm to say whether it is safe to unlock the
actual memory updates too, but even if i was confident i would still
prefer to change it as a separate step.

ok?

Index: kern/init_sysent.c
===
RCS file: /cvs/src/sys/kern/init_sysent.c,v
retrieving revision 1.236
diff -u -p -r1.236 init_sysent.c
--- kern/init_sysent.c  1 May 2022 23:00:04 -   1.236
+++ kern/init_sysent.c  17 May 2022 00:36:03 -
@@ -1,4 +1,4 @@
-/* $OpenBSD: init_sysent.c,v 1.236 2022/05/01 23:00:04 tedu Exp $  */
+/* $OpenBSD$   */
 
 /*
  * System call switch table.
@@ -204,7 +204,7 @@ const struct sysent sysent[] = {
sys_utimensat },/* 84 = utimensat */
{ 2, s(struct sys_futimens_args), 0,
sys_futimens }, /* 85 = futimens */
-   { 3, s(struct sys_kbind_args), 0,
+   { 3, s(struct sys_kbind_args), SY_NOLOCK | 0,
sys_kbind },/* 86 = kbind */
{ 2, s(struct sys_clock_gettime_args), SY_NOLOCK | 0,
sys_clock_gettime },/* 87 = clock_gettime */
Index: kern/syscalls.master
===
RCS file: /cvs/src/sys/kern/syscalls.master,v
retrieving revision 1.223
diff -u -p -r1.223 syscalls.master
--- kern/syscalls.master24 Feb 2022 07:41:51 -  1.223
+++ kern/syscalls.master17 May 2022 00:36:03 -
@@ -194,7 +194,7 @@
const struct timespec *times, int flag); }
 85 STD { int sys_futimens(int fd, \
const struct timespec *times); }
-86 STD { int sys_kbind(const struct __kbind *param, \
+86 STD NOLOCK  { int sys_kbind(const struct __kbind *param, \
size_t psize, int64_t proc_cookie); }
 87 STD NOLOCK  { int sys_clock_gettime(clockid_t clock_id, \
struct timespec *tp); }
Index: uvm/uvm_mmap.c
===
RCS file: /cvs/src/sys/uvm/uvm_mmap.c,v
retrieving revision 1.169
diff -u -p -r1.169 uvm_mmap.c
--- uvm/uvm_mmap.c  19 Jan 2022 10:43:48 -  1.169
+++ uvm/uvm_mmap.c  17 May 2022 00:36:03 -
@@ -70,6 +70,7 @@
 #include 
 #include /* for KBIND* */
 #include 
+#include 
 
 #include   /* for __LDPGSZ */
 
@@ -1125,33 +1126,64 @@ sys_kbind(struct proc *p, void *v, regis
const char *data;
vaddr_t baseva, last_baseva, endva, pageoffset, kva;
size_t psize, s;
-   u_long pc;
+   u_long pc, kpc;
int count, i, extra;
+   uint64_t cookie;
int error;
 
/*
 * extract syscall args from uap
 */
paramp = SCARG(uap, param);
-   psize = SCARG(uap, psize);
 
/* a NULL paramp disables the syscall for the process */
if (paramp == NULL) {
+   mtx_enter(>ps_mtx);
if (pr->ps_kbind_addr != 0)
-   sigexit(p, SIGILL);
+   goto leave_sigill;
pr->ps_kbind_addr = BOGO_PC;
+   mtx_leave(>ps_mtx);
return 0;
}
 
/* security checks */
+
+   /*
+* ps_kbind_addr can only be set to 0 or BOGO_PC by the
+* kernel, not by a call from userland.
+*/
pc = PROC_PC(p);
-   if (pr->ps_kbind_addr == 0) {
-   pr->ps_kbind_addr = pc;
-   pr->ps_kbind_cookie = SCARG(uap, proc_cookie);
-   } else if (pc != pr->ps_kbind_addr || pc == BOGO_PC)
-   sigexit(p, SIGILL);
-   else if (pr->ps_kbind_cookie != SCARG(uap, proc_cookie))
-   sigexit(p, SIGILL);
+   if (pc == 0 || pc == BOGO_PC)
+   goto sigill;
+
+   cookie = SCARG(uap, proc_cookie);
+   if (pr->ps_kbind_addr == pc) {
+   membar_datadep_consumer();
+   if (pr->ps_kbind_cookie != cookie)
+   goto sigill;
+   } else {
+   mtx_enter(>ps_mtx);
+   kpc = pr->ps_kbind_addr;
+
+   /*
+* If we're the first thread in (kpc is 0), then
+* remember where the call came from for the future.
+*
+* Maybe another thread raced us and won. pc cannot
+* be BOGO_PC, so this also handles the check for
+* the disabled case.
+*/
+
+   if (kpc == 0) {
+   pr->ps_kbind_cookie = cookie;
+