Re: start unlocking kbind(2)
On Wed, Jun 15, 2022 at 10:40:35PM -0500, Scott Cheloha wrote: > On Wed, Jun 15, 2022 at 06:17:07PM -0600, Theo de Raadt wrote: > > Mark Kettenis wrote: > > > > > Well, I believe that Scott was trying to fix a race condition that can > > > only happen if code is using kbind(2) incorrectly, i.e. when the > > > threads deliberately pass different cookies to kbind(2) or execute > > > kbind(2) from different "text" addresses. > > > > > > I still think the solution is simply to accept that race condition. > > > > Right. > > > > People are not calling kbind. They are calling syscall(SYS_kbind > > > > The man page says "don't do that". No user serviceable parts inside. > > Do not provide to children. > > > > That said, Scott is about to share a diff he and I did a few cycles > > around, to at least make the call-in transaction be a lock. > > [...] > > This patch reorganizes the logic in sys_kbind() preceding the > copyin(9) so that we only need to take the kernel lock in a single > spot if something goes wrong and we need to raise SIGILL. > > It also puts the per-process mutex, ps_mtx, around that logic. This > guarantees that the first thread to reach sys_kbind() sets > ps_kbind_addr and ps_kbind_cookie, even in oddball situations where > the program isn't using ld.so(1) correctly. > > [...] 10 days, no replies public or private. I trust that means the patch is fine. Here is a tweaked patch with the binding loop wrapped with the kernel lock. We can more carefully determine whether uvm_unmap_remove(), uvm_map_extract(), and uvm_unmap_detach() are MP-safe in a subsequent patch. They *look* safe but I can't be sure and nobody volunteered any thoughts or review for the prior patch so we'll do the conservative thing and push the kernel lock down, just as dlg@ was going to do. If something comes of guenther@'s PT_OPENBSD_KBIND idea we can trivially remove ps_mtx from this code. OK? Index: kern/syscalls.master === RCS file: /cvs/src/sys/kern/syscalls.master,v retrieving revision 1.224 diff -u -p -r1.224 syscalls.master --- kern/syscalls.master16 May 2022 07:36:04 - 1.224 +++ kern/syscalls.master24 Jun 2022 14:43:26 - @@ -194,7 +194,7 @@ const struct timespec *times, int flag); } 85 STD { int sys_futimens(int fd, \ const struct timespec *times); } -86 STD { int sys_kbind(const struct __kbind *param, \ +86 STD NOLOCK { int sys_kbind(const struct __kbind *param, \ size_t psize, int64_t proc_cookie); } 87 STD NOLOCK { int sys_clock_gettime(clockid_t clock_id, \ struct timespec *tp); } Index: kern/init_sysent.c === RCS file: /cvs/src/sys/kern/init_sysent.c,v retrieving revision 1.237 diff -u -p -r1.237 init_sysent.c --- kern/init_sysent.c 16 May 2022 07:38:10 - 1.237 +++ kern/init_sysent.c 24 Jun 2022 14:43:26 - @@ -1,4 +1,4 @@ -/* $OpenBSD: init_sysent.c,v 1.237 2022/05/16 07:38:10 mvs Exp $ */ +/* $OpenBSD$ */ /* * System call switch table. @@ -204,7 +204,7 @@ const struct sysent sysent[] = { sys_utimensat },/* 84 = utimensat */ { 2, s(struct sys_futimens_args), 0, sys_futimens }, /* 85 = futimens */ - { 3, s(struct sys_kbind_args), 0, + { 3, s(struct sys_kbind_args), SY_NOLOCK | 0, sys_kbind },/* 86 = kbind */ { 2, s(struct sys_clock_gettime_args), SY_NOLOCK | 0, sys_clock_gettime },/* 87 = clock_gettime */ Index: sys/proc.h === RCS file: /cvs/src/sys/sys/proc.h,v retrieving revision 1.330 diff -u -p -r1.330 proc.h --- sys/proc.h 13 May 2022 15:32:00 - 1.330 +++ sys/proc.h 24 Jun 2022 14:43:27 - @@ -234,8 +234,8 @@ struct process { uint64_t ps_pledge; uint64_t ps_execpledge; - int64_t ps_kbind_cookie; - u_long ps_kbind_addr; + int64_t ps_kbind_cookie;/* [m] */ + u_long ps_kbind_addr; /* [m] */ /* End area that is copied on creation. */ #define ps_endcopy ps_refcnt Index: sys/syscall.h === RCS file: /cvs/src/sys/sys/syscall.h,v retrieving revision 1.234 diff -u -p -r1.234 syscall.h --- sys/syscall.h 16 May 2022 07:38:10 - 1.234 +++ sys/syscall.h 24 Jun 2022 14:43:27 - @@ -1,4 +1,4 @@ -/* $OpenBSD: syscall.h,v 1.234 2022/05/16 07:38:10 mvs Exp $ */ +/* $OpenBSD$ */ /* * System call numbers. Index: sys/syscallargs.h === RCS file: /cvs/src/sys/sys/syscallargs.h,v retrieving
Re: start unlocking kbind(2)
On Wed, Jun 15, 2022 at 06:17:07PM -0600, Theo de Raadt wrote: > Mark Kettenis wrote: > > > Well, I believe that Scott was trying to fix a race condition that can > > only happen if code is using kbind(2) incorrectly, i.e. when the > > threads deliberately pass different cookies to kbind(2) or execute > > kbind(2) from different "text" addresses. > > > > I still think the solution is simply to accept that race condition. > > Right. > > People are not calling kbind. They are calling syscall(SYS_kbind > > The man page says "don't do that". No user serviceable parts inside. > Do not provide to children. > > That said, Scott is about to share a diff he and I did a few cycles > around, to at least make the call-in transaction be a lock. Okay, here it is. This patch reorganizes the logic in sys_kbind() preceding the copyin(9) so that we only need to take the kernel lock in a single spot if something goes wrong and we need to raise SIGILL. It also puts the per-process mutex, ps_mtx, around that logic. This guarantees that the first thread to reach sys_kbind() sets ps_kbind_addr and ps_kbind_cookie, even in oddball situations where the program isn't using ld.so(1) correctly. I am aware that this is "unsupported", but on balance I would prefer that the initialization of those two variables was atomic. I think taking the mutex is a very small price to pay for the guarantee that the "security check" logic always happens the way it was intended to happen in the order it was intended to happen. Note that we are not wrapping the binding loop up in a lock, just the ps_kbind_* variable logic. ... if Mark and Philip want to push back on that, I think I would yield and just drop the mutex. ... also, if Philip goes ahead with the PT_OPENBSD_KBIND thing we no longer need the mutex because, as I understand it, we would no longer need the ps_kbind_cookie. Even in either of those cases I still think the logic refactoring shown here is a good thing. -- I have been running with this now for a day and haven't hit any panics. The last time I tried unlocking kbind(2) I hit panics pretty quickly due to KERNEL_ASSERT_LOCKED() calls down in the binding loop. Because I'm not seeing that I assume this means something has changed down in UVM and we no longer need to wrap the binding loop with the kernel lock. Thoughts? Index: uvm/uvm_mmap.c === RCS file: /cvs/src/sys/uvm/uvm_mmap.c,v retrieving revision 1.169 diff -u -p -r1.169 uvm_mmap.c --- uvm/uvm_mmap.c 19 Jan 2022 10:43:48 - 1.169 +++ uvm/uvm_mmap.c 16 Jun 2022 03:36:53 - @@ -1127,7 +1127,7 @@ sys_kbind(struct proc *p, void *v, regis size_t psize, s; u_long pc; int count, i, extra; - int error; + int error, sigill = 0; /* * extract syscall args from uap @@ -1135,23 +1135,41 @@ sys_kbind(struct proc *p, void *v, regis paramp = SCARG(uap, param); psize = SCARG(uap, psize); - /* a NULL paramp disables the syscall for the process */ - if (paramp == NULL) { - if (pr->ps_kbind_addr != 0) - sigexit(p, SIGILL); - pr->ps_kbind_addr = BOGO_PC; - return 0; - } - - /* security checks */ + /* +* If paramp is NULL and we're uninitialized, disable the syscall +* for the process. Raise SIGILL if we're already initialized. +* +* If paramp is non-NULL and we're uninitialized, do initialization. +* Otherwise, do security checks raise SIGILL on failure. +*/ pc = PROC_PC(p); - if (pr->ps_kbind_addr == 0) { + mtx_enter(>ps_mtx); + if (paramp == NULL) { + if (pr->ps_kbind_addr == 0) + pr->ps_kbind_addr = BOGO_PC; + else + sigill = 1; + } else if (pr->ps_kbind_addr == 0) { pr->ps_kbind_addr = pc; pr->ps_kbind_cookie = SCARG(uap, proc_cookie); - } else if (pc != pr->ps_kbind_addr || pc == BOGO_PC) - sigexit(p, SIGILL); - else if (pr->ps_kbind_cookie != SCARG(uap, proc_cookie)) + } else if (pc != pr->ps_kbind_addr || pc == BOGO_PC || + pr->ps_kbind_cookie != SCARG(uap, proc_cookie)) { + sigill = 1; + } + mtx_leave(>ps_mtx); + + /* Raise SIGILL if something is off. */ + if (sigill) { + KERNEL_LOCK(); sigexit(p, SIGILL); + /* NOTREACHED */ + KERNEL_UNLOCK(); + } + + /* We're done if we were disabling the syscall. */ + if (paramp == NULL) + return 0; + if (psize < sizeof(struct __kbind) || psize > sizeof(param)) return EINVAL; if ((error = copyin(paramp, , psize))) Index: kern/syscalls.master ===
Re: start unlocking kbind(2)
Mark Kettenis wrote: > Well, I believe that Scott was trying to fix a race condition that can > only happen if code is using kbind(2) incorrectly, i.e. when the > threads deliberately pass different cookies to kbind(2) or execute > kbind(2) from different "text" addresses. > > I still think the solution is simply to accept that race condition. Right. People are not calling kbind. They are calling syscall(SYS_kbind The man page says "don't do that". No user serviceable parts inside. Do not provide to children. That said, Scott is about to share a diff he and I did a few cycles around, to at least make the call-in transaction be a lock.
Re: start unlocking kbind(2)
> From: Philip Guenther > Date: Wed, 15 Jun 2022 10:07:06 -0900 > > On Mon, 13 Jun 2022, Theo de Raadt wrote: > > Scott Cheloha wrote: > > > > Am I wrong that kbind is never called twice in the same address space? > > > > > > Isn't this exactly what happened the last time we tried this? > > > > Tried what? kbind has never been NOLOCK. > > Scott's referring to rev 1.237 of kern_fork.c, where we tried to require the > first use of kbind be before __tfork(2) was called. This blew up because > there's at least two ways to legitimately call pthread_create(3) before > doing your first lazy binding: > > 1) build with -znow, then dlopen() something not linked with -znow after > calling pthread_create() > > 2) pass address of pthread_create() to another function which calls > through the pointer, no special link options required (just need to > split up the _create and *funcptr enough that the compiler > won't optimize to a direct call) > > I've thought about how to possibly force a lazy resolution and haven't > thought up anything that wasn't unmaintainable, and probably > unimplementable. Well, I believe that Scott was trying to fix a race condition that can only happen if code is using kbind(2) incorrectly, i.e. when the threads deliberately pass different cookies to kbind(2) or execute kbind(2) from different "text" addresses. I still think the solution is simply to accept that race condition.
Re: start unlocking kbind(2)
On Mon, 13 Jun 2022, Theo de Raadt wrote: > Scott Cheloha wrote: > > > Am I wrong that kbind is never called twice in the same address space? > > > > Isn't this exactly what happened the last time we tried this? > > Tried what? kbind has never been NOLOCK. Scott's referring to rev 1.237 of kern_fork.c, where we tried to require the first use of kbind be before __tfork(2) was called. This blew up because there's at least two ways to legitimately call pthread_create(3) before doing your first lazy binding: 1) build with -znow, then dlopen() something not linked with -znow after calling pthread_create() 2) pass address of pthread_create() to another function which calls through the pointer, no special link options required (just need to split up the _create and *funcptr enough that the compiler won't optimize to a direct call) I've thought about how to possibly force a lazy resolution and haven't thought up anything that wasn't unmaintainable, and probably unimplementable. So, what could work reliably? My first thought would be to have ld.so's _dl_boot() do the equivalent of struct __kbind kb = { .kb_addr = _syscall_in_dl_bind; .kb_size = 0; }; kbind(, sizeof kb, pcookie) ...after teaching the kernel to accept such a first call to kbind and set pr->ps_kbind_* from them. That would permit the reimposing of the "first kbind can't be after __tfork" restriction; is it indeed enough to permit the ps_kbind_* members without locks? Another idea is to discard the cookie restriction and just trust the calling address to be enough. I mean, we trust it to be sufficient for sigreturn and that's perhaps more powerful, no? If we're fine with that, then how about giving ld.so a PT_OPENBSD_KBIND segment with the required address, so the kernel can just set it up at exec time? Eventually, we could disable kbind if said header wasn't present on the ELF interpreter. (In a groteque abuse of the ELF program header, we *could* keep the cookie by, for example, having the PT_OPENBSD_KBIND p_vaddr/p_memsz identify the location of the random cookie in ld.so memory and pass the kbind syscall address via p_paddr. You are permitted to barf.) Philip
Re: start unlocking kbind(2)
Scott Cheloha wrote: > > Am I wrong that kbind is never called twice in the same address space? > > Isn't this exactly what happened the last time we tried this? Tried what? kbind has never been NOLOCK. > > Can someone describe a synthetic sequence where racing sys_kbind calls > > perform the wrong action? > > This is highly synthetic, but: > > If Thread1 Stop right there. Is this in a static binary or a dynamic binary? In a dynamic binary, ld.so has used kbind(), and kbind() cannot be used again. In a static binary, early_static_init() calls kbind() and locks you out. When were these threads created? > sets ps_kbind_addr before Thread2 but Thread2 compares > ps_kbind_cookie with SCARG(uap, proc_cookie) before Thread1 can change > ps_kbind_addr from 0 to its own SCARG(uap, proc_cookie) it is possible > that Thread2 will pass the security check and proceed, even though it > should have caused a SIGILL. You cannot call kbind() synthetically, from the manual page: kbind is currently intended for use by ld.so(1) only. That is why there is no libc stub for it. That's why dlg's demo program replaces the libc one with a raw assembly version, so that it can call it. > Thread2 knows ps_kbind_cookie is initialy zero, so there is a > theoretical race. Again, how did you end up with threads, and how did you call kbind without using assembly language, and what program would ever do this? > If you wanted to make this statistically impossible you could > initialize ps_kbind_cookie to a random value during fork(2) so that > Thread2 has a 1 in 2^64 chance of guessing the the initial > ps_kbind_cookie value and bypassing the security check as > described above. Where did these threads come from before ld.so completed linkage resolution??
Re: start unlocking kbind(2)
On Sun, Jun 12, 2022 at 12:12:33AM -0600, Theo de Raadt wrote: > David Gwynne wrote: > > > On Wed, May 18, 2022 at 07:42:32PM -0600, Theo de Raadt wrote: > > > Mark Kettenis wrote: > > > > > > > > Isn't the vm_map_lock enough? > > > > > > > > Could be. The fast path is going to take that lock anyway. This > > > > would require a bit of surgery to uvm_map_extract() to make sure we > > > > don't take the vm_map_lock twice. Worth exploring I'd say. > > > > > > I think the vm_map_lock can be dropped before it reaches that code, > > > because of 3 cases: (1) new kbind lock, (2) a repeated kbind lock and > > > return, or (3) violation and process termination. > > > > > > So before doing the copyin() and updates, simply vm_map_unlock() > > > > > > Will that work and isn't it simpler than David's proposal? > > > > I'm not super familiar with uvm so it's hard for me to say it would or > > wouldn't be simpler, but my initial impressions are that it would be a > > lot more surgery and I wouldn't be confident I did such changes right. > > i don't understand. As a rule, using fewer locks and mutexes is simpler > to understand. > > i do not understand the purpose for your diff, and think you have simply > brushed my questions aside. > > > Releasing a lock and then taking it again quickly can have undesirable > > consequences too. If something else is waiting on the rwlock, > > releasing it will wake them up, but they will probably lose because > > we've already taken it again. > > Then don't release the vm_map_lock and regrab it, but keep it active. > > I think there is a lot of fuss going on here for a system call which, as > far as I can tell, is never called in a threaded program. Static libc or > ld.so always call kbind early on, before threads, precisely ONCE, and any > later call to it kills the program but why do we need a mutex for the > simple action of observing that ps_kbind_addr is not 0? > > Am I wrong that kbind is never called twice in the same address space? Isn't this exactly what happened the last time we tried this? > Can someone describe a synthetic sequence where racing sys_kbind calls > perform the wrong action? This is highly synthetic, but: If Thread1 sets ps_kbind_addr before Thread2 but Thread2 compares ps_kbind_cookie with SCARG(uap, proc_cookie) before Thread1 can change ps_kbind_addr from 0 to its own SCARG(uap, proc_cookie) it is possible that Thread2 will pass the security check and proceed, even though it should have caused a SIGILL. Thread2 knows ps_kbind_cookie is initialy zero, so there is a theoretical race. If you wanted to make this statistically impossible you could initialize ps_kbind_cookie to a random value during fork(2) so that Thread2 has a 1 in 2^64 chance of guessing the the initial ps_kbind_cookie value and bypassing the security check as described above. If you do that, then we can do the security check locklessly with atomic_cas_ulong(9).
Re: start unlocking kbind(2)
David Gwynne wrote: > On Wed, May 18, 2022 at 07:42:32PM -0600, Theo de Raadt wrote: > > Mark Kettenis wrote: > > > > > > Isn't the vm_map_lock enough? > > > > > > Could be. The fast path is going to take that lock anyway. This > > > would require a bit of surgery to uvm_map_extract() to make sure we > > > don't take the vm_map_lock twice. Worth exploring I'd say. > > > > I think the vm_map_lock can be dropped before it reaches that code, > > because of 3 cases: (1) new kbind lock, (2) a repeated kbind lock and > > return, or (3) violation and process termination. > > > > So before doing the copyin() and updates, simply vm_map_unlock() > > > > Will that work and isn't it simpler than David's proposal? > > I'm not super familiar with uvm so it's hard for me to say it would or > wouldn't be simpler, but my initial impressions are that it would be a > lot more surgery and I wouldn't be confident I did such changes right. i don't understand. As a rule, using fewer locks and mutexes is simpler to understand. i do not understand the purpose for your diff, and think you have simply brushed my questions aside. > Releasing a lock and then taking it again quickly can have undesirable > consequences too. If something else is waiting on the rwlock, > releasing it will wake them up, but they will probably lose because > we've already taken it again. Then don't release the vm_map_lock and regrab it, but keep it active. I think there is a lot of fuss going on here for a system call which, as far as I can tell, is never called in a threaded program. Static libc or ld.so always call kbind early on, before threads, precisely ONCE, and any later call to it kills the program but why do we need a mutex for the simple action of observing that ps_kbind_addr is not 0? Am I wrong that kbind is never called twice in the same address space? Can someone describe a synthetic sequence where racing sys_kbind calls perform the wrong action?
Re: start unlocking kbind(2)
On Tue, May 31, 2022 at 04:48:16PM +0200, Martin Pieuchot wrote: > On 18/05/22(Wed) 15:53, Alexander Bluhm wrote: > > On Tue, May 17, 2022 at 10:44:54AM +1000, David Gwynne wrote: > > > + cookie = SCARG(uap, proc_cookie); > > > + if (pr->ps_kbind_addr == pc) { > > > + membar_datadep_consumer(); > > > + if (pr->ps_kbind_cookie != cookie) > > > + goto sigill; > > > + } else { > > > > You must use membar_consumer() here. membar_datadep_consumer() is > > a barrier between reading pointer and pointed data. Only alpha > > requires membar_datadep_consumer() for that, everywhere else it is > > a NOP. > > > > > + mtx_enter(>ps_mtx); > > > + kpc = pr->ps_kbind_addr; > > > > Do we need kpc variable? I would prefer to read explicit > > pr->ps_kbind_addr in the two places where we use it. > > > > I think the logic of barriers and mutexes is correct. > > > > with the suggestions above OK bluhm@ > > I believe you should go ahead with the current diff. ok with me. Moving > the field under the scope of another lock can be easily done afterward. Agreed. I'll commit this version with tweaks to make bluhm@ happy tomorrow. Index: kern/init_sysent.c === RCS file: /cvs/src/sys/kern/init_sysent.c,v retrieving revision 1.236 diff -u -p -r1.236 init_sysent.c --- kern/init_sysent.c 1 May 2022 23:00:04 - 1.236 +++ kern/init_sysent.c 12 Jun 2022 04:48:06 - @@ -1,4 +1,4 @@ -/* $OpenBSD: init_sysent.c,v 1.236 2022/05/01 23:00:04 tedu Exp $ */ +/* $OpenBSD$ */ /* * System call switch table. @@ -204,7 +204,7 @@ const struct sysent sysent[] = { sys_utimensat },/* 84 = utimensat */ { 2, s(struct sys_futimens_args), 0, sys_futimens }, /* 85 = futimens */ - { 3, s(struct sys_kbind_args), 0, + { 3, s(struct sys_kbind_args), SY_NOLOCK | 0, sys_kbind },/* 86 = kbind */ { 2, s(struct sys_clock_gettime_args), SY_NOLOCK | 0, sys_clock_gettime },/* 87 = clock_gettime */ Index: kern/syscalls.master === RCS file: /cvs/src/sys/kern/syscalls.master,v retrieving revision 1.223 diff -u -p -r1.223 syscalls.master --- kern/syscalls.master24 Feb 2022 07:41:51 - 1.223 +++ kern/syscalls.master12 Jun 2022 04:48:06 - @@ -194,7 +194,7 @@ const struct timespec *times, int flag); } 85 STD { int sys_futimens(int fd, \ const struct timespec *times); } -86 STD { int sys_kbind(const struct __kbind *param, \ +86 STD NOLOCK { int sys_kbind(const struct __kbind *param, \ size_t psize, int64_t proc_cookie); } 87 STD NOLOCK { int sys_clock_gettime(clockid_t clock_id, \ struct timespec *tp); } Index: uvm/uvm_mmap.c === RCS file: /cvs/src/sys/uvm/uvm_mmap.c,v retrieving revision 1.169 diff -u -p -r1.169 uvm_mmap.c --- uvm/uvm_mmap.c 19 Jan 2022 10:43:48 - 1.169 +++ uvm/uvm_mmap.c 12 Jun 2022 04:48:06 - @@ -70,6 +70,7 @@ #include #include /* for KBIND* */ #include +#include #include /* for __LDPGSZ */ @@ -1127,31 +1128,61 @@ sys_kbind(struct proc *p, void *v, regis size_t psize, s; u_long pc; int count, i, extra; + uint64_t cookie; int error; /* * extract syscall args from uap */ paramp = SCARG(uap, param); - psize = SCARG(uap, psize); /* a NULL paramp disables the syscall for the process */ if (paramp == NULL) { + mtx_enter(>ps_mtx); if (pr->ps_kbind_addr != 0) - sigexit(p, SIGILL); + goto leave_sigill; pr->ps_kbind_addr = BOGO_PC; + mtx_leave(>ps_mtx); return 0; } /* security checks */ + + /* +* ps_kbind_addr can only be set to 0 or BOGO_PC by the +* kernel, not by a call from userland. +*/ pc = PROC_PC(p); - if (pr->ps_kbind_addr == 0) { - pr->ps_kbind_addr = pc; - pr->ps_kbind_cookie = SCARG(uap, proc_cookie); - } else if (pc != pr->ps_kbind_addr || pc == BOGO_PC) - sigexit(p, SIGILL); - else if (pr->ps_kbind_cookie != SCARG(uap, proc_cookie)) - sigexit(p, SIGILL); + if (pc == 0 || pc == BOGO_PC) + goto sigill; + + cookie = SCARG(uap, proc_cookie); + if (pr->ps_kbind_addr == pc) { + membar_consumer(); + if (pr->ps_kbind_cookie != cookie) + goto sigill; + } else {
Re: start unlocking kbind(2)
On Wed, May 18, 2022 at 07:42:32PM -0600, Theo de Raadt wrote: > Mark Kettenis wrote: > > > > Isn't the vm_map_lock enough? > > > > Could be. The fast path is going to take that lock anyway. This > > would require a bit of surgery to uvm_map_extract() to make sure we > > don't take the vm_map_lock twice. Worth exploring I'd say. > > I think the vm_map_lock can be dropped before it reaches that code, > because of 3 cases: (1) new kbind lock, (2) a repeated kbind lock and > return, or (3) violation and process termination. > > So before doing the copyin() and updates, simply vm_map_unlock() > > Will that work and isn't it simpler than David's proposal? I'm not super familiar with uvm so it's hard for me to say it would or wouldn't be simpler, but my initial impressions are that it would be a lot more surgery and I wouldn't be confident I did such changes right. Releasing a lock and then taking it again quickly can have undesirable consequences too. If something else is waiting on the rwlock, releasing it will wake them up, but they will probably lose because we've already taken it again.
Re: start unlocking kbind(2)
On 18/05/22(Wed) 15:53, Alexander Bluhm wrote: > On Tue, May 17, 2022 at 10:44:54AM +1000, David Gwynne wrote: > > + cookie = SCARG(uap, proc_cookie); > > + if (pr->ps_kbind_addr == pc) { > > + membar_datadep_consumer(); > > + if (pr->ps_kbind_cookie != cookie) > > + goto sigill; > > + } else { > > You must use membar_consumer() here. membar_datadep_consumer() is > a barrier between reading pointer and pointed data. Only alpha > requires membar_datadep_consumer() for that, everywhere else it is > a NOP. > > > + mtx_enter(>ps_mtx); > > + kpc = pr->ps_kbind_addr; > > Do we need kpc variable? I would prefer to read explicit > pr->ps_kbind_addr in the two places where we use it. > > I think the logic of barriers and mutexes is correct. > > with the suggestions above OK bluhm@ I believe you should go ahead with the current diff. ok with me. Moving the field under the scope of another lock can be easily done afterward.
Re: start unlocking kbind(2)
Mark Kettenis wrote: > > Isn't the vm_map_lock enough? > > Could be. The fast path is going to take that lock anyway. This > would require a bit of surgery to uvm_map_extract() to make sure we > don't take the vm_map_lock twice. Worth exploring I'd say. I think the vm_map_lock can be dropped before it reaches that code, because of 3 cases: (1) new kbind lock, (2) a repeated kbind lock and return, or (3) violation and process termination. So before doing the copyin() and updates, simply vm_map_unlock() Will that work and isn't it simpler than David's proposal?
Re: start unlocking kbind(2)
> From: "Theo de Raadt" > Date: Wed, 18 May 2022 10:48:01 -0600 > > Mark Kettenis wrote: > > > > I guess there is another question -- should the ps_kbind_* variables > > > be stored inside the uvmspace, rather than inside pr? > > > > I think there are arguments for both. But I don't think moving it > > makes things easier. > > Whatever proc sets the kbind configuration, wins. Any other proc must > set the same kbind (if it is different, it gets killed). Correct. > Isn't the vm_map_lock enough? Could be. The fast path is going to take that lock anyway. This would require a bit of surgery to uvm_map_extract() to make sure we don't take the vm_map_lock twice. Worth exploring I'd say.
Re: start unlocking kbind(2)
Mark Kettenis wrote: > > I guess there is another question -- should the ps_kbind_* variables > > be stored inside the uvmspace, rather than inside pr? > > I think there are arguments for both. But I don't think moving it > makes things easier. Whatever proc sets the kbind configuration, wins. Any other proc must set the same kbind (if it is different, it gets killed). Isn't the vm_map_lock enough?
Re: start unlocking kbind(2)
> From: "Theo de Raadt" > Date: Wed, 18 May 2022 10:09:38 -0600 > > I guess there is another question -- should the ps_kbind_* variables > be stored inside the uvmspace, rather than inside pr? I think there are arguments for both. But I don't think moving it makes things easier.
Re: start unlocking kbind(2)
> From: "Theo de Raadt" > Date: Wed, 18 May 2022 10:05:03 -0600 > > David Gwynne wrote: > > > from yet another perspective, broad use of a common lock can end up > > hurting in the long run because you may end up where everything is > > serialised and you have to go back and do a ton of work again anyway. > > > > > I think the theory is ps_kbind_addr is fixed to a shared address space > > > in "pr", if you are threaded there is only one ps_kbind_addr for all the > > > processes "p" sharing that address space. > > > > yes. > > > > > And execve() uses single_thread_set, which means you can't race? > > > > also yes. the value starts at 0, updates are serialised by the > > mutex, and the updates don't produce invalid intermediate state. > > so what is the mutex for? Is this being over-designed? > > On a shared address space, ld.so will be entered before we thread, > straight out of execve. I don't think we have constructors starting > threads, that would be very bad right? Unfortunately code exists that creates threads in constructors. At least that was the conclusion when we had a very similar discussion some months ago. It's a seriously bad idea, but the code that actually does this probably does even worse things. > So there are two cases -- static binary and dynaamic. On one address space, > kbind is only called once, and there is only one attempt, and any other > attempt should kill you, because you are doing it manual & wrong. <-- is > this correct, guenther? So in the dynamic case there may be multiple parallel calls to kbind(). They should all come from the same address though, unless someone is doing something deliberately nasty.
Re: start unlocking kbind(2)
I guess there is another question -- should the ps_kbind_* variables be stored inside the uvmspace, rather than inside pr?
Re: start unlocking kbind(2)
David Gwynne wrote: > from yet another perspective, broad use of a common lock can end up > hurting in the long run because you may end up where everything is > serialised and you have to go back and do a ton of work again anyway. > > > I think the theory is ps_kbind_addr is fixed to a shared address space > > in "pr", if you are threaded there is only one ps_kbind_addr for all the > > processes "p" sharing that address space. > > yes. > > > And execve() uses single_thread_set, which means you can't race? > > also yes. the value starts at 0, updates are serialised by the > mutex, and the updates don't produce invalid intermediate state. so what is the mutex for? Is this being over-designed? On a shared address space, ld.so will be entered before we thread, straight out of execve. I don't think we have constructors starting threads, that would be very bad right? So there are two cases -- static binary and dynaamic. On one address space, kbind is only called once, and there is only one attempt, and any other attempt should kill you, because you are doing it manual & wrong. <-- is this correct, guenther?
Re: start unlocking kbind(2)
On Tue, May 17, 2022 at 10:44:54AM +1000, David Gwynne wrote: > + cookie = SCARG(uap, proc_cookie); > + if (pr->ps_kbind_addr == pc) { > + membar_datadep_consumer(); > + if (pr->ps_kbind_cookie != cookie) > + goto sigill; > + } else { You must use membar_consumer() here. membar_datadep_consumer() is a barrier between reading pointer and pointed data. Only alpha requires membar_datadep_consumer() for that, everywhere else it is a NOP. > + mtx_enter(>ps_mtx); > + kpc = pr->ps_kbind_addr; Do we need kpc variable? I would prefer to read explicit pr->ps_kbind_addr in the two places where we use it. I think the logic of barriers and mutexes is correct. with the suggestions above OK bluhm@
Re: start unlocking kbind(2)
On Tue, May 17, 2022 at 10:14:18AM -0600, Theo de Raadt wrote: > Martin Pieuchot wrote: > > > On 17/05/22(Tue) 10:44, David Gwynne wrote: > > > this narrows the scope of the KERNEL_LOCK in kbind(2) so the syscall > > > argument checks can be done without the kernel lock. > > > > > > care is taken to allow the pc/cookie checks to run without any lock by > > > being careful with the order of the checks. all modifications to the > > > pc/cookie state are serialised by the per process mutex. > > > > I don't understand why it is safe to do the following check without > > holding a mutex: because you can't read any invalid intermediate value. in the fast path here, either it is the expected value when you're using the interface correctly, or it isnt. if it isn't then you take the "slow" path by taking the mutex and doing the comprehensive value checks and updates. > > > > if (pr->ps_kbind_addr == pc) > > ... > > > > Is there much differences when always grabbing the per-process mutex? sure, that sequence is maybe half a dozen instructions without the mutex, and many times that with the mutex. from a different perspective, i can't see a big difference between kernel compile times with or without this diff, so right now the mutex isn't going to make a difference either. from yet another perspective, broad use of a common lock can end up hurting in the long run because you may end up where everything is serialised and you have to go back and do a ton of work again anyway. > I think the theory is ps_kbind_addr is fixed to a shared address space > in "pr", if you are threaded there is only one ps_kbind_addr for all the > processes "p" sharing that address space. yes. > And execve() uses single_thread_set, which means you can't race? also yes. the value starts at 0, updates are serialised by the mutex, and the updates don't produce invalid intermediate state.
Re: start unlocking kbind(2)
Martin Pieuchot wrote: > On 17/05/22(Tue) 10:44, David Gwynne wrote: > > this narrows the scope of the KERNEL_LOCK in kbind(2) so the syscall > > argument checks can be done without the kernel lock. > > > > care is taken to allow the pc/cookie checks to run without any lock by > > being careful with the order of the checks. all modifications to the > > pc/cookie state are serialised by the per process mutex. > > I don't understand why it is safe to do the following check without > holding a mutex: > > if (pr->ps_kbind_addr == pc) > ... > > Is there much differences when always grabbing the per-process mutex? I think the theory is ps_kbind_addr is fixed to a shared address space in "pr", if you are threaded there is only one ps_kbind_addr for all the processes "p" sharing that address space. And execve() uses single_thread_set, which means you can't race?
Re: start unlocking kbind(2)
On 17/05/22(Tue) 10:44, David Gwynne wrote: > this narrows the scope of the KERNEL_LOCK in kbind(2) so the syscall > argument checks can be done without the kernel lock. > > care is taken to allow the pc/cookie checks to run without any lock by > being careful with the order of the checks. all modifications to the > pc/cookie state are serialised by the per process mutex. I don't understand why it is safe to do the following check without holding a mutex: if (pr->ps_kbind_addr == pc) ... Is there much differences when always grabbing the per-process mutex? > i dont know enough about uvm to say whether it is safe to unlock the > actual memory updates too, but even if i was confident i would still > prefer to change it as a separate step. I agree. > Index: kern/init_sysent.c > === > RCS file: /cvs/src/sys/kern/init_sysent.c,v > retrieving revision 1.236 > diff -u -p -r1.236 init_sysent.c > --- kern/init_sysent.c1 May 2022 23:00:04 - 1.236 > +++ kern/init_sysent.c17 May 2022 00:36:03 - > @@ -1,4 +1,4 @@ > -/* $OpenBSD: init_sysent.c,v 1.236 2022/05/01 23:00:04 tedu Exp $ */ > +/* $OpenBSD$ */ > > /* > * System call switch table. > @@ -204,7 +204,7 @@ const struct sysent sysent[] = { > sys_utimensat },/* 84 = utimensat */ > { 2, s(struct sys_futimens_args), 0, > sys_futimens }, /* 85 = futimens */ > - { 3, s(struct sys_kbind_args), 0, > + { 3, s(struct sys_kbind_args), SY_NOLOCK | 0, > sys_kbind },/* 86 = kbind */ > { 2, s(struct sys_clock_gettime_args), SY_NOLOCK | 0, > sys_clock_gettime },/* 87 = clock_gettime */ > Index: kern/syscalls.master > === > RCS file: /cvs/src/sys/kern/syscalls.master,v > retrieving revision 1.223 > diff -u -p -r1.223 syscalls.master > --- kern/syscalls.master 24 Feb 2022 07:41:51 - 1.223 > +++ kern/syscalls.master 17 May 2022 00:36:03 - > @@ -194,7 +194,7 @@ > const struct timespec *times, int flag); } > 85 STD { int sys_futimens(int fd, \ > const struct timespec *times); } > -86 STD { int sys_kbind(const struct __kbind *param, \ > +86 STD NOLOCK { int sys_kbind(const struct __kbind *param, \ > size_t psize, int64_t proc_cookie); } > 87 STD NOLOCK { int sys_clock_gettime(clockid_t clock_id, \ > struct timespec *tp); } > Index: uvm/uvm_mmap.c > === > RCS file: /cvs/src/sys/uvm/uvm_mmap.c,v > retrieving revision 1.169 > diff -u -p -r1.169 uvm_mmap.c > --- uvm/uvm_mmap.c19 Jan 2022 10:43:48 - 1.169 > +++ uvm/uvm_mmap.c17 May 2022 00:36:03 - > @@ -70,6 +70,7 @@ > #include > #include /* for KBIND* */ > #include > +#include > > #include /* for __LDPGSZ */ > > @@ -1125,33 +1126,64 @@ sys_kbind(struct proc *p, void *v, regis > const char *data; > vaddr_t baseva, last_baseva, endva, pageoffset, kva; > size_t psize, s; > - u_long pc; > + u_long pc, kpc; > int count, i, extra; > + uint64_t cookie; > int error; > > /* >* extract syscall args from uap >*/ > paramp = SCARG(uap, param); > - psize = SCARG(uap, psize); > > /* a NULL paramp disables the syscall for the process */ > if (paramp == NULL) { > + mtx_enter(>ps_mtx); > if (pr->ps_kbind_addr != 0) > - sigexit(p, SIGILL); > + goto leave_sigill; > pr->ps_kbind_addr = BOGO_PC; > + mtx_leave(>ps_mtx); > return 0; > } > > /* security checks */ > + > + /* > + * ps_kbind_addr can only be set to 0 or BOGO_PC by the > + * kernel, not by a call from userland. > + */ > pc = PROC_PC(p); > - if (pr->ps_kbind_addr == 0) { > - pr->ps_kbind_addr = pc; > - pr->ps_kbind_cookie = SCARG(uap, proc_cookie); > - } else if (pc != pr->ps_kbind_addr || pc == BOGO_PC) > - sigexit(p, SIGILL); > - else if (pr->ps_kbind_cookie != SCARG(uap, proc_cookie)) > - sigexit(p, SIGILL); > + if (pc == 0 || pc == BOGO_PC) > + goto sigill; > + > + cookie = SCARG(uap, proc_cookie); > + if (pr->ps_kbind_addr == pc) { > + membar_datadep_consumer(); > + if (pr->ps_kbind_cookie != cookie) > + goto sigill; > + } else { > + mtx_enter(>ps_mtx); > + kpc = pr->ps_kbind_addr; > + > + /* > + * If we're the first thread in (kpc is 0), then > + *
start unlocking kbind(2)
this narrows the scope of the KERNEL_LOCK in kbind(2) so the syscall argument checks can be done without the kernel lock. care is taken to allow the pc/cookie checks to run without any lock by being careful with the order of the checks. all modifications to the pc/cookie state are serialised by the per process mutex. i dont know enough about uvm to say whether it is safe to unlock the actual memory updates too, but even if i was confident i would still prefer to change it as a separate step. ok? Index: kern/init_sysent.c === RCS file: /cvs/src/sys/kern/init_sysent.c,v retrieving revision 1.236 diff -u -p -r1.236 init_sysent.c --- kern/init_sysent.c 1 May 2022 23:00:04 - 1.236 +++ kern/init_sysent.c 17 May 2022 00:36:03 - @@ -1,4 +1,4 @@ -/* $OpenBSD: init_sysent.c,v 1.236 2022/05/01 23:00:04 tedu Exp $ */ +/* $OpenBSD$ */ /* * System call switch table. @@ -204,7 +204,7 @@ const struct sysent sysent[] = { sys_utimensat },/* 84 = utimensat */ { 2, s(struct sys_futimens_args), 0, sys_futimens }, /* 85 = futimens */ - { 3, s(struct sys_kbind_args), 0, + { 3, s(struct sys_kbind_args), SY_NOLOCK | 0, sys_kbind },/* 86 = kbind */ { 2, s(struct sys_clock_gettime_args), SY_NOLOCK | 0, sys_clock_gettime },/* 87 = clock_gettime */ Index: kern/syscalls.master === RCS file: /cvs/src/sys/kern/syscalls.master,v retrieving revision 1.223 diff -u -p -r1.223 syscalls.master --- kern/syscalls.master24 Feb 2022 07:41:51 - 1.223 +++ kern/syscalls.master17 May 2022 00:36:03 - @@ -194,7 +194,7 @@ const struct timespec *times, int flag); } 85 STD { int sys_futimens(int fd, \ const struct timespec *times); } -86 STD { int sys_kbind(const struct __kbind *param, \ +86 STD NOLOCK { int sys_kbind(const struct __kbind *param, \ size_t psize, int64_t proc_cookie); } 87 STD NOLOCK { int sys_clock_gettime(clockid_t clock_id, \ struct timespec *tp); } Index: uvm/uvm_mmap.c === RCS file: /cvs/src/sys/uvm/uvm_mmap.c,v retrieving revision 1.169 diff -u -p -r1.169 uvm_mmap.c --- uvm/uvm_mmap.c 19 Jan 2022 10:43:48 - 1.169 +++ uvm/uvm_mmap.c 17 May 2022 00:36:03 - @@ -70,6 +70,7 @@ #include #include /* for KBIND* */ #include +#include #include /* for __LDPGSZ */ @@ -1125,33 +1126,64 @@ sys_kbind(struct proc *p, void *v, regis const char *data; vaddr_t baseva, last_baseva, endva, pageoffset, kva; size_t psize, s; - u_long pc; + u_long pc, kpc; int count, i, extra; + uint64_t cookie; int error; /* * extract syscall args from uap */ paramp = SCARG(uap, param); - psize = SCARG(uap, psize); /* a NULL paramp disables the syscall for the process */ if (paramp == NULL) { + mtx_enter(>ps_mtx); if (pr->ps_kbind_addr != 0) - sigexit(p, SIGILL); + goto leave_sigill; pr->ps_kbind_addr = BOGO_PC; + mtx_leave(>ps_mtx); return 0; } /* security checks */ + + /* +* ps_kbind_addr can only be set to 0 or BOGO_PC by the +* kernel, not by a call from userland. +*/ pc = PROC_PC(p); - if (pr->ps_kbind_addr == 0) { - pr->ps_kbind_addr = pc; - pr->ps_kbind_cookie = SCARG(uap, proc_cookie); - } else if (pc != pr->ps_kbind_addr || pc == BOGO_PC) - sigexit(p, SIGILL); - else if (pr->ps_kbind_cookie != SCARG(uap, proc_cookie)) - sigexit(p, SIGILL); + if (pc == 0 || pc == BOGO_PC) + goto sigill; + + cookie = SCARG(uap, proc_cookie); + if (pr->ps_kbind_addr == pc) { + membar_datadep_consumer(); + if (pr->ps_kbind_cookie != cookie) + goto sigill; + } else { + mtx_enter(>ps_mtx); + kpc = pr->ps_kbind_addr; + + /* +* If we're the first thread in (kpc is 0), then +* remember where the call came from for the future. +* +* Maybe another thread raced us and won. pc cannot +* be BOGO_PC, so this also handles the check for +* the disabled case. +*/ + + if (kpc == 0) { + pr->ps_kbind_cookie = cookie; +