kern_exit.c : 2 tiny buglets + a potential uvm leak
Hi, 1) pr->ps_ru is already NULL, so code can be shrunk. 2) missing timeout_del, we clear the 2 timeouts in struct process, but not from struct proc. 3) ps_mainproc is allocated in thread_new(), passed to process_new(), and then to process_initialize(). ps_mainproc never gets a call to uvm_uarea_free(). So to allow ps_mainproc to be freed when called finally in process_zap(), we shuffle the uvm_uarea_free() code into proc_free(). Is this analysis correct or wrong? Thanks Index: kern/kern_exit.c === RCS file: /cvs/src/sys/kern/kern_exit.c,v retrieving revision 1.184 diff -u -p -u -p -r1.184 kern_exit.c --- kern/kern_exit.c28 Feb 2020 17:03:05 - 1.184 +++ kern/kern_exit.c29 Feb 2020 20:29:16 - @@ -172,12 +172,7 @@ exit1(struct proc *p, int xexit, int xsi rup = pr->ps_ru; if (rup == NULL) { rup = pool_get(_pool, PR_WAITOK | PR_ZERO); - if (pr->ps_ru == NULL) { - pr->ps_ru = rup; - } else { - pool_put(_pool, rup); - rup = pr->ps_ru; - } + pr->ps_ru = rup; } p->p_siglist = 0; if ((p->p_flag & P_THREAD) == 0) @@ -390,6 +385,14 @@ exit2(struct proc *p) void proc_free(struct proc *p) { + timeout_del(>p_sleep_to); + /* +* Free the VM resources we're still holding on to. +* We must do this from a valid thread because doing +* so may block. +*/ + uvm_uarea_free(p); + p->p_vmspace = NULL;/* zap the thread's copy */ crfree(p->p_ucred); pool_put(_pool, p); nthreads--; @@ -422,14 +425,6 @@ reaper(void *arg) WITNESS_THREAD_EXIT(p); KERNEL_LOCK(); - - /* -* Free the VM resources we're still holding on to. -* We must do this from a valid thread because doing -* so may block. -*/ - uvm_uarea_free(p); - p->p_vmspace = NULL;/* zap the thread's copy */ if (p->p_flag & P_THREAD) { /* Just a thread */
Re: Debugger, traced processes & exit status
Hi, > @@ -303,6 +312,7 @@ struct process { > #define PS_PLEDGE 0x0010 /* Has called pledge(2) */ > #define PS_WXNEEDED 0x0020 /* Process may violate W^X */ > #define PS_EXECPLEDGE 0x0040 /* Has exec pledges */ > +#define PS_ORPHAN 0x0080 /* Process is on an orphan list > */ > > #define PS_BITS \ > ("\20" "\01CONTROLT" "\02EXEC" "\03INEXEC" "\04EXITING" "\05SUGID" \ PS_ORPHAN entry needs to be added to PS_BITS. Off topic to this diff: kern_exit.c is a weird mix of ISSET() and boolean comparison with 1) flag_field & FLAG 2) == 0. Is it ok to send a follow-up diff with ISSET() or !ISSET() as applicable throughout this file, so it reads easier? Thanks
Re: Debugger, traced processes & exit status
On 28/02/20(Fri) 18:24, Martin Pieuchot wrote: > Diff below fixes multiple issues with traced process, exposed by the > regression test attached: > > - When a debugging process exit, give back the traced process to its > original parent, if it exists, instead of init(8). > > - When a traced process exit, make sure the original parent receives > the exit status only after the debugger has seen it. This is done > by keeping a list of 'orphaned' children in the original parent and > looking in it in dowait4(). > > The logic and most of the code comes from FreeBSD as pointed out by > guenther@. Improved versions that: - Introduce process_clear_orphan() to remove a process from an orphan list. - Introduce process_untrace() to give back a process to its original parent or to init(8), used in both PT_DETACH & exit1(). - Rename proc_reparent() into process_reparent() for coherency Ok? Index: kern/kern_exit.c === RCS file: /cvs/src/sys/kern/kern_exit.c,v retrieving revision 1.184 diff -u -p -r1.184 kern_exit.c --- kern/kern_exit.c28 Feb 2020 17:03:05 - 1.184 +++ kern/kern_exit.c29 Feb 2020 18:50:09 - @@ -76,6 +76,7 @@ #endif void proc_finish_wait(struct proc *, struct proc *); +void process_clear_orphan(struct process *); void process_zap(struct process *); void proc_free(struct proc *); void unveil_destroy(struct process *ps); @@ -253,21 +254,22 @@ exit1(struct proc *p, int xexit, int xsi } /* -* Give orphaned children to init(8). +* Reparent children to their original parent, in case +* they were being traced, or to init(8). */ qr = LIST_FIRST(>ps_children); if (qr) /* only need this if any child is S_ZOMB */ wakeup(initprocess); for (; qr != 0; qr = nqr) { nqr = LIST_NEXT(qr, ps_sibling); - proc_reparent(qr, initprocess); /* * Traced processes are killed since their * existence means someone is screwing up. */ if (qr->ps_flags & PS_TRACED && !(qr->ps_flags & PS_EXITING)) { - atomic_clearbits_int(>ps_flags, PS_TRACED); + process_untrace(qr); + /* * If single threading is active, * direct the signal to the active @@ -278,8 +280,19 @@ exit1(struct proc *p, int xexit, int xsi STHREAD); else prsignal(qr, SIGKILL); + } else { + process_reparent(qr, initprocess); } } + + /* +* Make sure orphans won't remember the exiting process. +*/ + while ((qr = LIST_FIRST(>ps_orphans)) != NULL) { + KASSERT(qr->ps_oppid == pr->ps_pid); + qr->ps_oppid = 0; + process_clear_orphan(qr); + } } /* add thread's accumulated rusage into the process's total */ @@ -310,7 +323,7 @@ exit1(struct proc *p, int xexit, int xsi */ if (pr->ps_flags & PS_NOZOMBIE) { struct process *ppr = pr->ps_pptr; - proc_reparent(pr, initprocess); + process_reparent(pr, initprocess); wakeup(ppr); } @@ -562,6 +575,29 @@ loop: return (0); } } + /* +* Look in the orphans list too, to allow the parent to +* collect it's child exit status even if child is being +* debugged. +* +* Debugger detaches from the parent upon successful +* switch-over from parent to child. At this point due to +* re-parenting the parent loses the child to debugger and a +* wait4(2) call would report that it has no children to wait +* for. By maintaining a list of orphans we allow the parent +* to successfully wait until the child becomes a zombie. +*/ + if (nfound == 0) { + LIST_FOREACH(pr, >p_p->ps_orphans, ps_orphan) { + if ((pr->ps_flags & PS_NOZOMBIE) || + (pid != WAIT_ANY && + pr->ps_pid != pid && + pr->ps_pgid != -pid)) + continue; + nfound++; + break; + } + } if (nfound == 0)
Re: patch for dump for high percentages
This is only hiding a problem in the estimation math, rather than fixing it. I don't like it. Peter J. Philipp wrote: > Hi, > > I have a patch for dump(8) if it is generally considered bad if percentage > done is over 100.0%. I checked the archives on marc.info for this and didn't > see any discussion whether this was a topic before. > > Here is the odd DUMP message I got on a host: > > DUMP: 102.41% done, finished in 0:00 > > And here is the patch: > > Index: optr.c > === > RCS file: /cvs/src/sbin/dump/optr.c,v > retrieving revision 1.40 > diff -u -p -u -r1.40 optr.c > --- optr.c22 Jan 2019 16:16:26 - 1.40 > +++ optr.c29 Feb 2020 16:19:25 - > @@ -202,6 +202,7 @@ void > timeest(void) > { > time_t tnow, deltat; > + float percent; > > (void) time(); > if (tnow >= tschedule) { > @@ -211,8 +212,9 @@ timeest(void) > deltat = tstart_writing - tnow + > (1.0 * (tnow - tstart_writing)) > / blockswritten * tapesize; > + percent = (blockswritten * 100.0) / tapesize; > msg("%3.2f%% done, finished in %lld:%02lld\n", > - (blockswritten * 100.0) / tapesize, > + (percent > 100.0) ? 100.0 : percent, > (long long)deltat / 3600, > ((long long)deltat % 3600) / 60); > } > > > I tested this and it seems to report like before, only when it hits 100.0 it > will not go higher, that particular codepath I didn't hit though. > > beta# dump -0uaf - /var | gzip -c > /dev/null > DUMP: Date of this level 0 dump: Sat Feb 29 17:20:26 2020 > DUMP: Date of last level 0 dump: the epoch > DUMP: Dumping /dev/rsd2e (/var) to standard output > DUMP: mapping (Pass I) [regular files] > DUMP: mapping (Pass II) [directories] > DUMP: estimated 23891439 tape blocks. > DUMP: Volume 1 started at: Sat Feb 29 17:20:37 2020 > DUMP: dumping (Pass III) [directories] > DUMP: dumping (Pass IV) [regular files] > DUMP: 25.98% done, finished in 0:14 > > Let me know what you think, > > Regards, > -peter >
Fix use of WITNESS_UNLOCK() in rw_exit_{read,write}()
There is a bug in how rw_exit_read() and rw_exit_write() use WITNESS_UNLOCK(). The fast paths call WITNESS_UNLOCK() after releasing the actual lock. This leads to a use-after-free in the checker if the lock is dynamically allocated and another thread happens to free it too early. The following diff splits rw_exit() into two separate functions, one which is the rw_exit(9) frontend, and the other, rw_do_exit(), which is common between rw_exit(9), rw_exit_read(9) and rw_exit_write(9). This makes it possible to call WITNESS_UNLOCK() before the actual lock release without turning the code into an #ifdef maze. The diff additionally does the following tweaks: * Invoke membar_exit_before_atomic() only once even if the slow path is taken. The initial barrier is enough for the release semantics. * Read rwl_owner for rw_cas() as late as possible so that the CAS operation would be less likely to fail. Note that in rw_exit() the purpose of the read is different; it determines the type of the lock (read / write). * Put the rw_cas() in rw_exit() (now rw_do_exit()) inside __predict_false(). This compacts the resulting machine code a bit. The code could be simplified by removing the fast path from rw_exit_read() and rw_exit_write(), and always calling rw_do_exit(). However, that would reduce performance a tiny amount. I saw about 0.5% increase in kernel build time in a test setup. The patch below does not seem to affect performance negatively. OK? Index: kern/kern_rwlock.c === RCS file: src/sys/kern/kern_rwlock.c,v retrieving revision 1.44 diff -u -p -r1.44 kern_rwlock.c --- kern/kern_rwlock.c 30 Nov 2019 11:19:17 - 1.44 +++ kern/kern_rwlock.c 29 Feb 2020 16:24:59 - @@ -25,6 +25,8 @@ #include #include +void rw_do_exit(struct rwlock *, unsigned long); + /* XXX - temporary measure until proc0 is properly aligned */ #define RW_PROC(p) (((long)p) & ~RWLOCK_MASK) @@ -129,31 +131,31 @@ rw_enter_write(struct rwlock *rwl) void rw_exit_read(struct rwlock *rwl) { - unsigned long owner = rwl->rwl_owner; + unsigned long owner; rw_assert_rdlock(rwl); + WITNESS_UNLOCK(>rwl_lock_obj, 0); membar_exit_before_atomic(); + owner = rwl->rwl_owner; if (__predict_false((owner & RWLOCK_WAIT) || rw_cas(>rwl_owner, owner, owner - RWLOCK_READ_INCR))) - rw_exit(rwl); - else - WITNESS_UNLOCK(>rwl_lock_obj, 0); + rw_do_exit(rwl, 0); } void rw_exit_write(struct rwlock *rwl) { - unsigned long owner = rwl->rwl_owner; + unsigned long owner; rw_assert_wrlock(rwl); + WITNESS_UNLOCK(>rwl_lock_obj, LOP_EXCLUSIVE); membar_exit_before_atomic(); + owner = rwl->rwl_owner; if (__predict_false((owner & RWLOCK_WAIT) || rw_cas(>rwl_owner, owner, 0))) - rw_exit(rwl); - else - WITNESS_UNLOCK(>rwl_lock_obj, LOP_EXCLUSIVE); + rw_do_exit(rwl, RWLOCK_WRLOCK); } #ifdef DIAGNOSTIC @@ -314,22 +316,29 @@ retry: void rw_exit(struct rwlock *rwl) { - unsigned long owner = rwl->rwl_owner; - int wrlock = owner & RWLOCK_WRLOCK; - unsigned long set; + unsigned long wrlock; /* Avoid deadlocks after panic or in DDB */ if (panicstr || db_active) return; + wrlock = rwl->rwl_owner & RWLOCK_WRLOCK; if (wrlock) rw_assert_wrlock(rwl); else rw_assert_rdlock(rwl); - WITNESS_UNLOCK(>rwl_lock_obj, wrlock ? LOP_EXCLUSIVE : 0); membar_exit_before_atomic(); + rw_do_exit(rwl, wrlock); +} + +/* membar_exit_before_atomic() has to precede call of this function. */ +void +rw_do_exit(struct rwlock *rwl, unsigned long wrlock) +{ + unsigned long owner, set; + do { owner = rwl->rwl_owner; if (wrlock) @@ -337,7 +346,7 @@ rw_exit(struct rwlock *rwl) else set = (owner - RWLOCK_READ_INCR) & ~(RWLOCK_WAIT|RWLOCK_WRWANT); - } while (rw_cas(>rwl_owner, owner, set)); + } while (__predict_false(rw_cas(>rwl_owner, owner, set))); if (owner & RWLOCK_WAIT) wakeup(rwl);
patch for dump for high percentages
Hi, I have a patch for dump(8) if it is generally considered bad if percentage done is over 100.0%. I checked the archives on marc.info for this and didn't see any discussion whether this was a topic before. Here is the odd DUMP message I got on a host: DUMP: 102.41% done, finished in 0:00 And here is the patch: Index: optr.c === RCS file: /cvs/src/sbin/dump/optr.c,v retrieving revision 1.40 diff -u -p -u -r1.40 optr.c --- optr.c 22 Jan 2019 16:16:26 - 1.40 +++ optr.c 29 Feb 2020 16:19:25 - @@ -202,6 +202,7 @@ void timeest(void) { time_t tnow, deltat; + float percent; (void) time(); if (tnow >= tschedule) { @@ -211,8 +212,9 @@ timeest(void) deltat = tstart_writing - tnow + (1.0 * (tnow - tstart_writing)) / blockswritten * tapesize; + percent = (blockswritten * 100.0) / tapesize; msg("%3.2f%% done, finished in %lld:%02lld\n", - (blockswritten * 100.0) / tapesize, + (percent > 100.0) ? 100.0 : percent, (long long)deltat / 3600, ((long long)deltat % 3600) / 60); } I tested this and it seems to report like before, only when it hits 100.0 it will not go higher, that particular codepath I didn't hit though. beta# dump -0uaf - /var | gzip -c > /dev/null DUMP: Date of this level 0 dump: Sat Feb 29 17:20:26 2020 DUMP: Date of last level 0 dump: the epoch DUMP: Dumping /dev/rsd2e (/var) to standard output DUMP: mapping (Pass I) [regular files] DUMP: mapping (Pass II) [directories] DUMP: estimated 23891439 tape blocks. DUMP: Volume 1 started at: Sat Feb 29 17:20:37 2020 DUMP: dumping (Pass III) [directories] DUMP: dumping (Pass IV) [regular files] DUMP: 25.98% done, finished in 0:14 Let me know what you think, Regards, -peter
Re: iked(8): Recheck SA proposals when updating policy
On Wed, Feb 26, 2020 at 11:41:49PM +0100, Tobias Heider wrote: > Due to the design of the IKEv2 protocol, the receiver does not > know which policy the initiator tries to negotiate an SA for > until the second exchange (IKE_AUTH). The IKE_AUTH request contains > the ID payload which the responder uses to match a policy (and lookup > authentication keys). > Until then, iked will fall back to the default policy and then update > it later. > The cryptographic proposal and key exchange on the other hand are > negotiated in the first exchange (IKE_SA_INIT), when the responder > does not know the initiators ID and thus the actual policy. > > The attached diff adds an additional check to make sure that, when > the policy changes in IKE_AUTH due to the ID belonging to a different > policy, the negotiated SA's proposal is still compatible with the > updated policy. Yes, this makes sense to me. Diff reads OK, although I'm no iked expert. I've been running with this diff on sparc64 the last few days, all my peers continue to work.