Sorry for the delay in replying today. I fell into a deep pit of DWARF arcana all day and let your message sit (and not even the regularly scheduled DWARF arcana I had intended to work on this week!).
> Yes sure. But I think this part is most important and most hard, and > other changes may depend on locking very much. True enough. Off hand I don't think most of the individual data structure cleanup items I suggested differ much from one kind of locking to another. But if you are actively busy cogitating on the locking, no reason not to keep at that first. (I just didn't want anything easier to get blocked while waiting for feedback from me.) > Sure! The problem is I still can't see how can we do this without > complicating the current code "too much". I spent several hours doing > nothing except thinking about these changes, but all I can say now is: > I need to think more ;) And of course I appreciate your ideas. I figured it might be tough, but more or less all the big hopes depend on it being possible to make it cleaner by going this way. > Just for example, it would be really nice to avoid taking > current->ptrace_mutex > in exit_ptrace() when there are no tracees, but I don't see how can we do this > without races with ptrace_traceme(). But this is a minor detail. Right. > Consider 2 threads, T1 and T2. T2 forks the child C. > > T1 calls do_wait(). It scans T->children and finds nothing. Scans which? T1->children scan will see nothing. So say it has __WNOTHREAD, so then T1 does not also scan T2->children later. > Then it calls ptrace_do_wait() which temporary drops tasklist. > > T2 exits, reparents C to T1. If C is zombie that does do_notify_parent->wake_up_parent. If C is stopped or running, no wakeup. > T1 finds nothing interesting in ->ptraced (not that it matters, but > it is possible the list was not empty when we enter ptrace_do_wait, > but it is empty when we take ->ptrace_mutex). Right, fine. BTW, we need to remember to __set_current_state back to TASK_INTERRUPTIBLE before taking the tasklist_lock again (after the mutex et al presumably changed ->state). > T1 takes tasklist again. We check tsk == T1 has not exited (or we > some other check) and continue. > > do_wait() returns -ECHILD. With __WNOTHREAD, yes. And that's fine. It was true when the wait call was made, and it didn't block. From the user's perspective, the whole wait call happened "before" T2 finished exiting. Without __WNOTHREAD, then T1 scanned T2->children (and T2->ptraced) after the "T1 takes tasklist again" step in your scenario, but C was already gone. If C was already zombie when reparented, T2 did wake_up_parent while T1 had released tasklist_lock. But if there were no other eligible children, then it returns -ECHILD before even trying to block. > We need something more clever here. Yes. I think we can do it with two hacks, which means no more than four or five actual hacks. First, make reparenting always do a wake_up_parent, not just for zombies. Second, upon retaking tasklist_lock, ensure that if we've been woken, we reset *notask_error = 0 so we'll schedule() and not really block, and then restart; or if WNOHANG, just return -ERESTARTNOINTR (or a chosen retval to goto repeat without actual syscall return/restart, e.g. -EAGAIN and do_wait checks for that so we don't set TIF_SIGPENDING). @@ -1611,6 +1611,13 @@ repeat: } while (tsk != current); read_unlock(&tasklist_lock); + if (retval == -EAGAIN) { + if (signal_pending(current)) + retval = -ERESTARTSYS; + else + goto repeat; + } + if (!retval && !(options & WNOHANG)) { retval = -ERESTARTSYS; if (!signal_pending(current)) { For "if we've been woken", it might make sense to roll this in with dusting off and finishing up the patch that optimizes do_wait() with __WNOTHREAD or pid constraints. It uses a custom wake_queue.func function so that a blocked do_wait() uninterested in the dying/stopping waker does not get woken up just to spin around the list and block. The custom wake function could also do something like use a container_of(wait_queue,,) to find the notask_error ("retval" in do_wait) inside a larger struct on the do_wait stack, and poke it to -EAGAIN. That happens under the wait_queue_head lock, so it's synchronized finished after remove_wait_queue(). Maybe also needs: @@ -1622,6 +1629,11 @@ repeat: end: current->state = TASK_RUNNING; remove_wait_queue(¤t->signal->wait_chldexit,&wait); + if ((!retval || retval == -ECHILD) && wait.notask_error == -EAGAIN) { + if (signal_pending()) + return -ERESTARTSYS; + goto start_wait_queue; + } if (infop) { if (retval > 0) retval = 0; Then every wake_up_parent, including the new one for all reparentings, either makes schedule() return, or makes a dry scan after retaking tasklist_lock do a repeat from the top. > Yes... but we can have some nasty corener cases which are not bugs, but > still not good. There's probably only "acceptable" and "wrong" for these corners. (If it's really "not good" so as to care about, it's close enough to call it a bug.) > Two threads T1 and T2. T1 has a TASK_STOPPED child C. T2 in the middle > of sys_ptrace(PTRACE_ATTACH, C). You are a bad, bad man. > T1 does do_wait(WSTOPPED). It is possible that we already see C->ptrace > (so do_wait_thread(T1->children) just clears *notask_error), but we > don't see C in T2->ptraced list. "In the middle" of PTRACE_ATTACH means T2 holds T2->ptrace_mutex. Before it takes the mutex, C->ptrace is clear (unless racing just after D does PTRACE_DETACH to make T2's PTRACE_ATTACH possible). T1 takes T2->ptrace_mutex to examine T2->ptraced. If T2 has just set C->ptrace, it both did so and put C on T2->ptraced while holding the mutex. T1 will see C there. Before that, say C->ptrace was set from prior attach by D. T1 saw C->ptrace in do_wait_thread and did not report C. Now, D does sys_ptrace(PTRACE_DETACH, C), so clears C->ptrace while holding D->ptrace_mutex. Next T1 gets T2->ptrace_mutex before T2's racing PTRACE_ATTACH to C. It does not see C there. Now T1 blocks. This feels similar to the first scenario above. wake_up_parent in PTRACE_DETACH might do it similar to the scheme above. Then T1 restarts (even for WNOHANG). On the second pass, either it sees !C->ptrace in do_wait_thread and reports it stopped as natural parent (correct for it winning the race with PTRACE_ATTACH), or it sees C->ptrace already set again by T2 and then finds C on T2->ptraced. > Oh. Will try to think more ;) I keep tryin' to think but nothin' happens! The original reorganization of do_wait into do_wait_thread/ptrace_do_wait was motivated by eventually doing this conversion. If there is a different way to wholly reorganize it again that makes this cleaner, we can consider that too. Thanks, Roland