Re: [PATCH 1/2] signal: Always notice exiting tasks
Oleg Nesterov writes: > On 02/12, Eric W. Biederman wrote: >> >> > Here I was trying for the simple minimal change and I hit this landmine. >> > Which leaves me with the question of what should be semantics of signal >> > handling after exit. > > Yes, currently it is undefined. Even signal_pending() is random. > >> > I think from dim memory of previous conversations the desired semantics >> > look like: >> > a) Ignore all signal state except for SIGKILL. >> > b) Letting SIGKILL wake up the process should be sufficient. > > signal_wake_up(true) to make fatal_signal_pending() == T, I think. > >> Oleg any ideas on how to make PTRACE_EVENT_EXIT reliably killable? > > My answer is very simple: PTRACE_EVENT_EXIT must not stop if the tracee was > killed by the "real" SIGKILL (not by group_exit/etc), that is all. But this > is another user-visible change, it can equally confuse, say, strace (albeit > not too much iiuc). > > But this needs another discussion. Yes. Quite. I will just point out that as described that logic will rebreak Ivan's program. >> diff --git a/kernel/signal.c b/kernel/signal.c >> index 99fa8ff06fd9..a1f154dca73c 100644 >> --- a/kernel/signal.c >> +++ b/kernel/signal.c >> @@ -2544,6 +2544,9 @@ bool get_signal(struct ksignal *ksig) >> } >> >> fatal: >> + /* No more signals can be pending past this point */ >> + sigdelset(>pending.signal, SIGKILL); > > Well, this is very confusing. In fact, this is not really correct. Say, we > should > not remove the pending SIGKILL if we are going to call do_coredump(). This is > possible if ptrace_signal() was called, or after is_current_pgrp_orphaned() > returns > false. I don't see bugs in it. But it is certainly subtle and that is not what is needed right now. The subtlety is that we will never have a per thread SIGKILL pending unless signal_group_exit is true. So removing when it is not there is harmless. >> + clear_tsk_thread_flag(current, TIF_SIGPENDING); > > I don't understand this change, it looks irrelevant. Possibly makes sense, but > this connects to "semantics of signal handling after exit". As on the other the location is too subtle for the regression fix. The primary motivation is that dequeue_signal calls recalc_sigpending. And in the common case that will result clearing the TIF_SIGPENDING which will result in signal_pending being false. I have not found a location that cares enough to cause a misbehavior if we don't clear TIF_SIGPENDING but it is a practical change and there might be. So if the word of the day is be very conservative and avoid landminds I expect we need the clearing of TIF_SIGPENDING. Hmm. Probably using recalc_sigpending() now that I think about it. > OK, we need a minimal incremental fix for now. I'd suggest to replace > > ksig->info.si_signo = signr = SIGKILL; > if (signal_group_exit(signal)) > goto fatal; > > added by this patch with > > if (__fatal_signal_pending(current)) { > ksig->info.si_signo = signr = SIGKILL; > sigdelset(>pending.signal, SIGKILL); > goto fatal; > } > > __fatal_signal_pending() is cheaper and looks more understandable. I definitely agree that it is much less likely to cause a problem if we move all of the work before jumping to fatal. The cost of both __fatal_signal_pending and signal_group_exit is just a cache line read. So not a big deal wither way. On the other hand __fatal_signal_pending as currently implemented is insanely subtle and arguably a bit confusing. It tests for a SIGKILL in the current pending sigset, to discover the signal group property of if a process as started exiting. In the long run we need our data structures not to be subtle and tricky to use. To do that we need a test of something in signal_struct because it is a per signal group property. Further we need to remove the abuse of the per thread SIGKILL. Since signal_group_exit always implies __fatal_signal_pending in this case and the reverse. I see no reason to use a function that requires we maintain a huge amount of confusing and unnecessary machinery to keep working. All of that plus the signal_group_exit test has been tested and shown to fix an ignored SIGKILL and the only practical problem is it doesn't do one or two little things that dequeue_signal has done that made it impossible to stop in PTRACE_EVENT_EXIT. So for the regression fix let's just do the few little things that dequeue_signal used to do. That gives us a strong guarantee that nothing else was missed. Eric
Re: [PATCH 1/2] signal: Always notice exiting tasks
On 02/12, Eric W. Biederman wrote: > > > Here I was trying for the simple minimal change and I hit this landmine. > > Which leaves me with the question of what should be semantics of signal > > handling after exit. Yes, currently it is undefined. Even signal_pending() is random. > > I think from dim memory of previous conversations the desired semantics > > look like: > > a) Ignore all signal state except for SIGKILL. > > b) Letting SIGKILL wake up the process should be sufficient. signal_wake_up(true) to make fatal_signal_pending() == T, I think. > Oleg any ideas on how to make PTRACE_EVENT_EXIT reliably killable? My answer is very simple: PTRACE_EVENT_EXIT must not stop if the tracee was killed by the "real" SIGKILL (not by group_exit/etc), that is all. But this is another user-visible change, it can equally confuse, say, strace (albeit not too much iiuc). But this needs another discussion. > diff --git a/kernel/signal.c b/kernel/signal.c > index 99fa8ff06fd9..a1f154dca73c 100644 > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -2544,6 +2544,9 @@ bool get_signal(struct ksignal *ksig) > } > > fatal: > + /* No more signals can be pending past this point */ > + sigdelset(>pending.signal, SIGKILL); Well, this is very confusing. In fact, this is not really correct. Say, we should not remove the pending SIGKILL if we are going to call do_coredump(). This is possible if ptrace_signal() was called, or after is_current_pgrp_orphaned() returns false. > + clear_tsk_thread_flag(current, TIF_SIGPENDING); I don't understand this change, it looks irrelevant. Possibly makes sense, but this connects to "semantics of signal handling after exit". OK, we need a minimal incremental fix for now. I'd suggest to replace ksig->info.si_signo = signr = SIGKILL; if (signal_group_exit(signal)) goto fatal; added by this patch with if (__fatal_signal_pending(current)) { ksig->info.si_signo = signr = SIGKILL; sigdelset(>pending.signal, SIGKILL); goto fatal; } __fatal_signal_pending() is cheaper and looks more understandable. Oleg.
Re: [PATCH 1/2] signal: Always notice exiting tasks
ebied...@xmission.com (Eric W. Biederman) writes: > Oleg Nesterov writes: > >> sorry again for delay... >> >> On 02/07, Eric W. Biederman wrote: >>> >>> --- a/kernel/signal.c >>> +++ b/kernel/signal.c >>> @@ -2393,6 +2393,11 @@ bool get_signal(struct ksignal *ksig) >>> goto relock; >>> } >>> >>> + /* Has this task already been marked for death? */ >>> + ksig->info.si_signo = signr = SIGKILL; >>> + if (signal_group_exit(signal)) >>> + goto fatal; >>> + >>> for (;;) { >>> struct k_sigaction *ka; >>> >>> @@ -2488,6 +2493,7 @@ bool get_signal(struct ksignal *ksig) >>> continue; >>> } >>> >>> + fatal: >>> spin_unlock_irq(>siglock); >> >> Eric, but this is wrong. At least this is the serious user-visible >> change. >> >> Afaics, with this patch the tracee will never stop in PTRACE_EVENT_EXIT in >> case >> of group_exit/exec, because schedule() in TASK_TRACED state won't block due >> to >> __fatal_signal_pending(). >> >> Yes, yes, as I said many times the semantics of PTRACE_EVENT_EXIT was never >> really >> defined, it depends on /dev/random, but still I don't think we should break >> it even >> more. > > Well it changes PTRACE_EVENT_EXIT I grant that. It looks like that > changes makes PTRACE_EVENT_EXIT is less than useful. > > The only way to perfectly preserve the previous semantics is probably to > do something like my JOBCTL_TASK_EXIT proposal. > > That said I don't think even adding a JOBCTL_TASK_EXIT is enough to have > a reliable stop of ptrace_event_exit after a process has exited. As any > other pending signal can cause problems there as well. > > I have received a report that strace -f in some cases is not noticing > children before they die and it looks like a stop in PTRACE_EVENT_EXIT > would fix that strace behavior. > > Sigh. > > Here I was trying for the simple minimal change and I hit this landmine. > Which leaves me with the question of what should be semantics of signal > handling after exit. > > I think from dim memory of previous conversations the desired semantics > look like: > a) Ignore all signal state except for SIGKILL. > b) Letting SIGKILL wake up the process should be sufficient. > > I will see if I can reproduce the strace failure and see if I can cook > up something minimal that addresses just that. If you have suggestions > I would love to hear them. > > As this was a minimal fix for SIGKILL being broken I have already sent > the fix to Linus. So we are looking at an incremental fix at this > point. In my testing I found something that concerns me. Because we wind up with SIGKILL in shard_pending we can not kill a process in do_exit that has stopped at PTRACE_EVENT_EXIT. That bug seems to go back a long ways. Other than that, it looks like we can do the following to fix the regression I introduced. Oleg any ideas on how to make PTRACE_EVENT_EXIT reliably killable? diff --git a/kernel/signal.c b/kernel/signal.c index 99fa8ff06fd9..a1f154dca73c 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2544,6 +2544,9 @@ bool get_signal(struct ksignal *ksig) } fatal: + /* No more signals can be pending past this point */ + sigdelset(>pending.signal, SIGKILL); + clear_tsk_thread_flag(current, TIF_SIGPENDING); spin_unlock_irq(>siglock); /* Eric
Re: [PATCH 1/2] signal: Always notice exiting tasks
Oleg Nesterov writes: > sorry again for delay... > > On 02/07, Eric W. Biederman wrote: >> >> --- a/kernel/signal.c >> +++ b/kernel/signal.c >> @@ -2393,6 +2393,11 @@ bool get_signal(struct ksignal *ksig) >> goto relock; >> } >> >> +/* Has this task already been marked for death? */ >> +ksig->info.si_signo = signr = SIGKILL; >> +if (signal_group_exit(signal)) >> +goto fatal; >> + >> for (;;) { >> struct k_sigaction *ka; >> >> @@ -2488,6 +2493,7 @@ bool get_signal(struct ksignal *ksig) >> continue; >> } >> >> +fatal: >> spin_unlock_irq(>siglock); > > Eric, but this is wrong. At least this is the serious user-visible > change. > > Afaics, with this patch the tracee will never stop in PTRACE_EVENT_EXIT in > case > of group_exit/exec, because schedule() in TASK_TRACED state won't block due to > __fatal_signal_pending(). > > Yes, yes, as I said many times the semantics of PTRACE_EVENT_EXIT was never > really > defined, it depends on /dev/random, but still I don't think we should break > it even > more. Well it changes PTRACE_EVENT_EXIT I grant that. It looks like that changes makes PTRACE_EVENT_EXIT is less than useful. The only way to perfectly preserve the previous semantics is probably to do something like my JOBCTL_TASK_EXIT proposal. That said I don't think even adding a JOBCTL_TASK_EXIT is enough to have a reliable stop of ptrace_event_exit after a process has exited. As any other pending signal can cause problems there as well. I have received a report that strace -f in some cases is not noticing children before they die and it looks like a stop in PTRACE_EVENT_EXIT would fix that strace behavior. Sigh. Here I was trying for the simple minimal change and I hit this landmine. Which leaves me with the question of what should be semantics of signal handling after exit. I think from dim memory of previous conversations the desired semantics look like: a) Ignore all signal state except for SIGKILL. b) Letting SIGKILL wake up the process should be sufficient. I will see if I can reproduce the strace failure and see if I can cook up something minimal that addresses just that. If you have suggestions I would love to hear them. As this was a minimal fix for SIGKILL being broken I have already sent the fix to Linus. So we are looking at an incremental fix at this point. Eric
Re: [PATCH 1/2] signal: Always notice exiting tasks
sorry again for delay... On 02/07, Eric W. Biederman wrote: > > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -2393,6 +2393,11 @@ bool get_signal(struct ksignal *ksig) > goto relock; > } > > + /* Has this task already been marked for death? */ > + ksig->info.si_signo = signr = SIGKILL; > + if (signal_group_exit(signal)) > + goto fatal; > + > for (;;) { > struct k_sigaction *ka; > > @@ -2488,6 +2493,7 @@ bool get_signal(struct ksignal *ksig) > continue; > } > > + fatal: > spin_unlock_irq(>siglock); Eric, but this is wrong. At least this is the serious user-visible change. Afaics, with this patch the tracee will never stop in PTRACE_EVENT_EXIT in case of group_exit/exec, because schedule() in TASK_TRACED state won't block due to __fatal_signal_pending(). Yes, yes, as I said many times the semantics of PTRACE_EVENT_EXIT was never really defined, it depends on /dev/random, but still I don't think we should break it even more. Oleg.
[PATCH 1/2] signal: Always notice exiting tasks
Recently syzkaller was able to create unkillablle processes by creating a timer that is delivered as a thread local signal on SIGHUP, and receiving SIGHUP SA_NODEFERER. Ultimately causing a loop failing to deliver SIGHUP but always trying. Upon examination it turns out part of the problem is actually most of the solution. Since 2.5 signal delivery has found all fatal signals, marked the signal group for death, and queued SIGKILL in every threads thread queue relying on signal->group_exit_code to preserve the information of which was the actual fatal signal. The conversion of all fatal signals to SIGKILL results in the synchronous signal heuristic in next_signal kicking in and preferring SIGHUP to SIGKILL. Which is especially problematic as all fatal signals have already been transformed into SIGKILL. Instead of dequeueing signals and depending upon SIGKILL to be the first signal dequeued, first test if the signal group has already been marked for death. This guarantees that nothing in the signal queue can prevent a process that needs to exit from exiting. Cc: sta...@vger.kernel.org Reported-by: Dmitry Vyukov Ref: ebf5ebe31d2c ("[PATCH] signal-fixes-2.5.59-A4") History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Signed-off-by: "Eric W. Biederman" --- kernel/signal.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/kernel/signal.c b/kernel/signal.c index 9ca8e5278c8e..5424cb0006bc 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2393,6 +2393,11 @@ bool get_signal(struct ksignal *ksig) goto relock; } + /* Has this task already been marked for death? */ + ksig->info.si_signo = signr = SIGKILL; + if (signal_group_exit(signal)) + goto fatal; + for (;;) { struct k_sigaction *ka; @@ -2488,6 +2493,7 @@ bool get_signal(struct ksignal *ksig) continue; } + fatal: spin_unlock_irq(>siglock); /* -- 2.17.1