Re: [PATCH 1/2] signal: Always notice exiting tasks

2019-02-12 Thread Eric W. Biederman
Oleg Nesterov  writes:

> On 02/12, Eric W. Biederman wrote:
>>
>> > Here I was trying for the simple minimal change and I hit this landmine.
>> > Which leaves me with the question of what should be semantics of signal
>> > handling after exit.
>
> Yes, currently it is undefined. Even signal_pending() is random.
>
>> > I think from dim memory of previous conversations the desired semantics
>> > look like:
>> > a) Ignore all signal state except for SIGKILL.
>> > b) Letting SIGKILL wake up the process should be sufficient.
>
> signal_wake_up(true) to make fatal_signal_pending() == T, I think.
>
>> Oleg any ideas on how to make PTRACE_EVENT_EXIT reliably killable?
>
> My answer is very simple: PTRACE_EVENT_EXIT must not stop if the tracee was
> killed by the "real" SIGKILL (not by group_exit/etc), that is all. But this
> is another user-visible change, it can equally confuse, say, strace (albeit
> not too much iiuc).
>
> But this needs another discussion.

Yes.  Quite.

I will just point out that as described that logic will rebreak Ivan's
program.

>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index 99fa8ff06fd9..a1f154dca73c 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -2544,6 +2544,9 @@ bool get_signal(struct ksignal *ksig)
>> }
>>
>> fatal:
>> +   /* No more signals can be pending past this point */
>> +   sigdelset(>pending.signal, SIGKILL);
>
> Well, this is very confusing. In fact, this is not really correct. Say, we 
> should
> not remove the pending SIGKILL if we are going to call do_coredump(). This is
> possible if ptrace_signal() was called, or after is_current_pgrp_orphaned() 
> returns
> false.

I don't see bugs in it.  But it is certainly subtle and that is not what
is needed right now.

The subtlety is that we will never have a per thread SIGKILL pending
unless signal_group_exit is true.  So removing when it is not there is harmless.

>> +   clear_tsk_thread_flag(current, TIF_SIGPENDING);
>
> I don't understand this change, it looks irrelevant. Possibly makes sense, but
> this connects to "semantics of signal handling after exit".

As on the other the location is too subtle for the regression fix.

The primary motivation is that dequeue_signal calls recalc_sigpending.
And in the common case that will result clearing the TIF_SIGPENDING
which will result in signal_pending being false.

I have not found a location that cares enough to cause a misbehavior
if we don't clear TIF_SIGPENDING but it is a practical change and there
might be.  So if the word of the day is be very conservative and
avoid landminds I expect we need the clearing of TIF_SIGPENDING.

Hmm.  Probably using recalc_sigpending() now that I think about it.

> OK, we need a minimal incremental fix for now. I'd suggest to replace
>
>   ksig->info.si_signo = signr = SIGKILL;
>   if (signal_group_exit(signal))
>   goto fatal;
>
> added by this patch with
>
>   if (__fatal_signal_pending(current)) {
>   ksig->info.si_signo = signr = SIGKILL;
>   sigdelset(>pending.signal, SIGKILL);
>   goto fatal;
>   }
>
> __fatal_signal_pending() is cheaper and looks more understandable.


I definitely agree that it is much less likely to cause a problem
if we move all of the work before jumping to fatal.

The cost of both __fatal_signal_pending and signal_group_exit is just
a cache line read.  So not a big deal wither way.

On the other hand __fatal_signal_pending as currently implemented is
insanely subtle and arguably a bit confusing.  It tests for a SIGKILL in
the current pending sigset, to discover the signal group property of if
a process as started exiting.

In the long run we need our data structures not to be subtle and
tricky to use.  To do that we need a test of something in signal_struct
because it is a per signal group property.  Further we need to remove
the abuse of the per thread SIGKILL.

Since signal_group_exit always implies __fatal_signal_pending in this
case and the reverse.  I see no reason to use a function that requires
we maintain a huge amount of confusing and unnecessary machinery to keep
working.

All of that plus the signal_group_exit test has been tested and shown to
fix an ignored SIGKILL and the only practical problem is it doesn't do
one or two little things that dequeue_signal has done that made it
impossible to stop in PTRACE_EVENT_EXIT.

So for the regression fix let's just do the few little things that
dequeue_signal used to do.  That gives us a strong guarantee that
nothing else was missed.

Eric


Re: [PATCH 1/2] signal: Always notice exiting tasks

2019-02-12 Thread Oleg Nesterov
On 02/12, Eric W. Biederman wrote:
>
> > Here I was trying for the simple minimal change and I hit this landmine.
> > Which leaves me with the question of what should be semantics of signal
> > handling after exit.

Yes, currently it is undefined. Even signal_pending() is random.

> > I think from dim memory of previous conversations the desired semantics
> > look like:
> > a) Ignore all signal state except for SIGKILL.
> > b) Letting SIGKILL wake up the process should be sufficient.

signal_wake_up(true) to make fatal_signal_pending() == T, I think.

> Oleg any ideas on how to make PTRACE_EVENT_EXIT reliably killable?

My answer is very simple: PTRACE_EVENT_EXIT must not stop if the tracee was
killed by the "real" SIGKILL (not by group_exit/etc), that is all. But this
is another user-visible change, it can equally confuse, say, strace (albeit
not too much iiuc).

But this needs another discussion.

> diff --git a/kernel/signal.c b/kernel/signal.c
> index 99fa8ff06fd9..a1f154dca73c 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2544,6 +2544,9 @@ bool get_signal(struct ksignal *ksig)
> }
>
> fatal:
> +   /* No more signals can be pending past this point */
> +   sigdelset(>pending.signal, SIGKILL);

Well, this is very confusing. In fact, this is not really correct. Say, we 
should
not remove the pending SIGKILL if we are going to call do_coredump(). This is
possible if ptrace_signal() was called, or after is_current_pgrp_orphaned() 
returns
false.

> +   clear_tsk_thread_flag(current, TIF_SIGPENDING);

I don't understand this change, it looks irrelevant. Possibly makes sense, but
this connects to "semantics of signal handling after exit".

OK, we need a minimal incremental fix for now. I'd suggest to replace

ksig->info.si_signo = signr = SIGKILL;
if (signal_group_exit(signal))
goto fatal;

added by this patch with

if (__fatal_signal_pending(current)) {
ksig->info.si_signo = signr = SIGKILL;
sigdelset(>pending.signal, SIGKILL);
goto fatal;
}

__fatal_signal_pending() is cheaper and looks more understandable.

Oleg.



Re: [PATCH 1/2] signal: Always notice exiting tasks

2019-02-12 Thread Eric W. Biederman
ebied...@xmission.com (Eric W. Biederman) writes:

> Oleg Nesterov  writes:
>
>> sorry again for delay...
>>
>> On 02/07, Eric W. Biederman wrote:
>>>
>>> --- a/kernel/signal.c
>>> +++ b/kernel/signal.c
>>> @@ -2393,6 +2393,11 @@ bool get_signal(struct ksignal *ksig)
>>> goto relock;
>>> }
>>>  
>>> +   /* Has this task already been marked for death? */
>>> +   ksig->info.si_signo = signr = SIGKILL;
>>> +   if (signal_group_exit(signal))
>>> +   goto fatal;
>>> +
>>> for (;;) {
>>> struct k_sigaction *ka;
>>>  
>>> @@ -2488,6 +2493,7 @@ bool get_signal(struct ksignal *ksig)
>>> continue;
>>> }
>>>  
>>> +   fatal:
>>> spin_unlock_irq(>siglock);
>>
>> Eric, but this is wrong. At least this is the serious user-visible
>> change.
>>
>> Afaics, with this patch the tracee will never stop in PTRACE_EVENT_EXIT in 
>> case
>> of group_exit/exec, because schedule() in TASK_TRACED state won't block due 
>> to
>> __fatal_signal_pending().
>>
>> Yes, yes, as I said many times the semantics of PTRACE_EVENT_EXIT was never 
>> really
>> defined, it depends on /dev/random, but still I don't think we should break 
>> it even
>> more.
>
> Well it changes PTRACE_EVENT_EXIT I grant that.  It looks like that
> changes makes PTRACE_EVENT_EXIT is less than useful.
>
> The only way to perfectly preserve the previous semantics is probably to
> do something like my JOBCTL_TASK_EXIT proposal.
>
> That said I don't think even adding a JOBCTL_TASK_EXIT is enough to have
> a reliable stop of ptrace_event_exit after a process has exited.  As any
> other pending signal can cause problems there as well.
>
> I have received a report that strace -f in some cases is not noticing
> children before they die and it looks like a stop in PTRACE_EVENT_EXIT
> would fix that strace behavior.
>
> Sigh.
>
> Here I was trying for the simple minimal change and I hit this landmine.
> Which leaves me with the question of what should be semantics of signal
> handling after exit.
>
> I think from dim memory of previous conversations the desired semantics
> look like:
> a) Ignore all signal state except for SIGKILL.
> b) Letting SIGKILL wake up the process should be sufficient.
>
> I will see if I can reproduce the strace failure and see if I can cook
> up something minimal that addresses just that.  If you have suggestions
> I would love to hear them.
>
> As this was a minimal fix for SIGKILL being broken I have already sent
> the fix to Linus.  So we are looking at an incremental fix at this
> point.

In my testing I found something that concerns me.  Because we wind up
with SIGKILL in shard_pending we can not kill a process in do_exit that
has stopped at PTRACE_EVENT_EXIT.  That bug seems to go back a long ways.

Other than that, it looks like we can do the following to fix the
regression I introduced.

Oleg any ideas on how to make PTRACE_EVENT_EXIT reliably killable?

diff --git a/kernel/signal.c b/kernel/signal.c
index 99fa8ff06fd9..a1f154dca73c 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2544,6 +2544,9 @@ bool get_signal(struct ksignal *ksig)
}
 
fatal:
+   /* No more signals can be pending past this point */
+   sigdelset(>pending.signal, SIGKILL);
+   clear_tsk_thread_flag(current, TIF_SIGPENDING);
spin_unlock_irq(>siglock);
 
/*

Eric


Re: [PATCH 1/2] signal: Always notice exiting tasks

2019-02-11 Thread Eric W. Biederman
Oleg Nesterov  writes:

> sorry again for delay...
>
> On 02/07, Eric W. Biederman wrote:
>>
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -2393,6 +2393,11 @@ bool get_signal(struct ksignal *ksig)
>>  goto relock;
>>  }
>>  
>> +/* Has this task already been marked for death? */
>> +ksig->info.si_signo = signr = SIGKILL;
>> +if (signal_group_exit(signal))
>> +goto fatal;
>> +
>>  for (;;) {
>>  struct k_sigaction *ka;
>>  
>> @@ -2488,6 +2493,7 @@ bool get_signal(struct ksignal *ksig)
>>  continue;
>>  }
>>  
>> +fatal:
>>  spin_unlock_irq(>siglock);
>
> Eric, but this is wrong. At least this is the serious user-visible
> change.
>
> Afaics, with this patch the tracee will never stop in PTRACE_EVENT_EXIT in 
> case
> of group_exit/exec, because schedule() in TASK_TRACED state won't block due to
> __fatal_signal_pending().
>
> Yes, yes, as I said many times the semantics of PTRACE_EVENT_EXIT was never 
> really
> defined, it depends on /dev/random, but still I don't think we should break 
> it even
> more.

Well it changes PTRACE_EVENT_EXIT I grant that.  It looks like that
changes makes PTRACE_EVENT_EXIT is less than useful.

The only way to perfectly preserve the previous semantics is probably to
do something like my JOBCTL_TASK_EXIT proposal.

That said I don't think even adding a JOBCTL_TASK_EXIT is enough to have
a reliable stop of ptrace_event_exit after a process has exited.  As any
other pending signal can cause problems there as well.

I have received a report that strace -f in some cases is not noticing
children before they die and it looks like a stop in PTRACE_EVENT_EXIT
would fix that strace behavior.

Sigh.

Here I was trying for the simple minimal change and I hit this landmine.
Which leaves me with the question of what should be semantics of signal
handling after exit.

I think from dim memory of previous conversations the desired semantics
look like:
a) Ignore all signal state except for SIGKILL.
b) Letting SIGKILL wake up the process should be sufficient.

I will see if I can reproduce the strace failure and see if I can cook
up something minimal that addresses just that.  If you have suggestions
I would love to hear them.

As this was a minimal fix for SIGKILL being broken I have already sent
the fix to Linus.  So we are looking at an incremental fix at this point.

Eric


Re: [PATCH 1/2] signal: Always notice exiting tasks

2019-02-11 Thread Oleg Nesterov
sorry again for delay...

On 02/07, Eric W. Biederman wrote:
>
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2393,6 +2393,11 @@ bool get_signal(struct ksignal *ksig)
>   goto relock;
>   }
>  
> + /* Has this task already been marked for death? */
> + ksig->info.si_signo = signr = SIGKILL;
> + if (signal_group_exit(signal))
> + goto fatal;
> +
>   for (;;) {
>   struct k_sigaction *ka;
>  
> @@ -2488,6 +2493,7 @@ bool get_signal(struct ksignal *ksig)
>   continue;
>   }
>  
> + fatal:
>   spin_unlock_irq(>siglock);

Eric, but this is wrong. At least this is the serious user-visible change.

Afaics, with this patch the tracee will never stop in PTRACE_EVENT_EXIT in case
of group_exit/exec, because schedule() in TASK_TRACED state won't block due to
__fatal_signal_pending().

Yes, yes, as I said many times the semantics of PTRACE_EVENT_EXIT was never 
really
defined, it depends on /dev/random, but still I don't think we should break it 
even
more.

Oleg.



[PATCH 1/2] signal: Always notice exiting tasks

2019-02-06 Thread Eric W. Biederman


Recently syzkaller was able to create unkillablle processes by
creating a timer that is delivered as a thread local signal on SIGHUP,
and receiving SIGHUP SA_NODEFERER.  Ultimately causing a loop
failing to deliver SIGHUP but always trying.

Upon examination it turns out part of the problem is actually most of
the solution.  Since 2.5 signal delivery has found all fatal signals,
marked the signal group for death, and queued SIGKILL in every threads
thread queue relying on signal->group_exit_code to preserve the
information of which was the actual fatal signal.

The conversion of all fatal signals to SIGKILL results in the
synchronous signal heuristic in next_signal kicking in and preferring
SIGHUP to SIGKILL.  Which is especially problematic as all
fatal signals have already been transformed into SIGKILL.

Instead of dequeueing signals and depending upon SIGKILL to
be the first signal dequeued, first test if the signal group
has already been marked for death.  This guarantees that
nothing in the signal queue can prevent a process that needs
to exit from exiting.

Cc: sta...@vger.kernel.org
Reported-by: Dmitry Vyukov 
Ref: ebf5ebe31d2c ("[PATCH] signal-fixes-2.5.59-A4")
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Signed-off-by: "Eric W. Biederman" 
---
 kernel/signal.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/kernel/signal.c b/kernel/signal.c
index 9ca8e5278c8e..5424cb0006bc 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2393,6 +2393,11 @@ bool get_signal(struct ksignal *ksig)
goto relock;
}
 
+   /* Has this task already been marked for death? */
+   ksig->info.si_signo = signr = SIGKILL;
+   if (signal_group_exit(signal))
+   goto fatal;
+
for (;;) {
struct k_sigaction *ka;
 
@@ -2488,6 +2493,7 @@ bool get_signal(struct ksignal *ksig)
continue;
}
 
+   fatal:
spin_unlock_irq(>siglock);
 
/*
-- 
2.17.1