Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-22 Thread Enke Chen
Jann,

Thanks for the feedback. I will post a revised patch shortly.

On the related topic of "pdeath_signal", there are several inconsistencies
by preserving the flag across execve(2). The flag is cleared under several
conditions in different places. I will start a separate thread to see if
it can still be cleaned up.

 PR_SET_PDEATHSIG (since Linux 2.1.57)
  Set the parent death signal of the calling process to arg2
  (either a signal value in the range 1..maxsig, or 0 to clear).
  This is the signal that the calling process will get when its
  parent dies.  This value is cleared for the child of a fork(2)
  and (since Linux 2.4.36 / 2.6.23) when executing a set-user-ID
  or set-group-ID binary, or a binary that has associated
  capabilities (see capabilities(7)).  This value is preserved
  across execve(2).

-- Enke

On 10/22/18 8:40 AM, Jann Horn wrote:
> On Sat, Oct 20, 2018 at 1:01 AM Enke Chen  wrote:
>> Regarding the security considerations, it seems simpler and more secure to
>> just clear the "pre-coredump signal" cross execve(2), and let the new program
>> decide for itself.  What do you think?
> 
> I don't have a problem with these semantics.
> 
> I could imagine someone being unhappy about the theoretical race
> window if they want to perform an in-place reexecution of a running
> service, but I don't know whether anyone actually cares about that.
> 
>> Changes to prctl(2):
>>
>> DESCRIPTION
>>
>>PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>>   This allows the calling process to receive a signal (arg2,
>>   if nonzero) from a child process prior to the coredump of
>>   the child process. arg2 must be SIGUSR1, or SIGUSR2, or
>>   SIGCHLD, or 0 (for clear).
>>
>>   When SIGCHLD is specified, the signal code is set to
>>   CLD_PREDUMP in such an SIGCHLD signal.
>>
>>   The value of the pre-coredump signal is cleared across
>>   execve(2), or for the child of a fork(2).
>>
>>PR_GET_PREDUMP_SIG (since Linux 4.20.x)
>>   Return the current value of the pre-coredump signal for the
>>   calling process, in the location pointed to by (int *) arg2.


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-22 Thread Enke Chen
Jann,

Thanks for the feedback. I will post a revised patch shortly.

On the related topic of "pdeath_signal", there are several inconsistencies
by preserving the flag across execve(2). The flag is cleared under several
conditions in different places. I will start a separate thread to see if
it can still be cleaned up.

 PR_SET_PDEATHSIG (since Linux 2.1.57)
  Set the parent death signal of the calling process to arg2
  (either a signal value in the range 1..maxsig, or 0 to clear).
  This is the signal that the calling process will get when its
  parent dies.  This value is cleared for the child of a fork(2)
  and (since Linux 2.4.36 / 2.6.23) when executing a set-user-ID
  or set-group-ID binary, or a binary that has associated
  capabilities (see capabilities(7)).  This value is preserved
  across execve(2).

-- Enke

On 10/22/18 8:40 AM, Jann Horn wrote:
> On Sat, Oct 20, 2018 at 1:01 AM Enke Chen  wrote:
>> Regarding the security considerations, it seems simpler and more secure to
>> just clear the "pre-coredump signal" cross execve(2), and let the new program
>> decide for itself.  What do you think?
> 
> I don't have a problem with these semantics.
> 
> I could imagine someone being unhappy about the theoretical race
> window if they want to perform an in-place reexecution of a running
> service, but I don't know whether anyone actually cares about that.
> 
>> Changes to prctl(2):
>>
>> DESCRIPTION
>>
>>PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>>   This allows the calling process to receive a signal (arg2,
>>   if nonzero) from a child process prior to the coredump of
>>   the child process. arg2 must be SIGUSR1, or SIGUSR2, or
>>   SIGCHLD, or 0 (for clear).
>>
>>   When SIGCHLD is specified, the signal code is set to
>>   CLD_PREDUMP in such an SIGCHLD signal.
>>
>>   The value of the pre-coredump signal is cleared across
>>   execve(2), or for the child of a fork(2).
>>
>>PR_GET_PREDUMP_SIG (since Linux 4.20.x)
>>   Return the current value of the pre-coredump signal for the
>>   calling process, in the location pointed to by (int *) arg2.


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-22 Thread Jann Horn
On Sat, Oct 20, 2018 at 1:01 AM Enke Chen  wrote:
> Regarding the security considerations, it seems simpler and more secure to
> just clear the "pre-coredump signal" cross execve(2), and let the new program
> decide for itself.  What do you think?

I don't have a problem with these semantics.

I could imagine someone being unhappy about the theoretical race
window if they want to perform an in-place reexecution of a running
service, but I don't know whether anyone actually cares about that.

> Changes to prctl(2):
>
> DESCRIPTION
>
>PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>   This allows the calling process to receive a signal (arg2,
>   if nonzero) from a child process prior to the coredump of
>   the child process. arg2 must be SIGUSR1, or SIGUSR2, or
>   SIGCHLD, or 0 (for clear).
>
>   When SIGCHLD is specified, the signal code is set to
>   CLD_PREDUMP in such an SIGCHLD signal.
>
>   The value of the pre-coredump signal is cleared across
>   execve(2), or for the child of a fork(2).
>
>PR_GET_PREDUMP_SIG (since Linux 4.20.x)
>   Return the current value of the pre-coredump signal for the
>   calling process, in the location pointed to by (int *) arg2.


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-22 Thread Jann Horn
On Sat, Oct 20, 2018 at 1:01 AM Enke Chen  wrote:
> Regarding the security considerations, it seems simpler and more secure to
> just clear the "pre-coredump signal" cross execve(2), and let the new program
> decide for itself.  What do you think?

I don't have a problem with these semantics.

I could imagine someone being unhappy about the theoretical race
window if they want to perform an in-place reexecution of a running
service, but I don't know whether anyone actually cares about that.

> Changes to prctl(2):
>
> DESCRIPTION
>
>PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>   This allows the calling process to receive a signal (arg2,
>   if nonzero) from a child process prior to the coredump of
>   the child process. arg2 must be SIGUSR1, or SIGUSR2, or
>   SIGCHLD, or 0 (for clear).
>
>   When SIGCHLD is specified, the signal code is set to
>   CLD_PREDUMP in such an SIGCHLD signal.
>
>   The value of the pre-coredump signal is cleared across
>   execve(2), or for the child of a fork(2).
>
>PR_GET_PREDUMP_SIG (since Linux 4.20.x)
>   Return the current value of the pre-coredump signal for the
>   calling process, in the location pointed to by (int *) arg2.


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-19 Thread Enke Chen
Hi, Jann:

Regarding the security considerations, it seems simpler and more secure to
just clear the "pre-coredump signal" cross execve(2), and let the new program
decide for itself.  What do you think?

---
Changes to prctl(2):

DESCRIPTION

   PR_SET_PREDUMP_SIG (since Linux 4.20.x)
  This allows the calling process to receive a signal (arg2,
  if nonzero) from a child process prior to the coredump of
  the child process. arg2 must be SIGUSR1, or SIGUSR2, or
  SIGCHLD, or 0 (for clear).

  When SIGCHLD is specified, the signal code is set to
  CLD_PREDUMP in such an SIGCHLD signal.

  The value of the pre-coredump signal is cleared across
  execve(2), or for the child of a fork(2).

   PR_GET_PREDUMP_SIG (since Linux 4.20.x)
  Return the current value of the pre-coredump signal for the
  calling process, in the location pointed to by (int *) arg2.
---

Thanks.  -- Enke

On 10/15/18 11:54 AM, Jann Horn wrote:
> On Mon, Oct 15, 2018 at 8:36 PM Enke Chen  wrote:
>> On 10/13/18 11:27 AM, Jann Horn wrote:
>>> On Sat, Oct 13, 2018 at 2:33 AM Enke Chen  wrote:
 For simplicity and consistency, this patch provides an implementation
 for signal-based fault notification prior to the coredump of a child
 process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
 be used by an application to express its interest and to specify the
 signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
 signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>>
>>> Your suggested API looks vaguely similar to PR_SET_PDEATHSIG, but with
>>> some important differences:
>>>
>>>  - You don't reset the signal on setuid execution.
> [...]
>>>
>>> For both of these: Are these differences actually necessary, and if
>>> so, can you provide a specific rationale? From a security perspective,
>>> I would very much prefer it if this API had semantics closer to
>>> PR_SET_PDEATHSIG.
>>
> [...]
>>
>> Regarding the impact of "setuid", this property "PR_SET_PREDUMP_SIG" has to
>> do with the application/process whether the signal handler is set for 
>> receiving
>> such a notification.  If it is set, the "uid" should not matter.
> 
> If an attacker's process first calls PR_SET_PREDUMP_SIG, then forks
> off a child, then calls execve() on a setuid binary, the setuid binary
> calls setuid(0), and the attacker-controlled child then crashes, the
> privileged process will receive an unexpected signal that the attacker
> wouldn't have been allowed to send otherwise. For similar reasons, the
> parent death signal is reset when a setuid binary is executed:
> 
> void setup_new_exec(struct linux_binprm * bprm)
> {
> /*
>  * Once here, prepare_binrpm() will not be called any more, so
>  * the final state of setuid/setgid/fscaps can be merged into the
>  * secureexec flag.
>  */
> bprm->secureexec |= bprm->cap_elevated;
> 
> if (bprm->secureexec) {
> /* Make sure parent cannot signal privileged process. */
> current->pdeath_signal = 0;
> [...]
> }
> [...]
> }
> 
> int commit_creds(struct cred *new)
> {
> [...]
> /* dumpability changes */
> if (!uid_eq(old->euid, new->euid) ||
> !gid_eq(old->egid, new->egid) ||
> !uid_eq(old->fsuid, new->fsuid) ||
> !gid_eq(old->fsgid, new->fsgid) ||
> !cred_cap_issubset(old, new)) {
> if (task->mm)
> set_dumpable(task->mm, suid_dumpable);
> task->pdeath_signal = 0;
> smp_wmb();
> }
> [...]
> }
> 
> AppArmor and SELinux also do related changes:
> 
> static void apparmor_bprm_committing_creds(struct linux_binprm *bprm)
> {
> [...]
> /* bail out if unconfined or not changing profile */
> if ((new_label->proxy == label->proxy) ||
> (unconfined(new_label)))
> return;
> 
> aa_inherit_files(bprm->cred, current->files);
> 
> current->pdeath_signal = 0;
> [...]
> }
> 
> static void selinux_bprm_committing_creds(struct linux_binprm *bprm)
> {
> [...]
> new_tsec = bprm->cred->security;
> if (new_tsec->sid == new_tsec->osid)
> return;
> 
> /* Close files for which the new task SID is not authorized. */
> flush_unauthorized_files(bprm->cred, current->files);
> 
> /* Always clear parent death signal on SID transitions. */
> current->pdeath_signal = 0;
> [...]
> }
> 
> You should probably reset the coredump signal in the same places - or
> even better, add a new helper for resetting the parent death signal,
> and then add code for resetting the coredump signal in there.
> 


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-19 Thread Enke Chen
Hi, Jann:

Regarding the security considerations, it seems simpler and more secure to
just clear the "pre-coredump signal" cross execve(2), and let the new program
decide for itself.  What do you think?

---
Changes to prctl(2):

DESCRIPTION

   PR_SET_PREDUMP_SIG (since Linux 4.20.x)
  This allows the calling process to receive a signal (arg2,
  if nonzero) from a child process prior to the coredump of
  the child process. arg2 must be SIGUSR1, or SIGUSR2, or
  SIGCHLD, or 0 (for clear).

  When SIGCHLD is specified, the signal code is set to
  CLD_PREDUMP in such an SIGCHLD signal.

  The value of the pre-coredump signal is cleared across
  execve(2), or for the child of a fork(2).

   PR_GET_PREDUMP_SIG (since Linux 4.20.x)
  Return the current value of the pre-coredump signal for the
  calling process, in the location pointed to by (int *) arg2.
---

Thanks.  -- Enke

On 10/15/18 11:54 AM, Jann Horn wrote:
> On Mon, Oct 15, 2018 at 8:36 PM Enke Chen  wrote:
>> On 10/13/18 11:27 AM, Jann Horn wrote:
>>> On Sat, Oct 13, 2018 at 2:33 AM Enke Chen  wrote:
 For simplicity and consistency, this patch provides an implementation
 for signal-based fault notification prior to the coredump of a child
 process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
 be used by an application to express its interest and to specify the
 signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
 signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>>
>>> Your suggested API looks vaguely similar to PR_SET_PDEATHSIG, but with
>>> some important differences:
>>>
>>>  - You don't reset the signal on setuid execution.
> [...]
>>>
>>> For both of these: Are these differences actually necessary, and if
>>> so, can you provide a specific rationale? From a security perspective,
>>> I would very much prefer it if this API had semantics closer to
>>> PR_SET_PDEATHSIG.
>>
> [...]
>>
>> Regarding the impact of "setuid", this property "PR_SET_PREDUMP_SIG" has to
>> do with the application/process whether the signal handler is set for 
>> receiving
>> such a notification.  If it is set, the "uid" should not matter.
> 
> If an attacker's process first calls PR_SET_PREDUMP_SIG, then forks
> off a child, then calls execve() on a setuid binary, the setuid binary
> calls setuid(0), and the attacker-controlled child then crashes, the
> privileged process will receive an unexpected signal that the attacker
> wouldn't have been allowed to send otherwise. For similar reasons, the
> parent death signal is reset when a setuid binary is executed:
> 
> void setup_new_exec(struct linux_binprm * bprm)
> {
> /*
>  * Once here, prepare_binrpm() will not be called any more, so
>  * the final state of setuid/setgid/fscaps can be merged into the
>  * secureexec flag.
>  */
> bprm->secureexec |= bprm->cap_elevated;
> 
> if (bprm->secureexec) {
> /* Make sure parent cannot signal privileged process. */
> current->pdeath_signal = 0;
> [...]
> }
> [...]
> }
> 
> int commit_creds(struct cred *new)
> {
> [...]
> /* dumpability changes */
> if (!uid_eq(old->euid, new->euid) ||
> !gid_eq(old->egid, new->egid) ||
> !uid_eq(old->fsuid, new->fsuid) ||
> !gid_eq(old->fsgid, new->fsgid) ||
> !cred_cap_issubset(old, new)) {
> if (task->mm)
> set_dumpable(task->mm, suid_dumpable);
> task->pdeath_signal = 0;
> smp_wmb();
> }
> [...]
> }
> 
> AppArmor and SELinux also do related changes:
> 
> static void apparmor_bprm_committing_creds(struct linux_binprm *bprm)
> {
> [...]
> /* bail out if unconfined or not changing profile */
> if ((new_label->proxy == label->proxy) ||
> (unconfined(new_label)))
> return;
> 
> aa_inherit_files(bprm->cred, current->files);
> 
> current->pdeath_signal = 0;
> [...]
> }
> 
> static void selinux_bprm_committing_creds(struct linux_binprm *bprm)
> {
> [...]
> new_tsec = bprm->cred->security;
> if (new_tsec->sid == new_tsec->osid)
> return;
> 
> /* Close files for which the new task SID is not authorized. */
> flush_unauthorized_files(bprm->cred, current->files);
> 
> /* Always clear parent death signal on SID transitions. */
> current->pdeath_signal = 0;
> [...]
> }
> 
> You should probably reset the coredump signal in the same places - or
> even better, add a new helper for resetting the parent death signal,
> and then add code for resetting the coredump signal in there.
> 


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-16 Thread Enke Chen
Hi, Oleg:

On 10/16/18 7:14 AM, Oleg Nesterov wrote:
> On 10/15, Enke Chen wrote:
>>
>>> I don't understand why we need valid_predump_signal() at all.
>>
>> Most of the signals have well-defined semantics, and would not be appropriate
>> for this purpose.
> 
> you are going to change the rules anyway.
> 
>> That is why it is limited to only SIGCHLD, SIGUSR1, SIGUSR2.
> 
> Which do not queue. So the parent won't get the 2nd signal if 2 children
> crash at the same time.
> 
if (sig_kernel_coredump(signr)) {
 +  /*
 +   * Notify the parent prior to the coredump if the
 +   * parent is interested in such a notificaiton.
 +   */
 +  int p_sig = current->real_parent->predump_signal;
 +
 +  if (valid_predump_signal(p_sig)) {
 +  read_lock(_lock);
 +  do_notify_parent_predump(current);
 +  read_unlock(_lock);
 +  cond_resched();
>>>
>>> perhaps this should be called by do_coredump() after coredump_wait() kills
>>> all the sub-threads?
>>
>> proc_coredump_connector(current) is located here, they should stay together.
> 
> Why?
> 
> Once again, other threads are still alive. So if the parent restarts the 
> service
> after it recieves -predump_signal, the new process can "race" with the old 
> thread.

Yes, it is a good idea to do the signal notification in do_coredump() after
coredump_wait(). Will make the change as suggested.

Thanks.  -- Enke


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-16 Thread Enke Chen
Hi, Oleg:

On 10/16/18 7:14 AM, Oleg Nesterov wrote:
> On 10/15, Enke Chen wrote:
>>
>>> I don't understand why we need valid_predump_signal() at all.
>>
>> Most of the signals have well-defined semantics, and would not be appropriate
>> for this purpose.
> 
> you are going to change the rules anyway.
> 
>> That is why it is limited to only SIGCHLD, SIGUSR1, SIGUSR2.
> 
> Which do not queue. So the parent won't get the 2nd signal if 2 children
> crash at the same time.
> 
if (sig_kernel_coredump(signr)) {
 +  /*
 +   * Notify the parent prior to the coredump if the
 +   * parent is interested in such a notificaiton.
 +   */
 +  int p_sig = current->real_parent->predump_signal;
 +
 +  if (valid_predump_signal(p_sig)) {
 +  read_lock(_lock);
 +  do_notify_parent_predump(current);
 +  read_unlock(_lock);
 +  cond_resched();
>>>
>>> perhaps this should be called by do_coredump() after coredump_wait() kills
>>> all the sub-threads?
>>
>> proc_coredump_connector(current) is located here, they should stay together.
> 
> Why?
> 
> Once again, other threads are still alive. So if the parent restarts the 
> service
> after it recieves -predump_signal, the new process can "race" with the old 
> thread.

Yes, it is a good idea to do the signal notification in do_coredump() after
coredump_wait(). Will make the change as suggested.

Thanks.  -- Enke


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-16 Thread Eric W. Biederman
Enke Chen  writes:

> Hi, Eric:
>
> On 10/15/18 4:28 PM, Eric W. Biederman wrote:

>> With that said I think the best solution would be to figure out how to
>> allow the coredump to run in parallel with the usual exit signal, and
>> exit code reaping of the process> 
>> That would solve the problem for everyone, and would not introduce any
>> new complicated APIs.
>
> That would certainly help. But given the huge deployment of Linux, I don't
> think it would be feasible to change this fundamental behavior (signal post
> coredump).

Of course it will be feasible to change.  Make it a sysctl and keep the
current default and no one will even notice.  Waiting for something that
is happening asynchronously is not be difficult so having the wait
optional should not be a problem.

Right now the default in most distributions is to disable core dumps
entirely.   Which means that you are going to have to find a very
specific situation in which people and applications care about core
dumps happening to break an existing setup.

Then all you have to do to get the non-blocking behavior is to just do:
echo 1 > /proc/sys/kernel_core_async

Then everything else works without modifications and everyone is happy.
Maybe I am wearing rose colored glasses but that looks like all that is
needed and it should be much easier to work with and maintain than
having to modify every manager process to listen for unreliable signals,
and then take action on those unreliable signals.

Eric


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-16 Thread Eric W. Biederman
Enke Chen  writes:

> Hi, Eric:
>
> On 10/15/18 4:28 PM, Eric W. Biederman wrote:

>> With that said I think the best solution would be to figure out how to
>> allow the coredump to run in parallel with the usual exit signal, and
>> exit code reaping of the process> 
>> That would solve the problem for everyone, and would not introduce any
>> new complicated APIs.
>
> That would certainly help. But given the huge deployment of Linux, I don't
> think it would be feasible to change this fundamental behavior (signal post
> coredump).

Of course it will be feasible to change.  Make it a sysctl and keep the
current default and no one will even notice.  Waiting for something that
is happening asynchronously is not be difficult so having the wait
optional should not be a problem.

Right now the default in most distributions is to disable core dumps
entirely.   Which means that you are going to have to find a very
specific situation in which people and applications care about core
dumps happening to break an existing setup.

Then all you have to do to get the non-blocking behavior is to just do:
echo 1 > /proc/sys/kernel_core_async

Then everything else works without modifications and everyone is happy.
Maybe I am wearing rose colored glasses but that looks like all that is
needed and it should be much easier to work with and maintain than
having to modify every manager process to listen for unreliable signals,
and then take action on those unreliable signals.

Eric


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-16 Thread Eric W. Biederman
Oleg Nesterov  writes:

> On 10/15, Enke Chen wrote:
>>
>> > I don't understand why we need valid_predump_signal() at all.
>>
>> Most of the signals have well-defined semantics, and would not be appropriate
>> for this purpose.
>
> you are going to change the rules anyway.

I will just add that CLD_XXX is only valid with SIGCHLD as they are
signal specific si_codes.  In conjunction with another signal like
SIGUSR it will have another meaning.  I would really appreciate it
if new code does not further complicate siginfo_layout.

>> That is why it is limited to only SIGCHLD, SIGUSR1, SIGUSR2.
>
> Which do not queue. So the parent won't get the 2nd signal if 2 children
> crash at the same time.

We do best effort queueing but we don't guarantee anything.  So yes
this makes signals a very louzy interface for sending this kind of
information.

>> >>   if (sig_kernel_coredump(signr)) {
>> >> + /*
>> >> +  * Notify the parent prior to the coredump if the
>> >> +  * parent is interested in such a notificaiton.
>> >> +  */
>> >> + int p_sig = current->real_parent->predump_signal;
>> >> +
>> >> + if (valid_predump_signal(p_sig)) {
>> >> + read_lock(_lock);
>> >> + do_notify_parent_predump(current);
>> >> + read_unlock(_lock);
>> >> + cond_resched();
>> >
>> > perhaps this should be called by do_coredump() after coredump_wait() kills
>> > all the sub-threads?
>>
>> proc_coredump_connector(current) is located here, they should stay together.
>
> Why?
>
> Once again, other threads are still alive. So if the parent restarts the 
> service
> after it recieves -predump_signal, the new process can "race" with the old 
> thread.

Yes.  It isn't until do_coredump calls coredump_wait that all of the
threads are killed.

Eric



Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-16 Thread Eric W. Biederman
Oleg Nesterov  writes:

> On 10/15, Enke Chen wrote:
>>
>> > I don't understand why we need valid_predump_signal() at all.
>>
>> Most of the signals have well-defined semantics, and would not be appropriate
>> for this purpose.
>
> you are going to change the rules anyway.

I will just add that CLD_XXX is only valid with SIGCHLD as they are
signal specific si_codes.  In conjunction with another signal like
SIGUSR it will have another meaning.  I would really appreciate it
if new code does not further complicate siginfo_layout.

>> That is why it is limited to only SIGCHLD, SIGUSR1, SIGUSR2.
>
> Which do not queue. So the parent won't get the 2nd signal if 2 children
> crash at the same time.

We do best effort queueing but we don't guarantee anything.  So yes
this makes signals a very louzy interface for sending this kind of
information.

>> >>   if (sig_kernel_coredump(signr)) {
>> >> + /*
>> >> +  * Notify the parent prior to the coredump if the
>> >> +  * parent is interested in such a notificaiton.
>> >> +  */
>> >> + int p_sig = current->real_parent->predump_signal;
>> >> +
>> >> + if (valid_predump_signal(p_sig)) {
>> >> + read_lock(_lock);
>> >> + do_notify_parent_predump(current);
>> >> + read_unlock(_lock);
>> >> + cond_resched();
>> >
>> > perhaps this should be called by do_coredump() after coredump_wait() kills
>> > all the sub-threads?
>>
>> proc_coredump_connector(current) is located here, they should stay together.
>
> Why?
>
> Once again, other threads are still alive. So if the parent restarts the 
> service
> after it recieves -predump_signal, the new process can "race" with the old 
> thread.

Yes.  It isn't until do_coredump calls coredump_wait that all of the
threads are killed.

Eric



Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-16 Thread Oleg Nesterov
On 10/15, Enke Chen wrote:
>
> > I don't understand why we need valid_predump_signal() at all.
>
> Most of the signals have well-defined semantics, and would not be appropriate
> for this purpose.

you are going to change the rules anyway.

> That is why it is limited to only SIGCHLD, SIGUSR1, SIGUSR2.

Which do not queue. So the parent won't get the 2nd signal if 2 children
crash at the same time.

> >>if (sig_kernel_coredump(signr)) {
> >> +  /*
> >> +   * Notify the parent prior to the coredump if the
> >> +   * parent is interested in such a notificaiton.
> >> +   */
> >> +  int p_sig = current->real_parent->predump_signal;
> >> +
> >> +  if (valid_predump_signal(p_sig)) {
> >> +  read_lock(_lock);
> >> +  do_notify_parent_predump(current);
> >> +  read_unlock(_lock);
> >> +  cond_resched();
> >
> > perhaps this should be called by do_coredump() after coredump_wait() kills
> > all the sub-threads?
>
> proc_coredump_connector(current) is located here, they should stay together.

Why?

Once again, other threads are still alive. So if the parent restarts the service
after it recieves -predump_signal, the new process can "race" with the old 
thread.

Oleg.



Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-16 Thread Oleg Nesterov
On 10/15, Enke Chen wrote:
>
> > I don't understand why we need valid_predump_signal() at all.
>
> Most of the signals have well-defined semantics, and would not be appropriate
> for this purpose.

you are going to change the rules anyway.

> That is why it is limited to only SIGCHLD, SIGUSR1, SIGUSR2.

Which do not queue. So the parent won't get the 2nd signal if 2 children
crash at the same time.

> >>if (sig_kernel_coredump(signr)) {
> >> +  /*
> >> +   * Notify the parent prior to the coredump if the
> >> +   * parent is interested in such a notificaiton.
> >> +   */
> >> +  int p_sig = current->real_parent->predump_signal;
> >> +
> >> +  if (valid_predump_signal(p_sig)) {
> >> +  read_lock(_lock);
> >> +  do_notify_parent_predump(current);
> >> +  read_unlock(_lock);
> >> +  cond_resched();
> >
> > perhaps this should be called by do_coredump() after coredump_wait() kills
> > all the sub-threads?
>
> proc_coredump_connector(current) is located here, they should stay together.

Why?

Once again, other threads are still alive. So if the parent restarts the service
after it recieves -predump_signal, the new process can "race" with the old 
thread.

Oleg.



Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen
Hi, Eric:

On 10/15/18 4:28 PM, Eric W. Biederman wrote:
> Enke Chen  writes:
> 
>> For simplicity and consistency, this patch provides an implementation
>> for signal-based fault notification prior to the coredump of a child
>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> be used by an application to express its interest and to specify the
>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>
>> Background:
>>
>> As the coredump of a process may take time, in certain time-sensitive
>> applications it is necessary for a parent process (e.g., a process
>> manager) to be notified of a child's imminent death before the coredump
>> so that the parent process can act sooner, such as re-spawning an
>> application process, or initiating a control-plane fail-over.
> 
> You talk about time senstive and then you talk about bash scripts.
> I don't think your definition of time-sensitive and my definition match.

It's certainly not my preference to have a process manager (or one for each
application) written in bash scripts. But they do work, and are deployed.

> 
> With that said I think the best solution would be to figure out how to
> allow the coredump to run in parallel with the usual exit signal, and
> exit code reaping of the process> 
> That would solve the problem for everyone, and would not introduce any
> new complicated APIs.

That would certainly help. But given the huge deployment of Linux, I don't
think it would be feasible to change this fundamental behavior (signal post
coredump).

> 
> Short of that having the prctl in the process that receives the signals
> they you are doing is the right way to go.

Thanks for for the encouragement.

> 
> You are however calling do_notify_parent_predump from the wrong
> function, and frankly with the wrong locking.  There are multiple paths
> to the do_coredump function so you really want this notification from
> do_coredump.

This makes two - Oleg also suggested doing it in do_coredump().
I will look into it, perhaps also relocated proc_coredump_connector().

> 
> But I still think it would be better to solve the root cause problem and
> change the coredump logic to be able to run in parallel with the normal
> exit notification and zombie reaping logic.  Then the problem you are
> trying to solve goes away and everyone's code gets simpler.
> 
> Eric
> 

Thanks.  -- Enke



Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen
Hi, Eric:

On 10/15/18 4:28 PM, Eric W. Biederman wrote:
> Enke Chen  writes:
> 
>> For simplicity and consistency, this patch provides an implementation
>> for signal-based fault notification prior to the coredump of a child
>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> be used by an application to express its interest and to specify the
>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>
>> Background:
>>
>> As the coredump of a process may take time, in certain time-sensitive
>> applications it is necessary for a parent process (e.g., a process
>> manager) to be notified of a child's imminent death before the coredump
>> so that the parent process can act sooner, such as re-spawning an
>> application process, or initiating a control-plane fail-over.
> 
> You talk about time senstive and then you talk about bash scripts.
> I don't think your definition of time-sensitive and my definition match.

It's certainly not my preference to have a process manager (or one for each
application) written in bash scripts. But they do work, and are deployed.

> 
> With that said I think the best solution would be to figure out how to
> allow the coredump to run in parallel with the usual exit signal, and
> exit code reaping of the process> 
> That would solve the problem for everyone, and would not introduce any
> new complicated APIs.

That would certainly help. But given the huge deployment of Linux, I don't
think it would be feasible to change this fundamental behavior (signal post
coredump).

> 
> Short of that having the prctl in the process that receives the signals
> they you are doing is the right way to go.

Thanks for for the encouragement.

> 
> You are however calling do_notify_parent_predump from the wrong
> function, and frankly with the wrong locking.  There are multiple paths
> to the do_coredump function so you really want this notification from
> do_coredump.

This makes two - Oleg also suggested doing it in do_coredump().
I will look into it, perhaps also relocated proc_coredump_connector().

> 
> But I still think it would be better to solve the root cause problem and
> change the coredump logic to be able to run in parallel with the normal
> exit notification and zombie reaping logic.  Then the problem you are
> trying to solve goes away and everyone's code gets simpler.
> 
> Eric
> 

Thanks.  -- Enke



Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread valdis . kletnieks
On Mon, 15 Oct 2018 18:28:03 -0500, Eric W. Biederman said:
> Enke Chen  writes:
>
> > For simplicity and consistency, this patch provides an implementation
> > for signal-based fault notification prior to the coredump of a child
> > process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> > be used by an application to express its interest and to specify the
> > signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
> > signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
> >
> > Background:
> >
> > As the coredump of a process may take time, in certain time-sensitive
> > applications it is necessary for a parent process (e.g., a process
> > manager) to be notified of a child's imminent death before the coredump
> > so that the parent process can act sooner, such as re-spawning an
> > application process, or initiating a control-plane fail-over.
>
> You talk about time senstive and then you talk about bash scripts.
> I don't think your definition of time-sensitive and my definition match.

When the process image is measured in hundreds of gigabytes, the corefile can
take a while even by /bin/bash standards.  You want fun, watch an HPC process
manage to OOM a machine with 3T of RAM in a way that produces a full image
coredump.

To network storage.



pgpfkOg6G86es.pgp
Description: PGP signature


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread valdis . kletnieks
On Mon, 15 Oct 2018 18:28:03 -0500, Eric W. Biederman said:
> Enke Chen  writes:
>
> > For simplicity and consistency, this patch provides an implementation
> > for signal-based fault notification prior to the coredump of a child
> > process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> > be used by an application to express its interest and to specify the
> > signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
> > signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
> >
> > Background:
> >
> > As the coredump of a process may take time, in certain time-sensitive
> > applications it is necessary for a parent process (e.g., a process
> > manager) to be notified of a child's imminent death before the coredump
> > so that the parent process can act sooner, such as re-spawning an
> > application process, or initiating a control-plane fail-over.
>
> You talk about time senstive and then you talk about bash scripts.
> I don't think your definition of time-sensitive and my definition match.

When the process image is measured in hundreds of gigabytes, the corefile can
take a while even by /bin/bash standards.  You want fun, watch an HPC process
manage to OOM a machine with 3T of RAM in a way that produces a full image
coredump.

To network storage.



pgpfkOg6G86es.pgp
Description: PGP signature


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Eric W. Biederman
Enke Chen  writes:

> For simplicity and consistency, this patch provides an implementation
> for signal-based fault notification prior to the coredump of a child
> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> be used by an application to express its interest and to specify the
> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>
> Background:
>
> As the coredump of a process may take time, in certain time-sensitive
> applications it is necessary for a parent process (e.g., a process
> manager) to be notified of a child's imminent death before the coredump
> so that the parent process can act sooner, such as re-spawning an
> application process, or initiating a control-plane fail-over.

You talk about time senstive and then you talk about bash scripts.
I don't think your definition of time-sensitive and my definition match.

With that said I think the best solution would be to figure out how to
allow the coredump to run in parallel with the usual exit signal, and
exit code reaping of the process.

That would solve the problem for everyone, and would not introduce any
new complicated APIs.

Short of that having the prctl in the process that receives the signals
they you are doing is the right way to go.

You are however calling do_notify_parent_predump from the wrong
function, and frankly with the wrong locking.  There are multiple paths
to the do_coredump function so you really want this notification from
do_coredump.

But I still think it would be better to solve the root cause problem and
change the coredump logic to be able to run in parallel with the normal
exit notification and zombie reaping logic.  Then the problem you are
trying to solve goes away and everyone's code gets simpler.

Eric

> Currently there are two ways for a parent process to be notified of a
> child process's state change. One is to use the POSIX signal, and
> another is to use the kernel connector module. The specific events and
> actions are summarized as follows:
>
> Process EventPOSIX SignalConnector-based
> --
> ptrace_attach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>  SIGCHLD / CLD_STOPPED
>
> ptrace_detach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>  SIGCHLD / CLD_CONTINUED
>
> pre_coredump/N/A proc_coredump_connector()
> get_signal()
>
> post_coredump/   do_notify_parent()  proc_exit_connector()
> do_exit()SIGCHLD / exit_signal
> --
>
> As shown in the table, the signal-based pre-coredump notification is not
> currently available. In some cases using a connector-based notification
> can be quite complicated (e.g., when a process manager is written in shell
> scripts and thus is subject to certain inherent limitations), and a
> signal-based notification would be simpler and better suited.
>
> Signed-off-by: Enke Chen 
> ---
>  arch/x86/kernel/signal_compat.c|  2 +-
>  include/linux/sched.h  |  4 ++
>  include/linux/signal.h |  5 +++
>  include/uapi/asm-generic/siginfo.h |  3 +-
>  include/uapi/linux/prctl.h |  4 ++
>  kernel/fork.c  |  1 +
>  kernel/signal.c| 51 +
>  kernel/sys.c   | 77 
> ++
>  8 files changed, 145 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
> index 9ccbf05..a3deba8 100644
> --- a/arch/x86/kernel/signal_compat.c
> +++ b/arch/x86/kernel/signal_compat.c
> @@ -30,7 +30,7 @@ static inline void signal_compat_build_tests(void)
>   BUILD_BUG_ON(NSIGSEGV != 7);
>   BUILD_BUG_ON(NSIGBUS  != 5);
>   BUILD_BUG_ON(NSIGTRAP != 5);
> - BUILD_BUG_ON(NSIGCHLD != 6);
> + BUILD_BUG_ON(NSIGCHLD != 7);
>   BUILD_BUG_ON(NSIGSYS  != 1);
>  
>   /* This is part of the ABI and can never change in size: */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 09026ea..cfb9645 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -696,6 +696,10 @@ struct task_struct {
>   int exit_signal;
>   /* The signal sent when the parent dies: */
>   int pdeath_signal;
> +
> + /* The signal sent prior to a child's coredump: */
> + int predump_signal;
> +
>   /* JOBCTL_*, siglock protected: */
>   unsigned long   jobctl;
>  
> diff --git a/include/linux/signal.h b/include/linux/signal.h
> index 706a499..7cb976d 100644
> --- a/include/linux/signal.h
> +++ b/include/linux/signal.h
> @@ -256,6 +256,11 @@ static inline int valid_signal(unsigned long sig)
>   return sig 

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Eric W. Biederman
Enke Chen  writes:

> For simplicity and consistency, this patch provides an implementation
> for signal-based fault notification prior to the coredump of a child
> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> be used by an application to express its interest and to specify the
> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>
> Background:
>
> As the coredump of a process may take time, in certain time-sensitive
> applications it is necessary for a parent process (e.g., a process
> manager) to be notified of a child's imminent death before the coredump
> so that the parent process can act sooner, such as re-spawning an
> application process, or initiating a control-plane fail-over.

You talk about time senstive and then you talk about bash scripts.
I don't think your definition of time-sensitive and my definition match.

With that said I think the best solution would be to figure out how to
allow the coredump to run in parallel with the usual exit signal, and
exit code reaping of the process.

That would solve the problem for everyone, and would not introduce any
new complicated APIs.

Short of that having the prctl in the process that receives the signals
they you are doing is the right way to go.

You are however calling do_notify_parent_predump from the wrong
function, and frankly with the wrong locking.  There are multiple paths
to the do_coredump function so you really want this notification from
do_coredump.

But I still think it would be better to solve the root cause problem and
change the coredump logic to be able to run in parallel with the normal
exit notification and zombie reaping logic.  Then the problem you are
trying to solve goes away and everyone's code gets simpler.

Eric

> Currently there are two ways for a parent process to be notified of a
> child process's state change. One is to use the POSIX signal, and
> another is to use the kernel connector module. The specific events and
> actions are summarized as follows:
>
> Process EventPOSIX SignalConnector-based
> --
> ptrace_attach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>  SIGCHLD / CLD_STOPPED
>
> ptrace_detach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>  SIGCHLD / CLD_CONTINUED
>
> pre_coredump/N/A proc_coredump_connector()
> get_signal()
>
> post_coredump/   do_notify_parent()  proc_exit_connector()
> do_exit()SIGCHLD / exit_signal
> --
>
> As shown in the table, the signal-based pre-coredump notification is not
> currently available. In some cases using a connector-based notification
> can be quite complicated (e.g., when a process manager is written in shell
> scripts and thus is subject to certain inherent limitations), and a
> signal-based notification would be simpler and better suited.
>
> Signed-off-by: Enke Chen 
> ---
>  arch/x86/kernel/signal_compat.c|  2 +-
>  include/linux/sched.h  |  4 ++
>  include/linux/signal.h |  5 +++
>  include/uapi/asm-generic/siginfo.h |  3 +-
>  include/uapi/linux/prctl.h |  4 ++
>  kernel/fork.c  |  1 +
>  kernel/signal.c| 51 +
>  kernel/sys.c   | 77 
> ++
>  8 files changed, 145 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
> index 9ccbf05..a3deba8 100644
> --- a/arch/x86/kernel/signal_compat.c
> +++ b/arch/x86/kernel/signal_compat.c
> @@ -30,7 +30,7 @@ static inline void signal_compat_build_tests(void)
>   BUILD_BUG_ON(NSIGSEGV != 7);
>   BUILD_BUG_ON(NSIGBUS  != 5);
>   BUILD_BUG_ON(NSIGTRAP != 5);
> - BUILD_BUG_ON(NSIGCHLD != 6);
> + BUILD_BUG_ON(NSIGCHLD != 7);
>   BUILD_BUG_ON(NSIGSYS  != 1);
>  
>   /* This is part of the ABI and can never change in size: */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 09026ea..cfb9645 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -696,6 +696,10 @@ struct task_struct {
>   int exit_signal;
>   /* The signal sent when the parent dies: */
>   int pdeath_signal;
> +
> + /* The signal sent prior to a child's coredump: */
> + int predump_signal;
> +
>   /* JOBCTL_*, siglock protected: */
>   unsigned long   jobctl;
>  
> diff --git a/include/linux/signal.h b/include/linux/signal.h
> index 706a499..7cb976d 100644
> --- a/include/linux/signal.h
> +++ b/include/linux/signal.h
> @@ -256,6 +256,11 @@ static inline int valid_signal(unsigned long sig)
>   return sig 

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen
Hi, Alan:

As I replied earlier, I will remove the logic that allows setting on others.
This function "set_predump_signal_perm()" will be gone too.

Thanks.  -- Enke

On 10/15/18 2:21 PM, Alan Cox wrote:
>> +/*
>> + * Returns true if current's euid is same as p's uid or euid,
>> + * or has CAP_SYS_ADMIN.
>> + *
>> + * Called with rcu_read_lock, creds are safe.
>> + *
>> + * Adapted from set_one_prio_perm().
>> + */
>> +static bool set_predump_signal_perm(struct task_struct *p)
>> +{
>> +const struct cred *cred = current_cred(), *pcred = __task_cred(p);
>> +
>> +return uid_eq(pcred->uid, cred->euid) ||
>> +   uid_eq(pcred->euid, cred->euid) ||
>> +   capable(CAP_SYS_ADMIN);
>> +}
> 
> This makes absolutely no security sense whatsoever. The uid and euid of
> the parent and child can both change between the test and the signal
> delivery.
> 
> There are reasons that the child signal control code is incredibly
> careful about either the parent or child using execve or doing a
> privilege change that might pose a risk.
> 
> Until this code gets the same protections I don't believe it's safe.
> 
> Alan
> 


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen
Hi, Alan:

As I replied earlier, I will remove the logic that allows setting on others.
This function "set_predump_signal_perm()" will be gone too.

Thanks.  -- Enke

On 10/15/18 2:21 PM, Alan Cox wrote:
>> +/*
>> + * Returns true if current's euid is same as p's uid or euid,
>> + * or has CAP_SYS_ADMIN.
>> + *
>> + * Called with rcu_read_lock, creds are safe.
>> + *
>> + * Adapted from set_one_prio_perm().
>> + */
>> +static bool set_predump_signal_perm(struct task_struct *p)
>> +{
>> +const struct cred *cred = current_cred(), *pcred = __task_cred(p);
>> +
>> +return uid_eq(pcred->uid, cred->euid) ||
>> +   uid_eq(pcred->euid, cred->euid) ||
>> +   capable(CAP_SYS_ADMIN);
>> +}
> 
> This makes absolutely no security sense whatsoever. The uid and euid of
> the parent and child can both change between the test and the signal
> delivery.
> 
> There are reasons that the child signal control code is incredibly
> careful about either the parent or child using execve or doing a
> privilege change that might pose a risk.
> 
> Until this code gets the same protections I don't believe it's safe.
> 
> Alan
> 


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Alan Cox
> +/*
> + * Returns true if current's euid is same as p's uid or euid,
> + * or has CAP_SYS_ADMIN.
> + *
> + * Called with rcu_read_lock, creds are safe.
> + *
> + * Adapted from set_one_prio_perm().
> + */
> +static bool set_predump_signal_perm(struct task_struct *p)
> +{
> + const struct cred *cred = current_cred(), *pcred = __task_cred(p);
> +
> + return uid_eq(pcred->uid, cred->euid) ||
> +uid_eq(pcred->euid, cred->euid) ||
> +capable(CAP_SYS_ADMIN);
> +}

This makes absolutely no security sense whatsoever. The uid and euid of
the parent and child can both change between the test and the signal
delivery.

There are reasons that the child signal control code is incredibly
careful about either the parent or child using execve or doing a
privilege change that might pose a risk.

Until this code gets the same protections I don't believe it's safe.

Alan


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Alan Cox
> +/*
> + * Returns true if current's euid is same as p's uid or euid,
> + * or has CAP_SYS_ADMIN.
> + *
> + * Called with rcu_read_lock, creds are safe.
> + *
> + * Adapted from set_one_prio_perm().
> + */
> +static bool set_predump_signal_perm(struct task_struct *p)
> +{
> + const struct cred *cred = current_cred(), *pcred = __task_cred(p);
> +
> + return uid_eq(pcred->uid, cred->euid) ||
> +uid_eq(pcred->euid, cred->euid) ||
> +capable(CAP_SYS_ADMIN);
> +}

This makes absolutely no security sense whatsoever. The uid and euid of
the parent and child can both change between the test and the signal
delivery.

There are reasons that the child signal control code is incredibly
careful about either the parent or child using execve or doing a
privilege change that might pose a risk.

Until this code gets the same protections I don't believe it's safe.

Alan


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen
Hi, Olge:

>> probably ->predump_signal should be cleared on exec?

As I replied to Jann, will do.

Thanks. -- Enke

On 10/15/18 12:17 PM, Enke Chen wrote:
> Hi, Oleg:
> 
> I missed some of your comments in my previous reply.
> 
> On 10/15/18 5:05 AM, Oleg Nesterov wrote:
>> On 10/12, Enke Chen wrote:
>>>
>>> For simplicity and consistency, this patch provides an implementation
>>> for signal-based fault notification prior to the coredump of a child
>>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>>> be used by an application to express its interest and to specify the
>>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>
>> To be honest, I can't say I like this new feature...
>>
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -696,6 +696,10 @@ struct task_struct {
>>> int exit_signal;
>>> /* The signal sent when the parent dies: */
>>> int pdeath_signal;
>>> +
>>> +   /* The signal sent prior to a child's coredump: */
>>> +   int predump_signal;
>>> +
>>
>> At least, I think predump_signal should live in signal_struct, not
>> task_struct.
>>
>> (pdeath_signal too, but it is too late to change (fix) this awkward API).
>>
>>> +static void do_notify_parent_predump(struct task_struct *tsk)
>>> +{
>>> +   struct sighand_struct *sighand;
>>> +   struct task_struct *parent;
>>> +   struct kernel_siginfo info;
>>> +   unsigned long flags;
>>> +   int sig;
>>> +
>>> +   parent = tsk->real_parent;
>>
>> So, debuggere won't be notified, only real_parent...
>>
>>> +   sig = parent->predump_signal;
>>
>> probably ->predump_signal should be cleared on exec?
> 
> 
> Is this not enough in "copy_process()"?
> 
> @@ -1985,6 +1985,7 @@ static __latent_entropy struct task_struct 
> *copy_process(
>   p->dirty_paused_when = 0;
>  
>   p->pdeath_signal = 0;
> + p->predump_signal = 0;
> 
>>
>>> +   /* Check again with tasklist_lock" locked by the caller */
>>> +   if (!valid_predump_signal(sig))
>>> +   return;
>>
>> I don't understand why we need valid_predump_signal() at all.
> 
> Most of the signals have well-defined semantics, and would not be appropriate
> for this purpose.  That is why it is limited to only SIGCHLD, SIGUSR1, 
> SIGUSR2.
> 
>>
>>>  bool get_signal(struct ksignal *ksig)
>>>  {
>>> struct sighand_struct *sighand = current->sighand;
>>> @@ -2497,6 +2535,19 @@ bool get_signal(struct ksignal *ksig)
>>> current->flags |= PF_SIGNALED;
>>>  
>>> if (sig_kernel_coredump(signr)) {
>>> +   /*
>>> +* Notify the parent prior to the coredump if the
>>> +* parent is interested in such a notificaiton.
>>> +*/
>>> +   int p_sig = current->real_parent->predump_signal;
>>> +
>>> +   if (valid_predump_signal(p_sig)) {
>>> +   read_lock(_lock);
>>> +   do_notify_parent_predump(current);
>>> +   read_unlock(_lock);
>>> +   cond_resched();
>>
>> perhaps this should be called by do_coredump() after coredump_wait() kills
>> all the sub-threads?
> 
> proc_coredump_connector(current) is located here, they should stay together.
> 
> Thanks.  -- Enke
> 
>>
>>> +static int prctl_set_predump_signal(struct task_struct *tsk, pid_t pid, 
>>> int sig)
>>> +{
>>> +   struct task_struct *p;
>>> +   int error;
>>> +
>>> +   /* 0 is valid for disabling the feature */
>>> +   if (sig && !valid_predump_signal(sig))
>>> +   return -EINVAL;
>>> +
>>> +   /* For the current task, the common case */
>>> +   if (pid == 0) {
>>> +   tsk->predump_signal = sig;
>>> +   return 0;
>>> +   }
>>> +
>>> +   error = -ESRCH;
>>> +   rcu_read_lock();
>>> +   p = find_task_by_vpid(pid);
>>> +   if (p) {
>>> +   if (!set_predump_signal_perm(p))
>>> +   error = -EPERM;
>>> +   else {
>>> +   error = 0;
>>> +   p->predump_signal = sig;
>>> +   }
>>> +   }
>>> +   rcu_read_unlock();
>>> +   return error;
>>> +}
>>
>> Why? I mean, why do we really want to support the pid != 0 case?
>>
>> Oleg.
>>


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen
Hi, Olge:

>> probably ->predump_signal should be cleared on exec?

As I replied to Jann, will do.

Thanks. -- Enke

On 10/15/18 12:17 PM, Enke Chen wrote:
> Hi, Oleg:
> 
> I missed some of your comments in my previous reply.
> 
> On 10/15/18 5:05 AM, Oleg Nesterov wrote:
>> On 10/12, Enke Chen wrote:
>>>
>>> For simplicity and consistency, this patch provides an implementation
>>> for signal-based fault notification prior to the coredump of a child
>>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>>> be used by an application to express its interest and to specify the
>>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>
>> To be honest, I can't say I like this new feature...
>>
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -696,6 +696,10 @@ struct task_struct {
>>> int exit_signal;
>>> /* The signal sent when the parent dies: */
>>> int pdeath_signal;
>>> +
>>> +   /* The signal sent prior to a child's coredump: */
>>> +   int predump_signal;
>>> +
>>
>> At least, I think predump_signal should live in signal_struct, not
>> task_struct.
>>
>> (pdeath_signal too, but it is too late to change (fix) this awkward API).
>>
>>> +static void do_notify_parent_predump(struct task_struct *tsk)
>>> +{
>>> +   struct sighand_struct *sighand;
>>> +   struct task_struct *parent;
>>> +   struct kernel_siginfo info;
>>> +   unsigned long flags;
>>> +   int sig;
>>> +
>>> +   parent = tsk->real_parent;
>>
>> So, debuggere won't be notified, only real_parent...
>>
>>> +   sig = parent->predump_signal;
>>
>> probably ->predump_signal should be cleared on exec?
> 
> 
> Is this not enough in "copy_process()"?
> 
> @@ -1985,6 +1985,7 @@ static __latent_entropy struct task_struct 
> *copy_process(
>   p->dirty_paused_when = 0;
>  
>   p->pdeath_signal = 0;
> + p->predump_signal = 0;
> 
>>
>>> +   /* Check again with tasklist_lock" locked by the caller */
>>> +   if (!valid_predump_signal(sig))
>>> +   return;
>>
>> I don't understand why we need valid_predump_signal() at all.
> 
> Most of the signals have well-defined semantics, and would not be appropriate
> for this purpose.  That is why it is limited to only SIGCHLD, SIGUSR1, 
> SIGUSR2.
> 
>>
>>>  bool get_signal(struct ksignal *ksig)
>>>  {
>>> struct sighand_struct *sighand = current->sighand;
>>> @@ -2497,6 +2535,19 @@ bool get_signal(struct ksignal *ksig)
>>> current->flags |= PF_SIGNALED;
>>>  
>>> if (sig_kernel_coredump(signr)) {
>>> +   /*
>>> +* Notify the parent prior to the coredump if the
>>> +* parent is interested in such a notificaiton.
>>> +*/
>>> +   int p_sig = current->real_parent->predump_signal;
>>> +
>>> +   if (valid_predump_signal(p_sig)) {
>>> +   read_lock(_lock);
>>> +   do_notify_parent_predump(current);
>>> +   read_unlock(_lock);
>>> +   cond_resched();
>>
>> perhaps this should be called by do_coredump() after coredump_wait() kills
>> all the sub-threads?
> 
> proc_coredump_connector(current) is located here, they should stay together.
> 
> Thanks.  -- Enke
> 
>>
>>> +static int prctl_set_predump_signal(struct task_struct *tsk, pid_t pid, 
>>> int sig)
>>> +{
>>> +   struct task_struct *p;
>>> +   int error;
>>> +
>>> +   /* 0 is valid for disabling the feature */
>>> +   if (sig && !valid_predump_signal(sig))
>>> +   return -EINVAL;
>>> +
>>> +   /* For the current task, the common case */
>>> +   if (pid == 0) {
>>> +   tsk->predump_signal = sig;
>>> +   return 0;
>>> +   }
>>> +
>>> +   error = -ESRCH;
>>> +   rcu_read_lock();
>>> +   p = find_task_by_vpid(pid);
>>> +   if (p) {
>>> +   if (!set_predump_signal_perm(p))
>>> +   error = -EPERM;
>>> +   else {
>>> +   error = 0;
>>> +   p->predump_signal = sig;
>>> +   }
>>> +   }
>>> +   rcu_read_unlock();
>>> +   return error;
>>> +}
>>
>> Why? I mean, why do we really want to support the pid != 0 case?
>>
>> Oleg.
>>


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen
Hi, Jann:

Thanks for your detail explanation. Will take care of it.

-- Enke

On 10/15/18 11:54 AM, Jann Horn wrote:
> On Mon, Oct 15, 2018 at 8:36 PM Enke Chen  wrote:
>> On 10/13/18 11:27 AM, Jann Horn wrote:
>>> On Sat, Oct 13, 2018 at 2:33 AM Enke Chen  wrote:
 For simplicity and consistency, this patch provides an implementation
 for signal-based fault notification prior to the coredump of a child
 process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
 be used by an application to express its interest and to specify the
 signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
 signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>>
>>> Your suggested API looks vaguely similar to PR_SET_PDEATHSIG, but with
>>> some important differences:
>>>
>>>  - You don't reset the signal on setuid execution.
> [...]
>>>
>>> For both of these: Are these differences actually necessary, and if
>>> so, can you provide a specific rationale? From a security perspective,
>>> I would very much prefer it if this API had semantics closer to
>>> PR_SET_PDEATHSIG.
>>
> [...]
>>
>> Regarding the impact of "setuid", this property "PR_SET_PREDUMP_SIG" has to
>> do with the application/process whether the signal handler is set for 
>> receiving
>> such a notification.  If it is set, the "uid" should not matter.
> 
> If an attacker's process first calls PR_SET_PREDUMP_SIG, then forks
> off a child, then calls execve() on a setuid binary, the setuid binary
> calls setuid(0), and the attacker-controlled child then crashes, the
> privileged process will receive an unexpected signal that the attacker
> wouldn't have been allowed to send otherwise. For similar reasons, the
> parent death signal is reset when a setuid binary is executed:
> 
> void setup_new_exec(struct linux_binprm * bprm)
> {
> /*
>  * Once here, prepare_binrpm() will not be called any more, so
>  * the final state of setuid/setgid/fscaps can be merged into the
>  * secureexec flag.
>  */
> bprm->secureexec |= bprm->cap_elevated;
> 
> if (bprm->secureexec) {
> /* Make sure parent cannot signal privileged process. */
> current->pdeath_signal = 0;
> [...]
> }
> [...]
> }
> 
> int commit_creds(struct cred *new)
> {
> [...]
> /* dumpability changes */
> if (!uid_eq(old->euid, new->euid) ||
> !gid_eq(old->egid, new->egid) ||
> !uid_eq(old->fsuid, new->fsuid) ||
> !gid_eq(old->fsgid, new->fsgid) ||
> !cred_cap_issubset(old, new)) {
> if (task->mm)
> set_dumpable(task->mm, suid_dumpable);
> task->pdeath_signal = 0;
> smp_wmb();
> }
> [...]
> }
> 
> AppArmor and SELinux also do related changes:
> 
> static void apparmor_bprm_committing_creds(struct linux_binprm *bprm)
> {
> [...]
> /* bail out if unconfined or not changing profile */
> if ((new_label->proxy == label->proxy) ||
> (unconfined(new_label)))
> return;
> 
> aa_inherit_files(bprm->cred, current->files);
> 
> current->pdeath_signal = 0;
> [...]
> }
> 
> static void selinux_bprm_committing_creds(struct linux_binprm *bprm)
> {
> [...]
> new_tsec = bprm->cred->security;
> if (new_tsec->sid == new_tsec->osid)
> return;
> 
> /* Close files for which the new task SID is not authorized. */
> flush_unauthorized_files(bprm->cred, current->files);
> 
> /* Always clear parent death signal on SID transitions. */
> current->pdeath_signal = 0;
> [...]
> }
> 
> You should probably reset the coredump signal in the same places - or
> even better, add a new helper for resetting the parent death signal,
> and then add code for resetting the coredump signal in there.
> 


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen
Hi, Jann:

Thanks for your detail explanation. Will take care of it.

-- Enke

On 10/15/18 11:54 AM, Jann Horn wrote:
> On Mon, Oct 15, 2018 at 8:36 PM Enke Chen  wrote:
>> On 10/13/18 11:27 AM, Jann Horn wrote:
>>> On Sat, Oct 13, 2018 at 2:33 AM Enke Chen  wrote:
 For simplicity and consistency, this patch provides an implementation
 for signal-based fault notification prior to the coredump of a child
 process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
 be used by an application to express its interest and to specify the
 signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
 signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>>
>>> Your suggested API looks vaguely similar to PR_SET_PDEATHSIG, but with
>>> some important differences:
>>>
>>>  - You don't reset the signal on setuid execution.
> [...]
>>>
>>> For both of these: Are these differences actually necessary, and if
>>> so, can you provide a specific rationale? From a security perspective,
>>> I would very much prefer it if this API had semantics closer to
>>> PR_SET_PDEATHSIG.
>>
> [...]
>>
>> Regarding the impact of "setuid", this property "PR_SET_PREDUMP_SIG" has to
>> do with the application/process whether the signal handler is set for 
>> receiving
>> such a notification.  If it is set, the "uid" should not matter.
> 
> If an attacker's process first calls PR_SET_PREDUMP_SIG, then forks
> off a child, then calls execve() on a setuid binary, the setuid binary
> calls setuid(0), and the attacker-controlled child then crashes, the
> privileged process will receive an unexpected signal that the attacker
> wouldn't have been allowed to send otherwise. For similar reasons, the
> parent death signal is reset when a setuid binary is executed:
> 
> void setup_new_exec(struct linux_binprm * bprm)
> {
> /*
>  * Once here, prepare_binrpm() will not be called any more, so
>  * the final state of setuid/setgid/fscaps can be merged into the
>  * secureexec flag.
>  */
> bprm->secureexec |= bprm->cap_elevated;
> 
> if (bprm->secureexec) {
> /* Make sure parent cannot signal privileged process. */
> current->pdeath_signal = 0;
> [...]
> }
> [...]
> }
> 
> int commit_creds(struct cred *new)
> {
> [...]
> /* dumpability changes */
> if (!uid_eq(old->euid, new->euid) ||
> !gid_eq(old->egid, new->egid) ||
> !uid_eq(old->fsuid, new->fsuid) ||
> !gid_eq(old->fsgid, new->fsgid) ||
> !cred_cap_issubset(old, new)) {
> if (task->mm)
> set_dumpable(task->mm, suid_dumpable);
> task->pdeath_signal = 0;
> smp_wmb();
> }
> [...]
> }
> 
> AppArmor and SELinux also do related changes:
> 
> static void apparmor_bprm_committing_creds(struct linux_binprm *bprm)
> {
> [...]
> /* bail out if unconfined or not changing profile */
> if ((new_label->proxy == label->proxy) ||
> (unconfined(new_label)))
> return;
> 
> aa_inherit_files(bprm->cred, current->files);
> 
> current->pdeath_signal = 0;
> [...]
> }
> 
> static void selinux_bprm_committing_creds(struct linux_binprm *bprm)
> {
> [...]
> new_tsec = bprm->cred->security;
> if (new_tsec->sid == new_tsec->osid)
> return;
> 
> /* Close files for which the new task SID is not authorized. */
> flush_unauthorized_files(bprm->cred, current->files);
> 
> /* Always clear parent death signal on SID transitions. */
> current->pdeath_signal = 0;
> [...]
> }
> 
> You should probably reset the coredump signal in the same places - or
> even better, add a new helper for resetting the parent death signal,
> and then add code for resetting the coredump signal in there.
> 


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen
Hi, Oleg:

I missed some of your comments in my previous reply.

On 10/15/18 5:05 AM, Oleg Nesterov wrote:
> On 10/12, Enke Chen wrote:
>>
>> For simplicity and consistency, this patch provides an implementation
>> for signal-based fault notification prior to the coredump of a child
>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> be used by an application to express its interest and to specify the
>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
> 
> To be honest, I can't say I like this new feature...
> 
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -696,6 +696,10 @@ struct task_struct {
>>  int exit_signal;
>>  /* The signal sent when the parent dies: */
>>  int pdeath_signal;
>> +
>> +/* The signal sent prior to a child's coredump: */
>> +int predump_signal;
>> +
> 
> At least, I think predump_signal should live in signal_struct, not
> task_struct.
> 
> (pdeath_signal too, but it is too late to change (fix) this awkward API).
> 
>> +static void do_notify_parent_predump(struct task_struct *tsk)
>> +{
>> +struct sighand_struct *sighand;
>> +struct task_struct *parent;
>> +struct kernel_siginfo info;
>> +unsigned long flags;
>> +int sig;
>> +
>> +parent = tsk->real_parent;
> 
> So, debuggere won't be notified, only real_parent...
> 
>> +sig = parent->predump_signal;
> 
> probably ->predump_signal should be cleared on exec?


Is this not enough in "copy_process()"?

@@ -1985,6 +1985,7 @@ static __latent_entropy struct task_struct *copy_process(
p->dirty_paused_when = 0;
 
p->pdeath_signal = 0;
+   p->predump_signal = 0;

> 
>> +/* Check again with tasklist_lock" locked by the caller */
>> +if (!valid_predump_signal(sig))
>> +return;
> 
> I don't understand why we need valid_predump_signal() at all.

Most of the signals have well-defined semantics, and would not be appropriate
for this purpose.  That is why it is limited to only SIGCHLD, SIGUSR1, SIGUSR2.

> 
>>  bool get_signal(struct ksignal *ksig)
>>  {
>>  struct sighand_struct *sighand = current->sighand;
>> @@ -2497,6 +2535,19 @@ bool get_signal(struct ksignal *ksig)
>>  current->flags |= PF_SIGNALED;
>>  
>>  if (sig_kernel_coredump(signr)) {
>> +/*
>> + * Notify the parent prior to the coredump if the
>> + * parent is interested in such a notificaiton.
>> + */
>> +int p_sig = current->real_parent->predump_signal;
>> +
>> +if (valid_predump_signal(p_sig)) {
>> +read_lock(_lock);
>> +do_notify_parent_predump(current);
>> +read_unlock(_lock);
>> +cond_resched();
> 
> perhaps this should be called by do_coredump() after coredump_wait() kills
> all the sub-threads?

proc_coredump_connector(current) is located here, they should stay together.

Thanks.  -- Enke

> 
>> +static int prctl_set_predump_signal(struct task_struct *tsk, pid_t pid, int 
>> sig)
>> +{
>> +struct task_struct *p;
>> +int error;
>> +
>> +/* 0 is valid for disabling the feature */
>> +if (sig && !valid_predump_signal(sig))
>> +return -EINVAL;
>> +
>> +/* For the current task, the common case */
>> +if (pid == 0) {
>> +tsk->predump_signal = sig;
>> +return 0;
>> +}
>> +
>> +error = -ESRCH;
>> +rcu_read_lock();
>> +p = find_task_by_vpid(pid);
>> +if (p) {
>> +if (!set_predump_signal_perm(p))
>> +error = -EPERM;
>> +else {
>> +error = 0;
>> +p->predump_signal = sig;
>> +}
>> +}
>> +rcu_read_unlock();
>> +return error;
>> +}
> 
> Why? I mean, why do we really want to support the pid != 0 case?
> 
> Oleg.
> 


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen
Hi, Oleg:

I missed some of your comments in my previous reply.

On 10/15/18 5:05 AM, Oleg Nesterov wrote:
> On 10/12, Enke Chen wrote:
>>
>> For simplicity and consistency, this patch provides an implementation
>> for signal-based fault notification prior to the coredump of a child
>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> be used by an application to express its interest and to specify the
>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
> 
> To be honest, I can't say I like this new feature...
> 
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -696,6 +696,10 @@ struct task_struct {
>>  int exit_signal;
>>  /* The signal sent when the parent dies: */
>>  int pdeath_signal;
>> +
>> +/* The signal sent prior to a child's coredump: */
>> +int predump_signal;
>> +
> 
> At least, I think predump_signal should live in signal_struct, not
> task_struct.
> 
> (pdeath_signal too, but it is too late to change (fix) this awkward API).
> 
>> +static void do_notify_parent_predump(struct task_struct *tsk)
>> +{
>> +struct sighand_struct *sighand;
>> +struct task_struct *parent;
>> +struct kernel_siginfo info;
>> +unsigned long flags;
>> +int sig;
>> +
>> +parent = tsk->real_parent;
> 
> So, debuggere won't be notified, only real_parent...
> 
>> +sig = parent->predump_signal;
> 
> probably ->predump_signal should be cleared on exec?


Is this not enough in "copy_process()"?

@@ -1985,6 +1985,7 @@ static __latent_entropy struct task_struct *copy_process(
p->dirty_paused_when = 0;
 
p->pdeath_signal = 0;
+   p->predump_signal = 0;

> 
>> +/* Check again with tasklist_lock" locked by the caller */
>> +if (!valid_predump_signal(sig))
>> +return;
> 
> I don't understand why we need valid_predump_signal() at all.

Most of the signals have well-defined semantics, and would not be appropriate
for this purpose.  That is why it is limited to only SIGCHLD, SIGUSR1, SIGUSR2.

> 
>>  bool get_signal(struct ksignal *ksig)
>>  {
>>  struct sighand_struct *sighand = current->sighand;
>> @@ -2497,6 +2535,19 @@ bool get_signal(struct ksignal *ksig)
>>  current->flags |= PF_SIGNALED;
>>  
>>  if (sig_kernel_coredump(signr)) {
>> +/*
>> + * Notify the parent prior to the coredump if the
>> + * parent is interested in such a notificaiton.
>> + */
>> +int p_sig = current->real_parent->predump_signal;
>> +
>> +if (valid_predump_signal(p_sig)) {
>> +read_lock(_lock);
>> +do_notify_parent_predump(current);
>> +read_unlock(_lock);
>> +cond_resched();
> 
> perhaps this should be called by do_coredump() after coredump_wait() kills
> all the sub-threads?

proc_coredump_connector(current) is located here, they should stay together.

Thanks.  -- Enke

> 
>> +static int prctl_set_predump_signal(struct task_struct *tsk, pid_t pid, int 
>> sig)
>> +{
>> +struct task_struct *p;
>> +int error;
>> +
>> +/* 0 is valid for disabling the feature */
>> +if (sig && !valid_predump_signal(sig))
>> +return -EINVAL;
>> +
>> +/* For the current task, the common case */
>> +if (pid == 0) {
>> +tsk->predump_signal = sig;
>> +return 0;
>> +}
>> +
>> +error = -ESRCH;
>> +rcu_read_lock();
>> +p = find_task_by_vpid(pid);
>> +if (p) {
>> +if (!set_predump_signal_perm(p))
>> +error = -EPERM;
>> +else {
>> +error = 0;
>> +p->predump_signal = sig;
>> +}
>> +}
>> +rcu_read_unlock();
>> +return error;
>> +}
> 
> Why? I mean, why do we really want to support the pid != 0 case?
> 
> Oleg.
> 


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen
Hi, Oleg:

On 10/15/18 5:05 AM, Oleg Nesterov wrote:
> On 10/12, Enke Chen wrote:
>>
>> For simplicity and consistency, this patch provides an implementation
>> for signal-based fault notification prior to the coredump of a child
>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> be used by an application to express its interest and to specify the
>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
> 
> To be honest, I can't say I like this new feature...

The requirement for predump notification is real. IMO signal notification
is simpler than "connector" or "signal + connector".

> 
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -696,6 +696,10 @@ struct task_struct {
>>  int exit_signal;
>>  /* The signal sent when the parent dies: */
>>  int pdeath_signal;
>> +
>> +/* The signal sent prior to a child's coredump: */
>> +int predump_signal;
>> +
> 
> At least, I think predump_signal should live in signal_struct, not
> task_struct.

It makes sense as "signal handling" must be consistent in a process.
I was following the wrong example. I will make the change.

> 
> (pdeath_signal too, but it is too late to change (fix) this awkward API).
> 
>> +static void do_notify_parent_predump(struct task_struct *tsk)
>> +{
>> +struct sighand_struct *sighand;
>> +struct task_struct *parent;
>> +struct kernel_siginfo info;
>> +unsigned long flags;
>> +int sig;
>> +
>> +parent = tsk->real_parent;
> 
> So, debuggere won't be notified, only real_parent...
> 
>> +sig = parent->predump_signal;
> 
> probably ->predump_signal should be cleared on exec?
> 
>> +/* Check again with tasklist_lock" locked by the caller */
>> +if (!valid_predump_signal(sig))
>> +return;
> 
> I don't understand why we need valid_predump_signal() at all.
> 
>>  bool get_signal(struct ksignal *ksig)
>>  {
>>  struct sighand_struct *sighand = current->sighand;
>> @@ -2497,6 +2535,19 @@ bool get_signal(struct ksignal *ksig)
>>  current->flags |= PF_SIGNALED;
>>  
>>  if (sig_kernel_coredump(signr)) {
>> +/*
>> + * Notify the parent prior to the coredump if the
>> + * parent is interested in such a notificaiton.
>> + */
>> +int p_sig = current->real_parent->predump_signal;
>> +
>> +if (valid_predump_signal(p_sig)) {
>> +read_lock(_lock);
>> +do_notify_parent_predump(current);
>> +read_unlock(_lock);
>> +cond_resched();
> 
> perhaps this should be called by do_coredump() after coredump_wait() kills
> all the sub-threads?
> 
>> +static int prctl_set_predump_signal(struct task_struct *tsk, pid_t pid, int 
>> sig)
>> +{
>> +struct task_struct *p;
>> +int error;
>> +
>> +/* 0 is valid for disabling the feature */
>> +if (sig && !valid_predump_signal(sig))
>> +return -EINVAL;
>> +
>> +/* For the current task, the common case */
>> +if (pid == 0) {
>> +tsk->predump_signal = sig;
>> +return 0;
>> +}
>> +
>> +error = -ESRCH;
>> +rcu_read_lock();
>> +p = find_task_by_vpid(pid);
>> +if (p) {
>> +if (!set_predump_signal_perm(p))
>> +error = -EPERM;
>> +else {
>> +error = 0;
>> +p->predump_signal = sig;
>> +}
>> +}
>> +rcu_read_unlock();
>> +return error;
>> +}
> 
> Why? I mean, why do we really want to support the pid != 0 case?

I will remove it. Please see my reply to Jann.

Thanks.  -- Enke



Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen
Hi, Oleg:

On 10/15/18 5:05 AM, Oleg Nesterov wrote:
> On 10/12, Enke Chen wrote:
>>
>> For simplicity and consistency, this patch provides an implementation
>> for signal-based fault notification prior to the coredump of a child
>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> be used by an application to express its interest and to specify the
>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
> 
> To be honest, I can't say I like this new feature...

The requirement for predump notification is real. IMO signal notification
is simpler than "connector" or "signal + connector".

> 
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -696,6 +696,10 @@ struct task_struct {
>>  int exit_signal;
>>  /* The signal sent when the parent dies: */
>>  int pdeath_signal;
>> +
>> +/* The signal sent prior to a child's coredump: */
>> +int predump_signal;
>> +
> 
> At least, I think predump_signal should live in signal_struct, not
> task_struct.

It makes sense as "signal handling" must be consistent in a process.
I was following the wrong example. I will make the change.

> 
> (pdeath_signal too, but it is too late to change (fix) this awkward API).
> 
>> +static void do_notify_parent_predump(struct task_struct *tsk)
>> +{
>> +struct sighand_struct *sighand;
>> +struct task_struct *parent;
>> +struct kernel_siginfo info;
>> +unsigned long flags;
>> +int sig;
>> +
>> +parent = tsk->real_parent;
> 
> So, debuggere won't be notified, only real_parent...
> 
>> +sig = parent->predump_signal;
> 
> probably ->predump_signal should be cleared on exec?
> 
>> +/* Check again with tasklist_lock" locked by the caller */
>> +if (!valid_predump_signal(sig))
>> +return;
> 
> I don't understand why we need valid_predump_signal() at all.
> 
>>  bool get_signal(struct ksignal *ksig)
>>  {
>>  struct sighand_struct *sighand = current->sighand;
>> @@ -2497,6 +2535,19 @@ bool get_signal(struct ksignal *ksig)
>>  current->flags |= PF_SIGNALED;
>>  
>>  if (sig_kernel_coredump(signr)) {
>> +/*
>> + * Notify the parent prior to the coredump if the
>> + * parent is interested in such a notificaiton.
>> + */
>> +int p_sig = current->real_parent->predump_signal;
>> +
>> +if (valid_predump_signal(p_sig)) {
>> +read_lock(_lock);
>> +do_notify_parent_predump(current);
>> +read_unlock(_lock);
>> +cond_resched();
> 
> perhaps this should be called by do_coredump() after coredump_wait() kills
> all the sub-threads?
> 
>> +static int prctl_set_predump_signal(struct task_struct *tsk, pid_t pid, int 
>> sig)
>> +{
>> +struct task_struct *p;
>> +int error;
>> +
>> +/* 0 is valid for disabling the feature */
>> +if (sig && !valid_predump_signal(sig))
>> +return -EINVAL;
>> +
>> +/* For the current task, the common case */
>> +if (pid == 0) {
>> +tsk->predump_signal = sig;
>> +return 0;
>> +}
>> +
>> +error = -ESRCH;
>> +rcu_read_lock();
>> +p = find_task_by_vpid(pid);
>> +if (p) {
>> +if (!set_predump_signal_perm(p))
>> +error = -EPERM;
>> +else {
>> +error = 0;
>> +p->predump_signal = sig;
>> +}
>> +}
>> +rcu_read_unlock();
>> +return error;
>> +}
> 
> Why? I mean, why do we really want to support the pid != 0 case?

I will remove it. Please see my reply to Jann.

Thanks.  -- Enke



Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Greg Kroah-Hartman
On Mon, Oct 15, 2018 at 11:49:11AM -0700, Enke Chen wrote:
> Hi, Greg:
> 
> On 10/15/18 11:43 AM, Greg Kroah-Hartman wrote:
> > On Mon, Oct 15, 2018 at 11:16:36AM -0700, Enke Chen wrote:
> >> Hi, Greg:
> >>
> >>> Shouldn't there also be a manpage update, and a kselftest added for this
> >>> new user/kernel api that is being created?
> >>>
> >>
> >> I will submit a patch for manpage update once the code is accepted.
> > 
> > Writing a manpage update is key to see if what you are describing
> > actually matches the code you have submitted.  You should do both at the
> > same time so that they can be reviewed together.
> 
> Ok, will do at the same time. But should I submit it as a separate patch?

Yes please.

> >> Regarding the kselftest, I am not sure.  Once the prctl() is limited to
> >> self (which I will do), the logic would be pretty straightforward. Not
> >> sure if the selftest would add much value.
> > 
> > If you do not have a test for this feature, how do you know it even
> > works at all?  How will you know if it breaks in a future kernel
> > release?  Have you tested this?  If so, how?
> 
> I have the test code. I am just not sure whether I should submit and check
> it in to the kselftest?

Of course you should, why wouldn't you?  Do you want to be the only
person in the world responsible for ensuring that this feature does not
break for the next 20+ years?  :)

thanks,

greg k-h


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Greg Kroah-Hartman
On Mon, Oct 15, 2018 at 11:49:11AM -0700, Enke Chen wrote:
> Hi, Greg:
> 
> On 10/15/18 11:43 AM, Greg Kroah-Hartman wrote:
> > On Mon, Oct 15, 2018 at 11:16:36AM -0700, Enke Chen wrote:
> >> Hi, Greg:
> >>
> >>> Shouldn't there also be a manpage update, and a kselftest added for this
> >>> new user/kernel api that is being created?
> >>>
> >>
> >> I will submit a patch for manpage update once the code is accepted.
> > 
> > Writing a manpage update is key to see if what you are describing
> > actually matches the code you have submitted.  You should do both at the
> > same time so that they can be reviewed together.
> 
> Ok, will do at the same time. But should I submit it as a separate patch?

Yes please.

> >> Regarding the kselftest, I am not sure.  Once the prctl() is limited to
> >> self (which I will do), the logic would be pretty straightforward. Not
> >> sure if the selftest would add much value.
> > 
> > If you do not have a test for this feature, how do you know it even
> > works at all?  How will you know if it breaks in a future kernel
> > release?  Have you tested this?  If so, how?
> 
> I have the test code. I am just not sure whether I should submit and check
> it in to the kselftest?

Of course you should, why wouldn't you?  Do you want to be the only
person in the world responsible for ensuring that this feature does not
break for the next 20+ years?  :)

thanks,

greg k-h


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Jann Horn
On Mon, Oct 15, 2018 at 8:36 PM Enke Chen  wrote:
> On 10/13/18 11:27 AM, Jann Horn wrote:
> > On Sat, Oct 13, 2018 at 2:33 AM Enke Chen  wrote:
> >> For simplicity and consistency, this patch provides an implementation
> >> for signal-based fault notification prior to the coredump of a child
> >> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> >> be used by an application to express its interest and to specify the
> >> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
> >> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
> >
> > Your suggested API looks vaguely similar to PR_SET_PDEATHSIG, but with
> > some important differences:
> >
> >  - You don't reset the signal on setuid execution.
[...]
> >
> > For both of these: Are these differences actually necessary, and if
> > so, can you provide a specific rationale? From a security perspective,
> > I would very much prefer it if this API had semantics closer to
> > PR_SET_PDEATHSIG.
>
[...]
>
> Regarding the impact of "setuid", this property "PR_SET_PREDUMP_SIG" has to
> do with the application/process whether the signal handler is set for 
> receiving
> such a notification.  If it is set, the "uid" should not matter.

If an attacker's process first calls PR_SET_PREDUMP_SIG, then forks
off a child, then calls execve() on a setuid binary, the setuid binary
calls setuid(0), and the attacker-controlled child then crashes, the
privileged process will receive an unexpected signal that the attacker
wouldn't have been allowed to send otherwise. For similar reasons, the
parent death signal is reset when a setuid binary is executed:

void setup_new_exec(struct linux_binprm * bprm)
{
/*
 * Once here, prepare_binrpm() will not be called any more, so
 * the final state of setuid/setgid/fscaps can be merged into the
 * secureexec flag.
 */
bprm->secureexec |= bprm->cap_elevated;

if (bprm->secureexec) {
/* Make sure parent cannot signal privileged process. */
current->pdeath_signal = 0;
[...]
}
[...]
}

int commit_creds(struct cred *new)
{
[...]
/* dumpability changes */
if (!uid_eq(old->euid, new->euid) ||
!gid_eq(old->egid, new->egid) ||
!uid_eq(old->fsuid, new->fsuid) ||
!gid_eq(old->fsgid, new->fsgid) ||
!cred_cap_issubset(old, new)) {
if (task->mm)
set_dumpable(task->mm, suid_dumpable);
task->pdeath_signal = 0;
smp_wmb();
}
[...]
}

AppArmor and SELinux also do related changes:

static void apparmor_bprm_committing_creds(struct linux_binprm *bprm)
{
[...]
/* bail out if unconfined or not changing profile */
if ((new_label->proxy == label->proxy) ||
(unconfined(new_label)))
return;

aa_inherit_files(bprm->cred, current->files);

current->pdeath_signal = 0;
[...]
}

static void selinux_bprm_committing_creds(struct linux_binprm *bprm)
{
[...]
new_tsec = bprm->cred->security;
if (new_tsec->sid == new_tsec->osid)
return;

/* Close files for which the new task SID is not authorized. */
flush_unauthorized_files(bprm->cred, current->files);

/* Always clear parent death signal on SID transitions. */
current->pdeath_signal = 0;
[...]
}

You should probably reset the coredump signal in the same places - or
even better, add a new helper for resetting the parent death signal,
and then add code for resetting the coredump signal in there.


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Jann Horn
On Mon, Oct 15, 2018 at 8:36 PM Enke Chen  wrote:
> On 10/13/18 11:27 AM, Jann Horn wrote:
> > On Sat, Oct 13, 2018 at 2:33 AM Enke Chen  wrote:
> >> For simplicity and consistency, this patch provides an implementation
> >> for signal-based fault notification prior to the coredump of a child
> >> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> >> be used by an application to express its interest and to specify the
> >> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
> >> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
> >
> > Your suggested API looks vaguely similar to PR_SET_PDEATHSIG, but with
> > some important differences:
> >
> >  - You don't reset the signal on setuid execution.
[...]
> >
> > For both of these: Are these differences actually necessary, and if
> > so, can you provide a specific rationale? From a security perspective,
> > I would very much prefer it if this API had semantics closer to
> > PR_SET_PDEATHSIG.
>
[...]
>
> Regarding the impact of "setuid", this property "PR_SET_PREDUMP_SIG" has to
> do with the application/process whether the signal handler is set for 
> receiving
> such a notification.  If it is set, the "uid" should not matter.

If an attacker's process first calls PR_SET_PREDUMP_SIG, then forks
off a child, then calls execve() on a setuid binary, the setuid binary
calls setuid(0), and the attacker-controlled child then crashes, the
privileged process will receive an unexpected signal that the attacker
wouldn't have been allowed to send otherwise. For similar reasons, the
parent death signal is reset when a setuid binary is executed:

void setup_new_exec(struct linux_binprm * bprm)
{
/*
 * Once here, prepare_binrpm() will not be called any more, so
 * the final state of setuid/setgid/fscaps can be merged into the
 * secureexec flag.
 */
bprm->secureexec |= bprm->cap_elevated;

if (bprm->secureexec) {
/* Make sure parent cannot signal privileged process. */
current->pdeath_signal = 0;
[...]
}
[...]
}

int commit_creds(struct cred *new)
{
[...]
/* dumpability changes */
if (!uid_eq(old->euid, new->euid) ||
!gid_eq(old->egid, new->egid) ||
!uid_eq(old->fsuid, new->fsuid) ||
!gid_eq(old->fsgid, new->fsgid) ||
!cred_cap_issubset(old, new)) {
if (task->mm)
set_dumpable(task->mm, suid_dumpable);
task->pdeath_signal = 0;
smp_wmb();
}
[...]
}

AppArmor and SELinux also do related changes:

static void apparmor_bprm_committing_creds(struct linux_binprm *bprm)
{
[...]
/* bail out if unconfined or not changing profile */
if ((new_label->proxy == label->proxy) ||
(unconfined(new_label)))
return;

aa_inherit_files(bprm->cred, current->files);

current->pdeath_signal = 0;
[...]
}

static void selinux_bprm_committing_creds(struct linux_binprm *bprm)
{
[...]
new_tsec = bprm->cred->security;
if (new_tsec->sid == new_tsec->osid)
return;

/* Close files for which the new task SID is not authorized. */
flush_unauthorized_files(bprm->cred, current->files);

/* Always clear parent death signal on SID transitions. */
current->pdeath_signal = 0;
[...]
}

You should probably reset the coredump signal in the same places - or
even better, add a new helper for resetting the parent death signal,
and then add code for resetting the coredump signal in there.


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen
Hi, Greg:

On 10/15/18 11:43 AM, Greg Kroah-Hartman wrote:
> On Mon, Oct 15, 2018 at 11:16:36AM -0700, Enke Chen wrote:
>> Hi, Greg:
>>
>>> Shouldn't there also be a manpage update, and a kselftest added for this
>>> new user/kernel api that is being created?
>>>
>>
>> I will submit a patch for manpage update once the code is accepted.
> 
> Writing a manpage update is key to see if what you are describing
> actually matches the code you have submitted.  You should do both at the
> same time so that they can be reviewed together.

Ok, will do at the same time. But should I submit it as a separate patch?

> 
>> Regarding the kselftest, I am not sure.  Once the prctl() is limited to
>> self (which I will do), the logic would be pretty straightforward. Not
>> sure if the selftest would add much value.
> 
> If you do not have a test for this feature, how do you know it even
> works at all?  How will you know if it breaks in a future kernel
> release?  Have you tested this?  If so, how?

I have the test code. I am just not sure whether I should submit and check
it in to the kselftest?

Thanks.  -- Enke



Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen
Hi, Greg:

On 10/15/18 11:43 AM, Greg Kroah-Hartman wrote:
> On Mon, Oct 15, 2018 at 11:16:36AM -0700, Enke Chen wrote:
>> Hi, Greg:
>>
>>> Shouldn't there also be a manpage update, and a kselftest added for this
>>> new user/kernel api that is being created?
>>>
>>
>> I will submit a patch for manpage update once the code is accepted.
> 
> Writing a manpage update is key to see if what you are describing
> actually matches the code you have submitted.  You should do both at the
> same time so that they can be reviewed together.

Ok, will do at the same time. But should I submit it as a separate patch?

> 
>> Regarding the kselftest, I am not sure.  Once the prctl() is limited to
>> self (which I will do), the logic would be pretty straightforward. Not
>> sure if the selftest would add much value.
> 
> If you do not have a test for this feature, how do you know it even
> works at all?  How will you know if it breaks in a future kernel
> release?  Have you tested this?  If so, how?

I have the test code. I am just not sure whether I should submit and check
it in to the kselftest?

Thanks.  -- Enke



Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen
Hi, Jann:

Thanks a lot for you detailed review. Please see my replied/comments inline.

On 10/13/18 11:27 AM, Jann Horn wrote:
> On Sat, Oct 13, 2018 at 2:33 AM Enke Chen  wrote:
>> For simplicity and consistency, this patch provides an implementation
>> for signal-based fault notification prior to the coredump of a child
>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> be used by an application to express its interest and to specify the
>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
> 
> Your suggested API looks vaguely similar to PR_SET_PDEATHSIG, but with
> some important differences:
> 
>  - You don't reset the signal on setuid execution.
>  - You permit setting this not just on the current process, but also on 
> others.
> 
> For both of these: Are these differences actually necessary, and if
> so, can you provide a specific rationale? From a security perspective,
> I would very much prefer it if this API had semantics closer to
> PR_SET_PDEATHSIG.

Regarding setting on others, I started with setting for self. But there is
a requirement for supporting the feature for a process manager written in
bash script. That's the reason for allowing the setting on others.

Given the feedback from you and others, I agree that it would be simpler and
more secure to remove the setting on others. We can submit a patch for bash
to support the setting natively.

Regarding the impact of "setuid", this property "PR_SET_PREDUMP_SIG" has to
do with the application/process whether the signal handler is set for receiving
such a notification.  If it is set, the "uid" should not matter.

> 
> [...]
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index 312b43e..eb4a483 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -2337,6 +2337,44 @@ static int ptrace_signal(int signr, kernel_siginfo_t 
>> *info)
>> return signr;
>>  }
>>
>> +/*
>> + * Let the parent, if so desired, know about the imminent death of a child
>> + * prior to its coredump.
>> + *
>> + * Locking logic is similar to do_notify_parent_cldstop().
>> + */
>> +static void do_notify_parent_predump(struct task_struct *tsk)
>> +{
>> +   struct sighand_struct *sighand;
>> +   struct task_struct *parent;
>> +   struct kernel_siginfo info;
>> +   unsigned long flags;
>> +   int sig;
>> +
>> +   parent = tsk->real_parent;
>> +   sig = parent->predump_signal;
>> +
>> +   /* Check again with "tasklist_lock" locked by the caller */
>> +   if (!valid_predump_signal(sig))
>> +   return;
>> +
>> +   clear_siginfo();
>> +   info.si_signo = sig;
>> +   if (sig == SIGCHLD)
>> +   info.si_code = CLD_PREDUMP;
>> +
>> +   rcu_read_lock();
>> +   info.si_pid = task_pid_nr_ns(tsk, task_active_pid_ns(parent));
>> +   info.si_uid = from_kuid_munged(task_cred_xxx(parent, user_ns),
>> +  task_uid(tsk));
> 
> You're sending a signal from the current namespaces, but with IDs that
> have been mapped into the parent's namespaces? That looks wrong to me.

I am following the example "do_notify_parent_cldstop()" called in the same
routine "get_signal()". If there is a better way, sure I will use it.

> 
>> +   rcu_read_unlock();
>> +
>> +   sighand = parent->sighand;
>> +   spin_lock_irqsave(>siglock, flags);
>> +   __group_send_sig_info(sig, , parent);
>> +   spin_unlock_irqrestore(>siglock, flags);
>> +}
>> +
>>  bool get_signal(struct ksignal *ksig)
>>  {
>> struct sighand_struct *sighand = current->sighand;
>> @@ -2497,6 +2535,19 @@ bool get_signal(struct ksignal *ksig)
>> current->flags |= PF_SIGNALED;
>>
>> if (sig_kernel_coredump(signr)) {
>> +   /*
>> +* Notify the parent prior to the coredump if the
>> +* parent is interested in such a notificaiton.
>> +*/
>> +   int p_sig = current->real_parent->predump_signal;
> 
> current->real_parent is an __rcu member. I think if you run the sparse
> checker against this patch, it's going to complain. Are you allowed to
> access current->real_parent in this context?

Let me check, and get back to you on this one.

> 
>> +   if (valid_predump_signal(p_sig)) {
>> +   read_lock(_lock);
>> +   do_notify_parent_predump(current);
>> +   read_unlock(_lock);
>> +   cond_resched();
>> +   }
>> +
>> if (print_fatal_signals)
>> print_fatal_signal(ksig->info.si_signo);
>> proc_coredump_connector(current);
>> diff --git a/kernel/sys.c b/kernel/sys.c
>> index 123bd73..43eb250d 100644
>> --- a/kernel/sys.c

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen
Hi, Jann:

Thanks a lot for you detailed review. Please see my replied/comments inline.

On 10/13/18 11:27 AM, Jann Horn wrote:
> On Sat, Oct 13, 2018 at 2:33 AM Enke Chen  wrote:
>> For simplicity and consistency, this patch provides an implementation
>> for signal-based fault notification prior to the coredump of a child
>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> be used by an application to express its interest and to specify the
>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
> 
> Your suggested API looks vaguely similar to PR_SET_PDEATHSIG, but with
> some important differences:
> 
>  - You don't reset the signal on setuid execution.
>  - You permit setting this not just on the current process, but also on 
> others.
> 
> For both of these: Are these differences actually necessary, and if
> so, can you provide a specific rationale? From a security perspective,
> I would very much prefer it if this API had semantics closer to
> PR_SET_PDEATHSIG.

Regarding setting on others, I started with setting for self. But there is
a requirement for supporting the feature for a process manager written in
bash script. That's the reason for allowing the setting on others.

Given the feedback from you and others, I agree that it would be simpler and
more secure to remove the setting on others. We can submit a patch for bash
to support the setting natively.

Regarding the impact of "setuid", this property "PR_SET_PREDUMP_SIG" has to
do with the application/process whether the signal handler is set for receiving
such a notification.  If it is set, the "uid" should not matter.

> 
> [...]
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index 312b43e..eb4a483 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -2337,6 +2337,44 @@ static int ptrace_signal(int signr, kernel_siginfo_t 
>> *info)
>> return signr;
>>  }
>>
>> +/*
>> + * Let the parent, if so desired, know about the imminent death of a child
>> + * prior to its coredump.
>> + *
>> + * Locking logic is similar to do_notify_parent_cldstop().
>> + */
>> +static void do_notify_parent_predump(struct task_struct *tsk)
>> +{
>> +   struct sighand_struct *sighand;
>> +   struct task_struct *parent;
>> +   struct kernel_siginfo info;
>> +   unsigned long flags;
>> +   int sig;
>> +
>> +   parent = tsk->real_parent;
>> +   sig = parent->predump_signal;
>> +
>> +   /* Check again with "tasklist_lock" locked by the caller */
>> +   if (!valid_predump_signal(sig))
>> +   return;
>> +
>> +   clear_siginfo();
>> +   info.si_signo = sig;
>> +   if (sig == SIGCHLD)
>> +   info.si_code = CLD_PREDUMP;
>> +
>> +   rcu_read_lock();
>> +   info.si_pid = task_pid_nr_ns(tsk, task_active_pid_ns(parent));
>> +   info.si_uid = from_kuid_munged(task_cred_xxx(parent, user_ns),
>> +  task_uid(tsk));
> 
> You're sending a signal from the current namespaces, but with IDs that
> have been mapped into the parent's namespaces? That looks wrong to me.

I am following the example "do_notify_parent_cldstop()" called in the same
routine "get_signal()". If there is a better way, sure I will use it.

> 
>> +   rcu_read_unlock();
>> +
>> +   sighand = parent->sighand;
>> +   spin_lock_irqsave(>siglock, flags);
>> +   __group_send_sig_info(sig, , parent);
>> +   spin_unlock_irqrestore(>siglock, flags);
>> +}
>> +
>>  bool get_signal(struct ksignal *ksig)
>>  {
>> struct sighand_struct *sighand = current->sighand;
>> @@ -2497,6 +2535,19 @@ bool get_signal(struct ksignal *ksig)
>> current->flags |= PF_SIGNALED;
>>
>> if (sig_kernel_coredump(signr)) {
>> +   /*
>> +* Notify the parent prior to the coredump if the
>> +* parent is interested in such a notificaiton.
>> +*/
>> +   int p_sig = current->real_parent->predump_signal;
> 
> current->real_parent is an __rcu member. I think if you run the sparse
> checker against this patch, it's going to complain. Are you allowed to
> access current->real_parent in this context?

Let me check, and get back to you on this one.

> 
>> +   if (valid_predump_signal(p_sig)) {
>> +   read_lock(_lock);
>> +   do_notify_parent_predump(current);
>> +   read_unlock(_lock);
>> +   cond_resched();
>> +   }
>> +
>> if (print_fatal_signals)
>> print_fatal_signal(ksig->info.si_signo);
>> proc_coredump_connector(current);
>> diff --git a/kernel/sys.c b/kernel/sys.c
>> index 123bd73..43eb250d 100644
>> --- a/kernel/sys.c

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Greg Kroah-Hartman
On Mon, Oct 15, 2018 at 11:16:36AM -0700, Enke Chen wrote:
> Hi, Greg:
> 
> > Shouldn't there also be a manpage update, and a kselftest added for this
> > new user/kernel api that is being created?
> > 
> 
> I will submit a patch for manpage update once the code is accepted.

Writing a manpage update is key to see if what you are describing
actually matches the code you have submitted.  You should do both at the
same time so that they can be reviewed together.

> Regarding the kselftest, I am not sure.  Once the prctl() is limited to
> self (which I will do), the logic would be pretty straightforward. Not
> sure if the selftest would add much value.

If you do not have a test for this feature, how do you know it even
works at all?  How will you know if it breaks in a future kernel
release?  Have you tested this?  If so, how?

thanks,

greg k-h


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Greg Kroah-Hartman
On Mon, Oct 15, 2018 at 11:16:36AM -0700, Enke Chen wrote:
> Hi, Greg:
> 
> > Shouldn't there also be a manpage update, and a kselftest added for this
> > new user/kernel api that is being created?
> > 
> 
> I will submit a patch for manpage update once the code is accepted.

Writing a manpage update is key to see if what you are describing
actually matches the code you have submitted.  You should do both at the
same time so that they can be reviewed together.

> Regarding the kselftest, I am not sure.  Once the prctl() is limited to
> self (which I will do), the logic would be pretty straightforward. Not
> sure if the selftest would add much value.

If you do not have a test for this feature, how do you know it even
works at all?  How will you know if it breaks in a future kernel
release?  Have you tested this?  If so, how?

thanks,

greg k-h


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen
Hi, Christian:

As I replied to Jann, I will remove the code that does the setting on others
to make the code simpler and more secure.

Thanks.  -- Enke

>> +static bool set_predump_signal_perm(struct task_struct *p)
>> +{
>> +const struct cred *cred = current_cred(), *pcred = __task_cred(p);
>> +
>> +return uid_eq(pcred->uid, cred->euid) ||
>> +   uid_eq(pcred->euid, cred->euid) ||
>> +   capable(CAP_SYS_ADMIN);
> 
> So before proceeding I'd like to discuss at least two points:
> - how does this interact with the dumpability of a process?
> - do we need the capable(CAP_SYS_ADMIN) restriction to init_user_ns?
>   Seems we could make this work per-user-ns just like
>   PRCTL_SET_PDEATHSIG does?
> 
>> +}



Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen
Hi, Christian:

As I replied to Jann, I will remove the code that does the setting on others
to make the code simpler and more secure.

Thanks.  -- Enke

>> +static bool set_predump_signal_perm(struct task_struct *p)
>> +{
>> +const struct cred *cred = current_cred(), *pcred = __task_cred(p);
>> +
>> +return uid_eq(pcred->uid, cred->euid) ||
>> +   uid_eq(pcred->euid, cred->euid) ||
>> +   capable(CAP_SYS_ADMIN);
> 
> So before proceeding I'd like to discuss at least two points:
> - how does this interact with the dumpability of a process?
> - do we need the capable(CAP_SYS_ADMIN) restriction to init_user_ns?
>   Seems we could make this work per-user-ns just like
>   PRCTL_SET_PDEATHSIG does?
> 
>> +}



Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen
Hi, Greg:

> Shouldn't there also be a manpage update, and a kselftest added for this
> new user/kernel api that is being created?
> 

I will submit a patch for manpage update once the code is accepted.

Regarding the kselftest, I am not sure.  Once the prctl() is limited to
self (which I will do), the logic would be pretty straightforward. Not
sure if the selftest would add much value.

Thanks.  -- Enke

On 10/12/18 11:40 PM, Greg Kroah-Hartman wrote:
> On Fri, Oct 12, 2018 at 05:33:35PM -0700, Enke Chen wrote:
>> For simplicity and consistency, this patch provides an implementation
>> for signal-based fault notification prior to the coredump of a child
>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> be used by an application to express its interest and to specify the
>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>
>> Background:
>>
>> As the coredump of a process may take time, in certain time-sensitive
>> applications it is necessary for a parent process (e.g., a process
>> manager) to be notified of a child's imminent death before the coredump
>> so that the parent process can act sooner, such as re-spawning an
>> application process, or initiating a control-plane fail-over.
>>
>> Currently there are two ways for a parent process to be notified of a
>> child process's state change. One is to use the POSIX signal, and
>> another is to use the kernel connector module. The specific events and
>> actions are summarized as follows:
>>
>> Process EventPOSIX SignalConnector-based
>> --
>> ptrace_attach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>>  SIGCHLD / CLD_STOPPED
>>
>> ptrace_detach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>>  SIGCHLD / CLD_CONTINUED
>>
>> pre_coredump/N/A proc_coredump_connector()
>> get_signal()
>>
>> post_coredump/   do_notify_parent()  proc_exit_connector()
>> do_exit()SIGCHLD / exit_signal
>> --
>>
>> As shown in the table, the signal-based pre-coredump notification is not
>> currently available. In some cases using a connector-based notification
>> can be quite complicated (e.g., when a process manager is written in shell
>> scripts and thus is subject to certain inherent limitations), and a
>> signal-based notification would be simpler and better suited.
>>
>> Signed-off-by: Enke Chen 
>> ---
>>  arch/x86/kernel/signal_compat.c|  2 +-
>>  include/linux/sched.h  |  4 ++
>>  include/linux/signal.h |  5 +++
>>  include/uapi/asm-generic/siginfo.h |  3 +-
>>  include/uapi/linux/prctl.h |  4 ++
>>  kernel/fork.c  |  1 +
>>  kernel/signal.c| 51 +
>>  kernel/sys.c   | 77 
>> ++
>>  8 files changed, 145 insertions(+), 2 deletions(-)
> 
> Shouldn't there also be a manpage update, and a kselftest added for this
> new user/kernel api that is being created?
> 
> thanks,
> 
> greg k-h
> 


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen
Hi, Greg:

> Shouldn't there also be a manpage update, and a kselftest added for this
> new user/kernel api that is being created?
> 

I will submit a patch for manpage update once the code is accepted.

Regarding the kselftest, I am not sure.  Once the prctl() is limited to
self (which I will do), the logic would be pretty straightforward. Not
sure if the selftest would add much value.

Thanks.  -- Enke

On 10/12/18 11:40 PM, Greg Kroah-Hartman wrote:
> On Fri, Oct 12, 2018 at 05:33:35PM -0700, Enke Chen wrote:
>> For simplicity and consistency, this patch provides an implementation
>> for signal-based fault notification prior to the coredump of a child
>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> be used by an application to express its interest and to specify the
>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>
>> Background:
>>
>> As the coredump of a process may take time, in certain time-sensitive
>> applications it is necessary for a parent process (e.g., a process
>> manager) to be notified of a child's imminent death before the coredump
>> so that the parent process can act sooner, such as re-spawning an
>> application process, or initiating a control-plane fail-over.
>>
>> Currently there are two ways for a parent process to be notified of a
>> child process's state change. One is to use the POSIX signal, and
>> another is to use the kernel connector module. The specific events and
>> actions are summarized as follows:
>>
>> Process EventPOSIX SignalConnector-based
>> --
>> ptrace_attach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>>  SIGCHLD / CLD_STOPPED
>>
>> ptrace_detach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>>  SIGCHLD / CLD_CONTINUED
>>
>> pre_coredump/N/A proc_coredump_connector()
>> get_signal()
>>
>> post_coredump/   do_notify_parent()  proc_exit_connector()
>> do_exit()SIGCHLD / exit_signal
>> --
>>
>> As shown in the table, the signal-based pre-coredump notification is not
>> currently available. In some cases using a connector-based notification
>> can be quite complicated (e.g., when a process manager is written in shell
>> scripts and thus is subject to certain inherent limitations), and a
>> signal-based notification would be simpler and better suited.
>>
>> Signed-off-by: Enke Chen 
>> ---
>>  arch/x86/kernel/signal_compat.c|  2 +-
>>  include/linux/sched.h  |  4 ++
>>  include/linux/signal.h |  5 +++
>>  include/uapi/asm-generic/siginfo.h |  3 +-
>>  include/uapi/linux/prctl.h |  4 ++
>>  kernel/fork.c  |  1 +
>>  kernel/signal.c| 51 +
>>  kernel/sys.c   | 77 
>> ++
>>  8 files changed, 145 insertions(+), 2 deletions(-)
> 
> Shouldn't there also be a manpage update, and a kselftest added for this
> new user/kernel api that is being created?
> 
> thanks,
> 
> greg k-h
> 


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Oleg Nesterov
On 10/12, Enke Chen wrote:
>
> For simplicity and consistency, this patch provides an implementation
> for signal-based fault notification prior to the coredump of a child
> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> be used by an application to express its interest and to specify the
> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.

To be honest, I can't say I like this new feature...

> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -696,6 +696,10 @@ struct task_struct {
>   int exit_signal;
>   /* The signal sent when the parent dies: */
>   int pdeath_signal;
> +
> + /* The signal sent prior to a child's coredump: */
> + int predump_signal;
> +

At least, I think predump_signal should live in signal_struct, not
task_struct.

(pdeath_signal too, but it is too late to change (fix) this awkward API).

> +static void do_notify_parent_predump(struct task_struct *tsk)
> +{
> + struct sighand_struct *sighand;
> + struct task_struct *parent;
> + struct kernel_siginfo info;
> + unsigned long flags;
> + int sig;
> +
> + parent = tsk->real_parent;

So, debuggere won't be notified, only real_parent...

> + sig = parent->predump_signal;

probably ->predump_signal should be cleared on exec?

> + /* Check again with tasklist_lock" locked by the caller */
> + if (!valid_predump_signal(sig))
> + return;

I don't understand why we need valid_predump_signal() at all.

>  bool get_signal(struct ksignal *ksig)
>  {
>   struct sighand_struct *sighand = current->sighand;
> @@ -2497,6 +2535,19 @@ bool get_signal(struct ksignal *ksig)
>   current->flags |= PF_SIGNALED;
>  
>   if (sig_kernel_coredump(signr)) {
> + /*
> +  * Notify the parent prior to the coredump if the
> +  * parent is interested in such a notificaiton.
> +  */
> + int p_sig = current->real_parent->predump_signal;
> +
> + if (valid_predump_signal(p_sig)) {
> + read_lock(_lock);
> + do_notify_parent_predump(current);
> + read_unlock(_lock);
> + cond_resched();

perhaps this should be called by do_coredump() after coredump_wait() kills
all the sub-threads?

> +static int prctl_set_predump_signal(struct task_struct *tsk, pid_t pid, int 
> sig)
> +{
> + struct task_struct *p;
> + int error;
> +
> + /* 0 is valid for disabling the feature */
> + if (sig && !valid_predump_signal(sig))
> + return -EINVAL;
> +
> + /* For the current task, the common case */
> + if (pid == 0) {
> + tsk->predump_signal = sig;
> + return 0;
> + }
> +
> + error = -ESRCH;
> + rcu_read_lock();
> + p = find_task_by_vpid(pid);
> + if (p) {
> + if (!set_predump_signal_perm(p))
> + error = -EPERM;
> + else {
> + error = 0;
> + p->predump_signal = sig;
> + }
> + }
> + rcu_read_unlock();
> + return error;
> +}

Why? I mean, why do we really want to support the pid != 0 case?

Oleg.



Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Oleg Nesterov
On 10/12, Enke Chen wrote:
>
> For simplicity and consistency, this patch provides an implementation
> for signal-based fault notification prior to the coredump of a child
> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> be used by an application to express its interest and to specify the
> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.

To be honest, I can't say I like this new feature...

> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -696,6 +696,10 @@ struct task_struct {
>   int exit_signal;
>   /* The signal sent when the parent dies: */
>   int pdeath_signal;
> +
> + /* The signal sent prior to a child's coredump: */
> + int predump_signal;
> +

At least, I think predump_signal should live in signal_struct, not
task_struct.

(pdeath_signal too, but it is too late to change (fix) this awkward API).

> +static void do_notify_parent_predump(struct task_struct *tsk)
> +{
> + struct sighand_struct *sighand;
> + struct task_struct *parent;
> + struct kernel_siginfo info;
> + unsigned long flags;
> + int sig;
> +
> + parent = tsk->real_parent;

So, debuggere won't be notified, only real_parent...

> + sig = parent->predump_signal;

probably ->predump_signal should be cleared on exec?

> + /* Check again with tasklist_lock" locked by the caller */
> + if (!valid_predump_signal(sig))
> + return;

I don't understand why we need valid_predump_signal() at all.

>  bool get_signal(struct ksignal *ksig)
>  {
>   struct sighand_struct *sighand = current->sighand;
> @@ -2497,6 +2535,19 @@ bool get_signal(struct ksignal *ksig)
>   current->flags |= PF_SIGNALED;
>  
>   if (sig_kernel_coredump(signr)) {
> + /*
> +  * Notify the parent prior to the coredump if the
> +  * parent is interested in such a notificaiton.
> +  */
> + int p_sig = current->real_parent->predump_signal;
> +
> + if (valid_predump_signal(p_sig)) {
> + read_lock(_lock);
> + do_notify_parent_predump(current);
> + read_unlock(_lock);
> + cond_resched();

perhaps this should be called by do_coredump() after coredump_wait() kills
all the sub-threads?

> +static int prctl_set_predump_signal(struct task_struct *tsk, pid_t pid, int 
> sig)
> +{
> + struct task_struct *p;
> + int error;
> +
> + /* 0 is valid for disabling the feature */
> + if (sig && !valid_predump_signal(sig))
> + return -EINVAL;
> +
> + /* For the current task, the common case */
> + if (pid == 0) {
> + tsk->predump_signal = sig;
> + return 0;
> + }
> +
> + error = -ESRCH;
> + rcu_read_lock();
> + p = find_task_by_vpid(pid);
> + if (p) {
> + if (!set_predump_signal_perm(p))
> + error = -EPERM;
> + else {
> + error = 0;
> + p->predump_signal = sig;
> + }
> + }
> + rcu_read_unlock();
> + return error;
> +}

Why? I mean, why do we really want to support the pid != 0 case?

Oleg.



Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-13 Thread Jann Horn
On Sat, Oct 13, 2018 at 2:33 AM Enke Chen  wrote:
> For simplicity and consistency, this patch provides an implementation
> for signal-based fault notification prior to the coredump of a child
> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> be used by an application to express its interest and to specify the
> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.

Your suggested API looks vaguely similar to PR_SET_PDEATHSIG, but with
some important differences:

 - You don't reset the signal on setuid execution.
 - You permit setting this not just on the current process, but also on others.

For both of these: Are these differences actually necessary, and if
so, can you provide a specific rationale? From a security perspective,
I would very much prefer it if this API had semantics closer to
PR_SET_PDEATHSIG.

[...]
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 312b43e..eb4a483 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2337,6 +2337,44 @@ static int ptrace_signal(int signr, kernel_siginfo_t 
> *info)
> return signr;
>  }
>
> +/*
> + * Let the parent, if so desired, know about the imminent death of a child
> + * prior to its coredump.
> + *
> + * Locking logic is similar to do_notify_parent_cldstop().
> + */
> +static void do_notify_parent_predump(struct task_struct *tsk)
> +{
> +   struct sighand_struct *sighand;
> +   struct task_struct *parent;
> +   struct kernel_siginfo info;
> +   unsigned long flags;
> +   int sig;
> +
> +   parent = tsk->real_parent;
> +   sig = parent->predump_signal;
> +
> +   /* Check again with "tasklist_lock" locked by the caller */
> +   if (!valid_predump_signal(sig))
> +   return;
> +
> +   clear_siginfo();
> +   info.si_signo = sig;
> +   if (sig == SIGCHLD)
> +   info.si_code = CLD_PREDUMP;
> +
> +   rcu_read_lock();
> +   info.si_pid = task_pid_nr_ns(tsk, task_active_pid_ns(parent));
> +   info.si_uid = from_kuid_munged(task_cred_xxx(parent, user_ns),
> +  task_uid(tsk));

You're sending a signal from the current namespaces, but with IDs that
have been mapped into the parent's namespaces? That looks wrong to me.

> +   rcu_read_unlock();
> +
> +   sighand = parent->sighand;
> +   spin_lock_irqsave(>siglock, flags);
> +   __group_send_sig_info(sig, , parent);
> +   spin_unlock_irqrestore(>siglock, flags);
> +}
> +
>  bool get_signal(struct ksignal *ksig)
>  {
> struct sighand_struct *sighand = current->sighand;
> @@ -2497,6 +2535,19 @@ bool get_signal(struct ksignal *ksig)
> current->flags |= PF_SIGNALED;
>
> if (sig_kernel_coredump(signr)) {
> +   /*
> +* Notify the parent prior to the coredump if the
> +* parent is interested in such a notificaiton.
> +*/
> +   int p_sig = current->real_parent->predump_signal;

current->real_parent is an __rcu member. I think if you run the sparse
checker against this patch, it's going to complain. Are you allowed to
access current->real_parent in this context?

> +   if (valid_predump_signal(p_sig)) {
> +   read_lock(_lock);
> +   do_notify_parent_predump(current);
> +   read_unlock(_lock);
> +   cond_resched();
> +   }
> +
> if (print_fatal_signals)
> print_fatal_signal(ksig->info.si_signo);
> proc_coredump_connector(current);
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 123bd73..43eb250d 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2258,6 +2258,76 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct 
> *t, unsigned long which,
> return -EINVAL;
>  }
>
> +static int prctl_get_predump_signal(struct task_struct *tsk, pid_t pid,
> +   int __user *addr)
> +{
> +   struct task_struct *p;
> +   int error;
> +
> +   /* For the current task, the common case */
> +   if (pid == 0) {
> +   put_user(tsk->predump_signal, addr);
> +   return 0;
> +   }
> +
> +   error = -ESRCH;
> +   rcu_read_lock();
> +   p = find_task_by_vpid(pid);
> +   if (p) {
> +   error = 0;
> +   put_user(p->predump_signal, addr);

This is wrong. You can't call put_user() while you're in an RCU
read-side critical section.

As below, I would like it if you could just get rid of this branch,
and restrict this prctl to operating on the current process.

> +   }
> +   rcu_read_unlock();
> +   return error;
> +}
> +
> +/*
> + * Returns true if current's euid is same 

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-13 Thread Jann Horn
On Sat, Oct 13, 2018 at 2:33 AM Enke Chen  wrote:
> For simplicity and consistency, this patch provides an implementation
> for signal-based fault notification prior to the coredump of a child
> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> be used by an application to express its interest and to specify the
> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.

Your suggested API looks vaguely similar to PR_SET_PDEATHSIG, but with
some important differences:

 - You don't reset the signal on setuid execution.
 - You permit setting this not just on the current process, but also on others.

For both of these: Are these differences actually necessary, and if
so, can you provide a specific rationale? From a security perspective,
I would very much prefer it if this API had semantics closer to
PR_SET_PDEATHSIG.

[...]
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 312b43e..eb4a483 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2337,6 +2337,44 @@ static int ptrace_signal(int signr, kernel_siginfo_t 
> *info)
> return signr;
>  }
>
> +/*
> + * Let the parent, if so desired, know about the imminent death of a child
> + * prior to its coredump.
> + *
> + * Locking logic is similar to do_notify_parent_cldstop().
> + */
> +static void do_notify_parent_predump(struct task_struct *tsk)
> +{
> +   struct sighand_struct *sighand;
> +   struct task_struct *parent;
> +   struct kernel_siginfo info;
> +   unsigned long flags;
> +   int sig;
> +
> +   parent = tsk->real_parent;
> +   sig = parent->predump_signal;
> +
> +   /* Check again with "tasklist_lock" locked by the caller */
> +   if (!valid_predump_signal(sig))
> +   return;
> +
> +   clear_siginfo();
> +   info.si_signo = sig;
> +   if (sig == SIGCHLD)
> +   info.si_code = CLD_PREDUMP;
> +
> +   rcu_read_lock();
> +   info.si_pid = task_pid_nr_ns(tsk, task_active_pid_ns(parent));
> +   info.si_uid = from_kuid_munged(task_cred_xxx(parent, user_ns),
> +  task_uid(tsk));

You're sending a signal from the current namespaces, but with IDs that
have been mapped into the parent's namespaces? That looks wrong to me.

> +   rcu_read_unlock();
> +
> +   sighand = parent->sighand;
> +   spin_lock_irqsave(>siglock, flags);
> +   __group_send_sig_info(sig, , parent);
> +   spin_unlock_irqrestore(>siglock, flags);
> +}
> +
>  bool get_signal(struct ksignal *ksig)
>  {
> struct sighand_struct *sighand = current->sighand;
> @@ -2497,6 +2535,19 @@ bool get_signal(struct ksignal *ksig)
> current->flags |= PF_SIGNALED;
>
> if (sig_kernel_coredump(signr)) {
> +   /*
> +* Notify the parent prior to the coredump if the
> +* parent is interested in such a notificaiton.
> +*/
> +   int p_sig = current->real_parent->predump_signal;

current->real_parent is an __rcu member. I think if you run the sparse
checker against this patch, it's going to complain. Are you allowed to
access current->real_parent in this context?

> +   if (valid_predump_signal(p_sig)) {
> +   read_lock(_lock);
> +   do_notify_parent_predump(current);
> +   read_unlock(_lock);
> +   cond_resched();
> +   }
> +
> if (print_fatal_signals)
> print_fatal_signal(ksig->info.si_signo);
> proc_coredump_connector(current);
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 123bd73..43eb250d 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2258,6 +2258,76 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct 
> *t, unsigned long which,
> return -EINVAL;
>  }
>
> +static int prctl_get_predump_signal(struct task_struct *tsk, pid_t pid,
> +   int __user *addr)
> +{
> +   struct task_struct *p;
> +   int error;
> +
> +   /* For the current task, the common case */
> +   if (pid == 0) {
> +   put_user(tsk->predump_signal, addr);
> +   return 0;
> +   }
> +
> +   error = -ESRCH;
> +   rcu_read_lock();
> +   p = find_task_by_vpid(pid);
> +   if (p) {
> +   error = 0;
> +   put_user(p->predump_signal, addr);

This is wrong. You can't call put_user() while you're in an RCU
read-side critical section.

As below, I would like it if you could just get rid of this branch,
and restrict this prctl to operating on the current process.

> +   }
> +   rcu_read_unlock();
> +   return error;
> +}
> +
> +/*
> + * Returns true if current's euid is same 

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-13 Thread Christian Brauner
On Fri, Oct 12, 2018 at 05:33:35PM -0700, Enke Chen wrote:
> For simplicity and consistency, this patch provides an implementation
> for signal-based fault notification prior to the coredump of a child
> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> be used by an application to express its interest and to specify the
> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
> 
> Background:
> 
> As the coredump of a process may take time, in certain time-sensitive
> applications it is necessary for a parent process (e.g., a process
> manager) to be notified of a child's imminent death before the coredump
> so that the parent process can act sooner, such as re-spawning an
> application process, or initiating a control-plane fail-over.
> 
> Currently there are two ways for a parent process to be notified of a
> child process's state change. One is to use the POSIX signal, and
> another is to use the kernel connector module. The specific events and
> actions are summarized as follows:
> 
> Process EventPOSIX SignalConnector-based
> --
> ptrace_attach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>  SIGCHLD / CLD_STOPPED
> 
> ptrace_detach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>  SIGCHLD / CLD_CONTINUED
> 
> pre_coredump/N/A proc_coredump_connector()
> get_signal()
> 
> post_coredump/   do_notify_parent()  proc_exit_connector()
> do_exit()SIGCHLD / exit_signal
> --
> 
> As shown in the table, the signal-based pre-coredump notification is not
> currently available. In some cases using a connector-based notification
> can be quite complicated (e.g., when a process manager is written in shell
> scripts and thus is subject to certain inherent limitations), and a
> signal-based notification would be simpler and better suited.
> 
> Signed-off-by: Enke Chen 
> ---
>  arch/x86/kernel/signal_compat.c|  2 +-
>  include/linux/sched.h  |  4 ++
>  include/linux/signal.h |  5 +++
>  include/uapi/asm-generic/siginfo.h |  3 +-
>  include/uapi/linux/prctl.h |  4 ++
>  kernel/fork.c  |  1 +
>  kernel/signal.c| 51 +
>  kernel/sys.c   | 77 
> ++
>  8 files changed, 145 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
> index 9ccbf05..a3deba8 100644
> --- a/arch/x86/kernel/signal_compat.c
> +++ b/arch/x86/kernel/signal_compat.c
> @@ -30,7 +30,7 @@ static inline void signal_compat_build_tests(void)
>   BUILD_BUG_ON(NSIGSEGV != 7);
>   BUILD_BUG_ON(NSIGBUS  != 5);
>   BUILD_BUG_ON(NSIGTRAP != 5);
> - BUILD_BUG_ON(NSIGCHLD != 6);
> + BUILD_BUG_ON(NSIGCHLD != 7);
>   BUILD_BUG_ON(NSIGSYS  != 1);
>  
>   /* This is part of the ABI and can never change in size: */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 09026ea..cfb9645 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -696,6 +696,10 @@ struct task_struct {
>   int exit_signal;
>   /* The signal sent when the parent dies: */
>   int pdeath_signal;
> +
> + /* The signal sent prior to a child's coredump: */
> + int predump_signal;
> +
>   /* JOBCTL_*, siglock protected: */
>   unsigned long   jobctl;
>  
> diff --git a/include/linux/signal.h b/include/linux/signal.h
> index 706a499..7cb976d 100644
> --- a/include/linux/signal.h
> +++ b/include/linux/signal.h
> @@ -256,6 +256,11 @@ static inline int valid_signal(unsigned long sig)
>   return sig <= _NSIG ? 1 : 0;
>  }
>  
> +static inline int valid_predump_signal(int sig)
> +{
> + return (sig == SIGCHLD) || (sig == SIGUSR1) || (sig == SIGUSR2);
> +}
> +
>  struct timespec;
>  struct pt_regs;
>  enum pid_type;
> diff --git a/include/uapi/asm-generic/siginfo.h 
> b/include/uapi/asm-generic/siginfo.h
> index cb3d6c2..1a47cef 100644
> --- a/include/uapi/asm-generic/siginfo.h
> +++ b/include/uapi/asm-generic/siginfo.h
> @@ -267,7 +267,8 @@ struct {  \
>  #define CLD_TRAPPED  4   /* traced child has trapped */
>  #define CLD_STOPPED  5   /* child has stopped */
>  #define CLD_CONTINUED6   /* stopped child has continued */
> -#define NSIGCHLD 6
> +#define CLD_PREDUMP  7   /* child is about to dump core */
> +#define NSIGCHLD 7
>  
>  /*
>   * SIGPOLL (or any other signal without signal specific si_codes) si_codes
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-13 Thread Christian Brauner
On Fri, Oct 12, 2018 at 05:33:35PM -0700, Enke Chen wrote:
> For simplicity and consistency, this patch provides an implementation
> for signal-based fault notification prior to the coredump of a child
> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> be used by an application to express its interest and to specify the
> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
> 
> Background:
> 
> As the coredump of a process may take time, in certain time-sensitive
> applications it is necessary for a parent process (e.g., a process
> manager) to be notified of a child's imminent death before the coredump
> so that the parent process can act sooner, such as re-spawning an
> application process, or initiating a control-plane fail-over.
> 
> Currently there are two ways for a parent process to be notified of a
> child process's state change. One is to use the POSIX signal, and
> another is to use the kernel connector module. The specific events and
> actions are summarized as follows:
> 
> Process EventPOSIX SignalConnector-based
> --
> ptrace_attach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>  SIGCHLD / CLD_STOPPED
> 
> ptrace_detach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>  SIGCHLD / CLD_CONTINUED
> 
> pre_coredump/N/A proc_coredump_connector()
> get_signal()
> 
> post_coredump/   do_notify_parent()  proc_exit_connector()
> do_exit()SIGCHLD / exit_signal
> --
> 
> As shown in the table, the signal-based pre-coredump notification is not
> currently available. In some cases using a connector-based notification
> can be quite complicated (e.g., when a process manager is written in shell
> scripts and thus is subject to certain inherent limitations), and a
> signal-based notification would be simpler and better suited.
> 
> Signed-off-by: Enke Chen 
> ---
>  arch/x86/kernel/signal_compat.c|  2 +-
>  include/linux/sched.h  |  4 ++
>  include/linux/signal.h |  5 +++
>  include/uapi/asm-generic/siginfo.h |  3 +-
>  include/uapi/linux/prctl.h |  4 ++
>  kernel/fork.c  |  1 +
>  kernel/signal.c| 51 +
>  kernel/sys.c   | 77 
> ++
>  8 files changed, 145 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
> index 9ccbf05..a3deba8 100644
> --- a/arch/x86/kernel/signal_compat.c
> +++ b/arch/x86/kernel/signal_compat.c
> @@ -30,7 +30,7 @@ static inline void signal_compat_build_tests(void)
>   BUILD_BUG_ON(NSIGSEGV != 7);
>   BUILD_BUG_ON(NSIGBUS  != 5);
>   BUILD_BUG_ON(NSIGTRAP != 5);
> - BUILD_BUG_ON(NSIGCHLD != 6);
> + BUILD_BUG_ON(NSIGCHLD != 7);
>   BUILD_BUG_ON(NSIGSYS  != 1);
>  
>   /* This is part of the ABI and can never change in size: */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 09026ea..cfb9645 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -696,6 +696,10 @@ struct task_struct {
>   int exit_signal;
>   /* The signal sent when the parent dies: */
>   int pdeath_signal;
> +
> + /* The signal sent prior to a child's coredump: */
> + int predump_signal;
> +
>   /* JOBCTL_*, siglock protected: */
>   unsigned long   jobctl;
>  
> diff --git a/include/linux/signal.h b/include/linux/signal.h
> index 706a499..7cb976d 100644
> --- a/include/linux/signal.h
> +++ b/include/linux/signal.h
> @@ -256,6 +256,11 @@ static inline int valid_signal(unsigned long sig)
>   return sig <= _NSIG ? 1 : 0;
>  }
>  
> +static inline int valid_predump_signal(int sig)
> +{
> + return (sig == SIGCHLD) || (sig == SIGUSR1) || (sig == SIGUSR2);
> +}
> +
>  struct timespec;
>  struct pt_regs;
>  enum pid_type;
> diff --git a/include/uapi/asm-generic/siginfo.h 
> b/include/uapi/asm-generic/siginfo.h
> index cb3d6c2..1a47cef 100644
> --- a/include/uapi/asm-generic/siginfo.h
> +++ b/include/uapi/asm-generic/siginfo.h
> @@ -267,7 +267,8 @@ struct {  \
>  #define CLD_TRAPPED  4   /* traced child has trapped */
>  #define CLD_STOPPED  5   /* child has stopped */
>  #define CLD_CONTINUED6   /* stopped child has continued */
> -#define NSIGCHLD 6
> +#define CLD_PREDUMP  7   /* child is about to dump core */
> +#define NSIGCHLD 7
>  
>  /*
>   * SIGPOLL (or any other signal without signal specific si_codes) si_codes
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-13 Thread Greg Kroah-Hartman
On Fri, Oct 12, 2018 at 05:33:35PM -0700, Enke Chen wrote:
> For simplicity and consistency, this patch provides an implementation
> for signal-based fault notification prior to the coredump of a child
> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> be used by an application to express its interest and to specify the
> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
> 
> Background:
> 
> As the coredump of a process may take time, in certain time-sensitive
> applications it is necessary for a parent process (e.g., a process
> manager) to be notified of a child's imminent death before the coredump
> so that the parent process can act sooner, such as re-spawning an
> application process, or initiating a control-plane fail-over.
> 
> Currently there are two ways for a parent process to be notified of a
> child process's state change. One is to use the POSIX signal, and
> another is to use the kernel connector module. The specific events and
> actions are summarized as follows:
> 
> Process EventPOSIX SignalConnector-based
> --
> ptrace_attach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>  SIGCHLD / CLD_STOPPED
> 
> ptrace_detach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>  SIGCHLD / CLD_CONTINUED
> 
> pre_coredump/N/A proc_coredump_connector()
> get_signal()
> 
> post_coredump/   do_notify_parent()  proc_exit_connector()
> do_exit()SIGCHLD / exit_signal
> --
> 
> As shown in the table, the signal-based pre-coredump notification is not
> currently available. In some cases using a connector-based notification
> can be quite complicated (e.g., when a process manager is written in shell
> scripts and thus is subject to certain inherent limitations), and a
> signal-based notification would be simpler and better suited.
> 
> Signed-off-by: Enke Chen 
> ---
>  arch/x86/kernel/signal_compat.c|  2 +-
>  include/linux/sched.h  |  4 ++
>  include/linux/signal.h |  5 +++
>  include/uapi/asm-generic/siginfo.h |  3 +-
>  include/uapi/linux/prctl.h |  4 ++
>  kernel/fork.c  |  1 +
>  kernel/signal.c| 51 +
>  kernel/sys.c   | 77 
> ++
>  8 files changed, 145 insertions(+), 2 deletions(-)

Shouldn't there also be a manpage update, and a kselftest added for this
new user/kernel api that is being created?

thanks,

greg k-h


Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-13 Thread Greg Kroah-Hartman
On Fri, Oct 12, 2018 at 05:33:35PM -0700, Enke Chen wrote:
> For simplicity and consistency, this patch provides an implementation
> for signal-based fault notification prior to the coredump of a child
> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> be used by an application to express its interest and to specify the
> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
> 
> Background:
> 
> As the coredump of a process may take time, in certain time-sensitive
> applications it is necessary for a parent process (e.g., a process
> manager) to be notified of a child's imminent death before the coredump
> so that the parent process can act sooner, such as re-spawning an
> application process, or initiating a control-plane fail-over.
> 
> Currently there are two ways for a parent process to be notified of a
> child process's state change. One is to use the POSIX signal, and
> another is to use the kernel connector module. The specific events and
> actions are summarized as follows:
> 
> Process EventPOSIX SignalConnector-based
> --
> ptrace_attach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>  SIGCHLD / CLD_STOPPED
> 
> ptrace_detach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>  SIGCHLD / CLD_CONTINUED
> 
> pre_coredump/N/A proc_coredump_connector()
> get_signal()
> 
> post_coredump/   do_notify_parent()  proc_exit_connector()
> do_exit()SIGCHLD / exit_signal
> --
> 
> As shown in the table, the signal-based pre-coredump notification is not
> currently available. In some cases using a connector-based notification
> can be quite complicated (e.g., when a process manager is written in shell
> scripts and thus is subject to certain inherent limitations), and a
> signal-based notification would be simpler and better suited.
> 
> Signed-off-by: Enke Chen 
> ---
>  arch/x86/kernel/signal_compat.c|  2 +-
>  include/linux/sched.h  |  4 ++
>  include/linux/signal.h |  5 +++
>  include/uapi/asm-generic/siginfo.h |  3 +-
>  include/uapi/linux/prctl.h |  4 ++
>  kernel/fork.c  |  1 +
>  kernel/signal.c| 51 +
>  kernel/sys.c   | 77 
> ++
>  8 files changed, 145 insertions(+), 2 deletions(-)

Shouldn't there also be a manpage update, and a kselftest added for this
new user/kernel api that is being created?

thanks,

greg k-h