Re: [PATCH] uprobes: prevent mutex_lock() under rcu_read_lock()

2024-05-21 Thread Oleg Nesterov
On 05/20, Andrii Nakryiko wrote:
>
> Fixes: 1b8f85defbc8 ("uprobes: prepare uprobe args buffer lazily")
> Reported-by: Breno Leitao 
> Signed-off-by: Andrii Nakryiko 
> ---
>  kernel/trace/trace_uprobe.c | 14 +-
>  1 file changed, 9 insertions(+), 5 deletions(-)

Reviewed-by: Oleg Nesterov 




Re: [PATCHv6 bpf-next 1/9] x86/shstk: Make return uprobe work with shadow stack

2024-05-21 Thread Oleg Nesterov
On 05/21, Jiri Olsa wrote:
>
> Currently the application with enabled shadow stack will crash
> if it sets up return uprobe. The reason is the uretprobe kernel
> code changes the user space task's stack, but does not update
> shadow stack accordingly.
>
> Adding new functions to update values on shadow stack and using
> them in uprobe code to keep shadow stack in sync with uretprobe
> changes to user stack.

I don't think my ack has any value in this area but looks good to me.

Reviewed-by: Oleg Nesterov 


> Fixes: 8b1c23543436 ("x86/shstk: Add return uprobe support")

Hmm... Was this commit ever applied?

Oleg.




Re: [PATCHv5 bpf-next 6/8] x86/shstk: Add return uprobe support

2024-05-15 Thread Oleg Nesterov
On 05/15, Edgecombe, Rick P wrote:
>
> On Wed, 2024-05-15 at 13:35 +0200, Oleg Nesterov wrote:
> >
> > > I'm ok with not using optimized uretprobe when shadow stack is detected
> > > as enabled and we go with current uretprobe in that case
> >
> > But how can we detect it? Again, suppose userspace does
>
> the rdssp instruction returns the value of the shadow stack pointer. On non-
> shadow stack it is a nop. So you could check if the SSP is non-zero to find if
> shadow stack is enabled.

But again, the ret-probed function can enable it before it returns? And we
need to check if it is enabled on the function entry if we want to avoid
sys_uretprobe() in this case. Although I don't understand why we want to
avoid it.

> This would catch most cases, but I guess there is the
> possibility of it getting enabled in a signal that hit between checking and 
> the
> rest of operation.

Or from signal handler.

> Is this uretprobe stuff signal safe in general?

In what sense?

I forgot everything about this code but I can't recall any problem with signals.

Except it doesn't support sigaltstack() + siglongjmp().

Oleg.




Re: [PATCHv5 bpf-next 6/8] x86/shstk: Add return uprobe support

2024-05-15 Thread Oleg Nesterov
On 05/15, Jiri Olsa wrote:
>
> On Wed, May 15, 2024 at 01:19:20PM +0200, Oleg Nesterov wrote:
> > Let me ask a couple of really stupid questions. What if the shadow stack
> > is "shorter" than the normal stack? I mean,
> >
> > enable_shstk()
> > {
> > prctl(ARCH_SHSTK_SHSTK);

I meant ARCH_SHSTK_ENABLE, of course

> > }
> >
> > what happens when enable_shstk() returns?
>
> I think it will crash, there's explanation in the comment in
> tools/testing/selftests/x86/test_shadow_stack.c test

OK, thanks...

But test_shadow_stack.c doesn't do ARCH_PRCTL(ARCH_SHSTK_DISABLE) if
all the tests succeed ? Confused but nevermind.

> > And what is the purpose of fpregs_lock_and_load() ? Why do we need to
> > fpregs_restore_userregs() in shstk_setup() and other places?
> >
> > Oleg.
> >
>




Re: [PATCHv5 bpf-next 6/8] x86/shstk: Add return uprobe support

2024-05-15 Thread Oleg Nesterov
Let me repeat I know nothing about shadow stacks, only tried to
read Documentation/arch/x86/shstk.rst few minutes ago ;)

On 05/13, Jiri Olsa wrote:
>
> 1) current uretprobe which are not working at the moment and we change
>the top value of shadow stack with shstk_push_frame
> 2) optimized uretprobe which needs to push new frame on shadow stack
>with shstk_update_last_frame
>
> I think we should do 1) and have current uretprobe working with shadow
> stack, which is broken at the moment

Agreed,

> I'm ok with not using optimized uretprobe when shadow stack is detected
> as enabled and we go with current uretprobe in that case

But how can we detect it? Again, suppose userspace does

enable_shstk()
{
prctl(ARCH_SHSTK_SHSTK);
}

what if enable_shstk() is ret-probed ?

Oleg.




Re: [PATCHv5 bpf-next 6/8] x86/shstk: Add return uprobe support

2024-05-15 Thread Oleg Nesterov
Sorry for the late reply, I was on PTO.

On 05/14, Deepak Gupta wrote:
>
> Question,
>
> Is it kernel who is maintaining all return probes, meaning original return 
> addresses
> are saved in kernel data structures on per task basis.

Yes. task_struct->utask->return_instances

See prepare_uretprobe() which inserts the new return_instance with
->orig_ret_vaddr = original return addresses
when the tracee enters the ret-probed function.

> Once uretprobe did its job then
> its kernel who is ensuring return to original return address ?

Yes. See instruction_pointer_set(regs, ri->orig_ret_vaddr) in
handle_trampoline().



I know absolutely nothing about the shadow stacks, trying to read
Documentation/arch/x86/shstk.rst but it doesn't tell me too much...
Where can I find more documentation? I didn't try to google yet.

Upon function return, the processor pops the shadow stack copy
and compares it to the normal stack copy. If the two differ, the
processor raises a control-protection fault.

grep-grep-grep... exc_control_protection I guess.

Let me ask a couple of really stupid questions. What if the shadow stack
is "shorter" than the normal stack? I mean,

enable_shstk()
{
prctl(ARCH_SHSTK_SHSTK);
}

what happens when enable_shstk() returns?


And what is the purpose of fpregs_lock_and_load() ? Why do we need to
fpregs_restore_userregs() in shstk_setup() and other places?

Oleg.




Re: [PATCHv3 bpf-next 1/7] uprobe: Wire up uretprobe system call

2024-04-22 Thread Oleg Nesterov
On 04/21, Jiri Olsa wrote:
>
>  arch/x86/entry/syscalls/syscall_64.tbl | 1 +
>  include/linux/syscalls.h   | 2 ++
>  include/uapi/asm-generic/unistd.h  | 5 -
>  kernel/sys_ni.c| 2 ++
>  4 files changed, 9 insertions(+), 1 deletion(-)

Reviewed-by: Oleg Nesterov 




Re: [PATCHv3 bpf-next 2/7] uprobe: Add uretprobe syscall to speed up return probe

2024-04-22 Thread Oleg Nesterov
On 04/21, Jiri Olsa wrote:
>
>  arch/x86/kernel/uprobes.c | 115 ++
>  include/linux/uprobes.h   |   3 +
>  kernel/events/uprobes.c   |  24 +---
>  3 files changed, 135 insertions(+), 7 deletions(-)

Reviewed-by: Oleg Nesterov 




Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe

2024-04-08 Thread Oleg Nesterov
On 04/08, Jiri Olsa wrote:
>
> On Fri, Apr 05, 2024 at 01:02:30PM +0200, Oleg Nesterov wrote:
> >
> > And what should sys_uretprobe() do if it is not called from the trampoline?
> > I'd prefer force_sig(SIGILL) to punish the abuser ;) OK, OK, EINVAL.
>
> so the similar behaviour with int3 ends up with immediate SIGTRAP
> and not invoking pending uretprobe consumers, like:
>
>   - setup uretprobe for foo
>   - foo() {
>   executes int 3 -> sends SIGTRAP
> }
>
> because the int3 handler checks if it got executed from the uretprobe's
> trampoline.

... or the task has uprobe at this address

> if not it treats that int3 as regular trap

Yes this mimics the "default" behaviour without uprobes/uretprobes

> so I think we should mimic int3 behaviour and:
>
>   - setup uretprobe for foo
>   - foo() {
>  uretprobe_syscall -> check if we got executed from uretprobe's
>  trampoline and send SIGILL if that's not the case

Agreed,

> I think it's better to have the offending process killed right away,
> rather than having more undefined behaviour, waiting for final 'ret'
> instruction that jumps to uretprobe trampoline and causes SIGILL

Agreed. In fact I think it should be also killed if copy_to/from_user()
fails by the same reason.

Oleg.




Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe

2024-04-06 Thread Oleg Nesterov
On 04/06, Masami Hiramatsu wrote:
>
> On Fri, 5 Apr 2024 13:02:30 +0200
> Oleg Nesterov  wrote:
>
> > With or without this patch userpace can also do
> >
> > foo() { <-- retprobe1
> > bar() {
> > jump to xol_area
> > }
> > }
> >
> > handle_trampoline() will handle retprobe1.
>
> This is OK because the execution path has been changed to trampoline,

Agreed, in this case the misuse is more clear. But please see below.

> but the above will continue running bar() after sys_uretprobe().

... and most probably crash

> > sigreturn() can be "improved" too. Say, it could validate sigcontext->ip
> > and return -EINVAL if this addr is not valid. But why?
>
> Because sigreturn() never returns, but sys_uretprobe() will return.

You mean, sys_uretprobe() returns to the next insn after syscall.

Almost certainly yes, but this is not necessarily true. If one of consumers
changes regs->sp sys_uretprobe() "returns" to another location, just like
sys_rt_sigreturn().

That said.

Masami, it is not that I am trying to prove that you are "wrong" ;) No.

I see your points even if I am biased, I understand that my objections are
not 100% "fair".

I am just trying to explain why, rightly or not, I care much less about the
abuse of sys_uretprobe().

Thanks!

Oleg.




Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe

2024-04-05 Thread Oleg Nesterov
On 04/05, Jiri Olsa wrote:
>
> On Fri, Apr 05, 2024 at 10:22:03AM +0900, Masami Hiramatsu wrote:
> >
> > I think this expects setjmp/longjmp as below
> >
> > foo() { <- retprobe1
> > setjmp()
> > bar() { <- retprobe2
> > longjmp()
> > }
> > } <- return to trampoline
> >
> > In this case, we need to skip retprobe2's instance.

Yes,

> > My concern is, if we can not find appropriate return instance, what happen?
> > e.g.
> >
> > foo() { <-- retprobe1
> >bar() { # sp is decremented
> >sys_uretprobe() <-- ??
> > }
> > }
> >
> > It seems sys_uretprobe() will handle retprobe1 at that point instead of
> > SIGILL.
>
> yes, and I think it's fine, you get the consumer called in wrong place,
> but it's your fault and kernel won't crash

Agreed.

With or without this patch userpace can also do

foo() { <-- retprobe1
bar() {
jump to xol_area
}
}

handle_trampoline() will handle retprobe1.

> this can be fixed by checking the syscall is called from the trampoline
> and prevent handle_trampoline call if it's not

Yes, but I still do not think this makes a lot of sense. But I won't argue.

And what should sys_uretprobe() do if it is not called from the trampoline?
I'd prefer force_sig(SIGILL) to punish the abuser ;) OK, OK, EINVAL.

I agree very much with Andrii,

   sigreturn()  exists only to allow the implementation of signal handlers. 
 It should never be
   called directly.  Details of the arguments (if any) passed to 
sigreturn() vary depending  on
   the architecture.

this is how sys_uretprobe() should be treated/documented.

sigreturn() can be "improved" too. Say, it could validate sigcontext->ip
and return -EINVAL if this addr is not valid. But why?

Oleg.




Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe

2024-04-04 Thread Oleg Nesterov
On 04/05, Masami Hiramatsu wrote:
>
> Can we make this syscall and uprobe behavior clearer? As you said, if
> the application use sigreturn or longjump, it may skip returns and
> shadow stack entries are left in the kernel. In such cases, can uretprobe
> detect it properly, or just crash the process (or process runs wrongly)?

Please see the comment in handle_trampoline(), it tries to detect this case.
This patch should not make any difference.

Oleg.




Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe

2024-04-03 Thread Oleg Nesterov
Again, I leave this to you and Jiri, but

On 04/03, Masami Hiramatsu wrote:
>
> On Wed, 3 Apr 2024 11:47:41 +0200
> > > set in the user function, what happen if the user function directly
> > > calls this syscall? (maybe it consumes shadow stack?)
> >
> > the process should receive SIGILL if there's no pending uretprobe for
> > the current task, or it will trigger uretprobe if there's one pending
>
> No, that is too aggressive and not safe. Since the syscall is exposed to
> user program, it should return appropriate error code instead of SIGILL.

...

> Since the syscall is always exposed to the user program, it should
> - Do nothing and return an error unless it is properly called.
> - check the prerequisites for operation strictly.

We have sys_munmap(). should it check if the caller is going to unmap
the code region which contains regs->ip and do nothing?

I don't think it should. Userspace should blame itself, SIGSEGV is not
"too aggressive" in this case.

> I concern that new system calls introduce vulnerabilities.

Yes, we need to ensure that sys_uretprobe() can only damage the malicious
caller and nothing else.

Oleg.




Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe

2024-04-03 Thread Oleg Nesterov
I leave this to you and Masami, but...

On 04/03, Jiri Olsa wrote:
>
> On Wed, Apr 03, 2024 at 10:07:08AM +0900, Masami Hiramatsu wrote:
> >
> > This is interesting approach. But I doubt we need to add additional
> > syscall just for this purpose. Can't we use another syscall or ioctl?
>
> so the plan is to optimize entry uprobe in a similar way and given
> the syscall is not a scarce resource I wanted to add another syscall
> for that one as well
>
> tbh I'm not sure sure which syscall or ioctl to reuse for this, it's
> possible to do that, the trampoline will just have to save one or
> more additional registers, but adding new syscall seems cleaner to me

Agreed.

> > Also, we should run syzkaller on this syscall. And if uretprobe is
>
> right, I'll check on syzkaller

I don't understand this concern...

> > set in the user function, what happen if the user function directly
> > calls this syscall? (maybe it consumes shadow stack?)
>
> the process should receive SIGILL if there's no pending uretprobe for
> the current task,

Yes,

> or it will trigger uretprobe if there's one pending

... and corrupt the caller. So what?

> but we could limit the syscall to be executed just from the trampoline,
> that should prevent all the user space use cases, I'll do that in next
> version and add more tests for that

Yes, we can... well, ignoring the race with mremap() from another thread.

But why should we care?

Userspace should not call sys_uretprobe(). Likewise, it should not call
sys_restart_syscall(). Likewise, it should not jump to xol_area.

Of course, userspace (especially syzkaller) _can_ do this. So what?

I think the only thing we need to ensure is that the "malicious" task
which calls sys_uretprobe() can only harm itself, nothing more.

No?

Oleg.




Re: [PATCH v2 0/3] uprobes: two common case speed ups

2024-03-18 Thread Oleg Nesterov
On 03/18, Andrii Nakryiko wrote:
>
> Andrii Nakryiko (3):
>   uprobes: encapsulate preparation of uprobe args buffer
>   uprobes: prepare uprobe args buffer lazily
>   uprobes: add speculative lockless system-wide uprobe filter check
>
>  kernel/trace/trace_uprobe.c | 103 +---
>  1 file changed, 59 insertions(+), 44 deletions(-)

Reviewed-by: Oleg Nesterov 




Re: [PATCH bpf-next 2/3] uprobes: prepare uprobe args buffer lazily

2024-03-13 Thread Oleg Nesterov
Again, looks good to me, but I have a minor nit. Feel free to ignore.

On 03/12, Andrii Nakryiko wrote:
>
>  static void __uprobe_trace_func(struct trace_uprobe *tu,
>   unsigned long func, struct pt_regs *regs,
> - struct uprobe_cpu_buffer *ucb,
> + struct uprobe_cpu_buffer **ucbp,
>   struct trace_event_file *trace_file)
>  {
>   struct uprobe_trace_entry_head *entry;
>   struct trace_event_buffer fbuffer;
> + struct uprobe_cpu_buffer *ucb;
>   void *data;
>   int size, esize;
>   struct trace_event_call *call = trace_probe_event_call(>tp);
>  
> + ucb = *ucbp;
> + if (!ucb) {
> + ucb = prepare_uprobe_buffer(tu, regs);
> + *ucbp = ucb;
> + }

perhaps it would be more clean to pass ucbp to prepare_uprobe_buffer()
and change it to do

if (*ucbp)
return *ucbp;

at the start. Then __uprobe_trace_func() and __uprobe_perf_func() can
simply do

ucb = prepare_uprobe_buffer(tu, regs, ucbp);

> - uprobe_buffer_put(ucb);
> + if (ucb)
> + uprobe_buffer_put(ucb);

Similarly, I think the "ucb != NULL" check should be shifted into
uprobe_buffer_put().

Oleg.




Re: [PATCH bpf-next 1/3] uprobes: encapsulate preparation of uprobe args buffer

2024-03-13 Thread Oleg Nesterov
LGTM, one nit below.

On 03/12, Andrii Nakryiko wrote:
>
> +static struct uprobe_cpu_buffer *prepare_uprobe_buffer(struct trace_uprobe 
> *tu,
> +struct pt_regs *regs)
> +{
> + struct uprobe_cpu_buffer *ucb;
> + int dsize, esize;
> +
> + esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu));
> + dsize = __get_data_size(>tp, regs);
> +
> + ucb = uprobe_buffer_get();
> + ucb->dsize = dsize;
> +
> + store_trace_args(ucb->buf, >tp, regs, esize, dsize);
> +
> + return ucb;
> +}

OK, but note that every user of ->dsize adds tp.size. So I think you can
simplify this code a bit more if you change prepare_uprobe_buffer() to do

ucb->dsize = tu->tp.size + dsize;

and update the users.

Oleg.




Re: [PATCH bpf-next 3/3] uprobes: add speculative lockless system-wide uprobe filter check

2024-03-13 Thread Oleg Nesterov
I forgot everything about this code, plus it has changed a lot since
I looked at it many years ago, but ...

I think this change is fine but the changelog looks a bit confusing
(overcomplicated) to me.

On 03/12, Andrii Nakryiko wrote:
>
> This patch adds a speculative check before grabbing that rwlock. If
> nr_systemwide is non-zero, lock is skipped and event is passed through.
> From examining existing logic it looks correct and safe to do. If
> nr_systemwide is being modified under rwlock in parallel, we have to
> consider basically just one important race condition: the case when
> nr_systemwide is dropped from one to zero (from
> trace_uprobe_filter_remove()) under filter->rwlock, but
> uprobe_perf_filter() raced and saw it as >0.

Unless I am totally confused, there is nothing new. Even without
this change trace_uprobe_filter_remove() can clear nr_systemwide
right after uprobe_perf_filter() drops filter->rwlock.

And of course, trace_uprobe_filter_add() can change nr_systemwide
from 0 to 1. In this case uprobe_perf_func() can "wrongly" return
UPROBE_HANDLER_REMOVE but we can't avoid this and afaics this is
fine even if handler_chain() does unapply_uprobe(), uprobe_perf_open()
will do uprobe_apply() after that, we can rely on ->register_rwsem.

> In case we speculatively read nr_systemwide as zero, while it was
> incremented in parallel, we'll proceed to grabbing filter->rwlock and
> re-doing the check, this time in lock-protected and non-racy way.

See above...


So I think uprobe_perf_filter() needs filter->rwlock only to iterate
the list, it can check nr_systemwide lockless and this means that you
can also remove the same check in __uprobe_perf_filter(), other callers
trace_uprobe_filter_add/remove check it themselves.


> --- a/kernel/trace/trace_uprobe.c
> +++ b/kernel/trace/trace_uprobe.c
> @@ -1351,6 +1351,10 @@ static bool uprobe_perf_filter(struct uprobe_consumer 
> *uc,
>   tu = container_of(uc, struct trace_uprobe, consumer);
>   filter = tu->tp.event->filter;
>
> + /* speculative check */
> + if (READ_ONCE(filter->nr_systemwide))
> + return true;
> +
>   read_lock(>rwlock);
>   ret = __uprobe_perf_filter(filter, mm);
>   read_unlock(>rwlock);

ACK,

but see above. I think the changelog should be simplified and the
filter->nr_systemwide check in __uprobe_perf_filter() should be
removed. But I won't insist and perhaps I missed something...

Oleg.




Re: [PATCH 2/3] arm64: syscall.h: Add sign extension handling in syscall_get_return_value for compat

2021-04-16 Thread Oleg Nesterov
On 04/16, He Zhe wrote:
>
> --- a/arch/arm64/include/asm/syscall.h
> +++ b/arch/arm64/include/asm/syscall.h
> @@ -44,7 +44,12 @@ static inline long syscall_get_error(struct task_struct 
> *task,
>  static inline long syscall_get_return_value(struct task_struct *task,
>   struct pt_regs *regs)
>  {
> - return regs->regs[0];
> + long val = regs->regs[0];
> +
> + if (is_compat_thread(task_thread_info(task)))
> + val = sign_extend64(val, 31);
> +
> + return val;
>  }

I can't really review these arm-specific patches, but with this change both
syscall_get_error() and is_syscall_success() can use syscall_get_return_value()
to avoid the code duplication.

Oleg.



Re: [PATCH 2/2] ptrace: is_syscall_success: Add syscall return code handling for compat task

2021-04-14 Thread Oleg Nesterov
On 04/15, He Zhe wrote:
>
>
> On 4/15/21 12:55 AM, Oleg Nesterov wrote:
> > I think in_compat_syscall() should be used instead.
> >
> > But this doesn't matter, I still can't understand the problem.
>
> Sorry for not enough clarification.
>
> This was found on an arm64 kernel running with 32-bit user-space application.

OK, but then I think you should add the arm64 version of is_syscall_success()
into arch/arm4/include/asm/ptrace.h and do not touch the generic version ?

Something like arch/arm64/include/asm/syscall.h:syscall_get_error() which uses
is_compat_thread(). Perhaps you can even do

#define is_syscall_success(regs)\
(syscall_get_error(current, regs) == 0)

Oleg.



Re: [PATCH 2/4] arch/x86: implement the process_vm_exec syscall

2021-04-14 Thread Oleg Nesterov
On 04/13, Andrei Vagin wrote:
>
> +static void swap_mm(struct mm_struct *prev_mm, struct mm_struct *target_mm)
> +{
> + struct task_struct *tsk = current;
> + struct mm_struct *active_mm;
> +
> + task_lock(tsk);
> + /* Hold off tlb flush IPIs while switching mm's */
> + local_irq_disable();
> +
> + sync_mm_rss(prev_mm);
> +
> + vmacache_flush(tsk);
> +
> + active_mm = tsk->active_mm;
> + if (active_mm != target_mm) {
> + mmgrab(target_mm);
> + tsk->active_mm = target_mm;
> + }
> + tsk->mm = target_mm;
> + switch_mm_irqs_off(active_mm, target_mm, tsk);
> + local_irq_enable();
> + task_unlock(tsk);
> +#ifdef finish_arch_post_lock_switch
> + finish_arch_post_lock_switch();
> +#endif
> +
> + if (active_mm != target_mm)
> + mmdrop(active_mm);
> +}

I think this should be unified with kthread_use_mm() somehow...

And does it really need the "prev_mm" argument? It must be tsk->mm, no?

Oleg.



Re: [PATCH 2/2] ptrace: is_syscall_success: Add syscall return code handling for compat task

2021-04-14 Thread Oleg Nesterov
On 04/14, David Laight wrote:
>
> From: Oleg Nesterov
> > Sent: 14 April 2021 16:08
> >
> > Add audit maintainers...
> >
> > On 04/14, He Zhe wrote:
> > >
> > > When 32-bit userspace application is running on 64-bit kernel, the 32-bit
> > > syscall return code would be changed from u32 to u64 in regs_return_value
> > > and then changed to s64. Hence the negative return code would be treated
> > > as a positive number and results in a non-error in, for example, audit
> > > like below.
> >
> > Sorry, can understand. At least on x86_64 even the 32-bit syscall returns
> > long, not u32.
> >
> > Hmm. And afaics on x86 is_compat_task() is only defined if !CONFIG_COMPAT,
> > so this patch looks wrong anyway.
>
> And, as with the other patch a x64_64 64bit process can make both types
> of 32bit system call - so it needs to depend on the system call entry type
> not any type of the task.

I don't understand... but iirc is_compat_task() used to check TS_COMPAT and
this is what we need to detect the 32-bit syscall. But it looks deprecated,
I think in_compat_syscall() should be used instead.

But this doesn't matter, I still can't understand the problem.

Oleg.



Re: [PATCH 2/2] ptrace: is_syscall_success: Add syscall return code handling for compat task

2021-04-14 Thread Oleg Nesterov
Add audit maintainers...

On 04/14, He Zhe wrote:
>
> When 32-bit userspace application is running on 64-bit kernel, the 32-bit
> syscall return code would be changed from u32 to u64 in regs_return_value
> and then changed to s64. Hence the negative return code would be treated
> as a positive number and results in a non-error in, for example, audit
> like below.

Sorry, can understand. At least on x86_64 even the 32-bit syscall returns
long, not u32.

Hmm. And afaics on x86 is_compat_task() is only defined if !CONFIG_COMPAT,
so this patch looks wrong anyway.

Oleg.

> type=SYSCALL msg=audit(160715.887:582): arch=4028 syscall=322
> success=yes exit=4294967283
> 
> This patch forces the u32->s32->s64 for compat tasks.
> 
> Signed-off-by: He Zhe 
> ---
>  include/linux/ptrace.h | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
> index b5ebf6c01292..bc3056fff8a6 100644
> --- a/include/linux/ptrace.h
> +++ b/include/linux/ptrace.h
> @@ -260,7 +260,9 @@ static inline void ptrace_release_task(struct task_struct 
> *task)
>   * is an error value.  On some systems like ia64 and powerpc they have 
> different
>   * indicators of success/failure and must define their own.
>   */
> -#define is_syscall_success(regs) (!IS_ERR_VALUE((unsigned 
> long)(regs_return_value(regs
> +#define is_syscall_success(regs) (!IS_ERR_VALUE(is_compat_task() ? \
> + (unsigned long)(s64)(s32)(regs_return_value(regs)) : \
> + (unsigned long)(regs_return_value(regs
>  #endif
>  
>  /*
> -- 
> 2.17.1
> 



Re: Mutual debugging of 2 processes can stuck in unkillable stopped state

2021-04-13 Thread Oleg Nesterov
Hi Igor,

sorry for delay...

On 04/12, Igor Zhbanov wrote:
>
> Hi Oleg,
>
> So what is the cause of this problem?

The cause is clear. And well known ;) And again, this has almost nothing to do
with the mutual debugging.

The tracee sleeps in ptrace_stop(). You send SIGKILL. This wakes the tracee up,
it dequeues the signal, calls do_exit(), and stops again in PTRACE_EVENT_EXIT.
With SIGKILL in signal->shared_pending. This all looks as if the tracee doesn't
react to SIGKILL.

The only problem is that any change can break something which relies on the
current behaviour :/ I'll write another email on this.

Oleg.



Re: [PATCH] block: Fix sys_ioprio_set(.which=IOPRIO_WHO_PGRP) task iteration

2021-04-09 Thread Oleg Nesterov
On 04/08, Jens Axboe wrote:
>
> On 4/8/21 3:46 AM, Peter Zijlstra wrote:
> >
> > do_each_pid_thread() { } while_each_pid_thread() is a double loop and
> > thus break doesn't work as expected. Also, it should be used under
> > tasklist_lock because otherwise we can race against change_pid() for
> > PGID/SID.
>
> Applied, thanks.

Agreed, but can't resist. We can move the "out" label up and avoid the extra
read_unlock(tasklist_lock). IOW, something like below on top of this patch.

Quite possibly this won't change the generated code, gcc is smart enough, but
this makes the code a bit more readable.

Oleg.

--- x/block/ioprio.c~   2021-04-09 12:00:28.066145563 +0200
+++ x/block/ioprio.c2021-04-09 12:02:01.817849618 +0200
@@ -123,11 +123,10 @@
read_lock(_lock);
do_each_pid_thread(pgrp, PIDTYPE_PGID, p) {
ret = set_task_ioprio(p, ioprio);
-   if (ret) {
-   read_unlock(_lock);
-   goto out;
-   }
+   if (ret)
+   goto out_pgrp;
} while_each_pid_thread(pgrp, PIDTYPE_PGID, p);
+out_pgrp:
read_unlock(_lock);
 
break;
@@ -159,7 +158,6 @@
ret = -EINVAL;
}
 
-out:
rcu_read_unlock();
return ret;
 }



Re: perf_buffer.event_list is not RCU-safe?

2021-04-07 Thread Oleg Nesterov
On 04/07, Peter Zijlstra wrote:
>
> On Tue, Apr 06, 2021 at 07:43:53PM +0200, Oleg Nesterov wrote:
> > On 04/06, Oleg Nesterov wrote:
> > >
> > > perf_mmap_close() was added by 9bb5d40cd93c9 ("perf: Fix mmap() 
> > > accounting hole")
> >
> > I meant perf_mmap_close() -> put_event()
> >
> > > and this commit doesn't look right anyway
> >
> > It seems there is another problem or I am totally confused. I do not
> > understand why can we use list_for_each_entry_rcu(event, rb->event_list)
> > if this can race with perf_event_set_output(event) which can move "event"
> > to another list, in this case list_for_each_entry_rcu() can loop forever.
> >
> > perf_mmap_close() even mentions this race and restarts the iteration to
> > avoid it but I don't think this is enough,
> >
> > rcu_read_lock();
> > list_for_each_entry_rcu(event, >event_list, rb_entry) {
> > if (!atomic_long_inc_not_zero(>refcount)) {
> > /*
> >  * This event is en-route to free_event() which will
> >  * detach it and remove it from the list.
> >  */
> > continue;
> > }
> >
> > just suppose that "this event" is moved to another list first and after
> > that it goes away so that atomic_long_inc_not_zero() fails; in this case
> > the next iteration will play with event->rb_entry.next, and this is not
> > necessarily "struct perf_event", it can can be "list_head event_list".
>
> We observe an RCU GP in ring_buffer_attach(), between detach and attach,
> no?

Aaah yes, I didn't notice cond_synchronize_rcu() in ring_buffer_attach().

Thanks!

Oleg.



Re: [PATCH] task_work: add helper for more targeted task_work canceling

2021-04-06 Thread Oleg Nesterov
On 04/04, Jens Axboe wrote:
>
> +struct callback_head *task_work_cancel_match(struct task_struct *task,
> + bool (*match)(struct callback_head *, void *data), void *data);
^

Feel free to ignore, but how about "typedef task_work_match_t" ?

Either way,

Reviewed-by: Oleg Nesterov 



perf_buffer.event_list is not RCU-safe?

2021-04-06 Thread Oleg Nesterov
On 04/06, Oleg Nesterov wrote:
>
> perf_mmap_close() was added by 9bb5d40cd93c9 ("perf: Fix mmap() accounting 
> hole")

I meant perf_mmap_close() -> put_event()

> and this commit doesn't look right anyway

It seems there is another problem or I am totally confused. I do not
understand why can we use list_for_each_entry_rcu(event, rb->event_list)
if this can race with perf_event_set_output(event) which can move "event"
to another list, in this case list_for_each_entry_rcu() can loop forever.

perf_mmap_close() even mentions this race and restarts the iteration to
avoid it but I don't think this is enough,

rcu_read_lock();
list_for_each_entry_rcu(event, >event_list, rb_entry) {
if (!atomic_long_inc_not_zero(>refcount)) {
/*
 * This event is en-route to free_event() which will
 * detach it and remove it from the list.
 */
continue;
}

just suppose that "this event" is moved to another list first and after
that it goes away so that atomic_long_inc_not_zero() fails; in this case
the next iteration will play with event->rb_entry.next, and this is not
necessarily "struct perf_event", it can can be "list_head event_list".

Don't we need rb->event_lock ?

Oleg.



perf_mmap_close() -> put_event() -> event.destroy() can deadlock

2021-04-06 Thread Oleg Nesterov
Song, Peter, please take a look.

On 04/02, Hillf Danton wrote:
>
> On Thu, 1 Apr 2021 12:53:26  Oleg Nesterov wrote:
> >
> >Hmm, please correct me, but I don't think so. I think mmap_lock -> 
> >dup_mmap_sem
> >is not possible.
> >
> Given perf_trace_destroy() in the report, it is a couple of steps in the
> subsequent call chain that it likely takes to reach uprobe_unregister().
>
> perf_trace_destroy()
>   perf_trace_event_unreg(p_event)
> tp_event->class->reg(tp_event, TRACE_REG_PERF_UNREGISTER, NULL);
>   trace_uprobe_register()

Well, this is not possible, perf_trace_destroy() is only used by perf_tracepoint
pmu. But you are right anyway, event.destroy == perf_uprobe_destroy can lead to
uprobe_unregister(). Thanks.

---

So. perf_mmap_close() does put_event(event) with mm->mmap_lock held. This can
deadlock if event->destroy == perf_uprobe_destroy: perf_trace_event_close/unreg
takes dup_mmap_sem.


perf_mmap_close() was added by 9bb5d40cd93c9 ("perf: Fix mmap() accounting 
hole")
and this commit doesn't look right anyway, I'll write another email. However, it
seems that this particular problem was added later by 33ea4b24277b0 ("perf/core:
Implement the 'perf_uprobe' PMU").

Oleg.



Re: [syzbot] possible deadlock in register_for_each_vma

2021-04-01 Thread Oleg Nesterov
On 04/01, Hillf Danton wrote:
>
> If I dont misread it,  the lockdep chain will likely evolve from
>
>event_mutex -> uprobe.register_rwsem -> dup_mmap_sem -> mm.mmap_lock ->
>event_mutex
> to
>dup_mmap_sem -> mm.mmap_lock -> dup_mmap_sem
>
> after this patch as both uprobe_register() and uprobe_unregister() would take
> dup_mmap_sem.

Hmm, please correct me, but I don't think so. I think mmap_lock -> dup_mmap_sem
is not possible.

Oleg.



Re: [syzbot] possible deadlock in register_for_each_vma

2021-03-31 Thread Oleg Nesterov
On 03/28, Hillf Danton wrote:
>
> On Sat, 27 Mar 2021 18:53:08 Oleg Nesterov wrote:
> >Hi Hillf,
> >
> >it seems that you already understand the problem ;) I don't.
>
> It is simpler than you thought - I always blindly believe what syzbot
> reported is true before it turns out false as I am not smarter than it.
> Feel free to laugh loud.

I am not going to laugh. I too think that lockdep is more clever than me.

> >Could you explain in details how double __register is possible ? and how
>
> Taking another look at the report over five minutes may help more?

No. I spent much, much more time time and I still can't understand your
patch which adds UPROBE_REGISTERING. Quite possibly your patch is fine,
just I am not smart enough.

And I am a bit surprised you refused to help me.

> >it connects to this lockdep report?
>
> Feel free to show the report is false and ignore my noise.

Well, this particular report looks correct but false-positive to me,
_free_event() is not possible, but I can be easily wrong and we need
to shut up lockdep anyway...


---
Add more CC's. So, we have the following trace

-> #0 (dup_mmap_sem){}-{0:0}:
check_prev_add kernel/locking/lockdep.c:2936 [inline]
check_prevs_add kernel/locking/lockdep.c:3059 [inline]
validate_chain kernel/locking/lockdep.c:3674 [inline]
__lock_acquire+0x2b14/0x54c0 kernel/locking/lockdep.c:4900
lock_acquire kernel/locking/lockdep.c:5510 [inline]
lock_acquire+0x1ab/0x740 kernel/locking/lockdep.c:5475
percpu_down_write+0x95/0x440 kernel/locking/percpu-rwsem.c:217
register_for_each_vma+0x2c/0xc10 kernel/events/uprobes.c:1040
__uprobe_register+0x5c2/0x850 kernel/events/uprobes.c:1181
trace_uprobe_enable kernel/trace/trace_uprobe.c:1065 [inline]
probe_event_enable+0x357/0xa00 kernel/trace/trace_uprobe.c:1134
trace_uprobe_register+0x443/0x880 kernel/trace/trace_uprobe.c:1461
perf_trace_event_reg kernel/trace/trace_event_perf.c:129 [inline]
perf_trace_event_init+0x549/0xa20 kernel/trace/trace_event_perf.c:204
perf_uprobe_init+0x16f/0x210 kernel/trace/trace_event_perf.c:336
perf_uprobe_event_init+0xff/0x1c0 kernel/events/core.c:9754
perf_try_init_event+0x12a/0x560 kernel/events/core.c:11071
perf_init_event kernel/events/core.c:11123 [inline]
perf_event_alloc.part.0+0xe3b/0x3960 kernel/events/core.c:11403
perf_event_alloc kernel/events/core.c:11785 [inline]
__do_sys_perf_event_open+0x647/0x2e60 kernel/events/core.c:11883
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xae


which shows that this path takes

event_mutex -> uprobe.register_rwsem -> dup_mmap_sem -> mm.mmap_lock

Not good. If nothing else, perf_mmap_close() path can take event_mutex under
mm.mmap_lock, so lockdep complains correctly.

But why does perf_uprobe_init() take event_mutex? The comment mentions
uprobe_buffer_enable().

If this is the only reason, then why uprobe_buffer_enable/disable abuse
event_mutex?

IOW, can something like the stupid patch below work? (Just in case... yes
it is very suboptimal, I am just trying to understand the problem).

Song, Namhyung, Peter, what do you think?

Oleg.


--- x/kernel/trace/trace_event_perf.c
+++ x/kernel/trace/trace_event_perf.c
@@ -327,16 +327,9 @@ int perf_uprobe_init(struct perf_event *p_event,
goto out;
}
 
-   /*
-* local trace_uprobe need to hold event_mutex to call
-* uprobe_buffer_enable() and uprobe_buffer_disable().
-* event_mutex is not required for local trace_kprobes.
-*/
-   mutex_lock(_mutex);
ret = perf_trace_event_init(tp_event, p_event);
if (ret)
destroy_local_trace_uprobe(tp_event);
-   mutex_unlock(_mutex);
 out:
kfree(path);
return ret;
--- x/kernel/trace/trace_uprobe.c
+++ x/kernel/trace/trace_uprobe.c
@@ -857,6 +857,7 @@ struct uprobe_cpu_buffer {
 };
 static struct uprobe_cpu_buffer __percpu *uprobe_cpu_buffer;
 static int uprobe_buffer_refcnt;
+static DEFINE_MUTEX(uprobe_buffer_mutex);
 
 static int uprobe_buffer_init(void)
 {
@@ -894,13 +895,13 @@ static int uprobe_buffer_enable(void)
 {
int ret = 0;
 
-   BUG_ON(!mutex_is_locked(_mutex));
-
+   mutex_lock(_buffer_mutex);
if (uprobe_buffer_refcnt++ == 0) {
ret = uprobe_buffer_init();
if (ret < 0)
uprobe_buffer_refcnt--;
}
+   mutex_unlock(_buffer_mutex);
 
return ret;
 }
@@ -909,8 +910,7 @@ static void uprobe_buffer_disable(void)
 {
int cpu;
 
-   BUG_ON(!mutex_is_locked(_mutex));
-
+   mutex_lock(_buffer_mutex);
if (--uprobe_buffer_refcnt == 0) {
 

Re: [PATCH v3 06/11] perf: Add support for SIGTRAP on perf events

2021-03-29 Thread Oleg Nesterov
On 03/29, Marco Elver wrote:
>
> So, per off-list discussion, it appears that I should ask to clarify:
> PF_EXISTING or PF_EXITING?

Aaah, sorry Marco.

PF_EXITING, of course.

Oleg.



Re: Mutual debugging of 2 processes can stuck in unkillable stopped state

2021-03-29 Thread Oleg Nesterov
00
> SigBlk: 2000
> SigIgn: 0030
> SigCgt: 000180007007
> CapInh: 
> CapPrm: 
> CapEff: 
> CapBnd: 01ff
> CapAmb: 
> NoNewPrivs: 0
> Seccomp:0
> Seccomp_filters:0
> Speculation_Store_Bypass:   vulnerable
> SpeculationIndirectBranch:  always enabled
> Cpus_allowed:   7
> Cpus_allowed_list:  0-2
> Mems_allowed:   
> ,,,,,,,,,,,,,,,0001
> Mems_allowed_list:  0
> voluntary_ctxt_switches:180
> nonvoluntary_ctxt_switches: 848
> 
> On 29.03.2021 19:49, Oleg Nesterov wrote:
> >On 03/29, Igor Zhbanov wrote:
> >>
> >>Mutual debugging of 2 processes can stuck in unkillable stopped state
> >
> >can't reproduce and can't understand...
> >
> >>Hi!
> >>
> >>When one process, let's say "A", is tracing the another process "B", and the
> >>process "B" is trying to attach to the process "A", then both of them are
> >>getting stuck in the "t+" state. And they are ignoring all of the signals
> >>including the SIGKILL,
> >
> >Why do you think so? What is your kernel version?
> >
> >"t" means TASK_TRACED, SIGKILL should wake it up and terminate.
> >
> >>so it is not possible to terminate them without
> >>a reboot.
> >>
> >>To reproduce:
> >>1) Run two terminals
> >>2) Attach with "strace -p ..." from the first terminal to the shell (bash) 
> >>of
> >>the second terminal.
> >>3) In the second terminal run "exec strace -p ..." to attach to the PID of 
> >>the
> >>first strace.
> >>
> >>Then you'll see that the second strace is hanging without any output. And 
> >>the
> >>first strace will output following and hang too:
> >>ptrace(PTRACE_SEIZE, 11795, NULL,
> >>PTRACE_O_TRACESYSGOOD|PTRACE_O_TRACEEXEC|PTRACE_O_TRACEEXIT
> >>
> >>(The 11795 is the PID of the first strace itself.)
> >>
> >>And in the process list you will see following:
> >>ps awux | grep strace
> >>user   11776  0.0  0.0  24752  2248 pts/3t+   13:53   0:00 strace -p 
> >>11795
> >>user   11795  0.0  0.0  24752  3888 pts/1t+   13:54   0:00 strace -p 
> >>11776
> >
> >OK, may be they sleep in PTRACE_EVENT_EXIT? After you tried to send SIGKILL?
> >
> >please show us the output from "cat /proc/{11795,11776}/stack". And
> >"cat /proc/{11795,11776}/status" just in case.
> 



Re: Mutual debugging of 2 processes can stuck in unkillable stopped state

2021-03-29 Thread Oleg Nesterov
On 03/29, Igor Zhbanov wrote:
>
> Mutual debugging of 2 processes can stuck in unkillable stopped state

can't reproduce and can't understand...

> Hi!
>
> When one process, let's say "A", is tracing the another process "B", and the
> process "B" is trying to attach to the process "A", then both of them are
> getting stuck in the "t+" state. And they are ignoring all of the signals
> including the SIGKILL,

Why do you think so? What is your kernel version?

"t" means TASK_TRACED, SIGKILL should wake it up and terminate.

> so it is not possible to terminate them without
> a reboot.
> 
> To reproduce:
> 1) Run two terminals
> 2) Attach with "strace -p ..." from the first terminal to the shell (bash) of
>the second terminal.
> 3) In the second terminal run "exec strace -p ..." to attach to the PID of the
>first strace.
> 
> Then you'll see that the second strace is hanging without any output. And the
> first strace will output following and hang too:
> ptrace(PTRACE_SEIZE, 11795, NULL,
>PTRACE_O_TRACESYSGOOD|PTRACE_O_TRACEEXEC|PTRACE_O_TRACEEXIT
> 
> (The 11795 is the PID of the first strace itself.)
> 
> And in the process list you will see following:
> ps awux | grep strace
> user   11776  0.0  0.0  24752  2248 pts/3t+   13:53   0:00 strace -p 11795
> user   11795  0.0  0.0  24752  3888 pts/1t+   13:54   0:00 strace -p 11776

OK, may be they sleep in PTRACE_EVENT_EXIT? After you tried to send SIGKILL?

please show us the output from "cat /proc/{11795,11776}/stack". And
"cat /proc/{11795,11776}/status" just in case.

Oleg.



Re: [PATCH v3 06/11] perf: Add support for SIGTRAP on perf events

2021-03-29 Thread Oleg Nesterov
On 03/29, Peter Zijlstra wrote:
>
> On Thu, Mar 25, 2021 at 09:14:39AM +0100, Marco Elver wrote:
> > @@ -6395,6 +6395,13 @@ static void perf_sigtrap(struct perf_event *event)
> >  {
> > struct kernel_siginfo info;
> >
> > +   /*
> > +* This irq_work can race with an exiting task; bail out if sighand has
> > +* already been released in release_task().
> > +*/
> > +   if (!current->sighand)
> > +   return;

This is racy. If "current" has already passed exit_notify(), current->parent
can do release_task() and destroy current->sighand right after the check.

> Urgh.. I'm not entirely sure that check is correct, but I always forget
> the rules with signal. It could be we ought to be testing PF_EXISTING
> instead.

Agreed, PF_EXISTING check makes more sense in any case, the exiting task
can't receive the signal anyway.

Oleg.



Re: [syzbot] possible deadlock in register_for_each_vma

2021-03-27 Thread Oleg Nesterov
Hi Hillf,

it seems that you already understand the problem ;) I don't.

Could you explain in details how double __register is possible ? and how
it connects to this lockdep report?

On 03/27, Hillf Danton wrote:
>
> On Fri, 26 Mar 2021 03:29:19
> > Hello,
> > 
> > syzbot found the following issue on:
> > 
> > HEAD commit:0d02ec6b Linux 5.12-rc4
> > git tree:   upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=1719e4aad0
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=5adab0bdee099d7a
> > dashboard link: https://syzkaller.appspot.com/bug?extid=b804f902bbb6bcf290cb
> > 
> > Unfortunately, I don't have any reproducer for this issue yet.
> > 
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+b804f902bbb6bcf29...@syzkaller.appspotmail.com
> > 
> > ==
> > WARNING: possible circular locking dependency detected
> > 5.12.0-rc4-syzkaller #0 Not tainted
> > --
> > syz-executor.3/23522 is trying to acquire lock:
> > 8c03e530 (dup_mmap_sem){}-{0:0}, at: 
> > register_for_each_vma+0x2c/0xc10 kernel/events/uprobes.c:1040
> > 
> > but task is already holding lock:
> > 8880624a8c90 (>register_rwsem){+.+.}-{3:3}, at: 
> > __uprobe_register+0x531/0x850 kernel/events/uprobes.c:1177
> > 
> > which lock already depends on the new lock.
> > 
> > 
> > the existing dependency chain (in reverse order) is:
> > 
> > -> #3 (>register_rwsem){+.+.}-{3:3}:
> >down_write+0x92/0x150 kernel/locking/rwsem.c:1406
> >__uprobe_register+0x531/0x850 kernel/events/uprobes.c:1177
> >trace_uprobe_enable kernel/trace/trace_uprobe.c:1065 [inline]
> >probe_event_enable+0x357/0xa00 kernel/trace/trace_uprobe.c:1134
> >trace_uprobe_register+0x443/0x880 kernel/trace/trace_uprobe.c:1461
> >perf_trace_event_reg kernel/trace/trace_event_perf.c:129 [inline]
> >perf_trace_event_init+0x549/0xa20 kernel/trace/trace_event_perf.c:204
> >perf_uprobe_init+0x16f/0x210 kernel/trace/trace_event_perf.c:336
> >perf_uprobe_event_init+0xff/0x1c0 kernel/events/core.c:9754
> >perf_try_init_event+0x12a/0x560 kernel/events/core.c:11071
> >perf_init_event kernel/events/core.c:11123 [inline]
> >perf_event_alloc.part.0+0xe3b/0x3960 kernel/events/core.c:11403
> >perf_event_alloc kernel/events/core.c:11785 [inline]
> >__do_sys_perf_event_open+0x647/0x2e60 kernel/events/core.c:11883
> >do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
> >entry_SYSCALL_64_after_hwframe+0x44/0xae
> > 
> > -> #2 (event_mutex){+.+.}-{3:3}:
> >__mutex_lock_common kernel/locking/mutex.c:949 [inline]
> >__mutex_lock+0x139/0x1120 kernel/locking/mutex.c:1096
> >perf_trace_destroy+0x23/0xf0 kernel/trace/trace_event_perf.c:241
> >_free_event+0x2ee/0x1380 kernel/events/core.c:4863
> >put_event kernel/events/core.c:4957 [inline]
> >perf_mmap_close+0x572/0xe10 kernel/events/core.c:6002
> >remove_vma+0xae/0x170 mm/mmap.c:180
> >remove_vma_list mm/mmap.c:2653 [inline]
> >__do_munmap+0x74f/0x11a0 mm/mmap.c:2909
> >do_munmap mm/mmap.c:2917 [inline]
> >munmap_vma_range mm/mmap.c:598 [inline]
> >mmap_region+0x85a/0x1730 mm/mmap.c:1750
> >do_mmap+0xcff/0x11d0 mm/mmap.c:1581
> >vm_mmap_pgoff+0x1b7/0x290 mm/util.c:519
> >ksys_mmap_pgoff+0x49c/0x620 mm/mmap.c:1632
> >do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
> >entry_SYSCALL_64_after_hwframe+0x44/0xae
> > 
> > -> #1 (>mmap_lock#2){}-{3:3}:
> >down_write_killable+0x95/0x170 kernel/locking/rwsem.c:1417
> >mmap_write_lock_killable include/linux/mmap_lock.h:87 [inline]
> >dup_mmap kernel/fork.c:480 [inline]
> >dup_mm+0x12e/0x1380 kernel/fork.c:1368
> >copy_mm kernel/fork.c:1424 [inline]
> >copy_process+0x2b99/0x7150 kernel/fork.c:2107
> >kernel_clone+0xe7/0xab0 kernel/fork.c:2500
> >__do_sys_clone+0xc8/0x110 kernel/fork.c:2617
> >do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
> >entry_SYSCALL_64_after_hwframe+0x44/0xae
> > 
> > -> #0 (dup_mmap_sem){}-{0:0}:
> >check_prev_add kernel/locking/lockdep.c:2936 [inline]
> >check_prevs_add kernel/locking/lockdep.c:3059 [inline]
> >validate_chain kernel/locking/lockdep.c:3674 [inline]
> >__lock_acquire+0x2b14/0x54c0 kernel/locking/lockdep.c:4900
> >lock_acquire kernel/locking/lockdep.c:5510 [inline]
> >lock_acquire+0x1ab/0x740 kernel/locking/lockdep.c:5475
> >percpu_down_write+0x95/0x440 kernel/locking/percpu-rwsem.c:217
> >register_for_each_vma+0x2c/0xc10 kernel/events/uprobes.c:1040
> >__uprobe_register+0x5c2/0x850 kernel/events/uprobes.c:1181
> >trace_uprobe_enable 

Re: [PATCH 2/8] kernel: unmask SIGSTOP for IO threads

2021-03-26 Thread Oleg Nesterov
Jens, sorry, I got lost :/

On 03/25, Jens Axboe wrote:
>
> With IO threads accepting signals, including SIGSTOP,

where can I find this change? Looks like I wasn't cc'ed...

> unmask the
> SIGSTOP signal from the default blocked mask.
> 
> Signed-off-by: Jens Axboe 
> ---
>  kernel/fork.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index d3171e8e88e5..d5a40552910f 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2435,7 +2435,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), 
> void *arg, int node)
>   tsk = copy_process(NULL, 0, node, );
>   if (!IS_ERR(tsk)) {
>   sigfillset(>blocked);
> - sigdelsetmask(>blocked, sigmask(SIGKILL));
> + sigdelsetmask(>blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));

siginitsetinv(blocked, sigmask(SIGKILL)|sigmask(SIGSTOP)) but this is minor.

To remind, either way this is racy and can't really help.

And if "IO threads accepting signals" then I don't understand why. Sorry,
I must have missed something.

Oleg.



Re: [PATCH 0/2] Don't show PF_IO_WORKER in /proc//task/

2021-03-25 Thread Oleg Nesterov
On 03/25, Linus Torvalds wrote:
>
> The whole "signals are very special for IO threads" thing has caused
> so many problems, that maybe the solution is simply to _not_ make them
> special?

Or may be IO threads should not abuse CLONE_THREAD?

Why does create_io_thread() abuse CLONE_THREAD ?

One reason (I think) is that this implies SIGKILL when the process exits/execs,
anything else?

Oleg.



Re: [PATCH 0/2] Don't show PF_IO_WORKER in /proc//task/

2021-03-25 Thread Oleg Nesterov
On 03/25, Eric W. Biederman wrote:
>
> So looking quickly the flip side of the coin is gdb (and other
> debuggers) needs a way to know these threads are special, so it can know
> not to attach.

may be,

> I suspect getting -EPERM (or possibly a different error code) when
> attempting attach is the right was to know that a thread is not
> available to be debugged.

may be.

But I don't think we can blame gdb. The kernel changed the rules, and this
broke gdb. IOW, I don't agree this is gdb bug.

Oleg.



Re: [PATCH 0/2] Don't show PF_IO_WORKER in /proc//task/

2021-03-25 Thread Oleg Nesterov
I didn't even try to read this series yet, will try tomorrow.

But sorry, I can't resist...

On 03/25, Jens Axboe wrote:
>
> On 3/25/21 1:33 PM, Eric W. Biederman wrote:
> > Jens Axboe  writes:
> >
> >> Hi,
> >>
> >> Stefan reports that attaching to a task with io_uring will leave gdb
> >> very confused and just repeatedly attempting to attach to the IO threads,
> >> even though it receives an -EPERM every time.

Heh. As expected :/

> And arguably it _is_ a gdb bug, but...

Why do you think so?

Oleg.



Re: [RFC][PATCH] task_struct::state frobbing

2021-03-25 Thread Oleg Nesterov
On 03/25, Peter Zijlstra wrote:
>
>  static void ptrace_unfreeze_traced(struct task_struct *task)
>  {
> - if (task->state != __TASK_TRACED)
> + if (READ_ONCE(task->__state) != __TASK_TRACED)
>   return;

this change is correct,

> @@ -201,11 +201,11 @@ static void ptrace_unfreeze_traced(struct task_struct 
> *task)
>* Recheck state under the lock to close this race.
>*/
>   spin_lock_irq(>sighand->siglock);
> - if (task->state == __TASK_TRACED) {
> + if (READ_ONCE(task->__state) == __TASK_TRACED) {

this too,

> @@ -240,7 +240,7 @@ static int ptrace_check_attach(struct task_struct *child, 
> bool ignore_state)
>*/
>   read_lock(_lock);
>   if (child->ptrace && child->parent == current) {
> - WARN_ON(child->state == __TASK_TRACED);
> + WARN_ON(task_is_traced(child));
>   /*
>* child->sighand can't be NULL, release_task()
>* does ptrace_unlink() before __exit_signal().
> @@ -257,7 +257,7 @@ static int ptrace_check_attach(struct task_struct *child, 
> bool ignore_state)
>* ptrace_stop() changes ->state back to TASK_RUNNING,
>* so we should not worry about leaking __TASK_TRACED.
>*/
> - WARN_ON(child->state == __TASK_TRACED);
> + WARN_ON(task_is_traced(child));


the two above are not.

"state == __TASK_TRACED" and task_is_traced() is not the same thing.

"state == __TASK_TRACED" means that debugger changed the state from TASK_TRACED
to __TASK_TRACED (iow, removed TASK_WAKEKILL) to ensure the tracee can not run,
this doesn't affect task_is_traced().

Oleg.



Re: [PATCH V3] exit: trigger panic when global init has exited

2021-03-24 Thread Oleg Nesterov
Hi,

On 03/23, qianli zhao wrote:
>
> Hi,Oleg
>
> > You certainly don't understand me :/
>
> > Please read my email you quoted below. I didn't mean the current logic.
> > I meant the logic after your patch which moves atomic_dec_and_test() and
> > panic() before exit_signals().
>
> Sorry, I think I see what you mean now.
>
> You mean that after apply my patch,SIGNAL_GROUP_EXIT no longer needs
> to be tested or avoid zap_pid_ns_processes()->BUG().
> Yes,your consideration is correct.

OK, great

> But,my patch has another purpose,protect some key variables(such
> as:task->mm,task->nsproxy,etc) to recover init coredump from
> fulldump,if sub-threads finish do_exit(),

Yes I know.

But the purpose of this SIGNAL_GROUP_EXIT check is not clear and not
documented. That is why I said it should be documented at least in the
changelog.

Oleg.



Re: [patch V5 2/2] signal: Allow tasks to cache one sigqueue struct

2021-03-24 Thread Oleg Nesterov
On 03/23, Thomas Gleixner wrote:
>
>  include/linux/sched.h  |1 +
>  include/linux/signal.h |1 +
>  kernel/exit.c  |1 +
>  kernel/fork.c  |1 +
>  kernel/signal.c|   44 ++--
>  5 files changed, 46 insertions(+), 2 deletions(-)

both patches look good to me, feel free to add

Reviewed-by: Oleg Nesterov 



Re: [patch V4 2/2] signal: Allow tasks to cache one sigqueue struct

2021-03-23 Thread Oleg Nesterov
On 03/22, Thomas Gleixner wrote:
>
> +static void sigqueue_cache_or_free(struct sigqueue *q, bool cache)
> +{
> + /*
> +  * Cache one sigqueue per task. This pairs with the consumer side
> +  * in __sigqueue_alloc() and needs READ/WRITE_ONCE() to prevent the
> +  * compiler from store tearing and to tell KCSAN that the data race
> +  * is intentional when run without holding current->sighand->siglock,
> +  * which is fine as current obviously cannot run __sigqueue_free()
> +  * concurrently.
> +  */
> + if (cache && !READ_ONCE(current->sigqueue_cache))
> + WRITE_ONCE(current->sigqueue_cache, q);
> + else
> + kmem_cache_free(sigqueue_cachep, q);
> +}
> +
> +void exit_task_sigqueue_cache(struct task_struct *tsk)
> +{
> + /* Race free because @tsk is mopped up */
> + struct sigqueue *q = tsk->sigqueue_cache;
> +
> + if (q) {
> + tsk->sigqueue_cache = NULL;
> + /* If task is self reaping, don't cache it back */
> + sigqueue_cache_or_free(q, tsk != current);
  ^^
Still not right or I am totally confused.

tsk != current can be true if an exiting (and autoreaping) sub-thread
releases its group leader.

IOW. Suppose a process has 2 threads, its parent ignores SIGCHLD.

The group leader L exits. Then its sub-thread T exits too and calls
release_task(T). In this case the tsk != current is false.

But after that T calls release_task(L) and L != T is true.

I'd suggest to free tsk->sigqueue_cache in __exit_signal() unconditionally and
remove the "bool cache" argument from sigqueue_cache_or_free().

Oleg.



Re: [PATCH V3] exit: trigger panic when global init has exited

2021-03-23 Thread Oleg Nesterov
On 03/23, qianli zhao wrote:
>
> Hi,Oleg
>
> > No, there is at least one alive init thread. If they all have exited, we 
> > have
> > the thread which calls panic() above.
>
> By current logic, setting PF_EXITING(exit_signals()) is before the
> panic(),

You certainly don't understand me :/

Please read my email you quoted below. I didn't mean the current logic.
I meant the logic after your patch which moves atomic_dec_and_test() and
panic() before exit_signals().

Oleg.

> find_alive_thread() determines the PF_EXITING of all child
> threads, the panic thread's PF_EXITING has been set before panic(),so
> find_alive_thread() thinks this thread also dead, resulting in
> find_alive_thread returning NULL.It is possible to trigger a
> zap_pid_ns_processes()->BUG() in this case.
> 
> exit_signals(tsk);  /* sets PF_EXITING */
> ...
> group_dead = atomic_dec_and_test(>signal->live);
> if (group_dead) {
> if (unlikely(is_global_init(tsk)))
> panic("Attempted to kill init!
> exitcode=0x%08x\n",>//PF_EXITING has been set
> tsk->signal->group_exit_code ?: (int)code);
> 
> ===
> 
> > Why do you think so? It can affect _any_ code which runs under
> > "if (group_dead)". Again, I don't see anything wrong, but I didn't even
> > try to audit these code paths.
> 
> Yes,all places where checked the "signal->live" may be affected,but
> even before my changes, each program that checks "signal->live" may
> get different state(group_dead or not), depending on the timing of the
> caller,this situation will not change after my change.
> After my patch,"signal->live--" and other variable are set in a
> different order(such as signal->live and PF_EXITING),this can cause
> abnormalities in the logic associated with these two variables,that is
> my thinking.
> Of course, check all the "signal->live--" path is definitely
> necessary,it's just the case above that we need more attention.
> 
> Thanks
> 
> Oleg Nesterov  于2021年3月23日周二 上午12:37写道:
> >
> > Hi,
> >
> > It seems that we don't understand each other.
> >
> > If we move atomic_dec_and_test(signal->live) and do
> >
> > if (group_dead && is_global_init)
> > panic(...);
> >
> >
> > before setting PF_EXITING like your patch does, then zap_pid_ns_processes()
> > simply won't be called.
> >
> > Because:
> >
> > On 03/21, qianli zhao wrote:
> > >
> > > Hi,Oleg
> > >
> > > > How? Perhaps I missed something again, but I don't think this is 
> > > > possible.
> > >
> > > > zap_pid_ns_processes() simply won't be called, find_child_reaper() will
> > > > see the !PF_EXITING thread which calls panic().
> > >
> > > > So I think this should be documented somehow, at least in the changelog.
> > >
> > > This problem occurs when both two init threads enter the do_exit,
> > > One of the init thread is syscall sys_exit_group,and set SIGNAL_GROUP_EXIT
> > > The other init thread perform ret_to_user()->get_signal() and found
> > > SIGNAL_GROUP_EXIT is set,then do_group_exit()->do_exit(),since there
> > > are no alive init threads it finally goes to
> > > zap_pid_ns_processes()
> >
> > No, there is at least one alive init thread. If they all have exited, we 
> > have
> > the thread which calls panic() above.
> >
> > > and BUG().
> >
> > so we don't need the SIGNAL_GROUP_EXIT check to avoid this BUG().
> >
> > What have I missed?
> >
> > Oleg.
> >
> 



Re: [PATCH V3] exit: trigger panic when global init has exited

2021-03-22 Thread Oleg Nesterov
On 03/22, Oleg Nesterov wrote:
>
> On 03/22, qianli zhao wrote:
> >
> > Moving the decrement position should only affect between new and old
> > code position of movement of the decrement of
> > signal->live.
>
> Why do you think so? It can affect _any_ code which runs under
> "if (group_dead)". Again, I don't see anything wrong, but I didn't even
> try to audit these code paths.

forgot to mention...

and any code outside of do_exit() which checks signal->live.

Oleg.



Re: [PATCH V3] exit: trigger panic when global init has exited

2021-03-22 Thread Oleg Nesterov
On 03/22, qianli zhao wrote:
>
> Moving the decrement position should only affect between new and old
> code position of movement of the decrement of
> signal->live.

Why do you think so? It can affect _any_ code which runs under
"if (group_dead)". Again, I don't see anything wrong, but I didn't even
try to audit these code paths.

Oleg.



Re: [PATCH V3] exit: trigger panic when global init has exited

2021-03-22 Thread Oleg Nesterov
Hi,

It seems that we don't understand each other.

If we move atomic_dec_and_test(signal->live) and do

if (group_dead && is_global_init)
panic(...);


before setting PF_EXITING like your patch does, then zap_pid_ns_processes()
simply won't be called.

Because:

On 03/21, qianli zhao wrote:
>
> Hi,Oleg
>
> > How? Perhaps I missed something again, but I don't think this is possible.
>
> > zap_pid_ns_processes() simply won't be called, find_child_reaper() will
> > see the !PF_EXITING thread which calls panic().
>
> > So I think this should be documented somehow, at least in the changelog.
>
> This problem occurs when both two init threads enter the do_exit,
> One of the init thread is syscall sys_exit_group,and set SIGNAL_GROUP_EXIT
> The other init thread perform ret_to_user()->get_signal() and found
> SIGNAL_GROUP_EXIT is set,then do_group_exit()->do_exit(),since there
> are no alive init threads it finally goes to
> zap_pid_ns_processes()

No, there is at least one alive init thread. If they all have exited, we have
the thread which calls panic() above.

> and BUG().

so we don't need the SIGNAL_GROUP_EXIT check to avoid this BUG().

What have I missed?

Oleg.



Re: [PATCH 2/2] signal: don't allow STOP on PF_IO_WORKER threads

2021-03-22 Thread Oleg Nesterov
On 03/20, Eric W. Biederman wrote:
>
> But please tell me why is it not the right thing to have the io_uring
> helper threads stop?  Why is it ok for that process to go on consuming
> cpu resources in a non-stoppable way.

Yes, I have the same question ;)

Oleg.



Re: [PATCH 2/2] signal: don't allow STOP on PF_IO_WORKER threads

2021-03-22 Thread Oleg Nesterov
On 03/20, Jens Axboe wrote:
>
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2349,6 +2349,10 @@ static bool do_signal_stop(int signr)
>
>   t = current;
>   while_each_thread(current, t) {
> + /* don't STOP PF_IO_WORKER threads */
> + if (t->flags & PF_IO_WORKER)
> + continue;
> +

This is not enough. At least task_join_group_stop() should check
PF_IO_WORKER too.

Oleg.



Re: [PATCHSET 0/2] PF_IO_WORKER signal tweaks

2021-03-22 Thread Oleg Nesterov
On 03/20, Jens Axboe wrote:
>
> Hi,
>
> Been trying to ensure that we do the right thing wrt signals and
> PF_IO_WORKER threads

OMG. Am I understand correctly? create_io_thread() can create a sub-
thread of userspace process which acts as a kernel thread?

Looks like this is the recent feature I wasn't aware... Can't really
comment right now, just some random and possibly wrong notes.

> 1) Just don't allow signals to them in general. We do mask everything
>as blocked, outside of SIGKILL, so things like wants_signal() will
>never return true for them.

This only means that signal_wake_up() won't be called. But the signal
will be queued if sent via tkill/etc, I don't think this is what we want?

A PF_IO_WORKER thread should ignore the signals. But it seems that the
PF_IO_WORKER check in sig_task_ignored() makes no sense and can't help.
I don't think PF_IO_WORKER && SIG_KTHREAD_KERNEL is possible.

Not to mention that sig_ignored() won't even call sig_task_ignored(),
it will return false exactly because the signal is blocked.

Confused.

Plus the the setting of tsk->blocked in create_io_thread() looks racy,
signal_pending() can be already true. And in fact it can't really help,
calculate_sigpending() can set TIF_SIGPENDING after wake_up_new_task()
anyway.

And why does create_io_thread() use lower_32_bits() ? This looks very
confusing. This

.exit_signal= (lower_32_bits(flags) & CSIGNAL);

too. Firstly, the rhs is always zero, secondly it is ignored because
of CLONE_THREAD.


ptrace_attach() checks PF_IO_WORKER too. Yes, but 'gdb -p' will try
to attach to every thread /proc/pid/tasks, so it will probably just
hang?

Oleg.



Re: [PATCH V3] exit: trigger panic when global init has exited

2021-03-19 Thread Oleg Nesterov
On 03/19, qianli zhao wrote:
>
> I will think about the risks of movement of the decrement of
> signal->live before exit_signal().
> If is difficult to judge movement of the decrement of signal->live is
> safe,how about only test 'signal->live==1' not use group_dead?
>
> Such as:
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 04029e3..87f3595 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -767,6 +767,17 @@ void __noreturn do_exit(long code)
> validate_creds_for_do_exit(tsk);
>
> /*
> +* If global init has exited,
> +* panic immediately to get a useable coredump.
> +*/
> +   if (unlikely(is_global_init(tsk) &&
> +   ((atomic_read(>signal->live) == 1) ||/*current is
> last init thread*/

Just suppose signal->live == 2 and both init's sub-threads exit at the
same time. They both can see signal->live == 2, panic() won't be called.

Oleg.



Re: [PATCH V3] exit: trigger panic when global init has exited

2021-03-19 Thread Oleg Nesterov
On 03/19, qianli zhao wrote:
>
> > But then I don't understand the SIGNAL_GROUP_EXIT check added by your
> > patch. Do we really need it if we want to avoid zap_pid_ns_processes()
> > when the global init exits?
>
> I think check SIGNAL_GROUP_EXIT is necessary,or panic() will happen
> after all init sub-threads do_exit(),so the following two situations
> will happen:
> 1.According to the timing in the changelog,
> zap_pid_ns_processes()->BUG() maybe happened.

How? Perhaps I missed something again, but I don't think this is possible.

zap_pid_ns_processes() simply won't be called, find_child_reaper() will
see the !PF_EXITING thread which calls panic().

So I think this should be documented somehow, at least in the changelog.

Oleg.



Re: [PATCH V3] exit: trigger panic when global init has exited

2021-03-19 Thread Oleg Nesterov
On 03/18, Eric W. Biederman wrote:
>
> Oleg Nesterov  writes:
>
> > On 03/18, qianli zhao wrote:
> >>
> >> In addition, the patch also protects the init process state to
> >> successfully get usable init coredump.
> >
> > Could you spell please?
> >
> > Does this connect to SIGNAL_GROUP_EXIT check? Do you mean that you want
> > to panic earlier, before other init's sub-threads exit?
>
> That is my understanding.
>
> As I understand it this patch has two purposes:
> 1. Avoid the BUG_ON in zap_pid_ns_processes when !CONFIG_PID_NS
> 2. panic as early as possible so exiting threads don't removing
>interesting debugging state.

Yes, this was my understanding too, but the changelog didn't look
clear to me.

And I'd say that it is not that we want to avoid BUG_ON() in
zap_pid_ns_processes() when !CONFIG_PID_NS, we want to avoid
zap_pid_ns_processes() in the root namespace, regardless of
CONFIG_PID_NS.

> It is a bit tricky to tell if the movement of the decrement of
> signal->live is safe.

Agreed, this was my concern. I see nothing wrong at first glance,
but I can easily miss something.

Oleg.



Re: [PATCH V3] exit: trigger panic when global init has exited

2021-03-18 Thread Oleg Nesterov
On 03/18, qianli zhao wrote:
>
> Hi,Oleg
>
> Thank you for your reply.
>
> >> When init sub-threads running on different CPUs exit at the same time,
> >> zap_pid_ns_processe()->BUG() may be happened.
>
> > and why do you think your patch can't prevent this?
>
> > Sorry, I must have missed something. But it seems to me that you are trying
> > to fix the wrong problem. Yes, zap_pid_ns_processes() must not be called in
> > the root namespace, and this has nothing to do with CONFIG_PID_NS.
>
> Yes, i try to fix this exception by test SIGNAL_GROUP_EXIT and call
> panic before setting PF_EXITING to prevent zap_pid_ns_processes()
> being called when init do_exit().

Ah, I didn't notice your patch does atomic_dec_and_test(signal->live)
before exit_signals() which sets PF_EXITING. Thanks for correcting me.

So yes, I was wrong, your patch can prevent this. Although I'd like to
recheck if every do-something-if-group-dead action is correct in the
case we have a non-PF_EXITING thread...

But then I don't understand the SIGNAL_GROUP_EXIT check added by your
patch. Do we really need it if we want to avoid zap_pid_ns_processes()
when the global init exits?

> In addition, the patch also protects the init process state to
> successfully get usable init coredump.

Could you spell please?

Does this connect to SIGNAL_GROUP_EXIT check? Do you mean that you want
to panic earlier, before other init's sub-threads exit?

Thanks,

Oleg.



Re: [PATCH V3] exit: trigger panic when global init has exited

2021-03-17 Thread Oleg Nesterov
On 03/17, Qianli Zhao wrote:
>
> From: Qianli Zhao 
>
> When init sub-threads running on different CPUs exit at the same time,
> zap_pid_ns_processe()->BUG() may be happened.

and why do you think your patch can't prevent this?

Sorry, I must have missed something. But it seems to me that you are trying
to fix the wrong problem. Yes, zap_pid_ns_processes() must not be called in
the root namespace, and this has nothing to do with CONFIG_PID_NS.

> And every thread status is abnormal after exit(PF_EXITING set,task->mm=NULL 
> etc),
> which makes it difficult to parse coredump from fulldump normally.
> In order to fix the above problem, when any one init has been set to 
> SIGNAL_GROUP_EXIT,
> trigger panic immediately, and prevent other init threads from continuing to 
> exit
>
> [   24.705376] Kernel panic - not syncing: Attempted to kill init! 
> exitcode=0x7f00
> [   24.705382] CPU: 4 PID: 552 Comm: init Tainted: G S O
> 4.14.180-perf-g4483caa8ae80-dirty #1
> [   24.705390] kernel BUG at include/linux/pid_namespace.h:98!
>
> PID: 552   CPU: 4   COMMAND: "init"
> PID: 1 CPU: 7   COMMAND: "init"
> core4   core7
> ... sys_exit_group()
> do_group_exit()
>- sig->flags = SIGNAL_GROUP_EXIT
>- zap_other_threads()
> do_exit() //PF_EXITING is set
> ret_to_user()
> do_notify_resume()
> get_signal()
> - signal_group_exit
> - goto fatal;
> do_group_exit()
> do_exit() //PF_EXITING is set
> - panic("Attempted to kill init! exitcode=0x%08x\n")
> exit_notify()
> find_alive_thread() //no alive sub-threads
> zap_pid_ns_processes()//CONFIG_PID_NS is not 
> set
> BUG()
>
> Signed-off-by: Qianli Zhao 
> ---
> V3:
> - Use group_dead instead of thread_group_empty() to test single init exit.
>
> V2:
> - Changelog update
> - Remove wrong useage of SIGNAL_UNKILLABLE.
> - Add thread_group_empty() test to handle single init thread exit
> ---
>  kernel/exit.c | 20 +++-
>  1 file changed, 11 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 04029e3..32b74e4 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -766,6 +766,17 @@ void __noreturn do_exit(long code)
>
>   validate_creds_for_do_exit(tsk);
>
> + group_dead = atomic_dec_and_test(>signal->live);
> + /*
> +  * If global init has exited,
> +  * panic immediately to get a useable coredump.
> +  */
> + if (unlikely(is_global_init(tsk) &&
> + (group_dead || (tsk->signal->flags & SIGNAL_GROUP_EXIT {
> + panic("Attempted to kill init! exitcode=0x%08x\n",
> + tsk->signal->group_exit_code ?: (int)code);
> + }
> +
>   /*
>* We're taking recursive faults here in do_exit. Safest is to just
>* leave this task alone and wait for reboot.
> @@ -784,16 +795,7 @@ void __noreturn do_exit(long code)
>   if (tsk->mm)
>   sync_mm_rss(tsk->mm);
>   acct_update_integrals(tsk);
> - group_dead = atomic_dec_and_test(>signal->live);
>   if (group_dead) {
> - /*
> -  * If the last thread of global init has exited, panic
> -  * immediately to get a useable coredump.
> -  */
> - if (unlikely(is_global_init(tsk)))
> - panic("Attempted to kill init! exitcode=0x%08x\n",
> - tsk->signal->group_exit_code ?: (int)code);
> -
>  #ifdef CONFIG_POSIX_TIMERS
>   hrtimer_cancel(>signal->real_timer);
>   exit_itimers(tsk->signal);
> --
> 1.9.1
>



[tip: x86/urgent] kernel, fs: Introduce and use set_restart_fn() and arch_set_restart_data()

2021-03-16 Thread tip-bot2 for Oleg Nesterov
The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: 5abbe51a526253b9f003e9a0a195638dc882d660
Gitweb:
https://git.kernel.org/tip/5abbe51a526253b9f003e9a0a195638dc882d660
Author:Oleg Nesterov 
AuthorDate:Mon, 01 Feb 2021 18:46:41 +01:00
Committer: Thomas Gleixner 
CommitterDate: Tue, 16 Mar 2021 22:13:10 +01:00

kernel, fs: Introduce and use set_restart_fn() and arch_set_restart_data()

Preparation for fixing get_nr_restart_syscall() on X86 for COMPAT.

Add a new helper which sets restart_block->fn and calls a dummy
arch_set_restart_data() helper.

Fixes: 609c19a385c8 ("x86/ptrace: Stop setting TS_COMPAT in ptrace code")
Signed-off-by: Oleg Nesterov 
Signed-off-by: Thomas Gleixner 
Cc: sta...@vger.kernel.org
Link: https://lore.kernel.org/r/20210201174641.ga17...@redhat.com

---
 fs/select.c| 10 --
 include/linux/thread_info.h| 13 +
 kernel/futex.c |  3 +--
 kernel/time/alarmtimer.c   |  2 +-
 kernel/time/hrtimer.c  |  2 +-
 kernel/time/posix-cpu-timers.c |  2 +-
 6 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 37aaa83..945896d 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -1055,10 +1055,9 @@ static long do_restart_poll(struct restart_block 
*restart_block)
 
ret = do_sys_poll(ufds, nfds, to);
 
-   if (ret == -ERESTARTNOHAND) {
-   restart_block->fn = do_restart_poll;
-   ret = -ERESTART_RESTARTBLOCK;
-   }
+   if (ret == -ERESTARTNOHAND)
+   ret = set_restart_fn(restart_block, do_restart_poll);
+
return ret;
 }
 
@@ -1080,7 +1079,6 @@ SYSCALL_DEFINE3(poll, struct pollfd __user *, ufds, 
unsigned int, nfds,
struct restart_block *restart_block;
 
restart_block = >restart_block;
-   restart_block->fn = do_restart_poll;
restart_block->poll.ufds = ufds;
restart_block->poll.nfds = nfds;
 
@@ -1091,7 +1089,7 @@ SYSCALL_DEFINE3(poll, struct pollfd __user *, ufds, 
unsigned int, nfds,
} else
restart_block->poll.has_timeout = 0;
 
-   ret = -ERESTART_RESTARTBLOCK;
+   ret = set_restart_fn(restart_block, do_restart_poll);
}
return ret;
 }
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index 9b2158c..157762d 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 /*
@@ -59,6 +60,18 @@ enum syscall_work_bit {
 
 #ifdef __KERNEL__
 
+#ifndef arch_set_restart_data
+#define arch_set_restart_data(restart) do { } while (0)
+#endif
+
+static inline long set_restart_fn(struct restart_block *restart,
+   long (*fn)(struct restart_block *))
+{
+   restart->fn = fn;
+   arch_set_restart_data(restart);
+   return -ERESTART_RESTARTBLOCK;
+}
+
 #ifndef THREAD_ALIGN
 #define THREAD_ALIGN   THREAD_SIZE
 #endif
diff --git a/kernel/futex.c b/kernel/futex.c
index e68db77..00febd6 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2728,14 +2728,13 @@ retry:
goto out;
 
restart = >restart_block;
-   restart->fn = futex_wait_restart;
restart->futex.uaddr = uaddr;
restart->futex.val = val;
restart->futex.time = *abs_time;
restart->futex.bitset = bitset;
restart->futex.flags = flags | FLAGS_HAS_TIMEOUT;
 
-   ret = -ERESTART_RESTARTBLOCK;
+   ret = set_restart_fn(restart, futex_wait_restart);
 
 out:
if (to) {
diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c
index 98d7a15..4d94e2b 100644
--- a/kernel/time/alarmtimer.c
+++ b/kernel/time/alarmtimer.c
@@ -854,9 +854,9 @@ static int alarm_timer_nsleep(const clockid_t which_clock, 
int flags,
if (flags == TIMER_ABSTIME)
return -ERESTARTNOHAND;
 
-   restart->fn = alarm_timer_nsleep_restart;
restart->nanosleep.clockid = type;
restart->nanosleep.expires = exp;
+   set_restart_fn(restart, alarm_timer_nsleep_restart);
return ret;
 }
 
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 788b9d1..5c9d968 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1957,9 +1957,9 @@ long hrtimer_nanosleep(ktime_t rqtp, const enum 
hrtimer_mode mode,
}
 
restart = >restart_block;
-   restart->fn = hrtimer_nanosleep_restart;
restart->nanosleep.clockid = t.timer.base->clockid;
restart->nanosleep.expires = hrtimer_get_expires_tv64();
+   set_restart_fn(restart, hrtimer_nanosleep_restart);
 out:
destroy_hrtimer_on_stack();
return ret;
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index a7175

[tip: x86/urgent] x86: Introduce TS_COMPAT_RESTART to fix get_nr_restart_syscall()

2021-03-16 Thread tip-bot2 for Oleg Nesterov
The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: 8c150ba2fb5995c84a7a43848250d444a3329a7d
Gitweb:
https://git.kernel.org/tip/8c150ba2fb5995c84a7a43848250d444a3329a7d
Author:Oleg Nesterov 
AuthorDate:Mon, 01 Feb 2021 18:47:09 +01:00
Committer: Thomas Gleixner 
CommitterDate: Tue, 16 Mar 2021 22:13:11 +01:00

x86: Introduce TS_COMPAT_RESTART to fix get_nr_restart_syscall()

The comment in get_nr_restart_syscall() says:

 * The problem is that we can get here when ptrace pokes
 * syscall-like values into regs even if we're not in a syscall
 * at all.

Yes, but if not in a syscall then the

status & (TS_COMPAT|TS_I386_REGS_POKED)

check below can't really help:

- TS_COMPAT can't be set

- TS_I386_REGS_POKED is only set if regs->orig_ax was changed by
  32bit debugger; and even in this case get_nr_restart_syscall()
  is only correct if the tracee is 32bit too.

Suppose that a 64bit debugger plays with a 32bit tracee and

* Tracee calls sleep(2) // TS_COMPAT is set
* User interrupts the tracee by CTRL-C after 1 sec and does
  "(gdb) call func()"
* gdb saves the regs by PTRACE_GETREGS
* does PTRACE_SETREGS to set %rip='func' and %orig_rax=-1
* PTRACE_CONT   // TS_COMPAT is cleared
* func() hits int3.
* Debugger catches SIGTRAP.
* Restore original regs by PTRACE_SETREGS.
* PTRACE_CONT

get_nr_restart_syscall() wrongly returns __NR_restart_syscall==219, the
tracee calls ia32_sys_call_table[219] == sys_madvise.

Add the sticky TS_COMPAT_RESTART flag which survives after return to user
mode. It's going to be removed in the next step again by storing the
information in the restart block. As a further cleanup it might be possible
to remove also TS_I386_REGS_POKED with that.

Test-case:

  $ cvs -d :pserver:anoncvs:anon...@sourceware.org:/cvs/systemtap co 
ptrace-tests
  $ gcc -o erestartsys-trap-debuggee 
ptrace-tests/tests/erestartsys-trap-debuggee.c --m32
  $ gcc -o erestartsys-trap-debugger 
ptrace-tests/tests/erestartsys-trap-debugger.c -lutil
  $ ./erestartsys-trap-debugger
  Unexpected: retval 1, errno 22
  erestartsys-trap-debugger: ptrace-tests/tests/erestartsys-trap-debugger.c:421

Fixes: 609c19a385c8 ("x86/ptrace: Stop setting TS_COMPAT in ptrace code")
Reported-by: Jan Kratochvil 
Signed-off-by: Oleg Nesterov 
Signed-off-by: Thomas Gleixner 
Cc: sta...@vger.kernel.org
Link: https://lore.kernel.org/r/20210201174709.ga17...@redhat.com

---
 arch/x86/include/asm/thread_info.h | 14 +-
 arch/x86/kernel/signal.c   | 24 +---
 2 files changed, 14 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h 
b/arch/x86/include/asm/thread_info.h
index c2dc29e..30d1d18 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -214,10 +214,22 @@ static inline int arch_within_stack_frames(const void * 
const stack,
  */
 #define TS_COMPAT  0x0002  /* 32bit syscall active (64BIT)*/
 
+#ifndef __ASSEMBLY__
 #ifdef CONFIG_COMPAT
 #define TS_I386_REGS_POKED 0x0004  /* regs poked by 32-bit ptracer */
+#define TS_COMPAT_RESTART  0x0008
+
+#define arch_set_restart_data  arch_set_restart_data
+
+static inline void arch_set_restart_data(struct restart_block *restart)
+{
+   struct thread_info *ti = current_thread_info();
+   if (ti->status & TS_COMPAT)
+   ti->status |= TS_COMPAT_RESTART;
+   else
+   ti->status &= ~TS_COMPAT_RESTART;
+}
 #endif
-#ifndef __ASSEMBLY__
 
 #ifdef CONFIG_X86_32
 #define in_ia32_syscall() true
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index ea794a0..6c26d2c 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -766,30 +766,8 @@ handle_signal(struct ksignal *ksig, struct pt_regs *regs)
 
 static inline unsigned long get_nr_restart_syscall(const struct pt_regs *regs)
 {
-   /*
-* This function is fundamentally broken as currently
-* implemented.
-*
-* The idea is that we want to trigger a call to the
-* restart_block() syscall and that we want in_ia32_syscall(),
-* in_x32_syscall(), etc. to match whatever they were in the
-* syscall being restarted.  We assume that the syscall
-* instruction at (regs->ip - 2) matches whatever syscall
-* instruction we used to enter in the first place.
-*
-* The problem is that we can get here when ptrace pokes
-* syscall-like values into regs even if we're not in a syscall
-* at all.
-*
-* For now, we maintain historical behavior and guess based on
-* stored state.  We could do better by saving the actual
-* syscall arch in restart_block or (with caveats on x32) by
-* checking 

[tip: x86/urgent] x86: Move TS_COMPAT back to asm/thread_info.h

2021-03-16 Thread tip-bot2 for Oleg Nesterov
The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: 66c1b6d74cd7035e85c426f0af4aede19e805c8a
Gitweb:
https://git.kernel.org/tip/66c1b6d74cd7035e85c426f0af4aede19e805c8a
Author:Oleg Nesterov 
AuthorDate:Mon, 01 Feb 2021 18:46:49 +01:00
Committer: Thomas Gleixner 
CommitterDate: Tue, 16 Mar 2021 22:13:11 +01:00

x86: Move TS_COMPAT back to asm/thread_info.h

Move TS_COMPAT back to asm/thread_info.h, close to TS_I386_REGS_POKED.

It was moved to asm/processor.h by b9d989c7218a ("x86/asm: Move the
thread_info::status field to thread_struct"), then later 37a8f7c38339
("x86/asm: Move 'status' from thread_struct to thread_info") moved the
'status' field back but TS_COMPAT was forgotten.

Preparatory patch to fix the COMPAT case for get_nr_restart_syscall()

Fixes: 609c19a385c8 ("x86/ptrace: Stop setting TS_COMPAT in ptrace code")
Signed-off-by: Oleg Nesterov 
Signed-off-by: Thomas Gleixner 
Cc: sta...@vger.kernel.org
Link: https://lore.kernel.org/r/20210201174649.ga17...@redhat.com
---
 arch/x86/include/asm/processor.h   |  9 -
 arch/x86/include/asm/thread_info.h |  9 +
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index dc6d149..f1b9ed5 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -551,15 +551,6 @@ static inline void arch_thread_struct_whitelist(unsigned 
long *offset,
*size = fpu_kernel_xstate_size;
 }
 
-/*
- * Thread-synchronous status.
- *
- * This is different from the flags in that nobody else
- * ever touches our thread-synchronous status, so we don't
- * have to worry about atomic accesses.
- */
-#define TS_COMPAT  0x0002  /* 32bit syscall active (64BIT)*/
-
 static inline void
 native_load_sp0(unsigned long sp0)
 {
diff --git a/arch/x86/include/asm/thread_info.h 
b/arch/x86/include/asm/thread_info.h
index 0d751d5..c2dc29e 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -205,6 +205,15 @@ static inline int arch_within_stack_frames(const void * 
const stack,
 
 #endif
 
+/*
+ * Thread-synchronous status.
+ *
+ * This is different from the flags in that nobody else
+ * ever touches our thread-synchronous status, so we don't
+ * have to worry about atomic accesses.
+ */
+#define TS_COMPAT  0x0002  /* 32bit syscall active (64BIT)*/
+
 #ifdef CONFIG_COMPAT
 #define TS_I386_REGS_POKED 0x0004  /* regs poked by 32-bit ptracer */
 #endif


[tip: x86/urgent] x86: Introduce restart_block->arch_data to remove TS_COMPAT_RESTART

2021-03-16 Thread tip-bot2 for Oleg Nesterov
The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: b2e9df850c58c2b36e915e7d3bed3f6107cccba6
Gitweb:
https://git.kernel.org/tip/b2e9df850c58c2b36e915e7d3bed3f6107cccba6
Author:Oleg Nesterov 
AuthorDate:Mon, 01 Feb 2021 18:47:16 +01:00
Committer: Thomas Gleixner 
CommitterDate: Tue, 16 Mar 2021 22:13:11 +01:00

x86: Introduce restart_block->arch_data to remove TS_COMPAT_RESTART

Save the current_thread_info()->status of X86 in the new
restart_block->arch_data field so TS_COMPAT_RESTART can be removed again.

Signed-off-by: Oleg Nesterov 
Signed-off-by: Thomas Gleixner 
Link: https://lore.kernel.org/r/20210201174716.ga17...@redhat.com

---
 arch/x86/include/asm/thread_info.h | 12 ++--
 arch/x86/kernel/signal.c   |  2 +-
 include/linux/restart_block.h  |  1 +
 3 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h 
b/arch/x86/include/asm/thread_info.h
index 30d1d18..06b740b 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -217,18 +217,10 @@ static inline int arch_within_stack_frames(const void * 
const stack,
 #ifndef __ASSEMBLY__
 #ifdef CONFIG_COMPAT
 #define TS_I386_REGS_POKED 0x0004  /* regs poked by 32-bit ptracer */
-#define TS_COMPAT_RESTART  0x0008
 
-#define arch_set_restart_data  arch_set_restart_data
+#define arch_set_restart_data(restart) \
+   do { restart->arch_data = current_thread_info()->status; } while (0)
 
-static inline void arch_set_restart_data(struct restart_block *restart)
-{
-   struct thread_info *ti = current_thread_info();
-   if (ti->status & TS_COMPAT)
-   ti->status |= TS_COMPAT_RESTART;
-   else
-   ti->status &= ~TS_COMPAT_RESTART;
-}
 #endif
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 6c26d2c..f306e85 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -767,7 +767,7 @@ handle_signal(struct ksignal *ksig, struct pt_regs *regs)
 static inline unsigned long get_nr_restart_syscall(const struct pt_regs *regs)
 {
 #ifdef CONFIG_IA32_EMULATION
-   if (current_thread_info()->status & TS_COMPAT_RESTART)
+   if (current->restart_block.arch_data & TS_COMPAT)
return __NR_ia32_restart_syscall;
 #endif
 #ifdef CONFIG_X86_X32_ABI
diff --git a/include/linux/restart_block.h b/include/linux/restart_block.h
index bba2920..980a655 100644
--- a/include/linux/restart_block.h
+++ b/include/linux/restart_block.h
@@ -23,6 +23,7 @@ enum timespec_type {
  * System call restart block.
  */
 struct restart_block {
+   unsigned long arch_data;
long (*fn)(struct restart_block *);
union {
/* For futex_wait and futex_wait_requeue_pi */


Re: [PATCH RESEND v4 0/4] x86: fix get_nr_restart_syscall()

2021-03-16 Thread Oleg Nesterov
On 02/04, Oleg Nesterov wrote:
>
> It seems that nobody objects,
>
> Andrew, Andy, Thomas, how do you think this series should be routed?

ping...

What can I do to finally get this merged?

Should I resend once again? Whom?


> On 02/01, Oleg Nesterov wrote:
> >
> > Somehow I forgot about this problem. Let me resend the last version
> > based on discussion with Linus. IIRC he was agree with this series.
> >
> > And let me remind why 3/4 temporary adds the "transient" TS_COMPAT_RESTART
> > flag killed by the next patch: to simplify the backporting. 1-3 can fix
> > the problem without breaking the kABI.
> >
> > Oleg.
> > ---
> >  arch/x86/include/asm/processor.h   |  9 -
> >  arch/x86/include/asm/thread_info.h | 15 ++-
> >  arch/x86/kernel/signal.c   | 24 +---
> >  fs/select.c| 10 --
> >  include/linux/restart_block.h  |  1 +
> >  include/linux/thread_info.h| 13 +
> >  kernel/futex.c |  3 +--
> >  kernel/time/alarmtimer.c   |  2 +-
> >  kernel/time/hrtimer.c  |  2 +-
> >  kernel/time/posix-cpu-timers.c |  2 +-
> >  10 files changed, 37 insertions(+), 44 deletions(-)



Re: [PATCH v2] task_work: kasan: record task_work_add() call stack

2021-03-16 Thread Oleg Nesterov
On 03/16, Walter Wu wrote:
>
> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -34,6 +34,9 @@ int task_work_add(struct task_struct *task, struct 
> callback_head *work,
>  {
>   struct callback_head *head;
>
> + /* record the work call stack in order to print it in KASAN reports */
> + kasan_record_aux_stack(work);
> +
>   do {
>   head = READ_ONCE(task->task_works);
>   if (unlikely(head == _exited))

Acked-by: Oleg Nesterov 



Re: [PATCH v7] do_wait: make PIDTYPE_PID case O(1) instead of O(n)

2021-03-15 Thread Oleg Nesterov
On 03/14, Jim Newsome wrote:
>
> Since v5:
> * Switched back to explicitly looking up by tgid and then pid.

OK, as I said I won't argue, still looks good to me.

Reviewed-by: Oleg Nesterov 



Re: [PATCH] prctl: fix PR_SET_MM_AUXV kernel stack leak

2021-03-15 Thread Oleg Nesterov
On 03/15, Cyrill Gorcunov wrote:
>
> On Mon, Mar 15, 2021 at 01:08:03PM +0100, Oleg Nesterov wrote:
>
> >
> > And why task_lock(current) ? What does it try to protect?
>
> As far as I remember this was related to reading from procfs
> at time the patch was written for first time. Looks like this
> not relevant anymore and could be dropped.

I think this was never relevant, ->alloc_lock is per-thread, not per mm.

Oleg.



Re: [PATCH] prctl: fix PR_SET_MM_AUXV kernel stack leak

2021-03-15 Thread Oleg Nesterov
On 03/14, Alexey Dobriyan wrote:
>
>   prctl(PR_SET_MM, PR_SET_MM_AUXV, addr, 1);
>
> will copy 1 byte from userspace to (quite big) on-stack array
> and then stash everything to mm->saved_auxv.

I too don't understand, memcpy(mm->saved_auxv, user_auxv, len) will
copy 1 byte...

And why task_lock(current) ? What does it try to protect?

Oleg.



Re: [PATCH v5] do_wait: make PIDTYPE_PID case O(1) instead of O(n)

2021-03-13 Thread Oleg Nesterov
On 03/12, Eric W. Biederman wrote:
>
> I am going to kibitz just a little bit more.
>
> When I looked at this a second time it became apparent that using
> pid_task twice should actually be faster as it removes a dependent load
> caused by thread_group_leader, and replaces it by accessing two adjacent
> pointers in the same cache line.

Heh, I think this code should be optimized for the reader in the first place ;)

But as I said I won't argue, readability is very subjective, I am fine with your
version too.

Oleg.



Re: [patch V2 3/3] signal: Allow tasks to cache one sigqueue struct

2021-03-13 Thread Oleg Nesterov
On 03/12, Thomas Gleixner wrote:
>
> On Fri, Mar 12 2021 at 20:26, Thomas Gleixner wrote:
> > On Fri, Mar 12 2021 at 17:11, Oleg Nesterov wrote:
> >> On 03/11, Thomas Gleixner wrote:
> >>>
> >>> @@ -456,7 +460,12 @@ static void __sigqueue_free(struct sigqu
> >>>   return;
> >>>   if (atomic_dec_and_test(>user->sigpending))
> >>>   free_uid(q->user);
> >>> - kmem_cache_free(sigqueue_cachep, q);
> >>> +
> >>> + /* Cache one sigqueue per task */
> >>> + if (!current->sigqueue_cache)
> >>> + current->sigqueue_cache = q;
> >>> + else
> >>> + kmem_cache_free(sigqueue_cachep, q);
> >>>  }
> >>
> >> This doesn't look right, note that __exit_signal() does
> >> flush_sigqueue(>shared_pending) at the end, after exit_task_sighand()
> >> was already called.
> >>
> >> I'd suggest to not add the new exit_task_sighand() helper and simply free
> >> current->sigqueue_cache at the end of __exit_signal().
> >
> > Ooops. Thanks for spotting this!
>
> Hrm.
>
> The task which is released is obviously not current, so even if there
> are still sigqueues in shared_pending then they wont end up in the
> released tasks sigqueue_cache. They can only ever end up in
> current->sigqueue_cache.

The task which is released can be "current" if this task autoreaps.
For example, if its parent ignores SIGCHLD. In this case exit_notify()
does release_task(current).

> But that brings my memory back why I had cmpxchg() in the original
> version. This code runs without current->sighand->siglock held.

Yes, I was wrong too. This code runs without exiting_task->sighand->siglock
and this is fine in that it can not race with send_signal(exiting_task),
but somehow I missed the fact that it always populates current->sigqueue_cache,
not exiting_task->sigqueue_cache.

Oleg.



Re: [PATCH] ptrace: Allow other threads to access tracee

2021-03-12 Thread Oleg Nesterov
On 03/11, Jim Newsome wrote:
>
> I suppose even if the corruption of the register-values-themselves is
> acceptable, some synchronization may be needed to avoid the possibility
> of corrupting the kernel's data structures?

Yes, the kernel can crash. Just look at the comment above 
ptrace_freeze_traced().
The kernel assumes that the tracee is frozen, in particular it can't exit.
Say, ptrace_peek_siginfo() can crash the tracee exits and clears ->sighand,
and this can obviously happen if another thread does PTRACE_CONT + SIGKILL.

> Is it "just" a matter of adding some locking? Would a relatively coarse
> lock on the target task over the duration of the ptrace call

Yes I think needs a mutex in task_struct. But honestly I am not sure
it makes sense I dunno.

> (which I
> believe is always non-blocking?)

Why? It is blocking.

Oleg.



Re: [PATCH v4] do_wait: make PIDTYPE_PID case O(1) instead of O(n)

2021-03-12 Thread Oleg Nesterov
On 03/11, Jim Newsome wrote:
>
> +static bool is_effectively_child(struct wait_opts *wo, bool ptrace,
> +  struct task_struct *target)
> +{
> + struct task_struct *parent =
> + !ptrace ? target->real_parent : target->parent;
> +
> + return current == parent || (!(wo->wo_flags & __WNOTHREAD) &&
> +  same_thread_group(current, parent));
> +}
> +
> +/*
> + * Optimization for waiting on PIDTYPE_PID. No need to iterate through child
> + * and tracee lists to find the target task.
> + */
> +static int do_wait_pid(struct wait_opts *wo)
> +{
> + bool ptrace;
> + struct task_struct *target;
> + int retval;
> +
> + ptrace = false;
> +
> + /* A non-ptrace wait can only be performed on a thread group leader. */
> + target = pid_task(wo->wo_pid, PIDTYPE_TGID);
> +
> + if (target && is_effectively_child(wo, ptrace, target)) {
> + retval = wait_consider_task(wo, ptrace, target);
> + if (retval)
> + return retval;
> + }
> +
> + ptrace = true;
> +
> + /* A ptrace wait can be done on non-thread-group-leaders. */
> + if (!target)
> + target = pid_task(wo->wo_pid, PIDTYPE_PID);
> +
> + if (target && is_effectively_child(wo, ptrace, target)) {
> + retval = wait_consider_task(wo, ptrace, target);

No, this is not right... You need to check target->ptrace != 0.

I know that Eric suggests to not use thread_group_leader() and I won't argue
even if I don't really agree.

Up to you, but to me something like

do_wait_pid()
{
target = pid_task(wo->wo_pid, PIDTYPE_PID);

if (!target)
return 0;

if (thread_group_leader(target) &&
is_effectively_child(wo, 0, target) {
... 
}

if (target->ptrace &&
is_effectively_child(wo, 1, target) {
...
}

return 0;

}

looks more simple/clean.

Oleg.



Re: [PATCH V2] exit: trigger panic when global init has exited

2021-03-12 Thread Oleg Nesterov
On 03/12, Qianli Zhao wrote:
>
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -767,6 +767,17 @@ void __noreturn do_exit(long code)
>   validate_creds_for_do_exit(tsk);
>  
>   /*
> +  * If global init has exited,
> +  * panic immediately to get a useable coredump.
> +  */
> + if (unlikely(is_global_init(tsk) &&
> + (thread_group_empty(tsk) ||
> + (tsk->signal->flags & SIGNAL_GROUP_EXIT {
> + panic("Attempted to kill init! exitcode=0x%08x\n",
> + tsk->signal->group_exit_code ?: (int)code);

See our discussion with Eric, this is not right.
https://lore.kernel.org/lkml/20210310173236.gb8...@redhat.com/

Oleg.



Re: [patch V2 3/3] signal: Allow tasks to cache one sigqueue struct

2021-03-12 Thread Oleg Nesterov
On 03/12, Sebastian Andrzej Siewior wrote:
>
> On 2021-03-11 14:20:39 [+0100], Thomas Gleixner wrote:
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -433,7 +433,11 @@ static struct sigqueue *
> > rcu_read_unlock();
> >
> > if (override_rlimit || likely(sigpending <= task_rlimit(t, 
> > RLIMIT_SIGPENDING))) {
> > -   q = kmem_cache_alloc(sigqueue_cachep, gfp_flags);
> > +   /* Preallocation does not hold sighand::siglock */
> > +   if (sigqueue_flags || !t->sigqueue_cache)
> > +   q = kmem_cache_alloc(sigqueue_cachep, gfp_flags);
> > +   else
> > +   q = xchg(>sigqueue_cache, NULL);
>
> Could it happen that two tasks saw t->sigqueue_cache != NULL, the first
> one got the pointer via xchg() and the second got NULL via xchg()?

It is called with sighand::siglock held, we don't even need xchg().

Oleg.



Re: [patch V2 3/3] signal: Allow tasks to cache one sigqueue struct

2021-03-12 Thread Oleg Nesterov
On 03/11, Thomas Gleixner wrote:
>
> @@ -456,7 +460,12 @@ static void __sigqueue_free(struct sigqu
>   return;
>   if (atomic_dec_and_test(>user->sigpending))
>   free_uid(q->user);
> - kmem_cache_free(sigqueue_cachep, q);
> +
> + /* Cache one sigqueue per task */
> + if (!current->sigqueue_cache)
> + current->sigqueue_cache = q;
> + else
> + kmem_cache_free(sigqueue_cachep, q);
>  }

This doesn't look right, note that __exit_signal() does
flush_sigqueue(>shared_pending) at the end, after exit_task_sighand()
was already called.

I'd suggest to not add the new exit_task_sighand() helper and simply free
current->sigqueue_cache at the end of __exit_signal().

Oleg.



Re: [PATCH] ptrace: Allow other threads to access tracee

2021-03-11 Thread Oleg Nesterov
On 03/10, Jim Newsome wrote:
>
> @@ -238,7 +238,7 @@ static int ptrace_check_attach(struct task_struct *child, 
> bool ignore_state)
>* be changed by us so it's not changing right after this.
>*/
>   read_lock(_lock);
> - if (child->ptrace && child->parent == current) {
> + if (child->ptrace && same_thread_group(child->parent, current)) {

Cough... it is not that simple.

Just suppose that 2 threads call ptrace(tracee) at the same time. Say, the 1st
thread does PTRACE_CONT while the 2nd thread tries to change the registers.

Oleg.



Re: [PATCH v3] do_wait: make PIDTYPE_PID case O(1) instead of O(n)

2021-03-11 Thread Oleg Nesterov
On 03/10, Jim Newsome wrote:
>
> On 3/10/21 16:40, Eric W. Biederman wrote:
>
> >> +static int do_wait_pid(struct wait_opts *wo)
> >> +{
> >> +  struct task_struct *target = pid_task(wo->wo_pid, PIDTYPE_PID);
> >  ^
> > This is subtle change in behavior.
> >
> > Today on the task->children list we only place thread group leaders.
>
> Shouldn't we allow waiting on clone children if __WALL or __WCLONE is set?

Don't confuse clone child with child's sub-thread.

Oleg.



Re: [PATCH v3] do_wait: make PIDTYPE_PID case O(1) instead of O(n)

2021-03-11 Thread Oleg Nesterov
On 03/10, Eric W. Biederman wrote:
>
> Jim Newsome  writes:
>
> > +static int do_wait_pid(struct wait_opts *wo)
> > +{
> > +   struct task_struct *target = pid_task(wo->wo_pid, PIDTYPE_PID);
>  ^
> This is subtle change in behavior.
>
> Today on the task->children list we only place thread group leaders.

Aaah, yes, thanks Eric!

> So the code either needs a thread_group_leader filter on target before
> the ptrace=0 case or we need to use "pid_task(wo->wo_pid, PIDTYPE_TGID)"
> and "pid_task(wo->wo_pid, PIDTYPE_PID)" for the "ptrace=1" case.

Agreed,

> I would like to make thread_group_leaders go away

Hmm, why?

Oleg.



Re: [PATCH v2] ptrace: add PTRACE_GET_RSEQ_CONFIGURATION request

2021-03-10 Thread Oleg Nesterov
I don't think I can review this patch, I don't understand the problem space
well enough. But just in case, I see nothing wrong in this simple patch.

Feel free to add

Acked-by: Oleg Nesterov 



On 02/26, Piotr Figiel wrote:
>
> For userspace checkpoint and restore (C/R) a way of getting process state
> containing RSEQ configuration is needed.
> 
> There are two ways this information is going to be used:
>  - to re-enable RSEQ for threads which had it enabled before C/R
>  - to detect if a thread was in a critical section during C/R
> 
> Since C/R preserves TLS memory and addresses RSEQ ABI will be restored
> using the address registered before C/R.
> 
> Detection whether the thread is in a critical section during C/R is needed
> to enforce behavior of RSEQ abort during C/R. Attaching with ptrace()
> before registers are dumped itself doesn't cause RSEQ abort.
> Restoring the instruction pointer within the critical section is
> problematic because rseq_cs may get cleared before the control is passed
> to the migrated application code leading to RSEQ invariants not being
> preserved. C/R code will use RSEQ ABI address to find the abort handler
> to which the instruction pointer needs to be set.
> 
> To achieve above goals expose the RSEQ ABI address and the signature value
> with the new ptrace request PTRACE_GET_RSEQ_CONFIGURATION.
> 
> This new ptrace request can also be used by debuggers so they are aware
> of stops within restartable sequences in progress.
> 
> Signed-off-by: Piotr Figiel 
> Reviewed-by: Michal Miroslaw 
> 
> ---
> v2:
> Applied review comments:
>  - changed return value from the ptrace request to the size of the
>configuration structure
>  - expanded configuration structure with the flags field and
>the rseq abi structure size
> 
> v1:
>  https://lore.kernel.org/lkml/20210222100443.4155938-1-fig...@google.com/
> 
> ---
>  include/uapi/linux/ptrace.h | 10 ++
>  kernel/ptrace.c | 25 +
>  2 files changed, 35 insertions(+)
> 
> diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
> index 83ee45fa634b..3747bf816f9a 100644
> --- a/include/uapi/linux/ptrace.h
> +++ b/include/uapi/linux/ptrace.h
> @@ -102,6 +102,16 @@ struct ptrace_syscall_info {
>   };
>  };
>  
> +#define PTRACE_GET_RSEQ_CONFIGURATION0x420f
> +
> +struct ptrace_rseq_configuration {
> + __u64 rseq_abi_pointer;
> + __u32 rseq_abi_size;
> + __u32 signature;
> + __u32 flags;
> + __u32 pad;
> +};
> +
>  /*
>   * These values are stored in task->ptrace_message
>   * by tracehook_report_syscall_* to describe the current syscall-stop.
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index 61db50f7ca86..76f09456ec4b 100644
> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -31,6 +31,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include  /* for syscall_get_* */
>  
> @@ -779,6 +780,24 @@ static int ptrace_peek_siginfo(struct task_struct *child,
>   return ret;
>  }
>  
> +#ifdef CONFIG_RSEQ
> +static long ptrace_get_rseq_configuration(struct task_struct *task,
> +   unsigned long size, void __user *data)
> +{
> + struct ptrace_rseq_configuration conf = {
> + .rseq_abi_pointer = (u64)(uintptr_t)task->rseq,
> + .rseq_abi_size = sizeof(*task->rseq),
> + .signature = task->rseq_sig,
> + .flags = 0,
> + };
> +
> + size = min_t(unsigned long, size, sizeof(conf));
> + if (copy_to_user(data, , size))
> + return -EFAULT;
> + return sizeof(conf);
> +}
> +#endif
> +
>  #ifdef PTRACE_SINGLESTEP
>  #define is_singlestep(request)   ((request) == PTRACE_SINGLESTEP)
>  #else
> @@ -1222,6 +1241,12 @@ int ptrace_request(struct task_struct *child, long 
> request,
>   ret = seccomp_get_metadata(child, addr, datavp);
>   break;
>  
> +#ifdef CONFIG_RSEQ
> + case PTRACE_GET_RSEQ_CONFIGURATION:
> + ret = ptrace_get_rseq_configuration(child, addr, datavp);
> + break;
> +#endif
> +
>   default:
>   break;
>   }
> -- 
> 2.30.1.766.gb4fecdf3b7-goog
> 



Re: [PATCH] exit: trigger panic when init process is set to SIGNAL_GROUP_EXIT

2021-03-10 Thread Oleg Nesterov
On 03/10, Eric W. Biederman wrote:
>
>   /* If global init has exited,
>  * panic immediately to get a useable coredump.
>  */
>   if (unlikely(is_global_init(tsk) &&
>   (thread_group_empty(tsk) ||
> (tsk->signal->flags & SIGNAL_GROUP_EXIT {
>   panic("Attempted to kill init!  exitcode=0x%08x\n",
>   tsk->signal->group_exit_code ?: (int)code);
>   }
>
> The thread_group_empty test is needed to handle single threaded
> inits.

But we can't rely on thread_group_empty(). Just suppose that the main
thread exit first, then the 2nd (last) thread exits too.

Oleg.



Re: [PATCH v3] do_wait: make PIDTYPE_PID case O(1) instead of O(n)

2021-03-10 Thread Oleg Nesterov
On 03/09, Jim Newsome wrote:
>
> do_wait is an internal function used to implement waitpid, waitid,
> wait4, etc. To handle the general case, it does an O(n) linear scan of
> the thread group's children and tracees.
>
> This patch adds a special-case when waiting on a pid to skip these scans
> and instead do an O(1) lookup. This improves performance when waiting on
> a pid from a thread group with many children and/or tracees.
>
> Signed-off-by: James Newsome 

Reviewed-by: Oleg Nesterov 



Re: [PATCH] exit: trigger panic when init process is set to SIGNAL_GROUP_EXIT

2021-03-09 Thread Oleg Nesterov
On 03/09, Qianli Zhao wrote:
>
> From: Qianli Zhao 
>
> Once any init thread finds SIGNAL_GROUP_EXIT, trigger panic immediately
> instead of last thread of global init has exited, and do not allow other
> init threads to exit, protect task/memory state of all sub-threads for
> get reliable init coredump

To be honest, I don't understand the changelog. It seems that you want
to uglify the kernel to simplify the debugging of buggy init? Or what?

Nor can I understand the patch. I fail to understand the games with
SIGNAL_UNKILLABLE and ->siglock.

And iiuc with this patch the kernel will crash if init's sub-thread execs,
signal_group_exit() returns T in this case.

Oleg.

> [   24.705376] Kernel panic - not syncing: Attempted to kill init! 
> exitcode=0x7f00
> [   24.705382] CPU: 4 PID: 552 Comm: init Tainted: G S O
> 4.14.180-perf-g4483caa8ae80-dirty #1
> [   24.705390] kernel BUG at include/linux/pid_namespace.h:98!
> 
> PID: 552   CPU: 4   COMMAND: "init"
> PID: 1 CPU: 7   COMMAND: "init"
> core4 core7
> ...   sys_exit_group()
>   do_group_exit()
>   - sig->flags = SIGNAL_GROUP_EXIT
>   - zap_other_threads()
>   do_exit()
>   - PF_EXITING is set
> ret_to_user()
> do_notify_resume()
> get_signal()
> - signal_group_exit
> - goto fatal;
> do_group_exit()
> do_exit()
> - PF_EXITING is set
> - panic("Attempted to kill init! exitcode=0x%08x\n")
>   exit_notify()
>   find_alive_thread() //no alive sub-threads
>   zap_pid_ns_processes()//CONFIG_PID_NS is not set
>   BUG()
> 
> Signed-off-by: Qianli Zhao 
> ---
> We got an init crash issue, but we can't get init coredump from fulldump, we 
> also
> see BUG() triggered which calling in zap_pid_ns_processes().
> 
> From crash dump we can get the following information:
> 1. "Attempted to kill init",init process is killed.
> - Kernel panic - not syncing: Attempted to kill init! exitcode=0x7f00
> 2. At the same time as init crash, a BUG() triggered in other core.
> - [   24.705390] kernel BUG at include/linux/pid_namespace.h:98!
> 3. When init thread calls exit_mm, the corresponding thread task->mm will be 
> empty, which is not conducive to extracting coredump
> 
> To fix the issue and save complete coredump, once find init thread is set to 
> SIGNAL_GROUP_EXIT
> trigger panic immediately,and other child threads are not allowed to exit 
> just wait for reboot
> 
> PID: 1  TASK: ffc973126900  CPU: 7   COMMAND: "init"
>  #0 [ff800805ba60] perf_trace_kernel_panic_late at ff99ac0bcbcc
>  #1 [ff800805bac0] die at ff99ac08dc64
>  #2 [ff800805bb10] bug_handler at ff99ac08e398
>  #3 [ff800805bbc0] brk_handler at ff99ac08529c
>  #4 [ff800805bc80] do_debug_exception at ff99ac0814e4
>  #5 [ff800805bdf0] el1_dbg at ff99ac083298
> ->Exception
> 
> /home/work/courbet-r-stable-build/kernel/msm-4.14/include/linux/pid_namespace.h:
>  98
>  #6 [ff800805be20] do_exit at ff99ac0c22e8
>  #7 [ff800805be80] do_group_exit at ff99ac0c2658
>  #8 [ff800805beb0] sys_exit_group at ff99ac0c266c
>  #9 [ff800805bff0] el0_svc_naked at ff99ac083cf
> ->SYSCALLNO: 5e (__NR_exit_group) 
> 
> PID: 552TASK: ffc9613c8f00  CPU: 4   COMMAND: "init"
>  #0 [ff801455b870] __delay at ff99ad32cc14
>  #1 [ff801455b8b0] __const_udelay at ff99ad32cd10
>  #2 [ff801455b8c0] msm_trigger_wdog_bite at ff99ac5d5be0
>  #3 [ff801455b980] do_msm_restart at ff99a3f8
>  #4 [ff801455b9b0] machine_restart at ff99ac085dd0
>  #5 [ff801455b9d0] emergency_restart at ff99ac0eb6dc
>  #6 [ff801455baf0] panic at ff99ac0bd008
>  #7 [ff801455bb70] do_exit at ff99ac0c257c
> /home/work/courbet-r-stable-build/kernel/msm-4.14/kernel/exit.c: 842
>  #8 [ff801455bbd0] do_group_exit at ff99ac0c2644
>  #9 [ff801455bcc0] get_signal at ff99ac0d1384
> #10 [ff801455be60] do_notify_resume at ff99ac08b2a8
> #11 [ff801455bff0] work_pending at ff99ac083b8c
> 
> ---
>  kernel/exit.c | 29 +
>  1 file changed, 21 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/exit.c b/kernel/exit.c
> index ef2fb929..6b2da22 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -758,6 +758,27 @@ void __noreturn do_exit(long code)
>   validate_creds_for_do_exit(tsk);
>  
>   /*
> +  * Once init group is marked for death,
> +  * panic immediately to get a useable coredump
> +  */
> + if (unlikely(is_global_init(tsk) &&
> + signal_group_exit(tsk->signal))) {
> + spin_lock_irq(>sighand->siglock);
> + if (!(tsk->signal->flags & 

Re: [PATCH v2] do_wait: make PIDTYPE_PID case O(1) instead of O(n)

2021-03-09 Thread Oleg Nesterov
Jim,

Thanks, the patch looks good to me. Yet I think you need to send V3 even
if I personally do not care ;) Please consider ./scripts/checkpatch.pl,
it reports all the coding-style problems I was going to mention.

I too have a couple of cosmetic nits, but feel free to ignore, this is
subjective.

On 03/09, Jim Newsome wrote:
>
> do_wait is an internal function used to implement waitpid, waitid,
> wait4, etc. To handle the general case, it does an O(n) linear scan of
> the thread group's children and tracees.
> 
> This patch adds a special-case when waiting on a pid to skip these scans
> and instead do an O(1) lookup. This improves performance when waiting on
> a pid from a thread group with many children and/or tracees.
> 
> Signed-off-by: James Newsome
> ---
>  kernel/exit.c | 54 ++-
>  1 file changed, 45 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 04029e35e69a..312c4dfc9555 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -1439,6 +1439,33 @@ void __wake_up_parent(struct task_struct *p, struct 
> task_struct *parent)
>  TASK_INTERRUPTIBLE, p);
>  }
>  
> +// Optimization for waiting on PIDTYPE_PID. No need to iterate through child
> +// and tracee lists to find the target task.
> +static int do_wait_pid(struct wait_opts *wo, struct task_struct *tsk)
> +{
> + struct task_struct *target = pid_task(wo->wo_pid, PIDTYPE_PID);
> + if (!target) {
> + return 0;
> + }
> + if (tsk == target->real_parent ||
> + (!(wo->wo_flags & __WNOTHREAD) &&
> +  same_thread_group(tsk, target->real_parent))) {
> + int retval = wait_consider_task(wo, /* ptrace= */ 0, target);
> + if (retval) {
> + return retval;
> + }
> + }
> + if (target->ptrace && (tsk == target->parent ||
> +(!(wo->wo_flags & __WNOTHREAD) &&
> + same_thread_group(tsk, target->parent {
> + int retval = wait_consider_task(wo, /* ptrace= */ 1, target);
> + if (retval) {
> + return retval;
> + }
> + }

Both if's use "int retval", to me it would be better to declare this variable
at the start of do_wait_pid(). But again, I won't insist this is up to you.

I am wondering if something like

static inline bool is_parent(struct task_struct *tsk, struct 
task_struct *p, int flags)
{
return  tsk == p || !(flags & __WNOTHREAD)) && 
same_thread_group(tsk, p);
}

makes any sense to make do_wait_pid() more clear... probably not.

Oleg.



Re: patch: do_wait: make PIDTYPE_PID case O(1) instead of O(n)

2021-03-09 Thread Oleg Nesterov
Ah, and you forgot to CC lkml ;) let me resend my email.


Hi Jim,

Please do not use the attachments, just send the patch as plain text.
See Documentation/process/submitting-patches.rst

On 03/08, Jim Newsome wrote:
>
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -1462,8 +1462,61 @@ static long do_wait(struct wait_opts *wo)
>   goto notask;
>
>   set_current_state(TASK_INTERRUPTIBLE);
> +
>   read_lock(_lock);
>   tsk = current;
> +
> + if (wo->wo_type == PIDTYPE_PID) {
> + // Optimization for PIDTYPE_PID. No need to iterate through 
> child and
> + // tracee lists to find the target task.

I'd suggest to put this PIDTYPE_PID code into the new function.

> +
> + struct task_struct *real_parent = NULL;
> + struct task_struct *target = NULL;
> + bool do_regular_wait, do_ptrace_wait;
> +
> + // XXX: Do we need this? Or is the tasklist_lock sufficient?
> + rcu_read_lock();

No, you don't need rcu lock, tasklist_lock is sufficient

> + target = pid_task(wo->wo_pid, PIDTYPE_PID);
> + if (!target) {
> + rcu_read_unlock();
> + goto notask;

This is wrong, you forgot to drop tasklist_lock.


> + real_parent = !target->real_parent ? target->parent :
> +  target->real_parent;

Hmm, I don't understand the line above... perhaps it connects to the
question below.

> + if (!real_parent) {
> + // XXX: Is it a kernel bug to get here? Or would this be
> + // true of the init process?

Afaics, parent/real_parent can't be NULL if pid_task() succeeds.

> + do_regular_wait = tsk == real_parent ||
> +   (!(wo->wo_flags & __WNOTHREAD) &&
> +same_thread_group(tsk, real_parent));
> + do_ptrace_wait = target->ptrace &&
> +  (tsk == target->parent ||
> +   (!(wo->wo_flags & __WNOTHREAD) &&
> +same_thread_group(tsk, target->parent)));
> + rcu_read_unlock();
> +
> + if (do_regular_wait) {
> + retval =
> + wait_consider_task(wo, /* ptrace= */ 0, target);
> + if (retval) {
> + goto end;
> + }
> + }
> + if (do_ptrace_wait) {
> + retval =
> + wait_consider_task(wo, /* ptrace= */ 1, target);
> + if (retval) {
> + goto end;
> + }
> + }
> + read_unlock(_lock);
> + goto notask;

This part looks correct at first glance...

Please redo and send V2 ;)

Oleg.



Re: [PATCH] signal: Allow RT tasks to cache one sigqueue struct

2021-03-05 Thread Oleg Nesterov
On 03/04, Thomas Gleixner wrote:
>
> On Wed, Mar 03 2021 at 16:37, Oleg Nesterov wrote:
> > On 03/03, Sebastian Andrzej Siewior wrote:
> >>
> >> +static struct sigqueue *sigqueue_from_cache(struct task_struct *t)
> >> +{
> >> +  struct sigqueue *q = t->sigqueue_cache;
> >> +
> >> +  if (q && cmpxchg(>sigqueue_cache, q, NULL) == q)
> >> +  return q;
> >> +  return NULL;
> >> +}
> >> +
> >> +static bool sigqueue_add_cache(struct task_struct *t, struct sigqueue *q)
> >> +{
> >> +  if (!t->sigqueue_cache && cmpxchg(>sigqueue_cache, NULL, q) == NULL)
> >> +  return true;
> >> +  return false;
> >> +}
> >
> > Do we really need cmpxchg? It seems they are always called with
> > spinlock held.
>
> With which spinlock held?
>
> __send_signal() <- sighand::siglock held
>   __sigqueue_alloc()
>
> alloc_posix_timer()
>   sigqueue_alloc()  <- No lock held
> __sigqueue_alloc()

In the last case "fromslab" is true, sigqueue_from_cache() won't be called.

> and on the free side we have a bunch of callers which do not hold
> sighand::siglock either.

Where?

Oleg.



Re: [PATCH] signal: Allow RT tasks to cache one sigqueue struct

2021-03-03 Thread Oleg Nesterov
On 03/03, Sebastian Andrzej Siewior wrote:
>
> +static struct sigqueue *sigqueue_from_cache(struct task_struct *t)
> +{
> + struct sigqueue *q = t->sigqueue_cache;
> +
> + if (q && cmpxchg(>sigqueue_cache, q, NULL) == q)
> + return q;
> + return NULL;
> +}
> +
> +static bool sigqueue_add_cache(struct task_struct *t, struct sigqueue *q)
> +{
> + if (!t->sigqueue_cache && cmpxchg(>sigqueue_cache, NULL, q) == NULL)
> + return true;
> + return false;
> +}

Do we really need cmpxchg? It seems they are always called with spinlock held.

>  static struct sigqueue *
> -__sigqueue_alloc(int sig, struct task_struct *t, gfp_t flags, int 
> override_rlimit)
> +__sigqueue_do_alloc(int sig, struct task_struct *t, gfp_t flags,
> + int override_rlimit, bool fromslab)
>  {
>   struct sigqueue *q = NULL;
>   struct user_struct *user;
> @@ -432,7 +450,10 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t 
> flags, int override_rlimi
>   rcu_read_unlock();
>
>   if (override_rlimit || likely(sigpending <= task_rlimit(t, 
> RLIMIT_SIGPENDING))) {
> - q = kmem_cache_alloc(sigqueue_cachep, flags);
> + if (!fromslab)
> + q = sigqueue_from_cache(t);
> + if (!q)
> + q = kmem_cache_alloc(sigqueue_cachep, flags);

I won't insist but afaics you can avoid the new arg/function and simplify this
patch. __sigqueue_alloc() can simply check "sig > 0" or valid_signal(sig) rather
than "!fromslab".

> +static void __sigqueue_cache_or_free(struct sigqueue *q)
> +{
> + struct user_struct *up;
> +
> + if (q->flags & SIGQUEUE_PREALLOC)
> + return;
> +
> + up = q->user;
> + if (atomic_dec_and_test(>sigpending))
> + free_uid(up);
> + if (!task_is_realtime(current) || !sigqueue_add_cache(current, q))
> + kmem_cache_free(sigqueue_cachep, q);
> +}

Well, this duplicates __sigqueue_free... Do we really need the new helper?
What if we simply change __sigqueue_free() to do sigqueue_add_cache() if
task_is_realtime() && !PF_EXITING ? This too can simplify the patch...

Oleg.



Re: [PATCH] ia64: fix ptrace(PTRACE_SYSCALL_INFO_EXIT) sign

2021-03-03 Thread Oleg Nesterov
On 03/02, Sergei Trofimovich wrote:
>
> > --- a/arch/ia64/include/asm/syscall.h
> > +++ b/arch/ia64/include/asm/syscall.h
> > @@ -32,7 +32,7 @@ static inline void syscall_rollback(struct task_struct 
> > *task,
> >  static inline long syscall_get_error(struct task_struct *task,
> >  struct pt_regs *regs)
> >  {
> > -   return regs->r10 == -1 ? regs->r8:0;
> > +   return regs->r10 == -1 ? -regs->r8:0;
> >  }
> >
> >  static inline long syscall_get_return_value(struct task_struct *task,
> > --
> > 2.30.1
> >
>
> Andrew, would it be fine to pass it through misc tree?
> Or should it go through Oleg as it's mostly about ptrace?

We usually route ptrace fixes via mm tree.

But this fix and another patch from you "ia64: fix 
ia64_syscall_get_set_arguments()
for break-based syscalls" look very much ia64 specific. I don't think it's 
actually
about ptrace, and I didn't even try to review these patches because I do not
understand this low level ia64 code.

Can it be routed via ia64 tree? Add Tony and Fenghua...

Oleg.



Re: Why do kprobes and uprobes singlestep?

2021-03-03 Thread Oleg Nesterov
On 03/02, Alexei Starovoitov wrote:
>
> Especially if such tightening will come with performance boost for
> uprobe on a nop and unprobe at the start (which is typically push or
> alu on %sp).
> That would be a great step forward.

Just in case, nop and push are emulated without additional overhead.

Oleg.



Re: Why do kprobes and uprobes singlestep?

2021-03-02 Thread Oleg Nesterov
On 03/02, Masami Hiramatsu wrote:
>
> > Not sure I understand you correctly, I know almost nothing about low-level
> > x86  magic.
>
> x86 has normal interrupt and NMI. When an NMI occurs the CPU masks NMI
> (the mask itself is hidden status) and IRET releases the mask. The problem
> is that if an INT3 is hit in the NMI handler and does a single-stepping,
> it has to use IRET for atomically setting TF and return.

Ah, thanks a lot,

Oleg.



Re: Why do kprobes and uprobes singlestep?

2021-03-02 Thread Oleg Nesterov
forgot to add Srikar, sorry for resend...

On 03/01, Andy Lutomirski wrote:
>
> On Mon, Mar 1, 2021 at 8:51 AM Oleg Nesterov  wrote:
> >
> > But I guess this has nothing to do with uprobes, they do not single-step
> > in kernel mode, right?
>
> They single-step user code, though, and the code that makes this work
> is quite ugly.  Single-stepping on x86 is a mess.

But this doesn't really differ from, say, gdb doing si ? OK, except uprobes
have to hook DIE_DEBUG. Nevermind...

> > > Uprobes seem to single-step user code for no discernable reason.
> > > (They want to trap after executing an out of line instruction, AFAICT.
> > > Surely INT3 or even CALL after the out-of-line insn would work as well
> > > or better.)
> >
> > Uprobes use single-step from the very beginning, probably because this
> > is the most simple and "standard" way to implement xol.
> >
> > And please note that CALL/JMP/etc emulation was added much later to fix the
> > problems with non-canonical addresses, and this emulation it still 
> > incomplete.
>
> Is there something like a uprobe test suite?

Afaik, no.

> How maintained /

Add Srikar who sent the initial implementation. I can only say that I am glad 
that
./scripts/get_maintainer.pl no longer mentions me ;) I did some changes 
(including
emulation) but a) this was a long ago and b) only because I was forced^W asked 
to
fix the numerous bugs in this code.

> actively used is uprobe?

I have no idea, sorry ;)

Oleg.



Re: Why do kprobes and uprobes singlestep?

2021-03-02 Thread Oleg Nesterov
On 03/01, Andy Lutomirski wrote:
>
> On Mon, Mar 1, 2021 at 8:51 AM Oleg Nesterov  wrote:
> >
> > But I guess this has nothing to do with uprobes, they do not single-step
> > in kernel mode, right?
>
> They single-step user code, though, and the code that makes this work
> is quite ugly.  Single-stepping on x86 is a mess.

But this doesn't really differ from, say, gdb doing si ? OK, except uprobes
have to hook DIE_DEBUG. Nevermind...

> > > Uprobes seem to single-step user code for no discernable reason.
> > > (They want to trap after executing an out of line instruction, AFAICT.
> > > Surely INT3 or even CALL after the out-of-line insn would work as well
> > > or better.)
> >
> > Uprobes use single-step from the very beginning, probably because this
> > is the most simple and "standard" way to implement xol.
> >
> > And please note that CALL/JMP/etc emulation was added much later to fix the
> > problems with non-canonical addresses, and this emulation it still 
> > incomplete.
>
> Is there something like a uprobe test suite?

Afaik, no.

> How maintained /

Add Srikar who sent the initial implementation. I can only say that I am glad 
that
./scripts/get_maintainer.pl no longer mentions me ;) I did some changes 
(including
emulation) but a) this was a long ago and b) only because I was forced^W asked 
to
fix the numerous bugs in this code.

> actively used is uprobe?

I have no idea, sorry ;)

Oleg.



Re: Why do kprobes and uprobes singlestep?

2021-03-01 Thread Oleg Nesterov
Hi Andy,

sorry for delay.

On 02/23, Andy Lutomirski wrote:
>
> A while back, I let myself be convinced that kprobes genuinely need to
> single-step the kernel on occasion, and I decided that this sucked but
> I could live with it.  it would, however, be Really Really Nice (tm)
> if we could have a rule that anyone running x86 Linux who single-steps
> the kernel (e.g. kgdb and nothing else) gets to keep all the pieces
> when the system falls apart around them.  Specifically, if we don't
> allow kernel single-stepping and if we suitably limit kernel
> instruction breakpoints (the latter isn't actually a major problem),
> then we don't really really need to use IRET to return to the kernel,
> and that means we can avoid some massive NMI nastiness.

Not sure I understand you correctly, I know almost nothing about low-level
x86  magic.

But I guess this has nothing to do with uprobes, they do not single-step
in kernel mode, right?

> Uprobes seem to single-step user code for no discernable reason.
> (They want to trap after executing an out of line instruction, AFAICT.
> Surely INT3 or even CALL after the out-of-line insn would work as well
> or better.)

Uprobes use single-step from the very beginning, probably because this
is the most simple and "standard" way to implement xol.

And please note that CALL/JMP/etc emulation was added much later to fix the
problems with non-canonical addresses, and this emulation it still incomplete.

Oleg.



Re: [PATCH v2] powerpc/uprobes: Validation for prefixed instruction

2021-02-06 Thread Oleg Nesterov
On 02/04, Ravi Bangoria wrote:
>
> +static int get_instr(struct mm_struct *mm, unsigned long addr, u32 *instr)
> +{
> + struct page *page;
> + struct vm_area_struct *vma;
> + void *kaddr;
> + unsigned int gup_flags = FOLL_FORCE | FOLL_SPLIT_PMD;
> +
> + if (get_user_pages_remote(mm, addr, 1, gup_flags, , , NULL) <= 
> 0)
> + return -EINVAL;

"vma" is not used, and I don't think you need FOLL_SPLIT_PMD.

Otherwise I can't really comment this ppc-specific change.

To be honest, I don't even understand why do we need this fix. Sure, the
breakpoint in the middle of 64-bit insn won't work, why do we care? The
user should know what does he do.

Not to mention we can't really trust get_user_pages() in that this page
can be modified by mm owner or debugger...

But I won't argue.

Oleg.



Re: [PATCH RESEND v4 0/4] x86: fix get_nr_restart_syscall()

2021-02-03 Thread Oleg Nesterov
It seems that nobody objects,

Andrew, Andy, Thomas, how do you think this series should be routed?

On 02/01, Oleg Nesterov wrote:
>
> Somehow I forgot about this problem. Let me resend the last version
> based on discussion with Linus. IIRC he was agree with this series.
>
> And let me remind why 3/4 temporary adds the "transient" TS_COMPAT_RESTART
> flag killed by the next patch: to simplify the backporting. 1-3 can fix
> the problem without breaking the kABI.
>
> Oleg.
> ---
>  arch/x86/include/asm/processor.h   |  9 -
>  arch/x86/include/asm/thread_info.h | 15 ++-
>  arch/x86/kernel/signal.c   | 24 +---
>  fs/select.c| 10 --
>  include/linux/restart_block.h  |  1 +
>  include/linux/thread_info.h| 13 +
>  kernel/futex.c |  3 +--
>  kernel/time/alarmtimer.c   |  2 +-
>  kernel/time/hrtimer.c  |  2 +-
>  kernel/time/posix-cpu-timers.c |  2 +-
>  10 files changed, 37 insertions(+), 44 deletions(-)



Re: [PATCH RESEND v4 0/4] x86: fix get_nr_restart_syscall()

2021-02-02 Thread Oleg Nesterov
On 02/02, Linus Torvalds wrote:
>
> On Tue, Feb 2, 2021 at 7:56 AM Oleg Nesterov  wrote:
> >
> > There is the "erestartsys-trap-debugger" test in ptrace-tests suite.
> > Do you mean you want another test in tools/testing/selftests/ptrace ?
>
> No, I guess it's fine if it's caught by the ptrace test suite - we'll
> hopefully get fairly timely "guys, you broke it" reports.
>
> Is that ptrace erestartsys-trap-debugger.c test new, or has it just
> always failed? Or is the problem that it is so expected to fail that
> we wouldn't get reports of it anyway (this clearly fell off your radar
> for a long time)?

No, this test-case is very old. It used to work when this
get_nr_restart_syscall() logic was based on the test_thread_flag(TIF_IA32)
check.

Then it started to fail somewhere 2-3 years ago, and to be honest I didn't
even try to find which exactly patch broke this test. Because this logic
was always wrong anyway, even if this test-case used to work.

I sent v1 soon after this bug was reported, but every time I was too lazy,
that is why this (minor) problem is still not fixed. So, in short, this is
my fault in any case.

Oleg.



Re: [PATCH RESEND v4 0/4] x86: fix get_nr_restart_syscall()

2021-02-02 Thread Oleg Nesterov
On 02/01, Linus Torvalds wrote:
>
> On Mon, Feb 1, 2021 at 10:18 AM Linus Torvalds
>  wrote:
> >
> > Yeah, looks sane to me.
>
> Oh, and in a perfect world we'd have a test for this condition too, no?

There is the "erestartsys-trap-debugger" test in ptrace-tests suite.
Do you mean you want another test in tools/testing/selftests/ptrace ?

Oleg.



Re: [PATCH v4 3/4] x86: introduce TS_COMPAT_RESTART to fix

2021-02-02 Thread Oleg Nesterov
On 02/01, Andy Lutomirski wrote:
>
> On Mon, Feb 1, 2021 at 9:47 AM Oleg Nesterov  wrote:
> >
> > The comment in get_nr_restart_syscall() says:
> >
> >  * The problem is that we can get here when ptrace pokes
> >  * syscall-like values into regs even if we're not in a syscall
> >  * at all.
> >
> > Yes. but if we are not in syscall then the
> >
> > status & (TS_COMPAT|TS_I386_REGS_POKED)
> >
> > check below can't really help:
> >
> > - TS_COMPAT can't be set
> >
> > - TS_I386_REGS_POKED is only set if regs->orig_ax was changed by
> >   32bit debugger; and even in this case get_nr_restart_syscall()
> >   is only correct if the tracee is 32bit too.
> >
> > Suppose that 64bit debugger plays with 32bit tracee and
>
> At the risk of asking an obnoxious question here:
>
> >
> > * Tracee calls sleep(2) // TS_COMPAT is set
> > * User interrupts the tracee by CTRL-C after 1 sec and does
> >   "(gdb) call func()"
> > * gdb saves the regs by PTRACE_GETREGS
>
> It seems to me that a better solution may be for gdb to see the
> post-restart-setup state.  In other words, shouldn't the GETREGS
> return with the ax pointing to the restart syscall already?

and ip = regs-ip - 2? And hide ERESTART_BLOCK from debugger? Perhaps
I misunderstood, but this doesn't look like a better solution to me.
Not to mention this would be the serious user-visible change... And
even the necessary changes in getreg() do not look good to me.

Plus I do not understand how this could work. OK, suppose that the
tracee reports a signal with ax = ERESTART_BLOCK.

Debugger simply does GETREGS + SETREGS + PTRACE_CONT(signr). In this
case handle_signal() should set ax = -EINTR, but syscall_get_error()
will report __NR_ia32_restart_syscall?

Probably I greatly misunderstood you...

Oleg.



[PATCH v4 4/4] x86: introduce restart_block->arch_data to kill

2021-02-01 Thread Oleg Nesterov
With this patch x86 just saves current_thread_info()->status in the
new restart_block->arch_data field, TS_COMPAT_RESTART can be removed.

Rather than saving "status" we could shift the code from
get_nr_restart_syscall() to arch_set_restart_data() and save the syscall
number in ->arch_data.

Signed-off-by: Oleg Nesterov 
---
 arch/x86/include/asm/thread_info.h | 12 ++--
 arch/x86/kernel/signal.c   |  2 +-
 include/linux/restart_block.h  |  1 +
 3 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h 
b/arch/x86/include/asm/thread_info.h
index 30d1d187019f..06b740bae431 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -217,18 +217,10 @@ static inline int arch_within_stack_frames(const void * 
const stack,
 #ifndef __ASSEMBLY__
 #ifdef CONFIG_COMPAT
 #define TS_I386_REGS_POKED 0x0004  /* regs poked by 32-bit ptracer */
-#define TS_COMPAT_RESTART  0x0008
 
-#define arch_set_restart_data  arch_set_restart_data
+#define arch_set_restart_data(restart) \
+   do { restart->arch_data = current_thread_info()->status; } while (0)
 
-static inline void arch_set_restart_data(struct restart_block *restart)
-{
-   struct thread_info *ti = current_thread_info();
-   if (ti->status & TS_COMPAT)
-   ti->status |= TS_COMPAT_RESTART;
-   else
-   ti->status &= ~TS_COMPAT_RESTART;
-}
 #endif
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 6c26d2c3a2e4..f306e85a08a6 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -767,7 +767,7 @@ handle_signal(struct ksignal *ksig, struct pt_regs *regs)
 static inline unsigned long get_nr_restart_syscall(const struct pt_regs *regs)
 {
 #ifdef CONFIG_IA32_EMULATION
-   if (current_thread_info()->status & TS_COMPAT_RESTART)
+   if (current->restart_block.arch_data & TS_COMPAT)
return __NR_ia32_restart_syscall;
 #endif
 #ifdef CONFIG_X86_X32_ABI
diff --git a/include/linux/restart_block.h b/include/linux/restart_block.h
index bba2920e9c05..980a65594412 100644
--- a/include/linux/restart_block.h
+++ b/include/linux/restart_block.h
@@ -23,6 +23,7 @@ enum timespec_type {
  * System call restart block.
  */
 struct restart_block {
+   unsigned long arch_data;
long (*fn)(struct restart_block *);
union {
/* For futex_wait and futex_wait_requeue_pi */
-- 
2.25.1.362.g51ebf55



[PATCH v4 2/4] x86: mv TS_COMPAT from asm/processor.h to

2021-02-01 Thread Oleg Nesterov
Move TS_COMPAT back to asm/thread_info.h, close to TS_I386_REGS_POKED.

It was moved to asm/processor.h by b9d989c7218a ("x86/asm: Move the
thread_info::status field to thread_struct"), then later 37a8f7c38339
("x86/asm: Move 'status' from thread_struct to thread_info") moved the
'status' field back but TS_COMPAT was forgotten.

Signed-off-by: Oleg Nesterov 
---
 arch/x86/include/asm/processor.h   | 9 -
 arch/x86/include/asm/thread_info.h | 9 +
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index c20a52b5534b..c66df6368909 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -552,15 +552,6 @@ static inline void arch_thread_struct_whitelist(unsigned 
long *offset,
*size = fpu_kernel_xstate_size;
 }
 
-/*
- * Thread-synchronous status.
- *
- * This is different from the flags in that nobody else
- * ever touches our thread-synchronous status, so we don't
- * have to worry about atomic accesses.
- */
-#define TS_COMPAT  0x0002  /* 32bit syscall active (64BIT)*/
-
 static inline void
 native_load_sp0(unsigned long sp0)
 {
diff --git a/arch/x86/include/asm/thread_info.h 
b/arch/x86/include/asm/thread_info.h
index 0d751d5da702..c2dc29e215ea 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -205,6 +205,15 @@ static inline int arch_within_stack_frames(const void * 
const stack,
 
 #endif
 
+/*
+ * Thread-synchronous status.
+ *
+ * This is different from the flags in that nobody else
+ * ever touches our thread-synchronous status, so we don't
+ * have to worry about atomic accesses.
+ */
+#define TS_COMPAT  0x0002  /* 32bit syscall active (64BIT)*/
+
 #ifdef CONFIG_COMPAT
 #define TS_I386_REGS_POKED 0x0004  /* regs poked by 32-bit ptracer */
 #endif
-- 
2.25.1.362.g51ebf55



[PATCH v4 3/4] x86: introduce TS_COMPAT_RESTART to fix

2021-02-01 Thread Oleg Nesterov
The comment in get_nr_restart_syscall() says:

 * The problem is that we can get here when ptrace pokes
 * syscall-like values into regs even if we're not in a syscall
 * at all.

Yes. but if we are not in syscall then the

status & (TS_COMPAT|TS_I386_REGS_POKED)

check below can't really help:

- TS_COMPAT can't be set

- TS_I386_REGS_POKED is only set if regs->orig_ax was changed by
  32bit debugger; and even in this case get_nr_restart_syscall()
  is only correct if the tracee is 32bit too.

Suppose that 64bit debugger plays with 32bit tracee and

* Tracee calls sleep(2) // TS_COMPAT is set
* User interrupts the tracee by CTRL-C after 1 sec and does
  "(gdb) call func()"
* gdb saves the regs by PTRACE_GETREGS
* does PTRACE_SETREGS to set %rip='func' and %orig_rax=-1
* PTRACE_CONT   // TS_COMPAT is cleared
* func() hits int3.
* Debugger catches SIGTRAP.
* Restore original regs by PTRACE_SETREGS.
* PTRACE_CONT

get_nr_restart_syscall() wrongly returns __NR_restart_syscall==219, the
tracee calls ia32_sys_call_table[219] == sys_madvise.

This patch adds the sticky TS_COMPAT_RESTART flag which survives after
return to user mode, hopefully it allows us to kill TS_I386_REGS_POKED
but this needs another patch.

Test-case:

  $ cvs -d :pserver:anoncvs:anon...@sourceware.org:/cvs/systemtap co 
ptrace-tests
  $ gcc -o erestartsys-trap-debuggee 
ptrace-tests/tests/erestartsys-trap-debuggee.c --m32
  $ gcc -o erestartsys-trap-debugger 
ptrace-tests/tests/erestartsys-trap-debugger.c -lutil
  $ ./erestartsys-trap-debugger
  Unexpected: retval 1, errno 22
  erestartsys-trap-debugger: ptrace-tests/tests/erestartsys-trap-debugger.c:421

Reported-by: Jan Kratochvil 
Signed-off-by: Oleg Nesterov 
---
 arch/x86/include/asm/thread_info.h | 14 +-
 arch/x86/kernel/signal.c   | 24 +---
 2 files changed, 14 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h 
b/arch/x86/include/asm/thread_info.h
index c2dc29e215ea..30d1d187019f 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -214,10 +214,22 @@ static inline int arch_within_stack_frames(const void * 
const stack,
  */
 #define TS_COMPAT  0x0002  /* 32bit syscall active (64BIT)*/
 
+#ifndef __ASSEMBLY__
 #ifdef CONFIG_COMPAT
 #define TS_I386_REGS_POKED 0x0004  /* regs poked by 32-bit ptracer */
+#define TS_COMPAT_RESTART  0x0008
+
+#define arch_set_restart_data  arch_set_restart_data
+
+static inline void arch_set_restart_data(struct restart_block *restart)
+{
+   struct thread_info *ti = current_thread_info();
+   if (ti->status & TS_COMPAT)
+   ti->status |= TS_COMPAT_RESTART;
+   else
+   ti->status &= ~TS_COMPAT_RESTART;
+}
 #endif
-#ifndef __ASSEMBLY__
 
 #ifdef CONFIG_X86_32
 #define in_ia32_syscall() true
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index ea794a083c44..6c26d2c3a2e4 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -766,30 +766,8 @@ handle_signal(struct ksignal *ksig, struct pt_regs *regs)
 
 static inline unsigned long get_nr_restart_syscall(const struct pt_regs *regs)
 {
-   /*
-* This function is fundamentally broken as currently
-* implemented.
-*
-* The idea is that we want to trigger a call to the
-* restart_block() syscall and that we want in_ia32_syscall(),
-* in_x32_syscall(), etc. to match whatever they were in the
-* syscall being restarted.  We assume that the syscall
-* instruction at (regs->ip - 2) matches whatever syscall
-* instruction we used to enter in the first place.
-*
-* The problem is that we can get here when ptrace pokes
-* syscall-like values into regs even if we're not in a syscall
-* at all.
-*
-* For now, we maintain historical behavior and guess based on
-* stored state.  We could do better by saving the actual
-* syscall arch in restart_block or (with caveats on x32) by
-* checking if regs->ip points to 'int $0x80'.  The current
-* behavior is incorrect if a tracer has a different bitness
-* than the tracee.
-*/
 #ifdef CONFIG_IA32_EMULATION
-   if (current_thread_info()->status & (TS_COMPAT|TS_I386_REGS_POKED))
+   if (current_thread_info()->status & TS_COMPAT_RESTART)
return __NR_ia32_restart_syscall;
 #endif
 #ifdef CONFIG_X86_X32_ABI
-- 
2.25.1.362.g51ebf55



[PATCH v4 1/4] introduce set_restart_fn() and arch_set_restart_data()

2021-02-01 Thread Oleg Nesterov
Preparation. Add the new helper which sets restart_block->fn and calls
the dummy arch_set_restart_data() helper.

Signed-off-by: Oleg Nesterov 
---
 fs/select.c| 10 --
 include/linux/thread_info.h| 13 +
 kernel/futex.c |  3 +--
 kernel/time/alarmtimer.c   |  2 +-
 kernel/time/hrtimer.c  |  2 +-
 kernel/time/posix-cpu-timers.c |  2 +-
 6 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 37aaa8317f3a..945896d0ac9e 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -1055,10 +1055,9 @@ static long do_restart_poll(struct restart_block 
*restart_block)
 
ret = do_sys_poll(ufds, nfds, to);
 
-   if (ret == -ERESTARTNOHAND) {
-   restart_block->fn = do_restart_poll;
-   ret = -ERESTART_RESTARTBLOCK;
-   }
+   if (ret == -ERESTARTNOHAND)
+   ret = set_restart_fn(restart_block, do_restart_poll);
+
return ret;
 }
 
@@ -1080,7 +1079,6 @@ SYSCALL_DEFINE3(poll, struct pollfd __user *, ufds, 
unsigned int, nfds,
struct restart_block *restart_block;
 
restart_block = >restart_block;
-   restart_block->fn = do_restart_poll;
restart_block->poll.ufds = ufds;
restart_block->poll.nfds = nfds;
 
@@ -1091,7 +1089,7 @@ SYSCALL_DEFINE3(poll, struct pollfd __user *, ufds, 
unsigned int, nfds,
} else
restart_block->poll.has_timeout = 0;
 
-   ret = -ERESTART_RESTARTBLOCK;
+   ret = set_restart_fn(restart_block, do_restart_poll);
}
return ret;
 }
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index c8a974cead73..495fb6a94e58 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 /*
@@ -57,6 +58,18 @@ enum syscall_work_bit {
 
 #ifdef __KERNEL__
 
+#ifndef arch_set_restart_data
+#define arch_set_restart_data(restart) do { } while (0)
+#endif
+
+static inline long set_restart_fn(struct restart_block *restart,
+   long (*fn)(struct restart_block *))
+{
+   restart->fn = fn;
+   arch_set_restart_data(restart);
+   return -ERESTART_RESTARTBLOCK;
+}
+
 #ifndef THREAD_ALIGN
 #define THREAD_ALIGN   THREAD_SIZE
 #endif
diff --git a/kernel/futex.c b/kernel/futex.c
index c47d1015d759..1ec50f9681a6 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2724,14 +2724,13 @@ static int futex_wait(u32 __user *uaddr, unsigned int 
flags, u32 val,
goto out;
 
restart = >restart_block;
-   restart->fn = futex_wait_restart;
restart->futex.uaddr = uaddr;
restart->futex.val = val;
restart->futex.time = *abs_time;
restart->futex.bitset = bitset;
restart->futex.flags = flags | FLAGS_HAS_TIMEOUT;
 
-   ret = -ERESTART_RESTARTBLOCK;
+   ret = set_restart_fn(restart, futex_wait_restart);
 
 out:
if (to) {
diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c
index f4ace1bf8382..daeaa7140d0a 100644
--- a/kernel/time/alarmtimer.c
+++ b/kernel/time/alarmtimer.c
@@ -848,9 +848,9 @@ static int alarm_timer_nsleep(const clockid_t which_clock, 
int flags,
if (flags == TIMER_ABSTIME)
return -ERESTARTNOHAND;
 
-   restart->fn = alarm_timer_nsleep_restart;
restart->nanosleep.clockid = type;
restart->nanosleep.expires = exp;
+   set_restart_fn(restart, alarm_timer_nsleep_restart);
return ret;
 }
 
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 743c852e10f2..30f936431c56 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1939,9 +1939,9 @@ long hrtimer_nanosleep(ktime_t rqtp, const enum 
hrtimer_mode mode,
}
 
restart = >restart_block;
-   restart->fn = hrtimer_nanosleep_restart;
restart->nanosleep.clockid = t.timer.base->clockid;
restart->nanosleep.expires = hrtimer_get_expires_tv64();
+   set_restart_fn(restart, hrtimer_nanosleep_restart);
 out:
destroy_hrtimer_on_stack();
return ret;
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index a71758e34e45..9abe15255bc4 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -1480,8 +1480,8 @@ static int posix_cpu_nsleep(const clockid_t which_clock, 
int flags,
if (flags & TIMER_ABSTIME)
return -ERESTARTNOHAND;
 
-   restart_block->fn = posix_cpu_nsleep_restart;
restart_block->nanosleep.clockid = which_clock;
+   set_restart_fn(restart_block, posix_cpu_nsleep_restart);
}
return error;
 }
-- 
2.25.1.362.g51ebf55



[PATCH RESEND v4 0/4] x86: fix get_nr_restart_syscall()

2021-02-01 Thread Oleg Nesterov
Somehow I forgot about this problem. Let me resend the last version
based on discussion with Linus. IIRC he was agree with this series.

And let me remind why 3/4 temporary adds the "transient" TS_COMPAT_RESTART
flag killed by the next patch: to simplify the backporting. 1-3 can fix
the problem without breaking the kABI.

Oleg.
---
 arch/x86/include/asm/processor.h   |  9 -
 arch/x86/include/asm/thread_info.h | 15 ++-
 arch/x86/kernel/signal.c   | 24 +---
 fs/select.c| 10 --
 include/linux/restart_block.h  |  1 +
 include/linux/thread_info.h| 13 +
 kernel/futex.c |  3 +--
 kernel/time/alarmtimer.c   |  2 +-
 kernel/time/hrtimer.c  |  2 +-
 kernel/time/posix-cpu-timers.c |  2 +-
 10 files changed, 37 insertions(+), 44 deletions(-)



Re: [PATCH v3] tracing: precise log info for kretprobe addr err

2021-01-26 Thread Oleg Nesterov
On 01/26, Steven Rostedt wrote:
>
> On Tue, 26 Jan 2021 21:20:59 +0100
> Oleg Nesterov  wrote:
>
> > > No, not wrong. Even offset != 0, if the symbol exists in the kernel,
> > > kprobe_on_func_entry() will check it.
> >
> > Yes, but unless I am totally confused... if kprobe_on_func_entry() returns 
> > false,
> > then trace_kprobe_create() should fail with BAD_RETPROBE even if offset == 
> > 0 ?
>
> From what I understand. kprobe_on_func_entry() can return false if you pass
> in: "MOD:not_yet_loaded_module_func", but this is OK, because when the
> module is loaded, and the "not_yet_loaded_module_func" exists, the
> kretprobe will then be added.
>
> The strchr(symbol,":") check is to see if "MOD:" (or some other ":" command)
> is in the name, and we don't want it to fail if it is. Which is why we
> should have that commented.

Agreed, this matches my understanding.

But just in case... not sure I read this code correctly, but I think that
module_kallsyms_lookup_name("not_yet_loaded_module_func") should work even
without the "MOD:" prefix.

IOW, kprobe_on_func_entry("not_yet_loaded_module_func") can fail, and then
later succeed if you load the module which provides this symbol.

But even if I am right, I agree with the strchr(symbol,":") check.

Oleg.



  1   2   3   4   5   6   7   8   9   10   >