Re: [PATCH RFC tip/core/rcu 09/16] rcu-tasks: Add an RCU-tasks rude variant

2020-05-11 Thread Steven Rostedt
On Sun, 10 May 2020 17:59:27 +0800
Lai Jiangshan  wrote:

> Hello
> 
> I think adding a small number of instructions to preempt_schedule_irq()
> is sufficient to create the needed protected region between the start
> of a function and the trampoline body.
> 
> preempt_schedule_irq() {
> +  if (unlikely(is_trampoline_page(page_of(interrupted_ip {
> +  return; // don't do preempt schedule
> +
> +  }
>   preempt_schedule_irq() original body
> }
> 
> // generated on trampoline pages
> trace_trampoline() {
>preempt_disable();
>trace_trampoline body
>jmp preempt_enable_traced(clobbers)
> }
> 
> asm(kernel text):
> preempt_enable_traced:
>preempt_enable_notrace();
>restore cobblers
>return(the return ip on the stack is traced_function_start_code)
> 
> 
> If the number of instructions added in preempt_schedule_irq() and
> the complexity to make trampoline ip detectable(is_trampoline_page(),
> or is_trampoline_range()) are small, and tasks_rcu is rendered useless,
> I think it will be win-win.

To make this even more complex, with ftrace direct callers (used by bpf to
define their own non ftrace trampoline), if a direct call is on the same
location as a ftrace caller, we have something like this:

ftrace_caller:
save_regs
call ftrace_ops_list_func
cmp ORIG_RAX
jnz do_direct
restore_regs
ret

do_direct:
mov ORIG_RAX to return
restore_regs
ret

What the above is basically doing, is that the ftrace_ops_list_func() will
call the ftrace callbacks, but also a special callback to handle the direct
that is also registered to that same location. The direct callback will
place the address of the direct trampoline into ORIG_RAX. Then on return
from ftrace_ops_list_func(), it will jump directly to the direct caller.

To implement what you are proposing, you have to have a way to keep
preemption off between the setting of ORIG_RAX and the jump to the direct
caller (which would require its own preempt_disable() section). But if we
preempt between the two, the direct trampoline may disappear and then this
code will jump to it.

-- Steve


Re: [PATCH RFC tip/core/rcu 09/16] rcu-tasks: Add an RCU-tasks rude variant

2020-05-11 Thread Steven Rostedt
On Sun, 10 May 2020 17:59:27 +0800
Lai Jiangshan  wrote:

> I think adding a small number of instructions to preempt_schedule_irq()
> is sufficient to create the needed protected region between the start
> of a function and the trampoline body.
> 
> preempt_schedule_irq() {
> +  if (unlikely(is_trampoline_page(page_of(interrupted_ip {
> +  return; // don't do preempt schedule
> +
> +  }
>   preempt_schedule_irq() original body
> }

First, this would never be accepted due to the overhead it would cause,
next, the interrupt instruction pointer could be the call to the
trampoline itself. It would be hard to guarantee that we are not on the way
to the trampoline in question.

-- Steve


Re: [PATCH RFC tip/core/rcu 09/16] rcu-tasks: Add an RCU-tasks rude variant

2020-05-10 Thread Paul E. McKenney
On Mon, May 11, 2020 at 08:06:29AM +0800, Lai Jiangshan wrote:
> On Sun, May 10, 2020 at 11:49 PM Paul E. McKenney  wrote:
> >
> > On Sun, May 10, 2020 at 05:59:27PM +0800, Lai Jiangshan wrote:
> > > On Tue, Mar 17, 2020 at 6:03 AM Steven Rostedt  
> > > wrote:
> > > >
> > > > On Mon, 16 Mar 2020 17:45:40 -0400
> > > > Joel Fernandes  wrote:
> > > >
> > > > > >
> > > > > > Same for the function side (if not even more so). This would 
> > > > > > require adding
> > > > > > a srcu_read_lock() to all functions that can be traced! That would 
> > > > > > be a huge
> > > > > > kill in performance. Probably to the point no one would bother even 
> > > > > > using
> > > > > > function tracer.
> > > > >
> > > > > Point well taken! Thanks,
> > > >
> > > > Actually, it's worse than that. (We talked about this on IRC but I 
> > > > wanted
> > > > it documented here too).
> > > >
> > > > You can't use any type of locking, unless you insert it around all the
> > > > callers of the nops (which is unreasonable).
> > > >
> > > > That is, we have gcc -pg -mfentry that creates at the start of all 
> > > > traced
> > > > functions:
> > > >
> > > >  :
> > > > call __fentry__
> > > > [code for function here]
> > > >
> > > > At boot up (or even by the compiler itself) we convert that to:
> > > >
> > > >  :
> > > > nop
> > > > [code for function here]
> > > >
> > > >
> > > > When we want to trace this function we use text_poke (with current 
> > > > kernels)
> > > > and convert it to this:
> > > >
> > > >  :
> > > > call trace_trampoline
> > > > [code for function here]
> > > >
> > > >
> > > > That trace_trampoline can be allocated, which means when its no longer
> > > > needed, it must be freed. But when do we know it's safe to free it? 
> > > > Here's
> > > > the issue.
> > > >
> > > >
> > > >  :
> > > > call trace_trampoline  <- interrupt happens just after the jump
> > > > [code for function here]
> > > >
> > > > Now the task has just executed the call to the trace_trampoline. Which
> > > > means the instruction pointer is set to the start of the trampoline. 
> > > > But it
> > > > has yet executed that trampoline.
> > > >
> > > > Now if the task is preempted, and a real time hog is keeping it from
> > > > running for minutes at a time (which is possible!). And in the mean 
> > > > time,
> > > > we are done with that trampoline and free it. What happens when that 
> > > > task
> > > > is scheduled back? There's no more trampoline to execute even though its
> > > > instruction pointer is to execute the first operand on the trampoline!
> > > >
> > > > I used the analogy of jumping off the cliff expecting a magic carpet to 
> > > > be
> > > > there to catch you, and just before you land, it disappears. That would 
> > > > be
> > > > a very bad day indeed!
> > > >
> > > > We have no way to add a grace period between the start of a function 
> > > > (can
> > > > be *any* function) and the start of the trampoline.
> > >
> > > Hello
> > >
> > > I think adding a small number of instructions to preempt_schedule_irq()
> > > is sufficient to create the needed protected region between the start
> > > of a function and the trampoline body.
> > >
> > > preempt_schedule_irq() {
> > > +  if (unlikely(is_trampoline_page(page_of(interrupted_ip {
> > > +  return; // don't do preempt schedule
> > > +
> > > +  }
> > >   preempt_schedule_irq() original body
> > > }
> > >
> > > // generated on trampoline pages
> > > trace_trampoline() {
> > >preempt_disable();
> > >trace_trampoline body
> > >jmp preempt_enable_traced(clobbers)
> > > }
> > >
> > > asm(kernel text):
> > > preempt_enable_traced:
> > >preempt_enable_notrace();
> > >restore cobblers
> > >return(the return ip on the stack is traced_function_start_code)
> > >
> > >
> > > If the number of instructions added in preempt_schedule_irq() and
> > > the complexity to make trampoline ip detectable(is_trampoline_page(),
> > > or is_trampoline_range()) are small, and tasks_rcu is rendered useless,
> > > I think it will be win-win.
> >
> > It certainly would provide a nice reduction in code size!
> >
> > This would provide a zero-instructions preempt_disable() at the beginning
> > of the trampoline and a zero-instructions preempt_enable_no_resched() at
> > the end, correct?  If so, wouldn't this create a potentially long (though
> > "weak") preempt-disable region extending to the next preempt_enable(),
> > local_bh_enable(), schedule(), interrupt, transition to userspace,
> > or similar?  This could be quite some time.  Note that cond_resched()
> > wouldn't help, given that this is only in PREEMPT=y kernels.
> >
> > The "weak" refers to the fact that if a second resched IPI arrived in the
> > meantime, preemption would then happen.  But without that second IPI,
> > the request for preemption could be ignored for quite some time.
> >
> > Or am I missing something here?
> 
> Hello,
> 
> I'm sorry to note that preempt_enable_traced() is 

Re: [PATCH RFC tip/core/rcu 09/16] rcu-tasks: Add an RCU-tasks rude variant

2020-05-10 Thread Masami Hiramatsu
On Sun, 10 May 2020 17:59:27 +0800
Lai Jiangshan  wrote:

> On Tue, Mar 17, 2020 at 6:03 AM Steven Rostedt  wrote:
> >
> > On Mon, 16 Mar 2020 17:45:40 -0400
> > Joel Fernandes  wrote:
> >
> > > >
> > > > Same for the function side (if not even more so). This would require 
> > > > adding
> > > > a srcu_read_lock() to all functions that can be traced! That would be a 
> > > > huge
> > > > kill in performance. Probably to the point no one would bother even 
> > > > using
> > > > function tracer.
> > >
> > > Point well taken! Thanks,
> >
> > Actually, it's worse than that. (We talked about this on IRC but I wanted
> > it documented here too).
> >
> > You can't use any type of locking, unless you insert it around all the
> > callers of the nops (which is unreasonable).
> >
> > That is, we have gcc -pg -mfentry that creates at the start of all traced
> > functions:
> >
> >  :
> > call __fentry__
> > [code for function here]
> >
> > At boot up (or even by the compiler itself) we convert that to:
> >
> >  :
> > nop
> > [code for function here]
> >
> >
> > When we want to trace this function we use text_poke (with current kernels)
> > and convert it to this:
> >
> >  :
> > call trace_trampoline
> > [code for function here]
> >
> >
> > That trace_trampoline can be allocated, which means when its no longer
> > needed, it must be freed. But when do we know it's safe to free it? Here's
> > the issue.
> >
> >
> >  :
> > call trace_trampoline  <- interrupt happens just after the jump
> > [code for function here]
> >
> > Now the task has just executed the call to the trace_trampoline. Which
> > means the instruction pointer is set to the start of the trampoline. But it
> > has yet executed that trampoline.
> >
> > Now if the task is preempted, and a real time hog is keeping it from
> > running for minutes at a time (which is possible!). And in the mean time,
> > we are done with that trampoline and free it. What happens when that task
> > is scheduled back? There's no more trampoline to execute even though its
> > instruction pointer is to execute the first operand on the trampoline!
> >
> > I used the analogy of jumping off the cliff expecting a magic carpet to be
> > there to catch you, and just before you land, it disappears. That would be
> > a very bad day indeed!
> >
> > We have no way to add a grace period between the start of a function (can
> > be *any* function) and the start of the trampoline.
> 
> Hello
> 
> I think adding a small number of instructions to preempt_schedule_irq()
> is sufficient to create the needed protected region between the start
> of a function and the trampoline body.
> 
> preempt_schedule_irq() {
> +  if (unlikely(is_trampoline_page(page_of(interrupted_ip {
> +  return; // don't do preempt schedule
> +
> +  }
>   preempt_schedule_irq() original body
> }
> 
> // generated on trampoline pages
> trace_trampoline() {
>preempt_disable();
>trace_trampoline body
>jmp preempt_enable_traced(clobbers)
> }
> 
> asm(kernel text):
> preempt_enable_traced:
>preempt_enable_notrace();
>restore cobblers
>return(the return ip on the stack is traced_function_start_code)
> 
> 
> If the number of instructions added in preempt_schedule_irq() and
> the complexity to make trampoline ip detectable(is_trampoline_page(),
> or is_trampoline_range()) are small, and tasks_rcu is rendered useless,
> I think it will be win-win.

Good idea, but the is_trampoline_page() will not be so easy since
those pages are scattered. !is_kernel_text() works but it doesn't
cover modules. I think we should make a new subsystem to allocate
trampoline pages instead of using module_alloc().

Thank you,

-- 
Masami Hiramatsu 


Re: [PATCH RFC tip/core/rcu 09/16] rcu-tasks: Add an RCU-tasks rude variant

2020-05-10 Thread Lai Jiangshan
On Sun, May 10, 2020 at 11:49 PM Paul E. McKenney  wrote:
>
> On Sun, May 10, 2020 at 05:59:27PM +0800, Lai Jiangshan wrote:
> > On Tue, Mar 17, 2020 at 6:03 AM Steven Rostedt  wrote:
> > >
> > > On Mon, 16 Mar 2020 17:45:40 -0400
> > > Joel Fernandes  wrote:
> > >
> > > > >
> > > > > Same for the function side (if not even more so). This would require 
> > > > > adding
> > > > > a srcu_read_lock() to all functions that can be traced! That would be 
> > > > > a huge
> > > > > kill in performance. Probably to the point no one would bother even 
> > > > > using
> > > > > function tracer.
> > > >
> > > > Point well taken! Thanks,
> > >
> > > Actually, it's worse than that. (We talked about this on IRC but I wanted
> > > it documented here too).
> > >
> > > You can't use any type of locking, unless you insert it around all the
> > > callers of the nops (which is unreasonable).
> > >
> > > That is, we have gcc -pg -mfentry that creates at the start of all traced
> > > functions:
> > >
> > >  :
> > > call __fentry__
> > > [code for function here]
> > >
> > > At boot up (or even by the compiler itself) we convert that to:
> > >
> > >  :
> > > nop
> > > [code for function here]
> > >
> > >
> > > When we want to trace this function we use text_poke (with current 
> > > kernels)
> > > and convert it to this:
> > >
> > >  :
> > > call trace_trampoline
> > > [code for function here]
> > >
> > >
> > > That trace_trampoline can be allocated, which means when its no longer
> > > needed, it must be freed. But when do we know it's safe to free it? Here's
> > > the issue.
> > >
> > >
> > >  :
> > > call trace_trampoline  <- interrupt happens just after the jump
> > > [code for function here]
> > >
> > > Now the task has just executed the call to the trace_trampoline. Which
> > > means the instruction pointer is set to the start of the trampoline. But 
> > > it
> > > has yet executed that trampoline.
> > >
> > > Now if the task is preempted, and a real time hog is keeping it from
> > > running for minutes at a time (which is possible!). And in the mean time,
> > > we are done with that trampoline and free it. What happens when that task
> > > is scheduled back? There's no more trampoline to execute even though its
> > > instruction pointer is to execute the first operand on the trampoline!
> > >
> > > I used the analogy of jumping off the cliff expecting a magic carpet to be
> > > there to catch you, and just before you land, it disappears. That would be
> > > a very bad day indeed!
> > >
> > > We have no way to add a grace period between the start of a function (can
> > > be *any* function) and the start of the trampoline.
> >
> > Hello
> >
> > I think adding a small number of instructions to preempt_schedule_irq()
> > is sufficient to create the needed protected region between the start
> > of a function and the trampoline body.
> >
> > preempt_schedule_irq() {
> > +  if (unlikely(is_trampoline_page(page_of(interrupted_ip {
> > +  return; // don't do preempt schedule
> > +
> > +  }
> >   preempt_schedule_irq() original body
> > }
> >
> > // generated on trampoline pages
> > trace_trampoline() {
> >preempt_disable();
> >trace_trampoline body
> >jmp preempt_enable_traced(clobbers)
> > }
> >
> > asm(kernel text):
> > preempt_enable_traced:
> >preempt_enable_notrace();
> >restore cobblers
> >return(the return ip on the stack is traced_function_start_code)
> >
> >
> > If the number of instructions added in preempt_schedule_irq() and
> > the complexity to make trampoline ip detectable(is_trampoline_page(),
> > or is_trampoline_range()) are small, and tasks_rcu is rendered useless,
> > I think it will be win-win.
>
> It certainly would provide a nice reduction in code size!
>
> This would provide a zero-instructions preempt_disable() at the beginning
> of the trampoline and a zero-instructions preempt_enable_no_resched() at
> the end, correct?  If so, wouldn't this create a potentially long (though
> "weak") preempt-disable region extending to the next preempt_enable(),
> local_bh_enable(), schedule(), interrupt, transition to userspace,
> or similar?  This could be quite some time.  Note that cond_resched()
> wouldn't help, given that this is only in PREEMPT=y kernels.
>
> The "weak" refers to the fact that if a second resched IPI arrived in the
> meantime, preemption would then happen.  But without that second IPI,
> the request for preemption could be ignored for quite some time.
>
> Or am I missing something here?

Hello,

I'm sorry to note that preempt_enable_traced() is in *kernel text*, it
is *not* in trace_trampoline_protected region. So preempt_enable_traced()
can be preempted. And preempt_enable_notrace() in it checks any previous
resched requested during the trampoline. So no resched request is lost.

The idea is that "semi-automatically preempt-disable-protecting"
the trampoline. "semi" means the trampoline still needs
preempt_disable() and the 

Re: [PATCH RFC tip/core/rcu 09/16] rcu-tasks: Add an RCU-tasks rude variant

2020-05-10 Thread Paul E. McKenney
On Sun, May 10, 2020 at 05:59:27PM +0800, Lai Jiangshan wrote:
> On Tue, Mar 17, 2020 at 6:03 AM Steven Rostedt  wrote:
> >
> > On Mon, 16 Mar 2020 17:45:40 -0400
> > Joel Fernandes  wrote:
> >
> > > >
> > > > Same for the function side (if not even more so). This would require 
> > > > adding
> > > > a srcu_read_lock() to all functions that can be traced! That would be a 
> > > > huge
> > > > kill in performance. Probably to the point no one would bother even 
> > > > using
> > > > function tracer.
> > >
> > > Point well taken! Thanks,
> >
> > Actually, it's worse than that. (We talked about this on IRC but I wanted
> > it documented here too).
> >
> > You can't use any type of locking, unless you insert it around all the
> > callers of the nops (which is unreasonable).
> >
> > That is, we have gcc -pg -mfentry that creates at the start of all traced
> > functions:
> >
> >  :
> > call __fentry__
> > [code for function here]
> >
> > At boot up (or even by the compiler itself) we convert that to:
> >
> >  :
> > nop
> > [code for function here]
> >
> >
> > When we want to trace this function we use text_poke (with current kernels)
> > and convert it to this:
> >
> >  :
> > call trace_trampoline
> > [code for function here]
> >
> >
> > That trace_trampoline can be allocated, which means when its no longer
> > needed, it must be freed. But when do we know it's safe to free it? Here's
> > the issue.
> >
> >
> >  :
> > call trace_trampoline  <- interrupt happens just after the jump
> > [code for function here]
> >
> > Now the task has just executed the call to the trace_trampoline. Which
> > means the instruction pointer is set to the start of the trampoline. But it
> > has yet executed that trampoline.
> >
> > Now if the task is preempted, and a real time hog is keeping it from
> > running for minutes at a time (which is possible!). And in the mean time,
> > we are done with that trampoline and free it. What happens when that task
> > is scheduled back? There's no more trampoline to execute even though its
> > instruction pointer is to execute the first operand on the trampoline!
> >
> > I used the analogy of jumping off the cliff expecting a magic carpet to be
> > there to catch you, and just before you land, it disappears. That would be
> > a very bad day indeed!
> >
> > We have no way to add a grace period between the start of a function (can
> > be *any* function) and the start of the trampoline.
> 
> Hello
> 
> I think adding a small number of instructions to preempt_schedule_irq()
> is sufficient to create the needed protected region between the start
> of a function and the trampoline body.
> 
> preempt_schedule_irq() {
> +  if (unlikely(is_trampoline_page(page_of(interrupted_ip {
> +  return; // don't do preempt schedule
> +
> +  }
>   preempt_schedule_irq() original body
> }
> 
> // generated on trampoline pages
> trace_trampoline() {
>preempt_disable();
>trace_trampoline body
>jmp preempt_enable_traced(clobbers)
> }
> 
> asm(kernel text):
> preempt_enable_traced:
>preempt_enable_notrace();
>restore cobblers
>return(the return ip on the stack is traced_function_start_code)
> 
> 
> If the number of instructions added in preempt_schedule_irq() and
> the complexity to make trampoline ip detectable(is_trampoline_page(),
> or is_trampoline_range()) are small, and tasks_rcu is rendered useless,
> I think it will be win-win.

It certainly would provide a nice reduction in code size!

This would provide a zero-instructions preempt_disable() at the beginning
of the trampoline and a zero-instructions preempt_enable_no_resched() at
the end, correct?  If so, wouldn't this create a potentially long (though
"weak") preempt-disable region extending to the next preempt_enable(),
local_bh_enable(), schedule(), interrupt, transition to userspace,
or similar?  This could be quite some time.  Note that cond_resched()
wouldn't help, given that this is only in PREEMPT=y kernels.

The "weak" refers to the fact that if a second resched IPI arrived in the
meantime, preemption would then happen.  But without that second IPI,
the request for preemption could be ignored for quite some time.

Or am I missing something here?

Thanx, Paul

> Thanks
> 
> Lai
> 
> > Since the problem is
> > that the task was non-voluntarily preempted before it could execute the
> > trampoline, and that trampolines are not allowed (suppose) to call
> > schedule, then we have our quiescent state to track (voluntary scheduling).
> > When all tasks have either voluntarily scheduled, or entered user space
> > after disconnecting a trampoline from a function, we know that it is safe to
> > free the trampoline.
> >
> > -- Steve


Re: [PATCH RFC tip/core/rcu 09/16] rcu-tasks: Add an RCU-tasks rude variant

2020-05-10 Thread Lai Jiangshan
On Tue, Mar 17, 2020 at 6:03 AM Steven Rostedt  wrote:
>
> On Mon, 16 Mar 2020 17:45:40 -0400
> Joel Fernandes  wrote:
>
> > >
> > > Same for the function side (if not even more so). This would require 
> > > adding
> > > a srcu_read_lock() to all functions that can be traced! That would be a 
> > > huge
> > > kill in performance. Probably to the point no one would bother even using
> > > function tracer.
> >
> > Point well taken! Thanks,
>
> Actually, it's worse than that. (We talked about this on IRC but I wanted
> it documented here too).
>
> You can't use any type of locking, unless you insert it around all the
> callers of the nops (which is unreasonable).
>
> That is, we have gcc -pg -mfentry that creates at the start of all traced
> functions:
>
>  :
> call __fentry__
> [code for function here]
>
> At boot up (or even by the compiler itself) we convert that to:
>
>  :
> nop
> [code for function here]
>
>
> When we want to trace this function we use text_poke (with current kernels)
> and convert it to this:
>
>  :
> call trace_trampoline
> [code for function here]
>
>
> That trace_trampoline can be allocated, which means when its no longer
> needed, it must be freed. But when do we know it's safe to free it? Here's
> the issue.
>
>
>  :
> call trace_trampoline  <- interrupt happens just after the jump
> [code for function here]
>
> Now the task has just executed the call to the trace_trampoline. Which
> means the instruction pointer is set to the start of the trampoline. But it
> has yet executed that trampoline.
>
> Now if the task is preempted, and a real time hog is keeping it from
> running for minutes at a time (which is possible!). And in the mean time,
> we are done with that trampoline and free it. What happens when that task
> is scheduled back? There's no more trampoline to execute even though its
> instruction pointer is to execute the first operand on the trampoline!
>
> I used the analogy of jumping off the cliff expecting a magic carpet to be
> there to catch you, and just before you land, it disappears. That would be
> a very bad day indeed!
>
> We have no way to add a grace period between the start of a function (can
> be *any* function) and the start of the trampoline.

Hello

I think adding a small number of instructions to preempt_schedule_irq()
is sufficient to create the needed protected region between the start
of a function and the trampoline body.

preempt_schedule_irq() {
+  if (unlikely(is_trampoline_page(page_of(interrupted_ip {
+  return; // don't do preempt schedule
+
+  }
  preempt_schedule_irq() original body
}

// generated on trampoline pages
trace_trampoline() {
   preempt_disable();
   trace_trampoline body
   jmp preempt_enable_traced(clobbers)
}

asm(kernel text):
preempt_enable_traced:
   preempt_enable_notrace();
   restore cobblers
   return(the return ip on the stack is traced_function_start_code)


If the number of instructions added in preempt_schedule_irq() and
the complexity to make trampoline ip detectable(is_trampoline_page(),
or is_trampoline_range()) are small, and tasks_rcu is rendered useless,
I think it will be win-win.

Thanks

Lai

> Since the problem is
> that the task was non-voluntarily preempted before it could execute the
> trampoline, and that trampolines are not allowed (suppose) to call
> schedule, then we have our quiescent state to track (voluntary scheduling).
> When all tasks have either voluntarily scheduled, or entered user space
> after disconnecting a trampoline from a function, we know that it is safe to
> free the trampoline.
>
> -- Steve