Re: [PATCH RFC tip/core/rcu 09/16] rcu-tasks: Add an RCU-tasks rude variant
On Sun, 10 May 2020 17:59:27 +0800 Lai Jiangshan wrote: > Hello > > I think adding a small number of instructions to preempt_schedule_irq() > is sufficient to create the needed protected region between the start > of a function and the trampoline body. > > preempt_schedule_irq() { > + if (unlikely(is_trampoline_page(page_of(interrupted_ip { > + return; // don't do preempt schedule > + > + } > preempt_schedule_irq() original body > } > > // generated on trampoline pages > trace_trampoline() { >preempt_disable(); >trace_trampoline body >jmp preempt_enable_traced(clobbers) > } > > asm(kernel text): > preempt_enable_traced: >preempt_enable_notrace(); >restore cobblers >return(the return ip on the stack is traced_function_start_code) > > > If the number of instructions added in preempt_schedule_irq() and > the complexity to make trampoline ip detectable(is_trampoline_page(), > or is_trampoline_range()) are small, and tasks_rcu is rendered useless, > I think it will be win-win. To make this even more complex, with ftrace direct callers (used by bpf to define their own non ftrace trampoline), if a direct call is on the same location as a ftrace caller, we have something like this: ftrace_caller: save_regs call ftrace_ops_list_func cmp ORIG_RAX jnz do_direct restore_regs ret do_direct: mov ORIG_RAX to return restore_regs ret What the above is basically doing, is that the ftrace_ops_list_func() will call the ftrace callbacks, but also a special callback to handle the direct that is also registered to that same location. The direct callback will place the address of the direct trampoline into ORIG_RAX. Then on return from ftrace_ops_list_func(), it will jump directly to the direct caller. To implement what you are proposing, you have to have a way to keep preemption off between the setting of ORIG_RAX and the jump to the direct caller (which would require its own preempt_disable() section). But if we preempt between the two, the direct trampoline may disappear and then this code will jump to it. -- Steve
Re: [PATCH RFC tip/core/rcu 09/16] rcu-tasks: Add an RCU-tasks rude variant
On Sun, 10 May 2020 17:59:27 +0800 Lai Jiangshan wrote: > I think adding a small number of instructions to preempt_schedule_irq() > is sufficient to create the needed protected region between the start > of a function and the trampoline body. > > preempt_schedule_irq() { > + if (unlikely(is_trampoline_page(page_of(interrupted_ip { > + return; // don't do preempt schedule > + > + } > preempt_schedule_irq() original body > } First, this would never be accepted due to the overhead it would cause, next, the interrupt instruction pointer could be the call to the trampoline itself. It would be hard to guarantee that we are not on the way to the trampoline in question. -- Steve
Re: [PATCH RFC tip/core/rcu 09/16] rcu-tasks: Add an RCU-tasks rude variant
On Mon, May 11, 2020 at 08:06:29AM +0800, Lai Jiangshan wrote: > On Sun, May 10, 2020 at 11:49 PM Paul E. McKenney wrote: > > > > On Sun, May 10, 2020 at 05:59:27PM +0800, Lai Jiangshan wrote: > > > On Tue, Mar 17, 2020 at 6:03 AM Steven Rostedt > > > wrote: > > > > > > > > On Mon, 16 Mar 2020 17:45:40 -0400 > > > > Joel Fernandes wrote: > > > > > > > > > > > > > > > > Same for the function side (if not even more so). This would > > > > > > require adding > > > > > > a srcu_read_lock() to all functions that can be traced! That would > > > > > > be a huge > > > > > > kill in performance. Probably to the point no one would bother even > > > > > > using > > > > > > function tracer. > > > > > > > > > > Point well taken! Thanks, > > > > > > > > Actually, it's worse than that. (We talked about this on IRC but I > > > > wanted > > > > it documented here too). > > > > > > > > You can't use any type of locking, unless you insert it around all the > > > > callers of the nops (which is unreasonable). > > > > > > > > That is, we have gcc -pg -mfentry that creates at the start of all > > > > traced > > > > functions: > > > > > > > > : > > > > call __fentry__ > > > > [code for function here] > > > > > > > > At boot up (or even by the compiler itself) we convert that to: > > > > > > > > : > > > > nop > > > > [code for function here] > > > > > > > > > > > > When we want to trace this function we use text_poke (with current > > > > kernels) > > > > and convert it to this: > > > > > > > > : > > > > call trace_trampoline > > > > [code for function here] > > > > > > > > > > > > That trace_trampoline can be allocated, which means when its no longer > > > > needed, it must be freed. But when do we know it's safe to free it? > > > > Here's > > > > the issue. > > > > > > > > > > > > : > > > > call trace_trampoline <- interrupt happens just after the jump > > > > [code for function here] > > > > > > > > Now the task has just executed the call to the trace_trampoline. Which > > > > means the instruction pointer is set to the start of the trampoline. > > > > But it > > > > has yet executed that trampoline. > > > > > > > > Now if the task is preempted, and a real time hog is keeping it from > > > > running for minutes at a time (which is possible!). And in the mean > > > > time, > > > > we are done with that trampoline and free it. What happens when that > > > > task > > > > is scheduled back? There's no more trampoline to execute even though its > > > > instruction pointer is to execute the first operand on the trampoline! > > > > > > > > I used the analogy of jumping off the cliff expecting a magic carpet to > > > > be > > > > there to catch you, and just before you land, it disappears. That would > > > > be > > > > a very bad day indeed! > > > > > > > > We have no way to add a grace period between the start of a function > > > > (can > > > > be *any* function) and the start of the trampoline. > > > > > > Hello > > > > > > I think adding a small number of instructions to preempt_schedule_irq() > > > is sufficient to create the needed protected region between the start > > > of a function and the trampoline body. > > > > > > preempt_schedule_irq() { > > > + if (unlikely(is_trampoline_page(page_of(interrupted_ip { > > > + return; // don't do preempt schedule > > > + > > > + } > > > preempt_schedule_irq() original body > > > } > > > > > > // generated on trampoline pages > > > trace_trampoline() { > > >preempt_disable(); > > >trace_trampoline body > > >jmp preempt_enable_traced(clobbers) > > > } > > > > > > asm(kernel text): > > > preempt_enable_traced: > > >preempt_enable_notrace(); > > >restore cobblers > > >return(the return ip on the stack is traced_function_start_code) > > > > > > > > > If the number of instructions added in preempt_schedule_irq() and > > > the complexity to make trampoline ip detectable(is_trampoline_page(), > > > or is_trampoline_range()) are small, and tasks_rcu is rendered useless, > > > I think it will be win-win. > > > > It certainly would provide a nice reduction in code size! > > > > This would provide a zero-instructions preempt_disable() at the beginning > > of the trampoline and a zero-instructions preempt_enable_no_resched() at > > the end, correct? If so, wouldn't this create a potentially long (though > > "weak") preempt-disable region extending to the next preempt_enable(), > > local_bh_enable(), schedule(), interrupt, transition to userspace, > > or similar? This could be quite some time. Note that cond_resched() > > wouldn't help, given that this is only in PREEMPT=y kernels. > > > > The "weak" refers to the fact that if a second resched IPI arrived in the > > meantime, preemption would then happen. But without that second IPI, > > the request for preemption could be ignored for quite some time. > > > > Or am I missing something here? > > Hello, > > I'm sorry to note that preempt_enable_traced() is
Re: [PATCH RFC tip/core/rcu 09/16] rcu-tasks: Add an RCU-tasks rude variant
On Sun, 10 May 2020 17:59:27 +0800 Lai Jiangshan wrote: > On Tue, Mar 17, 2020 at 6:03 AM Steven Rostedt wrote: > > > > On Mon, 16 Mar 2020 17:45:40 -0400 > > Joel Fernandes wrote: > > > > > > > > > > Same for the function side (if not even more so). This would require > > > > adding > > > > a srcu_read_lock() to all functions that can be traced! That would be a > > > > huge > > > > kill in performance. Probably to the point no one would bother even > > > > using > > > > function tracer. > > > > > > Point well taken! Thanks, > > > > Actually, it's worse than that. (We talked about this on IRC but I wanted > > it documented here too). > > > > You can't use any type of locking, unless you insert it around all the > > callers of the nops (which is unreasonable). > > > > That is, we have gcc -pg -mfentry that creates at the start of all traced > > functions: > > > > : > > call __fentry__ > > [code for function here] > > > > At boot up (or even by the compiler itself) we convert that to: > > > > : > > nop > > [code for function here] > > > > > > When we want to trace this function we use text_poke (with current kernels) > > and convert it to this: > > > > : > > call trace_trampoline > > [code for function here] > > > > > > That trace_trampoline can be allocated, which means when its no longer > > needed, it must be freed. But when do we know it's safe to free it? Here's > > the issue. > > > > > > : > > call trace_trampoline <- interrupt happens just after the jump > > [code for function here] > > > > Now the task has just executed the call to the trace_trampoline. Which > > means the instruction pointer is set to the start of the trampoline. But it > > has yet executed that trampoline. > > > > Now if the task is preempted, and a real time hog is keeping it from > > running for minutes at a time (which is possible!). And in the mean time, > > we are done with that trampoline and free it. What happens when that task > > is scheduled back? There's no more trampoline to execute even though its > > instruction pointer is to execute the first operand on the trampoline! > > > > I used the analogy of jumping off the cliff expecting a magic carpet to be > > there to catch you, and just before you land, it disappears. That would be > > a very bad day indeed! > > > > We have no way to add a grace period between the start of a function (can > > be *any* function) and the start of the trampoline. > > Hello > > I think adding a small number of instructions to preempt_schedule_irq() > is sufficient to create the needed protected region between the start > of a function and the trampoline body. > > preempt_schedule_irq() { > + if (unlikely(is_trampoline_page(page_of(interrupted_ip { > + return; // don't do preempt schedule > + > + } > preempt_schedule_irq() original body > } > > // generated on trampoline pages > trace_trampoline() { >preempt_disable(); >trace_trampoline body >jmp preempt_enable_traced(clobbers) > } > > asm(kernel text): > preempt_enable_traced: >preempt_enable_notrace(); >restore cobblers >return(the return ip on the stack is traced_function_start_code) > > > If the number of instructions added in preempt_schedule_irq() and > the complexity to make trampoline ip detectable(is_trampoline_page(), > or is_trampoline_range()) are small, and tasks_rcu is rendered useless, > I think it will be win-win. Good idea, but the is_trampoline_page() will not be so easy since those pages are scattered. !is_kernel_text() works but it doesn't cover modules. I think we should make a new subsystem to allocate trampoline pages instead of using module_alloc(). Thank you, -- Masami Hiramatsu
Re: [PATCH RFC tip/core/rcu 09/16] rcu-tasks: Add an RCU-tasks rude variant
On Sun, May 10, 2020 at 11:49 PM Paul E. McKenney wrote: > > On Sun, May 10, 2020 at 05:59:27PM +0800, Lai Jiangshan wrote: > > On Tue, Mar 17, 2020 at 6:03 AM Steven Rostedt wrote: > > > > > > On Mon, 16 Mar 2020 17:45:40 -0400 > > > Joel Fernandes wrote: > > > > > > > > > > > > > Same for the function side (if not even more so). This would require > > > > > adding > > > > > a srcu_read_lock() to all functions that can be traced! That would be > > > > > a huge > > > > > kill in performance. Probably to the point no one would bother even > > > > > using > > > > > function tracer. > > > > > > > > Point well taken! Thanks, > > > > > > Actually, it's worse than that. (We talked about this on IRC but I wanted > > > it documented here too). > > > > > > You can't use any type of locking, unless you insert it around all the > > > callers of the nops (which is unreasonable). > > > > > > That is, we have gcc -pg -mfentry that creates at the start of all traced > > > functions: > > > > > > : > > > call __fentry__ > > > [code for function here] > > > > > > At boot up (or even by the compiler itself) we convert that to: > > > > > > : > > > nop > > > [code for function here] > > > > > > > > > When we want to trace this function we use text_poke (with current > > > kernels) > > > and convert it to this: > > > > > > : > > > call trace_trampoline > > > [code for function here] > > > > > > > > > That trace_trampoline can be allocated, which means when its no longer > > > needed, it must be freed. But when do we know it's safe to free it? Here's > > > the issue. > > > > > > > > > : > > > call trace_trampoline <- interrupt happens just after the jump > > > [code for function here] > > > > > > Now the task has just executed the call to the trace_trampoline. Which > > > means the instruction pointer is set to the start of the trampoline. But > > > it > > > has yet executed that trampoline. > > > > > > Now if the task is preempted, and a real time hog is keeping it from > > > running for minutes at a time (which is possible!). And in the mean time, > > > we are done with that trampoline and free it. What happens when that task > > > is scheduled back? There's no more trampoline to execute even though its > > > instruction pointer is to execute the first operand on the trampoline! > > > > > > I used the analogy of jumping off the cliff expecting a magic carpet to be > > > there to catch you, and just before you land, it disappears. That would be > > > a very bad day indeed! > > > > > > We have no way to add a grace period between the start of a function (can > > > be *any* function) and the start of the trampoline. > > > > Hello > > > > I think adding a small number of instructions to preempt_schedule_irq() > > is sufficient to create the needed protected region between the start > > of a function and the trampoline body. > > > > preempt_schedule_irq() { > > + if (unlikely(is_trampoline_page(page_of(interrupted_ip { > > + return; // don't do preempt schedule > > + > > + } > > preempt_schedule_irq() original body > > } > > > > // generated on trampoline pages > > trace_trampoline() { > >preempt_disable(); > >trace_trampoline body > >jmp preempt_enable_traced(clobbers) > > } > > > > asm(kernel text): > > preempt_enable_traced: > >preempt_enable_notrace(); > >restore cobblers > >return(the return ip on the stack is traced_function_start_code) > > > > > > If the number of instructions added in preempt_schedule_irq() and > > the complexity to make trampoline ip detectable(is_trampoline_page(), > > or is_trampoline_range()) are small, and tasks_rcu is rendered useless, > > I think it will be win-win. > > It certainly would provide a nice reduction in code size! > > This would provide a zero-instructions preempt_disable() at the beginning > of the trampoline and a zero-instructions preempt_enable_no_resched() at > the end, correct? If so, wouldn't this create a potentially long (though > "weak") preempt-disable region extending to the next preempt_enable(), > local_bh_enable(), schedule(), interrupt, transition to userspace, > or similar? This could be quite some time. Note that cond_resched() > wouldn't help, given that this is only in PREEMPT=y kernels. > > The "weak" refers to the fact that if a second resched IPI arrived in the > meantime, preemption would then happen. But without that second IPI, > the request for preemption could be ignored for quite some time. > > Or am I missing something here? Hello, I'm sorry to note that preempt_enable_traced() is in *kernel text*, it is *not* in trace_trampoline_protected region. So preempt_enable_traced() can be preempted. And preempt_enable_notrace() in it checks any previous resched requested during the trampoline. So no resched request is lost. The idea is that "semi-automatically preempt-disable-protecting" the trampoline. "semi" means the trampoline still needs preempt_disable() and the
Re: [PATCH RFC tip/core/rcu 09/16] rcu-tasks: Add an RCU-tasks rude variant
On Sun, May 10, 2020 at 05:59:27PM +0800, Lai Jiangshan wrote: > On Tue, Mar 17, 2020 at 6:03 AM Steven Rostedt wrote: > > > > On Mon, 16 Mar 2020 17:45:40 -0400 > > Joel Fernandes wrote: > > > > > > > > > > Same for the function side (if not even more so). This would require > > > > adding > > > > a srcu_read_lock() to all functions that can be traced! That would be a > > > > huge > > > > kill in performance. Probably to the point no one would bother even > > > > using > > > > function tracer. > > > > > > Point well taken! Thanks, > > > > Actually, it's worse than that. (We talked about this on IRC but I wanted > > it documented here too). > > > > You can't use any type of locking, unless you insert it around all the > > callers of the nops (which is unreasonable). > > > > That is, we have gcc -pg -mfentry that creates at the start of all traced > > functions: > > > > : > > call __fentry__ > > [code for function here] > > > > At boot up (or even by the compiler itself) we convert that to: > > > > : > > nop > > [code for function here] > > > > > > When we want to trace this function we use text_poke (with current kernels) > > and convert it to this: > > > > : > > call trace_trampoline > > [code for function here] > > > > > > That trace_trampoline can be allocated, which means when its no longer > > needed, it must be freed. But when do we know it's safe to free it? Here's > > the issue. > > > > > > : > > call trace_trampoline <- interrupt happens just after the jump > > [code for function here] > > > > Now the task has just executed the call to the trace_trampoline. Which > > means the instruction pointer is set to the start of the trampoline. But it > > has yet executed that trampoline. > > > > Now if the task is preempted, and a real time hog is keeping it from > > running for minutes at a time (which is possible!). And in the mean time, > > we are done with that trampoline and free it. What happens when that task > > is scheduled back? There's no more trampoline to execute even though its > > instruction pointer is to execute the first operand on the trampoline! > > > > I used the analogy of jumping off the cliff expecting a magic carpet to be > > there to catch you, and just before you land, it disappears. That would be > > a very bad day indeed! > > > > We have no way to add a grace period between the start of a function (can > > be *any* function) and the start of the trampoline. > > Hello > > I think adding a small number of instructions to preempt_schedule_irq() > is sufficient to create the needed protected region between the start > of a function and the trampoline body. > > preempt_schedule_irq() { > + if (unlikely(is_trampoline_page(page_of(interrupted_ip { > + return; // don't do preempt schedule > + > + } > preempt_schedule_irq() original body > } > > // generated on trampoline pages > trace_trampoline() { >preempt_disable(); >trace_trampoline body >jmp preempt_enable_traced(clobbers) > } > > asm(kernel text): > preempt_enable_traced: >preempt_enable_notrace(); >restore cobblers >return(the return ip on the stack is traced_function_start_code) > > > If the number of instructions added in preempt_schedule_irq() and > the complexity to make trampoline ip detectable(is_trampoline_page(), > or is_trampoline_range()) are small, and tasks_rcu is rendered useless, > I think it will be win-win. It certainly would provide a nice reduction in code size! This would provide a zero-instructions preempt_disable() at the beginning of the trampoline and a zero-instructions preempt_enable_no_resched() at the end, correct? If so, wouldn't this create a potentially long (though "weak") preempt-disable region extending to the next preempt_enable(), local_bh_enable(), schedule(), interrupt, transition to userspace, or similar? This could be quite some time. Note that cond_resched() wouldn't help, given that this is only in PREEMPT=y kernels. The "weak" refers to the fact that if a second resched IPI arrived in the meantime, preemption would then happen. But without that second IPI, the request for preemption could be ignored for quite some time. Or am I missing something here? Thanx, Paul > Thanks > > Lai > > > Since the problem is > > that the task was non-voluntarily preempted before it could execute the > > trampoline, and that trampolines are not allowed (suppose) to call > > schedule, then we have our quiescent state to track (voluntary scheduling). > > When all tasks have either voluntarily scheduled, or entered user space > > after disconnecting a trampoline from a function, we know that it is safe to > > free the trampoline. > > > > -- Steve
Re: [PATCH RFC tip/core/rcu 09/16] rcu-tasks: Add an RCU-tasks rude variant
On Tue, Mar 17, 2020 at 6:03 AM Steven Rostedt wrote: > > On Mon, 16 Mar 2020 17:45:40 -0400 > Joel Fernandes wrote: > > > > > > > Same for the function side (if not even more so). This would require > > > adding > > > a srcu_read_lock() to all functions that can be traced! That would be a > > > huge > > > kill in performance. Probably to the point no one would bother even using > > > function tracer. > > > > Point well taken! Thanks, > > Actually, it's worse than that. (We talked about this on IRC but I wanted > it documented here too). > > You can't use any type of locking, unless you insert it around all the > callers of the nops (which is unreasonable). > > That is, we have gcc -pg -mfentry that creates at the start of all traced > functions: > > : > call __fentry__ > [code for function here] > > At boot up (or even by the compiler itself) we convert that to: > > : > nop > [code for function here] > > > When we want to trace this function we use text_poke (with current kernels) > and convert it to this: > > : > call trace_trampoline > [code for function here] > > > That trace_trampoline can be allocated, which means when its no longer > needed, it must be freed. But when do we know it's safe to free it? Here's > the issue. > > > : > call trace_trampoline <- interrupt happens just after the jump > [code for function here] > > Now the task has just executed the call to the trace_trampoline. Which > means the instruction pointer is set to the start of the trampoline. But it > has yet executed that trampoline. > > Now if the task is preempted, and a real time hog is keeping it from > running for minutes at a time (which is possible!). And in the mean time, > we are done with that trampoline and free it. What happens when that task > is scheduled back? There's no more trampoline to execute even though its > instruction pointer is to execute the first operand on the trampoline! > > I used the analogy of jumping off the cliff expecting a magic carpet to be > there to catch you, and just before you land, it disappears. That would be > a very bad day indeed! > > We have no way to add a grace period between the start of a function (can > be *any* function) and the start of the trampoline. Hello I think adding a small number of instructions to preempt_schedule_irq() is sufficient to create the needed protected region between the start of a function and the trampoline body. preempt_schedule_irq() { + if (unlikely(is_trampoline_page(page_of(interrupted_ip { + return; // don't do preempt schedule + + } preempt_schedule_irq() original body } // generated on trampoline pages trace_trampoline() { preempt_disable(); trace_trampoline body jmp preempt_enable_traced(clobbers) } asm(kernel text): preempt_enable_traced: preempt_enable_notrace(); restore cobblers return(the return ip on the stack is traced_function_start_code) If the number of instructions added in preempt_schedule_irq() and the complexity to make trampoline ip detectable(is_trampoline_page(), or is_trampoline_range()) are small, and tasks_rcu is rendered useless, I think it will be win-win. Thanks Lai > Since the problem is > that the task was non-voluntarily preempted before it could execute the > trampoline, and that trampolines are not allowed (suppose) to call > schedule, then we have our quiescent state to track (voluntary scheduling). > When all tasks have either voluntarily scheduled, or entered user space > after disconnecting a trampoline from a function, we know that it is safe to > free the trampoline. > > -- Steve