Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-11-29 Thread Peter Zijlstra
On Fri, Oct 19, 2018 at 07:29:45AM -0700, Andy Lutomirski wrote:
> > On Oct 19, 2018, at 1:33 AM, Peter Zijlstra  wrote:
> > 
> >> On Fri, Oct 19, 2018 at 01:08:23AM +, Nadav Amit wrote:
> >> Consider for example do_int3(), and see my inlined comments:
> >> 
> >> dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
> >> {
> >>...
> >>ist_enter(regs);// => preempt_disable()
> >>cond_local_irq_enable(regs);// => assume it enables IRQs
> >> 
> >>...
> >>// resched irq can be delivered here. It will not caused rescheduling
> >>// since preemption is disabled
> >> 
> >>cond_local_irq_disable(regs);// => assume it disables IRQs
> >>ist_exit(regs);// => preempt_enable_no_resched()
> >> }
> >> 
> >> At this point resched will not happen for unbounded length of time (unless
> >> there is another point when exiting the trap handler that checks if
> >> preemption should take place).
> >> 
> >> Another example is __BPF_PROG_RUN_ARRAY(), which also uses
> >> preempt_enable_no_resched().
> >> 
> >> Am I missing something?
> > 
> > Would not the interrupt return then check for TIF_NEED_RESCHED and call
> > schedule() ?
> 
> The paranoid exit path doesn’t check TIF_NEED_RESCHED because it’s
> fundamentally atomic — it’s running on a percpu stack and it can’t
> schedule. In theory we could do some evil stack switching, but we
> don’t.
> 
> How does NMI handle this?  If an NMI that hit interruptible kernel
> code overflows a perf counter, how does the wake up work?

NMIs should never set NEED_RESCHED. What the perf does it self-IPI
(irq_work) and do the wakeup from there.


Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-11-29 Thread Peter Zijlstra
On Fri, Oct 19, 2018 at 07:29:45AM -0700, Andy Lutomirski wrote:
> > On Oct 19, 2018, at 1:33 AM, Peter Zijlstra  wrote:
> > 
> >> On Fri, Oct 19, 2018 at 01:08:23AM +, Nadav Amit wrote:
> >> Consider for example do_int3(), and see my inlined comments:
> >> 
> >> dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
> >> {
> >>...
> >>ist_enter(regs);// => preempt_disable()
> >>cond_local_irq_enable(regs);// => assume it enables IRQs
> >> 
> >>...
> >>// resched irq can be delivered here. It will not caused rescheduling
> >>// since preemption is disabled
> >> 
> >>cond_local_irq_disable(regs);// => assume it disables IRQs
> >>ist_exit(regs);// => preempt_enable_no_resched()
> >> }
> >> 
> >> At this point resched will not happen for unbounded length of time (unless
> >> there is another point when exiting the trap handler that checks if
> >> preemption should take place).
> >> 
> >> Another example is __BPF_PROG_RUN_ARRAY(), which also uses
> >> preempt_enable_no_resched().
> >> 
> >> Am I missing something?
> > 
> > Would not the interrupt return then check for TIF_NEED_RESCHED and call
> > schedule() ?
> 
> The paranoid exit path doesn’t check TIF_NEED_RESCHED because it’s
> fundamentally atomic — it’s running on a percpu stack and it can’t
> schedule. In theory we could do some evil stack switching, but we
> don’t.
> 
> How does NMI handle this?  If an NMI that hit interruptible kernel
> code overflows a perf counter, how does the wake up work?

NMIs should never set NEED_RESCHED. What the perf does it self-IPI
(irq_work) and do the wakeup from there.


Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-19 Thread Masami Hiramatsu
On Fri, 19 Oct 2018 04:44:33 +
Nadav Amit  wrote:

> at 9:29 PM, Andy Lutomirski  wrote:
> 
> >> On Oct 18, 2018, at 6:08 PM, Nadav Amit  wrote:
> >> 
> >> at 10:00 AM, Andy Lutomirski  wrote:
> >> 
>  On Oct 18, 2018, at 9:47 AM, Nadav Amit  wrote:
>  
>  at 8:51 PM, Andy Lutomirski  wrote:
>  
> >> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit  wrote:
> >> at 6:22 PM, Andy Lutomirski  wrote:
> >> 
>  On Oct 17, 2018, at 5:54 PM, Nadav Amit  wrote:
>  
>  It is sometimes beneficial to prevent preemption for very few
>  instructions, or prevent preemption for some instructions that 
>  precede
>  a branch (this latter case will be introduced in the next patches).
>  
>  To provide such functionality on x86-64, we use an empty REX-prefix
>  (opcode 0x40) as an indication that preemption is disabled for the
>  following instruction.
> >>> 
> >>> Nifty!
> >>> 
> >>> That being said, I think you have a few bugs. First, you can’t just 
> >>> ignore
> >>> a rescheduling interrupt, as you introduce unbounded latency when this
> >>> happens ― you’re effectively emulating preempt_enable_no_resched(), 
> >>> which
> >>> is not a drop-in replacement for preempt_enable(). To fix this, you 
> >>> may
> >>> need to jump to a slow-path trampoline that calls schedule() at the 
> >>> end or
> >>> consider rewinding one instruction instead. Or use TF, which is only a
> >>> little bit terrifying…
> >> 
> >> Yes, I didn’t pay enough attention here. For my use-case, I think that 
> >> the
> >> easiest solution would be to make synchronize_sched() ignore 
> >> preemptions
> >> that happen while the prefix is detected. It would slightly change the
> >> meaning of the prefix.
>  
>  So thinking about it further, rewinding the instruction seems the easiest
>  and most robust solution. I’ll do it.
>  
> >>> You also aren’t accounting for the case where you get an exception 
> >>> that
> >>> is, in turn, preempted.
> >> 
> >> Hmm.. Can you give me an example for such an exception in my use-case? 
> >> I
> >> cannot think of an exception that might be preempted (assuming #BP, #MC
> >> cannot be preempted).
> > 
> > Look for cond_local_irq_enable().
>  
>  I looked at it. Yet, I still don’t see how exceptions might happen in my
>  use-case, but having said that - this can be fixed too.
> >>> 
> >>> I’m not totally certain there’s a case that matters.  But it’s worth 
> >>> checking
> >> 
> >> I am still checking. But, I wanted to ask you whether the existing code is
> >> correct, since it seems to me that others do the same mistake I did, unless
> >> I don’t understand the code.
> >> 
> >> Consider for example do_int3(), and see my inlined comments:
> >> 
> >> dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
> >> {
> >>   ...
> >>   ist_enter(regs);// => preempt_disable()
> >>   cond_local_irq_enable(regs);// => assume it enables IRQs
> >> 
> >>   ...
> >>   // resched irq can be delivered here. It will not caused rescheduling
> >>   // since preemption is disabled
> >> 
> >>   cond_local_irq_disable(regs);// => assume it disables IRQs
> >>   ist_exit(regs);// => preempt_enable_no_resched()
> >> }
> >> 
> >> At this point resched will not happen for unbounded length of time (unless
> >> there is another point when exiting the trap handler that checks if
> >> preemption should take place).
> > 
> > I think it's only a bug in the cases where someone uses extable to fix
> > up an int3 (which would be nuts) or that we oops.  But I should still
> > fix it.  In the normal case where int3 was in user code, we'll miss
> > the reschedule in do_trap(), but we'll reschedule in
> > prepare_exit_to_usermode() -> exit_to_usermode_loop().
> 
> Thanks for your quick response, and sorry for bothering instead of dealing
> with it. Note that do_debug() does something similar to do_int3().
> 
> And then there is optimized_callback() that also uses
> preempt_enable_no_resched(). I think the original use was correct, but then
> a19b2e3d7839 ("kprobes/x86: Remove IRQ disabling from ftrace-based/optimized
> kprobes”) removed the IRQ disabling, while leaving
> preempt_enable_no_resched() . No?

Ah, good catch!
Indeed, we don't need to stick on no_resched anymore.

Thanks!


-- 
Masami Hiramatsu 


Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-19 Thread Masami Hiramatsu
On Fri, 19 Oct 2018 04:44:33 +
Nadav Amit  wrote:

> at 9:29 PM, Andy Lutomirski  wrote:
> 
> >> On Oct 18, 2018, at 6:08 PM, Nadav Amit  wrote:
> >> 
> >> at 10:00 AM, Andy Lutomirski  wrote:
> >> 
>  On Oct 18, 2018, at 9:47 AM, Nadav Amit  wrote:
>  
>  at 8:51 PM, Andy Lutomirski  wrote:
>  
> >> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit  wrote:
> >> at 6:22 PM, Andy Lutomirski  wrote:
> >> 
>  On Oct 17, 2018, at 5:54 PM, Nadav Amit  wrote:
>  
>  It is sometimes beneficial to prevent preemption for very few
>  instructions, or prevent preemption for some instructions that 
>  precede
>  a branch (this latter case will be introduced in the next patches).
>  
>  To provide such functionality on x86-64, we use an empty REX-prefix
>  (opcode 0x40) as an indication that preemption is disabled for the
>  following instruction.
> >>> 
> >>> Nifty!
> >>> 
> >>> That being said, I think you have a few bugs. First, you can’t just 
> >>> ignore
> >>> a rescheduling interrupt, as you introduce unbounded latency when this
> >>> happens ― you’re effectively emulating preempt_enable_no_resched(), 
> >>> which
> >>> is not a drop-in replacement for preempt_enable(). To fix this, you 
> >>> may
> >>> need to jump to a slow-path trampoline that calls schedule() at the 
> >>> end or
> >>> consider rewinding one instruction instead. Or use TF, which is only a
> >>> little bit terrifying…
> >> 
> >> Yes, I didn’t pay enough attention here. For my use-case, I think that 
> >> the
> >> easiest solution would be to make synchronize_sched() ignore 
> >> preemptions
> >> that happen while the prefix is detected. It would slightly change the
> >> meaning of the prefix.
>  
>  So thinking about it further, rewinding the instruction seems the easiest
>  and most robust solution. I’ll do it.
>  
> >>> You also aren’t accounting for the case where you get an exception 
> >>> that
> >>> is, in turn, preempted.
> >> 
> >> Hmm.. Can you give me an example for such an exception in my use-case? 
> >> I
> >> cannot think of an exception that might be preempted (assuming #BP, #MC
> >> cannot be preempted).
> > 
> > Look for cond_local_irq_enable().
>  
>  I looked at it. Yet, I still don’t see how exceptions might happen in my
>  use-case, but having said that - this can be fixed too.
> >>> 
> >>> I’m not totally certain there’s a case that matters.  But it’s worth 
> >>> checking
> >> 
> >> I am still checking. But, I wanted to ask you whether the existing code is
> >> correct, since it seems to me that others do the same mistake I did, unless
> >> I don’t understand the code.
> >> 
> >> Consider for example do_int3(), and see my inlined comments:
> >> 
> >> dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
> >> {
> >>   ...
> >>   ist_enter(regs);// => preempt_disable()
> >>   cond_local_irq_enable(regs);// => assume it enables IRQs
> >> 
> >>   ...
> >>   // resched irq can be delivered here. It will not caused rescheduling
> >>   // since preemption is disabled
> >> 
> >>   cond_local_irq_disable(regs);// => assume it disables IRQs
> >>   ist_exit(regs);// => preempt_enable_no_resched()
> >> }
> >> 
> >> At this point resched will not happen for unbounded length of time (unless
> >> there is another point when exiting the trap handler that checks if
> >> preemption should take place).
> > 
> > I think it's only a bug in the cases where someone uses extable to fix
> > up an int3 (which would be nuts) or that we oops.  But I should still
> > fix it.  In the normal case where int3 was in user code, we'll miss
> > the reschedule in do_trap(), but we'll reschedule in
> > prepare_exit_to_usermode() -> exit_to_usermode_loop().
> 
> Thanks for your quick response, and sorry for bothering instead of dealing
> with it. Note that do_debug() does something similar to do_int3().
> 
> And then there is optimized_callback() that also uses
> preempt_enable_no_resched(). I think the original use was correct, but then
> a19b2e3d7839 ("kprobes/x86: Remove IRQ disabling from ftrace-based/optimized
> kprobes”) removed the IRQ disabling, while leaving
> preempt_enable_no_resched() . No?

Ah, good catch!
Indeed, we don't need to stick on no_resched anymore.

Thanks!


-- 
Masami Hiramatsu 


Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-19 Thread Alexei Starovoitov
On Fri, Oct 19, 2018 at 1:22 AM Peter Zijlstra  wrote:
>
> On Thu, Oct 18, 2018 at 10:00:53PM -0700, Alexei Starovoitov wrote:
> > >
> > > >
> > > > Another example is __BPF_PROG_RUN_ARRAY(), which also uses
> > > > preempt_enable_no_resched().
> > >
> > > Alexei, I think this code is just wrong.
> >
> > why 'just wrong' ?
>
> Because you lost a preemption point, this is a no-no.
>
> >
> > > Do you know why it uses
> > > preempt_enable_no_resched()?
> >
> > dont recall precisely.
> > we could be preemptable at the point where macro is called.
> > I think the goal of no_resched was to avoid adding scheduling points
> > where they didn't exist before just because a prog ran for few nsec.
> > May be Daniel or Roman remember.
>
> No, you did the exact opposite, where there previously was a preemption,
> you just ate it. The band saw didn't get stopped in time, you loose your
> hand etc..

Let me do few experiments then.
We will fix it up.


Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-19 Thread Alexei Starovoitov
On Fri, Oct 19, 2018 at 1:22 AM Peter Zijlstra  wrote:
>
> On Thu, Oct 18, 2018 at 10:00:53PM -0700, Alexei Starovoitov wrote:
> > >
> > > >
> > > > Another example is __BPF_PROG_RUN_ARRAY(), which also uses
> > > > preempt_enable_no_resched().
> > >
> > > Alexei, I think this code is just wrong.
> >
> > why 'just wrong' ?
>
> Because you lost a preemption point, this is a no-no.
>
> >
> > > Do you know why it uses
> > > preempt_enable_no_resched()?
> >
> > dont recall precisely.
> > we could be preemptable at the point where macro is called.
> > I think the goal of no_resched was to avoid adding scheduling points
> > where they didn't exist before just because a prog ran for few nsec.
> > May be Daniel or Roman remember.
>
> No, you did the exact opposite, where there previously was a preemption,
> you just ate it. The band saw didn't get stopped in time, you loose your
> hand etc..

Let me do few experiments then.
We will fix it up.


Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-19 Thread Andy Lutomirski



> On Oct 19, 2018, at 1:33 AM, Peter Zijlstra  wrote:
> 
>> On Fri, Oct 19, 2018 at 01:08:23AM +, Nadav Amit wrote:
>> Consider for example do_int3(), and see my inlined comments:
>> 
>> dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
>> {
>>...
>>ist_enter(regs);// => preempt_disable()
>>cond_local_irq_enable(regs);// => assume it enables IRQs
>> 
>>...
>>// resched irq can be delivered here. It will not caused rescheduling
>>// since preemption is disabled
>> 
>>cond_local_irq_disable(regs);// => assume it disables IRQs
>>ist_exit(regs);// => preempt_enable_no_resched()
>> }
>> 
>> At this point resched will not happen for unbounded length of time (unless
>> there is another point when exiting the trap handler that checks if
>> preemption should take place).
>> 
>> Another example is __BPF_PROG_RUN_ARRAY(), which also uses
>> preempt_enable_no_resched().
>> 
>> Am I missing something?
> 
> Would not the interrupt return then check for TIF_NEED_RESCHED and call
> schedule() ?

The paranoid exit path doesn’t check TIF_NEED_RESCHED because it’s 
fundamentally atomic — it’s running on a percpu stack and it can’t schedule. In 
theory we could do some evil stack switching, but we don’t.

How does NMI handle this?  If an NMI that hit interruptible kernel code 
overflows a perf counter, how does the wake up work?

(do_int3() is special because it’s not actually IST.  But it can hit in odd 
places due to kprobes, and I’m nervous about recursing incorrectly into RCU and 
context tracking code if we were to use exception_enter().)

> 
> I think (and this certainly wants a comment) is that the ist_exit()
> thing hard relies on the interrupt-return path doing the reschedule.


Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-19 Thread Andy Lutomirski



> On Oct 19, 2018, at 1:33 AM, Peter Zijlstra  wrote:
> 
>> On Fri, Oct 19, 2018 at 01:08:23AM +, Nadav Amit wrote:
>> Consider for example do_int3(), and see my inlined comments:
>> 
>> dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
>> {
>>...
>>ist_enter(regs);// => preempt_disable()
>>cond_local_irq_enable(regs);// => assume it enables IRQs
>> 
>>...
>>// resched irq can be delivered here. It will not caused rescheduling
>>// since preemption is disabled
>> 
>>cond_local_irq_disable(regs);// => assume it disables IRQs
>>ist_exit(regs);// => preempt_enable_no_resched()
>> }
>> 
>> At this point resched will not happen for unbounded length of time (unless
>> there is another point when exiting the trap handler that checks if
>> preemption should take place).
>> 
>> Another example is __BPF_PROG_RUN_ARRAY(), which also uses
>> preempt_enable_no_resched().
>> 
>> Am I missing something?
> 
> Would not the interrupt return then check for TIF_NEED_RESCHED and call
> schedule() ?

The paranoid exit path doesn’t check TIF_NEED_RESCHED because it’s 
fundamentally atomic — it’s running on a percpu stack and it can’t schedule. In 
theory we could do some evil stack switching, but we don’t.

How does NMI handle this?  If an NMI that hit interruptible kernel code 
overflows a perf counter, how does the wake up work?

(do_int3() is special because it’s not actually IST.  But it can hit in odd 
places due to kprobes, and I’m nervous about recursing incorrectly into RCU and 
context tracking code if we were to use exception_enter().)

> 
> I think (and this certainly wants a comment) is that the ist_exit()
> thing hard relies on the interrupt-return path doing the reschedule.


Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-19 Thread Oleg Nesterov
On 10/18, Andy Lutomirski wrote:
>
> Oleg, the code in kernel/signal.c:
>
> preempt_disable();
> read_unlock(_lock);
> preempt_enable_no_resched();
> freezable_schedule();
>
> looks bogus.  I don't get what it's trying to achieve with
> preempt_disable(), and I also don't see why no_resched does anything.
> Sure, it prevents a reschedule triggered during read_unlock() from
> causing a reschedule,

Yes. Lets suppose we remove preempt_disable/enable.

Debugger was already woken up, if it runs on the same CPU quite possibly
it will preemt the tracee. After that debugger will spin in 
wait_task_inactive(),
until it is in turn preempted or calls schedule_timeout(1), so that the tracee
(current) can finally call __schedule(preempt = F) and call deactivate_task() to
become inactive.

> but it doesn't prevent an interrupt immediately
> after the preempt_enable_no_resched() call from scheduling.

Yes, but this is less likely.

Oleg.



Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-19 Thread Oleg Nesterov
On 10/18, Andy Lutomirski wrote:
>
> Oleg, the code in kernel/signal.c:
>
> preempt_disable();
> read_unlock(_lock);
> preempt_enable_no_resched();
> freezable_schedule();
>
> looks bogus.  I don't get what it's trying to achieve with
> preempt_disable(), and I also don't see why no_resched does anything.
> Sure, it prevents a reschedule triggered during read_unlock() from
> causing a reschedule,

Yes. Lets suppose we remove preempt_disable/enable.

Debugger was already woken up, if it runs on the same CPU quite possibly
it will preemt the tracee. After that debugger will spin in 
wait_task_inactive(),
until it is in turn preempted or calls schedule_timeout(1), so that the tracee
(current) can finally call __schedule(preempt = F) and call deactivate_task() to
become inactive.

> but it doesn't prevent an interrupt immediately
> after the preempt_enable_no_resched() call from scheduling.

Yes, but this is less likely.

Oleg.



Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-19 Thread Peter Zijlstra
On Fri, Oct 19, 2018 at 01:08:23AM +, Nadav Amit wrote:
> Consider for example do_int3(), and see my inlined comments:
> 
> dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
> {
>   ...
>   ist_enter(regs);// => preempt_disable()
>   cond_local_irq_enable(regs);// => assume it enables IRQs
> 
>   ...
>   // resched irq can be delivered here. It will not caused rescheduling
>   // since preemption is disabled
> 
>   cond_local_irq_disable(regs);   // => assume it disables IRQs
>   ist_exit(regs); // => preempt_enable_no_resched()
> }
> 
> At this point resched will not happen for unbounded length of time (unless
> there is another point when exiting the trap handler that checks if
> preemption should take place).
> 
> Another example is __BPF_PROG_RUN_ARRAY(), which also uses
> preempt_enable_no_resched().
> 
> Am I missing something?

Would not the interrupt return then check for TIF_NEED_RESCHED and call
schedule() ?

I think (and this certainly wants a comment) is that the ist_exit()
thing hard relies on the interrupt-return path doing the reschedule.


Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-19 Thread Peter Zijlstra
On Fri, Oct 19, 2018 at 01:08:23AM +, Nadav Amit wrote:
> Consider for example do_int3(), and see my inlined comments:
> 
> dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
> {
>   ...
>   ist_enter(regs);// => preempt_disable()
>   cond_local_irq_enable(regs);// => assume it enables IRQs
> 
>   ...
>   // resched irq can be delivered here. It will not caused rescheduling
>   // since preemption is disabled
> 
>   cond_local_irq_disable(regs);   // => assume it disables IRQs
>   ist_exit(regs); // => preempt_enable_no_resched()
> }
> 
> At this point resched will not happen for unbounded length of time (unless
> there is another point when exiting the trap handler that checks if
> preemption should take place).
> 
> Another example is __BPF_PROG_RUN_ARRAY(), which also uses
> preempt_enable_no_resched().
> 
> Am I missing something?

Would not the interrupt return then check for TIF_NEED_RESCHED and call
schedule() ?

I think (and this certainly wants a comment) is that the ist_exit()
thing hard relies on the interrupt-return path doing the reschedule.


Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-19 Thread Peter Zijlstra
On Thu, Oct 18, 2018 at 10:00:53PM -0700, Alexei Starovoitov wrote:
> > 
> > >
> > > Another example is __BPF_PROG_RUN_ARRAY(), which also uses
> > > preempt_enable_no_resched().
> > 
> > Alexei, I think this code is just wrong.
> 
> why 'just wrong' ?

Because you lost a preemption point, this is a no-no.

> 
> > Do you know why it uses
> > preempt_enable_no_resched()?
> 
> dont recall precisely.
> we could be preemptable at the point where macro is called.
> I think the goal of no_resched was to avoid adding scheduling points
> where they didn't exist before just because a prog ran for few nsec.
> May be Daniel or Roman remember.

No, you did the exact opposite, where there previously was a preemption,
you just ate it. The band saw didn't get stopped in time, you loose your
hand etc..


Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-19 Thread Peter Zijlstra
On Thu, Oct 18, 2018 at 10:00:53PM -0700, Alexei Starovoitov wrote:
> > 
> > >
> > > Another example is __BPF_PROG_RUN_ARRAY(), which also uses
> > > preempt_enable_no_resched().
> > 
> > Alexei, I think this code is just wrong.
> 
> why 'just wrong' ?

Because you lost a preemption point, this is a no-no.

> 
> > Do you know why it uses
> > preempt_enable_no_resched()?
> 
> dont recall precisely.
> we could be preemptable at the point where macro is called.
> I think the goal of no_resched was to avoid adding scheduling points
> where they didn't exist before just because a prog ran for few nsec.
> May be Daniel or Roman remember.

No, you did the exact opposite, where there previously was a preemption,
you just ate it. The band saw didn't get stopped in time, you loose your
hand etc..


Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-19 Thread Peter Zijlstra
On Thu, Oct 18, 2018 at 09:29:39PM -0700, Andy Lutomirski wrote:

> > Another example is __BPF_PROG_RUN_ARRAY(), which also uses
> > preempt_enable_no_resched().
> 
> Alexei, I think this code is just wrong. Do you know why it uses
> preempt_enable_no_resched()?

Yes, that's a straight up bug.

It looks like I need to go fix up abuse again :/

> Oleg, the code in kernel/signal.c:
> 
> preempt_disable();
> read_unlock(_lock);
> preempt_enable_no_resched();
> freezable_schedule();
> 

The purpose here is to avoid back-to-back schedule() calls, and this
pattern is one of the few correct uses of preempt_enable_no_resched().

Suppose we got a preemption while holding the read_lock(), then when
we'd do read_unlock(), we'd drop preempt_count to 0 and reschedule, then
when we get back we instantly call into schedule _again_.

What this code does, is it increments preempt_count such that
read_unlock() doesn't hit 0 and doesn't call schedule, then we lower it
to 0 without a call to schedule() and then call schedule() explicitly.



Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-19 Thread Peter Zijlstra
On Thu, Oct 18, 2018 at 09:29:39PM -0700, Andy Lutomirski wrote:

> > Another example is __BPF_PROG_RUN_ARRAY(), which also uses
> > preempt_enable_no_resched().
> 
> Alexei, I think this code is just wrong. Do you know why it uses
> preempt_enable_no_resched()?

Yes, that's a straight up bug.

It looks like I need to go fix up abuse again :/

> Oleg, the code in kernel/signal.c:
> 
> preempt_disable();
> read_unlock(_lock);
> preempt_enable_no_resched();
> freezable_schedule();
> 

The purpose here is to avoid back-to-back schedule() calls, and this
pattern is one of the few correct uses of preempt_enable_no_resched().

Suppose we got a preemption while holding the read_lock(), then when
we'd do read_unlock(), we'd drop preempt_count to 0 and reschedule, then
when we get back we instantly call into schedule _again_.

What this code does, is it increments preempt_count such that
read_unlock() doesn't hit 0 and doesn't call schedule, then we lower it
to 0 without a call to schedule() and then call schedule() explicitly.



Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-18 Thread Alexei Starovoitov
> 
> >
> > Another example is __BPF_PROG_RUN_ARRAY(), which also uses
> > preempt_enable_no_resched().
> 
> Alexei, I think this code is just wrong.

why 'just wrong' ?

> Do you know why it uses
> preempt_enable_no_resched()?

dont recall precisely.
we could be preemptable at the point where macro is called.
I think the goal of no_resched was to avoid adding scheduling points
where they didn't exist before just because a prog ran for few nsec.
May be Daniel or Roman remember.



Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-18 Thread Alexei Starovoitov
> 
> >
> > Another example is __BPF_PROG_RUN_ARRAY(), which also uses
> > preempt_enable_no_resched().
> 
> Alexei, I think this code is just wrong.

why 'just wrong' ?

> Do you know why it uses
> preempt_enable_no_resched()?

dont recall precisely.
we could be preemptable at the point where macro is called.
I think the goal of no_resched was to avoid adding scheduling points
where they didn't exist before just because a prog ran for few nsec.
May be Daniel or Roman remember.



Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-18 Thread Nadav Amit
at 9:29 PM, Andy Lutomirski  wrote:

>> On Oct 18, 2018, at 6:08 PM, Nadav Amit  wrote:
>> 
>> at 10:00 AM, Andy Lutomirski  wrote:
>> 
 On Oct 18, 2018, at 9:47 AM, Nadav Amit  wrote:
 
 at 8:51 PM, Andy Lutomirski  wrote:
 
>> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit  wrote:
>> at 6:22 PM, Andy Lutomirski  wrote:
>> 
 On Oct 17, 2018, at 5:54 PM, Nadav Amit  wrote:
 
 It is sometimes beneficial to prevent preemption for very few
 instructions, or prevent preemption for some instructions that precede
 a branch (this latter case will be introduced in the next patches).
 
 To provide such functionality on x86-64, we use an empty REX-prefix
 (opcode 0x40) as an indication that preemption is disabled for the
 following instruction.
>>> 
>>> Nifty!
>>> 
>>> That being said, I think you have a few bugs. First, you can’t just 
>>> ignore
>>> a rescheduling interrupt, as you introduce unbounded latency when this
>>> happens — you’re effectively emulating preempt_enable_no_resched(), 
>>> which
>>> is not a drop-in replacement for preempt_enable(). To fix this, you may
>>> need to jump to a slow-path trampoline that calls schedule() at the end 
>>> or
>>> consider rewinding one instruction instead. Or use TF, which is only a
>>> little bit terrifying…
>> 
>> Yes, I didn’t pay enough attention here. For my use-case, I think that 
>> the
>> easiest solution would be to make synchronize_sched() ignore preemptions
>> that happen while the prefix is detected. It would slightly change the
>> meaning of the prefix.
 
 So thinking about it further, rewinding the instruction seems the easiest
 and most robust solution. I’ll do it.
 
>>> You also aren’t accounting for the case where you get an exception that
>>> is, in turn, preempted.
>> 
>> Hmm.. Can you give me an example for such an exception in my use-case? I
>> cannot think of an exception that might be preempted (assuming #BP, #MC
>> cannot be preempted).
> 
> Look for cond_local_irq_enable().
 
 I looked at it. Yet, I still don’t see how exceptions might happen in my
 use-case, but having said that - this can be fixed too.
>>> 
>>> I’m not totally certain there’s a case that matters.  But it’s worth 
>>> checking
>> 
>> I am still checking. But, I wanted to ask you whether the existing code is
>> correct, since it seems to me that others do the same mistake I did, unless
>> I don’t understand the code.
>> 
>> Consider for example do_int3(), and see my inlined comments:
>> 
>> dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
>> {
>>   ...
>>   ist_enter(regs);// => preempt_disable()
>>   cond_local_irq_enable(regs);// => assume it enables IRQs
>> 
>>   ...
>>   // resched irq can be delivered here. It will not caused rescheduling
>>   // since preemption is disabled
>> 
>>   cond_local_irq_disable(regs);// => assume it disables IRQs
>>   ist_exit(regs);// => preempt_enable_no_resched()
>> }
>> 
>> At this point resched will not happen for unbounded length of time (unless
>> there is another point when exiting the trap handler that checks if
>> preemption should take place).
> 
> I think it's only a bug in the cases where someone uses extable to fix
> up an int3 (which would be nuts) or that we oops.  But I should still
> fix it.  In the normal case where int3 was in user code, we'll miss
> the reschedule in do_trap(), but we'll reschedule in
> prepare_exit_to_usermode() -> exit_to_usermode_loop().

Thanks for your quick response, and sorry for bothering instead of dealing
with it. Note that do_debug() does something similar to do_int3().

And then there is optimized_callback() that also uses
preempt_enable_no_resched(). I think the original use was correct, but then
a19b2e3d7839 ("kprobes/x86: Remove IRQ disabling from ftrace-based/optimized
kprobes”) removed the IRQ disabling, while leaving
preempt_enable_no_resched() . No?



Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-18 Thread Nadav Amit
at 9:29 PM, Andy Lutomirski  wrote:

>> On Oct 18, 2018, at 6:08 PM, Nadav Amit  wrote:
>> 
>> at 10:00 AM, Andy Lutomirski  wrote:
>> 
 On Oct 18, 2018, at 9:47 AM, Nadav Amit  wrote:
 
 at 8:51 PM, Andy Lutomirski  wrote:
 
>> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit  wrote:
>> at 6:22 PM, Andy Lutomirski  wrote:
>> 
 On Oct 17, 2018, at 5:54 PM, Nadav Amit  wrote:
 
 It is sometimes beneficial to prevent preemption for very few
 instructions, or prevent preemption for some instructions that precede
 a branch (this latter case will be introduced in the next patches).
 
 To provide such functionality on x86-64, we use an empty REX-prefix
 (opcode 0x40) as an indication that preemption is disabled for the
 following instruction.
>>> 
>>> Nifty!
>>> 
>>> That being said, I think you have a few bugs. First, you can’t just 
>>> ignore
>>> a rescheduling interrupt, as you introduce unbounded latency when this
>>> happens — you’re effectively emulating preempt_enable_no_resched(), 
>>> which
>>> is not a drop-in replacement for preempt_enable(). To fix this, you may
>>> need to jump to a slow-path trampoline that calls schedule() at the end 
>>> or
>>> consider rewinding one instruction instead. Or use TF, which is only a
>>> little bit terrifying…
>> 
>> Yes, I didn’t pay enough attention here. For my use-case, I think that 
>> the
>> easiest solution would be to make synchronize_sched() ignore preemptions
>> that happen while the prefix is detected. It would slightly change the
>> meaning of the prefix.
 
 So thinking about it further, rewinding the instruction seems the easiest
 and most robust solution. I’ll do it.
 
>>> You also aren’t accounting for the case where you get an exception that
>>> is, in turn, preempted.
>> 
>> Hmm.. Can you give me an example for such an exception in my use-case? I
>> cannot think of an exception that might be preempted (assuming #BP, #MC
>> cannot be preempted).
> 
> Look for cond_local_irq_enable().
 
 I looked at it. Yet, I still don’t see how exceptions might happen in my
 use-case, but having said that - this can be fixed too.
>>> 
>>> I’m not totally certain there’s a case that matters.  But it’s worth 
>>> checking
>> 
>> I am still checking. But, I wanted to ask you whether the existing code is
>> correct, since it seems to me that others do the same mistake I did, unless
>> I don’t understand the code.
>> 
>> Consider for example do_int3(), and see my inlined comments:
>> 
>> dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
>> {
>>   ...
>>   ist_enter(regs);// => preempt_disable()
>>   cond_local_irq_enable(regs);// => assume it enables IRQs
>> 
>>   ...
>>   // resched irq can be delivered here. It will not caused rescheduling
>>   // since preemption is disabled
>> 
>>   cond_local_irq_disable(regs);// => assume it disables IRQs
>>   ist_exit(regs);// => preempt_enable_no_resched()
>> }
>> 
>> At this point resched will not happen for unbounded length of time (unless
>> there is another point when exiting the trap handler that checks if
>> preemption should take place).
> 
> I think it's only a bug in the cases where someone uses extable to fix
> up an int3 (which would be nuts) or that we oops.  But I should still
> fix it.  In the normal case where int3 was in user code, we'll miss
> the reschedule in do_trap(), but we'll reschedule in
> prepare_exit_to_usermode() -> exit_to_usermode_loop().

Thanks for your quick response, and sorry for bothering instead of dealing
with it. Note that do_debug() does something similar to do_int3().

And then there is optimized_callback() that also uses
preempt_enable_no_resched(). I think the original use was correct, but then
a19b2e3d7839 ("kprobes/x86: Remove IRQ disabling from ftrace-based/optimized
kprobes”) removed the IRQ disabling, while leaving
preempt_enable_no_resched() . No?



Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-18 Thread Andy Lutomirski
> On Oct 18, 2018, at 6:08 PM, Nadav Amit  wrote:
>
> at 10:00 AM, Andy Lutomirski  wrote:
>
>>
>>
>>> On Oct 18, 2018, at 9:47 AM, Nadav Amit  wrote:
>>>
>>> at 8:51 PM, Andy Lutomirski  wrote:
>>>
> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit  wrote:
> at 6:22 PM, Andy Lutomirski  wrote:
>
>>> On Oct 17, 2018, at 5:54 PM, Nadav Amit  wrote:
>>>
>>> It is sometimes beneficial to prevent preemption for very few
>>> instructions, or prevent preemption for some instructions that precede
>>> a branch (this latter case will be introduced in the next patches).
>>>
>>> To provide such functionality on x86-64, we use an empty REX-prefix
>>> (opcode 0x40) as an indication that preemption is disabled for the
>>> following instruction.
>>
>> Nifty!
>>
>> That being said, I think you have a few bugs. First, you can’t just 
>> ignore
>> a rescheduling interrupt, as you introduce unbounded latency when this
>> happens — you’re effectively emulating preempt_enable_no_resched(), which
>> is not a drop-in replacement for preempt_enable(). To fix this, you may
>> need to jump to a slow-path trampoline that calls schedule() at the end 
>> or
>> consider rewinding one instruction instead. Or use TF, which is only a
>> little bit terrifying…
>
> Yes, I didn’t pay enough attention here. For my use-case, I think that the
> easiest solution would be to make synchronize_sched() ignore preemptions
> that happen while the prefix is detected. It would slightly change the
> meaning of the prefix.
>>>
>>> So thinking about it further, rewinding the instruction seems the easiest
>>> and most robust solution. I’ll do it.
>>>
>> You also aren’t accounting for the case where you get an exception that
>> is, in turn, preempted.
>
> Hmm.. Can you give me an example for such an exception in my use-case? I
> cannot think of an exception that might be preempted (assuming #BP, #MC
> cannot be preempted).

 Look for cond_local_irq_enable().
>>>
>>> I looked at it. Yet, I still don’t see how exceptions might happen in my
>>> use-case, but having said that - this can be fixed too.
>>
>> I’m not totally certain there’s a case that matters.  But it’s worth checking
>
> I am still checking. But, I wanted to ask you whether the existing code is
> correct, since it seems to me that others do the same mistake I did, unless
> I don’t understand the code.
>
> Consider for example do_int3(), and see my inlined comments:
>
> dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
> {
>...
>ist_enter(regs);// => preempt_disable()
>cond_local_irq_enable(regs);// => assume it enables IRQs
>
>...
>// resched irq can be delivered here. It will not caused rescheduling
>// since preemption is disabled
>
>cond_local_irq_disable(regs);// => assume it disables IRQs
>ist_exit(regs);// => preempt_enable_no_resched()
> }
>
> At this point resched will not happen for unbounded length of time (unless
> there is another point when exiting the trap handler that checks if
> preemption should take place).

I think it's only a bug in the cases where someone uses extable to fix
up an int3 (which would be nuts) or that we oops.  But I should still
fix it.  In the normal case where int3 was in user code, we'll miss
the reschedule in do_trap(), but we'll reschedule in
prepare_exit_to_usermode() -> exit_to_usermode_loop().

>
> Another example is __BPF_PROG_RUN_ARRAY(), which also uses
> preempt_enable_no_resched().

Alexei, I think this code is just wrong. Do you know why it uses
preempt_enable_no_resched()?

Oleg, the code in kernel/signal.c:

preempt_disable();
read_unlock(_lock);
preempt_enable_no_resched();
freezable_schedule();

looks bogus.  I don't get what it's trying to achieve with
preempt_disable(), and I also don't see why no_resched does anything.
Sure, it prevents a reschedule triggered during read_unlock() from
causing a reschedule, but it doesn't prevent an interrupt immediately
after the preempt_enable_no_resched() call from scheduling.

--Andy

>
> Am I missing something?
>
> Thanks,
> Nadav


Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-18 Thread Andy Lutomirski
> On Oct 18, 2018, at 6:08 PM, Nadav Amit  wrote:
>
> at 10:00 AM, Andy Lutomirski  wrote:
>
>>
>>
>>> On Oct 18, 2018, at 9:47 AM, Nadav Amit  wrote:
>>>
>>> at 8:51 PM, Andy Lutomirski  wrote:
>>>
> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit  wrote:
> at 6:22 PM, Andy Lutomirski  wrote:
>
>>> On Oct 17, 2018, at 5:54 PM, Nadav Amit  wrote:
>>>
>>> It is sometimes beneficial to prevent preemption for very few
>>> instructions, or prevent preemption for some instructions that precede
>>> a branch (this latter case will be introduced in the next patches).
>>>
>>> To provide such functionality on x86-64, we use an empty REX-prefix
>>> (opcode 0x40) as an indication that preemption is disabled for the
>>> following instruction.
>>
>> Nifty!
>>
>> That being said, I think you have a few bugs. First, you can’t just 
>> ignore
>> a rescheduling interrupt, as you introduce unbounded latency when this
>> happens — you’re effectively emulating preempt_enable_no_resched(), which
>> is not a drop-in replacement for preempt_enable(). To fix this, you may
>> need to jump to a slow-path trampoline that calls schedule() at the end 
>> or
>> consider rewinding one instruction instead. Or use TF, which is only a
>> little bit terrifying…
>
> Yes, I didn’t pay enough attention here. For my use-case, I think that the
> easiest solution would be to make synchronize_sched() ignore preemptions
> that happen while the prefix is detected. It would slightly change the
> meaning of the prefix.
>>>
>>> So thinking about it further, rewinding the instruction seems the easiest
>>> and most robust solution. I’ll do it.
>>>
>> You also aren’t accounting for the case where you get an exception that
>> is, in turn, preempted.
>
> Hmm.. Can you give me an example for such an exception in my use-case? I
> cannot think of an exception that might be preempted (assuming #BP, #MC
> cannot be preempted).

 Look for cond_local_irq_enable().
>>>
>>> I looked at it. Yet, I still don’t see how exceptions might happen in my
>>> use-case, but having said that - this can be fixed too.
>>
>> I’m not totally certain there’s a case that matters.  But it’s worth checking
>
> I am still checking. But, I wanted to ask you whether the existing code is
> correct, since it seems to me that others do the same mistake I did, unless
> I don’t understand the code.
>
> Consider for example do_int3(), and see my inlined comments:
>
> dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
> {
>...
>ist_enter(regs);// => preempt_disable()
>cond_local_irq_enable(regs);// => assume it enables IRQs
>
>...
>// resched irq can be delivered here. It will not caused rescheduling
>// since preemption is disabled
>
>cond_local_irq_disable(regs);// => assume it disables IRQs
>ist_exit(regs);// => preempt_enable_no_resched()
> }
>
> At this point resched will not happen for unbounded length of time (unless
> there is another point when exiting the trap handler that checks if
> preemption should take place).

I think it's only a bug in the cases where someone uses extable to fix
up an int3 (which would be nuts) or that we oops.  But I should still
fix it.  In the normal case where int3 was in user code, we'll miss
the reschedule in do_trap(), but we'll reschedule in
prepare_exit_to_usermode() -> exit_to_usermode_loop().

>
> Another example is __BPF_PROG_RUN_ARRAY(), which also uses
> preempt_enable_no_resched().

Alexei, I think this code is just wrong. Do you know why it uses
preempt_enable_no_resched()?

Oleg, the code in kernel/signal.c:

preempt_disable();
read_unlock(_lock);
preempt_enable_no_resched();
freezable_schedule();

looks bogus.  I don't get what it's trying to achieve with
preempt_disable(), and I also don't see why no_resched does anything.
Sure, it prevents a reschedule triggered during read_unlock() from
causing a reschedule, but it doesn't prevent an interrupt immediately
after the preempt_enable_no_resched() call from scheduling.

--Andy

>
> Am I missing something?
>
> Thanks,
> Nadav


Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-18 Thread Nadav Amit
at 10:00 AM, Andy Lutomirski  wrote:

> 
> 
>> On Oct 18, 2018, at 9:47 AM, Nadav Amit  wrote:
>> 
>> at 8:51 PM, Andy Lutomirski  wrote:
>> 
 On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit  wrote:
 at 6:22 PM, Andy Lutomirski  wrote:
 
>> On Oct 17, 2018, at 5:54 PM, Nadav Amit  wrote:
>> 
>> It is sometimes beneficial to prevent preemption for very few
>> instructions, or prevent preemption for some instructions that precede
>> a branch (this latter case will be introduced in the next patches).
>> 
>> To provide such functionality on x86-64, we use an empty REX-prefix
>> (opcode 0x40) as an indication that preemption is disabled for the
>> following instruction.
> 
> Nifty!
> 
> That being said, I think you have a few bugs. First, you can’t just ignore
> a rescheduling interrupt, as you introduce unbounded latency when this
> happens — you’re effectively emulating preempt_enable_no_resched(), which
> is not a drop-in replacement for preempt_enable(). To fix this, you may
> need to jump to a slow-path trampoline that calls schedule() at the end or
> consider rewinding one instruction instead. Or use TF, which is only a
> little bit terrifying…
 
 Yes, I didn’t pay enough attention here. For my use-case, I think that the
 easiest solution would be to make synchronize_sched() ignore preemptions
 that happen while the prefix is detected. It would slightly change the
 meaning of the prefix.
>> 
>> So thinking about it further, rewinding the instruction seems the easiest
>> and most robust solution. I’ll do it.
>> 
> You also aren’t accounting for the case where you get an exception that
> is, in turn, preempted.
 
 Hmm.. Can you give me an example for such an exception in my use-case? I
 cannot think of an exception that might be preempted (assuming #BP, #MC
 cannot be preempted).
>>> 
>>> Look for cond_local_irq_enable().
>> 
>> I looked at it. Yet, I still don’t see how exceptions might happen in my
>> use-case, but having said that - this can be fixed too.
> 
> I’m not totally certain there’s a case that matters.  But it’s worth checking 

I am still checking. But, I wanted to ask you whether the existing code is
correct, since it seems to me that others do the same mistake I did, unless
I don’t understand the code.

Consider for example do_int3(), and see my inlined comments:

dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
{
...
ist_enter(regs);// => preempt_disable()
cond_local_irq_enable(regs);// => assume it enables IRQs

...
// resched irq can be delivered here. It will not caused rescheduling
// since preemption is disabled

cond_local_irq_disable(regs);   // => assume it disables IRQs
ist_exit(regs); // => preempt_enable_no_resched()
}

At this point resched will not happen for unbounded length of time (unless
there is another point when exiting the trap handler that checks if
preemption should take place).

Another example is __BPF_PROG_RUN_ARRAY(), which also uses
preempt_enable_no_resched().

Am I missing something?

Thanks,
Nadav

Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-18 Thread Nadav Amit
at 10:00 AM, Andy Lutomirski  wrote:

> 
> 
>> On Oct 18, 2018, at 9:47 AM, Nadav Amit  wrote:
>> 
>> at 8:51 PM, Andy Lutomirski  wrote:
>> 
 On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit  wrote:
 at 6:22 PM, Andy Lutomirski  wrote:
 
>> On Oct 17, 2018, at 5:54 PM, Nadav Amit  wrote:
>> 
>> It is sometimes beneficial to prevent preemption for very few
>> instructions, or prevent preemption for some instructions that precede
>> a branch (this latter case will be introduced in the next patches).
>> 
>> To provide such functionality on x86-64, we use an empty REX-prefix
>> (opcode 0x40) as an indication that preemption is disabled for the
>> following instruction.
> 
> Nifty!
> 
> That being said, I think you have a few bugs. First, you can’t just ignore
> a rescheduling interrupt, as you introduce unbounded latency when this
> happens — you’re effectively emulating preempt_enable_no_resched(), which
> is not a drop-in replacement for preempt_enable(). To fix this, you may
> need to jump to a slow-path trampoline that calls schedule() at the end or
> consider rewinding one instruction instead. Or use TF, which is only a
> little bit terrifying…
 
 Yes, I didn’t pay enough attention here. For my use-case, I think that the
 easiest solution would be to make synchronize_sched() ignore preemptions
 that happen while the prefix is detected. It would slightly change the
 meaning of the prefix.
>> 
>> So thinking about it further, rewinding the instruction seems the easiest
>> and most robust solution. I’ll do it.
>> 
> You also aren’t accounting for the case where you get an exception that
> is, in turn, preempted.
 
 Hmm.. Can you give me an example for such an exception in my use-case? I
 cannot think of an exception that might be preempted (assuming #BP, #MC
 cannot be preempted).
>>> 
>>> Look for cond_local_irq_enable().
>> 
>> I looked at it. Yet, I still don’t see how exceptions might happen in my
>> use-case, but having said that - this can be fixed too.
> 
> I’m not totally certain there’s a case that matters.  But it’s worth checking 

I am still checking. But, I wanted to ask you whether the existing code is
correct, since it seems to me that others do the same mistake I did, unless
I don’t understand the code.

Consider for example do_int3(), and see my inlined comments:

dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
{
...
ist_enter(regs);// => preempt_disable()
cond_local_irq_enable(regs);// => assume it enables IRQs

...
// resched irq can be delivered here. It will not caused rescheduling
// since preemption is disabled

cond_local_irq_disable(regs);   // => assume it disables IRQs
ist_exit(regs); // => preempt_enable_no_resched()
}

At this point resched will not happen for unbounded length of time (unless
there is another point when exiting the trap handler that checks if
preemption should take place).

Another example is __BPF_PROG_RUN_ARRAY(), which also uses
preempt_enable_no_resched().

Am I missing something?

Thanks,
Nadav

Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-18 Thread Nadav Amit
at 12:54 AM, Peter Zijlstra  wrote:

> On Wed, Oct 17, 2018 at 06:22:48PM -0700, Andy Lutomirski wrote:
>>> On Oct 17, 2018, at 5:54 PM, Nadav Amit  wrote:
>>> 
>>> It is sometimes beneficial to prevent preemption for very few
>>> instructions, or prevent preemption for some instructions that precede
>>> a branch (this latter case will be introduced in the next patches).
>>> 
>>> To provide such functionality on x86-64, we use an empty REX-prefix
>>> (opcode 0x40) as an indication that preemption is disabled for the
>>> following instruction.
>> 
>> Nifty!
>> 
>> That being said, I think you have a few bugs.
> 
>> First, you can’t just ignore a rescheduling interrupt, as you
>> introduce unbounded latency when this happens — you’re effectively
>> emulating preempt_enable_no_resched(), which is not a drop-in
>> replacement for preempt_enable().
> 
>> To fix this, you may need to jump to a slow-path trampoline that calls
>> schedule() at the end or consider rewinding one instruction instead.
>> Or use TF, which is only a little bit terrifying...
> 
> At which point we're very close to in-kernel rseq.

Interesting. I didn’t know about this feature. I’ll see if I can draw some
ideas from there.



Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-18 Thread Nadav Amit
at 12:54 AM, Peter Zijlstra  wrote:

> On Wed, Oct 17, 2018 at 06:22:48PM -0700, Andy Lutomirski wrote:
>>> On Oct 17, 2018, at 5:54 PM, Nadav Amit  wrote:
>>> 
>>> It is sometimes beneficial to prevent preemption for very few
>>> instructions, or prevent preemption for some instructions that precede
>>> a branch (this latter case will be introduced in the next patches).
>>> 
>>> To provide such functionality on x86-64, we use an empty REX-prefix
>>> (opcode 0x40) as an indication that preemption is disabled for the
>>> following instruction.
>> 
>> Nifty!
>> 
>> That being said, I think you have a few bugs.
> 
>> First, you can’t just ignore a rescheduling interrupt, as you
>> introduce unbounded latency when this happens — you’re effectively
>> emulating preempt_enable_no_resched(), which is not a drop-in
>> replacement for preempt_enable().
> 
>> To fix this, you may need to jump to a slow-path trampoline that calls
>> schedule() at the end or consider rewinding one instruction instead.
>> Or use TF, which is only a little bit terrifying...
> 
> At which point we're very close to in-kernel rseq.

Interesting. I didn’t know about this feature. I’ll see if I can draw some
ideas from there.



Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-18 Thread Nadav Amit
at 10:29 AM, Andy Lutomirski  wrote:

> On Thu, Oct 18, 2018 at 10:25 AM Nadav Amit  wrote:
>> at 10:00 AM, Andy Lutomirski  wrote:
>> 
 On Oct 18, 2018, at 9:47 AM, Nadav Amit  wrote:
 
 at 8:51 PM, Andy Lutomirski  wrote:
 
>> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit  wrote:
>> at 6:22 PM, Andy Lutomirski  wrote:
>> 
 On Oct 17, 2018, at 5:54 PM, Nadav Amit  wrote:
 
 It is sometimes beneficial to prevent preemption for very few
 instructions, or prevent preemption for some instructions that precede
 a branch (this latter case will be introduced in the next patches).
 
 To provide such functionality on x86-64, we use an empty REX-prefix
 (opcode 0x40) as an indication that preemption is disabled for the
 following instruction.
>>> 
>>> Nifty!
>>> 
>>> That being said, I think you have a few bugs. First, you can’t just 
>>> ignore
>>> a rescheduling interrupt, as you introduce unbounded latency when this
>>> happens — you’re effectively emulating preempt_enable_no_resched(), 
>>> which
>>> is not a drop-in replacement for preempt_enable(). To fix this, you may
>>> need to jump to a slow-path trampoline that calls schedule() at the end 
>>> or
>>> consider rewinding one instruction instead. Or use TF, which is only a
>>> little bit terrifying…
>> 
>> Yes, I didn’t pay enough attention here. For my use-case, I think that 
>> the
>> easiest solution would be to make synchronize_sched() ignore preemptions
>> that happen while the prefix is detected. It would slightly change the
>> meaning of the prefix.
 
 So thinking about it further, rewinding the instruction seems the easiest
 and most robust solution. I’ll do it.
 
>>> You also aren’t accounting for the case where you get an exception that
>>> is, in turn, preempted.
>> 
>> Hmm.. Can you give me an example for such an exception in my use-case? I
>> cannot think of an exception that might be preempted (assuming #BP, #MC
>> cannot be preempted).
> 
> Look for cond_local_irq_enable().
 
 I looked at it. Yet, I still don’t see how exceptions might happen in my
 use-case, but having said that - this can be fixed too.
>>> 
>>> I’m not totally certain there’s a case that matters.  But it’s worth 
>>> checking
>>> 
 To be frank, I paid relatively little attention to this subject. Any
 feedback about the other parts and especially on the high-level approach? 
 Is
 modifying the retpolines in the proposed manner (assembly macros)
 acceptable?
>>> 
>>> It’s certainly a neat idea, and it could be a real speedup.
>> 
>> Great. So I’ll try to shape things up, and I still wait for other comments
>> (from others).
>> 
>> I’ll just mention two more patches I need to cleanup (I know I still owe you 
>> some
>> work, so obviously it will be done later):
>> 
>> 1. Seccomp trampolines. On my Ubuntu, when I run Redis, systemd installs 17
>> BPF filters on the Redis server process that are invoked on each
>> system-call. Invoking each one requires an indirect branch. The patch keeps
>> a per-process kernel code-page that holds trampolines for these functions.
> 
> I wonder how many levels of branches are needed before the branches
> involved exceed the retpoline cost.

In this case there is no hierarchy, but a list of trampolines that are
called one after the other, as the seccomp filter order is predefined. It
does not work if different threads of the same process have different
filters.

>> 2. Binary-search for system-calls. Use the per-process kernel code-page also
>> to hold multiple trampolines for the 16 common system calls of a certain
>> process. The patch uses an indirection table and a binary-search to find the
>> proper trampoline.
> 
> Same comment applies here.

Branch misprediction wastes ~7 cycles and a retpoline takes at least 30. So
assuming the branch predictor is not completely stupid 3-4 levels should not
be too much.



Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-18 Thread Nadav Amit
at 10:29 AM, Andy Lutomirski  wrote:

> On Thu, Oct 18, 2018 at 10:25 AM Nadav Amit  wrote:
>> at 10:00 AM, Andy Lutomirski  wrote:
>> 
 On Oct 18, 2018, at 9:47 AM, Nadav Amit  wrote:
 
 at 8:51 PM, Andy Lutomirski  wrote:
 
>> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit  wrote:
>> at 6:22 PM, Andy Lutomirski  wrote:
>> 
 On Oct 17, 2018, at 5:54 PM, Nadav Amit  wrote:
 
 It is sometimes beneficial to prevent preemption for very few
 instructions, or prevent preemption for some instructions that precede
 a branch (this latter case will be introduced in the next patches).
 
 To provide such functionality on x86-64, we use an empty REX-prefix
 (opcode 0x40) as an indication that preemption is disabled for the
 following instruction.
>>> 
>>> Nifty!
>>> 
>>> That being said, I think you have a few bugs. First, you can’t just 
>>> ignore
>>> a rescheduling interrupt, as you introduce unbounded latency when this
>>> happens — you’re effectively emulating preempt_enable_no_resched(), 
>>> which
>>> is not a drop-in replacement for preempt_enable(). To fix this, you may
>>> need to jump to a slow-path trampoline that calls schedule() at the end 
>>> or
>>> consider rewinding one instruction instead. Or use TF, which is only a
>>> little bit terrifying…
>> 
>> Yes, I didn’t pay enough attention here. For my use-case, I think that 
>> the
>> easiest solution would be to make synchronize_sched() ignore preemptions
>> that happen while the prefix is detected. It would slightly change the
>> meaning of the prefix.
 
 So thinking about it further, rewinding the instruction seems the easiest
 and most robust solution. I’ll do it.
 
>>> You also aren’t accounting for the case where you get an exception that
>>> is, in turn, preempted.
>> 
>> Hmm.. Can you give me an example for such an exception in my use-case? I
>> cannot think of an exception that might be preempted (assuming #BP, #MC
>> cannot be preempted).
> 
> Look for cond_local_irq_enable().
 
 I looked at it. Yet, I still don’t see how exceptions might happen in my
 use-case, but having said that - this can be fixed too.
>>> 
>>> I’m not totally certain there’s a case that matters.  But it’s worth 
>>> checking
>>> 
 To be frank, I paid relatively little attention to this subject. Any
 feedback about the other parts and especially on the high-level approach? 
 Is
 modifying the retpolines in the proposed manner (assembly macros)
 acceptable?
>>> 
>>> It’s certainly a neat idea, and it could be a real speedup.
>> 
>> Great. So I’ll try to shape things up, and I still wait for other comments
>> (from others).
>> 
>> I’ll just mention two more patches I need to cleanup (I know I still owe you 
>> some
>> work, so obviously it will be done later):
>> 
>> 1. Seccomp trampolines. On my Ubuntu, when I run Redis, systemd installs 17
>> BPF filters on the Redis server process that are invoked on each
>> system-call. Invoking each one requires an indirect branch. The patch keeps
>> a per-process kernel code-page that holds trampolines for these functions.
> 
> I wonder how many levels of branches are needed before the branches
> involved exceed the retpoline cost.

In this case there is no hierarchy, but a list of trampolines that are
called one after the other, as the seccomp filter order is predefined. It
does not work if different threads of the same process have different
filters.

>> 2. Binary-search for system-calls. Use the per-process kernel code-page also
>> to hold multiple trampolines for the 16 common system calls of a certain
>> process. The patch uses an indirection table and a binary-search to find the
>> proper trampoline.
> 
> Same comment applies here.

Branch misprediction wastes ~7 cycles and a retpoline takes at least 30. So
assuming the branch predictor is not completely stupid 3-4 levels should not
be too much.



Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-18 Thread Andy Lutomirski
On Thu, Oct 18, 2018 at 10:25 AM Nadav Amit  wrote:
>
> at 10:00 AM, Andy Lutomirski  wrote:
>
> >
> >
> >> On Oct 18, 2018, at 9:47 AM, Nadav Amit  wrote:
> >>
> >> at 8:51 PM, Andy Lutomirski  wrote:
> >>
>  On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit  wrote:
>  at 6:22 PM, Andy Lutomirski  wrote:
> 
> >> On Oct 17, 2018, at 5:54 PM, Nadav Amit  wrote:
> >>
> >> It is sometimes beneficial to prevent preemption for very few
> >> instructions, or prevent preemption for some instructions that precede
> >> a branch (this latter case will be introduced in the next patches).
> >>
> >> To provide such functionality on x86-64, we use an empty REX-prefix
> >> (opcode 0x40) as an indication that preemption is disabled for the
> >> following instruction.
> >
> > Nifty!
> >
> > That being said, I think you have a few bugs. First, you can’t just 
> > ignore
> > a rescheduling interrupt, as you introduce unbounded latency when this
> > happens — you’re effectively emulating preempt_enable_no_resched(), 
> > which
> > is not a drop-in replacement for preempt_enable(). To fix this, you may
> > need to jump to a slow-path trampoline that calls schedule() at the end 
> > or
> > consider rewinding one instruction instead. Or use TF, which is only a
> > little bit terrifying…
> 
>  Yes, I didn’t pay enough attention here. For my use-case, I think that 
>  the
>  easiest solution would be to make synchronize_sched() ignore preemptions
>  that happen while the prefix is detected. It would slightly change the
>  meaning of the prefix.
> >>
> >> So thinking about it further, rewinding the instruction seems the easiest
> >> and most robust solution. I’ll do it.
> >>
> > You also aren’t accounting for the case where you get an exception that
> > is, in turn, preempted.
> 
>  Hmm.. Can you give me an example for such an exception in my use-case? I
>  cannot think of an exception that might be preempted (assuming #BP, #MC
>  cannot be preempted).
> >>>
> >>> Look for cond_local_irq_enable().
> >>
> >> I looked at it. Yet, I still don’t see how exceptions might happen in my
> >> use-case, but having said that - this can be fixed too.
> >
> > I’m not totally certain there’s a case that matters.  But it’s worth 
> > checking
> >
> >> To be frank, I paid relatively little attention to this subject. Any
> >> feedback about the other parts and especially on the high-level approach? 
> >> Is
> >> modifying the retpolines in the proposed manner (assembly macros)
> >> acceptable?
> >
> > It’s certainly a neat idea, and it could be a real speedup.
>
> Great. So I’ll try to shape things up, and I still wait for other comments
> (from others).
>
> I’ll just mention two more patches I need to cleanup (I know I still owe you 
> some
> work, so obviously it will be done later):
>
> 1. Seccomp trampolines. On my Ubuntu, when I run Redis, systemd installs 17
> BPF filters on the Redis server process that are invoked on each
> system-call. Invoking each one requires an indirect branch. The patch keeps
> a per-process kernel code-page that holds trampolines for these functions.

I wonder how many levels of branches are needed before the branches
involved exceed the retpoline cost.

>
> 2. Binary-search for system-calls. Use the per-process kernel code-page also
> to hold multiple trampolines for the 16 common system calls of a certain
> process. The patch uses an indirection table and a binary-search to find the
> proper trampoline.

Same comment applies here.

>
> Thanks again,
> Nadav



-- 
Andy Lutomirski
AMA Capital Management, LLC


Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-18 Thread Andy Lutomirski
On Thu, Oct 18, 2018 at 10:25 AM Nadav Amit  wrote:
>
> at 10:00 AM, Andy Lutomirski  wrote:
>
> >
> >
> >> On Oct 18, 2018, at 9:47 AM, Nadav Amit  wrote:
> >>
> >> at 8:51 PM, Andy Lutomirski  wrote:
> >>
>  On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit  wrote:
>  at 6:22 PM, Andy Lutomirski  wrote:
> 
> >> On Oct 17, 2018, at 5:54 PM, Nadav Amit  wrote:
> >>
> >> It is sometimes beneficial to prevent preemption for very few
> >> instructions, or prevent preemption for some instructions that precede
> >> a branch (this latter case will be introduced in the next patches).
> >>
> >> To provide such functionality on x86-64, we use an empty REX-prefix
> >> (opcode 0x40) as an indication that preemption is disabled for the
> >> following instruction.
> >
> > Nifty!
> >
> > That being said, I think you have a few bugs. First, you can’t just 
> > ignore
> > a rescheduling interrupt, as you introduce unbounded latency when this
> > happens — you’re effectively emulating preempt_enable_no_resched(), 
> > which
> > is not a drop-in replacement for preempt_enable(). To fix this, you may
> > need to jump to a slow-path trampoline that calls schedule() at the end 
> > or
> > consider rewinding one instruction instead. Or use TF, which is only a
> > little bit terrifying…
> 
>  Yes, I didn’t pay enough attention here. For my use-case, I think that 
>  the
>  easiest solution would be to make synchronize_sched() ignore preemptions
>  that happen while the prefix is detected. It would slightly change the
>  meaning of the prefix.
> >>
> >> So thinking about it further, rewinding the instruction seems the easiest
> >> and most robust solution. I’ll do it.
> >>
> > You also aren’t accounting for the case where you get an exception that
> > is, in turn, preempted.
> 
>  Hmm.. Can you give me an example for such an exception in my use-case? I
>  cannot think of an exception that might be preempted (assuming #BP, #MC
>  cannot be preempted).
> >>>
> >>> Look for cond_local_irq_enable().
> >>
> >> I looked at it. Yet, I still don’t see how exceptions might happen in my
> >> use-case, but having said that - this can be fixed too.
> >
> > I’m not totally certain there’s a case that matters.  But it’s worth 
> > checking
> >
> >> To be frank, I paid relatively little attention to this subject. Any
> >> feedback about the other parts and especially on the high-level approach? 
> >> Is
> >> modifying the retpolines in the proposed manner (assembly macros)
> >> acceptable?
> >
> > It’s certainly a neat idea, and it could be a real speedup.
>
> Great. So I’ll try to shape things up, and I still wait for other comments
> (from others).
>
> I’ll just mention two more patches I need to cleanup (I know I still owe you 
> some
> work, so obviously it will be done later):
>
> 1. Seccomp trampolines. On my Ubuntu, when I run Redis, systemd installs 17
> BPF filters on the Redis server process that are invoked on each
> system-call. Invoking each one requires an indirect branch. The patch keeps
> a per-process kernel code-page that holds trampolines for these functions.

I wonder how many levels of branches are needed before the branches
involved exceed the retpoline cost.

>
> 2. Binary-search for system-calls. Use the per-process kernel code-page also
> to hold multiple trampolines for the 16 common system calls of a certain
> process. The patch uses an indirection table and a binary-search to find the
> proper trampoline.

Same comment applies here.

>
> Thanks again,
> Nadav



-- 
Andy Lutomirski
AMA Capital Management, LLC


Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-18 Thread Nadav Amit
at 10:00 AM, Andy Lutomirski  wrote:

> 
> 
>> On Oct 18, 2018, at 9:47 AM, Nadav Amit  wrote:
>> 
>> at 8:51 PM, Andy Lutomirski  wrote:
>> 
 On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit  wrote:
 at 6:22 PM, Andy Lutomirski  wrote:
 
>> On Oct 17, 2018, at 5:54 PM, Nadav Amit  wrote:
>> 
>> It is sometimes beneficial to prevent preemption for very few
>> instructions, or prevent preemption for some instructions that precede
>> a branch (this latter case will be introduced in the next patches).
>> 
>> To provide such functionality on x86-64, we use an empty REX-prefix
>> (opcode 0x40) as an indication that preemption is disabled for the
>> following instruction.
> 
> Nifty!
> 
> That being said, I think you have a few bugs. First, you can’t just ignore
> a rescheduling interrupt, as you introduce unbounded latency when this
> happens — you’re effectively emulating preempt_enable_no_resched(), which
> is not a drop-in replacement for preempt_enable(). To fix this, you may
> need to jump to a slow-path trampoline that calls schedule() at the end or
> consider rewinding one instruction instead. Or use TF, which is only a
> little bit terrifying…
 
 Yes, I didn’t pay enough attention here. For my use-case, I think that the
 easiest solution would be to make synchronize_sched() ignore preemptions
 that happen while the prefix is detected. It would slightly change the
 meaning of the prefix.
>> 
>> So thinking about it further, rewinding the instruction seems the easiest
>> and most robust solution. I’ll do it.
>> 
> You also aren’t accounting for the case where you get an exception that
> is, in turn, preempted.
 
 Hmm.. Can you give me an example for such an exception in my use-case? I
 cannot think of an exception that might be preempted (assuming #BP, #MC
 cannot be preempted).
>>> 
>>> Look for cond_local_irq_enable().
>> 
>> I looked at it. Yet, I still don’t see how exceptions might happen in my
>> use-case, but having said that - this can be fixed too.
> 
> I’m not totally certain there’s a case that matters.  But it’s worth checking 
> 
>> To be frank, I paid relatively little attention to this subject. Any
>> feedback about the other parts and especially on the high-level approach? Is
>> modifying the retpolines in the proposed manner (assembly macros)
>> acceptable?
> 
> It’s certainly a neat idea, and it could be a real speedup.

Great. So I’ll try to shape things up, and I still wait for other comments
(from others).

I’ll just mention two more patches I need to cleanup (I know I still owe you 
some
work, so obviously it will be done later):

1. Seccomp trampolines. On my Ubuntu, when I run Redis, systemd installs 17
BPF filters on the Redis server process that are invoked on each
system-call. Invoking each one requires an indirect branch. The patch keeps
a per-process kernel code-page that holds trampolines for these functions.

2. Binary-search for system-calls. Use the per-process kernel code-page also
to hold multiple trampolines for the 16 common system calls of a certain
process. The patch uses an indirection table and a binary-search to find the
proper trampoline.

Thanks again,
Nadav

Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-18 Thread Nadav Amit
at 10:00 AM, Andy Lutomirski  wrote:

> 
> 
>> On Oct 18, 2018, at 9:47 AM, Nadav Amit  wrote:
>> 
>> at 8:51 PM, Andy Lutomirski  wrote:
>> 
 On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit  wrote:
 at 6:22 PM, Andy Lutomirski  wrote:
 
>> On Oct 17, 2018, at 5:54 PM, Nadav Amit  wrote:
>> 
>> It is sometimes beneficial to prevent preemption for very few
>> instructions, or prevent preemption for some instructions that precede
>> a branch (this latter case will be introduced in the next patches).
>> 
>> To provide such functionality on x86-64, we use an empty REX-prefix
>> (opcode 0x40) as an indication that preemption is disabled for the
>> following instruction.
> 
> Nifty!
> 
> That being said, I think you have a few bugs. First, you can’t just ignore
> a rescheduling interrupt, as you introduce unbounded latency when this
> happens — you’re effectively emulating preempt_enable_no_resched(), which
> is not a drop-in replacement for preempt_enable(). To fix this, you may
> need to jump to a slow-path trampoline that calls schedule() at the end or
> consider rewinding one instruction instead. Or use TF, which is only a
> little bit terrifying…
 
 Yes, I didn’t pay enough attention here. For my use-case, I think that the
 easiest solution would be to make synchronize_sched() ignore preemptions
 that happen while the prefix is detected. It would slightly change the
 meaning of the prefix.
>> 
>> So thinking about it further, rewinding the instruction seems the easiest
>> and most robust solution. I’ll do it.
>> 
> You also aren’t accounting for the case where you get an exception that
> is, in turn, preempted.
 
 Hmm.. Can you give me an example for such an exception in my use-case? I
 cannot think of an exception that might be preempted (assuming #BP, #MC
 cannot be preempted).
>>> 
>>> Look for cond_local_irq_enable().
>> 
>> I looked at it. Yet, I still don’t see how exceptions might happen in my
>> use-case, but having said that - this can be fixed too.
> 
> I’m not totally certain there’s a case that matters.  But it’s worth checking 
> 
>> To be frank, I paid relatively little attention to this subject. Any
>> feedback about the other parts and especially on the high-level approach? Is
>> modifying the retpolines in the proposed manner (assembly macros)
>> acceptable?
> 
> It’s certainly a neat idea, and it could be a real speedup.

Great. So I’ll try to shape things up, and I still wait for other comments
(from others).

I’ll just mention two more patches I need to cleanup (I know I still owe you 
some
work, so obviously it will be done later):

1. Seccomp trampolines. On my Ubuntu, when I run Redis, systemd installs 17
BPF filters on the Redis server process that are invoked on each
system-call. Invoking each one requires an indirect branch. The patch keeps
a per-process kernel code-page that holds trampolines for these functions.

2. Binary-search for system-calls. Use the per-process kernel code-page also
to hold multiple trampolines for the 16 common system calls of a certain
process. The patch uses an indirection table and a binary-search to find the
proper trampoline.

Thanks again,
Nadav

Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-18 Thread Andy Lutomirski



> On Oct 18, 2018, at 9:47 AM, Nadav Amit  wrote:
> 
> at 8:51 PM, Andy Lutomirski  wrote:
> 
>>> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit  wrote:
>>> at 6:22 PM, Andy Lutomirski  wrote:
>>> 
> On Oct 17, 2018, at 5:54 PM, Nadav Amit  wrote:
> 
> It is sometimes beneficial to prevent preemption for very few
> instructions, or prevent preemption for some instructions that precede
> a branch (this latter case will be introduced in the next patches).
> 
> To provide such functionality on x86-64, we use an empty REX-prefix
> (opcode 0x40) as an indication that preemption is disabled for the
> following instruction.
 
 Nifty!
 
 That being said, I think you have a few bugs. First, you can’t just ignore
 a rescheduling interrupt, as you introduce unbounded latency when this
 happens — you’re effectively emulating preempt_enable_no_resched(), which
 is not a drop-in replacement for preempt_enable(). To fix this, you may
 need to jump to a slow-path trampoline that calls schedule() at the end or
 consider rewinding one instruction instead. Or use TF, which is only a
 little bit terrifying…
>>> 
>>> Yes, I didn’t pay enough attention here. For my use-case, I think that the
>>> easiest solution would be to make synchronize_sched() ignore preemptions
>>> that happen while the prefix is detected. It would slightly change the
>>> meaning of the prefix.
> 
> So thinking about it further, rewinding the instruction seems the easiest
> and most robust solution. I’ll do it.
> 
 You also aren’t accounting for the case where you get an exception that
 is, in turn, preempted.
>>> 
>>> Hmm.. Can you give me an example for such an exception in my use-case? I
>>> cannot think of an exception that might be preempted (assuming #BP, #MC
>>> cannot be preempted).
>> 
>> Look for cond_local_irq_enable().
> 
> I looked at it. Yet, I still don’t see how exceptions might happen in my
> use-case, but having said that - this can be fixed too.

I’m not totally certain there’s a case that matters.  But it’s worth checking 

> 
> To be frank, I paid relatively little attention to this subject. Any
> feedback about the other parts and especially on the high-level approach? Is
> modifying the retpolines in the proposed manner (assembly macros)
> acceptable?
> 

It’s certainly a neat idea, and it could be a real speedup.

> Thanks,
> Nadav


Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-18 Thread Andy Lutomirski



> On Oct 18, 2018, at 9:47 AM, Nadav Amit  wrote:
> 
> at 8:51 PM, Andy Lutomirski  wrote:
> 
>>> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit  wrote:
>>> at 6:22 PM, Andy Lutomirski  wrote:
>>> 
> On Oct 17, 2018, at 5:54 PM, Nadav Amit  wrote:
> 
> It is sometimes beneficial to prevent preemption for very few
> instructions, or prevent preemption for some instructions that precede
> a branch (this latter case will be introduced in the next patches).
> 
> To provide such functionality on x86-64, we use an empty REX-prefix
> (opcode 0x40) as an indication that preemption is disabled for the
> following instruction.
 
 Nifty!
 
 That being said, I think you have a few bugs. First, you can’t just ignore
 a rescheduling interrupt, as you introduce unbounded latency when this
 happens — you’re effectively emulating preempt_enable_no_resched(), which
 is not a drop-in replacement for preempt_enable(). To fix this, you may
 need to jump to a slow-path trampoline that calls schedule() at the end or
 consider rewinding one instruction instead. Or use TF, which is only a
 little bit terrifying…
>>> 
>>> Yes, I didn’t pay enough attention here. For my use-case, I think that the
>>> easiest solution would be to make synchronize_sched() ignore preemptions
>>> that happen while the prefix is detected. It would slightly change the
>>> meaning of the prefix.
> 
> So thinking about it further, rewinding the instruction seems the easiest
> and most robust solution. I’ll do it.
> 
 You also aren’t accounting for the case where you get an exception that
 is, in turn, preempted.
>>> 
>>> Hmm.. Can you give me an example for such an exception in my use-case? I
>>> cannot think of an exception that might be preempted (assuming #BP, #MC
>>> cannot be preempted).
>> 
>> Look for cond_local_irq_enable().
> 
> I looked at it. Yet, I still don’t see how exceptions might happen in my
> use-case, but having said that - this can be fixed too.

I’m not totally certain there’s a case that matters.  But it’s worth checking 

> 
> To be frank, I paid relatively little attention to this subject. Any
> feedback about the other parts and especially on the high-level approach? Is
> modifying the retpolines in the proposed manner (assembly macros)
> acceptable?
> 

It’s certainly a neat idea, and it could be a real speedup.

> Thanks,
> Nadav


Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-18 Thread Nadav Amit
at 8:51 PM, Andy Lutomirski  wrote:

> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit  wrote:
>> at 6:22 PM, Andy Lutomirski  wrote:
>> 
 On Oct 17, 2018, at 5:54 PM, Nadav Amit  wrote:
 
 It is sometimes beneficial to prevent preemption for very few
 instructions, or prevent preemption for some instructions that precede
 a branch (this latter case will be introduced in the next patches).
 
 To provide such functionality on x86-64, we use an empty REX-prefix
 (opcode 0x40) as an indication that preemption is disabled for the
 following instruction.
>>> 
>>> Nifty!
>>> 
>>> That being said, I think you have a few bugs. First, you can’t just ignore
>>> a rescheduling interrupt, as you introduce unbounded latency when this
>>> happens — you’re effectively emulating preempt_enable_no_resched(), which
>>> is not a drop-in replacement for preempt_enable(). To fix this, you may
>>> need to jump to a slow-path trampoline that calls schedule() at the end or
>>> consider rewinding one instruction instead. Or use TF, which is only a
>>> little bit terrifying…
>> 
>> Yes, I didn’t pay enough attention here. For my use-case, I think that the
>> easiest solution would be to make synchronize_sched() ignore preemptions
>> that happen while the prefix is detected. It would slightly change the
>> meaning of the prefix.

So thinking about it further, rewinding the instruction seems the easiest
and most robust solution. I’ll do it.

>>> You also aren’t accounting for the case where you get an exception that
>>> is, in turn, preempted.
>> 
>> Hmm.. Can you give me an example for such an exception in my use-case? I
>> cannot think of an exception that might be preempted (assuming #BP, #MC
>> cannot be preempted).
> 
> Look for cond_local_irq_enable().

I looked at it. Yet, I still don’t see how exceptions might happen in my
use-case, but having said that - this can be fixed too.

To be frank, I paid relatively little attention to this subject. Any
feedback about the other parts and especially on the high-level approach? Is
modifying the retpolines in the proposed manner (assembly macros)
acceptable?

Thanks,
Nadav

Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-18 Thread Nadav Amit
at 8:51 PM, Andy Lutomirski  wrote:

> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit  wrote:
>> at 6:22 PM, Andy Lutomirski  wrote:
>> 
 On Oct 17, 2018, at 5:54 PM, Nadav Amit  wrote:
 
 It is sometimes beneficial to prevent preemption for very few
 instructions, or prevent preemption for some instructions that precede
 a branch (this latter case will be introduced in the next patches).
 
 To provide such functionality on x86-64, we use an empty REX-prefix
 (opcode 0x40) as an indication that preemption is disabled for the
 following instruction.
>>> 
>>> Nifty!
>>> 
>>> That being said, I think you have a few bugs. First, you can’t just ignore
>>> a rescheduling interrupt, as you introduce unbounded latency when this
>>> happens — you’re effectively emulating preempt_enable_no_resched(), which
>>> is not a drop-in replacement for preempt_enable(). To fix this, you may
>>> need to jump to a slow-path trampoline that calls schedule() at the end or
>>> consider rewinding one instruction instead. Or use TF, which is only a
>>> little bit terrifying…
>> 
>> Yes, I didn’t pay enough attention here. For my use-case, I think that the
>> easiest solution would be to make synchronize_sched() ignore preemptions
>> that happen while the prefix is detected. It would slightly change the
>> meaning of the prefix.

So thinking about it further, rewinding the instruction seems the easiest
and most robust solution. I’ll do it.

>>> You also aren’t accounting for the case where you get an exception that
>>> is, in turn, preempted.
>> 
>> Hmm.. Can you give me an example for such an exception in my use-case? I
>> cannot think of an exception that might be preempted (assuming #BP, #MC
>> cannot be preempted).
> 
> Look for cond_local_irq_enable().

I looked at it. Yet, I still don’t see how exceptions might happen in my
use-case, but having said that - this can be fixed too.

To be frank, I paid relatively little attention to this subject. Any
feedback about the other parts and especially on the high-level approach? Is
modifying the retpolines in the proposed manner (assembly macros)
acceptable?

Thanks,
Nadav

Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-18 Thread Peter Zijlstra
On Wed, Oct 17, 2018 at 06:22:48PM -0700, Andy Lutomirski wrote:
> 
> > On Oct 17, 2018, at 5:54 PM, Nadav Amit  wrote:
> > 
> > It is sometimes beneficial to prevent preemption for very few
> > instructions, or prevent preemption for some instructions that precede
> > a branch (this latter case will be introduced in the next patches).
> > 
> > To provide such functionality on x86-64, we use an empty REX-prefix
> > (opcode 0x40) as an indication that preemption is disabled for the
> > following instruction.
> 
> Nifty!
> 
> That being said, I think you have a few bugs.

> First, you can’t just ignore a rescheduling interrupt, as you
> introduce unbounded latency when this happens — you’re effectively
> emulating preempt_enable_no_resched(), which is not a drop-in
> replacement for preempt_enable().

> To fix this, you may need to jump to a slow-path trampoline that calls
> schedule() at the end or consider rewinding one instruction instead.
> Or use TF, which is only a little bit terrifying...

At which point we're very close to in-kernel rseq.


Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-18 Thread Peter Zijlstra
On Wed, Oct 17, 2018 at 06:22:48PM -0700, Andy Lutomirski wrote:
> 
> > On Oct 17, 2018, at 5:54 PM, Nadav Amit  wrote:
> > 
> > It is sometimes beneficial to prevent preemption for very few
> > instructions, or prevent preemption for some instructions that precede
> > a branch (this latter case will be introduced in the next patches).
> > 
> > To provide such functionality on x86-64, we use an empty REX-prefix
> > (opcode 0x40) as an indication that preemption is disabled for the
> > following instruction.
> 
> Nifty!
> 
> That being said, I think you have a few bugs.

> First, you can’t just ignore a rescheduling interrupt, as you
> introduce unbounded latency when this happens — you’re effectively
> emulating preempt_enable_no_resched(), which is not a drop-in
> replacement for preempt_enable().

> To fix this, you may need to jump to a slow-path trampoline that calls
> schedule() at the end or consider rewinding one instruction instead.
> Or use TF, which is only a little bit terrifying...

At which point we're very close to in-kernel rseq.


Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-17 Thread Andy Lutomirski
On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit  wrote:
>
> at 6:22 PM, Andy Lutomirski  wrote:
>
> >
> >> On Oct 17, 2018, at 5:54 PM, Nadav Amit  wrote:
> >>
> >> It is sometimes beneficial to prevent preemption for very few
> >> instructions, or prevent preemption for some instructions that precede
> >> a branch (this latter case will be introduced in the next patches).
> >>
> >> To provide such functionality on x86-64, we use an empty REX-prefix
> >> (opcode 0x40) as an indication that preemption is disabled for the
> >> following instruction.
> >
> > Nifty!
> >
> > That being said, I think you have a few bugs. First, you can’t just ignore
> > a rescheduling interrupt, as you introduce unbounded latency when this
> > happens — you’re effectively emulating preempt_enable_no_resched(), which
> > is not a drop-in replacement for preempt_enable(). To fix this, you may
> > need to jump to a slow-path trampoline that calls schedule() at the end or
> > consider rewinding one instruction instead. Or use TF, which is only a
> > little bit terrifying…
>
> Yes, I didn’t pay enough attention here. For my use-case, I think that the
> easiest solution would be to make synchronize_sched() ignore preemptions
> that happen while the prefix is detected. It would slightly change the
> meaning of the prefix.
>
> > You also aren’t accounting for the case where you get an exception that
> > is, in turn, preempted.
>
> Hmm.. Can you give me an example for such an exception in my use-case? I
> cannot think of an exception that might be preempted (assuming #BP, #MC
> cannot be preempted).
>

Look for cond_local_irq_enable().

--Andy


Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-17 Thread Andy Lutomirski
On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit  wrote:
>
> at 6:22 PM, Andy Lutomirski  wrote:
>
> >
> >> On Oct 17, 2018, at 5:54 PM, Nadav Amit  wrote:
> >>
> >> It is sometimes beneficial to prevent preemption for very few
> >> instructions, or prevent preemption for some instructions that precede
> >> a branch (this latter case will be introduced in the next patches).
> >>
> >> To provide such functionality on x86-64, we use an empty REX-prefix
> >> (opcode 0x40) as an indication that preemption is disabled for the
> >> following instruction.
> >
> > Nifty!
> >
> > That being said, I think you have a few bugs. First, you can’t just ignore
> > a rescheduling interrupt, as you introduce unbounded latency when this
> > happens — you’re effectively emulating preempt_enable_no_resched(), which
> > is not a drop-in replacement for preempt_enable(). To fix this, you may
> > need to jump to a slow-path trampoline that calls schedule() at the end or
> > consider rewinding one instruction instead. Or use TF, which is only a
> > little bit terrifying…
>
> Yes, I didn’t pay enough attention here. For my use-case, I think that the
> easiest solution would be to make synchronize_sched() ignore preemptions
> that happen while the prefix is detected. It would slightly change the
> meaning of the prefix.
>
> > You also aren’t accounting for the case where you get an exception that
> > is, in turn, preempted.
>
> Hmm.. Can you give me an example for such an exception in my use-case? I
> cannot think of an exception that might be preempted (assuming #BP, #MC
> cannot be preempted).
>

Look for cond_local_irq_enable().

--Andy


Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-17 Thread Nadav Amit
at 8:11 PM, Nadav Amit  wrote:

> at 6:22 PM, Andy Lutomirski  wrote:
> 
>>> On Oct 17, 2018, at 5:54 PM, Nadav Amit  wrote:
>>> 
>>> It is sometimes beneficial to prevent preemption for very few
>>> instructions, or prevent preemption for some instructions that precede
>>> a branch (this latter case will be introduced in the next patches).
>>> 
>>> To provide such functionality on x86-64, we use an empty REX-prefix
>>> (opcode 0x40) as an indication that preemption is disabled for the
>>> following instruction.
>> 
>> Nifty!
>> 
>> That being said, I think you have a few bugs. First, you can’t just ignore
>> a rescheduling interrupt, as you introduce unbounded latency when this
>> happens — you’re effectively emulating preempt_enable_no_resched(), which
>> is not a drop-in replacement for preempt_enable(). To fix this, you may
>> need to jump to a slow-path trampoline that calls schedule() at the end or
>> consider rewinding one instruction instead. Or use TF, which is only a
>> little bit terrifying…
> 
> Yes, I didn’t pay enough attention here. For my use-case, I think that the
> easiest solution would be to make synchronize_sched() ignore preemptions
> that happen while the prefix is detected. It would slightly change the
> meaning of the prefix.

Ignore this nonsense that I wrote. I’ll try to come up with a decent
solution.

>> You also aren’t accounting for the case where you get an exception that
>> is, in turn, preempted.
> 
> Hmm.. Can you give me an example for such an exception in my use-case? I
> cannot think of an exception that might be preempted (assuming #BP, #MC
> cannot be preempted).
> 
> I agree that for super-general case this might be inappropriate.




Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-17 Thread Nadav Amit
at 8:11 PM, Nadav Amit  wrote:

> at 6:22 PM, Andy Lutomirski  wrote:
> 
>>> On Oct 17, 2018, at 5:54 PM, Nadav Amit  wrote:
>>> 
>>> It is sometimes beneficial to prevent preemption for very few
>>> instructions, or prevent preemption for some instructions that precede
>>> a branch (this latter case will be introduced in the next patches).
>>> 
>>> To provide such functionality on x86-64, we use an empty REX-prefix
>>> (opcode 0x40) as an indication that preemption is disabled for the
>>> following instruction.
>> 
>> Nifty!
>> 
>> That being said, I think you have a few bugs. First, you can’t just ignore
>> a rescheduling interrupt, as you introduce unbounded latency when this
>> happens — you’re effectively emulating preempt_enable_no_resched(), which
>> is not a drop-in replacement for preempt_enable(). To fix this, you may
>> need to jump to a slow-path trampoline that calls schedule() at the end or
>> consider rewinding one instruction instead. Or use TF, which is only a
>> little bit terrifying…
> 
> Yes, I didn’t pay enough attention here. For my use-case, I think that the
> easiest solution would be to make synchronize_sched() ignore preemptions
> that happen while the prefix is detected. It would slightly change the
> meaning of the prefix.

Ignore this nonsense that I wrote. I’ll try to come up with a decent
solution.

>> You also aren’t accounting for the case where you get an exception that
>> is, in turn, preempted.
> 
> Hmm.. Can you give me an example for such an exception in my use-case? I
> cannot think of an exception that might be preempted (assuming #BP, #MC
> cannot be preempted).
> 
> I agree that for super-general case this might be inappropriate.




Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-17 Thread Nadav Amit
at 6:22 PM, Andy Lutomirski  wrote:

> 
>> On Oct 17, 2018, at 5:54 PM, Nadav Amit  wrote:
>> 
>> It is sometimes beneficial to prevent preemption for very few
>> instructions, or prevent preemption for some instructions that precede
>> a branch (this latter case will be introduced in the next patches).
>> 
>> To provide such functionality on x86-64, we use an empty REX-prefix
>> (opcode 0x40) as an indication that preemption is disabled for the
>> following instruction.
> 
> Nifty!
> 
> That being said, I think you have a few bugs. First, you can’t just ignore
> a rescheduling interrupt, as you introduce unbounded latency when this
> happens — you’re effectively emulating preempt_enable_no_resched(), which
> is not a drop-in replacement for preempt_enable(). To fix this, you may
> need to jump to a slow-path trampoline that calls schedule() at the end or
> consider rewinding one instruction instead. Or use TF, which is only a
> little bit terrifying…

Yes, I didn’t pay enough attention here. For my use-case, I think that the
easiest solution would be to make synchronize_sched() ignore preemptions
that happen while the prefix is detected. It would slightly change the
meaning of the prefix.

> You also aren’t accounting for the case where you get an exception that
> is, in turn, preempted.

Hmm.. Can you give me an example for such an exception in my use-case? I
cannot think of an exception that might be preempted (assuming #BP, #MC
cannot be preempted).

I agree that for super-general case this might be inappropriate.



Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-17 Thread Nadav Amit
at 6:22 PM, Andy Lutomirski  wrote:

> 
>> On Oct 17, 2018, at 5:54 PM, Nadav Amit  wrote:
>> 
>> It is sometimes beneficial to prevent preemption for very few
>> instructions, or prevent preemption for some instructions that precede
>> a branch (this latter case will be introduced in the next patches).
>> 
>> To provide such functionality on x86-64, we use an empty REX-prefix
>> (opcode 0x40) as an indication that preemption is disabled for the
>> following instruction.
> 
> Nifty!
> 
> That being said, I think you have a few bugs. First, you can’t just ignore
> a rescheduling interrupt, as you introduce unbounded latency when this
> happens — you’re effectively emulating preempt_enable_no_resched(), which
> is not a drop-in replacement for preempt_enable(). To fix this, you may
> need to jump to a slow-path trampoline that calls schedule() at the end or
> consider rewinding one instruction instead. Or use TF, which is only a
> little bit terrifying…

Yes, I didn’t pay enough attention here. For my use-case, I think that the
easiest solution would be to make synchronize_sched() ignore preemptions
that happen while the prefix is detected. It would slightly change the
meaning of the prefix.

> You also aren’t accounting for the case where you get an exception that
> is, in turn, preempted.

Hmm.. Can you give me an example for such an exception in my use-case? I
cannot think of an exception that might be preempted (assuming #BP, #MC
cannot be preempted).

I agree that for super-general case this might be inappropriate.



Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-17 Thread Andy Lutomirski


> On Oct 17, 2018, at 5:54 PM, Nadav Amit  wrote:
> 
> It is sometimes beneficial to prevent preemption for very few
> instructions, or prevent preemption for some instructions that precede
> a branch (this latter case will be introduced in the next patches).
> 
> To provide such functionality on x86-64, we use an empty REX-prefix
> (opcode 0x40) as an indication that preemption is disabled for the
> following instruction.

Nifty!

That being said, I think you have a few bugs.  First, you can’t just ignore a 
rescheduling interrupt, as you introduce unbounded latency when this happens — 
you’re effectively emulating preempt_enable_no_resched(), which is not a 
drop-in replacement for preempt_enable(). To fix this, you may need to jump to 
a slow-path trampoline that calls schedule() at the end or consider rewinding 
one instruction instead. Or use TF, which is only a little bit terrifying...

You also aren’t accounting for the case where you get an exception that is, in 
turn, preempted.



> 
> It is expected that this opcode is not in common use.
> 
> Signed-off-by: Nadav Amit 
> ---
> arch/x86/entry/entry_64.S| 10 ++
> arch/x86/include/asm/nospec-branch.h | 12 
> 2 files changed, 22 insertions(+)
> 
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index cb8a5893fd33..31d59aad496e 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -643,6 +643,16 @@ retint_kernel:
>jnc1f
> 0:cmpl$0, PER_CPU_VAR(__preempt_count)
>jnz1f
> +
> +/*
> + * Allow to use hint to prevent preemption on a certain instruction.
> + * Consider an instruction with the first byte having REX prefix
> + * without any bits set as an indication for preemption disabled.
> + */
> +movqRIP(%rsp), %rax
> +cmpb$PREEMPT_DISABLE_PREFIX, (%rax)
> +jz1f
> +
>callpreempt_schedule_irq
>jmp0b
> 1:
> diff --git a/arch/x86/include/asm/nospec-branch.h 
> b/arch/x86/include/asm/nospec-branch.h
> index 80dc14422495..0267611eb247 100644
> --- a/arch/x86/include/asm/nospec-branch.h
> +++ b/arch/x86/include/asm/nospec-branch.h
> @@ -52,6 +52,12 @@
>jnz771b;\
>add$(BITS_PER_LONG/8) * nr, sp;
> 
> +/*
> + * An empty REX-prefix is an indication that preemption should not take 
> place on
> + * this instruction.
> + */
> +#define PREEMPT_DISABLE_PREFIX (0x40)
> +
> #ifdef __ASSEMBLY__
> 
> /*
> @@ -148,6 +154,12 @@
> #endif
> .endm
> 
> +.macro preempt_disable_prefix
> +#ifdef CONFIG_PREEMPT
> +.bytePREEMPT_DISABLE_PREFIX
> +#endif
> +.endm
> +
> #else /* __ASSEMBLY__ */
> 
> #define ANNOTATE_NOSPEC_ALTERNATIVE\
> -- 
> 2.17.1
> 


Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

2018-10-17 Thread Andy Lutomirski


> On Oct 17, 2018, at 5:54 PM, Nadav Amit  wrote:
> 
> It is sometimes beneficial to prevent preemption for very few
> instructions, or prevent preemption for some instructions that precede
> a branch (this latter case will be introduced in the next patches).
> 
> To provide such functionality on x86-64, we use an empty REX-prefix
> (opcode 0x40) as an indication that preemption is disabled for the
> following instruction.

Nifty!

That being said, I think you have a few bugs.  First, you can’t just ignore a 
rescheduling interrupt, as you introduce unbounded latency when this happens — 
you’re effectively emulating preempt_enable_no_resched(), which is not a 
drop-in replacement for preempt_enable(). To fix this, you may need to jump to 
a slow-path trampoline that calls schedule() at the end or consider rewinding 
one instruction instead. Or use TF, which is only a little bit terrifying...

You also aren’t accounting for the case where you get an exception that is, in 
turn, preempted.



> 
> It is expected that this opcode is not in common use.
> 
> Signed-off-by: Nadav Amit 
> ---
> arch/x86/entry/entry_64.S| 10 ++
> arch/x86/include/asm/nospec-branch.h | 12 
> 2 files changed, 22 insertions(+)
> 
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index cb8a5893fd33..31d59aad496e 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -643,6 +643,16 @@ retint_kernel:
>jnc1f
> 0:cmpl$0, PER_CPU_VAR(__preempt_count)
>jnz1f
> +
> +/*
> + * Allow to use hint to prevent preemption on a certain instruction.
> + * Consider an instruction with the first byte having REX prefix
> + * without any bits set as an indication for preemption disabled.
> + */
> +movqRIP(%rsp), %rax
> +cmpb$PREEMPT_DISABLE_PREFIX, (%rax)
> +jz1f
> +
>callpreempt_schedule_irq
>jmp0b
> 1:
> diff --git a/arch/x86/include/asm/nospec-branch.h 
> b/arch/x86/include/asm/nospec-branch.h
> index 80dc14422495..0267611eb247 100644
> --- a/arch/x86/include/asm/nospec-branch.h
> +++ b/arch/x86/include/asm/nospec-branch.h
> @@ -52,6 +52,12 @@
>jnz771b;\
>add$(BITS_PER_LONG/8) * nr, sp;
> 
> +/*
> + * An empty REX-prefix is an indication that preemption should not take 
> place on
> + * this instruction.
> + */
> +#define PREEMPT_DISABLE_PREFIX (0x40)
> +
> #ifdef __ASSEMBLY__
> 
> /*
> @@ -148,6 +154,12 @@
> #endif
> .endm
> 
> +.macro preempt_disable_prefix
> +#ifdef CONFIG_PREEMPT
> +.bytePREEMPT_DISABLE_PREFIX
> +#endif
> +.endm
> +
> #else /* __ASSEMBLY__ */
> 
> #define ANNOTATE_NOSPEC_ALTERNATIVE\
> -- 
> 2.17.1
>