Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-08 Thread Andy Lutomirski
On Sun, Mar 8, 2015 at 6:59 AM, Andy Lutomirski  wrote:
> There's another problem, though:  We don't have a real stack pointer
> just after syscall and just before sysexit, and therefore we *must*
> use IST for anything that can happen while we still have
> user-controlled rsp.  That includes #DB, #NM, and #MC.

I think I faked myself out here.  Why do #DB and #BP use IST?  We
could remove a decent (large?) amount of crud if we changed that.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-08 Thread Andy Lutomirski
On Mar 8, 2015 4:55 AM, "Ingo Molnar"  wrote:
>
>
> * Linus Torvalds  wrote:
>
> > On Sat, Mar 7, 2015 at 2:36 AM, Ingo Molnar  wrote:
> > >
> > > We could save the same 10 cycles from page fault overhead as well,
> > > AFAICS.
> >
> > Are trap gates actually noticeably faster? Or is it just he
> > "conditional_sti()" you're worried about?
>
> ( I'll talk about the CR2 complication later, please ignore that
>   problem for a moment. )
>
> So I base my thinking on the following hierarchy of fast paths. In a
> typical x86 system there are 4 main types of 'context switches', in
> order of decreasing frequency:
>
>- syscall'context switch': entry/exit from syscall
>- trap/fault 'context switch': entry/exit from trap/fault
>- irq'context switch': entry/exit from irqs
>- task   'context switch': scheduler context-switch
>
> Where each successive level is about an order of magnitude less
> frequently executed on a typical system than the previous one, so to
> optimize a level we are willing to push overhead to the next one(s).
>
> So the primary payoff in executing much of the entry code with irqs
> enabled would be to allow 64-bit *syscalls* to be made without irqs
> disabled: the first, most important level of context entries.
>
> Doing that would give us four (theoretical) performance advantages:
>
>   - No implicit irq disabling overhead when the syscall instruction is
> executed: we could change MSR_SYSCALL_MASK from 0xc084 to
> 0xc284, which removes the implicit CLI on syscall entry.
>
>   - No explicit irq enabling overhead via ENABLE_INTERRUPTS() [STI] in
> system_call.
>
>   - No explicit irq disabling overhead in the ret_from_sys_call fast
> path, i.e. no DISABLE_INTERRUPTS() [CLI].
>
>   - No implicit irq enabling overhead in ret_from_sys_call's
> USERGS_SYSRET64: the SYSRETQ instruction would not have to
> re-enable irqs as the user-space IF in R11 would match that of the
> current IF.
>
> whether that's an actual performance win in practice as well needs to
> be measured, but I'd be (very!) shocked if it wasn't in the 20+ cycles
> range: which is absolutely huge in terms of system_call optimizations.
>
> Now do I think this could be done realistically? I think yes (by
> re-using the NMI code's paranoid entry codepaths for the irq code as
> well, essentially fixing up the effects of any partial entries in an
> irq entry slow path), but I could be wrong about that.
>
> My main worry isn't even feasibility but maintainability and general
> robustness: all these asm cleanups we are doing now could enable such
> a trick to be pulled off robustly.
>
> But I think it could be done technically, because the NMI code already
> has to be careful about 'preempting' partial entries, so we have the
> logic.
>
> Complications:
>
>  - We now enable double partial entries: partial syscall interrupted
>by an irq interrupted by an NMI context. I think it should all work
>out fine but it's a new scenario.
>
>  - We'd enable interruptible return from system call, which caused
>trouble in the past. Solvable IMHO, by being careful in the irq
>entry code.
>
>  - We'd now have to be extra careful about the possibility of
>exceptions triggered in the entry/exit code itself, triggered by
>any sort of unusual register content or MMU fault.
>
> Simplifications:
>
>  - I'd ruthlessly disable IRQs for any sort of non fast path: for
>example 32-bit compat entries, ptrace or any other slowpath - at
>least initially until we map out the long term effects of this
>optimization.
>
> Does this scare me? Yes, I think it should scare any sane person, but
> I don't think it's all that bad: all the NMI paranoidentry work has
> already the trail blazed, and most of the races will be readily
> triggerable via regular irq loads, so it's not like we'll leave subtle
> bugs in there.
>
> Being able to do the same with certain traps, because irq entry is
> careful about partial entry state, would just be a secondary bonus.
>
> Regarding the CR2 value on page faults:
>
> > Anyway, for page faulting, we traditionally actually wanted an
> > interrupt gate, because of how we wanted to avoid interrupts coming
> > in and possibly messing up %cr2 due to vmalloc faults, but more
> > importantly for preemption. [...]
>
> Here too I think we could take a page from the NMI code: save cr2 in
> the page fault asm code, recognize from the irq code when we are
> interrupting that and dropping into a slowpath that saves cr2 right
> there. Potentially task-context-switching will be safe after that.
>
> Saving cr2 in the early page fault code should be much less of an
> overhead than what the IRQ disabling/enabling costs, so this should be
> a win. The potential win could be similar to that of system calls:
>
>   - Removal of an implicit 'CLI' in irq gates
>
>   - Removal of the explicit 'STI' in conditional_sti in your proposed
> code
>
>   - Removal of an 

Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-08 Thread Ingo Molnar

* Ingo Molnar  wrote:

> Doing that would give us four (theoretical) performance advantages:
> 
>   - No implicit irq disabling overhead when the syscall instruction is
> executed: we could change MSR_SYSCALL_MASK from 0xc084 to
> 0xc284, which removes the implicit CLI on syscall entry.
> 
>   - No explicit irq enabling overhead via ENABLE_INTERRUPTS() [STI] in
> system_call.
> 
>   - No explicit irq disabling overhead in the ret_from_sys_call fast 
> path, i.e. no DISABLE_INTERRUPTS() [CLI].
> 
>   - No implicit irq enabling overhead in ret_from_sys_call's 
> USERGS_SYSRET64: the SYSRETQ instruction would not have to 
> re-enable irqs as the user-space IF in R11 would match that of the 
> current IF.
> 
> whether that's an actual performance win in practice as well needs 
> to be measured, but I'd be (very!) shocked if it wasn't in the 20+ 
> cycles range: which is absolutely huge in terms of system_call 
> optimizations.

So just to quantify the potential 64-bit system call entry fast path 
performance savings a bit, I tried to simulate the effects in 
user-space via a 'best case' simulation, where we do a PUSHFQ+CLI+STI 
... CLI+POPFQ simulated syscall sequence (beginning and end 
sufficiently far from each other to not be interacting), on Intel 
family 6 model 62 CPUs (slightly dated but still relevant):

with irq disabling/enabling:

  new best speed: 2710739 loops (158 cycles per iteration).

fully preemptible:

  new best speed: 3389503 loops (113 cycles per iteration).

now that's an about 40 cycles difference, but admittedly the cost very 
much depends on the way we save flags and on the way we restore flags 
and depends on how intelligently the CPU can hide the irq disabling 
and the restoration amongst other processing it has to do on 
entry/exit, which it can do pretty well in a number of important 
cases.

I don't think I can simulate the real thing in user-space:

  - The hardest bit to simulate is SYSRET: POPFQ is expensive, but 
SYSRET might be able to 'cheat' on the enabling side

  - I _think_ it cannot cheat because user-space might have come in 
with irqs disabled itself (we still have iopl(3)), so it's a POPFQ
equivalent instruction.

  - OTOH the CPU might be able to hide the latency of the POPFQ 
amongst other SYSRET return work (which is significant) - so this 
is really hard to estimate.

So "we'll have to try it to see it" :-/ [and maybe Intel knows.]

But even if just half of the suspected savings can be realized: a 20 
cycles speedup is very tempting IMHO, given that our 64-bit system 
calls cost around 110 cycles these days.

Yes, it's scary, crazy, potentially fragile, might not even work, etc. 
- but it's also very tempting nevertheless ...

So I'll try to write a prototype of this, just to be able to get some 
numbers - but shoot me down if you think I'm being stupid and if the 
concept is an absolute non-starter to begin with!

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-08 Thread Ingo Molnar

* Linus Torvalds  wrote:

> On Sat, Mar 7, 2015 at 2:36 AM, Ingo Molnar  wrote:
> >
> > We could save the same 10 cycles from page fault overhead as well,
> > AFAICS.
> 
> Are trap gates actually noticeably faster? Or is it just he
> "conditional_sti()" you're worried about?

( I'll talk about the CR2 complication later, please ignore that 
  problem for a moment. )

So I base my thinking on the following hierarchy of fast paths. In a 
typical x86 system there are 4 main types of 'context switches', in 
order of decreasing frequency:

   - syscall'context switch': entry/exit from syscall
   - trap/fault 'context switch': entry/exit from trap/fault
   - irq'context switch': entry/exit from irqs
   - task   'context switch': scheduler context-switch

Where each successive level is about an order of magnitude less 
frequently executed on a typical system than the previous one, so to 
optimize a level we are willing to push overhead to the next one(s).

So the primary payoff in executing much of the entry code with irqs 
enabled would be to allow 64-bit *syscalls* to be made without irqs 
disabled: the first, most important level of context entries.

Doing that would give us four (theoretical) performance advantages:

  - No implicit irq disabling overhead when the syscall instruction is
executed: we could change MSR_SYSCALL_MASK from 0xc084 to
0xc284, which removes the implicit CLI on syscall entry.

  - No explicit irq enabling overhead via ENABLE_INTERRUPTS() [STI] in
system_call.

  - No explicit irq disabling overhead in the ret_from_sys_call fast 
path, i.e. no DISABLE_INTERRUPTS() [CLI].

  - No implicit irq enabling overhead in ret_from_sys_call's 
USERGS_SYSRET64: the SYSRETQ instruction would not have to 
re-enable irqs as the user-space IF in R11 would match that of the 
current IF.

whether that's an actual performance win in practice as well needs to 
be measured, but I'd be (very!) shocked if it wasn't in the 20+ cycles 
range: which is absolutely huge in terms of system_call optimizations.

Now do I think this could be done realistically? I think yes (by 
re-using the NMI code's paranoid entry codepaths for the irq code as 
well, essentially fixing up the effects of any partial entries in an 
irq entry slow path), but I could be wrong about that.

My main worry isn't even feasibility but maintainability and general 
robustness: all these asm cleanups we are doing now could enable such 
a trick to be pulled off robustly.

But I think it could be done technically, because the NMI code already 
has to be careful about 'preempting' partial entries, so we have the 
logic.

Complications:

 - We now enable double partial entries: partial syscall interrupted
   by an irq interrupted by an NMI context. I think it should all work
   out fine but it's a new scenario.

 - We'd enable interruptible return from system call, which caused
   trouble in the past. Solvable IMHO, by being careful in the irq 
   entry code.

 - We'd now have to be extra careful about the possibility of 
   exceptions triggered in the entry/exit code itself, triggered by 
   any sort of unusual register content or MMU fault.

Simplifications:

 - I'd ruthlessly disable IRQs for any sort of non fast path: for 
   example 32-bit compat entries, ptrace or any other slowpath - at 
   least initially until we map out the long term effects of this 
   optimization.

Does this scare me? Yes, I think it should scare any sane person, but 
I don't think it's all that bad: all the NMI paranoidentry work has 
already the trail blazed, and most of the races will be readily 
triggerable via regular irq loads, so it's not like we'll leave subtle 
bugs in there.

Being able to do the same with certain traps, because irq entry is 
careful about partial entry state, would just be a secondary bonus.

Regarding the CR2 value on page faults:

> Anyway, for page faulting, we traditionally actually wanted an 
> interrupt gate, because of how we wanted to avoid interrupts coming 
> in and possibly messing up %cr2 due to vmalloc faults, but more 
> importantly for preemption. [...]

Here too I think we could take a page from the NMI code: save cr2 in 
the page fault asm code, recognize from the irq code when we are 
interrupting that and dropping into a slowpath that saves cr2 right 
there. Potentially task-context-switching will be safe after that.

Saving cr2 in the early page fault code should be much less of an 
overhead than what the IRQ disabling/enabling costs, so this should be 
a win. The potential win could be similar to that of system calls:

  - Removal of an implicit 'CLI' in irq gates

  - Removal of the explicit 'STI' in conditional_sti in your proposed 
code

  - Removal of an explicit 'CLI' (DISABLE_INTERRUPTS) in 
error_exit/retint_swapgs.

  - Removal of an implicit 'STI' when SYSRET enables interrupts from R11

and the same savings would apply to FPU traps as well. I'd leave 

Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-08 Thread Ingo Molnar

* Linus Torvalds torva...@linux-foundation.org wrote:

 On Sat, Mar 7, 2015 at 2:36 AM, Ingo Molnar mi...@kernel.org wrote:
 
  We could save the same 10 cycles from page fault overhead as well,
  AFAICS.
 
 Are trap gates actually noticeably faster? Or is it just he
 conditional_sti() you're worried about?

( I'll talk about the CR2 complication later, please ignore that 
  problem for a moment. )

So I base my thinking on the following hierarchy of fast paths. In a 
typical x86 system there are 4 main types of 'context switches', in 
order of decreasing frequency:

   - syscall'context switch': entry/exit from syscall
   - trap/fault 'context switch': entry/exit from trap/fault
   - irq'context switch': entry/exit from irqs
   - task   'context switch': scheduler context-switch

Where each successive level is about an order of magnitude less 
frequently executed on a typical system than the previous one, so to 
optimize a level we are willing to push overhead to the next one(s).

So the primary payoff in executing much of the entry code with irqs 
enabled would be to allow 64-bit *syscalls* to be made without irqs 
disabled: the first, most important level of context entries.

Doing that would give us four (theoretical) performance advantages:

  - No implicit irq disabling overhead when the syscall instruction is
executed: we could change MSR_SYSCALL_MASK from 0xc084 to
0xc284, which removes the implicit CLI on syscall entry.

  - No explicit irq enabling overhead via ENABLE_INTERRUPTS() [STI] in
system_call.

  - No explicit irq disabling overhead in the ret_from_sys_call fast 
path, i.e. no DISABLE_INTERRUPTS() [CLI].

  - No implicit irq enabling overhead in ret_from_sys_call's 
USERGS_SYSRET64: the SYSRETQ instruction would not have to 
re-enable irqs as the user-space IF in R11 would match that of the 
current IF.

whether that's an actual performance win in practice as well needs to 
be measured, but I'd be (very!) shocked if it wasn't in the 20+ cycles 
range: which is absolutely huge in terms of system_call optimizations.

Now do I think this could be done realistically? I think yes (by 
re-using the NMI code's paranoid entry codepaths for the irq code as 
well, essentially fixing up the effects of any partial entries in an 
irq entry slow path), but I could be wrong about that.

My main worry isn't even feasibility but maintainability and general 
robustness: all these asm cleanups we are doing now could enable such 
a trick to be pulled off robustly.

But I think it could be done technically, because the NMI code already 
has to be careful about 'preempting' partial entries, so we have the 
logic.

Complications:

 - We now enable double partial entries: partial syscall interrupted
   by an irq interrupted by an NMI context. I think it should all work
   out fine but it's a new scenario.

 - We'd enable interruptible return from system call, which caused
   trouble in the past. Solvable IMHO, by being careful in the irq 
   entry code.

 - We'd now have to be extra careful about the possibility of 
   exceptions triggered in the entry/exit code itself, triggered by 
   any sort of unusual register content or MMU fault.

Simplifications:

 - I'd ruthlessly disable IRQs for any sort of non fast path: for 
   example 32-bit compat entries, ptrace or any other slowpath - at 
   least initially until we map out the long term effects of this 
   optimization.

Does this scare me? Yes, I think it should scare any sane person, but 
I don't think it's all that bad: all the NMI paranoidentry work has 
already the trail blazed, and most of the races will be readily 
triggerable via regular irq loads, so it's not like we'll leave subtle 
bugs in there.

Being able to do the same with certain traps, because irq entry is 
careful about partial entry state, would just be a secondary bonus.

Regarding the CR2 value on page faults:

 Anyway, for page faulting, we traditionally actually wanted an 
 interrupt gate, because of how we wanted to avoid interrupts coming 
 in and possibly messing up %cr2 due to vmalloc faults, but more 
 importantly for preemption. [...]

Here too I think we could take a page from the NMI code: save cr2 in 
the page fault asm code, recognize from the irq code when we are 
interrupting that and dropping into a slowpath that saves cr2 right 
there. Potentially task-context-switching will be safe after that.

Saving cr2 in the early page fault code should be much less of an 
overhead than what the IRQ disabling/enabling costs, so this should be 
a win. The potential win could be similar to that of system calls:

  - Removal of an implicit 'CLI' in irq gates

  - Removal of the explicit 'STI' in conditional_sti in your proposed 
code

  - Removal of an explicit 'CLI' (DISABLE_INTERRUPTS) in 
error_exit/retint_swapgs.

  - Removal of an implicit 'STI' when SYSRET enables interrupts from R11

and the same savings would apply to 

Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-08 Thread Ingo Molnar

* Ingo Molnar mi...@kernel.org wrote:

 Doing that would give us four (theoretical) performance advantages:
 
   - No implicit irq disabling overhead when the syscall instruction is
 executed: we could change MSR_SYSCALL_MASK from 0xc084 to
 0xc284, which removes the implicit CLI on syscall entry.
 
   - No explicit irq enabling overhead via ENABLE_INTERRUPTS() [STI] in
 system_call.
 
   - No explicit irq disabling overhead in the ret_from_sys_call fast 
 path, i.e. no DISABLE_INTERRUPTS() [CLI].
 
   - No implicit irq enabling overhead in ret_from_sys_call's 
 USERGS_SYSRET64: the SYSRETQ instruction would not have to 
 re-enable irqs as the user-space IF in R11 would match that of the 
 current IF.
 
 whether that's an actual performance win in practice as well needs 
 to be measured, but I'd be (very!) shocked if it wasn't in the 20+ 
 cycles range: which is absolutely huge in terms of system_call 
 optimizations.

So just to quantify the potential 64-bit system call entry fast path 
performance savings a bit, I tried to simulate the effects in 
user-space via a 'best case' simulation, where we do a PUSHFQ+CLI+STI 
... CLI+POPFQ simulated syscall sequence (beginning and end 
sufficiently far from each other to not be interacting), on Intel 
family 6 model 62 CPUs (slightly dated but still relevant):

with irq disabling/enabling:

  new best speed: 2710739 loops (158 cycles per iteration).

fully preemptible:

  new best speed: 3389503 loops (113 cycles per iteration).

now that's an about 40 cycles difference, but admittedly the cost very 
much depends on the way we save flags and on the way we restore flags 
and depends on how intelligently the CPU can hide the irq disabling 
and the restoration amongst other processing it has to do on 
entry/exit, which it can do pretty well in a number of important 
cases.

I don't think I can simulate the real thing in user-space:

  - The hardest bit to simulate is SYSRET: POPFQ is expensive, but 
SYSRET might be able to 'cheat' on the enabling side

  - I _think_ it cannot cheat because user-space might have come in 
with irqs disabled itself (we still have iopl(3)), so it's a POPFQ
equivalent instruction.

  - OTOH the CPU might be able to hide the latency of the POPFQ 
amongst other SYSRET return work (which is significant) - so this 
is really hard to estimate.

So we'll have to try it to see it :-/ [and maybe Intel knows.]

But even if just half of the suspected savings can be realized: a 20 
cycles speedup is very tempting IMHO, given that our 64-bit system 
calls cost around 110 cycles these days.

Yes, it's scary, crazy, potentially fragile, might not even work, etc. 
- but it's also very tempting nevertheless ...

So I'll try to write a prototype of this, just to be able to get some 
numbers - but shoot me down if you think I'm being stupid and if the 
concept is an absolute non-starter to begin with!

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-08 Thread Andy Lutomirski
On Mar 8, 2015 4:55 AM, Ingo Molnar mi...@kernel.org wrote:


 * Linus Torvalds torva...@linux-foundation.org wrote:

  On Sat, Mar 7, 2015 at 2:36 AM, Ingo Molnar mi...@kernel.org wrote:
  
   We could save the same 10 cycles from page fault overhead as well,
   AFAICS.
 
  Are trap gates actually noticeably faster? Or is it just he
  conditional_sti() you're worried about?

 ( I'll talk about the CR2 complication later, please ignore that
   problem for a moment. )

 So I base my thinking on the following hierarchy of fast paths. In a
 typical x86 system there are 4 main types of 'context switches', in
 order of decreasing frequency:

- syscall'context switch': entry/exit from syscall
- trap/fault 'context switch': entry/exit from trap/fault
- irq'context switch': entry/exit from irqs
- task   'context switch': scheduler context-switch

 Where each successive level is about an order of magnitude less
 frequently executed on a typical system than the previous one, so to
 optimize a level we are willing to push overhead to the next one(s).

 So the primary payoff in executing much of the entry code with irqs
 enabled would be to allow 64-bit *syscalls* to be made without irqs
 disabled: the first, most important level of context entries.

 Doing that would give us four (theoretical) performance advantages:

   - No implicit irq disabling overhead when the syscall instruction is
 executed: we could change MSR_SYSCALL_MASK from 0xc084 to
 0xc284, which removes the implicit CLI on syscall entry.

   - No explicit irq enabling overhead via ENABLE_INTERRUPTS() [STI] in
 system_call.

   - No explicit irq disabling overhead in the ret_from_sys_call fast
 path, i.e. no DISABLE_INTERRUPTS() [CLI].

   - No implicit irq enabling overhead in ret_from_sys_call's
 USERGS_SYSRET64: the SYSRETQ instruction would not have to
 re-enable irqs as the user-space IF in R11 would match that of the
 current IF.

 whether that's an actual performance win in practice as well needs to
 be measured, but I'd be (very!) shocked if it wasn't in the 20+ cycles
 range: which is absolutely huge in terms of system_call optimizations.

 Now do I think this could be done realistically? I think yes (by
 re-using the NMI code's paranoid entry codepaths for the irq code as
 well, essentially fixing up the effects of any partial entries in an
 irq entry slow path), but I could be wrong about that.

 My main worry isn't even feasibility but maintainability and general
 robustness: all these asm cleanups we are doing now could enable such
 a trick to be pulled off robustly.

 But I think it could be done technically, because the NMI code already
 has to be careful about 'preempting' partial entries, so we have the
 logic.

 Complications:

  - We now enable double partial entries: partial syscall interrupted
by an irq interrupted by an NMI context. I think it should all work
out fine but it's a new scenario.

  - We'd enable interruptible return from system call, which caused
trouble in the past. Solvable IMHO, by being careful in the irq
entry code.

  - We'd now have to be extra careful about the possibility of
exceptions triggered in the entry/exit code itself, triggered by
any sort of unusual register content or MMU fault.

 Simplifications:

  - I'd ruthlessly disable IRQs for any sort of non fast path: for
example 32-bit compat entries, ptrace or any other slowpath - at
least initially until we map out the long term effects of this
optimization.

 Does this scare me? Yes, I think it should scare any sane person, but
 I don't think it's all that bad: all the NMI paranoidentry work has
 already the trail blazed, and most of the races will be readily
 triggerable via regular irq loads, so it's not like we'll leave subtle
 bugs in there.

 Being able to do the same with certain traps, because irq entry is
 careful about partial entry state, would just be a secondary bonus.

 Regarding the CR2 value on page faults:

  Anyway, for page faulting, we traditionally actually wanted an
  interrupt gate, because of how we wanted to avoid interrupts coming
  in and possibly messing up %cr2 due to vmalloc faults, but more
  importantly for preemption. [...]

 Here too I think we could take a page from the NMI code: save cr2 in
 the page fault asm code, recognize from the irq code when we are
 interrupting that and dropping into a slowpath that saves cr2 right
 there. Potentially task-context-switching will be safe after that.

 Saving cr2 in the early page fault code should be much less of an
 overhead than what the IRQ disabling/enabling costs, so this should be
 a win. The potential win could be similar to that of system calls:

   - Removal of an implicit 'CLI' in irq gates

   - Removal of the explicit 'STI' in conditional_sti in your proposed
 code

   - Removal of an explicit 'CLI' (DISABLE_INTERRUPTS) in
 error_exit/retint_swapgs.

   - 

Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-08 Thread Andy Lutomirski
On Sun, Mar 8, 2015 at 6:59 AM, Andy Lutomirski l...@amacapital.net wrote:
 There's another problem, though:  We don't have a real stack pointer
 just after syscall and just before sysexit, and therefore we *must*
 use IST for anything that can happen while we still have
 user-controlled rsp.  That includes #DB, #NM, and #MC.

I think I faked myself out here.  Why do #DB and #BP use IST?  We
could remove a decent (large?) amount of crud if we changed that.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-07 Thread Linus Torvalds
On Sat, Mar 7, 2015 at 2:36 AM, Ingo Molnar  wrote:
>
> We could save the same 10 cycles from page fault overhead as well,
> AFAICS.

Are trap gates actually noticeably faster? Or is it just he
"conditional_sti()" you're worried about?

Anyway, for page faulting, we traditionally actually wanted an
interrupt gate, because of how we wanted to avoid interrupts coming in
and possibly messing up %cr2 due to vmalloc faults, but more
importantly for preemption. vmalloc faults are "harmless" because
we'll notice that it's already done, return, and then re-take the real
fault. But a preemption event before we read %cr2 can cause bad things
to happen:

 - page fault pushes error code on stack, address in %cr2

 - we don't have interrupts disabled, and some interrupt comes in and
causes preemption

 - some other process runs, take another page fault. %cr2 now is the
wrong address

 - we go back to the original thread (perhaps on another cpu), which
now reads %cr2 for the wrong address

 - we send the process a SIGSEGV because we think it's accessing
memory that it has no place touching

So the page fault code actually *needs* interrupts disabled until we
read %cr2. Stupid x86 trap semantics where the error code is on the
thread-safe stack, but %cr2 is not.

Maybe there is some trick I'm missing, but on the whole I think
"interrupt gate + conditional_sti()" does have things going for it.
Yes, it still leaves NMI as being special, but NMI really *is*
special.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-07 Thread Ingo Molnar

* Andy Lutomirski  wrote:

> On Fri, Mar 6, 2015 at 2:00 PM, Linus Torvalds
>  wrote:
> > On Fri, Mar 6, 2015 at 11:23 AM, Andy Lutomirski  
> > wrote:
> >>
> >> Please don't.  IMO it's really nice that we don't use trap gates at
> >> all on x86_64, and I find the conditional_sti thing much nicer than
> >> having to audit all of the entry code to see whether it's safe to run
> >> it with IRQs on.
> >
> > So I'm not sure I see much difference, but I'd certainly be ok with
> > just moving the "conditional_sti()" up unconditionally to be the first
> > thing in do_device_not_available().
> 
> I'd be fine with that.  The important difference is that it's after swapgs.

The thing is, we have to be careful about NMI contexts anyway. So how 
about being careful in irq contexts as well?

We could shave a good 10 cycles from the FPU trap overhead, maybe 
more?

We could save the same 10 cycles from page fault overhead as well, 
AFAICS.

Hm?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-07 Thread Ingo Molnar

* Linus Torvalds  wrote:

> On Thu, Mar 5, 2015 at 11:58 PM, Ingo Molnar  wrote:
> >
> > math_state_restore() was historically called with irqs disabled,
> > because that's how the hardware generates the trap, and also because
> > back in the days it was possible for it to be an asynchronous
> > interrupt and interrupt handlers run with irqs off.
> >
> > These days it's always an instruction trap, and furthermore it does
> > inevitably complex things such as memory allocation and signal
> > processing, which is not done with irqs disabled.
> >
> > So keep irqs enabled.
> 
> I agree with the "keep irqs enabled".
> 
> However, I do *not* agree with the actual patch, which doesn't do that at all.
> > @@ -844,8 +844,9 @@ void math_state_restore(void)
> >  {
> > struct task_struct *tsk = current;
> >
> > +   local_irq_enable();
> > +
> 
> There's a big difference between "keep interrupts enabled" (ok) and
> "explicitly enable interrupts in random contexts" (*NOT* ok).

Agreed, so I thought that we already kind of did that:

   if (!tsk_used_math(tsk)) {
   local_irq_enable();

But yeah, my patch brought that to a whole new level by always doing 
it, without starting with adding a warning first.

> 
> So get rid of the "local_irq_enable()" entirely, and replace it with a
>
>WARN_ON_ONCE(irqs_disabled());

Yeah, agreed absolutely - sorry about scaring (or annoying) you with a 
Signed-off-by patch, that was silly from me.

> and let's just fix the cases where this actually gets called with 
> interrupts off. [...]

Yes. I was a bit blinded by the 'easy to backport' aspect, so I 
concentrated on that, but it's more important to not break stuff.

> @@ -959,7 +949,7 @@ void __init trap_init(void)
>   set_system_intr_gate(X86_TRAP_OF, );
>   set_intr_gate(X86_TRAP_BR, bounds);
>   set_intr_gate(X86_TRAP_UD, invalid_op);
> - set_intr_gate(X86_TRAP_NM, device_not_available);
> + set_trap_gate(X86_TRAP_NM, device_not_available);

So I wasn't this brave.

Historically modern x86 entry code ran with irqs off, because that's 
what the hardware gave us on most entry types. I'm not 100% sure we 
are ready to allow preemption of sensitive entry code on 
CONFIG_PREEMPT=y kernels. But we could try.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-07 Thread Ingo Molnar

* Andy Lutomirski l...@amacapital.net wrote:

 On Fri, Mar 6, 2015 at 2:00 PM, Linus Torvalds
 torva...@linux-foundation.org wrote:
  On Fri, Mar 6, 2015 at 11:23 AM, Andy Lutomirski l...@amacapital.net 
  wrote:
 
  Please don't.  IMO it's really nice that we don't use trap gates at
  all on x86_64, and I find the conditional_sti thing much nicer than
  having to audit all of the entry code to see whether it's safe to run
  it with IRQs on.
 
  So I'm not sure I see much difference, but I'd certainly be ok with
  just moving the conditional_sti() up unconditionally to be the first
  thing in do_device_not_available().
 
 I'd be fine with that.  The important difference is that it's after swapgs.

The thing is, we have to be careful about NMI contexts anyway. So how 
about being careful in irq contexts as well?

We could shave a good 10 cycles from the FPU trap overhead, maybe 
more?

We could save the same 10 cycles from page fault overhead as well, 
AFAICS.

Hm?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-07 Thread Ingo Molnar

* Linus Torvalds torva...@linux-foundation.org wrote:

 On Thu, Mar 5, 2015 at 11:58 PM, Ingo Molnar mi...@kernel.org wrote:
 
  math_state_restore() was historically called with irqs disabled,
  because that's how the hardware generates the trap, and also because
  back in the days it was possible for it to be an asynchronous
  interrupt and interrupt handlers run with irqs off.
 
  These days it's always an instruction trap, and furthermore it does
  inevitably complex things such as memory allocation and signal
  processing, which is not done with irqs disabled.
 
  So keep irqs enabled.
 
 I agree with the keep irqs enabled.
 
 However, I do *not* agree with the actual patch, which doesn't do that at all.
  @@ -844,8 +844,9 @@ void math_state_restore(void)
   {
  struct task_struct *tsk = current;
 
  +   local_irq_enable();
  +
 
 There's a big difference between keep interrupts enabled (ok) and
 explicitly enable interrupts in random contexts (*NOT* ok).

Agreed, so I thought that we already kind of did that:

   if (!tsk_used_math(tsk)) {
   local_irq_enable();

But yeah, my patch brought that to a whole new level by always doing 
it, without starting with adding a warning first.

 
 So get rid of the local_irq_enable() entirely, and replace it with a

WARN_ON_ONCE(irqs_disabled());

Yeah, agreed absolutely - sorry about scaring (or annoying) you with a 
Signed-off-by patch, that was silly from me.

 and let's just fix the cases where this actually gets called with 
 interrupts off. [...]

Yes. I was a bit blinded by the 'easy to backport' aspect, so I 
concentrated on that, but it's more important to not break stuff.

 @@ -959,7 +949,7 @@ void __init trap_init(void)
   set_system_intr_gate(X86_TRAP_OF, overflow);
   set_intr_gate(X86_TRAP_BR, bounds);
   set_intr_gate(X86_TRAP_UD, invalid_op);
 - set_intr_gate(X86_TRAP_NM, device_not_available);
 + set_trap_gate(X86_TRAP_NM, device_not_available);

So I wasn't this brave.

Historically modern x86 entry code ran with irqs off, because that's 
what the hardware gave us on most entry types. I'm not 100% sure we 
are ready to allow preemption of sensitive entry code on 
CONFIG_PREEMPT=y kernels. But we could try.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-07 Thread Linus Torvalds
On Sat, Mar 7, 2015 at 2:36 AM, Ingo Molnar mi...@kernel.org wrote:

 We could save the same 10 cycles from page fault overhead as well,
 AFAICS.

Are trap gates actually noticeably faster? Or is it just he
conditional_sti() you're worried about?

Anyway, for page faulting, we traditionally actually wanted an
interrupt gate, because of how we wanted to avoid interrupts coming in
and possibly messing up %cr2 due to vmalloc faults, but more
importantly for preemption. vmalloc faults are harmless because
we'll notice that it's already done, return, and then re-take the real
fault. But a preemption event before we read %cr2 can cause bad things
to happen:

 - page fault pushes error code on stack, address in %cr2

 - we don't have interrupts disabled, and some interrupt comes in and
causes preemption

 - some other process runs, take another page fault. %cr2 now is the
wrong address

 - we go back to the original thread (perhaps on another cpu), which
now reads %cr2 for the wrong address

 - we send the process a SIGSEGV because we think it's accessing
memory that it has no place touching

So the page fault code actually *needs* interrupts disabled until we
read %cr2. Stupid x86 trap semantics where the error code is on the
thread-safe stack, but %cr2 is not.

Maybe there is some trick I'm missing, but on the whole I think
interrupt gate + conditional_sti() does have things going for it.
Yes, it still leaves NMI as being special, but NMI really *is*
special.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-06 Thread Andy Lutomirski
On Fri, Mar 6, 2015 at 2:00 PM, Linus Torvalds
 wrote:
> On Fri, Mar 6, 2015 at 11:23 AM, Andy Lutomirski  wrote:
>>
>> Please don't.  IMO it's really nice that we don't use trap gates at
>> all on x86_64, and I find the conditional_sti thing much nicer than
>> having to audit all of the entry code to see whether it's safe to run
>> it with IRQs on.
>
> So I'm not sure I see much difference, but I'd certainly be ok with
> just moving the "conditional_sti()" up unconditionally to be the first
> thing in do_device_not_available().

I'd be fine with that.  The important difference is that it's after swapgs.

--Andy

>
> The point being that we still *not* just randomly enable interrupts
> because we decide that the callers are wrong.
>
>Linus



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-06 Thread Linus Torvalds
On Fri, Mar 6, 2015 at 11:23 AM, Andy Lutomirski  wrote:
>
> Please don't.  IMO it's really nice that we don't use trap gates at
> all on x86_64, and I find the conditional_sti thing much nicer than
> having to audit all of the entry code to see whether it's safe to run
> it with IRQs on.

So I'm not sure I see much difference, but I'd certainly be ok with
just moving the "conditional_sti()" up unconditionally to be the first
thing in do_device_not_available().

The point being that we still *not* just randomly enable interrupts
because we decide that the callers are wrong.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-06 Thread Andy Lutomirski
On Fri, Mar 6, 2015 at 9:33 AM, Linus Torvalds
 wrote:
> On Thu, Mar 5, 2015 at 11:58 PM, Ingo Molnar  wrote:
>>
>> math_state_restore() was historically called with irqs disabled,
>> because that's how the hardware generates the trap, and also because
>> back in the days it was possible for it to be an asynchronous
>> interrupt and interrupt handlers run with irqs off.
>>
>> These days it's always an instruction trap, and furthermore it does
>> inevitably complex things such as memory allocation and signal
>> processing, which is not done with irqs disabled.
>>
>> So keep irqs enabled.
>
> I agree with the "keep irqs enabled".
>
> However, I do *not* agree with the actual patch, which doesn't do that at all.
>> @@ -844,8 +844,9 @@ void math_state_restore(void)
>>  {
>> struct task_struct *tsk = current;
>>
>> +   local_irq_enable();
>> +
>
> There's a big difference between "keep interrupts enabled" (ok) and
> "explicitly enable interrupts in random contexts" (*NOT* ok).
>
> So get rid of the "local_irq_enable()" entirely, and replace it with a
>
>WARN_ON_ONCE(irqs_disabled());

I like this a lot better.  Ingo's patch scares me because in-kernel
FPU users are fundamentally atomic: if we're using kernel FPU *and*
current has important FPU state, we can't schedule safely because
there's nowhere to save the in-kernel FPU state.  I don't see how we
would end up in this situation with CR0.TS set, but if we somehow do,
then we'll corrupt something with your patch.

>
> and let's just fix the cases where this actually gets called with
> interrupts off. In particular, let's just make the
> device_not_available thing use a trap gate, not an interrupt gate. And
> then remove the "conditional_sti()" stuff.

Please don't.  IMO it's really nice that we don't use trap gates at
all on x86_64, and I find the conditional_sti thing much nicer than
having to audit all of the entry code to see whether it's safe to run
it with IRQs on.

This isn't to say that using trap gates would be a terrible idea in
general.  I think I'd be okay with saying that non-IST
synchronous-only never-from-random-places entries should be trap gates
in general.  We could then audit the entry code and convert various
entries.

Thinks that maybe shouldn't be trap gates:

 - Anything asynchronous.
 - Page faults.  The entry code can cause page faults recursively due
to lazy vmap page table fills.  Maybe this doesn't matter.
 - Anything using IST.  That way lies madness.

FWIW, I just started auditing the entry code a bit, and it's currently
unsafe to use trap gates.  The IRQ entry code in the "interrupt" macro
does this:

testl $3, CS-RBP(%rsp)
je 1f
SWAPGS

That will go boom quite nicely if it happens very early in device_not_available.

I suspect that fixing that will be slow and unpleasant and will
significantly outweigh any benefit.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-06 Thread Oleg Nesterov
On 03/06, Linus Torvalds wrote:
>
> On Thu, Mar 5, 2015 at 11:58 PM, Ingo Molnar  wrote:
> >
> > math_state_restore() was historically called with irqs disabled,
> > because that's how the hardware generates the trap, and also because
> > back in the days it was possible for it to be an asynchronous
> > interrupt and interrupt handlers run with irqs off.
> >
> > These days it's always an instruction trap, and furthermore it does
> > inevitably complex things such as memory allocation and signal
> > processing, which is not done with irqs disabled.
> >
> > So keep irqs enabled.
>
> I agree with the "keep irqs enabled".

Me too, but not for stable. This patch is wrong without other changes.

> IOW, I think the starting point should be something like the attached
> (which doesn't do the WARN_ON_ONCE() - it should be added for
> debugging).

Yes, agreed.

And. Even if we forget about stable, we need some minor changes before
this one. At least we need to add preempt_disable() into kernel_fpu_disable().


So I still think that the horrible hack I sent makes sense for -stable.
Just we need to cleanup (kill) it "immediately".

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-06 Thread Linus Torvalds
On Thu, Mar 5, 2015 at 11:58 PM, Ingo Molnar  wrote:
>
> math_state_restore() was historically called with irqs disabled,
> because that's how the hardware generates the trap, and also because
> back in the days it was possible for it to be an asynchronous
> interrupt and interrupt handlers run with irqs off.
>
> These days it's always an instruction trap, and furthermore it does
> inevitably complex things such as memory allocation and signal
> processing, which is not done with irqs disabled.
>
> So keep irqs enabled.

I agree with the "keep irqs enabled".

However, I do *not* agree with the actual patch, which doesn't do that at all.
> @@ -844,8 +844,9 @@ void math_state_restore(void)
>  {
> struct task_struct *tsk = current;
>
> +   local_irq_enable();
> +

There's a big difference between "keep interrupts enabled" (ok) and
"explicitly enable interrupts in random contexts" (*NOT* ok).

So get rid of the "local_irq_enable()" entirely, and replace it with a

   WARN_ON_ONCE(irqs_disabled());

and let's just fix the cases where this actually gets called with
interrupts off. In particular, let's just make the
device_not_available thing use a trap gate, not an interrupt gate. And
then remove the "conditional_sti()" stuff.

IOW, I think the starting point should be something like the attached
(which doesn't do the WARN_ON_ONCE() - it should be added for
debugging).

*NOT* some kind of "let's just enable interrupts blindly" approach.

This is completely untested, of course. But I don't see why we should
use an interrupt gate for this. Is there anything in
"exception_enter()" that requires it?

   Linus
 arch/x86/kernel/traps.c | 14 ++
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 9d2073e2ecc9..f045ac026ff1 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -836,16 +836,12 @@ asmlinkage __visible void __attribute__((weak)) 
smp_threshold_interrupt(void)
  *
  * Careful.. There are problems with IBM-designed IRQ13 behaviour.
  * Don't touch unless you *really* know how it works.
- *
- * Must be called with kernel preemption disabled (eg with local
- * local interrupts as in the case of do_device_not_available).
  */
 void math_state_restore(void)
 {
struct task_struct *tsk = current;
 
if (!tsk_used_math(tsk)) {
-   local_irq_enable();
/*
 * does a slab alloc which can sleep
 */
@@ -856,7 +852,6 @@ void math_state_restore(void)
do_group_exit(SIGKILL);
return;
}
-   local_irq_disable();
}
 
/* Avoid __kernel_fpu_begin() right after __thread_fpu_begin() */
@@ -884,18 +879,13 @@ do_device_not_available(struct pt_regs *regs, long 
error_code)
if (read_cr0() & X86_CR0_EM) {
struct math_emu_info info = { };
 
-   conditional_sti(regs);
-
info.regs = regs;
math_emulate();
exception_exit(prev_state);
return;
}
 #endif
-   math_state_restore(); /* interrupts still off */
-#ifdef CONFIG_X86_32
-   conditional_sti(regs);
-#endif
+   math_state_restore();
exception_exit(prev_state);
 }
 NOKPROBE_SYMBOL(do_device_not_available);
@@ -959,7 +949,7 @@ void __init trap_init(void)
set_system_intr_gate(X86_TRAP_OF, );
set_intr_gate(X86_TRAP_BR, bounds);
set_intr_gate(X86_TRAP_UD, invalid_op);
-   set_intr_gate(X86_TRAP_NM, device_not_available);
+   set_trap_gate(X86_TRAP_NM, device_not_available);
 #ifdef CONFIG_X86_32
set_task_gate(X86_TRAP_DF, GDT_ENTRY_DOUBLEFAULT_TSS);
 #else


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-06 Thread Oleg Nesterov
On 03/06, David Vrabel wrote:
>
> On 06/03/15 15:36, Oleg Nesterov wrote:
> >
> > This needs more discussion, but in short so far I think that fpu_alloc()
> > from #NM exception is fine if user_mode(regs) == T.
>
> I think a memory allocation here, where the only behaviour on a failure
> is to kill the task, is (and has always been) a crazy idea.

Well, I do not agree. But lets discuss this later. This code should be
rewritten in any case. It has more problems.

> Additionally, in a Xen PV guest the #NM handler is called with TS
> already cleared by the hypervisor so the handler must not enable
> interrupts (and thus potentially schedule another task) until after the
> current task's fpu state has been restored.  If a task was scheduled
> before restoring the FPU state, TS would be clear and that task will use
> fpu state from a previous task.

I can be easily wrong (especially because I know nothing about Xen ;), but
I do not think this is true.

Yes sure, we need to avoid preemption, but we need this in any case, even
without Xen.

Again, lets discuss this a bit later?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-06 Thread David Vrabel
On 06/03/15 15:36, Oleg Nesterov wrote:
> On 03/06, David Vrabel wrote:
>>
>> On 06/03/15 14:01, Oleg Nesterov wrote:
>>
>>> Not sure I understand it correctly after the first quick look, but
>>>
>>> 1. It conflicts with the recent changes in tip/x86/fpu
>>>
>>> 2. fpu_ini() initializes current->thread.fpu.state. This looks unneeded,
>>>the kernel threads no longer have FPU context and do not abuse CPU.
>>>
>>> 3. I can be easily wrong, but it looks buggy... Note that
>>>arch_dup_task_struct() doesn't allocate child->fpu.state if
>>>!tsk_used_math(parent).
>>
>> ...yes. It's bit-rotted a bit.
>>
>>> No, I do not think this patch is a good idea. Perhaps I am wrong, but I
>>> think we need other changes. And they should start from init_fpu().
>>
>> But the general principle of avoiding the allocation in the #NM handler
>> and hence avoiding the need to re-enable IRQs is still a good idea, yes?
> 
> This needs more discussion, but in short so far I think that fpu_alloc()
> from #NM exception is fine if user_mode(regs) == T.

I think a memory allocation here, where the only behaviour on a failure
is to kill the task, is (and has always been) a crazy idea.

Additionally, in a Xen PV guest the #NM handler is called with TS
already cleared by the hypervisor so the handler must not enable
interrupts (and thus potentially schedule another task) until after the
current task's fpu state has been restored.  If a task was scheduled
before restoring the FPU state, TS would be clear and that task will use
fpu state from a previous task.

> Just do_device_not_available() should simply do conditional_sti(), I think.
> Perhaps it can even enable irqs unconditionally, but we need to verify that
> this is 100% correct.
> 
> And I agree that "if (!tsk_used_math(tsk))" code in math_state_restore()
> should be removed. But not to avoid the allocation, and this needs other
> changes.
> 
> Oleg.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-06 Thread Oleg Nesterov
On 03/06, David Vrabel wrote:
>
> On 06/03/15 14:01, Oleg Nesterov wrote:
>
> > Not sure I understand it correctly after the first quick look, but
> >
> > 1. It conflicts with the recent changes in tip/x86/fpu
> >
> > 2. fpu_ini() initializes current->thread.fpu.state. This looks unneeded,
> >the kernel threads no longer have FPU context and do not abuse CPU.
> >
> > 3. I can be easily wrong, but it looks buggy... Note that
> >arch_dup_task_struct() doesn't allocate child->fpu.state if
> >!tsk_used_math(parent).
>
> ...yes. It's bit-rotted a bit.
>
> > No, I do not think this patch is a good idea. Perhaps I am wrong, but I
> > think we need other changes. And they should start from init_fpu().
>
> But the general principle of avoiding the allocation in the #NM handler
> and hence avoiding the need to re-enable IRQs is still a good idea, yes?

This needs more discussion, but in short so far I think that fpu_alloc()
from #NM exception is fine if user_mode(regs) == T.

Just do_device_not_available() should simply do conditional_sti(), I think.
Perhaps it can even enable irqs unconditionally, but we need to verify that
this is 100% correct.

And I agree that "if (!tsk_used_math(tsk))" code in math_state_restore()
should be removed. But not to avoid the allocation, and this needs other
changes.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-06 Thread David Vrabel
On 06/03/15 14:01, Oleg Nesterov wrote:
> On 03/06, Ingo Molnar wrote:
>>
>> * Oleg Nesterov  wrote:
>>
>>> On 03/06, Ingo Molnar wrote:

 * Oleg Nesterov  wrote:

> [...] The patch above looks "obviously safe", but perhaps I am
> paranoid too much...

 IMHO your hack above isn't really acceptable, even for a backport.
 So lets test the patch below (assuming it's the right thing to do)
 and move forward?
>>>
>>> OK, but please note that this patch is not beckportable. If you think
>>> that -stable doesn't need this fix, then I agree.
>>>
>>> If the caller is do_device_not_available(), then we can not enable
>>> irqs before __thread_fpu_begin() + restore_fpu_checking().
>>>
>>> 1. Preemption in between can destroy ->fpu.state initialized by
>>>fpu_finit(), __switch_to() will save the live (wrong) FPU state
>>>again.
>>>
>>> 2. kernel_fpu_begin() from irq right after __thread_fpu_begin() is
>>>not nice too. It will do __save_init_fpu() and this overwrites
>>>->fpu.state too.
>>>
>>> Starting from v4.0 it does kernel_fpu_disable(), but the older kernels
>>> do not.
>>>
>>> Ingo, this code is really horrible and fragile. We need to cleanup it
>>> step-by-step, imho.
>>
>> How about the patch from David Vrabel? That seems to solve the
>> irq-disable problem too, right?
> 
> I wasn't cc'ed, I guess you mean
> 
>   [PATCHv4] x86, fpu: remove the logic of non-eager fpu mem allocation at 
> the first usage
>   http://marc.info/?l=linux-kernel=142564237705311=2

This patch is from Suresh, and was originally against 3.10, so...

> Not sure I understand it correctly after the first quick look, but
> 
> 1. It conflicts with the recent changes in tip/x86/fpu
> 
> 2. fpu_ini() initializes current->thread.fpu.state. This looks unneeded,
>the kernel threads no longer have FPU context and do not abuse CPU.
> 
> 3. I can be easily wrong, but it looks buggy... Note that
>arch_dup_task_struct() doesn't allocate child->fpu.state if
>!tsk_used_math(parent).

...yes. It's bit-rotted a bit.

> No, I do not think this patch is a good idea. Perhaps I am wrong, but I
> think we need other changes. And they should start from init_fpu().

But the general principle of avoiding the allocation in the #NM handler
and hence avoiding the need to re-enable IRQs is still a good idea, yes?

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-06 Thread Oleg Nesterov
On 03/06, Oleg Nesterov wrote:
>
> On 03/06, Ingo Molnar wrote:
> >
> > How about the patch from David Vrabel? That seems to solve the
> > irq-disable problem too, right?
>
> I wasn't cc'ed, I guess you mean
>
>   [PATCHv4] x86, fpu: remove the logic of non-eager fpu mem allocation at 
> the first usage
>   http://marc.info/?l=linux-kernel=142564237705311=2
>
> Not sure I understand it correctly after the first quick look, but
>
> 1. It conflicts with the recent changes in tip/x86/fpu
>
> 2. fpu_ini() initializes current->thread.fpu.state. This looks unneeded,
>the kernel threads no longer have FPU context and do not abuse CPU.
>
> 3. I can be easily wrong, but it looks buggy... Note that
>arch_dup_task_struct() doesn't allocate child->fpu.state if
>!tsk_used_math(parent).
>
> Add David...
>
> No, I do not think this patch is a good idea. Perhaps I am wrong, but I
> think we need other changes. And they should start from init_fpu().

But the change in eager_fpu_init_bp() looks good (on top of tip/x86/fpu),
at least I was going to do the same ;)

In any case, I do not think this patch can target -stable.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-06 Thread Oleg Nesterov
On 03/06, Ingo Molnar wrote:
>
> * Oleg Nesterov  wrote:
>
> > On 03/06, Ingo Molnar wrote:
> > >
> > > * Oleg Nesterov  wrote:
> > >
> > > > [...] The patch above looks "obviously safe", but perhaps I am
> > > > paranoid too much...
> > >
> > > IMHO your hack above isn't really acceptable, even for a backport.
> > > So lets test the patch below (assuming it's the right thing to do)
> > > and move forward?
> >
> > OK, but please note that this patch is not beckportable. If you think
> > that -stable doesn't need this fix, then I agree.
> >
> > If the caller is do_device_not_available(), then we can not enable
> > irqs before __thread_fpu_begin() + restore_fpu_checking().
> >
> > 1. Preemption in between can destroy ->fpu.state initialized by
> >fpu_finit(), __switch_to() will save the live (wrong) FPU state
> >again.
> >
> > 2. kernel_fpu_begin() from irq right after __thread_fpu_begin() is
> >not nice too. It will do __save_init_fpu() and this overwrites
> >->fpu.state too.
> >
> > Starting from v4.0 it does kernel_fpu_disable(), but the older kernels
> > do not.
> >
> > Ingo, this code is really horrible and fragile. We need to cleanup it
> > step-by-step, imho.
>
> How about the patch from David Vrabel? That seems to solve the
> irq-disable problem too, right?

I wasn't cc'ed, I guess you mean

[PATCHv4] x86, fpu: remove the logic of non-eager fpu mem allocation at 
the first usage
http://marc.info/?l=linux-kernel=142564237705311=2

Not sure I understand it correctly after the first quick look, but

1. It conflicts with the recent changes in tip/x86/fpu

2. fpu_ini() initializes current->thread.fpu.state. This looks unneeded,
   the kernel threads no longer have FPU context and do not abuse CPU.

3. I can be easily wrong, but it looks buggy... Note that
   arch_dup_task_struct() doesn't allocate child->fpu.state if
   !tsk_used_math(parent).

Add David...

No, I do not think this patch is a good idea. Perhaps I am wrong, but I
think we need other changes. And they should start from init_fpu().

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-06 Thread Ingo Molnar

* Oleg Nesterov  wrote:

> On 03/06, Ingo Molnar wrote:
> >
> > * Oleg Nesterov  wrote:
> >
> > > [...] The patch above looks "obviously safe", but perhaps I am
> > > paranoid too much...
> >
> > IMHO your hack above isn't really acceptable, even for a backport.
> > So lets test the patch below (assuming it's the right thing to do)
> > and move forward?
> 
> OK, but please note that this patch is not beckportable. If you think
> that -stable doesn't need this fix, then I agree.
> 
> If the caller is do_device_not_available(), then we can not enable
> irqs before __thread_fpu_begin() + restore_fpu_checking().
> 
> 1. Preemption in between can destroy ->fpu.state initialized by
>fpu_finit(), __switch_to() will save the live (wrong) FPU state
>again.
> 
> 2. kernel_fpu_begin() from irq right after __thread_fpu_begin() is
>not nice too. It will do __save_init_fpu() and this overwrites
>->fpu.state too.
> 
> Starting from v4.0 it does kernel_fpu_disable(), but the older kernels
> do not.
> 
> Ingo, this code is really horrible and fragile. We need to cleanup it
> step-by-step, imho.

How about the patch from David Vrabel? That seems to solve the 
irq-disable problem too, right?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-06 Thread Oleg Nesterov
On 03/06, Oleg Nesterov wrote:
>
> OK, but please note that this patch is not beckportable. If you think
> that -stable doesn't need this fix, then I agree.
>
> If the caller is do_device_not_available(), then we can not enable
> irqs before __thread_fpu_begin() + restore_fpu_checking().
>
> 1. Preemption in between can destroy ->fpu.state initialized by
>fpu_finit(), __switch_to() will save the live (wrong) FPU state
>again.
>
> 2. kernel_fpu_begin() from irq right after __thread_fpu_begin() is
>not nice too. It will do __save_init_fpu() and this overwrites
>->fpu.state too.
>
> Starting from v4.0 it does kernel_fpu_disable(), but the older kernels
> do not.
>
> Ingo, this code is really horrible and fragile. We need to cleanup it
> step-by-step, imho.

Forgot to mention...

And, otoh, if we are not going to backport this change, then I think this
irq_enable() should be called by do_device_not_available().

>
> > ==>
> > From: Ingo Molnar 
> > Date: Fri, 6 Mar 2015 08:37:57 +0100
> > Subject: [PATCH] x86/fpu: Don't disable irqs in math_state_restore()
> >
> > math_state_restore() was historically called with irqs disabled,
> > because that's how the hardware generates the trap, and also because
> > back in the days it was possible for it to be an asynchronous
> > interrupt and interrupt handlers run with irqs off.
> >
> > These days it's always an instruction trap, and furthermore it does
> > inevitably complex things such as memory allocation and signal
> > processing, which is not done with irqs disabled.
> >
> > So keep irqs enabled.
> >
> > This might surprise in-kernel FPU users that somehow relied on
> > interrupts being disabled across FPU usage - but that's
> > fundamentally fragile anyway due to the inatomicity of FPU state
> > restores. The trap return will restore interrupts to its previous
> > state, but if FPU ops trigger math_state_restore() there's no
> > guarantee of atomicity anymore.
> >
> > To warn about in-kernel irqs-off users of FPU state we might want to
> > pass 'struct pt_regs' to math_state_restore() and check the trapped
> > state for irqs disabled (flags has IF cleared) and kernel context -
> > but that's for a later patch.
> >
> > Cc: Andy Lutomirski 
> > Cc: Borislav Petkov 
> > Cc: Fenghua Yu 
> > Cc: H. Peter Anvin 
> > Cc: Linus Torvalds 
> > Cc: Oleg Nesterov 
> > Cc: Quentin Casasnovas 
> > Cc: Thomas Gleixner 
> > Signed-off-by: Ingo Molnar 
> > ---
> >  arch/x86/kernel/traps.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> > index 950815a138e1..52f9e4057cee 100644
> > --- a/arch/x86/kernel/traps.c
> > +++ b/arch/x86/kernel/traps.c
> > @@ -844,8 +844,9 @@ void math_state_restore(void)
> >  {
> > struct task_struct *tsk = current;
> >
> > +   local_irq_enable();
> > +
> > if (!tsk_used_math(tsk)) {
> > -   local_irq_enable();
> > /*
> >  * does a slab alloc which can sleep
> >  */
> > @@ -856,7 +857,6 @@ void math_state_restore(void)
> > do_group_exit(SIGKILL);
> > return;
> > }
> > -   local_irq_disable();
> > }
> >
> > /* Avoid __kernel_fpu_begin() right after __thread_fpu_begin() */

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-06 Thread Oleg Nesterov
On 03/06, Ingo Molnar wrote:
>
> * Oleg Nesterov  wrote:
>
> > [...] The patch above looks "obviously safe", but perhaps I am
> > paranoid too much...
>
> IMHO your hack above isn't really acceptable, even for a backport.
> So lets test the patch below (assuming it's the right thing to do)
> and move forward?

OK, but please note that this patch is not beckportable. If you think
that -stable doesn't need this fix, then I agree.

If the caller is do_device_not_available(), then we can not enable
irqs before __thread_fpu_begin() + restore_fpu_checking().

1. Preemption in between can destroy ->fpu.state initialized by
   fpu_finit(), __switch_to() will save the live (wrong) FPU state
   again.

2. kernel_fpu_begin() from irq right after __thread_fpu_begin() is
   not nice too. It will do __save_init_fpu() and this overwrites
   ->fpu.state too.

Starting from v4.0 it does kernel_fpu_disable(), but the older kernels
do not.

Ingo, this code is really horrible and fragile. We need to cleanup it
step-by-step, imho.

> ==>
> From: Ingo Molnar 
> Date: Fri, 6 Mar 2015 08:37:57 +0100
> Subject: [PATCH] x86/fpu: Don't disable irqs in math_state_restore()
>
> math_state_restore() was historically called with irqs disabled,
> because that's how the hardware generates the trap, and also because
> back in the days it was possible for it to be an asynchronous
> interrupt and interrupt handlers run with irqs off.
>
> These days it's always an instruction trap, and furthermore it does
> inevitably complex things such as memory allocation and signal
> processing, which is not done with irqs disabled.
>
> So keep irqs enabled.
>
> This might surprise in-kernel FPU users that somehow relied on
> interrupts being disabled across FPU usage - but that's
> fundamentally fragile anyway due to the inatomicity of FPU state
> restores. The trap return will restore interrupts to its previous
> state, but if FPU ops trigger math_state_restore() there's no
> guarantee of atomicity anymore.
>
> To warn about in-kernel irqs-off users of FPU state we might want to
> pass 'struct pt_regs' to math_state_restore() and check the trapped
> state for irqs disabled (flags has IF cleared) and kernel context -
> but that's for a later patch.
>
> Cc: Andy Lutomirski 
> Cc: Borislav Petkov 
> Cc: Fenghua Yu 
> Cc: H. Peter Anvin 
> Cc: Linus Torvalds 
> Cc: Oleg Nesterov 
> Cc: Quentin Casasnovas 
> Cc: Thomas Gleixner 
> Signed-off-by: Ingo Molnar 
> ---
>  arch/x86/kernel/traps.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 950815a138e1..52f9e4057cee 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -844,8 +844,9 @@ void math_state_restore(void)
>  {
>   struct task_struct *tsk = current;
>
> + local_irq_enable();
> +
>   if (!tsk_used_math(tsk)) {
> - local_irq_enable();
>   /*
>* does a slab alloc which can sleep
>*/
> @@ -856,7 +857,6 @@ void math_state_restore(void)
>   do_group_exit(SIGKILL);
>   return;
>   }
> - local_irq_disable();
>   }
>
>   /* Avoid __kernel_fpu_begin() right after __thread_fpu_begin() */

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-06 Thread Oleg Nesterov
On 03/06, Oleg Nesterov wrote:

 On 03/06, Ingo Molnar wrote:
 
  How about the patch from David Vrabel? That seems to solve the
  irq-disable problem too, right?

 I wasn't cc'ed, I guess you mean

   [PATCHv4] x86, fpu: remove the logic of non-eager fpu mem allocation at 
 the first usage
   http://marc.info/?l=linux-kernelm=142564237705311w=2

 Not sure I understand it correctly after the first quick look, but

 1. It conflicts with the recent changes in tip/x86/fpu

 2. fpu_ini() initializes current-thread.fpu.state. This looks unneeded,
the kernel threads no longer have FPU context and do not abuse CPU.

 3. I can be easily wrong, but it looks buggy... Note that
arch_dup_task_struct() doesn't allocate child-fpu.state if
!tsk_used_math(parent).

 Add David...

 No, I do not think this patch is a good idea. Perhaps I am wrong, but I
 think we need other changes. And they should start from init_fpu().

But the change in eager_fpu_init_bp() looks good (on top of tip/x86/fpu),
at least I was going to do the same ;)

In any case, I do not think this patch can target -stable.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-06 Thread David Vrabel
On 06/03/15 14:01, Oleg Nesterov wrote:
 On 03/06, Ingo Molnar wrote:

 * Oleg Nesterov o...@redhat.com wrote:

 On 03/06, Ingo Molnar wrote:

 * Oleg Nesterov o...@redhat.com wrote:

 [...] The patch above looks obviously safe, but perhaps I am
 paranoid too much...

 IMHO your hack above isn't really acceptable, even for a backport.
 So lets test the patch below (assuming it's the right thing to do)
 and move forward?

 OK, but please note that this patch is not beckportable. If you think
 that -stable doesn't need this fix, then I agree.

 If the caller is do_device_not_available(), then we can not enable
 irqs before __thread_fpu_begin() + restore_fpu_checking().

 1. Preemption in between can destroy -fpu.state initialized by
fpu_finit(), __switch_to() will save the live (wrong) FPU state
again.

 2. kernel_fpu_begin() from irq right after __thread_fpu_begin() is
not nice too. It will do __save_init_fpu() and this overwrites
-fpu.state too.

 Starting from v4.0 it does kernel_fpu_disable(), but the older kernels
 do not.

 Ingo, this code is really horrible and fragile. We need to cleanup it
 step-by-step, imho.

 How about the patch from David Vrabel? That seems to solve the
 irq-disable problem too, right?
 
 I wasn't cc'ed, I guess you mean
 
   [PATCHv4] x86, fpu: remove the logic of non-eager fpu mem allocation at 
 the first usage
   http://marc.info/?l=linux-kernelm=142564237705311w=2

This patch is from Suresh, and was originally against 3.10, so...

 Not sure I understand it correctly after the first quick look, but
 
 1. It conflicts with the recent changes in tip/x86/fpu
 
 2. fpu_ini() initializes current-thread.fpu.state. This looks unneeded,
the kernel threads no longer have FPU context and do not abuse CPU.
 
 3. I can be easily wrong, but it looks buggy... Note that
arch_dup_task_struct() doesn't allocate child-fpu.state if
!tsk_used_math(parent).

...yes. It's bit-rotted a bit.

 No, I do not think this patch is a good idea. Perhaps I am wrong, but I
 think we need other changes. And they should start from init_fpu().

But the general principle of avoiding the allocation in the #NM handler
and hence avoiding the need to re-enable IRQs is still a good idea, yes?

David
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-06 Thread Oleg Nesterov
On 03/06, David Vrabel wrote:

 On 06/03/15 14:01, Oleg Nesterov wrote:

  Not sure I understand it correctly after the first quick look, but
 
  1. It conflicts with the recent changes in tip/x86/fpu
 
  2. fpu_ini() initializes current-thread.fpu.state. This looks unneeded,
 the kernel threads no longer have FPU context and do not abuse CPU.
 
  3. I can be easily wrong, but it looks buggy... Note that
 arch_dup_task_struct() doesn't allocate child-fpu.state if
 !tsk_used_math(parent).

 ...yes. It's bit-rotted a bit.

  No, I do not think this patch is a good idea. Perhaps I am wrong, but I
  think we need other changes. And they should start from init_fpu().

 But the general principle of avoiding the allocation in the #NM handler
 and hence avoiding the need to re-enable IRQs is still a good idea, yes?

This needs more discussion, but in short so far I think that fpu_alloc()
from #NM exception is fine if user_mode(regs) == T.

Just do_device_not_available() should simply do conditional_sti(), I think.
Perhaps it can even enable irqs unconditionally, but we need to verify that
this is 100% correct.

And I agree that if (!tsk_used_math(tsk)) code in math_state_restore()
should be removed. But not to avoid the allocation, and this needs other
changes.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-06 Thread Oleg Nesterov
On 03/06, Ingo Molnar wrote:

 * Oleg Nesterov o...@redhat.com wrote:

  On 03/06, Ingo Molnar wrote:
  
   * Oleg Nesterov o...@redhat.com wrote:
  
[...] The patch above looks obviously safe, but perhaps I am
paranoid too much...
  
   IMHO your hack above isn't really acceptable, even for a backport.
   So lets test the patch below (assuming it's the right thing to do)
   and move forward?
 
  OK, but please note that this patch is not beckportable. If you think
  that -stable doesn't need this fix, then I agree.
 
  If the caller is do_device_not_available(), then we can not enable
  irqs before __thread_fpu_begin() + restore_fpu_checking().
 
  1. Preemption in between can destroy -fpu.state initialized by
 fpu_finit(), __switch_to() will save the live (wrong) FPU state
 again.
 
  2. kernel_fpu_begin() from irq right after __thread_fpu_begin() is
 not nice too. It will do __save_init_fpu() and this overwrites
 -fpu.state too.
 
  Starting from v4.0 it does kernel_fpu_disable(), but the older kernels
  do not.
 
  Ingo, this code is really horrible and fragile. We need to cleanup it
  step-by-step, imho.

 How about the patch from David Vrabel? That seems to solve the
 irq-disable problem too, right?

I wasn't cc'ed, I guess you mean

[PATCHv4] x86, fpu: remove the logic of non-eager fpu mem allocation at 
the first usage
http://marc.info/?l=linux-kernelm=142564237705311w=2

Not sure I understand it correctly after the first quick look, but

1. It conflicts with the recent changes in tip/x86/fpu

2. fpu_ini() initializes current-thread.fpu.state. This looks unneeded,
   the kernel threads no longer have FPU context and do not abuse CPU.

3. I can be easily wrong, but it looks buggy... Note that
   arch_dup_task_struct() doesn't allocate child-fpu.state if
   !tsk_used_math(parent).

Add David...

No, I do not think this patch is a good idea. Perhaps I am wrong, but I
think we need other changes. And they should start from init_fpu().

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-06 Thread Oleg Nesterov
On 03/06, Oleg Nesterov wrote:

 OK, but please note that this patch is not beckportable. If you think
 that -stable doesn't need this fix, then I agree.

 If the caller is do_device_not_available(), then we can not enable
 irqs before __thread_fpu_begin() + restore_fpu_checking().

 1. Preemption in between can destroy -fpu.state initialized by
fpu_finit(), __switch_to() will save the live (wrong) FPU state
again.

 2. kernel_fpu_begin() from irq right after __thread_fpu_begin() is
not nice too. It will do __save_init_fpu() and this overwrites
-fpu.state too.

 Starting from v4.0 it does kernel_fpu_disable(), but the older kernels
 do not.

 Ingo, this code is really horrible and fragile. We need to cleanup it
 step-by-step, imho.

Forgot to mention...

And, otoh, if we are not going to backport this change, then I think this
irq_enable() should be called by do_device_not_available().


  ==
  From: Ingo Molnar mi...@kernel.org
  Date: Fri, 6 Mar 2015 08:37:57 +0100
  Subject: [PATCH] x86/fpu: Don't disable irqs in math_state_restore()
 
  math_state_restore() was historically called with irqs disabled,
  because that's how the hardware generates the trap, and also because
  back in the days it was possible for it to be an asynchronous
  interrupt and interrupt handlers run with irqs off.
 
  These days it's always an instruction trap, and furthermore it does
  inevitably complex things such as memory allocation and signal
  processing, which is not done with irqs disabled.
 
  So keep irqs enabled.
 
  This might surprise in-kernel FPU users that somehow relied on
  interrupts being disabled across FPU usage - but that's
  fundamentally fragile anyway due to the inatomicity of FPU state
  restores. The trap return will restore interrupts to its previous
  state, but if FPU ops trigger math_state_restore() there's no
  guarantee of atomicity anymore.
 
  To warn about in-kernel irqs-off users of FPU state we might want to
  pass 'struct pt_regs' to math_state_restore() and check the trapped
  state for irqs disabled (flags has IF cleared) and kernel context -
  but that's for a later patch.
 
  Cc: Andy Lutomirski l...@amacapital.net
  Cc: Borislav Petkov b...@alien8.de
  Cc: Fenghua Yu fenghua...@intel.com
  Cc: H. Peter Anvin h...@zytor.com
  Cc: Linus Torvalds torva...@linux-foundation.org
  Cc: Oleg Nesterov o...@redhat.com
  Cc: Quentin Casasnovas quentin.casasno...@oracle.com
  Cc: Thomas Gleixner t...@linutronix.de
  Signed-off-by: Ingo Molnar mi...@kernel.org
  ---
   arch/x86/kernel/traps.c | 4 ++--
   1 file changed, 2 insertions(+), 2 deletions(-)
 
  diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
  index 950815a138e1..52f9e4057cee 100644
  --- a/arch/x86/kernel/traps.c
  +++ b/arch/x86/kernel/traps.c
  @@ -844,8 +844,9 @@ void math_state_restore(void)
   {
  struct task_struct *tsk = current;
 
  +   local_irq_enable();
  +
  if (!tsk_used_math(tsk)) {
  -   local_irq_enable();
  /*
   * does a slab alloc which can sleep
   */
  @@ -856,7 +857,6 @@ void math_state_restore(void)
  do_group_exit(SIGKILL);
  return;
  }
  -   local_irq_disable();
  }
 
  /* Avoid __kernel_fpu_begin() right after __thread_fpu_begin() */

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-06 Thread Oleg Nesterov
On 03/06, Ingo Molnar wrote:

 * Oleg Nesterov o...@redhat.com wrote:

  [...] The patch above looks obviously safe, but perhaps I am
  paranoid too much...

 IMHO your hack above isn't really acceptable, even for a backport.
 So lets test the patch below (assuming it's the right thing to do)
 and move forward?

OK, but please note that this patch is not beckportable. If you think
that -stable doesn't need this fix, then I agree.

If the caller is do_device_not_available(), then we can not enable
irqs before __thread_fpu_begin() + restore_fpu_checking().

1. Preemption in between can destroy -fpu.state initialized by
   fpu_finit(), __switch_to() will save the live (wrong) FPU state
   again.

2. kernel_fpu_begin() from irq right after __thread_fpu_begin() is
   not nice too. It will do __save_init_fpu() and this overwrites
   -fpu.state too.

Starting from v4.0 it does kernel_fpu_disable(), but the older kernels
do not.

Ingo, this code is really horrible and fragile. We need to cleanup it
step-by-step, imho.

 ==
 From: Ingo Molnar mi...@kernel.org
 Date: Fri, 6 Mar 2015 08:37:57 +0100
 Subject: [PATCH] x86/fpu: Don't disable irqs in math_state_restore()

 math_state_restore() was historically called with irqs disabled,
 because that's how the hardware generates the trap, and also because
 back in the days it was possible for it to be an asynchronous
 interrupt and interrupt handlers run with irqs off.

 These days it's always an instruction trap, and furthermore it does
 inevitably complex things such as memory allocation and signal
 processing, which is not done with irqs disabled.

 So keep irqs enabled.

 This might surprise in-kernel FPU users that somehow relied on
 interrupts being disabled across FPU usage - but that's
 fundamentally fragile anyway due to the inatomicity of FPU state
 restores. The trap return will restore interrupts to its previous
 state, but if FPU ops trigger math_state_restore() there's no
 guarantee of atomicity anymore.

 To warn about in-kernel irqs-off users of FPU state we might want to
 pass 'struct pt_regs' to math_state_restore() and check the trapped
 state for irqs disabled (flags has IF cleared) and kernel context -
 but that's for a later patch.

 Cc: Andy Lutomirski l...@amacapital.net
 Cc: Borislav Petkov b...@alien8.de
 Cc: Fenghua Yu fenghua...@intel.com
 Cc: H. Peter Anvin h...@zytor.com
 Cc: Linus Torvalds torva...@linux-foundation.org
 Cc: Oleg Nesterov o...@redhat.com
 Cc: Quentin Casasnovas quentin.casasno...@oracle.com
 Cc: Thomas Gleixner t...@linutronix.de
 Signed-off-by: Ingo Molnar mi...@kernel.org
 ---
  arch/x86/kernel/traps.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

 diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
 index 950815a138e1..52f9e4057cee 100644
 --- a/arch/x86/kernel/traps.c
 +++ b/arch/x86/kernel/traps.c
 @@ -844,8 +844,9 @@ void math_state_restore(void)
  {
   struct task_struct *tsk = current;

 + local_irq_enable();
 +
   if (!tsk_used_math(tsk)) {
 - local_irq_enable();
   /*
* does a slab alloc which can sleep
*/
 @@ -856,7 +857,6 @@ void math_state_restore(void)
   do_group_exit(SIGKILL);
   return;
   }
 - local_irq_disable();
   }

   /* Avoid __kernel_fpu_begin() right after __thread_fpu_begin() */

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-06 Thread Ingo Molnar

* Oleg Nesterov o...@redhat.com wrote:

 On 03/06, Ingo Molnar wrote:
 
  * Oleg Nesterov o...@redhat.com wrote:
 
   [...] The patch above looks obviously safe, but perhaps I am
   paranoid too much...
 
  IMHO your hack above isn't really acceptable, even for a backport.
  So lets test the patch below (assuming it's the right thing to do)
  and move forward?
 
 OK, but please note that this patch is not beckportable. If you think
 that -stable doesn't need this fix, then I agree.
 
 If the caller is do_device_not_available(), then we can not enable
 irqs before __thread_fpu_begin() + restore_fpu_checking().
 
 1. Preemption in between can destroy -fpu.state initialized by
fpu_finit(), __switch_to() will save the live (wrong) FPU state
again.
 
 2. kernel_fpu_begin() from irq right after __thread_fpu_begin() is
not nice too. It will do __save_init_fpu() and this overwrites
-fpu.state too.
 
 Starting from v4.0 it does kernel_fpu_disable(), but the older kernels
 do not.
 
 Ingo, this code is really horrible and fragile. We need to cleanup it
 step-by-step, imho.

How about the patch from David Vrabel? That seems to solve the 
irq-disable problem too, right?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-06 Thread David Vrabel
On 06/03/15 15:36, Oleg Nesterov wrote:
 On 03/06, David Vrabel wrote:

 On 06/03/15 14:01, Oleg Nesterov wrote:

 Not sure I understand it correctly after the first quick look, but

 1. It conflicts with the recent changes in tip/x86/fpu

 2. fpu_ini() initializes current-thread.fpu.state. This looks unneeded,
the kernel threads no longer have FPU context and do not abuse CPU.

 3. I can be easily wrong, but it looks buggy... Note that
arch_dup_task_struct() doesn't allocate child-fpu.state if
!tsk_used_math(parent).

 ...yes. It's bit-rotted a bit.

 No, I do not think this patch is a good idea. Perhaps I am wrong, but I
 think we need other changes. And they should start from init_fpu().

 But the general principle of avoiding the allocation in the #NM handler
 and hence avoiding the need to re-enable IRQs is still a good idea, yes?
 
 This needs more discussion, but in short so far I think that fpu_alloc()
 from #NM exception is fine if user_mode(regs) == T.

I think a memory allocation here, where the only behaviour on a failure
is to kill the task, is (and has always been) a crazy idea.

Additionally, in a Xen PV guest the #NM handler is called with TS
already cleared by the hypervisor so the handler must not enable
interrupts (and thus potentially schedule another task) until after the
current task's fpu state has been restored.  If a task was scheduled
before restoring the FPU state, TS would be clear and that task will use
fpu state from a previous task.

 Just do_device_not_available() should simply do conditional_sti(), I think.
 Perhaps it can even enable irqs unconditionally, but we need to verify that
 this is 100% correct.
 
 And I agree that if (!tsk_used_math(tsk)) code in math_state_restore()
 should be removed. But not to avoid the allocation, and this needs other
 changes.
 
 Oleg.
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-06 Thread Oleg Nesterov
On 03/06, David Vrabel wrote:

 On 06/03/15 15:36, Oleg Nesterov wrote:
 
  This needs more discussion, but in short so far I think that fpu_alloc()
  from #NM exception is fine if user_mode(regs) == T.

 I think a memory allocation here, where the only behaviour on a failure
 is to kill the task, is (and has always been) a crazy idea.

Well, I do not agree. But lets discuss this later. This code should be
rewritten in any case. It has more problems.

 Additionally, in a Xen PV guest the #NM handler is called with TS
 already cleared by the hypervisor so the handler must not enable
 interrupts (and thus potentially schedule another task) until after the
 current task's fpu state has been restored.  If a task was scheduled
 before restoring the FPU state, TS would be clear and that task will use
 fpu state from a previous task.

I can be easily wrong (especially because I know nothing about Xen ;), but
I do not think this is true.

Yes sure, we need to avoid preemption, but we need this in any case, even
without Xen.

Again, lets discuss this a bit later?

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-06 Thread Linus Torvalds
On Thu, Mar 5, 2015 at 11:58 PM, Ingo Molnar mi...@kernel.org wrote:

 math_state_restore() was historically called with irqs disabled,
 because that's how the hardware generates the trap, and also because
 back in the days it was possible for it to be an asynchronous
 interrupt and interrupt handlers run with irqs off.

 These days it's always an instruction trap, and furthermore it does
 inevitably complex things such as memory allocation and signal
 processing, which is not done with irqs disabled.

 So keep irqs enabled.

I agree with the keep irqs enabled.

However, I do *not* agree with the actual patch, which doesn't do that at all.
 @@ -844,8 +844,9 @@ void math_state_restore(void)
  {
 struct task_struct *tsk = current;

 +   local_irq_enable();
 +

There's a big difference between keep interrupts enabled (ok) and
explicitly enable interrupts in random contexts (*NOT* ok).

So get rid of the local_irq_enable() entirely, and replace it with a

   WARN_ON_ONCE(irqs_disabled());

and let's just fix the cases where this actually gets called with
interrupts off. In particular, let's just make the
device_not_available thing use a trap gate, not an interrupt gate. And
then remove the conditional_sti() stuff.

IOW, I think the starting point should be something like the attached
(which doesn't do the WARN_ON_ONCE() - it should be added for
debugging).

*NOT* some kind of let's just enable interrupts blindly approach.

This is completely untested, of course. But I don't see why we should
use an interrupt gate for this. Is there anything in
exception_enter() that requires it?

   Linus
 arch/x86/kernel/traps.c | 14 ++
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 9d2073e2ecc9..f045ac026ff1 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -836,16 +836,12 @@ asmlinkage __visible void __attribute__((weak)) 
smp_threshold_interrupt(void)
  *
  * Careful.. There are problems with IBM-designed IRQ13 behaviour.
  * Don't touch unless you *really* know how it works.
- *
- * Must be called with kernel preemption disabled (eg with local
- * local interrupts as in the case of do_device_not_available).
  */
 void math_state_restore(void)
 {
struct task_struct *tsk = current;
 
if (!tsk_used_math(tsk)) {
-   local_irq_enable();
/*
 * does a slab alloc which can sleep
 */
@@ -856,7 +852,6 @@ void math_state_restore(void)
do_group_exit(SIGKILL);
return;
}
-   local_irq_disable();
}
 
/* Avoid __kernel_fpu_begin() right after __thread_fpu_begin() */
@@ -884,18 +879,13 @@ do_device_not_available(struct pt_regs *regs, long 
error_code)
if (read_cr0()  X86_CR0_EM) {
struct math_emu_info info = { };
 
-   conditional_sti(regs);
-
info.regs = regs;
math_emulate(info);
exception_exit(prev_state);
return;
}
 #endif
-   math_state_restore(); /* interrupts still off */
-#ifdef CONFIG_X86_32
-   conditional_sti(regs);
-#endif
+   math_state_restore();
exception_exit(prev_state);
 }
 NOKPROBE_SYMBOL(do_device_not_available);
@@ -959,7 +949,7 @@ void __init trap_init(void)
set_system_intr_gate(X86_TRAP_OF, overflow);
set_intr_gate(X86_TRAP_BR, bounds);
set_intr_gate(X86_TRAP_UD, invalid_op);
-   set_intr_gate(X86_TRAP_NM, device_not_available);
+   set_trap_gate(X86_TRAP_NM, device_not_available);
 #ifdef CONFIG_X86_32
set_task_gate(X86_TRAP_DF, GDT_ENTRY_DOUBLEFAULT_TSS);
 #else


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-06 Thread Oleg Nesterov
On 03/06, Linus Torvalds wrote:

 On Thu, Mar 5, 2015 at 11:58 PM, Ingo Molnar mi...@kernel.org wrote:
 
  math_state_restore() was historically called with irqs disabled,
  because that's how the hardware generates the trap, and also because
  back in the days it was possible for it to be an asynchronous
  interrupt and interrupt handlers run with irqs off.
 
  These days it's always an instruction trap, and furthermore it does
  inevitably complex things such as memory allocation and signal
  processing, which is not done with irqs disabled.
 
  So keep irqs enabled.

 I agree with the keep irqs enabled.

Me too, but not for stable. This patch is wrong without other changes.

 IOW, I think the starting point should be something like the attached
 (which doesn't do the WARN_ON_ONCE() - it should be added for
 debugging).

Yes, agreed.

And. Even if we forget about stable, we need some minor changes before
this one. At least we need to add preempt_disable() into kernel_fpu_disable().


So I still think that the horrible hack I sent makes sense for -stable.
Just we need to cleanup (kill) it immediately.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-06 Thread Andy Lutomirski
On Fri, Mar 6, 2015 at 9:33 AM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Thu, Mar 5, 2015 at 11:58 PM, Ingo Molnar mi...@kernel.org wrote:

 math_state_restore() was historically called with irqs disabled,
 because that's how the hardware generates the trap, and also because
 back in the days it was possible for it to be an asynchronous
 interrupt and interrupt handlers run with irqs off.

 These days it's always an instruction trap, and furthermore it does
 inevitably complex things such as memory allocation and signal
 processing, which is not done with irqs disabled.

 So keep irqs enabled.

 I agree with the keep irqs enabled.

 However, I do *not* agree with the actual patch, which doesn't do that at all.
 @@ -844,8 +844,9 @@ void math_state_restore(void)
  {
 struct task_struct *tsk = current;

 +   local_irq_enable();
 +

 There's a big difference between keep interrupts enabled (ok) and
 explicitly enable interrupts in random contexts (*NOT* ok).

 So get rid of the local_irq_enable() entirely, and replace it with a

WARN_ON_ONCE(irqs_disabled());

I like this a lot better.  Ingo's patch scares me because in-kernel
FPU users are fundamentally atomic: if we're using kernel FPU *and*
current has important FPU state, we can't schedule safely because
there's nowhere to save the in-kernel FPU state.  I don't see how we
would end up in this situation with CR0.TS set, but if we somehow do,
then we'll corrupt something with your patch.


 and let's just fix the cases where this actually gets called with
 interrupts off. In particular, let's just make the
 device_not_available thing use a trap gate, not an interrupt gate. And
 then remove the conditional_sti() stuff.

Please don't.  IMO it's really nice that we don't use trap gates at
all on x86_64, and I find the conditional_sti thing much nicer than
having to audit all of the entry code to see whether it's safe to run
it with IRQs on.

This isn't to say that using trap gates would be a terrible idea in
general.  I think I'd be okay with saying that non-IST
synchronous-only never-from-random-places entries should be trap gates
in general.  We could then audit the entry code and convert various
entries.

Thinks that maybe shouldn't be trap gates:

 - Anything asynchronous.
 - Page faults.  The entry code can cause page faults recursively due
to lazy vmap page table fills.  Maybe this doesn't matter.
 - Anything using IST.  That way lies madness.

FWIW, I just started auditing the entry code a bit, and it's currently
unsafe to use trap gates.  The IRQ entry code in the interrupt macro
does this:

testl $3, CS-RBP(%rsp)
je 1f
SWAPGS

That will go boom quite nicely if it happens very early in device_not_available.

I suspect that fixing that will be slow and unpleasant and will
significantly outweigh any benefit.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-06 Thread Linus Torvalds
On Fri, Mar 6, 2015 at 11:23 AM, Andy Lutomirski l...@amacapital.net wrote:

 Please don't.  IMO it's really nice that we don't use trap gates at
 all on x86_64, and I find the conditional_sti thing much nicer than
 having to audit all of the entry code to see whether it's safe to run
 it with IRQs on.

So I'm not sure I see much difference, but I'd certainly be ok with
just moving the conditional_sti() up unconditionally to be the first
thing in do_device_not_available().

The point being that we still *not* just randomly enable interrupts
because we decide that the callers are wrong.

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-06 Thread Andy Lutomirski
On Fri, Mar 6, 2015 at 2:00 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Fri, Mar 6, 2015 at 11:23 AM, Andy Lutomirski l...@amacapital.net wrote:

 Please don't.  IMO it's really nice that we don't use trap gates at
 all on x86_64, and I find the conditional_sti thing much nicer than
 having to audit all of the entry code to see whether it's safe to run
 it with IRQs on.

 So I'm not sure I see much difference, but I'd certainly be ok with
 just moving the conditional_sti() up unconditionally to be the first
 thing in do_device_not_available().

I'd be fine with that.  The important difference is that it's after swapgs.

--Andy


 The point being that we still *not* just randomly enable interrupts
 because we decide that the callers are wrong.

Linus



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-05 Thread Ingo Molnar

* Oleg Nesterov  wrote:

> On 03/05, Ingo Molnar wrote:
> >
> > * Oleg Nesterov  wrote:
> >
> > > --- a/arch/x86/kernel/traps.c
> > > +++ b/arch/x86/kernel/traps.c
> > > @@ -774,7 +774,10 @@ void math_state_restore(void)
> > >   struct task_struct *tsk = current;
> > >
> > >   if (!tsk_used_math(tsk)) {
> > > - local_irq_enable();
> > > + bool disabled = irqs_disabled();
> > > +
> > > + if (disabled)
> > > + local_irq_enable();
> > >   /*
> > >* does a slab alloc which can sleep
> > >*/
> > > @@ -785,7 +788,9 @@ void math_state_restore(void)
> > >   do_group_exit(SIGKILL);
> > >   return;
> > >   }
> > > - local_irq_disable();
> > > +
> > > + if (disabled)
> > > + local_irq_disable();
> > >   }
> >
> > Yuck!
> >
> > Is there a fundamental reason why we cannot simply enable irqs and
> > leave them enabled? Math state restore is not atomic and cannot really
> > be atomic.
> 
> You know, I didn't even try to verify ;) but see below.

So I'm thinking about the attached patch.

> Most probably we can simply enable irqs, yes. But what about older 
> kernels, how can we check?
>
> And let me repeat, I strongly believe that this !tsk_used_math() 
> case in math_state_restore() must die. And unlazy_fpu() in 
> init_fpu(). And both __restore_xstate_sig() and flush_thread() 
> should not use math_state_restore() at all. At least in its current 
> form.

Agreed.

> But this is obviously not -stable material.
> 
> That said, I'll try to look into git history tomorrow.

So I think the reasons are:

 - historic: because math_state_restore() started out as an interrupt 
   routine (from the IRQ13 days)

 - hardware imposed: the handler is executed with irqs off

 - it's probably the fastest implementation: we just run with the 
   natural irqs-off state the handler executes with.

So there's nothing outright wrong about executing with irqs off in a 
trap handler.

> [...] The patch above looks "obviously safe", but perhaps I am 
> paranoid too much...

IMHO your hack above isn't really acceptable, even for a backport.
So lets test the patch below (assuming it's the right thing to do)
and move forward?

Thanks,

Ingo

==>
From: Ingo Molnar 
Date: Fri, 6 Mar 2015 08:37:57 +0100
Subject: [PATCH] x86/fpu: Don't disable irqs in math_state_restore()

math_state_restore() was historically called with irqs disabled, 
because that's how the hardware generates the trap, and also because 
back in the days it was possible for it to be an asynchronous 
interrupt and interrupt handlers run with irqs off.

These days it's always an instruction trap, and furthermore it does 
inevitably complex things such as memory allocation and signal 
processing, which is not done with irqs disabled.

So keep irqs enabled.

This might surprise in-kernel FPU users that somehow relied on
interrupts being disabled across FPU usage - but that's
fundamentally fragile anyway due to the inatomicity of FPU state
restores. The trap return will restore interrupts to its previous 
state, but if FPU ops trigger math_state_restore() there's no
guarantee of atomicity anymore.

To warn about in-kernel irqs-off users of FPU state we might want to 
pass 'struct pt_regs' to math_state_restore() and check the trapped 
state for irqs disabled (flags has IF cleared) and kernel context - 
but that's for a later patch.

Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Fenghua Yu 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Oleg Nesterov 
Cc: Quentin Casasnovas 
Cc: Thomas Gleixner 
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/traps.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 950815a138e1..52f9e4057cee 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -844,8 +844,9 @@ void math_state_restore(void)
 {
struct task_struct *tsk = current;
 
+   local_irq_enable();
+
if (!tsk_used_math(tsk)) {
-   local_irq_enable();
/*
 * does a slab alloc which can sleep
 */
@@ -856,7 +857,6 @@ void math_state_restore(void)
do_group_exit(SIGKILL);
return;
}
-   local_irq_disable();
}
 
/* Avoid __kernel_fpu_begin() right after __thread_fpu_begin() */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-05 Thread Oleg Nesterov
On 03/05, Ingo Molnar wrote:
>
> * Oleg Nesterov  wrote:
>
> > --- a/arch/x86/kernel/traps.c
> > +++ b/arch/x86/kernel/traps.c
> > @@ -774,7 +774,10 @@ void math_state_restore(void)
> > struct task_struct *tsk = current;
> >
> > if (!tsk_used_math(tsk)) {
> > -   local_irq_enable();
> > +   bool disabled = irqs_disabled();
> > +
> > +   if (disabled)
> > +   local_irq_enable();
> > /*
> >  * does a slab alloc which can sleep
> >  */
> > @@ -785,7 +788,9 @@ void math_state_restore(void)
> > do_group_exit(SIGKILL);
> > return;
> > }
> > -   local_irq_disable();
> > +
> > +   if (disabled)
> > +   local_irq_disable();
> > }
>
> Yuck!
>
> Is there a fundamental reason why we cannot simply enable irqs and
> leave them enabled? Math state restore is not atomic and cannot really
> be atomic.

You know, I didn't even try to verify ;) but see below.

Most probably we can simply enable irqs, yes. But what about older kernels,
how can we check?

And let me repeat, I strongly believe that this !tsk_used_math() case in
math_state_restore() must die. And unlazy_fpu() in init_fpu(). And both
__restore_xstate_sig() and flush_thread() should not use math_state_restore()
at all. At least in its current form.

But this is obviously not -stable material.

That said, I'll try to look into git history tomorrow. The patch above
looks "obviously safe", but perhaps I am paranoid too much...

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-05 Thread Ingo Molnar

* Oleg Nesterov  wrote:

> math_state_restore() assumes it is called with irqs disabled, but
> this is not true if the caller is __restore_xstate_sig().
> 
> This means that if ia32_fxstate == T and __copy_from_user() fails
> __restore_xstate_sig() returns with irqs disabled too. This trgiggers
> 
>   BUG: sleeping function called from invalid context at 
> kernel/locking/rwsem.c:41
>   [] dump_stack+0x59/0xa0
>   [] ___might_sleep+0x105/0x110
>   [] ? _raw_spin_unlock_irqrestore+0x3d/0x70
>   [] __might_sleep+0x7d/0xb0
>   [] down_read+0x26/0xa0
>   [] ? _raw_spin_unlock_irqrestore+0x5a/0x70
>   [] print_vma_addr+0x58/0x130
>   [] signal_fault+0xbe/0xf0
>   [] sys32_rt_sigreturn+0xba/0xd0
> 
> Change math_state_restore() to check irqs_disabled().
> 
> Note: this is the minimal fix for -stable, it is horrible but simple.
> We need to rewrite math_state_restore(), init_fpu(), and cleanup their
> users.
> 
> Signed-off-by: Oleg Nesterov 
> Cc: 
> ---
>  arch/x86/kernel/traps.c |9 +++--
>  1 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 51c4658..7310e0e 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -774,7 +774,10 @@ void math_state_restore(void)
>   struct task_struct *tsk = current;
>  
>   if (!tsk_used_math(tsk)) {
> - local_irq_enable();
> + bool disabled = irqs_disabled();
> +
> + if (disabled)
> + local_irq_enable();
>   /*
>* does a slab alloc which can sleep
>*/
> @@ -785,7 +788,9 @@ void math_state_restore(void)
>   do_group_exit(SIGKILL);
>   return;
>   }
> - local_irq_disable();
> +
> + if (disabled)
> + local_irq_disable();
>   }

Yuck!

Is there a fundamental reason why we cannot simply enable irqs and 
leave them enabled? Math state restore is not atomic and cannot really 
be atomic.

[ A potential worry would be kernel code using vector instructions in
  irqs-off regions - but that's totally broken anyway so not a big
  worry IMO, we might even want to warn about it. ]

But I might be missing something?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-05 Thread Ingo Molnar

* Oleg Nesterov o...@redhat.com wrote:

 math_state_restore() assumes it is called with irqs disabled, but
 this is not true if the caller is __restore_xstate_sig().
 
 This means that if ia32_fxstate == T and __copy_from_user() fails
 __restore_xstate_sig() returns with irqs disabled too. This trgiggers
 
   BUG: sleeping function called from invalid context at 
 kernel/locking/rwsem.c:41
   [81381499] dump_stack+0x59/0xa0
   [8106bd05] ___might_sleep+0x105/0x110
   [8138786d] ? _raw_spin_unlock_irqrestore+0x3d/0x70
   [8106bd8d] __might_sleep+0x7d/0xb0
   [81385426] down_read+0x26/0xa0
   [8138788a] ? _raw_spin_unlock_irqrestore+0x5a/0x70
   [81136038] print_vma_addr+0x58/0x130
   [8100239e] signal_fault+0xbe/0xf0
   [810419aa] sys32_rt_sigreturn+0xba/0xd0
 
 Change math_state_restore() to check irqs_disabled().
 
 Note: this is the minimal fix for -stable, it is horrible but simple.
 We need to rewrite math_state_restore(), init_fpu(), and cleanup their
 users.
 
 Signed-off-by: Oleg Nesterov o...@redhat.com
 Cc: sta...@vger.kernel.org
 ---
  arch/x86/kernel/traps.c |9 +++--
  1 files changed, 7 insertions(+), 2 deletions(-)
 
 diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
 index 51c4658..7310e0e 100644
 --- a/arch/x86/kernel/traps.c
 +++ b/arch/x86/kernel/traps.c
 @@ -774,7 +774,10 @@ void math_state_restore(void)
   struct task_struct *tsk = current;
  
   if (!tsk_used_math(tsk)) {
 - local_irq_enable();
 + bool disabled = irqs_disabled();
 +
 + if (disabled)
 + local_irq_enable();
   /*
* does a slab alloc which can sleep
*/
 @@ -785,7 +788,9 @@ void math_state_restore(void)
   do_group_exit(SIGKILL);
   return;
   }
 - local_irq_disable();
 +
 + if (disabled)
 + local_irq_disable();
   }

Yuck!

Is there a fundamental reason why we cannot simply enable irqs and 
leave them enabled? Math state restore is not atomic and cannot really 
be atomic.

[ A potential worry would be kernel code using vector instructions in
  irqs-off regions - but that's totally broken anyway so not a big
  worry IMO, we might even want to warn about it. ]

But I might be missing something?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-05 Thread Oleg Nesterov
On 03/05, Ingo Molnar wrote:

 * Oleg Nesterov o...@redhat.com wrote:

  --- a/arch/x86/kernel/traps.c
  +++ b/arch/x86/kernel/traps.c
  @@ -774,7 +774,10 @@ void math_state_restore(void)
  struct task_struct *tsk = current;
 
  if (!tsk_used_math(tsk)) {
  -   local_irq_enable();
  +   bool disabled = irqs_disabled();
  +
  +   if (disabled)
  +   local_irq_enable();
  /*
   * does a slab alloc which can sleep
   */
  @@ -785,7 +788,9 @@ void math_state_restore(void)
  do_group_exit(SIGKILL);
  return;
  }
  -   local_irq_disable();
  +
  +   if (disabled)
  +   local_irq_disable();
  }

 Yuck!

 Is there a fundamental reason why we cannot simply enable irqs and
 leave them enabled? Math state restore is not atomic and cannot really
 be atomic.

You know, I didn't even try to verify ;) but see below.

Most probably we can simply enable irqs, yes. But what about older kernels,
how can we check?

And let me repeat, I strongly believe that this !tsk_used_math() case in
math_state_restore() must die. And unlazy_fpu() in init_fpu(). And both
__restore_xstate_sig() and flush_thread() should not use math_state_restore()
at all. At least in its current form.

But this is obviously not -stable material.

That said, I'll try to look into git history tomorrow. The patch above
looks obviously safe, but perhaps I am paranoid too much...

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

2015-03-05 Thread Ingo Molnar

* Oleg Nesterov o...@redhat.com wrote:

 On 03/05, Ingo Molnar wrote:
 
  * Oleg Nesterov o...@redhat.com wrote:
 
   --- a/arch/x86/kernel/traps.c
   +++ b/arch/x86/kernel/traps.c
   @@ -774,7 +774,10 @@ void math_state_restore(void)
 struct task_struct *tsk = current;
  
 if (!tsk_used_math(tsk)) {
   - local_irq_enable();
   + bool disabled = irqs_disabled();
   +
   + if (disabled)
   + local_irq_enable();
 /*
  * does a slab alloc which can sleep
  */
   @@ -785,7 +788,9 @@ void math_state_restore(void)
 do_group_exit(SIGKILL);
 return;
 }
   - local_irq_disable();
   +
   + if (disabled)
   + local_irq_disable();
 }
 
  Yuck!
 
  Is there a fundamental reason why we cannot simply enable irqs and
  leave them enabled? Math state restore is not atomic and cannot really
  be atomic.
 
 You know, I didn't even try to verify ;) but see below.

So I'm thinking about the attached patch.

 Most probably we can simply enable irqs, yes. But what about older 
 kernels, how can we check?

 And let me repeat, I strongly believe that this !tsk_used_math() 
 case in math_state_restore() must die. And unlazy_fpu() in 
 init_fpu(). And both __restore_xstate_sig() and flush_thread() 
 should not use math_state_restore() at all. At least in its current 
 form.

Agreed.

 But this is obviously not -stable material.
 
 That said, I'll try to look into git history tomorrow.

So I think the reasons are:

 - historic: because math_state_restore() started out as an interrupt 
   routine (from the IRQ13 days)

 - hardware imposed: the handler is executed with irqs off

 - it's probably the fastest implementation: we just run with the 
   natural irqs-off state the handler executes with.

So there's nothing outright wrong about executing with irqs off in a 
trap handler.

 [...] The patch above looks obviously safe, but perhaps I am 
 paranoid too much...

IMHO your hack above isn't really acceptable, even for a backport.
So lets test the patch below (assuming it's the right thing to do)
and move forward?

Thanks,

Ingo

==
From: Ingo Molnar mi...@kernel.org
Date: Fri, 6 Mar 2015 08:37:57 +0100
Subject: [PATCH] x86/fpu: Don't disable irqs in math_state_restore()

math_state_restore() was historically called with irqs disabled, 
because that's how the hardware generates the trap, and also because 
back in the days it was possible for it to be an asynchronous 
interrupt and interrupt handlers run with irqs off.

These days it's always an instruction trap, and furthermore it does 
inevitably complex things such as memory allocation and signal 
processing, which is not done with irqs disabled.

So keep irqs enabled.

This might surprise in-kernel FPU users that somehow relied on
interrupts being disabled across FPU usage - but that's
fundamentally fragile anyway due to the inatomicity of FPU state
restores. The trap return will restore interrupts to its previous 
state, but if FPU ops trigger math_state_restore() there's no
guarantee of atomicity anymore.

To warn about in-kernel irqs-off users of FPU state we might want to 
pass 'struct pt_regs' to math_state_restore() and check the trapped 
state for irqs disabled (flags has IF cleared) and kernel context - 
but that's for a later patch.

Cc: Andy Lutomirski l...@amacapital.net
Cc: Borislav Petkov b...@alien8.de
Cc: Fenghua Yu fenghua...@intel.com
Cc: H. Peter Anvin h...@zytor.com
Cc: Linus Torvalds torva...@linux-foundation.org
Cc: Oleg Nesterov o...@redhat.com
Cc: Quentin Casasnovas quentin.casasno...@oracle.com
Cc: Thomas Gleixner t...@linutronix.de
Signed-off-by: Ingo Molnar mi...@kernel.org
---
 arch/x86/kernel/traps.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 950815a138e1..52f9e4057cee 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -844,8 +844,9 @@ void math_state_restore(void)
 {
struct task_struct *tsk = current;
 
+   local_irq_enable();
+
if (!tsk_used_math(tsk)) {
-   local_irq_enable();
/*
 * does a slab alloc which can sleep
 */
@@ -856,7 +857,6 @@ void math_state_restore(void)
do_group_exit(SIGKILL);
return;
}
-   local_irq_disable();
}
 
/* Avoid __kernel_fpu_begin() right after __thread_fpu_begin() */
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/