Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On Sun, Mar 8, 2015 at 6:59 AM, Andy Lutomirski wrote: > There's another problem, though: We don't have a real stack pointer > just after syscall and just before sysexit, and therefore we *must* > use IST for anything that can happen while we still have > user-controlled rsp. That includes #DB, #NM, and #MC. I think I faked myself out here. Why do #DB and #BP use IST? We could remove a decent (large?) amount of crud if we changed that. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On Mar 8, 2015 4:55 AM, "Ingo Molnar" wrote: > > > * Linus Torvalds wrote: > > > On Sat, Mar 7, 2015 at 2:36 AM, Ingo Molnar wrote: > > > > > > We could save the same 10 cycles from page fault overhead as well, > > > AFAICS. > > > > Are trap gates actually noticeably faster? Or is it just he > > "conditional_sti()" you're worried about? > > ( I'll talk about the CR2 complication later, please ignore that > problem for a moment. ) > > So I base my thinking on the following hierarchy of fast paths. In a > typical x86 system there are 4 main types of 'context switches', in > order of decreasing frequency: > >- syscall'context switch': entry/exit from syscall >- trap/fault 'context switch': entry/exit from trap/fault >- irq'context switch': entry/exit from irqs >- task 'context switch': scheduler context-switch > > Where each successive level is about an order of magnitude less > frequently executed on a typical system than the previous one, so to > optimize a level we are willing to push overhead to the next one(s). > > So the primary payoff in executing much of the entry code with irqs > enabled would be to allow 64-bit *syscalls* to be made without irqs > disabled: the first, most important level of context entries. > > Doing that would give us four (theoretical) performance advantages: > > - No implicit irq disabling overhead when the syscall instruction is > executed: we could change MSR_SYSCALL_MASK from 0xc084 to > 0xc284, which removes the implicit CLI on syscall entry. > > - No explicit irq enabling overhead via ENABLE_INTERRUPTS() [STI] in > system_call. > > - No explicit irq disabling overhead in the ret_from_sys_call fast > path, i.e. no DISABLE_INTERRUPTS() [CLI]. > > - No implicit irq enabling overhead in ret_from_sys_call's > USERGS_SYSRET64: the SYSRETQ instruction would not have to > re-enable irqs as the user-space IF in R11 would match that of the > current IF. > > whether that's an actual performance win in practice as well needs to > be measured, but I'd be (very!) shocked if it wasn't in the 20+ cycles > range: which is absolutely huge in terms of system_call optimizations. > > Now do I think this could be done realistically? I think yes (by > re-using the NMI code's paranoid entry codepaths for the irq code as > well, essentially fixing up the effects of any partial entries in an > irq entry slow path), but I could be wrong about that. > > My main worry isn't even feasibility but maintainability and general > robustness: all these asm cleanups we are doing now could enable such > a trick to be pulled off robustly. > > But I think it could be done technically, because the NMI code already > has to be careful about 'preempting' partial entries, so we have the > logic. > > Complications: > > - We now enable double partial entries: partial syscall interrupted >by an irq interrupted by an NMI context. I think it should all work >out fine but it's a new scenario. > > - We'd enable interruptible return from system call, which caused >trouble in the past. Solvable IMHO, by being careful in the irq >entry code. > > - We'd now have to be extra careful about the possibility of >exceptions triggered in the entry/exit code itself, triggered by >any sort of unusual register content or MMU fault. > > Simplifications: > > - I'd ruthlessly disable IRQs for any sort of non fast path: for >example 32-bit compat entries, ptrace or any other slowpath - at >least initially until we map out the long term effects of this >optimization. > > Does this scare me? Yes, I think it should scare any sane person, but > I don't think it's all that bad: all the NMI paranoidentry work has > already the trail blazed, and most of the races will be readily > triggerable via regular irq loads, so it's not like we'll leave subtle > bugs in there. > > Being able to do the same with certain traps, because irq entry is > careful about partial entry state, would just be a secondary bonus. > > Regarding the CR2 value on page faults: > > > Anyway, for page faulting, we traditionally actually wanted an > > interrupt gate, because of how we wanted to avoid interrupts coming > > in and possibly messing up %cr2 due to vmalloc faults, but more > > importantly for preemption. [...] > > Here too I think we could take a page from the NMI code: save cr2 in > the page fault asm code, recognize from the irq code when we are > interrupting that and dropping into a slowpath that saves cr2 right > there. Potentially task-context-switching will be safe after that. > > Saving cr2 in the early page fault code should be much less of an > overhead than what the IRQ disabling/enabling costs, so this should be > a win. The potential win could be similar to that of system calls: > > - Removal of an implicit 'CLI' in irq gates > > - Removal of the explicit 'STI' in conditional_sti in your proposed > code > > - Removal of an
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
* Ingo Molnar wrote: > Doing that would give us four (theoretical) performance advantages: > > - No implicit irq disabling overhead when the syscall instruction is > executed: we could change MSR_SYSCALL_MASK from 0xc084 to > 0xc284, which removes the implicit CLI on syscall entry. > > - No explicit irq enabling overhead via ENABLE_INTERRUPTS() [STI] in > system_call. > > - No explicit irq disabling overhead in the ret_from_sys_call fast > path, i.e. no DISABLE_INTERRUPTS() [CLI]. > > - No implicit irq enabling overhead in ret_from_sys_call's > USERGS_SYSRET64: the SYSRETQ instruction would not have to > re-enable irqs as the user-space IF in R11 would match that of the > current IF. > > whether that's an actual performance win in practice as well needs > to be measured, but I'd be (very!) shocked if it wasn't in the 20+ > cycles range: which is absolutely huge in terms of system_call > optimizations. So just to quantify the potential 64-bit system call entry fast path performance savings a bit, I tried to simulate the effects in user-space via a 'best case' simulation, where we do a PUSHFQ+CLI+STI ... CLI+POPFQ simulated syscall sequence (beginning and end sufficiently far from each other to not be interacting), on Intel family 6 model 62 CPUs (slightly dated but still relevant): with irq disabling/enabling: new best speed: 2710739 loops (158 cycles per iteration). fully preemptible: new best speed: 3389503 loops (113 cycles per iteration). now that's an about 40 cycles difference, but admittedly the cost very much depends on the way we save flags and on the way we restore flags and depends on how intelligently the CPU can hide the irq disabling and the restoration amongst other processing it has to do on entry/exit, which it can do pretty well in a number of important cases. I don't think I can simulate the real thing in user-space: - The hardest bit to simulate is SYSRET: POPFQ is expensive, but SYSRET might be able to 'cheat' on the enabling side - I _think_ it cannot cheat because user-space might have come in with irqs disabled itself (we still have iopl(3)), so it's a POPFQ equivalent instruction. - OTOH the CPU might be able to hide the latency of the POPFQ amongst other SYSRET return work (which is significant) - so this is really hard to estimate. So "we'll have to try it to see it" :-/ [and maybe Intel knows.] But even if just half of the suspected savings can be realized: a 20 cycles speedup is very tempting IMHO, given that our 64-bit system calls cost around 110 cycles these days. Yes, it's scary, crazy, potentially fragile, might not even work, etc. - but it's also very tempting nevertheless ... So I'll try to write a prototype of this, just to be able to get some numbers - but shoot me down if you think I'm being stupid and if the concept is an absolute non-starter to begin with! Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
* Linus Torvalds wrote: > On Sat, Mar 7, 2015 at 2:36 AM, Ingo Molnar wrote: > > > > We could save the same 10 cycles from page fault overhead as well, > > AFAICS. > > Are trap gates actually noticeably faster? Or is it just he > "conditional_sti()" you're worried about? ( I'll talk about the CR2 complication later, please ignore that problem for a moment. ) So I base my thinking on the following hierarchy of fast paths. In a typical x86 system there are 4 main types of 'context switches', in order of decreasing frequency: - syscall'context switch': entry/exit from syscall - trap/fault 'context switch': entry/exit from trap/fault - irq'context switch': entry/exit from irqs - task 'context switch': scheduler context-switch Where each successive level is about an order of magnitude less frequently executed on a typical system than the previous one, so to optimize a level we are willing to push overhead to the next one(s). So the primary payoff in executing much of the entry code with irqs enabled would be to allow 64-bit *syscalls* to be made without irqs disabled: the first, most important level of context entries. Doing that would give us four (theoretical) performance advantages: - No implicit irq disabling overhead when the syscall instruction is executed: we could change MSR_SYSCALL_MASK from 0xc084 to 0xc284, which removes the implicit CLI on syscall entry. - No explicit irq enabling overhead via ENABLE_INTERRUPTS() [STI] in system_call. - No explicit irq disabling overhead in the ret_from_sys_call fast path, i.e. no DISABLE_INTERRUPTS() [CLI]. - No implicit irq enabling overhead in ret_from_sys_call's USERGS_SYSRET64: the SYSRETQ instruction would not have to re-enable irqs as the user-space IF in R11 would match that of the current IF. whether that's an actual performance win in practice as well needs to be measured, but I'd be (very!) shocked if it wasn't in the 20+ cycles range: which is absolutely huge in terms of system_call optimizations. Now do I think this could be done realistically? I think yes (by re-using the NMI code's paranoid entry codepaths for the irq code as well, essentially fixing up the effects of any partial entries in an irq entry slow path), but I could be wrong about that. My main worry isn't even feasibility but maintainability and general robustness: all these asm cleanups we are doing now could enable such a trick to be pulled off robustly. But I think it could be done technically, because the NMI code already has to be careful about 'preempting' partial entries, so we have the logic. Complications: - We now enable double partial entries: partial syscall interrupted by an irq interrupted by an NMI context. I think it should all work out fine but it's a new scenario. - We'd enable interruptible return from system call, which caused trouble in the past. Solvable IMHO, by being careful in the irq entry code. - We'd now have to be extra careful about the possibility of exceptions triggered in the entry/exit code itself, triggered by any sort of unusual register content or MMU fault. Simplifications: - I'd ruthlessly disable IRQs for any sort of non fast path: for example 32-bit compat entries, ptrace or any other slowpath - at least initially until we map out the long term effects of this optimization. Does this scare me? Yes, I think it should scare any sane person, but I don't think it's all that bad: all the NMI paranoidentry work has already the trail blazed, and most of the races will be readily triggerable via regular irq loads, so it's not like we'll leave subtle bugs in there. Being able to do the same with certain traps, because irq entry is careful about partial entry state, would just be a secondary bonus. Regarding the CR2 value on page faults: > Anyway, for page faulting, we traditionally actually wanted an > interrupt gate, because of how we wanted to avoid interrupts coming > in and possibly messing up %cr2 due to vmalloc faults, but more > importantly for preemption. [...] Here too I think we could take a page from the NMI code: save cr2 in the page fault asm code, recognize from the irq code when we are interrupting that and dropping into a slowpath that saves cr2 right there. Potentially task-context-switching will be safe after that. Saving cr2 in the early page fault code should be much less of an overhead than what the IRQ disabling/enabling costs, so this should be a win. The potential win could be similar to that of system calls: - Removal of an implicit 'CLI' in irq gates - Removal of the explicit 'STI' in conditional_sti in your proposed code - Removal of an explicit 'CLI' (DISABLE_INTERRUPTS) in error_exit/retint_swapgs. - Removal of an implicit 'STI' when SYSRET enables interrupts from R11 and the same savings would apply to FPU traps as well. I'd leave
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
* Linus Torvalds torva...@linux-foundation.org wrote: On Sat, Mar 7, 2015 at 2:36 AM, Ingo Molnar mi...@kernel.org wrote: We could save the same 10 cycles from page fault overhead as well, AFAICS. Are trap gates actually noticeably faster? Or is it just he conditional_sti() you're worried about? ( I'll talk about the CR2 complication later, please ignore that problem for a moment. ) So I base my thinking on the following hierarchy of fast paths. In a typical x86 system there are 4 main types of 'context switches', in order of decreasing frequency: - syscall'context switch': entry/exit from syscall - trap/fault 'context switch': entry/exit from trap/fault - irq'context switch': entry/exit from irqs - task 'context switch': scheduler context-switch Where each successive level is about an order of magnitude less frequently executed on a typical system than the previous one, so to optimize a level we are willing to push overhead to the next one(s). So the primary payoff in executing much of the entry code with irqs enabled would be to allow 64-bit *syscalls* to be made without irqs disabled: the first, most important level of context entries. Doing that would give us four (theoretical) performance advantages: - No implicit irq disabling overhead when the syscall instruction is executed: we could change MSR_SYSCALL_MASK from 0xc084 to 0xc284, which removes the implicit CLI on syscall entry. - No explicit irq enabling overhead via ENABLE_INTERRUPTS() [STI] in system_call. - No explicit irq disabling overhead in the ret_from_sys_call fast path, i.e. no DISABLE_INTERRUPTS() [CLI]. - No implicit irq enabling overhead in ret_from_sys_call's USERGS_SYSRET64: the SYSRETQ instruction would not have to re-enable irqs as the user-space IF in R11 would match that of the current IF. whether that's an actual performance win in practice as well needs to be measured, but I'd be (very!) shocked if it wasn't in the 20+ cycles range: which is absolutely huge in terms of system_call optimizations. Now do I think this could be done realistically? I think yes (by re-using the NMI code's paranoid entry codepaths for the irq code as well, essentially fixing up the effects of any partial entries in an irq entry slow path), but I could be wrong about that. My main worry isn't even feasibility but maintainability and general robustness: all these asm cleanups we are doing now could enable such a trick to be pulled off robustly. But I think it could be done technically, because the NMI code already has to be careful about 'preempting' partial entries, so we have the logic. Complications: - We now enable double partial entries: partial syscall interrupted by an irq interrupted by an NMI context. I think it should all work out fine but it's a new scenario. - We'd enable interruptible return from system call, which caused trouble in the past. Solvable IMHO, by being careful in the irq entry code. - We'd now have to be extra careful about the possibility of exceptions triggered in the entry/exit code itself, triggered by any sort of unusual register content or MMU fault. Simplifications: - I'd ruthlessly disable IRQs for any sort of non fast path: for example 32-bit compat entries, ptrace or any other slowpath - at least initially until we map out the long term effects of this optimization. Does this scare me? Yes, I think it should scare any sane person, but I don't think it's all that bad: all the NMI paranoidentry work has already the trail blazed, and most of the races will be readily triggerable via regular irq loads, so it's not like we'll leave subtle bugs in there. Being able to do the same with certain traps, because irq entry is careful about partial entry state, would just be a secondary bonus. Regarding the CR2 value on page faults: Anyway, for page faulting, we traditionally actually wanted an interrupt gate, because of how we wanted to avoid interrupts coming in and possibly messing up %cr2 due to vmalloc faults, but more importantly for preemption. [...] Here too I think we could take a page from the NMI code: save cr2 in the page fault asm code, recognize from the irq code when we are interrupting that and dropping into a slowpath that saves cr2 right there. Potentially task-context-switching will be safe after that. Saving cr2 in the early page fault code should be much less of an overhead than what the IRQ disabling/enabling costs, so this should be a win. The potential win could be similar to that of system calls: - Removal of an implicit 'CLI' in irq gates - Removal of the explicit 'STI' in conditional_sti in your proposed code - Removal of an explicit 'CLI' (DISABLE_INTERRUPTS) in error_exit/retint_swapgs. - Removal of an implicit 'STI' when SYSRET enables interrupts from R11 and the same savings would apply to
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
* Ingo Molnar mi...@kernel.org wrote: Doing that would give us four (theoretical) performance advantages: - No implicit irq disabling overhead when the syscall instruction is executed: we could change MSR_SYSCALL_MASK from 0xc084 to 0xc284, which removes the implicit CLI on syscall entry. - No explicit irq enabling overhead via ENABLE_INTERRUPTS() [STI] in system_call. - No explicit irq disabling overhead in the ret_from_sys_call fast path, i.e. no DISABLE_INTERRUPTS() [CLI]. - No implicit irq enabling overhead in ret_from_sys_call's USERGS_SYSRET64: the SYSRETQ instruction would not have to re-enable irqs as the user-space IF in R11 would match that of the current IF. whether that's an actual performance win in practice as well needs to be measured, but I'd be (very!) shocked if it wasn't in the 20+ cycles range: which is absolutely huge in terms of system_call optimizations. So just to quantify the potential 64-bit system call entry fast path performance savings a bit, I tried to simulate the effects in user-space via a 'best case' simulation, where we do a PUSHFQ+CLI+STI ... CLI+POPFQ simulated syscall sequence (beginning and end sufficiently far from each other to not be interacting), on Intel family 6 model 62 CPUs (slightly dated but still relevant): with irq disabling/enabling: new best speed: 2710739 loops (158 cycles per iteration). fully preemptible: new best speed: 3389503 loops (113 cycles per iteration). now that's an about 40 cycles difference, but admittedly the cost very much depends on the way we save flags and on the way we restore flags and depends on how intelligently the CPU can hide the irq disabling and the restoration amongst other processing it has to do on entry/exit, which it can do pretty well in a number of important cases. I don't think I can simulate the real thing in user-space: - The hardest bit to simulate is SYSRET: POPFQ is expensive, but SYSRET might be able to 'cheat' on the enabling side - I _think_ it cannot cheat because user-space might have come in with irqs disabled itself (we still have iopl(3)), so it's a POPFQ equivalent instruction. - OTOH the CPU might be able to hide the latency of the POPFQ amongst other SYSRET return work (which is significant) - so this is really hard to estimate. So we'll have to try it to see it :-/ [and maybe Intel knows.] But even if just half of the suspected savings can be realized: a 20 cycles speedup is very tempting IMHO, given that our 64-bit system calls cost around 110 cycles these days. Yes, it's scary, crazy, potentially fragile, might not even work, etc. - but it's also very tempting nevertheless ... So I'll try to write a prototype of this, just to be able to get some numbers - but shoot me down if you think I'm being stupid and if the concept is an absolute non-starter to begin with! Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On Mar 8, 2015 4:55 AM, Ingo Molnar mi...@kernel.org wrote: * Linus Torvalds torva...@linux-foundation.org wrote: On Sat, Mar 7, 2015 at 2:36 AM, Ingo Molnar mi...@kernel.org wrote: We could save the same 10 cycles from page fault overhead as well, AFAICS. Are trap gates actually noticeably faster? Or is it just he conditional_sti() you're worried about? ( I'll talk about the CR2 complication later, please ignore that problem for a moment. ) So I base my thinking on the following hierarchy of fast paths. In a typical x86 system there are 4 main types of 'context switches', in order of decreasing frequency: - syscall'context switch': entry/exit from syscall - trap/fault 'context switch': entry/exit from trap/fault - irq'context switch': entry/exit from irqs - task 'context switch': scheduler context-switch Where each successive level is about an order of magnitude less frequently executed on a typical system than the previous one, so to optimize a level we are willing to push overhead to the next one(s). So the primary payoff in executing much of the entry code with irqs enabled would be to allow 64-bit *syscalls* to be made without irqs disabled: the first, most important level of context entries. Doing that would give us four (theoretical) performance advantages: - No implicit irq disabling overhead when the syscall instruction is executed: we could change MSR_SYSCALL_MASK from 0xc084 to 0xc284, which removes the implicit CLI on syscall entry. - No explicit irq enabling overhead via ENABLE_INTERRUPTS() [STI] in system_call. - No explicit irq disabling overhead in the ret_from_sys_call fast path, i.e. no DISABLE_INTERRUPTS() [CLI]. - No implicit irq enabling overhead in ret_from_sys_call's USERGS_SYSRET64: the SYSRETQ instruction would not have to re-enable irqs as the user-space IF in R11 would match that of the current IF. whether that's an actual performance win in practice as well needs to be measured, but I'd be (very!) shocked if it wasn't in the 20+ cycles range: which is absolutely huge in terms of system_call optimizations. Now do I think this could be done realistically? I think yes (by re-using the NMI code's paranoid entry codepaths for the irq code as well, essentially fixing up the effects of any partial entries in an irq entry slow path), but I could be wrong about that. My main worry isn't even feasibility but maintainability and general robustness: all these asm cleanups we are doing now could enable such a trick to be pulled off robustly. But I think it could be done technically, because the NMI code already has to be careful about 'preempting' partial entries, so we have the logic. Complications: - We now enable double partial entries: partial syscall interrupted by an irq interrupted by an NMI context. I think it should all work out fine but it's a new scenario. - We'd enable interruptible return from system call, which caused trouble in the past. Solvable IMHO, by being careful in the irq entry code. - We'd now have to be extra careful about the possibility of exceptions triggered in the entry/exit code itself, triggered by any sort of unusual register content or MMU fault. Simplifications: - I'd ruthlessly disable IRQs for any sort of non fast path: for example 32-bit compat entries, ptrace or any other slowpath - at least initially until we map out the long term effects of this optimization. Does this scare me? Yes, I think it should scare any sane person, but I don't think it's all that bad: all the NMI paranoidentry work has already the trail blazed, and most of the races will be readily triggerable via regular irq loads, so it's not like we'll leave subtle bugs in there. Being able to do the same with certain traps, because irq entry is careful about partial entry state, would just be a secondary bonus. Regarding the CR2 value on page faults: Anyway, for page faulting, we traditionally actually wanted an interrupt gate, because of how we wanted to avoid interrupts coming in and possibly messing up %cr2 due to vmalloc faults, but more importantly for preemption. [...] Here too I think we could take a page from the NMI code: save cr2 in the page fault asm code, recognize from the irq code when we are interrupting that and dropping into a slowpath that saves cr2 right there. Potentially task-context-switching will be safe after that. Saving cr2 in the early page fault code should be much less of an overhead than what the IRQ disabling/enabling costs, so this should be a win. The potential win could be similar to that of system calls: - Removal of an implicit 'CLI' in irq gates - Removal of the explicit 'STI' in conditional_sti in your proposed code - Removal of an explicit 'CLI' (DISABLE_INTERRUPTS) in error_exit/retint_swapgs. -
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On Sun, Mar 8, 2015 at 6:59 AM, Andy Lutomirski l...@amacapital.net wrote: There's another problem, though: We don't have a real stack pointer just after syscall and just before sysexit, and therefore we *must* use IST for anything that can happen while we still have user-controlled rsp. That includes #DB, #NM, and #MC. I think I faked myself out here. Why do #DB and #BP use IST? We could remove a decent (large?) amount of crud if we changed that. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On Sat, Mar 7, 2015 at 2:36 AM, Ingo Molnar wrote: > > We could save the same 10 cycles from page fault overhead as well, > AFAICS. Are trap gates actually noticeably faster? Or is it just he "conditional_sti()" you're worried about? Anyway, for page faulting, we traditionally actually wanted an interrupt gate, because of how we wanted to avoid interrupts coming in and possibly messing up %cr2 due to vmalloc faults, but more importantly for preemption. vmalloc faults are "harmless" because we'll notice that it's already done, return, and then re-take the real fault. But a preemption event before we read %cr2 can cause bad things to happen: - page fault pushes error code on stack, address in %cr2 - we don't have interrupts disabled, and some interrupt comes in and causes preemption - some other process runs, take another page fault. %cr2 now is the wrong address - we go back to the original thread (perhaps on another cpu), which now reads %cr2 for the wrong address - we send the process a SIGSEGV because we think it's accessing memory that it has no place touching So the page fault code actually *needs* interrupts disabled until we read %cr2. Stupid x86 trap semantics where the error code is on the thread-safe stack, but %cr2 is not. Maybe there is some trick I'm missing, but on the whole I think "interrupt gate + conditional_sti()" does have things going for it. Yes, it still leaves NMI as being special, but NMI really *is* special. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
* Andy Lutomirski wrote: > On Fri, Mar 6, 2015 at 2:00 PM, Linus Torvalds > wrote: > > On Fri, Mar 6, 2015 at 11:23 AM, Andy Lutomirski > > wrote: > >> > >> Please don't. IMO it's really nice that we don't use trap gates at > >> all on x86_64, and I find the conditional_sti thing much nicer than > >> having to audit all of the entry code to see whether it's safe to run > >> it with IRQs on. > > > > So I'm not sure I see much difference, but I'd certainly be ok with > > just moving the "conditional_sti()" up unconditionally to be the first > > thing in do_device_not_available(). > > I'd be fine with that. The important difference is that it's after swapgs. The thing is, we have to be careful about NMI contexts anyway. So how about being careful in irq contexts as well? We could shave a good 10 cycles from the FPU trap overhead, maybe more? We could save the same 10 cycles from page fault overhead as well, AFAICS. Hm? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
* Linus Torvalds wrote: > On Thu, Mar 5, 2015 at 11:58 PM, Ingo Molnar wrote: > > > > math_state_restore() was historically called with irqs disabled, > > because that's how the hardware generates the trap, and also because > > back in the days it was possible for it to be an asynchronous > > interrupt and interrupt handlers run with irqs off. > > > > These days it's always an instruction trap, and furthermore it does > > inevitably complex things such as memory allocation and signal > > processing, which is not done with irqs disabled. > > > > So keep irqs enabled. > > I agree with the "keep irqs enabled". > > However, I do *not* agree with the actual patch, which doesn't do that at all. > > @@ -844,8 +844,9 @@ void math_state_restore(void) > > { > > struct task_struct *tsk = current; > > > > + local_irq_enable(); > > + > > There's a big difference between "keep interrupts enabled" (ok) and > "explicitly enable interrupts in random contexts" (*NOT* ok). Agreed, so I thought that we already kind of did that: if (!tsk_used_math(tsk)) { local_irq_enable(); But yeah, my patch brought that to a whole new level by always doing it, without starting with adding a warning first. > > So get rid of the "local_irq_enable()" entirely, and replace it with a > >WARN_ON_ONCE(irqs_disabled()); Yeah, agreed absolutely - sorry about scaring (or annoying) you with a Signed-off-by patch, that was silly from me. > and let's just fix the cases where this actually gets called with > interrupts off. [...] Yes. I was a bit blinded by the 'easy to backport' aspect, so I concentrated on that, but it's more important to not break stuff. > @@ -959,7 +949,7 @@ void __init trap_init(void) > set_system_intr_gate(X86_TRAP_OF, ); > set_intr_gate(X86_TRAP_BR, bounds); > set_intr_gate(X86_TRAP_UD, invalid_op); > - set_intr_gate(X86_TRAP_NM, device_not_available); > + set_trap_gate(X86_TRAP_NM, device_not_available); So I wasn't this brave. Historically modern x86 entry code ran with irqs off, because that's what the hardware gave us on most entry types. I'm not 100% sure we are ready to allow preemption of sensitive entry code on CONFIG_PREEMPT=y kernels. But we could try. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
* Andy Lutomirski l...@amacapital.net wrote: On Fri, Mar 6, 2015 at 2:00 PM, Linus Torvalds torva...@linux-foundation.org wrote: On Fri, Mar 6, 2015 at 11:23 AM, Andy Lutomirski l...@amacapital.net wrote: Please don't. IMO it's really nice that we don't use trap gates at all on x86_64, and I find the conditional_sti thing much nicer than having to audit all of the entry code to see whether it's safe to run it with IRQs on. So I'm not sure I see much difference, but I'd certainly be ok with just moving the conditional_sti() up unconditionally to be the first thing in do_device_not_available(). I'd be fine with that. The important difference is that it's after swapgs. The thing is, we have to be careful about NMI contexts anyway. So how about being careful in irq contexts as well? We could shave a good 10 cycles from the FPU trap overhead, maybe more? We could save the same 10 cycles from page fault overhead as well, AFAICS. Hm? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
* Linus Torvalds torva...@linux-foundation.org wrote: On Thu, Mar 5, 2015 at 11:58 PM, Ingo Molnar mi...@kernel.org wrote: math_state_restore() was historically called with irqs disabled, because that's how the hardware generates the trap, and also because back in the days it was possible for it to be an asynchronous interrupt and interrupt handlers run with irqs off. These days it's always an instruction trap, and furthermore it does inevitably complex things such as memory allocation and signal processing, which is not done with irqs disabled. So keep irqs enabled. I agree with the keep irqs enabled. However, I do *not* agree with the actual patch, which doesn't do that at all. @@ -844,8 +844,9 @@ void math_state_restore(void) { struct task_struct *tsk = current; + local_irq_enable(); + There's a big difference between keep interrupts enabled (ok) and explicitly enable interrupts in random contexts (*NOT* ok). Agreed, so I thought that we already kind of did that: if (!tsk_used_math(tsk)) { local_irq_enable(); But yeah, my patch brought that to a whole new level by always doing it, without starting with adding a warning first. So get rid of the local_irq_enable() entirely, and replace it with a WARN_ON_ONCE(irqs_disabled()); Yeah, agreed absolutely - sorry about scaring (or annoying) you with a Signed-off-by patch, that was silly from me. and let's just fix the cases where this actually gets called with interrupts off. [...] Yes. I was a bit blinded by the 'easy to backport' aspect, so I concentrated on that, but it's more important to not break stuff. @@ -959,7 +949,7 @@ void __init trap_init(void) set_system_intr_gate(X86_TRAP_OF, overflow); set_intr_gate(X86_TRAP_BR, bounds); set_intr_gate(X86_TRAP_UD, invalid_op); - set_intr_gate(X86_TRAP_NM, device_not_available); + set_trap_gate(X86_TRAP_NM, device_not_available); So I wasn't this brave. Historically modern x86 entry code ran with irqs off, because that's what the hardware gave us on most entry types. I'm not 100% sure we are ready to allow preemption of sensitive entry code on CONFIG_PREEMPT=y kernels. But we could try. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On Sat, Mar 7, 2015 at 2:36 AM, Ingo Molnar mi...@kernel.org wrote: We could save the same 10 cycles from page fault overhead as well, AFAICS. Are trap gates actually noticeably faster? Or is it just he conditional_sti() you're worried about? Anyway, for page faulting, we traditionally actually wanted an interrupt gate, because of how we wanted to avoid interrupts coming in and possibly messing up %cr2 due to vmalloc faults, but more importantly for preemption. vmalloc faults are harmless because we'll notice that it's already done, return, and then re-take the real fault. But a preemption event before we read %cr2 can cause bad things to happen: - page fault pushes error code on stack, address in %cr2 - we don't have interrupts disabled, and some interrupt comes in and causes preemption - some other process runs, take another page fault. %cr2 now is the wrong address - we go back to the original thread (perhaps on another cpu), which now reads %cr2 for the wrong address - we send the process a SIGSEGV because we think it's accessing memory that it has no place touching So the page fault code actually *needs* interrupts disabled until we read %cr2. Stupid x86 trap semantics where the error code is on the thread-safe stack, but %cr2 is not. Maybe there is some trick I'm missing, but on the whole I think interrupt gate + conditional_sti() does have things going for it. Yes, it still leaves NMI as being special, but NMI really *is* special. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On Fri, Mar 6, 2015 at 2:00 PM, Linus Torvalds wrote: > On Fri, Mar 6, 2015 at 11:23 AM, Andy Lutomirski wrote: >> >> Please don't. IMO it's really nice that we don't use trap gates at >> all on x86_64, and I find the conditional_sti thing much nicer than >> having to audit all of the entry code to see whether it's safe to run >> it with IRQs on. > > So I'm not sure I see much difference, but I'd certainly be ok with > just moving the "conditional_sti()" up unconditionally to be the first > thing in do_device_not_available(). I'd be fine with that. The important difference is that it's after swapgs. --Andy > > The point being that we still *not* just randomly enable interrupts > because we decide that the callers are wrong. > >Linus -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On Fri, Mar 6, 2015 at 11:23 AM, Andy Lutomirski wrote: > > Please don't. IMO it's really nice that we don't use trap gates at > all on x86_64, and I find the conditional_sti thing much nicer than > having to audit all of the entry code to see whether it's safe to run > it with IRQs on. So I'm not sure I see much difference, but I'd certainly be ok with just moving the "conditional_sti()" up unconditionally to be the first thing in do_device_not_available(). The point being that we still *not* just randomly enable interrupts because we decide that the callers are wrong. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On Fri, Mar 6, 2015 at 9:33 AM, Linus Torvalds wrote: > On Thu, Mar 5, 2015 at 11:58 PM, Ingo Molnar wrote: >> >> math_state_restore() was historically called with irqs disabled, >> because that's how the hardware generates the trap, and also because >> back in the days it was possible for it to be an asynchronous >> interrupt and interrupt handlers run with irqs off. >> >> These days it's always an instruction trap, and furthermore it does >> inevitably complex things such as memory allocation and signal >> processing, which is not done with irqs disabled. >> >> So keep irqs enabled. > > I agree with the "keep irqs enabled". > > However, I do *not* agree with the actual patch, which doesn't do that at all. >> @@ -844,8 +844,9 @@ void math_state_restore(void) >> { >> struct task_struct *tsk = current; >> >> + local_irq_enable(); >> + > > There's a big difference between "keep interrupts enabled" (ok) and > "explicitly enable interrupts in random contexts" (*NOT* ok). > > So get rid of the "local_irq_enable()" entirely, and replace it with a > >WARN_ON_ONCE(irqs_disabled()); I like this a lot better. Ingo's patch scares me because in-kernel FPU users are fundamentally atomic: if we're using kernel FPU *and* current has important FPU state, we can't schedule safely because there's nowhere to save the in-kernel FPU state. I don't see how we would end up in this situation with CR0.TS set, but if we somehow do, then we'll corrupt something with your patch. > > and let's just fix the cases where this actually gets called with > interrupts off. In particular, let's just make the > device_not_available thing use a trap gate, not an interrupt gate. And > then remove the "conditional_sti()" stuff. Please don't. IMO it's really nice that we don't use trap gates at all on x86_64, and I find the conditional_sti thing much nicer than having to audit all of the entry code to see whether it's safe to run it with IRQs on. This isn't to say that using trap gates would be a terrible idea in general. I think I'd be okay with saying that non-IST synchronous-only never-from-random-places entries should be trap gates in general. We could then audit the entry code and convert various entries. Thinks that maybe shouldn't be trap gates: - Anything asynchronous. - Page faults. The entry code can cause page faults recursively due to lazy vmap page table fills. Maybe this doesn't matter. - Anything using IST. That way lies madness. FWIW, I just started auditing the entry code a bit, and it's currently unsafe to use trap gates. The IRQ entry code in the "interrupt" macro does this: testl $3, CS-RBP(%rsp) je 1f SWAPGS That will go boom quite nicely if it happens very early in device_not_available. I suspect that fixing that will be slow and unpleasant and will significantly outweigh any benefit. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On 03/06, Linus Torvalds wrote: > > On Thu, Mar 5, 2015 at 11:58 PM, Ingo Molnar wrote: > > > > math_state_restore() was historically called with irqs disabled, > > because that's how the hardware generates the trap, and also because > > back in the days it was possible for it to be an asynchronous > > interrupt and interrupt handlers run with irqs off. > > > > These days it's always an instruction trap, and furthermore it does > > inevitably complex things such as memory allocation and signal > > processing, which is not done with irqs disabled. > > > > So keep irqs enabled. > > I agree with the "keep irqs enabled". Me too, but not for stable. This patch is wrong without other changes. > IOW, I think the starting point should be something like the attached > (which doesn't do the WARN_ON_ONCE() - it should be added for > debugging). Yes, agreed. And. Even if we forget about stable, we need some minor changes before this one. At least we need to add preempt_disable() into kernel_fpu_disable(). So I still think that the horrible hack I sent makes sense for -stable. Just we need to cleanup (kill) it "immediately". Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On Thu, Mar 5, 2015 at 11:58 PM, Ingo Molnar wrote: > > math_state_restore() was historically called with irqs disabled, > because that's how the hardware generates the trap, and also because > back in the days it was possible for it to be an asynchronous > interrupt and interrupt handlers run with irqs off. > > These days it's always an instruction trap, and furthermore it does > inevitably complex things such as memory allocation and signal > processing, which is not done with irqs disabled. > > So keep irqs enabled. I agree with the "keep irqs enabled". However, I do *not* agree with the actual patch, which doesn't do that at all. > @@ -844,8 +844,9 @@ void math_state_restore(void) > { > struct task_struct *tsk = current; > > + local_irq_enable(); > + There's a big difference between "keep interrupts enabled" (ok) and "explicitly enable interrupts in random contexts" (*NOT* ok). So get rid of the "local_irq_enable()" entirely, and replace it with a WARN_ON_ONCE(irqs_disabled()); and let's just fix the cases where this actually gets called with interrupts off. In particular, let's just make the device_not_available thing use a trap gate, not an interrupt gate. And then remove the "conditional_sti()" stuff. IOW, I think the starting point should be something like the attached (which doesn't do the WARN_ON_ONCE() - it should be added for debugging). *NOT* some kind of "let's just enable interrupts blindly" approach. This is completely untested, of course. But I don't see why we should use an interrupt gate for this. Is there anything in "exception_enter()" that requires it? Linus arch/x86/kernel/traps.c | 14 ++ 1 file changed, 2 insertions(+), 12 deletions(-) diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 9d2073e2ecc9..f045ac026ff1 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -836,16 +836,12 @@ asmlinkage __visible void __attribute__((weak)) smp_threshold_interrupt(void) * * Careful.. There are problems with IBM-designed IRQ13 behaviour. * Don't touch unless you *really* know how it works. - * - * Must be called with kernel preemption disabled (eg with local - * local interrupts as in the case of do_device_not_available). */ void math_state_restore(void) { struct task_struct *tsk = current; if (!tsk_used_math(tsk)) { - local_irq_enable(); /* * does a slab alloc which can sleep */ @@ -856,7 +852,6 @@ void math_state_restore(void) do_group_exit(SIGKILL); return; } - local_irq_disable(); } /* Avoid __kernel_fpu_begin() right after __thread_fpu_begin() */ @@ -884,18 +879,13 @@ do_device_not_available(struct pt_regs *regs, long error_code) if (read_cr0() & X86_CR0_EM) { struct math_emu_info info = { }; - conditional_sti(regs); - info.regs = regs; math_emulate(); exception_exit(prev_state); return; } #endif - math_state_restore(); /* interrupts still off */ -#ifdef CONFIG_X86_32 - conditional_sti(regs); -#endif + math_state_restore(); exception_exit(prev_state); } NOKPROBE_SYMBOL(do_device_not_available); @@ -959,7 +949,7 @@ void __init trap_init(void) set_system_intr_gate(X86_TRAP_OF, ); set_intr_gate(X86_TRAP_BR, bounds); set_intr_gate(X86_TRAP_UD, invalid_op); - set_intr_gate(X86_TRAP_NM, device_not_available); + set_trap_gate(X86_TRAP_NM, device_not_available); #ifdef CONFIG_X86_32 set_task_gate(X86_TRAP_DF, GDT_ENTRY_DOUBLEFAULT_TSS); #else
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On 03/06, David Vrabel wrote: > > On 06/03/15 15:36, Oleg Nesterov wrote: > > > > This needs more discussion, but in short so far I think that fpu_alloc() > > from #NM exception is fine if user_mode(regs) == T. > > I think a memory allocation here, where the only behaviour on a failure > is to kill the task, is (and has always been) a crazy idea. Well, I do not agree. But lets discuss this later. This code should be rewritten in any case. It has more problems. > Additionally, in a Xen PV guest the #NM handler is called with TS > already cleared by the hypervisor so the handler must not enable > interrupts (and thus potentially schedule another task) until after the > current task's fpu state has been restored. If a task was scheduled > before restoring the FPU state, TS would be clear and that task will use > fpu state from a previous task. I can be easily wrong (especially because I know nothing about Xen ;), but I do not think this is true. Yes sure, we need to avoid preemption, but we need this in any case, even without Xen. Again, lets discuss this a bit later? Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On 06/03/15 15:36, Oleg Nesterov wrote: > On 03/06, David Vrabel wrote: >> >> On 06/03/15 14:01, Oleg Nesterov wrote: >> >>> Not sure I understand it correctly after the first quick look, but >>> >>> 1. It conflicts with the recent changes in tip/x86/fpu >>> >>> 2. fpu_ini() initializes current->thread.fpu.state. This looks unneeded, >>>the kernel threads no longer have FPU context and do not abuse CPU. >>> >>> 3. I can be easily wrong, but it looks buggy... Note that >>>arch_dup_task_struct() doesn't allocate child->fpu.state if >>>!tsk_used_math(parent). >> >> ...yes. It's bit-rotted a bit. >> >>> No, I do not think this patch is a good idea. Perhaps I am wrong, but I >>> think we need other changes. And they should start from init_fpu(). >> >> But the general principle of avoiding the allocation in the #NM handler >> and hence avoiding the need to re-enable IRQs is still a good idea, yes? > > This needs more discussion, but in short so far I think that fpu_alloc() > from #NM exception is fine if user_mode(regs) == T. I think a memory allocation here, where the only behaviour on a failure is to kill the task, is (and has always been) a crazy idea. Additionally, in a Xen PV guest the #NM handler is called with TS already cleared by the hypervisor so the handler must not enable interrupts (and thus potentially schedule another task) until after the current task's fpu state has been restored. If a task was scheduled before restoring the FPU state, TS would be clear and that task will use fpu state from a previous task. > Just do_device_not_available() should simply do conditional_sti(), I think. > Perhaps it can even enable irqs unconditionally, but we need to verify that > this is 100% correct. > > And I agree that "if (!tsk_used_math(tsk))" code in math_state_restore() > should be removed. But not to avoid the allocation, and this needs other > changes. > > Oleg. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On 03/06, David Vrabel wrote: > > On 06/03/15 14:01, Oleg Nesterov wrote: > > > Not sure I understand it correctly after the first quick look, but > > > > 1. It conflicts with the recent changes in tip/x86/fpu > > > > 2. fpu_ini() initializes current->thread.fpu.state. This looks unneeded, > >the kernel threads no longer have FPU context and do not abuse CPU. > > > > 3. I can be easily wrong, but it looks buggy... Note that > >arch_dup_task_struct() doesn't allocate child->fpu.state if > >!tsk_used_math(parent). > > ...yes. It's bit-rotted a bit. > > > No, I do not think this patch is a good idea. Perhaps I am wrong, but I > > think we need other changes. And they should start from init_fpu(). > > But the general principle of avoiding the allocation in the #NM handler > and hence avoiding the need to re-enable IRQs is still a good idea, yes? This needs more discussion, but in short so far I think that fpu_alloc() from #NM exception is fine if user_mode(regs) == T. Just do_device_not_available() should simply do conditional_sti(), I think. Perhaps it can even enable irqs unconditionally, but we need to verify that this is 100% correct. And I agree that "if (!tsk_used_math(tsk))" code in math_state_restore() should be removed. But not to avoid the allocation, and this needs other changes. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On 06/03/15 14:01, Oleg Nesterov wrote: > On 03/06, Ingo Molnar wrote: >> >> * Oleg Nesterov wrote: >> >>> On 03/06, Ingo Molnar wrote: * Oleg Nesterov wrote: > [...] The patch above looks "obviously safe", but perhaps I am > paranoid too much... IMHO your hack above isn't really acceptable, even for a backport. So lets test the patch below (assuming it's the right thing to do) and move forward? >>> >>> OK, but please note that this patch is not beckportable. If you think >>> that -stable doesn't need this fix, then I agree. >>> >>> If the caller is do_device_not_available(), then we can not enable >>> irqs before __thread_fpu_begin() + restore_fpu_checking(). >>> >>> 1. Preemption in between can destroy ->fpu.state initialized by >>>fpu_finit(), __switch_to() will save the live (wrong) FPU state >>>again. >>> >>> 2. kernel_fpu_begin() from irq right after __thread_fpu_begin() is >>>not nice too. It will do __save_init_fpu() and this overwrites >>>->fpu.state too. >>> >>> Starting from v4.0 it does kernel_fpu_disable(), but the older kernels >>> do not. >>> >>> Ingo, this code is really horrible and fragile. We need to cleanup it >>> step-by-step, imho. >> >> How about the patch from David Vrabel? That seems to solve the >> irq-disable problem too, right? > > I wasn't cc'ed, I guess you mean > > [PATCHv4] x86, fpu: remove the logic of non-eager fpu mem allocation at > the first usage > http://marc.info/?l=linux-kernel=142564237705311=2 This patch is from Suresh, and was originally against 3.10, so... > Not sure I understand it correctly after the first quick look, but > > 1. It conflicts with the recent changes in tip/x86/fpu > > 2. fpu_ini() initializes current->thread.fpu.state. This looks unneeded, >the kernel threads no longer have FPU context and do not abuse CPU. > > 3. I can be easily wrong, but it looks buggy... Note that >arch_dup_task_struct() doesn't allocate child->fpu.state if >!tsk_used_math(parent). ...yes. It's bit-rotted a bit. > No, I do not think this patch is a good idea. Perhaps I am wrong, but I > think we need other changes. And they should start from init_fpu(). But the general principle of avoiding the allocation in the #NM handler and hence avoiding the need to re-enable IRQs is still a good idea, yes? David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On 03/06, Oleg Nesterov wrote: > > On 03/06, Ingo Molnar wrote: > > > > How about the patch from David Vrabel? That seems to solve the > > irq-disable problem too, right? > > I wasn't cc'ed, I guess you mean > > [PATCHv4] x86, fpu: remove the logic of non-eager fpu mem allocation at > the first usage > http://marc.info/?l=linux-kernel=142564237705311=2 > > Not sure I understand it correctly after the first quick look, but > > 1. It conflicts with the recent changes in tip/x86/fpu > > 2. fpu_ini() initializes current->thread.fpu.state. This looks unneeded, >the kernel threads no longer have FPU context and do not abuse CPU. > > 3. I can be easily wrong, but it looks buggy... Note that >arch_dup_task_struct() doesn't allocate child->fpu.state if >!tsk_used_math(parent). > > Add David... > > No, I do not think this patch is a good idea. Perhaps I am wrong, but I > think we need other changes. And they should start from init_fpu(). But the change in eager_fpu_init_bp() looks good (on top of tip/x86/fpu), at least I was going to do the same ;) In any case, I do not think this patch can target -stable. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On 03/06, Ingo Molnar wrote: > > * Oleg Nesterov wrote: > > > On 03/06, Ingo Molnar wrote: > > > > > > * Oleg Nesterov wrote: > > > > > > > [...] The patch above looks "obviously safe", but perhaps I am > > > > paranoid too much... > > > > > > IMHO your hack above isn't really acceptable, even for a backport. > > > So lets test the patch below (assuming it's the right thing to do) > > > and move forward? > > > > OK, but please note that this patch is not beckportable. If you think > > that -stable doesn't need this fix, then I agree. > > > > If the caller is do_device_not_available(), then we can not enable > > irqs before __thread_fpu_begin() + restore_fpu_checking(). > > > > 1. Preemption in between can destroy ->fpu.state initialized by > >fpu_finit(), __switch_to() will save the live (wrong) FPU state > >again. > > > > 2. kernel_fpu_begin() from irq right after __thread_fpu_begin() is > >not nice too. It will do __save_init_fpu() and this overwrites > >->fpu.state too. > > > > Starting from v4.0 it does kernel_fpu_disable(), but the older kernels > > do not. > > > > Ingo, this code is really horrible and fragile. We need to cleanup it > > step-by-step, imho. > > How about the patch from David Vrabel? That seems to solve the > irq-disable problem too, right? I wasn't cc'ed, I guess you mean [PATCHv4] x86, fpu: remove the logic of non-eager fpu mem allocation at the first usage http://marc.info/?l=linux-kernel=142564237705311=2 Not sure I understand it correctly after the first quick look, but 1. It conflicts with the recent changes in tip/x86/fpu 2. fpu_ini() initializes current->thread.fpu.state. This looks unneeded, the kernel threads no longer have FPU context and do not abuse CPU. 3. I can be easily wrong, but it looks buggy... Note that arch_dup_task_struct() doesn't allocate child->fpu.state if !tsk_used_math(parent). Add David... No, I do not think this patch is a good idea. Perhaps I am wrong, but I think we need other changes. And they should start from init_fpu(). Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
* Oleg Nesterov wrote: > On 03/06, Ingo Molnar wrote: > > > > * Oleg Nesterov wrote: > > > > > [...] The patch above looks "obviously safe", but perhaps I am > > > paranoid too much... > > > > IMHO your hack above isn't really acceptable, even for a backport. > > So lets test the patch below (assuming it's the right thing to do) > > and move forward? > > OK, but please note that this patch is not beckportable. If you think > that -stable doesn't need this fix, then I agree. > > If the caller is do_device_not_available(), then we can not enable > irqs before __thread_fpu_begin() + restore_fpu_checking(). > > 1. Preemption in between can destroy ->fpu.state initialized by >fpu_finit(), __switch_to() will save the live (wrong) FPU state >again. > > 2. kernel_fpu_begin() from irq right after __thread_fpu_begin() is >not nice too. It will do __save_init_fpu() and this overwrites >->fpu.state too. > > Starting from v4.0 it does kernel_fpu_disable(), but the older kernels > do not. > > Ingo, this code is really horrible and fragile. We need to cleanup it > step-by-step, imho. How about the patch from David Vrabel? That seems to solve the irq-disable problem too, right? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On 03/06, Oleg Nesterov wrote: > > OK, but please note that this patch is not beckportable. If you think > that -stable doesn't need this fix, then I agree. > > If the caller is do_device_not_available(), then we can not enable > irqs before __thread_fpu_begin() + restore_fpu_checking(). > > 1. Preemption in between can destroy ->fpu.state initialized by >fpu_finit(), __switch_to() will save the live (wrong) FPU state >again. > > 2. kernel_fpu_begin() from irq right after __thread_fpu_begin() is >not nice too. It will do __save_init_fpu() and this overwrites >->fpu.state too. > > Starting from v4.0 it does kernel_fpu_disable(), but the older kernels > do not. > > Ingo, this code is really horrible and fragile. We need to cleanup it > step-by-step, imho. Forgot to mention... And, otoh, if we are not going to backport this change, then I think this irq_enable() should be called by do_device_not_available(). > > > ==> > > From: Ingo Molnar > > Date: Fri, 6 Mar 2015 08:37:57 +0100 > > Subject: [PATCH] x86/fpu: Don't disable irqs in math_state_restore() > > > > math_state_restore() was historically called with irqs disabled, > > because that's how the hardware generates the trap, and also because > > back in the days it was possible for it to be an asynchronous > > interrupt and interrupt handlers run with irqs off. > > > > These days it's always an instruction trap, and furthermore it does > > inevitably complex things such as memory allocation and signal > > processing, which is not done with irqs disabled. > > > > So keep irqs enabled. > > > > This might surprise in-kernel FPU users that somehow relied on > > interrupts being disabled across FPU usage - but that's > > fundamentally fragile anyway due to the inatomicity of FPU state > > restores. The trap return will restore interrupts to its previous > > state, but if FPU ops trigger math_state_restore() there's no > > guarantee of atomicity anymore. > > > > To warn about in-kernel irqs-off users of FPU state we might want to > > pass 'struct pt_regs' to math_state_restore() and check the trapped > > state for irqs disabled (flags has IF cleared) and kernel context - > > but that's for a later patch. > > > > Cc: Andy Lutomirski > > Cc: Borislav Petkov > > Cc: Fenghua Yu > > Cc: H. Peter Anvin > > Cc: Linus Torvalds > > Cc: Oleg Nesterov > > Cc: Quentin Casasnovas > > Cc: Thomas Gleixner > > Signed-off-by: Ingo Molnar > > --- > > arch/x86/kernel/traps.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c > > index 950815a138e1..52f9e4057cee 100644 > > --- a/arch/x86/kernel/traps.c > > +++ b/arch/x86/kernel/traps.c > > @@ -844,8 +844,9 @@ void math_state_restore(void) > > { > > struct task_struct *tsk = current; > > > > + local_irq_enable(); > > + > > if (!tsk_used_math(tsk)) { > > - local_irq_enable(); > > /* > > * does a slab alloc which can sleep > > */ > > @@ -856,7 +857,6 @@ void math_state_restore(void) > > do_group_exit(SIGKILL); > > return; > > } > > - local_irq_disable(); > > } > > > > /* Avoid __kernel_fpu_begin() right after __thread_fpu_begin() */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On 03/06, Ingo Molnar wrote: > > * Oleg Nesterov wrote: > > > [...] The patch above looks "obviously safe", but perhaps I am > > paranoid too much... > > IMHO your hack above isn't really acceptable, even for a backport. > So lets test the patch below (assuming it's the right thing to do) > and move forward? OK, but please note that this patch is not beckportable. If you think that -stable doesn't need this fix, then I agree. If the caller is do_device_not_available(), then we can not enable irqs before __thread_fpu_begin() + restore_fpu_checking(). 1. Preemption in between can destroy ->fpu.state initialized by fpu_finit(), __switch_to() will save the live (wrong) FPU state again. 2. kernel_fpu_begin() from irq right after __thread_fpu_begin() is not nice too. It will do __save_init_fpu() and this overwrites ->fpu.state too. Starting from v4.0 it does kernel_fpu_disable(), but the older kernels do not. Ingo, this code is really horrible and fragile. We need to cleanup it step-by-step, imho. > ==> > From: Ingo Molnar > Date: Fri, 6 Mar 2015 08:37:57 +0100 > Subject: [PATCH] x86/fpu: Don't disable irqs in math_state_restore() > > math_state_restore() was historically called with irqs disabled, > because that's how the hardware generates the trap, and also because > back in the days it was possible for it to be an asynchronous > interrupt and interrupt handlers run with irqs off. > > These days it's always an instruction trap, and furthermore it does > inevitably complex things such as memory allocation and signal > processing, which is not done with irqs disabled. > > So keep irqs enabled. > > This might surprise in-kernel FPU users that somehow relied on > interrupts being disabled across FPU usage - but that's > fundamentally fragile anyway due to the inatomicity of FPU state > restores. The trap return will restore interrupts to its previous > state, but if FPU ops trigger math_state_restore() there's no > guarantee of atomicity anymore. > > To warn about in-kernel irqs-off users of FPU state we might want to > pass 'struct pt_regs' to math_state_restore() and check the trapped > state for irqs disabled (flags has IF cleared) and kernel context - > but that's for a later patch. > > Cc: Andy Lutomirski > Cc: Borislav Petkov > Cc: Fenghua Yu > Cc: H. Peter Anvin > Cc: Linus Torvalds > Cc: Oleg Nesterov > Cc: Quentin Casasnovas > Cc: Thomas Gleixner > Signed-off-by: Ingo Molnar > --- > arch/x86/kernel/traps.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c > index 950815a138e1..52f9e4057cee 100644 > --- a/arch/x86/kernel/traps.c > +++ b/arch/x86/kernel/traps.c > @@ -844,8 +844,9 @@ void math_state_restore(void) > { > struct task_struct *tsk = current; > > + local_irq_enable(); > + > if (!tsk_used_math(tsk)) { > - local_irq_enable(); > /* >* does a slab alloc which can sleep >*/ > @@ -856,7 +857,6 @@ void math_state_restore(void) > do_group_exit(SIGKILL); > return; > } > - local_irq_disable(); > } > > /* Avoid __kernel_fpu_begin() right after __thread_fpu_begin() */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On 03/06, Oleg Nesterov wrote: On 03/06, Ingo Molnar wrote: How about the patch from David Vrabel? That seems to solve the irq-disable problem too, right? I wasn't cc'ed, I guess you mean [PATCHv4] x86, fpu: remove the logic of non-eager fpu mem allocation at the first usage http://marc.info/?l=linux-kernelm=142564237705311w=2 Not sure I understand it correctly after the first quick look, but 1. It conflicts with the recent changes in tip/x86/fpu 2. fpu_ini() initializes current-thread.fpu.state. This looks unneeded, the kernel threads no longer have FPU context and do not abuse CPU. 3. I can be easily wrong, but it looks buggy... Note that arch_dup_task_struct() doesn't allocate child-fpu.state if !tsk_used_math(parent). Add David... No, I do not think this patch is a good idea. Perhaps I am wrong, but I think we need other changes. And they should start from init_fpu(). But the change in eager_fpu_init_bp() looks good (on top of tip/x86/fpu), at least I was going to do the same ;) In any case, I do not think this patch can target -stable. Oleg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On 06/03/15 14:01, Oleg Nesterov wrote: On 03/06, Ingo Molnar wrote: * Oleg Nesterov o...@redhat.com wrote: On 03/06, Ingo Molnar wrote: * Oleg Nesterov o...@redhat.com wrote: [...] The patch above looks obviously safe, but perhaps I am paranoid too much... IMHO your hack above isn't really acceptable, even for a backport. So lets test the patch below (assuming it's the right thing to do) and move forward? OK, but please note that this patch is not beckportable. If you think that -stable doesn't need this fix, then I agree. If the caller is do_device_not_available(), then we can not enable irqs before __thread_fpu_begin() + restore_fpu_checking(). 1. Preemption in between can destroy -fpu.state initialized by fpu_finit(), __switch_to() will save the live (wrong) FPU state again. 2. kernel_fpu_begin() from irq right after __thread_fpu_begin() is not nice too. It will do __save_init_fpu() and this overwrites -fpu.state too. Starting from v4.0 it does kernel_fpu_disable(), but the older kernels do not. Ingo, this code is really horrible and fragile. We need to cleanup it step-by-step, imho. How about the patch from David Vrabel? That seems to solve the irq-disable problem too, right? I wasn't cc'ed, I guess you mean [PATCHv4] x86, fpu: remove the logic of non-eager fpu mem allocation at the first usage http://marc.info/?l=linux-kernelm=142564237705311w=2 This patch is from Suresh, and was originally against 3.10, so... Not sure I understand it correctly after the first quick look, but 1. It conflicts with the recent changes in tip/x86/fpu 2. fpu_ini() initializes current-thread.fpu.state. This looks unneeded, the kernel threads no longer have FPU context and do not abuse CPU. 3. I can be easily wrong, but it looks buggy... Note that arch_dup_task_struct() doesn't allocate child-fpu.state if !tsk_used_math(parent). ...yes. It's bit-rotted a bit. No, I do not think this patch is a good idea. Perhaps I am wrong, but I think we need other changes. And they should start from init_fpu(). But the general principle of avoiding the allocation in the #NM handler and hence avoiding the need to re-enable IRQs is still a good idea, yes? David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On 03/06, David Vrabel wrote: On 06/03/15 14:01, Oleg Nesterov wrote: Not sure I understand it correctly after the first quick look, but 1. It conflicts with the recent changes in tip/x86/fpu 2. fpu_ini() initializes current-thread.fpu.state. This looks unneeded, the kernel threads no longer have FPU context and do not abuse CPU. 3. I can be easily wrong, but it looks buggy... Note that arch_dup_task_struct() doesn't allocate child-fpu.state if !tsk_used_math(parent). ...yes. It's bit-rotted a bit. No, I do not think this patch is a good idea. Perhaps I am wrong, but I think we need other changes. And they should start from init_fpu(). But the general principle of avoiding the allocation in the #NM handler and hence avoiding the need to re-enable IRQs is still a good idea, yes? This needs more discussion, but in short so far I think that fpu_alloc() from #NM exception is fine if user_mode(regs) == T. Just do_device_not_available() should simply do conditional_sti(), I think. Perhaps it can even enable irqs unconditionally, but we need to verify that this is 100% correct. And I agree that if (!tsk_used_math(tsk)) code in math_state_restore() should be removed. But not to avoid the allocation, and this needs other changes. Oleg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On 03/06, Ingo Molnar wrote: * Oleg Nesterov o...@redhat.com wrote: On 03/06, Ingo Molnar wrote: * Oleg Nesterov o...@redhat.com wrote: [...] The patch above looks obviously safe, but perhaps I am paranoid too much... IMHO your hack above isn't really acceptable, even for a backport. So lets test the patch below (assuming it's the right thing to do) and move forward? OK, but please note that this patch is not beckportable. If you think that -stable doesn't need this fix, then I agree. If the caller is do_device_not_available(), then we can not enable irqs before __thread_fpu_begin() + restore_fpu_checking(). 1. Preemption in between can destroy -fpu.state initialized by fpu_finit(), __switch_to() will save the live (wrong) FPU state again. 2. kernel_fpu_begin() from irq right after __thread_fpu_begin() is not nice too. It will do __save_init_fpu() and this overwrites -fpu.state too. Starting from v4.0 it does kernel_fpu_disable(), but the older kernels do not. Ingo, this code is really horrible and fragile. We need to cleanup it step-by-step, imho. How about the patch from David Vrabel? That seems to solve the irq-disable problem too, right? I wasn't cc'ed, I guess you mean [PATCHv4] x86, fpu: remove the logic of non-eager fpu mem allocation at the first usage http://marc.info/?l=linux-kernelm=142564237705311w=2 Not sure I understand it correctly after the first quick look, but 1. It conflicts with the recent changes in tip/x86/fpu 2. fpu_ini() initializes current-thread.fpu.state. This looks unneeded, the kernel threads no longer have FPU context and do not abuse CPU. 3. I can be easily wrong, but it looks buggy... Note that arch_dup_task_struct() doesn't allocate child-fpu.state if !tsk_used_math(parent). Add David... No, I do not think this patch is a good idea. Perhaps I am wrong, but I think we need other changes. And they should start from init_fpu(). Oleg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On 03/06, Oleg Nesterov wrote: OK, but please note that this patch is not beckportable. If you think that -stable doesn't need this fix, then I agree. If the caller is do_device_not_available(), then we can not enable irqs before __thread_fpu_begin() + restore_fpu_checking(). 1. Preemption in between can destroy -fpu.state initialized by fpu_finit(), __switch_to() will save the live (wrong) FPU state again. 2. kernel_fpu_begin() from irq right after __thread_fpu_begin() is not nice too. It will do __save_init_fpu() and this overwrites -fpu.state too. Starting from v4.0 it does kernel_fpu_disable(), but the older kernels do not. Ingo, this code is really horrible and fragile. We need to cleanup it step-by-step, imho. Forgot to mention... And, otoh, if we are not going to backport this change, then I think this irq_enable() should be called by do_device_not_available(). == From: Ingo Molnar mi...@kernel.org Date: Fri, 6 Mar 2015 08:37:57 +0100 Subject: [PATCH] x86/fpu: Don't disable irqs in math_state_restore() math_state_restore() was historically called with irqs disabled, because that's how the hardware generates the trap, and also because back in the days it was possible for it to be an asynchronous interrupt and interrupt handlers run with irqs off. These days it's always an instruction trap, and furthermore it does inevitably complex things such as memory allocation and signal processing, which is not done with irqs disabled. So keep irqs enabled. This might surprise in-kernel FPU users that somehow relied on interrupts being disabled across FPU usage - but that's fundamentally fragile anyway due to the inatomicity of FPU state restores. The trap return will restore interrupts to its previous state, but if FPU ops trigger math_state_restore() there's no guarantee of atomicity anymore. To warn about in-kernel irqs-off users of FPU state we might want to pass 'struct pt_regs' to math_state_restore() and check the trapped state for irqs disabled (flags has IF cleared) and kernel context - but that's for a later patch. Cc: Andy Lutomirski l...@amacapital.net Cc: Borislav Petkov b...@alien8.de Cc: Fenghua Yu fenghua...@intel.com Cc: H. Peter Anvin h...@zytor.com Cc: Linus Torvalds torva...@linux-foundation.org Cc: Oleg Nesterov o...@redhat.com Cc: Quentin Casasnovas quentin.casasno...@oracle.com Cc: Thomas Gleixner t...@linutronix.de Signed-off-by: Ingo Molnar mi...@kernel.org --- arch/x86/kernel/traps.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 950815a138e1..52f9e4057cee 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -844,8 +844,9 @@ void math_state_restore(void) { struct task_struct *tsk = current; + local_irq_enable(); + if (!tsk_used_math(tsk)) { - local_irq_enable(); /* * does a slab alloc which can sleep */ @@ -856,7 +857,6 @@ void math_state_restore(void) do_group_exit(SIGKILL); return; } - local_irq_disable(); } /* Avoid __kernel_fpu_begin() right after __thread_fpu_begin() */ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On 03/06, Ingo Molnar wrote: * Oleg Nesterov o...@redhat.com wrote: [...] The patch above looks obviously safe, but perhaps I am paranoid too much... IMHO your hack above isn't really acceptable, even for a backport. So lets test the patch below (assuming it's the right thing to do) and move forward? OK, but please note that this patch is not beckportable. If you think that -stable doesn't need this fix, then I agree. If the caller is do_device_not_available(), then we can not enable irqs before __thread_fpu_begin() + restore_fpu_checking(). 1. Preemption in between can destroy -fpu.state initialized by fpu_finit(), __switch_to() will save the live (wrong) FPU state again. 2. kernel_fpu_begin() from irq right after __thread_fpu_begin() is not nice too. It will do __save_init_fpu() and this overwrites -fpu.state too. Starting from v4.0 it does kernel_fpu_disable(), but the older kernels do not. Ingo, this code is really horrible and fragile. We need to cleanup it step-by-step, imho. == From: Ingo Molnar mi...@kernel.org Date: Fri, 6 Mar 2015 08:37:57 +0100 Subject: [PATCH] x86/fpu: Don't disable irqs in math_state_restore() math_state_restore() was historically called with irqs disabled, because that's how the hardware generates the trap, and also because back in the days it was possible for it to be an asynchronous interrupt and interrupt handlers run with irqs off. These days it's always an instruction trap, and furthermore it does inevitably complex things such as memory allocation and signal processing, which is not done with irqs disabled. So keep irqs enabled. This might surprise in-kernel FPU users that somehow relied on interrupts being disabled across FPU usage - but that's fundamentally fragile anyway due to the inatomicity of FPU state restores. The trap return will restore interrupts to its previous state, but if FPU ops trigger math_state_restore() there's no guarantee of atomicity anymore. To warn about in-kernel irqs-off users of FPU state we might want to pass 'struct pt_regs' to math_state_restore() and check the trapped state for irqs disabled (flags has IF cleared) and kernel context - but that's for a later patch. Cc: Andy Lutomirski l...@amacapital.net Cc: Borislav Petkov b...@alien8.de Cc: Fenghua Yu fenghua...@intel.com Cc: H. Peter Anvin h...@zytor.com Cc: Linus Torvalds torva...@linux-foundation.org Cc: Oleg Nesterov o...@redhat.com Cc: Quentin Casasnovas quentin.casasno...@oracle.com Cc: Thomas Gleixner t...@linutronix.de Signed-off-by: Ingo Molnar mi...@kernel.org --- arch/x86/kernel/traps.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 950815a138e1..52f9e4057cee 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -844,8 +844,9 @@ void math_state_restore(void) { struct task_struct *tsk = current; + local_irq_enable(); + if (!tsk_used_math(tsk)) { - local_irq_enable(); /* * does a slab alloc which can sleep */ @@ -856,7 +857,6 @@ void math_state_restore(void) do_group_exit(SIGKILL); return; } - local_irq_disable(); } /* Avoid __kernel_fpu_begin() right after __thread_fpu_begin() */ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
* Oleg Nesterov o...@redhat.com wrote: On 03/06, Ingo Molnar wrote: * Oleg Nesterov o...@redhat.com wrote: [...] The patch above looks obviously safe, but perhaps I am paranoid too much... IMHO your hack above isn't really acceptable, even for a backport. So lets test the patch below (assuming it's the right thing to do) and move forward? OK, but please note that this patch is not beckportable. If you think that -stable doesn't need this fix, then I agree. If the caller is do_device_not_available(), then we can not enable irqs before __thread_fpu_begin() + restore_fpu_checking(). 1. Preemption in between can destroy -fpu.state initialized by fpu_finit(), __switch_to() will save the live (wrong) FPU state again. 2. kernel_fpu_begin() from irq right after __thread_fpu_begin() is not nice too. It will do __save_init_fpu() and this overwrites -fpu.state too. Starting from v4.0 it does kernel_fpu_disable(), but the older kernels do not. Ingo, this code is really horrible and fragile. We need to cleanup it step-by-step, imho. How about the patch from David Vrabel? That seems to solve the irq-disable problem too, right? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On 06/03/15 15:36, Oleg Nesterov wrote: On 03/06, David Vrabel wrote: On 06/03/15 14:01, Oleg Nesterov wrote: Not sure I understand it correctly after the first quick look, but 1. It conflicts with the recent changes in tip/x86/fpu 2. fpu_ini() initializes current-thread.fpu.state. This looks unneeded, the kernel threads no longer have FPU context and do not abuse CPU. 3. I can be easily wrong, but it looks buggy... Note that arch_dup_task_struct() doesn't allocate child-fpu.state if !tsk_used_math(parent). ...yes. It's bit-rotted a bit. No, I do not think this patch is a good idea. Perhaps I am wrong, but I think we need other changes. And they should start from init_fpu(). But the general principle of avoiding the allocation in the #NM handler and hence avoiding the need to re-enable IRQs is still a good idea, yes? This needs more discussion, but in short so far I think that fpu_alloc() from #NM exception is fine if user_mode(regs) == T. I think a memory allocation here, where the only behaviour on a failure is to kill the task, is (and has always been) a crazy idea. Additionally, in a Xen PV guest the #NM handler is called with TS already cleared by the hypervisor so the handler must not enable interrupts (and thus potentially schedule another task) until after the current task's fpu state has been restored. If a task was scheduled before restoring the FPU state, TS would be clear and that task will use fpu state from a previous task. Just do_device_not_available() should simply do conditional_sti(), I think. Perhaps it can even enable irqs unconditionally, but we need to verify that this is 100% correct. And I agree that if (!tsk_used_math(tsk)) code in math_state_restore() should be removed. But not to avoid the allocation, and this needs other changes. Oleg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On 03/06, David Vrabel wrote: On 06/03/15 15:36, Oleg Nesterov wrote: This needs more discussion, but in short so far I think that fpu_alloc() from #NM exception is fine if user_mode(regs) == T. I think a memory allocation here, where the only behaviour on a failure is to kill the task, is (and has always been) a crazy idea. Well, I do not agree. But lets discuss this later. This code should be rewritten in any case. It has more problems. Additionally, in a Xen PV guest the #NM handler is called with TS already cleared by the hypervisor so the handler must not enable interrupts (and thus potentially schedule another task) until after the current task's fpu state has been restored. If a task was scheduled before restoring the FPU state, TS would be clear and that task will use fpu state from a previous task. I can be easily wrong (especially because I know nothing about Xen ;), but I do not think this is true. Yes sure, we need to avoid preemption, but we need this in any case, even without Xen. Again, lets discuss this a bit later? Oleg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On Thu, Mar 5, 2015 at 11:58 PM, Ingo Molnar mi...@kernel.org wrote: math_state_restore() was historically called with irqs disabled, because that's how the hardware generates the trap, and also because back in the days it was possible for it to be an asynchronous interrupt and interrupt handlers run with irqs off. These days it's always an instruction trap, and furthermore it does inevitably complex things such as memory allocation and signal processing, which is not done with irqs disabled. So keep irqs enabled. I agree with the keep irqs enabled. However, I do *not* agree with the actual patch, which doesn't do that at all. @@ -844,8 +844,9 @@ void math_state_restore(void) { struct task_struct *tsk = current; + local_irq_enable(); + There's a big difference between keep interrupts enabled (ok) and explicitly enable interrupts in random contexts (*NOT* ok). So get rid of the local_irq_enable() entirely, and replace it with a WARN_ON_ONCE(irqs_disabled()); and let's just fix the cases where this actually gets called with interrupts off. In particular, let's just make the device_not_available thing use a trap gate, not an interrupt gate. And then remove the conditional_sti() stuff. IOW, I think the starting point should be something like the attached (which doesn't do the WARN_ON_ONCE() - it should be added for debugging). *NOT* some kind of let's just enable interrupts blindly approach. This is completely untested, of course. But I don't see why we should use an interrupt gate for this. Is there anything in exception_enter() that requires it? Linus arch/x86/kernel/traps.c | 14 ++ 1 file changed, 2 insertions(+), 12 deletions(-) diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 9d2073e2ecc9..f045ac026ff1 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -836,16 +836,12 @@ asmlinkage __visible void __attribute__((weak)) smp_threshold_interrupt(void) * * Careful.. There are problems with IBM-designed IRQ13 behaviour. * Don't touch unless you *really* know how it works. - * - * Must be called with kernel preemption disabled (eg with local - * local interrupts as in the case of do_device_not_available). */ void math_state_restore(void) { struct task_struct *tsk = current; if (!tsk_used_math(tsk)) { - local_irq_enable(); /* * does a slab alloc which can sleep */ @@ -856,7 +852,6 @@ void math_state_restore(void) do_group_exit(SIGKILL); return; } - local_irq_disable(); } /* Avoid __kernel_fpu_begin() right after __thread_fpu_begin() */ @@ -884,18 +879,13 @@ do_device_not_available(struct pt_regs *regs, long error_code) if (read_cr0() X86_CR0_EM) { struct math_emu_info info = { }; - conditional_sti(regs); - info.regs = regs; math_emulate(info); exception_exit(prev_state); return; } #endif - math_state_restore(); /* interrupts still off */ -#ifdef CONFIG_X86_32 - conditional_sti(regs); -#endif + math_state_restore(); exception_exit(prev_state); } NOKPROBE_SYMBOL(do_device_not_available); @@ -959,7 +949,7 @@ void __init trap_init(void) set_system_intr_gate(X86_TRAP_OF, overflow); set_intr_gate(X86_TRAP_BR, bounds); set_intr_gate(X86_TRAP_UD, invalid_op); - set_intr_gate(X86_TRAP_NM, device_not_available); + set_trap_gate(X86_TRAP_NM, device_not_available); #ifdef CONFIG_X86_32 set_task_gate(X86_TRAP_DF, GDT_ENTRY_DOUBLEFAULT_TSS); #else
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On 03/06, Linus Torvalds wrote: On Thu, Mar 5, 2015 at 11:58 PM, Ingo Molnar mi...@kernel.org wrote: math_state_restore() was historically called with irqs disabled, because that's how the hardware generates the trap, and also because back in the days it was possible for it to be an asynchronous interrupt and interrupt handlers run with irqs off. These days it's always an instruction trap, and furthermore it does inevitably complex things such as memory allocation and signal processing, which is not done with irqs disabled. So keep irqs enabled. I agree with the keep irqs enabled. Me too, but not for stable. This patch is wrong without other changes. IOW, I think the starting point should be something like the attached (which doesn't do the WARN_ON_ONCE() - it should be added for debugging). Yes, agreed. And. Even if we forget about stable, we need some minor changes before this one. At least we need to add preempt_disable() into kernel_fpu_disable(). So I still think that the horrible hack I sent makes sense for -stable. Just we need to cleanup (kill) it immediately. Oleg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On Fri, Mar 6, 2015 at 9:33 AM, Linus Torvalds torva...@linux-foundation.org wrote: On Thu, Mar 5, 2015 at 11:58 PM, Ingo Molnar mi...@kernel.org wrote: math_state_restore() was historically called with irqs disabled, because that's how the hardware generates the trap, and also because back in the days it was possible for it to be an asynchronous interrupt and interrupt handlers run with irqs off. These days it's always an instruction trap, and furthermore it does inevitably complex things such as memory allocation and signal processing, which is not done with irqs disabled. So keep irqs enabled. I agree with the keep irqs enabled. However, I do *not* agree with the actual patch, which doesn't do that at all. @@ -844,8 +844,9 @@ void math_state_restore(void) { struct task_struct *tsk = current; + local_irq_enable(); + There's a big difference between keep interrupts enabled (ok) and explicitly enable interrupts in random contexts (*NOT* ok). So get rid of the local_irq_enable() entirely, and replace it with a WARN_ON_ONCE(irqs_disabled()); I like this a lot better. Ingo's patch scares me because in-kernel FPU users are fundamentally atomic: if we're using kernel FPU *and* current has important FPU state, we can't schedule safely because there's nowhere to save the in-kernel FPU state. I don't see how we would end up in this situation with CR0.TS set, but if we somehow do, then we'll corrupt something with your patch. and let's just fix the cases where this actually gets called with interrupts off. In particular, let's just make the device_not_available thing use a trap gate, not an interrupt gate. And then remove the conditional_sti() stuff. Please don't. IMO it's really nice that we don't use trap gates at all on x86_64, and I find the conditional_sti thing much nicer than having to audit all of the entry code to see whether it's safe to run it with IRQs on. This isn't to say that using trap gates would be a terrible idea in general. I think I'd be okay with saying that non-IST synchronous-only never-from-random-places entries should be trap gates in general. We could then audit the entry code and convert various entries. Thinks that maybe shouldn't be trap gates: - Anything asynchronous. - Page faults. The entry code can cause page faults recursively due to lazy vmap page table fills. Maybe this doesn't matter. - Anything using IST. That way lies madness. FWIW, I just started auditing the entry code a bit, and it's currently unsafe to use trap gates. The IRQ entry code in the interrupt macro does this: testl $3, CS-RBP(%rsp) je 1f SWAPGS That will go boom quite nicely if it happens very early in device_not_available. I suspect that fixing that will be slow and unpleasant and will significantly outweigh any benefit. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On Fri, Mar 6, 2015 at 11:23 AM, Andy Lutomirski l...@amacapital.net wrote: Please don't. IMO it's really nice that we don't use trap gates at all on x86_64, and I find the conditional_sti thing much nicer than having to audit all of the entry code to see whether it's safe to run it with IRQs on. So I'm not sure I see much difference, but I'd certainly be ok with just moving the conditional_sti() up unconditionally to be the first thing in do_device_not_available(). The point being that we still *not* just randomly enable interrupts because we decide that the callers are wrong. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On Fri, Mar 6, 2015 at 2:00 PM, Linus Torvalds torva...@linux-foundation.org wrote: On Fri, Mar 6, 2015 at 11:23 AM, Andy Lutomirski l...@amacapital.net wrote: Please don't. IMO it's really nice that we don't use trap gates at all on x86_64, and I find the conditional_sti thing much nicer than having to audit all of the entry code to see whether it's safe to run it with IRQs on. So I'm not sure I see much difference, but I'd certainly be ok with just moving the conditional_sti() up unconditionally to be the first thing in do_device_not_available(). I'd be fine with that. The important difference is that it's after swapgs. --Andy The point being that we still *not* just randomly enable interrupts because we decide that the callers are wrong. Linus -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
* Oleg Nesterov wrote: > On 03/05, Ingo Molnar wrote: > > > > * Oleg Nesterov wrote: > > > > > --- a/arch/x86/kernel/traps.c > > > +++ b/arch/x86/kernel/traps.c > > > @@ -774,7 +774,10 @@ void math_state_restore(void) > > > struct task_struct *tsk = current; > > > > > > if (!tsk_used_math(tsk)) { > > > - local_irq_enable(); > > > + bool disabled = irqs_disabled(); > > > + > > > + if (disabled) > > > + local_irq_enable(); > > > /* > > >* does a slab alloc which can sleep > > >*/ > > > @@ -785,7 +788,9 @@ void math_state_restore(void) > > > do_group_exit(SIGKILL); > > > return; > > > } > > > - local_irq_disable(); > > > + > > > + if (disabled) > > > + local_irq_disable(); > > > } > > > > Yuck! > > > > Is there a fundamental reason why we cannot simply enable irqs and > > leave them enabled? Math state restore is not atomic and cannot really > > be atomic. > > You know, I didn't even try to verify ;) but see below. So I'm thinking about the attached patch. > Most probably we can simply enable irqs, yes. But what about older > kernels, how can we check? > > And let me repeat, I strongly believe that this !tsk_used_math() > case in math_state_restore() must die. And unlazy_fpu() in > init_fpu(). And both __restore_xstate_sig() and flush_thread() > should not use math_state_restore() at all. At least in its current > form. Agreed. > But this is obviously not -stable material. > > That said, I'll try to look into git history tomorrow. So I think the reasons are: - historic: because math_state_restore() started out as an interrupt routine (from the IRQ13 days) - hardware imposed: the handler is executed with irqs off - it's probably the fastest implementation: we just run with the natural irqs-off state the handler executes with. So there's nothing outright wrong about executing with irqs off in a trap handler. > [...] The patch above looks "obviously safe", but perhaps I am > paranoid too much... IMHO your hack above isn't really acceptable, even for a backport. So lets test the patch below (assuming it's the right thing to do) and move forward? Thanks, Ingo ==> From: Ingo Molnar Date: Fri, 6 Mar 2015 08:37:57 +0100 Subject: [PATCH] x86/fpu: Don't disable irqs in math_state_restore() math_state_restore() was historically called with irqs disabled, because that's how the hardware generates the trap, and also because back in the days it was possible for it to be an asynchronous interrupt and interrupt handlers run with irqs off. These days it's always an instruction trap, and furthermore it does inevitably complex things such as memory allocation and signal processing, which is not done with irqs disabled. So keep irqs enabled. This might surprise in-kernel FPU users that somehow relied on interrupts being disabled across FPU usage - but that's fundamentally fragile anyway due to the inatomicity of FPU state restores. The trap return will restore interrupts to its previous state, but if FPU ops trigger math_state_restore() there's no guarantee of atomicity anymore. To warn about in-kernel irqs-off users of FPU state we might want to pass 'struct pt_regs' to math_state_restore() and check the trapped state for irqs disabled (flags has IF cleared) and kernel context - but that's for a later patch. Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Fenghua Yu Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Oleg Nesterov Cc: Quentin Casasnovas Cc: Thomas Gleixner Signed-off-by: Ingo Molnar --- arch/x86/kernel/traps.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 950815a138e1..52f9e4057cee 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -844,8 +844,9 @@ void math_state_restore(void) { struct task_struct *tsk = current; + local_irq_enable(); + if (!tsk_used_math(tsk)) { - local_irq_enable(); /* * does a slab alloc which can sleep */ @@ -856,7 +857,6 @@ void math_state_restore(void) do_group_exit(SIGKILL); return; } - local_irq_disable(); } /* Avoid __kernel_fpu_begin() right after __thread_fpu_begin() */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On 03/05, Ingo Molnar wrote: > > * Oleg Nesterov wrote: > > > --- a/arch/x86/kernel/traps.c > > +++ b/arch/x86/kernel/traps.c > > @@ -774,7 +774,10 @@ void math_state_restore(void) > > struct task_struct *tsk = current; > > > > if (!tsk_used_math(tsk)) { > > - local_irq_enable(); > > + bool disabled = irqs_disabled(); > > + > > + if (disabled) > > + local_irq_enable(); > > /* > > * does a slab alloc which can sleep > > */ > > @@ -785,7 +788,9 @@ void math_state_restore(void) > > do_group_exit(SIGKILL); > > return; > > } > > - local_irq_disable(); > > + > > + if (disabled) > > + local_irq_disable(); > > } > > Yuck! > > Is there a fundamental reason why we cannot simply enable irqs and > leave them enabled? Math state restore is not atomic and cannot really > be atomic. You know, I didn't even try to verify ;) but see below. Most probably we can simply enable irqs, yes. But what about older kernels, how can we check? And let me repeat, I strongly believe that this !tsk_used_math() case in math_state_restore() must die. And unlazy_fpu() in init_fpu(). And both __restore_xstate_sig() and flush_thread() should not use math_state_restore() at all. At least in its current form. But this is obviously not -stable material. That said, I'll try to look into git history tomorrow. The patch above looks "obviously safe", but perhaps I am paranoid too much... Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
* Oleg Nesterov wrote: > math_state_restore() assumes it is called with irqs disabled, but > this is not true if the caller is __restore_xstate_sig(). > > This means that if ia32_fxstate == T and __copy_from_user() fails > __restore_xstate_sig() returns with irqs disabled too. This trgiggers > > BUG: sleeping function called from invalid context at > kernel/locking/rwsem.c:41 > [] dump_stack+0x59/0xa0 > [] ___might_sleep+0x105/0x110 > [] ? _raw_spin_unlock_irqrestore+0x3d/0x70 > [] __might_sleep+0x7d/0xb0 > [] down_read+0x26/0xa0 > [] ? _raw_spin_unlock_irqrestore+0x5a/0x70 > [] print_vma_addr+0x58/0x130 > [] signal_fault+0xbe/0xf0 > [] sys32_rt_sigreturn+0xba/0xd0 > > Change math_state_restore() to check irqs_disabled(). > > Note: this is the minimal fix for -stable, it is horrible but simple. > We need to rewrite math_state_restore(), init_fpu(), and cleanup their > users. > > Signed-off-by: Oleg Nesterov > Cc: > --- > arch/x86/kernel/traps.c |9 +++-- > 1 files changed, 7 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c > index 51c4658..7310e0e 100644 > --- a/arch/x86/kernel/traps.c > +++ b/arch/x86/kernel/traps.c > @@ -774,7 +774,10 @@ void math_state_restore(void) > struct task_struct *tsk = current; > > if (!tsk_used_math(tsk)) { > - local_irq_enable(); > + bool disabled = irqs_disabled(); > + > + if (disabled) > + local_irq_enable(); > /* >* does a slab alloc which can sleep >*/ > @@ -785,7 +788,9 @@ void math_state_restore(void) > do_group_exit(SIGKILL); > return; > } > - local_irq_disable(); > + > + if (disabled) > + local_irq_disable(); > } Yuck! Is there a fundamental reason why we cannot simply enable irqs and leave them enabled? Math state restore is not atomic and cannot really be atomic. [ A potential worry would be kernel code using vector instructions in irqs-off regions - but that's totally broken anyway so not a big worry IMO, we might even want to warn about it. ] But I might be missing something? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
* Oleg Nesterov o...@redhat.com wrote: math_state_restore() assumes it is called with irqs disabled, but this is not true if the caller is __restore_xstate_sig(). This means that if ia32_fxstate == T and __copy_from_user() fails __restore_xstate_sig() returns with irqs disabled too. This trgiggers BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:41 [81381499] dump_stack+0x59/0xa0 [8106bd05] ___might_sleep+0x105/0x110 [8138786d] ? _raw_spin_unlock_irqrestore+0x3d/0x70 [8106bd8d] __might_sleep+0x7d/0xb0 [81385426] down_read+0x26/0xa0 [8138788a] ? _raw_spin_unlock_irqrestore+0x5a/0x70 [81136038] print_vma_addr+0x58/0x130 [8100239e] signal_fault+0xbe/0xf0 [810419aa] sys32_rt_sigreturn+0xba/0xd0 Change math_state_restore() to check irqs_disabled(). Note: this is the minimal fix for -stable, it is horrible but simple. We need to rewrite math_state_restore(), init_fpu(), and cleanup their users. Signed-off-by: Oleg Nesterov o...@redhat.com Cc: sta...@vger.kernel.org --- arch/x86/kernel/traps.c |9 +++-- 1 files changed, 7 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 51c4658..7310e0e 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -774,7 +774,10 @@ void math_state_restore(void) struct task_struct *tsk = current; if (!tsk_used_math(tsk)) { - local_irq_enable(); + bool disabled = irqs_disabled(); + + if (disabled) + local_irq_enable(); /* * does a slab alloc which can sleep */ @@ -785,7 +788,9 @@ void math_state_restore(void) do_group_exit(SIGKILL); return; } - local_irq_disable(); + + if (disabled) + local_irq_disable(); } Yuck! Is there a fundamental reason why we cannot simply enable irqs and leave them enabled? Math state restore is not atomic and cannot really be atomic. [ A potential worry would be kernel code using vector instructions in irqs-off regions - but that's totally broken anyway so not a big worry IMO, we might even want to warn about it. ] But I might be missing something? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
On 03/05, Ingo Molnar wrote: * Oleg Nesterov o...@redhat.com wrote: --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -774,7 +774,10 @@ void math_state_restore(void) struct task_struct *tsk = current; if (!tsk_used_math(tsk)) { - local_irq_enable(); + bool disabled = irqs_disabled(); + + if (disabled) + local_irq_enable(); /* * does a slab alloc which can sleep */ @@ -785,7 +788,9 @@ void math_state_restore(void) do_group_exit(SIGKILL); return; } - local_irq_disable(); + + if (disabled) + local_irq_disable(); } Yuck! Is there a fundamental reason why we cannot simply enable irqs and leave them enabled? Math state restore is not atomic and cannot really be atomic. You know, I didn't even try to verify ;) but see below. Most probably we can simply enable irqs, yes. But what about older kernels, how can we check? And let me repeat, I strongly believe that this !tsk_used_math() case in math_state_restore() must die. And unlazy_fpu() in init_fpu(). And both __restore_xstate_sig() and flush_thread() should not use math_state_restore() at all. At least in its current form. But this is obviously not -stable material. That said, I'll try to look into git history tomorrow. The patch above looks obviously safe, but perhaps I am paranoid too much... Oleg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
* Oleg Nesterov o...@redhat.com wrote: On 03/05, Ingo Molnar wrote: * Oleg Nesterov o...@redhat.com wrote: --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -774,7 +774,10 @@ void math_state_restore(void) struct task_struct *tsk = current; if (!tsk_used_math(tsk)) { - local_irq_enable(); + bool disabled = irqs_disabled(); + + if (disabled) + local_irq_enable(); /* * does a slab alloc which can sleep */ @@ -785,7 +788,9 @@ void math_state_restore(void) do_group_exit(SIGKILL); return; } - local_irq_disable(); + + if (disabled) + local_irq_disable(); } Yuck! Is there a fundamental reason why we cannot simply enable irqs and leave them enabled? Math state restore is not atomic and cannot really be atomic. You know, I didn't even try to verify ;) but see below. So I'm thinking about the attached patch. Most probably we can simply enable irqs, yes. But what about older kernels, how can we check? And let me repeat, I strongly believe that this !tsk_used_math() case in math_state_restore() must die. And unlazy_fpu() in init_fpu(). And both __restore_xstate_sig() and flush_thread() should not use math_state_restore() at all. At least in its current form. Agreed. But this is obviously not -stable material. That said, I'll try to look into git history tomorrow. So I think the reasons are: - historic: because math_state_restore() started out as an interrupt routine (from the IRQ13 days) - hardware imposed: the handler is executed with irqs off - it's probably the fastest implementation: we just run with the natural irqs-off state the handler executes with. So there's nothing outright wrong about executing with irqs off in a trap handler. [...] The patch above looks obviously safe, but perhaps I am paranoid too much... IMHO your hack above isn't really acceptable, even for a backport. So lets test the patch below (assuming it's the right thing to do) and move forward? Thanks, Ingo == From: Ingo Molnar mi...@kernel.org Date: Fri, 6 Mar 2015 08:37:57 +0100 Subject: [PATCH] x86/fpu: Don't disable irqs in math_state_restore() math_state_restore() was historically called with irqs disabled, because that's how the hardware generates the trap, and also because back in the days it was possible for it to be an asynchronous interrupt and interrupt handlers run with irqs off. These days it's always an instruction trap, and furthermore it does inevitably complex things such as memory allocation and signal processing, which is not done with irqs disabled. So keep irqs enabled. This might surprise in-kernel FPU users that somehow relied on interrupts being disabled across FPU usage - but that's fundamentally fragile anyway due to the inatomicity of FPU state restores. The trap return will restore interrupts to its previous state, but if FPU ops trigger math_state_restore() there's no guarantee of atomicity anymore. To warn about in-kernel irqs-off users of FPU state we might want to pass 'struct pt_regs' to math_state_restore() and check the trapped state for irqs disabled (flags has IF cleared) and kernel context - but that's for a later patch. Cc: Andy Lutomirski l...@amacapital.net Cc: Borislav Petkov b...@alien8.de Cc: Fenghua Yu fenghua...@intel.com Cc: H. Peter Anvin h...@zytor.com Cc: Linus Torvalds torva...@linux-foundation.org Cc: Oleg Nesterov o...@redhat.com Cc: Quentin Casasnovas quentin.casasno...@oracle.com Cc: Thomas Gleixner t...@linutronix.de Signed-off-by: Ingo Molnar mi...@kernel.org --- arch/x86/kernel/traps.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 950815a138e1..52f9e4057cee 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -844,8 +844,9 @@ void math_state_restore(void) { struct task_struct *tsk = current; + local_irq_enable(); + if (!tsk_used_math(tsk)) { - local_irq_enable(); /* * does a slab alloc which can sleep */ @@ -856,7 +857,6 @@ void math_state_restore(void) do_group_exit(SIGKILL); return; } - local_irq_disable(); } /* Avoid __kernel_fpu_begin() right after __thread_fpu_begin() */ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/