Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
* Jiri Kosina wrote: > On Sat, 21 Feb 2015, Ingo Molnar wrote: > > > (It does have some other requirements, such as making > > all syscalls interruptible to a 'special' signalling > > method that only live patching triggers - even syscalls > > that are under the normal ABI uninterruptible, such as > > sys_sync().) > > BTW I didn't really understand this -- could you please > elaborate what exactly do you propose to do here in your > "simplified" patching method (i.e. serializing everybody > at the kernel boundary) for TASK_UNINTERRUPTIBLE > processess? So I'd try to separate out the two main categories of uninterruptible sleepers: - those who just serialize with other local CPUs/tasks relatively quickly - those who are waiting for some potentially very long and open ended request. [such as IO, potentially network IO.] I'd only touch the latter: a prominent example would be sys_sync(). I'd leave alone the myriads of other uninterruptible sleepers. > But I didn't understand your claims regarding > uninterruptible sleeps in your paragraph above. > sys_sync() is one thing, that's just waiting > uninterruptibly for completion. But how about all the > mutex waitiers in TASK_UNINTERRUPTIBLE, for example? I'd not touch those - unless they are waiting for something that will not be done by the time we park all tasks: for example NFS might have uninterruptible sleeps, and sys_sync() will potentially do IO for minutes. I think it would be the exception, not the rule - but it would give us an approach that allows us to touch 'any' kernel code if its wait times are unreasonably long or open ended. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
* Jiri Kosina jkos...@suse.cz wrote: On Sat, 21 Feb 2015, Ingo Molnar wrote: (It does have some other requirements, such as making all syscalls interruptible to a 'special' signalling method that only live patching triggers - even syscalls that are under the normal ABI uninterruptible, such as sys_sync().) BTW I didn't really understand this -- could you please elaborate what exactly do you propose to do here in your simplified patching method (i.e. serializing everybody at the kernel boundary) for TASK_UNINTERRUPTIBLE processess? So I'd try to separate out the two main categories of uninterruptible sleepers: - those who just serialize with other local CPUs/tasks relatively quickly - those who are waiting for some potentially very long and open ended request. [such as IO, potentially network IO.] I'd only touch the latter: a prominent example would be sys_sync(). I'd leave alone the myriads of other uninterruptible sleepers. But I didn't understand your claims regarding uninterruptible sleeps in your paragraph above. sys_sync() is one thing, that's just waiting uninterruptibly for completion. But how about all the mutex waitiers in TASK_UNINTERRUPTIBLE, for example? I'd not touch those - unless they are waiting for something that will not be done by the time we park all tasks: for example NFS might have uninterruptible sleeps, and sys_sync() will potentially do IO for minutes. I think it would be the exception, not the rule - but it would give us an approach that allows us to touch 'any' kernel code if its wait times are unreasonably long or open ended. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
On Sat, 21 Feb 2015, Ingo Molnar wrote: > (It does have some other requirements, such as making all > syscalls interruptible to a 'special' signalling method > that only live patching triggers - even syscalls that are > under the normal ABI uninterruptible, such as sys_sync().) BTW I didn't really understand this -- could you please elaborate what exactly do you propose to do here in your "simplified" patching method (i.e. serializing everybody at the kernel boundary) for TASK_UNINTERRUPTIBLE processess? That actually seems to be the most crucial problem to me in this respect. Other things are rather implementation details; no matter whether we are sending normal SIGCONT or SIGPATCHING with special semantics you have described above, at the end of the day we end up calling kick_process() for the task in question, and that makes both interruptible sleepers and CPU hogs go through the "checkpoint". SIGPATCHING would then be "just" an improvement of this, making sure that EINTR doesn't spuriously get leaked to userspace. But I didn't understand your claims regarding uninterruptible sleeps in your paragraph above. sys_sync() is one thing, that's just waiting uninterruptibly for completion. But how about all the mutex waitiers in TASK_UNINTERRUPTIBLE, for example? Thanks a lot, -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
On Sat, 21 Feb 2015, Ingo Molnar wrote: (It does have some other requirements, such as making all syscalls interruptible to a 'special' signalling method that only live patching triggers - even syscalls that are under the normal ABI uninterruptible, such as sys_sync().) BTW I didn't really understand this -- could you please elaborate what exactly do you propose to do here in your simplified patching method (i.e. serializing everybody at the kernel boundary) for TASK_UNINTERRUPTIBLE processess? That actually seems to be the most crucial problem to me in this respect. Other things are rather implementation details; no matter whether we are sending normal SIGCONT or SIGPATCHING with special semantics you have described above, at the end of the day we end up calling kick_process() for the task in question, and that makes both interruptible sleepers and CPU hogs go through the checkpoint. SIGPATCHING would then be just an improvement of this, making sure that EINTR doesn't spuriously get leaked to userspace. But I didn't understand your claims regarding uninterruptible sleeps in your paragraph above. sys_sync() is one thing, that's just waiting uninterruptibly for completion. But how about all the mutex waitiers in TASK_UNINTERRUPTIBLE, for example? Thanks a lot, -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
[ live-patching@ ML added to CC here as well ] On Sun, 22 Feb 2015, Ingo Molnar wrote: > > BTW how exactly do you envision this will work? Do I understand your > > proposal correctly that EINTR will be "handled" somewhere in the "live > > patching special signal handler" and then have the interrupted syscall > > restarted? > > If you want to think about it in signal handling terms then it's a new > automatic in-kernel handler, which does not actually return back to > user-mode at all. > > We can do it via the signals machinery (mainly to reuse all the existing > signal_pending() code in various syscalls), or via new TIF flags like > the user work machinery: the principle is the same: interrupt out of > syscall functions into a central place and restart them, and return to > user-space later on as if a single call had been performed. > > This necessarily means some changes to syscalls, but not insurmountable > ones - and checkpoint/restore support would want to have similar changes > in any case so we can hit two birds with the same stone. Understood. I think this by itself really would be a huge improvement on how to force tasks to converge to the new universe without confusing the userspace via artificial signals (i.e. even with the lazy migration aproaches, without full serialization checkpoint). As I said yesterday, I am pretty sure we'll experiment with this in kgraft. Thanks for bringing this idea up, -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
* Jiri Kosina wrote: > On Sat, 21 Feb 2015, Ingo Molnar wrote: > > > > Plus a lot of processes would see EINTR, causing more > > > havoc. > > > > Parking threads safely in user mode does not require > > the propagation of syscall interruption to user-space. > > BTW how exactly do you envision this will work? Do I > understand your proposal correctly that EINTR will be > "handled" somewhere in the "live patching special signal > handler" and then have the interrupted syscall restarted? If you want to think about it in signal handling terms then it's a new automatic in-kernel handler, which does not actually return back to user-mode at all. We can do it via the signals machinery (mainly to reuse all the existing signal_pending() code in various syscalls), or via new TIF flags like the user work machinery: the principle is the same: interrupt out of syscall functions into a central place and restart them, and return to user-space later on as if a single call had been performed. This necessarily means some changes to syscalls, but not insurmountable ones - and checkpoint/restore support would want to have similar changes in any case so we can hit two birds with the same stone. > Even without EINTR propagation to userspace, this would > make a lot of new syscall restarts that were not there > before, [...] That's only a problem if you do system call restarts by restarting them via user-space system call restart handler - I'm not proposing that. I'm suggesting a completely user-space transparent way to execute long lasting system calls in a smarter way. I.e. it would not be observable via strace either. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
On Sun, 22 Feb 2015, Ingo Molnar wrote: > I am making specific technical arguments, but you attempted to redirect > my very specific arguments towards 'differences in philosophy' and > 'where to draw the line'. Lets discuss the merits and brush them aside > as 'philosophical differences' or a made up category of 'consistency > models'. > > Anyway, let me try to reboot this discussion back to technological > details by summing up my arguments in another mail. Sounds good, thanks. Don't get me wrong -- I am not really opposing your solution (because it's of course quite close to what kgraft is doing :p ), but I'd like to make sure we understand each other well. I still have to think a little bit more about how to handle kthreads properly even in your proposed solution (i.e. how exactly is it superior to what kgraft is currently doing in this respect). We'd have to go through all them anyway, and make them parkable, right? Thanks, -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
On Sat, 21 Feb 2015, Ingo Molnar wrote: > > Plus a lot of processes would see EINTR, causing more havoc. > > Parking threads safely in user mode does not require the propagation of > syscall interruption to user-space. BTW how exactly do you envision this will work? Do I understand your proposal correctly that EINTR will be "handled" somewhere in the "live patching special signal handler" and then have the interrupted syscall restarted? Even without EINTR propagation to userspace, this would make a lot of new syscall restarts that were not there before, and I am still to be convinced that this is something we are not going to cause a lot of new user-visible breakage with. Yes, the breakage would be caused kernel bugs (I mostly envision drivers to be problematic in this area) that would be nice to have fixed, but the user experience that will come out of it will be just completely horrible. Or did I misunderstand what you are really proposing? Thanks, -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
* Jiri Kosina wrote: > > > Or think of kernel that has some 3rd party vendor > > > module loaded, and this module spawning a ktrehad > > > that is not capable of parking itself. > > > > The kernel will refuse to live patch until the module > > is fixed. It is a win by any measure. > > Depends on your point of view. If you are an > administrator of the vulnerable system, you have no to > very little clue why patching your system is not possible > (and what you should to to have it fixed in the coming > days, before your system is taken down by attackers, for > example), and would be really sad to be forced to run > with unpatched system. I don't see what precise argument you are making here. That we are unable to fix possible bugs in binary only modules? News at 11. Or are you making the argument that we should design our core kernel capabilities to be deficient, just to accomodate hypothetical scenarios with buggy third party modules that create unparkable kernel threads? That would be a crazy proposition. > > Why would __notrace__ functions be a problem in the > > 'simple' method? Live patching with this method will > > work even if ftrace is not built in, we can just patch > > out the function in the simplest possible fashion, > > because we do it atomically and don't have to worry > > about per task 'transition state' - like kGraft does. > > > > It's a massive simplification and there's no need to > > rely on ftrace's mcount trick. (Sorry Steve!) > > This would indeed be possible iff we take the "global > synchronizing point" aproach you are proposing. [...] Yes. > [...] Still technicalities to be solved (what happens if > you are patching that already has ftrace on it, etc), but > probably nothing principal. Correct. > > > So it's not black and white, it's really a rather > > > philosophical question where to draw the line (and > > > make sure that everyone is aware of where the line is > > > and what it means). > > > > Out of the three examples you mentioned, two are > > actually an advantage to begin with - so I'd suggest > > you stop condescending me ... > > Ugh, if you feel my e-mails like attempts to condescend > you, I am really surprised; I thought we are discussing > technical details. If you feel otherwise, we should > probably just stop discussing then. I am making specific technical arguments, but you attempted to redirect my very specific arguments towards 'differences in philosophy' and 'where to draw the line'. Lets discuss the merits and brush them aside as 'philosophical differences' or a made up category of 'consistency models'. Anyway, let me try to reboot this discussion back to technological details by summing up my arguments in another mail. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
* Jiri Kosina jkos...@suse.cz wrote: Or think of kernel that has some 3rd party vendor module loaded, and this module spawning a ktrehad that is not capable of parking itself. The kernel will refuse to live patch until the module is fixed. It is a win by any measure. Depends on your point of view. If you are an administrator of the vulnerable system, you have no to very little clue why patching your system is not possible (and what you should to to have it fixed in the coming days, before your system is taken down by attackers, for example), and would be really sad to be forced to run with unpatched system. I don't see what precise argument you are making here. That we are unable to fix possible bugs in binary only modules? News at 11. Or are you making the argument that we should design our core kernel capabilities to be deficient, just to accomodate hypothetical scenarios with buggy third party modules that create unparkable kernel threads? That would be a crazy proposition. Why would __notrace__ functions be a problem in the 'simple' method? Live patching with this method will work even if ftrace is not built in, we can just patch out the function in the simplest possible fashion, because we do it atomically and don't have to worry about per task 'transition state' - like kGraft does. It's a massive simplification and there's no need to rely on ftrace's mcount trick. (Sorry Steve!) This would indeed be possible iff we take the global synchronizing point aproach you are proposing. [...] Yes. [...] Still technicalities to be solved (what happens if you are patching that already has ftrace on it, etc), but probably nothing principal. Correct. So it's not black and white, it's really a rather philosophical question where to draw the line (and make sure that everyone is aware of where the line is and what it means). Out of the three examples you mentioned, two are actually an advantage to begin with - so I'd suggest you stop condescending me ... Ugh, if you feel my e-mails like attempts to condescend you, I am really surprised; I thought we are discussing technical details. If you feel otherwise, we should probably just stop discussing then. I am making specific technical arguments, but you attempted to redirect my very specific arguments towards 'differences in philosophy' and 'where to draw the line'. Lets discuss the merits and brush them aside as 'philosophical differences' or a made up category of 'consistency models'. Anyway, let me try to reboot this discussion back to technological details by summing up my arguments in another mail. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
* Jiri Kosina jkos...@suse.cz wrote: On Sat, 21 Feb 2015, Ingo Molnar wrote: Plus a lot of processes would see EINTR, causing more havoc. Parking threads safely in user mode does not require the propagation of syscall interruption to user-space. BTW how exactly do you envision this will work? Do I understand your proposal correctly that EINTR will be handled somewhere in the live patching special signal handler and then have the interrupted syscall restarted? If you want to think about it in signal handling terms then it's a new automatic in-kernel handler, which does not actually return back to user-mode at all. We can do it via the signals machinery (mainly to reuse all the existing signal_pending() code in various syscalls), or via new TIF flags like the user work machinery: the principle is the same: interrupt out of syscall functions into a central place and restart them, and return to user-space later on as if a single call had been performed. This necessarily means some changes to syscalls, but not insurmountable ones - and checkpoint/restore support would want to have similar changes in any case so we can hit two birds with the same stone. Even without EINTR propagation to userspace, this would make a lot of new syscall restarts that were not there before, [...] That's only a problem if you do system call restarts by restarting them via user-space system call restart handler - I'm not proposing that. I'm suggesting a completely user-space transparent way to execute long lasting system calls in a smarter way. I.e. it would not be observable via strace either. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
[ live-patching@ ML added to CC here as well ] On Sun, 22 Feb 2015, Ingo Molnar wrote: BTW how exactly do you envision this will work? Do I understand your proposal correctly that EINTR will be handled somewhere in the live patching special signal handler and then have the interrupted syscall restarted? If you want to think about it in signal handling terms then it's a new automatic in-kernel handler, which does not actually return back to user-mode at all. We can do it via the signals machinery (mainly to reuse all the existing signal_pending() code in various syscalls), or via new TIF flags like the user work machinery: the principle is the same: interrupt out of syscall functions into a central place and restart them, and return to user-space later on as if a single call had been performed. This necessarily means some changes to syscalls, but not insurmountable ones - and checkpoint/restore support would want to have similar changes in any case so we can hit two birds with the same stone. Understood. I think this by itself really would be a huge improvement on how to force tasks to converge to the new universe without confusing the userspace via artificial signals (i.e. even with the lazy migration aproaches, without full serialization checkpoint). As I said yesterday, I am pretty sure we'll experiment with this in kgraft. Thanks for bringing this idea up, -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
On Sun, 22 Feb 2015, Ingo Molnar wrote: I am making specific technical arguments, but you attempted to redirect my very specific arguments towards 'differences in philosophy' and 'where to draw the line'. Lets discuss the merits and brush them aside as 'philosophical differences' or a made up category of 'consistency models'. Anyway, let me try to reboot this discussion back to technological details by summing up my arguments in another mail. Sounds good, thanks. Don't get me wrong -- I am not really opposing your solution (because it's of course quite close to what kgraft is doing :p ), but I'd like to make sure we understand each other well. I still have to think a little bit more about how to handle kthreads properly even in your proposed solution (i.e. how exactly is it superior to what kgraft is currently doing in this respect). We'd have to go through all them anyway, and make them parkable, right? Thanks, -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
On Sat, 21 Feb 2015, Ingo Molnar wrote: Plus a lot of processes would see EINTR, causing more havoc. Parking threads safely in user mode does not require the propagation of syscall interruption to user-space. BTW how exactly do you envision this will work? Do I understand your proposal correctly that EINTR will be handled somewhere in the live patching special signal handler and then have the interrupted syscall restarted? Even without EINTR propagation to userspace, this would make a lot of new syscall restarts that were not there before, and I am still to be convinced that this is something we are not going to cause a lot of new user-visible breakage with. Yes, the breakage would be caused kernel bugs (I mostly envision drivers to be problematic in this area) that would be nice to have fixed, but the user experience that will come out of it will be just completely horrible. Or did I misunderstand what you are really proposing? Thanks, -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
To make sure that this thread doesn't conclude in void, here's my take on it: - what's currently alredy there is the simplest-of-simplest methods; it allows you to apply context-less patches (such as adding bounds checking to the beginning of syscall, etc), which turns out to cover vast portion of applicable CVEs - it can always be made more clever; patch author always has to know the version of the kernel he's preparing the patch for anyway (the live patch and the kernel is closely tied together) - the proposal to force sleeping or CPU-hogging tasks through a defined safe checkpoint using a fake sort-of signal without any other sideeffects might be useful even for kGraft and also for other proposed aproaches. I think we'll try to implement this as an optimization for kGraft and we'll see how (a) fast (b) non-intrusive we would be able to make it. If it turns out to be successful, we can then just reuse it in the upstream solution (whatever that would be) Thanks, -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
On Sat, 21 Feb 2015, Ingo Molnar wrote: > > I see the difference, but I am afraid you are simplifying > > the situation a litle bit too much. > > > > There will always be properties of patches that will make > > them unapplicable in a "live patching" way by design. > > Think of data structure layout changes (*). > > Yes. The (*) note described that it's actually in theory possible, but the price to pay is very high. So it's possible to write a patch that does this, and have it use a different consistency model, which would take care of data structure layout change (and the fact that this particular consistency model is very expensive, would be a clear fact). Your "simple one-method-to-rule-them-all aproach" is not sufficient for this. Even such a stretched case doesn't justify existency of consistency models for you? > > Or think of kernel that has some 3rd party vendor module loaded, and > > this module spawning a ktrehad that is not capable of parking itself. > > The kernel will refuse to live patch until the module is fixed. It is a > win by any measure. Depends on your point of view. If you are an administrator of the vulnerable system, you have no to very little clue why patching your system is not possible (and what you should to to have it fixed in the coming days, before your system is taken down by attackers, for example), and would be really sad to be forced to run with unpatched system. I see your point that this would be a good lever we'd have to 3rd party vendors, but OTOH we'd be taking OS users as hostages in some sense. > > Or think of patching __notrace__ functions. > > Why would __notrace__ functions be a problem in the 'simple' method? > Live patching with this method will work even if ftrace is not built in, > we can just patch out the function in the simplest possible fashion, > because we do it atomically and don't have to worry about per task > 'transition state' - like kGraft does. > > It's a massive simplification and there's no need to rely on ftrace's > mcount trick. (Sorry Steve!) This would indeed be possible iff we take the "global synchronizing point" aproach you are proposing. Still technicalities to be solved (what happens if you are patching that already has ftrace on it, etc), but probably nothing principal. > > So it's not black and white, it's really a rather philosophical > > question where to draw the line (and make sure that everyone is aware > > of where the line is and what it means). > > Out of the three examples you mentioned, two are actually an advantage > to begin with - so I'd suggest you stop condescending me ... Ugh, if you feel my e-mails like attempts to condescend you, I am really surprised; I thought we are discussing technical details. If you feel otherwise, we should probably just stop discussing then. Thanks, -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
* Jiri Kosina wrote: > On Sat, 21 Feb 2015, Ingo Molnar wrote: > > > > But admittedly, if we reserve a special sort-of > > > signal for making the tasks pass through a safe > > > checkpoint (and make them queue there (your solution) > > > or make them just pass through it and continue > > > (current kGraft)), it might reduce the time this > > > effort needs considerably. > > > > Well, I think the 'simple' method has another > > advantage: it can only work if all those problems > > (kthreads, parking machinery) are solved, because the > > patching will occur only everything is quiescent. > > > > So no shortcuts are allowed, by design. It starts from > > a fundamentally safe, robust base, while all the other > > approaches I've seen were developed in a 'lets get the > > patching to work, then iteratively try to make it > > safer' which really puts the cart before the horse. > > > > So to put it more bluntly: I don't subscribe to the > > whole 'consistency model' nonsense: that's just crazy > > talk IMHO. > > > > Make it fundamentally safe from the very beginning, the > > 'simple method' I suggested _won't live patch the > > kernel_ if the mechanism has a bug and some kthread or > > task does not park. See the difference? > > I see the difference, but I am afraid you are simplifying > the situation a litle bit too much. > > There will always be properties of patches that will make > them unapplicable in a "live patching" way by design. > Think of data structure layout changes (*). Yes. > Or think of kernel that has some 3rd party vendor module > loaded, and this module spawning a ktrehad that is not > capable of parking itself. The kernel will refuse to live patch until the module is fixed. It is a win by any measure. > Or think of patching __notrace__ functions. Why would __notrace__ functions be a problem in the 'simple' method? Live patching with this method will work even if ftrace is not built in, we can just patch out the function in the simplest possible fashion, because we do it atomically and don't have to worry about per task 'transition state' - like kGraft does. It's a massive simplification and there's no need to rely on ftrace's mcount trick. (Sorry Steve!) > Etc. > > So it's not black and white, it's really a rather > philosophical question where to draw the line (and make > sure that everyone is aware of where the line is and what > it means). Out of the three examples you mentioned, two are actually an advantage to begin with - so I'd suggest you stop condescending me ... > This is exactly why we came up with consistency models -- > it allows you to draw the line at well-defined places. To still be blunt: they are snake oil, a bit like 'security modules': allowing upstream merges by consensus between competing pieces of crap, instead of coming up with a single good design that we can call the Linux kernel live patching method ... Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
On Sat, 21 Feb 2015, Ingo Molnar wrote: > > But admittedly, if we reserve a special sort-of signal > > for making the tasks pass through a safe checkpoint (and > > make them queue there (your solution) or make them just > > pass through it and continue (current kGraft)), it might > > reduce the time this effort needs considerably. > > Well, I think the 'simple' method has another advantage: it can only > work if all those problems (kthreads, parking machinery) are solved, > because the patching will occur only everything is quiescent. > > So no shortcuts are allowed, by design. It starts from a fundamentally > safe, robust base, while all the other approaches I've seen were > developed in a 'lets get the patching to work, then iteratively try to > make it safer' which really puts the cart before the horse. > > So to put it more bluntly: I don't subscribe to the whole 'consistency > model' nonsense: that's just crazy talk IMHO. > > Make it fundamentally safe from the very beginning, the 'simple method' > I suggested _won't live patch the kernel_ if the mechanism has a bug and > some kthread or task does not park. See the difference? I see the difference, but I am afraid you are simplifying the situation a litle bit too much. There will always be properties of patches that will make them unapplicable in a "live patching" way by design. Think of data structure layout changes (*). Or think of kernel that has some 3rd party vendor module loaded, and this module spawning a ktrehad that is not capable of parking itself. Or think of patching __notrace__ functions. Etc. So it's not black and white, it's really a rather philosophical question where to draw the line (and make sure that everyone is aware of where the line is and what it means). This is exactly why we came up with consistency models -- it allows you to draw the line at well-defined places. (*) there are some rather crazy ideas how to make this work, but the price you pay is basically always unacceptable slowdown Thanks, -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
* Jiri Kosina wrote: > On Sat, 21 Feb 2015, Ingo Molnar wrote: > > > > This means that each and every sleeping task in the > > > system has to be woken up in some way (sending a > > > signal ...) to exit from a syscall it is sleeping in. > > > Same for CPU hogs. All kernel threads need to be > > > parked. > > > > Yes - although I'd not use signals for this, signals > > have side effects - but yes, something functionally > > equivalent. > > This is similar to my proposal I came up with not too > long time ago; a fake signal (analogically to, but not > exactly the same, what freezer is using), that will just > make tasks cycle through userspace/kernelspace boundary > without other side-effects. Yeah. > > > This is exactly what you need to do for kGraft to > > > complete patching. > > > > My understanding of kGraft is that by default you allow > > tasks to continue 'in the new universe' after they are > > patched. Has this changed or have I misunderstood the > > concept? > > What Vojtech meant here, I believe, is that the effort > you have to make to force all tasks to queue themselves > to park them on a safe place and then restart their > execution is exactly the same as the effort you have to > make to make kGraft converge and succeed. Yes - with the difference that in the 'simple' method I suggested we'd have kpatch's patching robustness (all or nothing atomic patching - no intermediate patching state, no reliance on mcount entries, no doubt about which version of the function is working - sans kpatch's stack trace logic), combined with kGraft's task parking robustness. > But admittedly, if we reserve a special sort-of signal > for making the tasks pass through a safe checkpoint (and > make them queue there (your solution) or make them just > pass through it and continue (current kGraft)), it might > reduce the time this effort needs considerably. Well, I think the 'simple' method has another advantage: it can only work if all those problems (kthreads, parking machinery) are solved, because the patching will occur only everything is quiescent. So no shortcuts are allowed, by design. It starts from a fundamentally safe, robust base, while all the other approaches I've seen were developed in a 'lets get the patching to work, then iteratively try to make it safer' which really puts the cart before the horse. So to put it more bluntly: I don't subscribe to the whole 'consistency model' nonsense: that's just crazy talk IMHO. Make it fundamentally safe from the very beginning, the 'simple method' I suggested _won't live patch the kernel_ if the mechanism has a bug and some kthread or task does not park. See the difference? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
On Sat, 21 Feb 2015, Ingo Molnar wrote: > > This means that each and every sleeping task in the system has to be > > woken up in some way (sending a signal ...) to exit from a syscall it > > is sleeping in. Same for CPU hogs. All kernel threads need to be > > parked. > > Yes - although I'd not use signals for this, signals have > side effects - but yes, something functionally equivalent. This is similar to my proposal I came up with not too long time ago; a fake signal (analogically to, but not exactly the same, what freezer is using), that will just make tasks cycle through userspace/kernelspace boundary without other side-effects. > > This is exactly what you need to do for kGraft to complete patching. > > My understanding of kGraft is that by default you allow tasks to > continue 'in the new universe' after they are patched. Has this changed > or have I misunderstood the concept? What Vojtech meant here, I believe, is that the effort you have to make to force all tasks to queue themselves to park them on a safe place and then restart their execution is exactly the same as the effort you have to make to make kGraft converge and succeed. But admittedly, if we reserve a special sort-of signal for making the tasks pass through a safe checkpoint (and make them queue there (your solution) or make them just pass through it and continue (current kGraft)), it might reduce the time this effort needs considerably. -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
* Josh Poimboeuf wrote: > On Fri, Feb 20, 2015 at 10:46:13PM +0100, Vojtech Pavlik wrote: > > On Fri, Feb 20, 2015 at 08:49:01PM +0100, Ingo Molnar wrote: > > > > > I.e. it's in essence the strong stop-all atomic > > > patching model of 'kpatch', combined with the > > > reliable avoidance of kernel stacks that 'kgraft' > > > uses. > > > > > That should be the starting point, because it's the > > > most reliable method. > > > > In the consistency models discussion, this was marked > > the "LEAVE_KERNEL+SWITCH_KERNEL" model. It's indeed the > > strongest model of all, but also comes at the highest > > cost in terms of impact on running tasks. It's so high > > (the interruption may be seconds or more) that it was > > deemed not worth implementing. > > Yeah, this is way too disruptive to the user. > > Even the comparatively tiny latency caused by kpatch's > use of stop_machine() was considered unacceptable by > some. Unreliable, unrobust patching is even more disruptive... What I think makes it long term fragile is that we combine two unrobust, unlikely mechanisms: the chance that a task just happens to execute a patched function, with the chance that debug information is unreliable. For example tracing patching got debugged to a fair degree because we rely on the patching for actual tracing functionality. Even with that relatively robust usage model we had our crises ... I just don't see how a stack backtrace based live patching method can become robust in the long run. > Plus a lot of processes would see EINTR, causing more > havoc. Parking threads safely in user mode does not require the propagation of syscall interruption to user-space. (It does have some other requirements, such as making all syscalls interruptible to a 'special' signalling method that only live patching triggers - even syscalls that are under the normal ABI uninterruptible, such as sys_sync().) On the other hand, if it's too slow, people will work on improving signal propagation latencies: making syscalls more readily interruptible and more seemlessly restartable has various other advantages beyond live kernel patching. I.e. it's a win-win scenario and will improve various areas of the kernel in terms of syscall interruptability latencies. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
* Vojtech Pavlik wrote: > On Fri, Feb 20, 2015 at 08:49:01PM +0100, Ingo Molnar wrote: > > > > ... the choice the sysadmins have here is either have > > > the system running in an intermediate state, or have > > > the system completely dead for the *same time*. > > > Because to finish the transition successfully, all > > > the tasks have to be woken up in any case. > > > > That statement is false: an 'intermediate state' system > > where 'new' tasks are still running is still running > > and will interfere with the resolution of 'old' tasks. > > Can you suggest a way how they would interfere? The > transition happens on entering or returning from a > syscall, there is no influence between individual tasks. Well, a 'new' task does not stop executing after returning from the syscall, right? If it's stopped (until all patching is totally complete) then you are right and I concede your point. If it's allowed to continue its workload then my point stands: subsequent execution of 'new' tasks can interfere with, slow down, interact with 'old' tasks trying to get patched. > > I think you misunderstood: the 'simple' method I > > outlined does not just 'synchronize', it actually > > executes the live patching atomically, once all tasks > > are gathered and we know they are _all_ in a safe > > state. > > The 'simple' method has to catch and freeze all tasks one > by one in syscall entry/exit, at the kernel/userspace > boundary, until all are frozen and then patch the system > atomically. Correct. > This means that each and every sleeping task in the > system has to be woken up in some way (sending a signal > ...) to exit from a syscall it is sleeping in. Same for > CPU hogs. All kernel threads need to be parked. Yes - although I'd not use signals for this, signals have side effects - but yes, something functionally equivalent. > This is exactly what you need to do for kGraft to > complete patching. My understanding of kGraft is that by default you allow tasks to continue 'in the new universe' after they are patched. Has this changed or have I misunderstood the concept? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
* Vojtech Pavlik vojt...@suse.com wrote: On Fri, Feb 20, 2015 at 08:49:01PM +0100, Ingo Molnar wrote: ... the choice the sysadmins have here is either have the system running in an intermediate state, or have the system completely dead for the *same time*. Because to finish the transition successfully, all the tasks have to be woken up in any case. That statement is false: an 'intermediate state' system where 'new' tasks are still running is still running and will interfere with the resolution of 'old' tasks. Can you suggest a way how they would interfere? The transition happens on entering or returning from a syscall, there is no influence between individual tasks. Well, a 'new' task does not stop executing after returning from the syscall, right? If it's stopped (until all patching is totally complete) then you are right and I concede your point. If it's allowed to continue its workload then my point stands: subsequent execution of 'new' tasks can interfere with, slow down, interact with 'old' tasks trying to get patched. I think you misunderstood: the 'simple' method I outlined does not just 'synchronize', it actually executes the live patching atomically, once all tasks are gathered and we know they are _all_ in a safe state. The 'simple' method has to catch and freeze all tasks one by one in syscall entry/exit, at the kernel/userspace boundary, until all are frozen and then patch the system atomically. Correct. This means that each and every sleeping task in the system has to be woken up in some way (sending a signal ...) to exit from a syscall it is sleeping in. Same for CPU hogs. All kernel threads need to be parked. Yes - although I'd not use signals for this, signals have side effects - but yes, something functionally equivalent. This is exactly what you need to do for kGraft to complete patching. My understanding of kGraft is that by default you allow tasks to continue 'in the new universe' after they are patched. Has this changed or have I misunderstood the concept? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
* Josh Poimboeuf jpoim...@redhat.com wrote: On Fri, Feb 20, 2015 at 10:46:13PM +0100, Vojtech Pavlik wrote: On Fri, Feb 20, 2015 at 08:49:01PM +0100, Ingo Molnar wrote: I.e. it's in essence the strong stop-all atomic patching model of 'kpatch', combined with the reliable avoidance of kernel stacks that 'kgraft' uses. That should be the starting point, because it's the most reliable method. In the consistency models discussion, this was marked the LEAVE_KERNEL+SWITCH_KERNEL model. It's indeed the strongest model of all, but also comes at the highest cost in terms of impact on running tasks. It's so high (the interruption may be seconds or more) that it was deemed not worth implementing. Yeah, this is way too disruptive to the user. Even the comparatively tiny latency caused by kpatch's use of stop_machine() was considered unacceptable by some. Unreliable, unrobust patching is even more disruptive... What I think makes it long term fragile is that we combine two unrobust, unlikely mechanisms: the chance that a task just happens to execute a patched function, with the chance that debug information is unreliable. For example tracing patching got debugged to a fair degree because we rely on the patching for actual tracing functionality. Even with that relatively robust usage model we had our crises ... I just don't see how a stack backtrace based live patching method can become robust in the long run. Plus a lot of processes would see EINTR, causing more havoc. Parking threads safely in user mode does not require the propagation of syscall interruption to user-space. (It does have some other requirements, such as making all syscalls interruptible to a 'special' signalling method that only live patching triggers - even syscalls that are under the normal ABI uninterruptible, such as sys_sync().) On the other hand, if it's too slow, people will work on improving signal propagation latencies: making syscalls more readily interruptible and more seemlessly restartable has various other advantages beyond live kernel patching. I.e. it's a win-win scenario and will improve various areas of the kernel in terms of syscall interruptability latencies. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
To make sure that this thread doesn't conclude in void, here's my take on it: - what's currently alredy there is the simplest-of-simplest methods; it allows you to apply context-less patches (such as adding bounds checking to the beginning of syscall, etc), which turns out to cover vast portion of applicable CVEs - it can always be made more clever; patch author always has to know the version of the kernel he's preparing the patch for anyway (the live patch and the kernel is closely tied together) - the proposal to force sleeping or CPU-hogging tasks through a defined safe checkpoint using a fake sort-of signal without any other sideeffects might be useful even for kGraft and also for other proposed aproaches. I think we'll try to implement this as an optimization for kGraft and we'll see how (a) fast (b) non-intrusive we would be able to make it. If it turns out to be successful, we can then just reuse it in the upstream solution (whatever that would be) Thanks, -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
On Sat, 21 Feb 2015, Ingo Molnar wrote: This means that each and every sleeping task in the system has to be woken up in some way (sending a signal ...) to exit from a syscall it is sleeping in. Same for CPU hogs. All kernel threads need to be parked. Yes - although I'd not use signals for this, signals have side effects - but yes, something functionally equivalent. This is similar to my proposal I came up with not too long time ago; a fake signal (analogically to, but not exactly the same, what freezer is using), that will just make tasks cycle through userspace/kernelspace boundary without other side-effects. This is exactly what you need to do for kGraft to complete patching. My understanding of kGraft is that by default you allow tasks to continue 'in the new universe' after they are patched. Has this changed or have I misunderstood the concept? What Vojtech meant here, I believe, is that the effort you have to make to force all tasks to queue themselves to park them on a safe place and then restart their execution is exactly the same as the effort you have to make to make kGraft converge and succeed. But admittedly, if we reserve a special sort-of signal for making the tasks pass through a safe checkpoint (and make them queue there (your solution) or make them just pass through it and continue (current kGraft)), it might reduce the time this effort needs considerably. -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
On Sat, 21 Feb 2015, Ingo Molnar wrote: I see the difference, but I am afraid you are simplifying the situation a litle bit too much. There will always be properties of patches that will make them unapplicable in a live patching way by design. Think of data structure layout changes (*). Yes. The (*) note described that it's actually in theory possible, but the price to pay is very high. So it's possible to write a patch that does this, and have it use a different consistency model, which would take care of data structure layout change (and the fact that this particular consistency model is very expensive, would be a clear fact). Your simple one-method-to-rule-them-all aproach is not sufficient for this. Even such a stretched case doesn't justify existency of consistency models for you? Or think of kernel that has some 3rd party vendor module loaded, and this module spawning a ktrehad that is not capable of parking itself. The kernel will refuse to live patch until the module is fixed. It is a win by any measure. Depends on your point of view. If you are an administrator of the vulnerable system, you have no to very little clue why patching your system is not possible (and what you should to to have it fixed in the coming days, before your system is taken down by attackers, for example), and would be really sad to be forced to run with unpatched system. I see your point that this would be a good lever we'd have to 3rd party vendors, but OTOH we'd be taking OS users as hostages in some sense. Or think of patching __notrace__ functions. Why would __notrace__ functions be a problem in the 'simple' method? Live patching with this method will work even if ftrace is not built in, we can just patch out the function in the simplest possible fashion, because we do it atomically and don't have to worry about per task 'transition state' - like kGraft does. It's a massive simplification and there's no need to rely on ftrace's mcount trick. (Sorry Steve!) This would indeed be possible iff we take the global synchronizing point aproach you are proposing. Still technicalities to be solved (what happens if you are patching that already has ftrace on it, etc), but probably nothing principal. So it's not black and white, it's really a rather philosophical question where to draw the line (and make sure that everyone is aware of where the line is and what it means). Out of the three examples you mentioned, two are actually an advantage to begin with - so I'd suggest you stop condescending me ... Ugh, if you feel my e-mails like attempts to condescend you, I am really surprised; I thought we are discussing technical details. If you feel otherwise, we should probably just stop discussing then. Thanks, -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
* Jiri Kosina jkos...@suse.cz wrote: On Sat, 21 Feb 2015, Ingo Molnar wrote: This means that each and every sleeping task in the system has to be woken up in some way (sending a signal ...) to exit from a syscall it is sleeping in. Same for CPU hogs. All kernel threads need to be parked. Yes - although I'd not use signals for this, signals have side effects - but yes, something functionally equivalent. This is similar to my proposal I came up with not too long time ago; a fake signal (analogically to, but not exactly the same, what freezer is using), that will just make tasks cycle through userspace/kernelspace boundary without other side-effects. Yeah. This is exactly what you need to do for kGraft to complete patching. My understanding of kGraft is that by default you allow tasks to continue 'in the new universe' after they are patched. Has this changed or have I misunderstood the concept? What Vojtech meant here, I believe, is that the effort you have to make to force all tasks to queue themselves to park them on a safe place and then restart their execution is exactly the same as the effort you have to make to make kGraft converge and succeed. Yes - with the difference that in the 'simple' method I suggested we'd have kpatch's patching robustness (all or nothing atomic patching - no intermediate patching state, no reliance on mcount entries, no doubt about which version of the function is working - sans kpatch's stack trace logic), combined with kGraft's task parking robustness. But admittedly, if we reserve a special sort-of signal for making the tasks pass through a safe checkpoint (and make them queue there (your solution) or make them just pass through it and continue (current kGraft)), it might reduce the time this effort needs considerably. Well, I think the 'simple' method has another advantage: it can only work if all those problems (kthreads, parking machinery) are solved, because the patching will occur only everything is quiescent. So no shortcuts are allowed, by design. It starts from a fundamentally safe, robust base, while all the other approaches I've seen were developed in a 'lets get the patching to work, then iteratively try to make it safer' which really puts the cart before the horse. So to put it more bluntly: I don't subscribe to the whole 'consistency model' nonsense: that's just crazy talk IMHO. Make it fundamentally safe from the very beginning, the 'simple method' I suggested _won't live patch the kernel_ if the mechanism has a bug and some kthread or task does not park. See the difference? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
On Sat, 21 Feb 2015, Ingo Molnar wrote: But admittedly, if we reserve a special sort-of signal for making the tasks pass through a safe checkpoint (and make them queue there (your solution) or make them just pass through it and continue (current kGraft)), it might reduce the time this effort needs considerably. Well, I think the 'simple' method has another advantage: it can only work if all those problems (kthreads, parking machinery) are solved, because the patching will occur only everything is quiescent. So no shortcuts are allowed, by design. It starts from a fundamentally safe, robust base, while all the other approaches I've seen were developed in a 'lets get the patching to work, then iteratively try to make it safer' which really puts the cart before the horse. So to put it more bluntly: I don't subscribe to the whole 'consistency model' nonsense: that's just crazy talk IMHO. Make it fundamentally safe from the very beginning, the 'simple method' I suggested _won't live patch the kernel_ if the mechanism has a bug and some kthread or task does not park. See the difference? I see the difference, but I am afraid you are simplifying the situation a litle bit too much. There will always be properties of patches that will make them unapplicable in a live patching way by design. Think of data structure layout changes (*). Or think of kernel that has some 3rd party vendor module loaded, and this module spawning a ktrehad that is not capable of parking itself. Or think of patching __notrace__ functions. Etc. So it's not black and white, it's really a rather philosophical question where to draw the line (and make sure that everyone is aware of where the line is and what it means). This is exactly why we came up with consistency models -- it allows you to draw the line at well-defined places. (*) there are some rather crazy ideas how to make this work, but the price you pay is basically always unacceptable slowdown Thanks, -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
* Jiri Kosina jkos...@suse.cz wrote: On Sat, 21 Feb 2015, Ingo Molnar wrote: But admittedly, if we reserve a special sort-of signal for making the tasks pass through a safe checkpoint (and make them queue there (your solution) or make them just pass through it and continue (current kGraft)), it might reduce the time this effort needs considerably. Well, I think the 'simple' method has another advantage: it can only work if all those problems (kthreads, parking machinery) are solved, because the patching will occur only everything is quiescent. So no shortcuts are allowed, by design. It starts from a fundamentally safe, robust base, while all the other approaches I've seen were developed in a 'lets get the patching to work, then iteratively try to make it safer' which really puts the cart before the horse. So to put it more bluntly: I don't subscribe to the whole 'consistency model' nonsense: that's just crazy talk IMHO. Make it fundamentally safe from the very beginning, the 'simple method' I suggested _won't live patch the kernel_ if the mechanism has a bug and some kthread or task does not park. See the difference? I see the difference, but I am afraid you are simplifying the situation a litle bit too much. There will always be properties of patches that will make them unapplicable in a live patching way by design. Think of data structure layout changes (*). Yes. Or think of kernel that has some 3rd party vendor module loaded, and this module spawning a ktrehad that is not capable of parking itself. The kernel will refuse to live patch until the module is fixed. It is a win by any measure. Or think of patching __notrace__ functions. Why would __notrace__ functions be a problem in the 'simple' method? Live patching with this method will work even if ftrace is not built in, we can just patch out the function in the simplest possible fashion, because we do it atomically and don't have to worry about per task 'transition state' - like kGraft does. It's a massive simplification and there's no need to rely on ftrace's mcount trick. (Sorry Steve!) Etc. So it's not black and white, it's really a rather philosophical question where to draw the line (and make sure that everyone is aware of where the line is and what it means). Out of the three examples you mentioned, two are actually an advantage to begin with - so I'd suggest you stop condescending me ... This is exactly why we came up with consistency models -- it allows you to draw the line at well-defined places. To still be blunt: they are snake oil, a bit like 'security modules': allowing upstream merges by consensus between competing pieces of crap, instead of coming up with a single good design that we can call the Linux kernel live patching method ... Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
On Fri, Feb 20, 2015 at 10:46:13PM +0100, Vojtech Pavlik wrote: > On Fri, Feb 20, 2015 at 08:49:01PM +0100, Ingo Molnar wrote: > > I.e. it's in essence the strong stop-all atomic patching > > model of 'kpatch', combined with the reliable avoidance of > > kernel stacks that 'kgraft' uses. > > > That should be the starting point, because it's the most > > reliable method. > > In the consistency models discussion, this was marked the > "LEAVE_KERNEL+SWITCH_KERNEL" model. It's indeed the strongest model of > all, but also comes at the highest cost in terms of impact on running > tasks. It's so high (the interruption may be seconds or more) that it > was deemed not worth implementing. Yeah, this is way too disruptive to the user. Even the comparatively tiny latency caused by kpatch's use of stop_machine() was considered unacceptable by some. Plus a lot of processes would see EINTR, causing more havoc. -- Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
On Fri, Feb 20, 2015 at 08:49:01PM +0100, Ingo Molnar wrote: > > ... the choice the sysadmins have here is either have the > > system running in an intermediate state, or have the > > system completely dead for the *same time*. Because to > > finish the transition successfully, all the tasks have to > > be woken up in any case. > > That statement is false: an 'intermediate state' system > where 'new' tasks are still running is still running and > will interfere with the resolution of 'old' tasks. Can you suggest a way how they would interfere? The transition happens on entering or returning from a syscall, there is no influence between individual tasks. If you mean that the patch author has to consider the fact that both old and new code will be running simultaneously, then yes, they have to. > > But I do get your point; what you are basically saying is > > that your preference is what kgraft is doing, and option > > to allow for a global synchronization point should be > > added in parallel to the gradual lazy migration. > > I think you misunderstood: the 'simple' method I outlined > does not just 'synchronize', it actually executes the live > patching atomically, once all tasks are gathered and we > know they are _all_ in a safe state. The 'simple' method has to catch and freeze all tasks one by one in syscall entry/exit, at the kernel/userspace boundary, until all are frozen and then patch the system atomically. This means that each and every sleeping task in the system has to be woken up in some way (sending a signal ...) to exit from a syscall it is sleeping in. Same for CPU hogs. All kernel threads need to be parked. This is exactly what you need to do for kGraft to complete patching. This may take a significant amount of time to achieve and you won't be able to use a userspace script to send the signals, because it'd be frozen immediately. > I.e. it's in essence the strong stop-all atomic patching > model of 'kpatch', combined with the reliable avoidance of > kernel stacks that 'kgraft' uses. > That should be the starting point, because it's the most > reliable method. In the consistency models discussion, this was marked the "LEAVE_KERNEL+SWITCH_KERNEL" model. It's indeed the strongest model of all, but also comes at the highest cost in terms of impact on running tasks. It's so high (the interruption may be seconds or more) that it was deemed not worth implementing. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Fri, Feb 20, 2015 at 09:08:49PM +0100, Ingo Molnar wrote: > * Josh Poimboeuf wrote: > > On Fri, Feb 20, 2015 at 10:50:03AM +0100, Ingo Molnar wrote: > > > * Jiri Kosina wrote: > > > > > > > Alright, so to sum it up: > > > > > > > > - current stack dumping (even looking at /proc//stack) is not > > > > guaranteed to yield "correct" results in case the task is running at > > > > the > > > > time the stack is being examined > > > > > > Don't even _think_ about trying to base something as > > > dangerous as live patching the kernel image on the concept > > > of: > > > > > > 'We can make all stack backtraces reliably correct all > > >the time, with no false positives, with no false > > >negatives, 100% of the time, and quickly discover and > > >fix bugs in that'. > > > > > > It's not going to happen: > > > > > > - The correctness of stacktraces partially depends on > > >tooling and we don't control those. > > > > What tooling are you referring to? > > > > Sure, we rely on the compiler to produce the correct > > assembly for putting frame pointers on the stack. [...] > > We also rely on all assembly code to have valid frames all > the time - and this is far from a given, we've had many > bugs in that area. Are you talking about kernel bugs or compiler bugs? Either way, would those bugs not be caught by the stack unwinding validation improvements I proposed? > > [...] But that's pretty straightforward. The kernel > > already relies on the compiler to do plenty of things > > which are much more complex than that. > > So this claim scares me because it's such nonsense: > > We rely on the compiler to create _functional code_ for us. > If the compiler messes up then functionality, actual code > breaks. > > 'Valid stack frame' is not functional code, it's in essence > debug info. Only because we _treat_ it like debug info. I'm proposing that we treat it like functional code. So that yes, if the compiler messes up frame pointers, actual code breaks. (But we could detect that before we do the patch, so it doesn't destabilize the system.) If gcc can't do frame pointers right, it's a *major* gcc bug. It's nothing more than: push %rbp mov%rsp,%rbp at the beginning of the function, and: pop%rbp right before the function returns. If you really think the compiler could possibly mess that up, I can probably write a simple script, executed by the Makefile, which validates that every C function in the kernel has the above pattern. > If a backtrace is wrong in some rare, racy case > then that won't break the kernel, it just gives slightly > misleading debug info in most cases. > > Now you want to turn 'debug info' to have functional > relevance, for something as intrusive as patching the > kernel code. > > You really need to accept this distinction! Yes, I already realize that this would be the first time the kernel relies on stack traces being 100% correct. Peter and I had already some discussion about it already elsewhere in this thread. Just because we haven't relied on its correctness in the past doesn't mean we can't rely on it in the future. > > > - More importantly, there's no strong force that ensures > > >we can rely on stack backtraces: correcting bad > > >stack traces depends on people hitting those > > >functions and situations that generate them, seeing > > >a bad stack trace, noticing that it's weird and > > >correcting whatever code or tooling quirk causes the > > >stack entry to be incorrect. > > > > > > Essentially unlike other kernel code which breaks stuff > > > if it's incorrect, there's no _functional_ dependence > > > on stack traces, so live patching would be the first > > > (and pretty much only) thing that breaks on bad stack > > > traces ... > > > > First, there are several things we do to avoid anomalies: > > > > - we don't patch asm functions > > the problem with asm functions isn't that we might want to > patch them (although sometimes we might want to: especially > the more interesting security fixes tend to be in assembly > code), it's that asm functions can (easily) mess up debug > info, without that being apparent in any functional way. Again, that's why I proposed the stack unwinding validation improvements. > > - we only walk the stack of sleeping tasks > > sleeping tasks are affected by debug info bugs too. > > > Second, currently the stack unwinding code is rather crude and > > doesn't do much in the way of error handling. > > That's a big red flag in my book ... I wasn't talking about my patch set. I was talking about the existing stack dumping code in the kernel. -- Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
* Josh Poimboeuf wrote: > On Fri, Feb 20, 2015 at 10:50:03AM +0100, Ingo Molnar wrote: > > > > * Jiri Kosina wrote: > > > > > Alright, so to sum it up: > > > > > > - current stack dumping (even looking at /proc//stack) is not > > > guaranteed to yield "correct" results in case the task is running at > > > the > > > time the stack is being examined > > > > Don't even _think_ about trying to base something as > > dangerous as live patching the kernel image on the concept > > of: > > > > 'We can make all stack backtraces reliably correct all > >the time, with no false positives, with no false > >negatives, 100% of the time, and quickly discover and > >fix bugs in that'. > > > > It's not going to happen: > > > > - The correctness of stacktraces partially depends on > >tooling and we don't control those. > > What tooling are you referring to? > > Sure, we rely on the compiler to produce the correct > assembly for putting frame pointers on the stack. [...] We also rely on all assembly code to have valid frames all the time - and this is far from a given, we've had many bugs in that area. > [...] But that's pretty straightforward. The kernel > already relies on the compiler to do plenty of things > which are much more complex than that. So this claim scares me because it's such nonsense: We rely on the compiler to create _functional code_ for us. If the compiler messes up then functionality, actual code breaks. 'Valid stack frame' is not functional code, it's in essence debug info. If a backtrace is wrong in some rare, racy case then that won't break the kernel, it just gives slightly misleading debug info in most cases. Now you want to turn 'debug info' to have functional relevance, for something as intrusive as patching the kernel code. You really need to accept this distinction! > > - More importantly, there's no strong force that ensures > >we can rely on stack backtraces: correcting bad > >stack traces depends on people hitting those > >functions and situations that generate them, seeing > >a bad stack trace, noticing that it's weird and > >correcting whatever code or tooling quirk causes the > >stack entry to be incorrect. > > > > Essentially unlike other kernel code which breaks stuff > > if it's incorrect, there's no _functional_ dependence > > on stack traces, so live patching would be the first > > (and pretty much only) thing that breaks on bad stack > > traces ... > > First, there are several things we do to avoid anomalies: > > - we don't patch asm functions the problem with asm functions isn't that we might want to patch them (although sometimes we might want to: especially the more interesting security fixes tend to be in assembly code), it's that asm functions can (easily) mess up debug info, without that being apparent in any functional way. > - we only walk the stack of sleeping tasks sleeping tasks are affected by debug info bugs too. > Second, currently the stack unwinding code is rather > crude and doesn't do much in the way of error handling. That's a big red flag in my book ... Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
* Jiri Kosina wrote: > > All fundamental pieces of the simple method are > > necessary to get guaranteed time transition from the > > complicated method: task tracking and transparent > > catching of them, handling kthreads, etc. > > > > My argument is that the simple method should be > > implemented first and foremost. > > > > Then people can do add-on features to possibly spread > > out the new function versions in a more complicated way > > if they want to avoid the stop-all-tasks transition - > > although I'm not convinced about it: I'm sure sure many > > sysadmins would like the bug patching to be over with > > quickly and not have their systems in an intermediate > > state like kgraft does it. > > ... the choice the sysadmins have here is either have the > system running in an intermediate state, or have the > system completely dead for the *same time*. Because to > finish the transition successfully, all the tasks have to > be woken up in any case. That statement is false: an 'intermediate state' system where 'new' tasks are still running is still running and will interfere with the resolution of 'old' tasks. > But I do get your point; what you are basically saying is > that your preference is what kgraft is doing, and option > to allow for a global synchronization point should be > added in parallel to the gradual lazy migration. I think you misunderstood: the 'simple' method I outlined does not just 'synchronize', it actually executes the live patching atomically, once all tasks are gathered and we know they are _all_ in a safe state. I.e. it's in essence the strong stop-all atomic patching model of 'kpatch', combined with the reliable avoidance of kernel stacks that 'kgraft' uses. That should be the starting point, because it's the most reliable method. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Fri, Feb 20, 2015 at 09:49:32AM +0100, Jiri Kosina wrote: > Alright, so to sum it up: > > - current stack dumping (even looking at /proc//stack) is not > guaranteed to yield "correct" results in case the task is running at the > time the stack is being examined > > - the only fool-proof way is to send IPI-NMI to all CPUs, and synchronize > the handlers between each other (to make sure that reschedule doesn't > happen in between on some CPU and other task doesn't start running in > the interim). > The NMI handler dumps its current stack in case it's running in context > of the process whose stack is to be dumped. Otherwise, one of the NMI > handlers looks up the required task_struct, and dumps it if it's not > running on any CPU > > - For live patching use-case, the stack has to be analyzed (and decision > on what to do based on the analysis) in the NMI handler itself, > otherwise it gets racy again > > Converting /proc//stack to this mechanism seems like a correct thing > to do in any case, as it's slow path anyway. > > The original intent seemed to have been to make this fast path for the > live patching case, but that's probably not possible, so it seems like the > price that will have to be paid for being able to finish live-patching of > CPU-bound processess is the cost of IPI-NMI broadcast. Hm, syncing IPI's among CPUs sounds pretty disruptive. This is really two different issues, so I'll separate them: 1. /proc/pid/stack for running tasks I haven't heard anybody demanding that /proc//stack should actually print the stack for running tasks. My suggestion was just that we avoid the possibility of printing garbage. Today's behavior for a running task is (usually): # cat /proc/802/stack [] 0x How about, when we detecting a running task, just always show that? That would give us today's behavior, except without occasionally printing garbage, while avoiding all the overhead of syncing IPI's. 2. live patching of running tasks I don't see why we would need to sync IPI's to patch CPU-bound processes. Why not use context tracking or the TIF_USERSPACE flag like I mentioned before? -- Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Fri, Feb 20, 2015 at 10:50:03AM +0100, Ingo Molnar wrote: > > * Jiri Kosina wrote: > > > Alright, so to sum it up: > > > > - current stack dumping (even looking at /proc//stack) is not > > guaranteed to yield "correct" results in case the task is running at the > > time the stack is being examined > > Don't even _think_ about trying to base something as > dangerous as live patching the kernel image on the concept > of: > > 'We can make all stack backtraces reliably correct all >the time, with no false positives, with no false >negatives, 100% of the time, and quickly discover and >fix bugs in that'. > > It's not going to happen: > > - The correctness of stacktraces partially depends on >tooling and we don't control those. What tooling are you referring to? Sure, we rely on the compiler to produce the correct assembly for putting frame pointers on the stack. But that's pretty straightforward. The kernel already relies on the compiler to do plenty of things which are much more complex than that. > - More importantly, there's no strong force that ensures >we can rely on stack backtraces: correcting bad stack >traces depends on people hitting those functions and >situations that generate them, seeing a bad stack trace, >noticing that it's weird and correcting whatever code or >tooling quirk causes the stack entry to be incorrect. > > Essentially unlike other kernel code which breaks stuff if > it's incorrect, there's no _functional_ dependence on stack > traces, so live patching would be the first (and pretty > much only) thing that breaks on bad stack traces ... First, there are several things we do to avoid anomalies: - we don't patch asm functions - we only walk the stack of sleeping tasks Second, currently the stack unwinding code is rather crude and doesn't do much in the way of error handling. There are several error conditions we could easily check for programmatically: - make sure it starts from a __schedule() call at the top (I've only proposed walking the stacks of sleeping tasks) - make sure we actually reach the bottom of the stack - make sure each stack frame's return address is immediately after a call instruction If any of these checks fail, we can either delay the patching of the task until later or we can cancel the patching process, with no harm done. Either way we can WARN the user so that we get reports of these anomalies. > If you think you can make something like dwarf annotations > work reliably to base kernel live patching on that, > reconsider. No, we would rely on CONFIG_FRAME_POINTER. > Even with frame pointer backtraces can go bad sometimes, I > wouldn't base live patching even on _that_, Other than hand-coded assembly, can you provide more details about how frame pointer stack traces can go bad? > and that's a very simple concept with a performance cost that most > distros don't want to pay. Hm, CONFIG_FRAME_POINTER is enabled on all the distros I use. > So if your design is based on being able to discover 'live' > functions in the kernel stack dump of all tasks in the > system, I think you need a serious reboot of the whole > approach and get rid of that fragility before any of that > functionality gets upstream! Just to clarify a couple of things: 1. Despite the email subject, this discussion is really about another RFC patch set [1]. It hasn't been merged, and there's still a lot of ongoing discussion about it. 2. We don't dump the stack of _all_ tasks. Only sleeping ones. [1] https://lkml.org/lkml/2015/2/9/475 -- Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
On Fri, 20 Feb 2015, Ingo Molnar wrote: > - the complicated method spread out over time: uses the > same essential mechanism plus the ftrace patching > machinery to detect whether all tasks have transitioned > through a version flip. [this is what kgraft does in > part.] The only difference of this to what kgraft does is that alive-enough tasks are not put in this kind of "out of kernel 'freezer'", but keep running. Modifying kgraft to (optionally) add the synchronization barrier and then flip the switch should be a rather trivial task, and can indeed be added as a simple option to the patch author / sysadmin. However ... > All fundamental pieces of the simple method are necessary to get > guaranteed time transition from the complicated method: task tracking > and transparent catching of them, handling kthreads, etc. > > My argument is that the simple method should be implemented first and > foremost. > > Then people can do add-on features to possibly spread out the new > function versions in a more complicated way if they want to avoid the > stop-all-tasks transition - although I'm not convinced about it: I'm > sure sure many sysadmins would like the bug patching to be over with > quickly and not have their systems in an intermediate state like kgraft > does it. ... the choice the sysadmins have here is either have the system running in an intermediate state, or have the system completely dead for the *same time*. Because to finish the transition successfully, all the tasks have to be woken up in any case. (please note that this is different to suspend/resume task freezing, because we don't care about sleeping tasks there). But I do get your point; what you are basically saying is that your preference is what kgraft is doing, and option to allow for a global synchronization point should be added in parallel to the gradual lazy migration. > In any case, as per my arguments above, examining the kernel stack is > superfluous (so we won't be exposed to the fragility of it either): > there's no need to examine it and writing such patches is misguided... > > Thanks, > > Ingo > -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
* Jiri Kosina wrote: > On Fri, 20 Feb 2015, Ingo Molnar wrote: > > > So if your design is based on being able to discover > > > 'live' functions in the kernel stack dump of all tasks > > in the system, I think you need a serious reboot of the > > whole approach and get rid of that fragility before any > > of that functionality gets upstream! > > So let me repeat again, just to make sure that no more > confusion is being spread around -- there are aproaches > which do rely on stack contents, and aproaches which > don't. kpatch (the Red Hat solution) and ksplice (the > Oracle solution) contains stack analysis as a conceptual > design step, kgraft (the SUSE solution) doesn't. So just to make my position really clear: any talk about looking at the kernel stack for backtraces is just crazy talk, considering how stack backtrace technology stands today and in the reasonable near future! With that out of the way, the only safe mechanism to live patch the kernel (for sufficiently simple sets of changes to singular functions) I'm aware of at the moment is: - forcing all user space tasks out of kernel mode and intercepting them in a safe state. I.e. making sure that no kernel code is executed, no kernel stack state is used (modulo code closely related to the live patching mechanism and kernel threads in safe state, lets ignore them for this argument) There's two variants of this concept, which deals with the timing of how user-space tasks are forced into user mode: - the simple method: force all user-space tasks out of kernel mode, stop the machine for a brief moment and be done with the patching safely and essentially atomically. - the complicated method spread out over time: uses the same essential mechanism plus the ftrace patching machinery to detect whether all tasks have transitioned through a version flip. [this is what kgraft does in part.] All fundamental pieces of the simple method are necessary to get guaranteed time transition from the complicated method: task tracking and transparent catching of them, handling kthreads, etc. My argument is that the simple method should be implemented first and foremost. Then people can do add-on features to possibly spread out the new function versions in a more complicated way if they want to avoid the stop-all-tasks transition - although I'm not convinced about it: I'm sure sure many sysadmins would like the bug patching to be over with quickly and not have their systems in an intermediate state like kgraft does it. In any case, as per my arguments above, examining the kernel stack is superfluous (so we won't be exposed to the fragility of it either): there's no need to examine it and writing such patches is misguided... Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Fri, 20 Feb 2015, Ingo Molnar wrote: > So if your design is based on being able to discover 'live' functions in > the kernel stack dump of all tasks in the system, I think you need a > serious reboot of the whole approach and get rid of that fragility > before any of that functionality gets upstream! So let me repeat again, just to make sure that no more confusion is being spread around -- there are aproaches which do rely on stack contents, and aproaches which don't. kpatch (the Red Hat solution) and ksplice (the Oracle solution) contains stack analysis as a conceptual design step, kgraft (the SUSE solution) doesn't. Also the limited (in features -- consistency models) common infrastructure that is currently upstream doesn't care about stack contents. We are now figuring out which consistency models make sense for upstream, and how they can be achieved. Josh's patchset is one of the aproaches that is currently being discussed, but it's not the only available option. Thanks, -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
* Jiri Kosina wrote: > Alright, so to sum it up: > > - current stack dumping (even looking at /proc//stack) is not > guaranteed to yield "correct" results in case the task is running at the > time the stack is being examined Don't even _think_ about trying to base something as dangerous as live patching the kernel image on the concept of: 'We can make all stack backtraces reliably correct all the time, with no false positives, with no false negatives, 100% of the time, and quickly discover and fix bugs in that'. It's not going to happen: - The correctness of stacktraces partially depends on tooling and we don't control those. - More importantly, there's no strong force that ensures we can rely on stack backtraces: correcting bad stack traces depends on people hitting those functions and situations that generate them, seeing a bad stack trace, noticing that it's weird and correcting whatever code or tooling quirk causes the stack entry to be incorrect. Essentially unlike other kernel code which breaks stuff if it's incorrect, there's no _functional_ dependence on stack traces, so live patching would be the first (and pretty much only) thing that breaks on bad stack traces ... If you think you can make something like dwarf annotations work reliably to base kernel live patching on that, reconsider. Even with frame pointer backtraces can go bad sometimes, I wouldn't base live patching even on _that_, and that's a very simple concept with a performance cost that most distros don't want to pay. So if your design is based on being able to discover 'live' functions in the kernel stack dump of all tasks in the system, I think you need a serious reboot of the whole approach and get rid of that fragility before any of that functionality gets upstream! > - For live patching use-case, the stack has to be > analyzed (and decision on what to do based on the > analysis) in the NMI handler itself, otherwise it gets > racy again You simply cannot reliably determine from the kernel stack whether a function is used by a task or not, and actually modify the kernel image, from a stack backtrace, as things stand today. Full stop. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
Alright, so to sum it up: - current stack dumping (even looking at /proc//stack) is not guaranteed to yield "correct" results in case the task is running at the time the stack is being examined - the only fool-proof way is to send IPI-NMI to all CPUs, and synchronize the handlers between each other (to make sure that reschedule doesn't happen in between on some CPU and other task doesn't start running in the interim). The NMI handler dumps its current stack in case it's running in context of the process whose stack is to be dumped. Otherwise, one of the NMI handlers looks up the required task_struct, and dumps it if it's not running on any CPU - For live patching use-case, the stack has to be analyzed (and decision on what to do based on the analysis) in the NMI handler itself, otherwise it gets racy again Converting /proc//stack to this mechanism seems like a correct thing to do in any case, as it's slow path anyway. The original intent seemed to have been to make this fast path for the live patching case, but that's probably not possible, so it seems like the price that will have to be paid for being able to finish live-patching of CPU-bound processess is the cost of IPI-NMI broadcast. Agreed? Thanks, -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
* Jiri Kosina jkos...@suse.cz wrote: On Fri, 20 Feb 2015, Ingo Molnar wrote: So if your design is based on being able to discover 'live' functions in the kernel stack dump of all tasks in the system, I think you need a serious reboot of the whole approach and get rid of that fragility before any of that functionality gets upstream! So let me repeat again, just to make sure that no more confusion is being spread around -- there are aproaches which do rely on stack contents, and aproaches which don't. kpatch (the Red Hat solution) and ksplice (the Oracle solution) contains stack analysis as a conceptual design step, kgraft (the SUSE solution) doesn't. So just to make my position really clear: any talk about looking at the kernel stack for backtraces is just crazy talk, considering how stack backtrace technology stands today and in the reasonable near future! With that out of the way, the only safe mechanism to live patch the kernel (for sufficiently simple sets of changes to singular functions) I'm aware of at the moment is: - forcing all user space tasks out of kernel mode and intercepting them in a safe state. I.e. making sure that no kernel code is executed, no kernel stack state is used (modulo code closely related to the live patching mechanism and kernel threads in safe state, lets ignore them for this argument) There's two variants of this concept, which deals with the timing of how user-space tasks are forced into user mode: - the simple method: force all user-space tasks out of kernel mode, stop the machine for a brief moment and be done with the patching safely and essentially atomically. - the complicated method spread out over time: uses the same essential mechanism plus the ftrace patching machinery to detect whether all tasks have transitioned through a version flip. [this is what kgraft does in part.] All fundamental pieces of the simple method are necessary to get guaranteed time transition from the complicated method: task tracking and transparent catching of them, handling kthreads, etc. My argument is that the simple method should be implemented first and foremost. Then people can do add-on features to possibly spread out the new function versions in a more complicated way if they want to avoid the stop-all-tasks transition - although I'm not convinced about it: I'm sure sure many sysadmins would like the bug patching to be over with quickly and not have their systems in an intermediate state like kgraft does it. In any case, as per my arguments above, examining the kernel stack is superfluous (so we won't be exposed to the fragility of it either): there's no need to examine it and writing such patches is misguided... Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
On Fri, 20 Feb 2015, Ingo Molnar wrote: - the complicated method spread out over time: uses the same essential mechanism plus the ftrace patching machinery to detect whether all tasks have transitioned through a version flip. [this is what kgraft does in part.] The only difference of this to what kgraft does is that alive-enough tasks are not put in this kind of out of kernel 'freezer', but keep running. Modifying kgraft to (optionally) add the synchronization barrier and then flip the switch should be a rather trivial task, and can indeed be added as a simple option to the patch author / sysadmin. However ... All fundamental pieces of the simple method are necessary to get guaranteed time transition from the complicated method: task tracking and transparent catching of them, handling kthreads, etc. My argument is that the simple method should be implemented first and foremost. Then people can do add-on features to possibly spread out the new function versions in a more complicated way if they want to avoid the stop-all-tasks transition - although I'm not convinced about it: I'm sure sure many sysadmins would like the bug patching to be over with quickly and not have their systems in an intermediate state like kgraft does it. ... the choice the sysadmins have here is either have the system running in an intermediate state, or have the system completely dead for the *same time*. Because to finish the transition successfully, all the tasks have to be woken up in any case. (please note that this is different to suspend/resume task freezing, because we don't care about sleeping tasks there). But I do get your point; what you are basically saying is that your preference is what kgraft is doing, and option to allow for a global synchronization point should be added in parallel to the gradual lazy migration. In any case, as per my arguments above, examining the kernel stack is superfluous (so we won't be exposed to the fragility of it either): there's no need to examine it and writing such patches is misguided... Thanks, Ingo -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
Alright, so to sum it up: - current stack dumping (even looking at /proc/pid/stack) is not guaranteed to yield correct results in case the task is running at the time the stack is being examined - the only fool-proof way is to send IPI-NMI to all CPUs, and synchronize the handlers between each other (to make sure that reschedule doesn't happen in between on some CPU and other task doesn't start running in the interim). The NMI handler dumps its current stack in case it's running in context of the process whose stack is to be dumped. Otherwise, one of the NMI handlers looks up the required task_struct, and dumps it if it's not running on any CPU - For live patching use-case, the stack has to be analyzed (and decision on what to do based on the analysis) in the NMI handler itself, otherwise it gets racy again Converting /proc/pid/stack to this mechanism seems like a correct thing to do in any case, as it's slow path anyway. The original intent seemed to have been to make this fast path for the live patching case, but that's probably not possible, so it seems like the price that will have to be paid for being able to finish live-patching of CPU-bound processess is the cost of IPI-NMI broadcast. Agreed? Thanks, -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Fri, 20 Feb 2015, Ingo Molnar wrote: So if your design is based on being able to discover 'live' functions in the kernel stack dump of all tasks in the system, I think you need a serious reboot of the whole approach and get rid of that fragility before any of that functionality gets upstream! So let me repeat again, just to make sure that no more confusion is being spread around -- there are aproaches which do rely on stack contents, and aproaches which don't. kpatch (the Red Hat solution) and ksplice (the Oracle solution) contains stack analysis as a conceptual design step, kgraft (the SUSE solution) doesn't. Also the limited (in features -- consistency models) common infrastructure that is currently upstream doesn't care about stack contents. We are now figuring out which consistency models make sense for upstream, and how they can be achieved. Josh's patchset is one of the aproaches that is currently being discussed, but it's not the only available option. Thanks, -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
* Jiri Kosina jkos...@suse.cz wrote: Alright, so to sum it up: - current stack dumping (even looking at /proc/pid/stack) is not guaranteed to yield correct results in case the task is running at the time the stack is being examined Don't even _think_ about trying to base something as dangerous as live patching the kernel image on the concept of: 'We can make all stack backtraces reliably correct all the time, with no false positives, with no false negatives, 100% of the time, and quickly discover and fix bugs in that'. It's not going to happen: - The correctness of stacktraces partially depends on tooling and we don't control those. - More importantly, there's no strong force that ensures we can rely on stack backtraces: correcting bad stack traces depends on people hitting those functions and situations that generate them, seeing a bad stack trace, noticing that it's weird and correcting whatever code or tooling quirk causes the stack entry to be incorrect. Essentially unlike other kernel code which breaks stuff if it's incorrect, there's no _functional_ dependence on stack traces, so live patching would be the first (and pretty much only) thing that breaks on bad stack traces ... If you think you can make something like dwarf annotations work reliably to base kernel live patching on that, reconsider. Even with frame pointer backtraces can go bad sometimes, I wouldn't base live patching even on _that_, and that's a very simple concept with a performance cost that most distros don't want to pay. So if your design is based on being able to discover 'live' functions in the kernel stack dump of all tasks in the system, I think you need a serious reboot of the whole approach and get rid of that fragility before any of that functionality gets upstream! - For live patching use-case, the stack has to be analyzed (and decision on what to do based on the analysis) in the NMI handler itself, otherwise it gets racy again You simply cannot reliably determine from the kernel stack whether a function is used by a task or not, and actually modify the kernel image, from a stack backtrace, as things stand today. Full stop. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
On Fri, Feb 20, 2015 at 08:49:01PM +0100, Ingo Molnar wrote: ... the choice the sysadmins have here is either have the system running in an intermediate state, or have the system completely dead for the *same time*. Because to finish the transition successfully, all the tasks have to be woken up in any case. That statement is false: an 'intermediate state' system where 'new' tasks are still running is still running and will interfere with the resolution of 'old' tasks. Can you suggest a way how they would interfere? The transition happens on entering or returning from a syscall, there is no influence between individual tasks. If you mean that the patch author has to consider the fact that both old and new code will be running simultaneously, then yes, they have to. But I do get your point; what you are basically saying is that your preference is what kgraft is doing, and option to allow for a global synchronization point should be added in parallel to the gradual lazy migration. I think you misunderstood: the 'simple' method I outlined does not just 'synchronize', it actually executes the live patching atomically, once all tasks are gathered and we know they are _all_ in a safe state. The 'simple' method has to catch and freeze all tasks one by one in syscall entry/exit, at the kernel/userspace boundary, until all are frozen and then patch the system atomically. This means that each and every sleeping task in the system has to be woken up in some way (sending a signal ...) to exit from a syscall it is sleeping in. Same for CPU hogs. All kernel threads need to be parked. This is exactly what you need to do for kGraft to complete patching. This may take a significant amount of time to achieve and you won't be able to use a userspace script to send the signals, because it'd be frozen immediately. I.e. it's in essence the strong stop-all atomic patching model of 'kpatch', combined with the reliable avoidance of kernel stacks that 'kgraft' uses. That should be the starting point, because it's the most reliable method. In the consistency models discussion, this was marked the LEAVE_KERNEL+SWITCH_KERNEL model. It's indeed the strongest model of all, but also comes at the highest cost in terms of impact on running tasks. It's so high (the interruption may be seconds or more) that it was deemed not worth implementing. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
On Fri, Feb 20, 2015 at 10:46:13PM +0100, Vojtech Pavlik wrote: On Fri, Feb 20, 2015 at 08:49:01PM +0100, Ingo Molnar wrote: I.e. it's in essence the strong stop-all atomic patching model of 'kpatch', combined with the reliable avoidance of kernel stacks that 'kgraft' uses. That should be the starting point, because it's the most reliable method. In the consistency models discussion, this was marked the LEAVE_KERNEL+SWITCH_KERNEL model. It's indeed the strongest model of all, but also comes at the highest cost in terms of impact on running tasks. It's so high (the interruption may be seconds or more) that it was deemed not worth implementing. Yeah, this is way too disruptive to the user. Even the comparatively tiny latency caused by kpatch's use of stop_machine() was considered unacceptable by some. Plus a lot of processes would see EINTR, causing more havoc. -- Josh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Fri, Feb 20, 2015 at 09:08:49PM +0100, Ingo Molnar wrote: * Josh Poimboeuf jpoim...@redhat.com wrote: On Fri, Feb 20, 2015 at 10:50:03AM +0100, Ingo Molnar wrote: * Jiri Kosina jkos...@suse.cz wrote: Alright, so to sum it up: - current stack dumping (even looking at /proc/pid/stack) is not guaranteed to yield correct results in case the task is running at the time the stack is being examined Don't even _think_ about trying to base something as dangerous as live patching the kernel image on the concept of: 'We can make all stack backtraces reliably correct all the time, with no false positives, with no false negatives, 100% of the time, and quickly discover and fix bugs in that'. It's not going to happen: - The correctness of stacktraces partially depends on tooling and we don't control those. What tooling are you referring to? Sure, we rely on the compiler to produce the correct assembly for putting frame pointers on the stack. [...] We also rely on all assembly code to have valid frames all the time - and this is far from a given, we've had many bugs in that area. Are you talking about kernel bugs or compiler bugs? Either way, would those bugs not be caught by the stack unwinding validation improvements I proposed? [...] But that's pretty straightforward. The kernel already relies on the compiler to do plenty of things which are much more complex than that. So this claim scares me because it's such nonsense: We rely on the compiler to create _functional code_ for us. If the compiler messes up then functionality, actual code breaks. 'Valid stack frame' is not functional code, it's in essence debug info. Only because we _treat_ it like debug info. I'm proposing that we treat it like functional code. So that yes, if the compiler messes up frame pointers, actual code breaks. (But we could detect that before we do the patch, so it doesn't destabilize the system.) If gcc can't do frame pointers right, it's a *major* gcc bug. It's nothing more than: push %rbp mov%rsp,%rbp at the beginning of the function, and: pop%rbp right before the function returns. If you really think the compiler could possibly mess that up, I can probably write a simple script, executed by the Makefile, which validates that every C function in the kernel has the above pattern. If a backtrace is wrong in some rare, racy case then that won't break the kernel, it just gives slightly misleading debug info in most cases. Now you want to turn 'debug info' to have functional relevance, for something as intrusive as patching the kernel code. You really need to accept this distinction! Yes, I already realize that this would be the first time the kernel relies on stack traces being 100% correct. Peter and I had already some discussion about it already elsewhere in this thread. Just because we haven't relied on its correctness in the past doesn't mean we can't rely on it in the future. - More importantly, there's no strong force that ensures we can rely on stack backtraces: correcting bad stack traces depends on people hitting those functions and situations that generate them, seeing a bad stack trace, noticing that it's weird and correcting whatever code or tooling quirk causes the stack entry to be incorrect. Essentially unlike other kernel code which breaks stuff if it's incorrect, there's no _functional_ dependence on stack traces, so live patching would be the first (and pretty much only) thing that breaks on bad stack traces ... First, there are several things we do to avoid anomalies: - we don't patch asm functions the problem with asm functions isn't that we might want to patch them (although sometimes we might want to: especially the more interesting security fixes tend to be in assembly code), it's that asm functions can (easily) mess up debug info, without that being apparent in any functional way. Again, that's why I proposed the stack unwinding validation improvements. - we only walk the stack of sleeping tasks sleeping tasks are affected by debug info bugs too. Second, currently the stack unwinding code is rather crude and doesn't do much in the way of error handling. That's a big red flag in my book ... I wasn't talking about my patch set. I was talking about the existing stack dumping code in the kernel. -- Josh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Fri, Feb 20, 2015 at 10:50:03AM +0100, Ingo Molnar wrote: * Jiri Kosina jkos...@suse.cz wrote: Alright, so to sum it up: - current stack dumping (even looking at /proc/pid/stack) is not guaranteed to yield correct results in case the task is running at the time the stack is being examined Don't even _think_ about trying to base something as dangerous as live patching the kernel image on the concept of: 'We can make all stack backtraces reliably correct all the time, with no false positives, with no false negatives, 100% of the time, and quickly discover and fix bugs in that'. It's not going to happen: - The correctness of stacktraces partially depends on tooling and we don't control those. What tooling are you referring to? Sure, we rely on the compiler to produce the correct assembly for putting frame pointers on the stack. But that's pretty straightforward. The kernel already relies on the compiler to do plenty of things which are much more complex than that. - More importantly, there's no strong force that ensures we can rely on stack backtraces: correcting bad stack traces depends on people hitting those functions and situations that generate them, seeing a bad stack trace, noticing that it's weird and correcting whatever code or tooling quirk causes the stack entry to be incorrect. Essentially unlike other kernel code which breaks stuff if it's incorrect, there's no _functional_ dependence on stack traces, so live patching would be the first (and pretty much only) thing that breaks on bad stack traces ... First, there are several things we do to avoid anomalies: - we don't patch asm functions - we only walk the stack of sleeping tasks Second, currently the stack unwinding code is rather crude and doesn't do much in the way of error handling. There are several error conditions we could easily check for programmatically: - make sure it starts from a __schedule() call at the top (I've only proposed walking the stacks of sleeping tasks) - make sure we actually reach the bottom of the stack - make sure each stack frame's return address is immediately after a call instruction If any of these checks fail, we can either delay the patching of the task until later or we can cancel the patching process, with no harm done. Either way we can WARN the user so that we get reports of these anomalies. If you think you can make something like dwarf annotations work reliably to base kernel live patching on that, reconsider. No, we would rely on CONFIG_FRAME_POINTER. Even with frame pointer backtraces can go bad sometimes, I wouldn't base live patching even on _that_, Other than hand-coded assembly, can you provide more details about how frame pointer stack traces can go bad? and that's a very simple concept with a performance cost that most distros don't want to pay. Hm, CONFIG_FRAME_POINTER is enabled on all the distros I use. So if your design is based on being able to discover 'live' functions in the kernel stack dump of all tasks in the system, I think you need a serious reboot of the whole approach and get rid of that fragility before any of that functionality gets upstream! Just to clarify a couple of things: 1. Despite the email subject, this discussion is really about another RFC patch set [1]. It hasn't been merged, and there's still a lot of ongoing discussion about it. 2. We don't dump the stack of _all_ tasks. Only sleeping ones. [1] https://lkml.org/lkml/2015/2/9/475 -- Josh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Fri, Feb 20, 2015 at 09:49:32AM +0100, Jiri Kosina wrote: Alright, so to sum it up: - current stack dumping (even looking at /proc/pid/stack) is not guaranteed to yield correct results in case the task is running at the time the stack is being examined - the only fool-proof way is to send IPI-NMI to all CPUs, and synchronize the handlers between each other (to make sure that reschedule doesn't happen in between on some CPU and other task doesn't start running in the interim). The NMI handler dumps its current stack in case it's running in context of the process whose stack is to be dumped. Otherwise, one of the NMI handlers looks up the required task_struct, and dumps it if it's not running on any CPU - For live patching use-case, the stack has to be analyzed (and decision on what to do based on the analysis) in the NMI handler itself, otherwise it gets racy again Converting /proc/pid/stack to this mechanism seems like a correct thing to do in any case, as it's slow path anyway. The original intent seemed to have been to make this fast path for the live patching case, but that's probably not possible, so it seems like the price that will have to be paid for being able to finish live-patching of CPU-bound processess is the cost of IPI-NMI broadcast. Hm, syncing IPI's among CPUs sounds pretty disruptive. This is really two different issues, so I'll separate them: 1. /proc/pid/stack for running tasks I haven't heard anybody demanding that /proc/pid/stack should actually print the stack for running tasks. My suggestion was just that we avoid the possibility of printing garbage. Today's behavior for a running task is (usually): # cat /proc/802/stack [] 0x How about, when we detecting a running task, just always show that? That would give us today's behavior, except without occasionally printing garbage, while avoiding all the overhead of syncing IPI's. 2. live patching of running tasks I don't see why we would need to sync IPI's to patch CPU-bound processes. Why not use context tracking or the TIF_USERSPACE flag like I mentioned before? -- Josh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
* Jiri Kosina jkos...@suse.cz wrote: All fundamental pieces of the simple method are necessary to get guaranteed time transition from the complicated method: task tracking and transparent catching of them, handling kthreads, etc. My argument is that the simple method should be implemented first and foremost. Then people can do add-on features to possibly spread out the new function versions in a more complicated way if they want to avoid the stop-all-tasks transition - although I'm not convinced about it: I'm sure sure many sysadmins would like the bug patching to be over with quickly and not have their systems in an intermediate state like kgraft does it. ... the choice the sysadmins have here is either have the system running in an intermediate state, or have the system completely dead for the *same time*. Because to finish the transition successfully, all the tasks have to be woken up in any case. That statement is false: an 'intermediate state' system where 'new' tasks are still running is still running and will interfere with the resolution of 'old' tasks. But I do get your point; what you are basically saying is that your preference is what kgraft is doing, and option to allow for a global synchronization point should be added in parallel to the gradual lazy migration. I think you misunderstood: the 'simple' method I outlined does not just 'synchronize', it actually executes the live patching atomically, once all tasks are gathered and we know they are _all_ in a safe state. I.e. it's in essence the strong stop-all atomic patching model of 'kpatch', combined with the reliable avoidance of kernel stacks that 'kgraft' uses. That should be the starting point, because it's the most reliable method. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
* Josh Poimboeuf jpoim...@redhat.com wrote: On Fri, Feb 20, 2015 at 10:50:03AM +0100, Ingo Molnar wrote: * Jiri Kosina jkos...@suse.cz wrote: Alright, so to sum it up: - current stack dumping (even looking at /proc/pid/stack) is not guaranteed to yield correct results in case the task is running at the time the stack is being examined Don't even _think_ about trying to base something as dangerous as live patching the kernel image on the concept of: 'We can make all stack backtraces reliably correct all the time, with no false positives, with no false negatives, 100% of the time, and quickly discover and fix bugs in that'. It's not going to happen: - The correctness of stacktraces partially depends on tooling and we don't control those. What tooling are you referring to? Sure, we rely on the compiler to produce the correct assembly for putting frame pointers on the stack. [...] We also rely on all assembly code to have valid frames all the time - and this is far from a given, we've had many bugs in that area. [...] But that's pretty straightforward. The kernel already relies on the compiler to do plenty of things which are much more complex than that. So this claim scares me because it's such nonsense: We rely on the compiler to create _functional code_ for us. If the compiler messes up then functionality, actual code breaks. 'Valid stack frame' is not functional code, it's in essence debug info. If a backtrace is wrong in some rare, racy case then that won't break the kernel, it just gives slightly misleading debug info in most cases. Now you want to turn 'debug info' to have functional relevance, for something as intrusive as patching the kernel code. You really need to accept this distinction! - More importantly, there's no strong force that ensures we can rely on stack backtraces: correcting bad stack traces depends on people hitting those functions and situations that generate them, seeing a bad stack trace, noticing that it's weird and correcting whatever code or tooling quirk causes the stack entry to be incorrect. Essentially unlike other kernel code which breaks stuff if it's incorrect, there's no _functional_ dependence on stack traces, so live patching would be the first (and pretty much only) thing that breaks on bad stack traces ... First, there are several things we do to avoid anomalies: - we don't patch asm functions the problem with asm functions isn't that we might want to patch them (although sometimes we might want to: especially the more interesting security fixes tend to be in assembly code), it's that asm functions can (easily) mess up debug info, without that being apparent in any functional way. - we only walk the stack of sleeping tasks sleeping tasks are affected by debug info bugs too. Second, currently the stack unwinding code is rather crude and doesn't do much in the way of error handling. That's a big red flag in my book ... Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Thu, 19 Feb 2015, Josh Poimboeuf wrote: > So I've looked at kgr_needs_lazy_migration(), but I still have no idea > how it works. > > First of all, I think reading the stack while its being written to could > give you some garbage values, and a completely wrong nr_entries value > from save_stack_trace_tsk(). I believe we've already been discussing this some time ago ... I agree that this is a very crude optimization that should probably be either removed (which would only cause slower convergence in the presence of CPU-bound tasks), or rewritten to perform IPI-based stack dumping (probably on a voluntarily-configurable basis). Reading garbage values could only happen if the task would be running in kernelspace. nr_entries would then be at least 2. But I agree that relying on this very specific behavior is not really safe in general in case someone changes the stack dumping implementation in the future in an unpredictable way. > But also, how would you walk a stack without knowing its stack pointer? > That function relies on the saved stack pointer in > task_struct.thread.sp, which, AFAICT, was last saved during the last > call to schedule(). Since then, the stack could have been completely > rewritten, with different size stack frames, before the task exited the > kernel. Same argument holds here as well, I believe. Thanks, -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Thu, Feb 19, 2015 at 10:26:09PM +0100, Jiri Kosina wrote: > On Thu, 19 Feb 2015, Josh Poimboeuf wrote: > > > How about with a TIF_IN_USERSPACE thread flag? It could be cleared/set > > right at the border. Then for running tasks it's as simple as: > > > > if (test_tsk_thread_flag(task, TIF_IN_USERSPACE)) > > klp_switch_task_universe(task); > > That's in principle what CONTEXT_TRACKING is doing, i.e. the condition > we'd be interested in would be > > __this_cpu_read(context_tracking.state) == IN_USER > > But it has overhead. Yeah, that does seem to pretty much do what we want. Unfortunately it has a much higher overhead than just setting a thread flag. And from the Kconfig description for CONTEXT_TRACKING_FORCE, which would enable it on all CPUs during boot, "this option brings an overhead that you don't want in production." Maybe that code needs a rewrite to rely on a thread flag instead. Then we could use it too. -- Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Thu, Feb 19, 2015 at 09:40:36PM +0100, Vojtech Pavlik wrote: > On Thu, Feb 19, 2015 at 11:32:55AM -0600, Josh Poimboeuf wrote: > > On Thu, Feb 19, 2015 at 06:19:29PM +0100, Vojtech Pavlik wrote: > > > On Thu, Feb 19, 2015 at 11:03:53AM -0600, Josh Poimboeuf wrote: > > > > On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote: > > > > > On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote: > > > > > > > > > > > > No, these tasks will _never_ make syscalls. So you need to > > > > > > > guarantee > > > > > > > they don't accidentally enter the kernel while you flip them. > > > > > > > Something > > > > > > > like so should do. > > > > > > > > > > > > > > You set TIF_ENTER_WAIT on them, check they're still in userspace, > > > > > > > flip > > > > > > > them then clear TIF_ENTER_WAIT. > > > > > > > > > > > > Ah, that's a good idea. But how do we check if they're in user > > > > > > space? > > > > > > > > > > I don't see the benefit in holding them in a loop - you can just as > > > > > well > > > > > flip them from the syscall code as kGraft does. > > > > > > > > But we were talking specifically about HPC tasks which never make > > > > syscalls. > > > > > > Yes. I'm saying that rather than guaranteeing they don't enter the > > > kernel (by having them spin) you can flip them in case they try to do > > > that instead. That solves the race condition just as well. > > > > Ok, gotcha. > > > > We'd still need a safe way to check if they're in user space though. > > Having a safe way would be very nice and actually quite useful in other > cases, too. > > For this specific purpose, however, we don't need a very safe way, > though. We don't require atomicity in any way, we don't mind even if it > creates false negatives, only false positives would be bad. > > kGraft looks at the stacktrace of CPU hogs and if it finds no kernel > addresses there, it assumes userspace. Not very nice, but does the job. So I've looked at kgr_needs_lazy_migration(), but I still have no idea how it works. First of all, I think reading the stack while its being written to could give you some garbage values, and a completely wrong nr_entries value from save_stack_trace_tsk(). But also, how would you walk a stack without knowing its stack pointer? That function relies on the saved stack pointer in task_struct.thread.sp, which, AFAICT, was last saved during the last call to schedule(). Since then, the stack could have been completely rewritten, with different size stack frames, before the task exited the kernel. Am I missing something? -- Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Thu, 19 Feb 2015, Jiri Kosina wrote: > > How about with a TIF_IN_USERSPACE thread flag? It could be cleared/set > > right at the border. Then for running tasks it's as simple as: > > > > if (test_tsk_thread_flag(task, TIF_IN_USERSPACE)) > > klp_switch_task_universe(task); > > That's in principle what CONTEXT_TRACKING is doing, i.e. the condition > we'd be interested in would be > > __this_cpu_read(context_tracking.state) == IN_USER Well, more precisely per_cpu(context_tracking.state, cpu) == IN_USER of course. -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Thu, 19 Feb 2015, Josh Poimboeuf wrote: > How about with a TIF_IN_USERSPACE thread flag? It could be cleared/set > right at the border. Then for running tasks it's as simple as: > > if (test_tsk_thread_flag(task, TIF_IN_USERSPACE)) > klp_switch_task_universe(task); That's in principle what CONTEXT_TRACKING is doing, i.e. the condition we'd be interested in would be __this_cpu_read(context_tracking.state) == IN_USER But it has overhead. -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Thu, Feb 19, 2015 at 11:32:55AM -0600, Josh Poimboeuf wrote: > On Thu, Feb 19, 2015 at 06:19:29PM +0100, Vojtech Pavlik wrote: > > On Thu, Feb 19, 2015 at 11:03:53AM -0600, Josh Poimboeuf wrote: > > > On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote: > > > > On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote: > > > > > > > > > > No, these tasks will _never_ make syscalls. So you need to guarantee > > > > > > they don't accidentally enter the kernel while you flip them. > > > > > > Something > > > > > > like so should do. > > > > > > > > > > > > You set TIF_ENTER_WAIT on them, check they're still in userspace, > > > > > > flip > > > > > > them then clear TIF_ENTER_WAIT. > > > > > > > > > > Ah, that's a good idea. But how do we check if they're in user space? > > > > > > > > I don't see the benefit in holding them in a loop - you can just as well > > > > flip them from the syscall code as kGraft does. > > > > > > But we were talking specifically about HPC tasks which never make > > > syscalls. > > > > Yes. I'm saying that rather than guaranteeing they don't enter the > > kernel (by having them spin) you can flip them in case they try to do > > that instead. That solves the race condition just as well. > > Ok, gotcha. > > We'd still need a safe way to check if they're in user space though. Having a safe way would be very nice and actually quite useful in other cases, too. For this specific purpose, however, we don't need a very safe way, though. We don't require atomicity in any way, we don't mind even if it creates false negatives, only false positives would be bad. kGraft looks at the stacktrace of CPU hogs and if it finds no kernel addresses there, it assumes userspace. Not very nice, but does the job. > How about with a TIF_IN_USERSPACE thread flag? It could be cleared/set > right at the border. Then for running tasks it's as simple as: > > if (test_tsk_thread_flag(task, TIF_IN_USERSPACE)) > klp_switch_task_universe(task); -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On February 19, 2015 6:32:55 PM CET, Josh Poimboeuf wrote: >> Yes. I'm saying that rather than guaranteeing they don't enter the >> kernel (by having them spin) you can flip them in case they try to do >> that instead. That solves the race condition just as well. > >Ok, gotcha. > >We'd still need a safe way to check if they're in user space though. > >How about with a TIF_IN_USERSPACE thread flag? It could be cleared/set >right at the border. Then for running tasks it's as simple as: > >if (test_tsk_thread_flag(task, TIF_IN_USERSPACE)) > klp_switch_task_universe(task); The s390x arch used to have a TIF_SYSCALL, which was doing exactly that (well, negated). I think it would work well, but it isn't entirely for free: two instructions per syscall and an extra TIF bit, which are (near) exhausted on some archs. -- Vojtech Pavlik Director SuSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Thu, Feb 19, 2015 at 06:19:29PM +0100, Vojtech Pavlik wrote: > On Thu, Feb 19, 2015 at 11:03:53AM -0600, Josh Poimboeuf wrote: > > On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote: > > > On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote: > > > > > > > > No, these tasks will _never_ make syscalls. So you need to guarantee > > > > > they don't accidentally enter the kernel while you flip them. > > > > > Something > > > > > like so should do. > > > > > > > > > > You set TIF_ENTER_WAIT on them, check they're still in userspace, flip > > > > > them then clear TIF_ENTER_WAIT. > > > > > > > > Ah, that's a good idea. But how do we check if they're in user space? > > > > > > I don't see the benefit in holding them in a loop - you can just as well > > > flip them from the syscall code as kGraft does. > > > > But we were talking specifically about HPC tasks which never make > > syscalls. > > Yes. I'm saying that rather than guaranteeing they don't enter the > kernel (by having them spin) you can flip them in case they try to do > that instead. That solves the race condition just as well. Ok, gotcha. We'd still need a safe way to check if they're in user space though. How about with a TIF_IN_USERSPACE thread flag? It could be cleared/set right at the border. Then for running tasks it's as simple as: if (test_tsk_thread_flag(task, TIF_IN_USERSPACE)) klp_switch_task_universe(task); -- Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Thu, Feb 19, 2015 at 11:03:53AM -0600, Josh Poimboeuf wrote: > On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote: > > On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote: > > > > > > No, these tasks will _never_ make syscalls. So you need to guarantee > > > > they don't accidentally enter the kernel while you flip them. Something > > > > like so should do. > > > > > > > > You set TIF_ENTER_WAIT on them, check they're still in userspace, flip > > > > them then clear TIF_ENTER_WAIT. > > > > > > Ah, that's a good idea. But how do we check if they're in user space? > > > > I don't see the benefit in holding them in a loop - you can just as well > > flip them from the syscall code as kGraft does. > > But we were talking specifically about HPC tasks which never make > syscalls. Yes. I'm saying that rather than guaranteeing they don't enter the kernel (by having them spin) you can flip them in case they try to do that instead. That solves the race condition just as well. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Thu, 19 Feb 2015, Josh Poimboeuf wrote: > > > > No, these tasks will _never_ make syscalls. So you need to guarantee > > > > they don't accidentally enter the kernel while you flip them. Something > > > > like so should do. > > > > > > > > You set TIF_ENTER_WAIT on them, check they're still in userspace, flip > > > > them then clear TIF_ENTER_WAIT. > > > > > > Ah, that's a good idea. But how do we check if they're in user space? > > > > I don't see the benefit in holding them in a loop - you can just as well > > flip them from the syscall code as kGraft does. > > But we were talking specifically about HPC tasks which never make > syscalls. Yeah. And getting answer to "is this task_struct currently running in userspace?" is not easy in a non-disruptive way. Piggy-backing on CONFIG_CONTEXT_TRACKING is one option, but I guess the number of systems that have this option turned on will be rather limited (especially in HPC space). Sending IPIs around to check whether the task is running in user-space or kernel-space would work, but it's disruptive to the running task, which might be unwanted as well. -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote: > On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote: > > > > No, these tasks will _never_ make syscalls. So you need to guarantee > > > they don't accidentally enter the kernel while you flip them. Something > > > like so should do. > > > > > > You set TIF_ENTER_WAIT on them, check they're still in userspace, flip > > > them then clear TIF_ENTER_WAIT. > > > > Ah, that's a good idea. But how do we check if they're in user space? > > I don't see the benefit in holding them in a loop - you can just as well > flip them from the syscall code as kGraft does. But we were talking specifically about HPC tasks which never make syscalls. -- Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote: > > No, these tasks will _never_ make syscalls. So you need to guarantee > > they don't accidentally enter the kernel while you flip them. Something > > like so should do. > > > > You set TIF_ENTER_WAIT on them, check they're still in userspace, flip > > them then clear TIF_ENTER_WAIT. > > Ah, that's a good idea. But how do we check if they're in user space? I don't see the benefit in holding them in a loop - you can just as well flip them from the syscall code as kGraft does. > > I still absolutely hate you need to disturb userspace like that. Signals > > are quite visible and perturb userspace state. > > Yeah, SIGSTOP on a sleeping task can be intrusive to user space if it > results in EINTR being returned from a system call. It's not ideal, but > it's much less intrusive than something like suspend. > > But anyway we leave it up to the user to decide whether they want to > take that risk, or wait for the task to wake up on its own, or cancel > the patch. > > > Also, you cannot SIGCONT a task that was SIGSTOP'ed by userspace for > > what they thought was a good reason. You'd wreck their state. > > Hm, that's a good point. Maybe use the freezer instead of signals? > > (Again this would only be for those user tasks which are sleeping on a > patched function) > > > > But now I'm thinking that kthreads will almost never be a problem. Most > > > kthreads are basically this: > > > > You guys are way too optimistic; maybe its because I've worked on > > realtime stuff too much, but I'm always looking at worst cases. If you > > can't handle those, I feel you might as well not bother :-) > > Well, I think we're already resigned to the fact that live patching > won't work for every patch, every time. And that the patch author must > know what they're doing, and must do it carefully. > > Our target worst case is that the patching fails gracefully and the user > can't patch their system with that particular patch. Or that the system > remains in a partially patched state forever, if the user is ok with > that. > > > > Patching thread_fn wouldn't be possible unless we killed the thread. > > > > It is, see kthread_park(). > > When the kthread returns from kthread_parkme(), it'll still be running > the old thread_fn code, regardless of whether we flipped the task's > patch state. We could instrument kthread_parkme() to be able to return to a different function, but it'd be a bit crazy indeed. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Thu, Feb 19, 2015 at 11:16:07AM +0100, Peter Zijlstra wrote: > On Wed, Feb 18, 2015 at 10:17:53PM -0600, Josh Poimboeuf wrote: > > On Thu, Feb 19, 2015 at 01:20:58AM +0100, Peter Zijlstra wrote: > > > On Wed, Feb 18, 2015 at 11:12:56AM -0600, Josh Poimboeuf wrote: > > > > > The next line of attack is patching tasks when exiting the kernel to > > > > user space (system calls, interrupts, signals), to catch all CPU-bound > > > > and some I/O-bound tasks. That's done in patch 9 [1] of the consistency > > > > model patch set. > > > > > > So the HPC people are really into userspace that does for (;;) ; and > > > isolate that on CPUs and have the tick interrupt stopped and all that. > > > > > > You'll not catch those threads on the sysexit path. > > > > > > And I'm fairly sure they'll not want to SIGSTOP/CONT their stuff either. > > > > > > Now its fairly easy to also handle this; just mark those tasks with a > > > _TIF_WORK_SYSCALL_ENTRY flag, have that slowpath wait for the flag to > > > go-away, then flip their state and clear the flag. > > > > I guess you mean patch the task when it makes a syscall? I'm doing that > > already on syscall exit with a bit in _TIF_ALLWORK_MASK and > > _TIF_DO_NOTIFY_MASK. > > No, these tasks will _never_ make syscalls. So you need to guarantee > they don't accidentally enter the kernel while you flip them. Something > like so should do. > > You set TIF_ENTER_WAIT on them, check they're still in userspace, flip > them then clear TIF_ENTER_WAIT. Ah, that's a good idea. But how do we check if they're in user space? > > > > As a last resort, if there are still any tasks which are sleeping on a > > > > to-be-patched function, the user can send them SIGSTOP and SIGCONT to > > > > force them to be patched. > > > > > > You typically cannot SIGSTOP/SIGCONT kernel threads. Also > > > TASK_UNINTERRUPTIBLE sleeps are unaffected by signals. > > > > > > Bit pesky that.. needs pondering. > > I still absolutely hate you need to disturb userspace like that. Signals > are quite visible and perturb userspace state. Yeah, SIGSTOP on a sleeping task can be intrusive to user space if it results in EINTR being returned from a system call. It's not ideal, but it's much less intrusive than something like suspend. But anyway we leave it up to the user to decide whether they want to take that risk, or wait for the task to wake up on its own, or cancel the patch. > Also, you cannot SIGCONT a task that was SIGSTOP'ed by userspace for > what they thought was a good reason. You'd wreck their state. Hm, that's a good point. Maybe use the freezer instead of signals? (Again this would only be for those user tasks which are sleeping on a patched function) > > But now I'm thinking that kthreads will almost never be a problem. Most > > kthreads are basically this: > > You guys are way too optimistic; maybe its because I've worked on > realtime stuff too much, but I'm always looking at worst cases. If you > can't handle those, I feel you might as well not bother :-) Well, I think we're already resigned to the fact that live patching won't work for every patch, every time. And that the patch author must know what they're doing, and must do it carefully. Our target worst case is that the patching fails gracefully and the user can't patch their system with that particular patch. Or that the system remains in a partially patched state forever, if the user is ok with that. > > Patching thread_fn wouldn't be possible unless we killed the thread. > > It is, see kthread_park(). When the kthread returns from kthread_parkme(), it'll still be running the old thread_fn code, regardless of whether we flipped the task's patch state. -- Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Wed, Feb 18, 2015 at 10:17:53PM -0600, Josh Poimboeuf wrote: > On Thu, Feb 19, 2015 at 01:20:58AM +0100, Peter Zijlstra wrote: > > On Wed, Feb 18, 2015 at 11:12:56AM -0600, Josh Poimboeuf wrote: > > > The next line of attack is patching tasks when exiting the kernel to > > > user space (system calls, interrupts, signals), to catch all CPU-bound > > > and some I/O-bound tasks. That's done in patch 9 [1] of the consistency > > > model patch set. > > > > So the HPC people are really into userspace that does for (;;) ; and > > isolate that on CPUs and have the tick interrupt stopped and all that. > > > > You'll not catch those threads on the sysexit path. > > > > And I'm fairly sure they'll not want to SIGSTOP/CONT their stuff either. > > > > Now its fairly easy to also handle this; just mark those tasks with a > > _TIF_WORK_SYSCALL_ENTRY flag, have that slowpath wait for the flag to > > go-away, then flip their state and clear the flag. > > I guess you mean patch the task when it makes a syscall? I'm doing that > already on syscall exit with a bit in _TIF_ALLWORK_MASK and > _TIF_DO_NOTIFY_MASK. No, these tasks will _never_ make syscalls. So you need to guarantee they don't accidentally enter the kernel while you flip them. Something like so should do. You set TIF_ENTER_WAIT on them, check they're still in userspace, flip them then clear TIF_ENTER_WAIT. --- arch/x86/include/asm/thread_info.h | 4 +++- arch/x86/kernel/entry_64.S | 2 ++ 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h index e82e95abc92b..baa836f13536 100644 --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -90,6 +90,7 @@ struct thread_info { #define TIF_SYSCALL_TRACEPOINT 28 /* syscall tracepoint instrumentation */ #define TIF_ADDR32 29 /* 32-bit address space on 64 bits */ #define TIF_X3230 /* 32-bit native x86-64 binary */ +#define TIF_ENTER_WAIT 31 #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE) #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME) @@ -113,12 +114,13 @@ struct thread_info { #define _TIF_SYSCALL_TRACEPOINT(1 << TIF_SYSCALL_TRACEPOINT) #define _TIF_ADDR32(1 << TIF_ADDR32) #define _TIF_X32 (1 << TIF_X32) +#define _TIF_ENTER_WAIT(1 << TIF_ENTER_WAIT) /* work to do in syscall_trace_enter() */ #define _TIF_WORK_SYSCALL_ENTRY\ (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT | \ _TIF_SECCOMP | _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT | \ -_TIF_NOHZ) +_TIF_NOHZ | _TIF_ENTER_WAIT) /* work to do in syscall_trace_leave() */ #define _TIF_WORK_SYSCALL_EXIT \ diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index db13655c3a2a..735566b35903 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -387,6 +387,8 @@ GLOBAL(system_call_after_swapgs) /* Do syscall tracing */ tracesys: + andl $_TIF_ENTER_WAIT,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET) + jnz tracesys; leaq -REST_SKIP(%rsp), %rdi movq $AUDIT_ARCH_X86_64, %rsi call syscall_trace_enter_phase1 > > > As a last resort, if there are still any tasks which are sleeping on a > > > to-be-patched function, the user can send them SIGSTOP and SIGCONT to > > > force them to be patched. > > > > You typically cannot SIGSTOP/SIGCONT kernel threads. Also > > TASK_UNINTERRUPTIBLE sleeps are unaffected by signals. > > > > Bit pesky that.. needs pondering. I still absolutely hate you need to disturb userspace like that. Signals are quite visible and perturb userspace state. Also, you cannot SIGCONT a task that was SIGSTOP'ed by userspace for what they thought was a good reason. You'd wreck their state. > But now I'm thinking that kthreads will almost never be a problem. Most > kthreads are basically this: You guys are way too optimistic; maybe its because I've worked on realtime stuff too much, but I'm always looking at worst cases. If you can't handle those, I feel you might as well not bother :-) > Patching thread_fn wouldn't be possible unless we killed the thread. It is, see kthread_park(). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Thu, Feb 19, 2015 at 11:32:55AM -0600, Josh Poimboeuf wrote: On Thu, Feb 19, 2015 at 06:19:29PM +0100, Vojtech Pavlik wrote: On Thu, Feb 19, 2015 at 11:03:53AM -0600, Josh Poimboeuf wrote: On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote: On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote: No, these tasks will _never_ make syscalls. So you need to guarantee they don't accidentally enter the kernel while you flip them. Something like so should do. You set TIF_ENTER_WAIT on them, check they're still in userspace, flip them then clear TIF_ENTER_WAIT. Ah, that's a good idea. But how do we check if they're in user space? I don't see the benefit in holding them in a loop - you can just as well flip them from the syscall code as kGraft does. But we were talking specifically about HPC tasks which never make syscalls. Yes. I'm saying that rather than guaranteeing they don't enter the kernel (by having them spin) you can flip them in case they try to do that instead. That solves the race condition just as well. Ok, gotcha. We'd still need a safe way to check if they're in user space though. Having a safe way would be very nice and actually quite useful in other cases, too. For this specific purpose, however, we don't need a very safe way, though. We don't require atomicity in any way, we don't mind even if it creates false negatives, only false positives would be bad. kGraft looks at the stacktrace of CPU hogs and if it finds no kernel addresses there, it assumes userspace. Not very nice, but does the job. How about with a TIF_IN_USERSPACE thread flag? It could be cleared/set right at the border. Then for running tasks it's as simple as: if (test_tsk_thread_flag(task, TIF_IN_USERSPACE)) klp_switch_task_universe(task); -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Wed, Feb 18, 2015 at 10:17:53PM -0600, Josh Poimboeuf wrote: On Thu, Feb 19, 2015 at 01:20:58AM +0100, Peter Zijlstra wrote: On Wed, Feb 18, 2015 at 11:12:56AM -0600, Josh Poimboeuf wrote: The next line of attack is patching tasks when exiting the kernel to user space (system calls, interrupts, signals), to catch all CPU-bound and some I/O-bound tasks. That's done in patch 9 [1] of the consistency model patch set. So the HPC people are really into userspace that does for (;;) ; and isolate that on CPUs and have the tick interrupt stopped and all that. You'll not catch those threads on the sysexit path. And I'm fairly sure they'll not want to SIGSTOP/CONT their stuff either. Now its fairly easy to also handle this; just mark those tasks with a _TIF_WORK_SYSCALL_ENTRY flag, have that slowpath wait for the flag to go-away, then flip their state and clear the flag. I guess you mean patch the task when it makes a syscall? I'm doing that already on syscall exit with a bit in _TIF_ALLWORK_MASK and _TIF_DO_NOTIFY_MASK. No, these tasks will _never_ make syscalls. So you need to guarantee they don't accidentally enter the kernel while you flip them. Something like so should do. You set TIF_ENTER_WAIT on them, check they're still in userspace, flip them then clear TIF_ENTER_WAIT. --- arch/x86/include/asm/thread_info.h | 4 +++- arch/x86/kernel/entry_64.S | 2 ++ 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h index e82e95abc92b..baa836f13536 100644 --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -90,6 +90,7 @@ struct thread_info { #define TIF_SYSCALL_TRACEPOINT 28 /* syscall tracepoint instrumentation */ #define TIF_ADDR32 29 /* 32-bit address space on 64 bits */ #define TIF_X3230 /* 32-bit native x86-64 binary */ +#define TIF_ENTER_WAIT 31 #define _TIF_SYSCALL_TRACE (1 TIF_SYSCALL_TRACE) #define _TIF_NOTIFY_RESUME (1 TIF_NOTIFY_RESUME) @@ -113,12 +114,13 @@ struct thread_info { #define _TIF_SYSCALL_TRACEPOINT(1 TIF_SYSCALL_TRACEPOINT) #define _TIF_ADDR32(1 TIF_ADDR32) #define _TIF_X32 (1 TIF_X32) +#define _TIF_ENTER_WAIT(1 TIF_ENTER_WAIT) /* work to do in syscall_trace_enter() */ #define _TIF_WORK_SYSCALL_ENTRY\ (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT | \ _TIF_SECCOMP | _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT | \ -_TIF_NOHZ) +_TIF_NOHZ | _TIF_ENTER_WAIT) /* work to do in syscall_trace_leave() */ #define _TIF_WORK_SYSCALL_EXIT \ diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index db13655c3a2a..735566b35903 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -387,6 +387,8 @@ GLOBAL(system_call_after_swapgs) /* Do syscall tracing */ tracesys: + andl $_TIF_ENTER_WAIT,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET) + jnz tracesys; leaq -REST_SKIP(%rsp), %rdi movq $AUDIT_ARCH_X86_64, %rsi call syscall_trace_enter_phase1 As a last resort, if there are still any tasks which are sleeping on a to-be-patched function, the user can send them SIGSTOP and SIGCONT to force them to be patched. You typically cannot SIGSTOP/SIGCONT kernel threads. Also TASK_UNINTERRUPTIBLE sleeps are unaffected by signals. Bit pesky that.. needs pondering. I still absolutely hate you need to disturb userspace like that. Signals are quite visible and perturb userspace state. Also, you cannot SIGCONT a task that was SIGSTOP'ed by userspace for what they thought was a good reason. You'd wreck their state. But now I'm thinking that kthreads will almost never be a problem. Most kthreads are basically this: You guys are way too optimistic; maybe its because I've worked on realtime stuff too much, but I'm always looking at worst cases. If you can't handle those, I feel you might as well not bother :-) Patching thread_fn wouldn't be possible unless we killed the thread. It is, see kthread_park(). -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Thu, 19 Feb 2015, Josh Poimboeuf wrote: How about with a TIF_IN_USERSPACE thread flag? It could be cleared/set right at the border. Then for running tasks it's as simple as: if (test_tsk_thread_flag(task, TIF_IN_USERSPACE)) klp_switch_task_universe(task); That's in principle what CONTEXT_TRACKING is doing, i.e. the condition we'd be interested in would be __this_cpu_read(context_tracking.state) == IN_USER But it has overhead. -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Thu, 19 Feb 2015, Jiri Kosina wrote: How about with a TIF_IN_USERSPACE thread flag? It could be cleared/set right at the border. Then for running tasks it's as simple as: if (test_tsk_thread_flag(task, TIF_IN_USERSPACE)) klp_switch_task_universe(task); That's in principle what CONTEXT_TRACKING is doing, i.e. the condition we'd be interested in would be __this_cpu_read(context_tracking.state) == IN_USER Well, more precisely per_cpu(context_tracking.state, cpu) == IN_USER of course. -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Thu, Feb 19, 2015 at 09:40:36PM +0100, Vojtech Pavlik wrote: On Thu, Feb 19, 2015 at 11:32:55AM -0600, Josh Poimboeuf wrote: On Thu, Feb 19, 2015 at 06:19:29PM +0100, Vojtech Pavlik wrote: On Thu, Feb 19, 2015 at 11:03:53AM -0600, Josh Poimboeuf wrote: On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote: On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote: No, these tasks will _never_ make syscalls. So you need to guarantee they don't accidentally enter the kernel while you flip them. Something like so should do. You set TIF_ENTER_WAIT on them, check they're still in userspace, flip them then clear TIF_ENTER_WAIT. Ah, that's a good idea. But how do we check if they're in user space? I don't see the benefit in holding them in a loop - you can just as well flip them from the syscall code as kGraft does. But we were talking specifically about HPC tasks which never make syscalls. Yes. I'm saying that rather than guaranteeing they don't enter the kernel (by having them spin) you can flip them in case they try to do that instead. That solves the race condition just as well. Ok, gotcha. We'd still need a safe way to check if they're in user space though. Having a safe way would be very nice and actually quite useful in other cases, too. For this specific purpose, however, we don't need a very safe way, though. We don't require atomicity in any way, we don't mind even if it creates false negatives, only false positives would be bad. kGraft looks at the stacktrace of CPU hogs and if it finds no kernel addresses there, it assumes userspace. Not very nice, but does the job. So I've looked at kgr_needs_lazy_migration(), but I still have no idea how it works. First of all, I think reading the stack while its being written to could give you some garbage values, and a completely wrong nr_entries value from save_stack_trace_tsk(). But also, how would you walk a stack without knowing its stack pointer? That function relies on the saved stack pointer in task_struct.thread.sp, which, AFAICT, was last saved during the last call to schedule(). Since then, the stack could have been completely rewritten, with different size stack frames, before the task exited the kernel. Am I missing something? -- Josh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Thu, 19 Feb 2015, Josh Poimboeuf wrote: So I've looked at kgr_needs_lazy_migration(), but I still have no idea how it works. First of all, I think reading the stack while its being written to could give you some garbage values, and a completely wrong nr_entries value from save_stack_trace_tsk(). I believe we've already been discussing this some time ago ... I agree that this is a very crude optimization that should probably be either removed (which would only cause slower convergence in the presence of CPU-bound tasks), or rewritten to perform IPI-based stack dumping (probably on a voluntarily-configurable basis). Reading garbage values could only happen if the task would be running in kernelspace. nr_entries would then be at least 2. But I agree that relying on this very specific behavior is not really safe in general in case someone changes the stack dumping implementation in the future in an unpredictable way. But also, how would you walk a stack without knowing its stack pointer? That function relies on the saved stack pointer in task_struct.thread.sp, which, AFAICT, was last saved during the last call to schedule(). Since then, the stack could have been completely rewritten, with different size stack frames, before the task exited the kernel. Same argument holds here as well, I believe. Thanks, -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Thu, Feb 19, 2015 at 11:16:07AM +0100, Peter Zijlstra wrote: On Wed, Feb 18, 2015 at 10:17:53PM -0600, Josh Poimboeuf wrote: On Thu, Feb 19, 2015 at 01:20:58AM +0100, Peter Zijlstra wrote: On Wed, Feb 18, 2015 at 11:12:56AM -0600, Josh Poimboeuf wrote: The next line of attack is patching tasks when exiting the kernel to user space (system calls, interrupts, signals), to catch all CPU-bound and some I/O-bound tasks. That's done in patch 9 [1] of the consistency model patch set. So the HPC people are really into userspace that does for (;;) ; and isolate that on CPUs and have the tick interrupt stopped and all that. You'll not catch those threads on the sysexit path. And I'm fairly sure they'll not want to SIGSTOP/CONT their stuff either. Now its fairly easy to also handle this; just mark those tasks with a _TIF_WORK_SYSCALL_ENTRY flag, have that slowpath wait for the flag to go-away, then flip their state and clear the flag. I guess you mean patch the task when it makes a syscall? I'm doing that already on syscall exit with a bit in _TIF_ALLWORK_MASK and _TIF_DO_NOTIFY_MASK. No, these tasks will _never_ make syscalls. So you need to guarantee they don't accidentally enter the kernel while you flip them. Something like so should do. You set TIF_ENTER_WAIT on them, check they're still in userspace, flip them then clear TIF_ENTER_WAIT. Ah, that's a good idea. But how do we check if they're in user space? As a last resort, if there are still any tasks which are sleeping on a to-be-patched function, the user can send them SIGSTOP and SIGCONT to force them to be patched. You typically cannot SIGSTOP/SIGCONT kernel threads. Also TASK_UNINTERRUPTIBLE sleeps are unaffected by signals. Bit pesky that.. needs pondering. I still absolutely hate you need to disturb userspace like that. Signals are quite visible and perturb userspace state. Yeah, SIGSTOP on a sleeping task can be intrusive to user space if it results in EINTR being returned from a system call. It's not ideal, but it's much less intrusive than something like suspend. But anyway we leave it up to the user to decide whether they want to take that risk, or wait for the task to wake up on its own, or cancel the patch. Also, you cannot SIGCONT a task that was SIGSTOP'ed by userspace for what they thought was a good reason. You'd wreck their state. Hm, that's a good point. Maybe use the freezer instead of signals? (Again this would only be for those user tasks which are sleeping on a patched function) But now I'm thinking that kthreads will almost never be a problem. Most kthreads are basically this: You guys are way too optimistic; maybe its because I've worked on realtime stuff too much, but I'm always looking at worst cases. If you can't handle those, I feel you might as well not bother :-) Well, I think we're already resigned to the fact that live patching won't work for every patch, every time. And that the patch author must know what they're doing, and must do it carefully. Our target worst case is that the patching fails gracefully and the user can't patch their system with that particular patch. Or that the system remains in a partially patched state forever, if the user is ok with that. Patching thread_fn wouldn't be possible unless we killed the thread. It is, see kthread_park(). When the kthread returns from kthread_parkme(), it'll still be running the old thread_fn code, regardless of whether we flipped the task's patch state. -- Josh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote: No, these tasks will _never_ make syscalls. So you need to guarantee they don't accidentally enter the kernel while you flip them. Something like so should do. You set TIF_ENTER_WAIT on them, check they're still in userspace, flip them then clear TIF_ENTER_WAIT. Ah, that's a good idea. But how do we check if they're in user space? I don't see the benefit in holding them in a loop - you can just as well flip them from the syscall code as kGraft does. I still absolutely hate you need to disturb userspace like that. Signals are quite visible and perturb userspace state. Yeah, SIGSTOP on a sleeping task can be intrusive to user space if it results in EINTR being returned from a system call. It's not ideal, but it's much less intrusive than something like suspend. But anyway we leave it up to the user to decide whether they want to take that risk, or wait for the task to wake up on its own, or cancel the patch. Also, you cannot SIGCONT a task that was SIGSTOP'ed by userspace for what they thought was a good reason. You'd wreck their state. Hm, that's a good point. Maybe use the freezer instead of signals? (Again this would only be for those user tasks which are sleeping on a patched function) But now I'm thinking that kthreads will almost never be a problem. Most kthreads are basically this: You guys are way too optimistic; maybe its because I've worked on realtime stuff too much, but I'm always looking at worst cases. If you can't handle those, I feel you might as well not bother :-) Well, I think we're already resigned to the fact that live patching won't work for every patch, every time. And that the patch author must know what they're doing, and must do it carefully. Our target worst case is that the patching fails gracefully and the user can't patch their system with that particular patch. Or that the system remains in a partially patched state forever, if the user is ok with that. Patching thread_fn wouldn't be possible unless we killed the thread. It is, see kthread_park(). When the kthread returns from kthread_parkme(), it'll still be running the old thread_fn code, regardless of whether we flipped the task's patch state. We could instrument kthread_parkme() to be able to return to a different function, but it'd be a bit crazy indeed. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Thu, 19 Feb 2015, Josh Poimboeuf wrote: No, these tasks will _never_ make syscalls. So you need to guarantee they don't accidentally enter the kernel while you flip them. Something like so should do. You set TIF_ENTER_WAIT on them, check they're still in userspace, flip them then clear TIF_ENTER_WAIT. Ah, that's a good idea. But how do we check if they're in user space? I don't see the benefit in holding them in a loop - you can just as well flip them from the syscall code as kGraft does. But we were talking specifically about HPC tasks which never make syscalls. Yeah. And getting answer to is this task_struct currently running in userspace? is not easy in a non-disruptive way. Piggy-backing on CONFIG_CONTEXT_TRACKING is one option, but I guess the number of systems that have this option turned on will be rather limited (especially in HPC space). Sending IPIs around to check whether the task is running in user-space or kernel-space would work, but it's disruptive to the running task, which might be unwanted as well. -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Thu, Feb 19, 2015 at 11:03:53AM -0600, Josh Poimboeuf wrote: On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote: On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote: No, these tasks will _never_ make syscalls. So you need to guarantee they don't accidentally enter the kernel while you flip them. Something like so should do. You set TIF_ENTER_WAIT on them, check they're still in userspace, flip them then clear TIF_ENTER_WAIT. Ah, that's a good idea. But how do we check if they're in user space? I don't see the benefit in holding them in a loop - you can just as well flip them from the syscall code as kGraft does. But we were talking specifically about HPC tasks which never make syscalls. Yes. I'm saying that rather than guaranteeing they don't enter the kernel (by having them spin) you can flip them in case they try to do that instead. That solves the race condition just as well. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Thu, Feb 19, 2015 at 06:19:29PM +0100, Vojtech Pavlik wrote: On Thu, Feb 19, 2015 at 11:03:53AM -0600, Josh Poimboeuf wrote: On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote: On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote: No, these tasks will _never_ make syscalls. So you need to guarantee they don't accidentally enter the kernel while you flip them. Something like so should do. You set TIF_ENTER_WAIT on them, check they're still in userspace, flip them then clear TIF_ENTER_WAIT. Ah, that's a good idea. But how do we check if they're in user space? I don't see the benefit in holding them in a loop - you can just as well flip them from the syscall code as kGraft does. But we were talking specifically about HPC tasks which never make syscalls. Yes. I'm saying that rather than guaranteeing they don't enter the kernel (by having them spin) you can flip them in case they try to do that instead. That solves the race condition just as well. Ok, gotcha. We'd still need a safe way to check if they're in user space though. How about with a TIF_IN_USERSPACE thread flag? It could be cleared/set right at the border. Then for running tasks it's as simple as: if (test_tsk_thread_flag(task, TIF_IN_USERSPACE)) klp_switch_task_universe(task); -- Josh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote: On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote: No, these tasks will _never_ make syscalls. So you need to guarantee they don't accidentally enter the kernel while you flip them. Something like so should do. You set TIF_ENTER_WAIT on them, check they're still in userspace, flip them then clear TIF_ENTER_WAIT. Ah, that's a good idea. But how do we check if they're in user space? I don't see the benefit in holding them in a loop - you can just as well flip them from the syscall code as kGraft does. But we were talking specifically about HPC tasks which never make syscalls. -- Josh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On February 19, 2015 6:32:55 PM CET, Josh Poimboeuf jpoim...@redhat.com wrote: Yes. I'm saying that rather than guaranteeing they don't enter the kernel (by having them spin) you can flip them in case they try to do that instead. That solves the race condition just as well. Ok, gotcha. We'd still need a safe way to check if they're in user space though. How about with a TIF_IN_USERSPACE thread flag? It could be cleared/set right at the border. Then for running tasks it's as simple as: if (test_tsk_thread_flag(task, TIF_IN_USERSPACE)) klp_switch_task_universe(task); The s390x arch used to have a TIF_SYSCALL, which was doing exactly that (well, negated). I think it would work well, but it isn't entirely for free: two instructions per syscall and an extra TIF bit, which are (near) exhausted on some archs. -- Vojtech Pavlik Director SuSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Thu, Feb 19, 2015 at 10:26:09PM +0100, Jiri Kosina wrote: On Thu, 19 Feb 2015, Josh Poimboeuf wrote: How about with a TIF_IN_USERSPACE thread flag? It could be cleared/set right at the border. Then for running tasks it's as simple as: if (test_tsk_thread_flag(task, TIF_IN_USERSPACE)) klp_switch_task_universe(task); That's in principle what CONTEXT_TRACKING is doing, i.e. the condition we'd be interested in would be __this_cpu_read(context_tracking.state) == IN_USER But it has overhead. Yeah, that does seem to pretty much do what we want. Unfortunately it has a much higher overhead than just setting a thread flag. And from the Kconfig description for CONTEXT_TRACKING_FORCE, which would enable it on all CPUs during boot, this option brings an overhead that you don't want in production. Maybe that code needs a rewrite to rely on a thread flag instead. Then we could use it too. -- Josh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Thu, Feb 19, 2015 at 01:20:58AM +0100, Peter Zijlstra wrote: > On Wed, Feb 18, 2015 at 11:12:56AM -0600, Josh Poimboeuf wrote: > > > So uhm, what happens if your target task is running? When will you > > > retry? The problem I see is that if you do a sample approach you might > > > never hit an opportune moment. > > > > We attack it from multiple angles. > > > > First we check the stack of all sleeping tasks. That patches the > > majority of tasks immediately. If necessary, we also do that > > periodically in a workqueue to catch any stragglers. > > So not only do you need an 'atomic' stack save, you need to analyze and > flip its state in the same atomic region. The task cannot start running > again after the save and start using old functions while you analyze the > stack and flip it. Yes, exactly. > > The next line of attack is patching tasks when exiting the kernel to > > user space (system calls, interrupts, signals), to catch all CPU-bound > > and some I/O-bound tasks. That's done in patch 9 [1] of the consistency > > model patch set. > > So the HPC people are really into userspace that does for (;;) ; and > isolate that on CPUs and have the tick interrupt stopped and all that. > > You'll not catch those threads on the sysexit path. > > And I'm fairly sure they'll not want to SIGSTOP/CONT their stuff either. > > Now its fairly easy to also handle this; just mark those tasks with a > _TIF_WORK_SYSCALL_ENTRY flag, have that slowpath wait for the flag to > go-away, then flip their state and clear the flag. I guess you mean patch the task when it makes a syscall? I'm doing that already on syscall exit with a bit in _TIF_ALLWORK_MASK and _TIF_DO_NOTIFY_MASK. > > As a last resort, if there are still any tasks which are sleeping on a > > to-be-patched function, the user can send them SIGSTOP and SIGCONT to > > force them to be patched. > > You typically cannot SIGSTOP/SIGCONT kernel threads. Also > TASK_UNINTERRUPTIBLE sleeps are unaffected by signals. > > Bit pesky that.. needs pondering. I did have a scheme for patching kthreads which are sleeping on to-be-patched functions. But now I'm thinking that kthreads will almost never be a problem. Most kthreads are basically this: void thread_fn() { while (1) { /* do some stuff */ schedule(); /* do other stuff */ } } So a kthread would typically only fail the stack check if we're trying to patch either schedule() or the top-level thread_fn. Patching thread_fn wouldn't be possible unless we killed the thread. And I'd guess we can probably live without being able to patch schedule() for now :-) -- Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Wed, Feb 18, 2015 at 11:12:56AM -0600, Josh Poimboeuf wrote: > > > a) spend the time to ensure the unwinding code is correct and resilient > > >to errors; > > > > > > b) leave the consistency model compiled code out if !FRAME_POINTER and > > >allow users to patch without one (similar to the livepatch code > > >that's already in the livepatch tree today); or > > > > Which has a much more limited set of patches it can do, right? > > If we're talking security patches, roughly 9 out of 10 of them don't > change function prototypes, data, or data semantics. If the patch > author is careful, he or she doesn't need a consistency model in those > cases. > > So it's not _overly_ limited. If somebody's not happy about those > limitations, they can put in the work for option a :-) OK. So all this is really only about really funny bits. > > So uhm, what happens if your target task is running? When will you > > retry? The problem I see is that if you do a sample approach you might > > never hit an opportune moment. > > We attack it from multiple angles. > > First we check the stack of all sleeping tasks. That patches the > majority of tasks immediately. If necessary, we also do that > periodically in a workqueue to catch any stragglers. So not only do you need an 'atomic' stack save, you need to analyze and flip its state in the same atomic region. The task cannot start running again after the save and start using old functions while you analyze the stack and flip it. > The next line of attack is patching tasks when exiting the kernel to > user space (system calls, interrupts, signals), to catch all CPU-bound > and some I/O-bound tasks. That's done in patch 9 [1] of the consistency > model patch set. So the HPC people are really into userspace that does for (;;) ; and isolate that on CPUs and have the tick interrupt stopped and all that. You'll not catch those threads on the sysexit path. And I'm fairly sure they'll not want to SIGSTOP/CONT their stuff either. Now its fairly easy to also handle this; just mark those tasks with a _TIF_WORK_SYSCALL_ENTRY flag, have that slowpath wait for the flag to go-away, then flip their state and clear the flag. > As a last resort, if there are still any tasks which are sleeping on a > to-be-patched function, the user can send them SIGSTOP and SIGCONT to > force them to be patched. You typically cannot SIGSTOP/SIGCONT kernel threads. Also TASK_UNINTERRUPTIBLE sleeps are unaffected by signals. Bit pesky that.. needs pondering. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Wed, Feb 18, 2015 at 04:21:00PM +0100, Peter Zijlstra wrote: > On Tue, Feb 17, 2015 at 03:25:32PM -0600, Josh Poimboeuf wrote: > > > And I'm assuming you're hard relying on CONFIG_FRAMEPOINTER here, > > > because otherwise x86 stacks are a mess too. > > > > Yeah, it'll rely on CONFIG_FRAME_POINTER. IIUC, the arches we care > > about now (x86, power, s390, arm64) all have frame pointer support, and > > the stack formats are all sane, AFAICT. > > > > If we ever port livepatch to a more obscure arch for which walking the > > stack is more difficult, we'll have several options: > > > > a) spend the time to ensure the unwinding code is correct and resilient > >to errors; > > > > b) leave the consistency model compiled code out if !FRAME_POINTER and > >allow users to patch without one (similar to the livepatch code > >that's already in the livepatch tree today); or > > Which has a much more limited set of patches it can do, right? If we're talking security patches, roughly 9 out of 10 of them don't change function prototypes, data, or data semantics. If the patch author is careful, he or she doesn't need a consistency model in those cases. So it's not _overly_ limited. If somebody's not happy about those limitations, they can put in the work for option a :-) > > > And then hope you can get a better trace next time around? Or will you > > > fall-back to an alternative method of patching? > > > > Yeah, on second thought, we wouldn't have to cancel the patch. We could > > defer to check the task's stack again at a later time. If it's stuck > > there, the user can try sending it a signal to unstick it, or cancel the > > patching process. Those mechanisms are already in place with my > > consistency model patch set. > > > > I'd also do a WARN_ON_ONCE and a dump of the full stack data, since I'm > > guessing it would either indicate an error in the unwinding code or > > point us to an unexpected stack condition. > > So uhm, what happens if your target task is running? When will you > retry? The problem I see is that if you do a sample approach you might > never hit an opportune moment. We attack it from multiple angles. First we check the stack of all sleeping tasks. That patches the majority of tasks immediately. If necessary, we also do that periodically in a workqueue to catch any stragglers. The next line of attack is patching tasks when exiting the kernel to user space (system calls, interrupts, signals), to catch all CPU-bound and some I/O-bound tasks. That's done in patch 9 [1] of the consistency model patch set. As a last resort, if there are still any tasks which are sleeping on a to-be-patched function, the user can send them SIGSTOP and SIGCONT to force them to be patched. If none of those work, the user has the option of either canceling the patch or just waiting to see if the patching process eventually completes. > > > I'm further thinking we don't actually need 2 (or more) different means > > > of live patching in the kernel. So you all had better sit down (again) > > > and come up with something you all agree on. > > > > Yeah, I also _really_ want to avoid multiple consistency models. > > > > In fact, that's a big motivation behind my consistency model patch set. > > It's heavily inspired by a suggestion from Vojtech [1]. It combines > > kpatch (backtrace checking) with kGraft (per-thread consistency). It > > has several advantages over either of the individual approaches. See > > http://lwn.net/Articles/632582/ where I describe its pros over both > > kpatch and kGraft. > > > > Jiri, would you and Vojtech agree that the proposed consistency model is > > all we need? Or do you still want to do the multiple consistency model > > thing? > > Skimmed that thread; you all mostly seem to agree that one would be good > but not quite agree on which one. Well, I think we at least agreed that LEAVE_PATCHED_SET + SWITCH_THREAD is the most interesting combination. That's what my patches implement. Let's wait to see what Jiri and Vojtech say. > And I note, not all seem to require this stack walking stuff. True. But without stack walking, you have bigger problems: 1. You have to signal _all_ sleeping tasks to force them to be patched. That's much more disruptive to the system. 2. There's no way to convert kthreads without either: a) hacking up all the various kthread loops to call klp_update_task_universe(); or b) converting all kthreads' code to be freezeable. Then we could freeze all tasks to patch them. Not only would it be a lot of work to make all kthreads freezable, but it would encode within the kernel the dangerous and non-obvious assumption that the freeze point is a "safe" place for patching. [1] https://lkml.org/lkml/2015/2/9/478 -- Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: [PATCH 1/3] sched: add sched_task_call()
On Tue, Feb 17, 2015 at 03:25:32PM -0600, Josh Poimboeuf wrote: > > And I'm assuming you're hard relying on CONFIG_FRAMEPOINTER here, > > because otherwise x86 stacks are a mess too. > > Yeah, it'll rely on CONFIG_FRAME_POINTER. IIUC, the arches we care > about now (x86, power, s390, arm64) all have frame pointer support, and > the stack formats are all sane, AFAICT. > > If we ever port livepatch to a more obscure arch for which walking the > stack is more difficult, we'll have several options: > > a) spend the time to ensure the unwinding code is correct and resilient >to errors; > > b) leave the consistency model compiled code out if !FRAME_POINTER and >allow users to patch without one (similar to the livepatch code >that's already in the livepatch tree today); or Which has a much more limited set of patches it can do, right? > c) not support that arch. Which would be sad of course. > > And then hope you can get a better trace next time around? Or will you > > fall-back to an alternative method of patching? > > Yeah, on second thought, we wouldn't have to cancel the patch. We could > defer to check the task's stack again at a later time. If it's stuck > there, the user can try sending it a signal to unstick it, or cancel the > patching process. Those mechanisms are already in place with my > consistency model patch set. > > I'd also do a WARN_ON_ONCE and a dump of the full stack data, since I'm > guessing it would either indicate an error in the unwinding code or > point us to an unexpected stack condition. So uhm, what happens if your target task is running? When will you retry? The problem I see is that if you do a sample approach you might never hit an opportune moment. > > I'm further thinking we don't actually need 2 (or more) different means > > of live patching in the kernel. So you all had better sit down (again) > > and come up with something you all agree on. > > Yeah, I also _really_ want to avoid multiple consistency models. > > In fact, that's a big motivation behind my consistency model patch set. > It's heavily inspired by a suggestion from Vojtech [1]. It combines > kpatch (backtrace checking) with kGraft (per-thread consistency). It > has several advantages over either of the individual approaches. See > http://lwn.net/Articles/632582/ where I describe its pros over both > kpatch and kGraft. > > Jiri, would you and Vojtech agree that the proposed consistency model is > all we need? Or do you still want to do the multiple consistency model > thing? Skimmed that thread; you all mostly seem to agree that one would be good but not quite agree on which one. And I note, not all seem to require this stack walking stuff. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Wed, Feb 18, 2015 at 04:21:00PM +0100, Peter Zijlstra wrote: On Tue, Feb 17, 2015 at 03:25:32PM -0600, Josh Poimboeuf wrote: And I'm assuming you're hard relying on CONFIG_FRAMEPOINTER here, because otherwise x86 stacks are a mess too. Yeah, it'll rely on CONFIG_FRAME_POINTER. IIUC, the arches we care about now (x86, power, s390, arm64) all have frame pointer support, and the stack formats are all sane, AFAICT. If we ever port livepatch to a more obscure arch for which walking the stack is more difficult, we'll have several options: a) spend the time to ensure the unwinding code is correct and resilient to errors; b) leave the consistency model compiled code out if !FRAME_POINTER and allow users to patch without one (similar to the livepatch code that's already in the livepatch tree today); or Which has a much more limited set of patches it can do, right? If we're talking security patches, roughly 9 out of 10 of them don't change function prototypes, data, or data semantics. If the patch author is careful, he or she doesn't need a consistency model in those cases. So it's not _overly_ limited. If somebody's not happy about those limitations, they can put in the work for option a :-) And then hope you can get a better trace next time around? Or will you fall-back to an alternative method of patching? Yeah, on second thought, we wouldn't have to cancel the patch. We could defer to check the task's stack again at a later time. If it's stuck there, the user can try sending it a signal to unstick it, or cancel the patching process. Those mechanisms are already in place with my consistency model patch set. I'd also do a WARN_ON_ONCE and a dump of the full stack data, since I'm guessing it would either indicate an error in the unwinding code or point us to an unexpected stack condition. So uhm, what happens if your target task is running? When will you retry? The problem I see is that if you do a sample approach you might never hit an opportune moment. We attack it from multiple angles. First we check the stack of all sleeping tasks. That patches the majority of tasks immediately. If necessary, we also do that periodically in a workqueue to catch any stragglers. The next line of attack is patching tasks when exiting the kernel to user space (system calls, interrupts, signals), to catch all CPU-bound and some I/O-bound tasks. That's done in patch 9 [1] of the consistency model patch set. As a last resort, if there are still any tasks which are sleeping on a to-be-patched function, the user can send them SIGSTOP and SIGCONT to force them to be patched. If none of those work, the user has the option of either canceling the patch or just waiting to see if the patching process eventually completes. I'm further thinking we don't actually need 2 (or more) different means of live patching in the kernel. So you all had better sit down (again) and come up with something you all agree on. Yeah, I also _really_ want to avoid multiple consistency models. In fact, that's a big motivation behind my consistency model patch set. It's heavily inspired by a suggestion from Vojtech [1]. It combines kpatch (backtrace checking) with kGraft (per-thread consistency). It has several advantages over either of the individual approaches. See http://lwn.net/Articles/632582/ where I describe its pros over both kpatch and kGraft. Jiri, would you and Vojtech agree that the proposed consistency model is all we need? Or do you still want to do the multiple consistency model thing? Skimmed that thread; you all mostly seem to agree that one would be good but not quite agree on which one. Well, I think we at least agreed that LEAVE_PATCHED_SET + SWITCH_THREAD is the most interesting combination. That's what my patches implement. Let's wait to see what Jiri and Vojtech say. And I note, not all seem to require this stack walking stuff. True. But without stack walking, you have bigger problems: 1. You have to signal _all_ sleeping tasks to force them to be patched. That's much more disruptive to the system. 2. There's no way to convert kthreads without either: a) hacking up all the various kthread loops to call klp_update_task_universe(); or b) converting all kthreads' code to be freezeable. Then we could freeze all tasks to patch them. Not only would it be a lot of work to make all kthreads freezable, but it would encode within the kernel the dangerous and non-obvious assumption that the freeze point is a safe place for patching. [1] https://lkml.org/lkml/2015/2/9/478 -- Josh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Thu, Feb 19, 2015 at 01:20:58AM +0100, Peter Zijlstra wrote: On Wed, Feb 18, 2015 at 11:12:56AM -0600, Josh Poimboeuf wrote: So uhm, what happens if your target task is running? When will you retry? The problem I see is that if you do a sample approach you might never hit an opportune moment. We attack it from multiple angles. First we check the stack of all sleeping tasks. That patches the majority of tasks immediately. If necessary, we also do that periodically in a workqueue to catch any stragglers. So not only do you need an 'atomic' stack save, you need to analyze and flip its state in the same atomic region. The task cannot start running again after the save and start using old functions while you analyze the stack and flip it. Yes, exactly. The next line of attack is patching tasks when exiting the kernel to user space (system calls, interrupts, signals), to catch all CPU-bound and some I/O-bound tasks. That's done in patch 9 [1] of the consistency model patch set. So the HPC people are really into userspace that does for (;;) ; and isolate that on CPUs and have the tick interrupt stopped and all that. You'll not catch those threads on the sysexit path. And I'm fairly sure they'll not want to SIGSTOP/CONT their stuff either. Now its fairly easy to also handle this; just mark those tasks with a _TIF_WORK_SYSCALL_ENTRY flag, have that slowpath wait for the flag to go-away, then flip their state and clear the flag. I guess you mean patch the task when it makes a syscall? I'm doing that already on syscall exit with a bit in _TIF_ALLWORK_MASK and _TIF_DO_NOTIFY_MASK. As a last resort, if there are still any tasks which are sleeping on a to-be-patched function, the user can send them SIGSTOP and SIGCONT to force them to be patched. You typically cannot SIGSTOP/SIGCONT kernel threads. Also TASK_UNINTERRUPTIBLE sleeps are unaffected by signals. Bit pesky that.. needs pondering. I did have a scheme for patching kthreads which are sleeping on to-be-patched functions. But now I'm thinking that kthreads will almost never be a problem. Most kthreads are basically this: void thread_fn() { while (1) { /* do some stuff */ schedule(); /* do other stuff */ } } So a kthread would typically only fail the stack check if we're trying to patch either schedule() or the top-level thread_fn. Patching thread_fn wouldn't be possible unless we killed the thread. And I'd guess we can probably live without being able to patch schedule() for now :-) -- Josh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Tue, Feb 17, 2015 at 03:25:32PM -0600, Josh Poimboeuf wrote: And I'm assuming you're hard relying on CONFIG_FRAMEPOINTER here, because otherwise x86 stacks are a mess too. Yeah, it'll rely on CONFIG_FRAME_POINTER. IIUC, the arches we care about now (x86, power, s390, arm64) all have frame pointer support, and the stack formats are all sane, AFAICT. If we ever port livepatch to a more obscure arch for which walking the stack is more difficult, we'll have several options: a) spend the time to ensure the unwinding code is correct and resilient to errors; b) leave the consistency model compiled code out if !FRAME_POINTER and allow users to patch without one (similar to the livepatch code that's already in the livepatch tree today); or Which has a much more limited set of patches it can do, right? c) not support that arch. Which would be sad of course. And then hope you can get a better trace next time around? Or will you fall-back to an alternative method of patching? Yeah, on second thought, we wouldn't have to cancel the patch. We could defer to check the task's stack again at a later time. If it's stuck there, the user can try sending it a signal to unstick it, or cancel the patching process. Those mechanisms are already in place with my consistency model patch set. I'd also do a WARN_ON_ONCE and a dump of the full stack data, since I'm guessing it would either indicate an error in the unwinding code or point us to an unexpected stack condition. So uhm, what happens if your target task is running? When will you retry? The problem I see is that if you do a sample approach you might never hit an opportune moment. I'm further thinking we don't actually need 2 (or more) different means of live patching in the kernel. So you all had better sit down (again) and come up with something you all agree on. Yeah, I also _really_ want to avoid multiple consistency models. In fact, that's a big motivation behind my consistency model patch set. It's heavily inspired by a suggestion from Vojtech [1]. It combines kpatch (backtrace checking) with kGraft (per-thread consistency). It has several advantages over either of the individual approaches. See http://lwn.net/Articles/632582/ where I describe its pros over both kpatch and kGraft. Jiri, would you and Vojtech agree that the proposed consistency model is all we need? Or do you still want to do the multiple consistency model thing? Skimmed that thread; you all mostly seem to agree that one would be good but not quite agree on which one. And I note, not all seem to require this stack walking stuff. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Wed, Feb 18, 2015 at 11:12:56AM -0600, Josh Poimboeuf wrote: a) spend the time to ensure the unwinding code is correct and resilient to errors; b) leave the consistency model compiled code out if !FRAME_POINTER and allow users to patch without one (similar to the livepatch code that's already in the livepatch tree today); or Which has a much more limited set of patches it can do, right? If we're talking security patches, roughly 9 out of 10 of them don't change function prototypes, data, or data semantics. If the patch author is careful, he or she doesn't need a consistency model in those cases. So it's not _overly_ limited. If somebody's not happy about those limitations, they can put in the work for option a :-) OK. So all this is really only about really funny bits. So uhm, what happens if your target task is running? When will you retry? The problem I see is that if you do a sample approach you might never hit an opportune moment. We attack it from multiple angles. First we check the stack of all sleeping tasks. That patches the majority of tasks immediately. If necessary, we also do that periodically in a workqueue to catch any stragglers. So not only do you need an 'atomic' stack save, you need to analyze and flip its state in the same atomic region. The task cannot start running again after the save and start using old functions while you analyze the stack and flip it. The next line of attack is patching tasks when exiting the kernel to user space (system calls, interrupts, signals), to catch all CPU-bound and some I/O-bound tasks. That's done in patch 9 [1] of the consistency model patch set. So the HPC people are really into userspace that does for (;;) ; and isolate that on CPUs and have the tick interrupt stopped and all that. You'll not catch those threads on the sysexit path. And I'm fairly sure they'll not want to SIGSTOP/CONT their stuff either. Now its fairly easy to also handle this; just mark those tasks with a _TIF_WORK_SYSCALL_ENTRY flag, have that slowpath wait for the flag to go-away, then flip their state and clear the flag. As a last resort, if there are still any tasks which are sleeping on a to-be-patched function, the user can send them SIGSTOP and SIGCONT to force them to be patched. You typically cannot SIGSTOP/SIGCONT kernel threads. Also TASK_UNINTERRUPTIBLE sleeps are unaffected by signals. Bit pesky that.. needs pondering. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Tue, Feb 17, 2015 at 07:15:41PM +0100, Peter Zijlstra wrote: > On Tue, Feb 17, 2015 at 08:12:11AM -0600, Josh Poimboeuf wrote: > > On Tue, Feb 17, 2015 at 10:24:50AM +0100, Peter Zijlstra wrote: > > > So far stack unwinding has basically been a best effort debug output > > > kind of thing, you're wanting to make the integrity of the kernel depend > > > on it. > > > > > > You require an absolute 100% correctness of the stack unwinder -- where > > > today it is; as stated above; a best effort debug output thing. > > > > > > That is a _big_ change. > > > > True, this does seem to be the first attempt to rely on the correctness > > of the stack walking code. > > > > > Has this been properly considered; has all the asm of the relevant > > > architectures been audited? Are you planning on maintaining that level > > > of audit for all new patches? > > > > I agree, the unwinder needs to be 100% correct. Currently we only > > support x86_64. Luckily, in general, stacks are very well defined and > > walking the stack of a sleeping task should be straightforward. I don't > > think it would be too hard to ensure the stack unwinder is right for > > other architectures as we port them. > > I would not be too sure about that, the x86 framepointer thing is not > universal. IIRC some have to have some form of in-kernel dwarf > unwinding. > > And I'm assuming you're hard relying on CONFIG_FRAMEPOINTER here, > because otherwise x86 stacks are a mess too. Yeah, it'll rely on CONFIG_FRAME_POINTER. IIUC, the arches we care about now (x86, power, s390, arm64) all have frame pointer support, and the stack formats are all sane, AFAICT. If we ever port livepatch to a more obscure arch for which walking the stack is more difficult, we'll have several options: a) spend the time to ensure the unwinding code is correct and resilient to errors; b) leave the consistency model compiled code out if !FRAME_POINTER and allow users to patch without one (similar to the livepatch code that's already in the livepatch tree today); or c) not support that arch. > So even with CONFIG_FRAMEPOINTER, things like copy_to/from_user, which > are implemented in asm, don't honour that. So if one of those faults and > the fault handler sleeps, you'll miss say your > 'copy_user_enhanced_fast_string' entry. > > Then again, those asm functions don't have function trace bits either, > so you can't patch them to begin with I suppose. > > Here's to hoping you don't have to.. Right. We can't patch asm functions because we rely on ftrace, which needs the -mfentry prologue. > > > Because the way you propose to do things, we'll end up with silent but > > > deadly fail if the unwinder is less than 100% correct. No way to easily > > > debug that, no warns, just silent corruption. > > > > That's a good point. There's definitely room for additional error > > checking in the x86 stack walking code. A couple of ideas: > > > > - make sure it starts from a __schedule() call at the top > > - make sure we actually reach the bottom of the stack > > - make sure each stack frame's return address is immediately after a > > call instruction > > > > If we hit any of those errors, we can bail out, unregister with ftrace > > and restore the system to its original state. > > And then hope you can get a better trace next time around? Or will you > fall-back to an alternative method of patching? Yeah, on second thought, we wouldn't have to cancel the patch. We could defer to check the task's stack again at a later time. If it's stuck there, the user can try sending it a signal to unstick it, or cancel the patching process. Those mechanisms are already in place with my consistency model patch set. I'd also do a WARN_ON_ONCE and a dump of the full stack data, since I'm guessing it would either indicate an error in the unwinding code or point us to an unexpected stack condition. > > > Are you really really sure you want to go do this? > > > > Basically, yes. We've had a lot of conversations about many different > > variations of how to do this over the past year, and this is by far the > > best approach we've come up with. > > For some reason I'm thinking jikos is going to disagree with you on that > :-) > > I'm further thinking we don't actually need 2 (or more) different means > of live patching in the kernel. So you all had better sit down (again) > and come up with something you all agree on. Yeah, I also _really_ want to avoid multiple consistency models. In fact, that's a big motivation behind my consistency model patch set. It's heavily inspired by a suggestion from Vojtech [1]. It combines kpatch (backtrace checking) with kGraft (per-thread consistency). It has several advantages over either of the individual approaches. See http://lwn.net/Articles/632582/ where I describe its pros over both kpatch and kGraft. Jiri, would you and Vojtech agree that the proposed consistency model is all we need? Or do you still want to do the
Re: [PATCH 1/3] sched: add sched_task_call()
On Tue, Feb 17, 2015 at 08:12:11AM -0600, Josh Poimboeuf wrote: > On Tue, Feb 17, 2015 at 10:24:50AM +0100, Peter Zijlstra wrote: > > So far stack unwinding has basically been a best effort debug output > > kind of thing, you're wanting to make the integrity of the kernel depend > > on it. > > > > You require an absolute 100% correctness of the stack unwinder -- where > > today it is; as stated above; a best effort debug output thing. > > > > That is a _big_ change. > > True, this does seem to be the first attempt to rely on the correctness > of the stack walking code. > > > Has this been properly considered; has all the asm of the relevant > > architectures been audited? Are you planning on maintaining that level > > of audit for all new patches? > > I agree, the unwinder needs to be 100% correct. Currently we only > support x86_64. Luckily, in general, stacks are very well defined and > walking the stack of a sleeping task should be straightforward. I don't > think it would be too hard to ensure the stack unwinder is right for > other architectures as we port them. I would not be too sure about that, the x86 framepointer thing is not universal. IIRC some have to have some form of in-kernel dwarf unwinding. And I'm assuming you're hard relying on CONFIG_FRAMEPOINTER here, because otherwise x86 stacks are a mess too. So even with CONFIG_FRAMEPOINTER, things like copy_to/from_user, which are implemented in asm, don't honour that. So if one of those faults and the fault handler sleeps, you'll miss say your 'copy_user_enhanced_fast_string' entry. Then again, those asm functions don't have function trace bits either, so you can't patch them to begin with I suppose. Here's to hoping you don't have to.. > > Because the way you propose to do things, we'll end up with silent but > > deadly fail if the unwinder is less than 100% correct. No way to easily > > debug that, no warns, just silent corruption. > > That's a good point. There's definitely room for additional error > checking in the x86 stack walking code. A couple of ideas: > > - make sure it starts from a __schedule() call at the top > - make sure we actually reach the bottom of the stack > - make sure each stack frame's return address is immediately after a > call instruction > > If we hit any of those errors, we can bail out, unregister with ftrace > and restore the system to its original state. And then hope you can get a better trace next time around? Or will you fall-back to an alternative method of patching? > > Are you really really sure you want to go do this? > > Basically, yes. We've had a lot of conversations about many different > variations of how to do this over the past year, and this is by far the > best approach we've come up with. For some reason I'm thinking jikos is going to disagree with you on that :-) I'm further thinking we don't actually need 2 (or more) different means of live patching in the kernel. So you all had better sit down (again) and come up with something you all agree on. > FWIW, we've been doing something similar with kpatch and stop_machine() > over the last 1+ years, and have run into zero problems with that > approach. Could be you've just been lucky... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Tue, Feb 17, 2015 at 10:24:50AM +0100, Peter Zijlstra wrote: > On Mon, Feb 16, 2015 at 04:05:05PM -0600, Josh Poimboeuf wrote: > > Yeah, I can understand that. I definitely want to avoid touching the > > scheduler code. Basically I'm trying to find a way to atomically do the > > following: > > > > if (task is sleeping) { > > walk the stack > > if (certain set of functions isn't on the stack) > > set (or clear) a thread flag for the task > > } > > > > Any ideas on how I can achieve that? So far my ideas are: > > So far stack unwinding has basically been a best effort debug output > kind of thing, you're wanting to make the integrity of the kernel depend > on it. > > You require an absolute 100% correctness of the stack unwinder -- where > today it is; as stated above; a best effort debug output thing. > > That is a _big_ change. True, this does seem to be the first attempt to rely on the correctness of the stack walking code. > Has this been properly considered; has all the asm of the relevant > architectures been audited? Are you planning on maintaining that level > of audit for all new patches? I agree, the unwinder needs to be 100% correct. Currently we only support x86_64. Luckily, in general, stacks are very well defined and walking the stack of a sleeping task should be straightforward. I don't think it would be too hard to ensure the stack unwinder is right for other architectures as we port them. > Because the way you propose to do things, we'll end up with silent but > deadly fail if the unwinder is less than 100% correct. No way to easily > debug that, no warns, just silent corruption. That's a good point. There's definitely room for additional error checking in the x86 stack walking code. A couple of ideas: - make sure it starts from a __schedule() call at the top - make sure we actually reach the bottom of the stack - make sure each stack frame's return address is immediately after a call instruction If we hit any of those errors, we can bail out, unregister with ftrace and restore the system to its original state. > Are you really really sure you want to go do this? Basically, yes. We've had a lot of conversations about many different variations of how to do this over the past year, and this is by far the best approach we've come up with. FWIW, we've been doing something similar with kpatch and stop_machine() over the last 1+ years, and have run into zero problems with that approach. -- Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Mon, Feb 16, 2015 at 04:05:05PM -0600, Josh Poimboeuf wrote: > Yeah, I can understand that. I definitely want to avoid touching the > scheduler code. Basically I'm trying to find a way to atomically do the > following: > > if (task is sleeping) { > walk the stack > if (certain set of functions isn't on the stack) > set (or clear) a thread flag for the task > } > > Any ideas on how I can achieve that? So far my ideas are: So far stack unwinding has basically been a best effort debug output kind of thing, you're wanting to make the integrity of the kernel depend on it. You require an absolute 100% correctness of the stack unwinder -- where today it is; as stated above; a best effort debug output thing. That is a _big_ change. Has this been properly considered; has all the asm of the relevant architectures been audited? Are you planning on maintaining that level of audit for all new patches? Because the way you propose to do things, we'll end up with silent but deadly fail if the unwinder is less than 100% correct. No way to easily debug that, no warns, just silent corruption. Are you really really sure you want to go do this? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Tue, Feb 17, 2015 at 10:24:50AM +0100, Peter Zijlstra wrote: On Mon, Feb 16, 2015 at 04:05:05PM -0600, Josh Poimboeuf wrote: Yeah, I can understand that. I definitely want to avoid touching the scheduler code. Basically I'm trying to find a way to atomically do the following: if (task is sleeping) { walk the stack if (certain set of functions isn't on the stack) set (or clear) a thread flag for the task } Any ideas on how I can achieve that? So far my ideas are: So far stack unwinding has basically been a best effort debug output kind of thing, you're wanting to make the integrity of the kernel depend on it. You require an absolute 100% correctness of the stack unwinder -- where today it is; as stated above; a best effort debug output thing. That is a _big_ change. True, this does seem to be the first attempt to rely on the correctness of the stack walking code. Has this been properly considered; has all the asm of the relevant architectures been audited? Are you planning on maintaining that level of audit for all new patches? I agree, the unwinder needs to be 100% correct. Currently we only support x86_64. Luckily, in general, stacks are very well defined and walking the stack of a sleeping task should be straightforward. I don't think it would be too hard to ensure the stack unwinder is right for other architectures as we port them. Because the way you propose to do things, we'll end up with silent but deadly fail if the unwinder is less than 100% correct. No way to easily debug that, no warns, just silent corruption. That's a good point. There's definitely room for additional error checking in the x86 stack walking code. A couple of ideas: - make sure it starts from a __schedule() call at the top - make sure we actually reach the bottom of the stack - make sure each stack frame's return address is immediately after a call instruction If we hit any of those errors, we can bail out, unregister with ftrace and restore the system to its original state. Are you really really sure you want to go do this? Basically, yes. We've had a lot of conversations about many different variations of how to do this over the past year, and this is by far the best approach we've come up with. FWIW, we've been doing something similar with kpatch and stop_machine() over the last 1+ years, and have run into zero problems with that approach. -- Josh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Tue, Feb 17, 2015 at 08:12:11AM -0600, Josh Poimboeuf wrote: On Tue, Feb 17, 2015 at 10:24:50AM +0100, Peter Zijlstra wrote: So far stack unwinding has basically been a best effort debug output kind of thing, you're wanting to make the integrity of the kernel depend on it. You require an absolute 100% correctness of the stack unwinder -- where today it is; as stated above; a best effort debug output thing. That is a _big_ change. True, this does seem to be the first attempt to rely on the correctness of the stack walking code. Has this been properly considered; has all the asm of the relevant architectures been audited? Are you planning on maintaining that level of audit for all new patches? I agree, the unwinder needs to be 100% correct. Currently we only support x86_64. Luckily, in general, stacks are very well defined and walking the stack of a sleeping task should be straightforward. I don't think it would be too hard to ensure the stack unwinder is right for other architectures as we port them. I would not be too sure about that, the x86 framepointer thing is not universal. IIRC some have to have some form of in-kernel dwarf unwinding. And I'm assuming you're hard relying on CONFIG_FRAMEPOINTER here, because otherwise x86 stacks are a mess too. So even with CONFIG_FRAMEPOINTER, things like copy_to/from_user, which are implemented in asm, don't honour that. So if one of those faults and the fault handler sleeps, you'll miss say your 'copy_user_enhanced_fast_string' entry. Then again, those asm functions don't have function trace bits either, so you can't patch them to begin with I suppose. Here's to hoping you don't have to.. Because the way you propose to do things, we'll end up with silent but deadly fail if the unwinder is less than 100% correct. No way to easily debug that, no warns, just silent corruption. That's a good point. There's definitely room for additional error checking in the x86 stack walking code. A couple of ideas: - make sure it starts from a __schedule() call at the top - make sure we actually reach the bottom of the stack - make sure each stack frame's return address is immediately after a call instruction If we hit any of those errors, we can bail out, unregister with ftrace and restore the system to its original state. And then hope you can get a better trace next time around? Or will you fall-back to an alternative method of patching? Are you really really sure you want to go do this? Basically, yes. We've had a lot of conversations about many different variations of how to do this over the past year, and this is by far the best approach we've come up with. For some reason I'm thinking jikos is going to disagree with you on that :-) I'm further thinking we don't actually need 2 (or more) different means of live patching in the kernel. So you all had better sit down (again) and come up with something you all agree on. FWIW, we've been doing something similar with kpatch and stop_machine() over the last 1+ years, and have run into zero problems with that approach. Could be you've just been lucky... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Tue, Feb 17, 2015 at 07:15:41PM +0100, Peter Zijlstra wrote: On Tue, Feb 17, 2015 at 08:12:11AM -0600, Josh Poimboeuf wrote: On Tue, Feb 17, 2015 at 10:24:50AM +0100, Peter Zijlstra wrote: So far stack unwinding has basically been a best effort debug output kind of thing, you're wanting to make the integrity of the kernel depend on it. You require an absolute 100% correctness of the stack unwinder -- where today it is; as stated above; a best effort debug output thing. That is a _big_ change. True, this does seem to be the first attempt to rely on the correctness of the stack walking code. Has this been properly considered; has all the asm of the relevant architectures been audited? Are you planning on maintaining that level of audit for all new patches? I agree, the unwinder needs to be 100% correct. Currently we only support x86_64. Luckily, in general, stacks are very well defined and walking the stack of a sleeping task should be straightforward. I don't think it would be too hard to ensure the stack unwinder is right for other architectures as we port them. I would not be too sure about that, the x86 framepointer thing is not universal. IIRC some have to have some form of in-kernel dwarf unwinding. And I'm assuming you're hard relying on CONFIG_FRAMEPOINTER here, because otherwise x86 stacks are a mess too. Yeah, it'll rely on CONFIG_FRAME_POINTER. IIUC, the arches we care about now (x86, power, s390, arm64) all have frame pointer support, and the stack formats are all sane, AFAICT. If we ever port livepatch to a more obscure arch for which walking the stack is more difficult, we'll have several options: a) spend the time to ensure the unwinding code is correct and resilient to errors; b) leave the consistency model compiled code out if !FRAME_POINTER and allow users to patch without one (similar to the livepatch code that's already in the livepatch tree today); or c) not support that arch. So even with CONFIG_FRAMEPOINTER, things like copy_to/from_user, which are implemented in asm, don't honour that. So if one of those faults and the fault handler sleeps, you'll miss say your 'copy_user_enhanced_fast_string' entry. Then again, those asm functions don't have function trace bits either, so you can't patch them to begin with I suppose. Here's to hoping you don't have to.. Right. We can't patch asm functions because we rely on ftrace, which needs the -mfentry prologue. Because the way you propose to do things, we'll end up with silent but deadly fail if the unwinder is less than 100% correct. No way to easily debug that, no warns, just silent corruption. That's a good point. There's definitely room for additional error checking in the x86 stack walking code. A couple of ideas: - make sure it starts from a __schedule() call at the top - make sure we actually reach the bottom of the stack - make sure each stack frame's return address is immediately after a call instruction If we hit any of those errors, we can bail out, unregister with ftrace and restore the system to its original state. And then hope you can get a better trace next time around? Or will you fall-back to an alternative method of patching? Yeah, on second thought, we wouldn't have to cancel the patch. We could defer to check the task's stack again at a later time. If it's stuck there, the user can try sending it a signal to unstick it, or cancel the patching process. Those mechanisms are already in place with my consistency model patch set. I'd also do a WARN_ON_ONCE and a dump of the full stack data, since I'm guessing it would either indicate an error in the unwinding code or point us to an unexpected stack condition. Are you really really sure you want to go do this? Basically, yes. We've had a lot of conversations about many different variations of how to do this over the past year, and this is by far the best approach we've come up with. For some reason I'm thinking jikos is going to disagree with you on that :-) I'm further thinking we don't actually need 2 (or more) different means of live patching in the kernel. So you all had better sit down (again) and come up with something you all agree on. Yeah, I also _really_ want to avoid multiple consistency models. In fact, that's a big motivation behind my consistency model patch set. It's heavily inspired by a suggestion from Vojtech [1]. It combines kpatch (backtrace checking) with kGraft (per-thread consistency). It has several advantages over either of the individual approaches. See http://lwn.net/Articles/632582/ where I describe its pros over both kpatch and kGraft. Jiri, would you and Vojtech agree that the proposed consistency model is all we need? Or do you still want to do the multiple consistency model thing? FWIW, we've been doing something similar with kpatch and stop_machine() over the
Re: [PATCH 1/3] sched: add sched_task_call()
On Mon, Feb 16, 2015 at 04:05:05PM -0600, Josh Poimboeuf wrote: Yeah, I can understand that. I definitely want to avoid touching the scheduler code. Basically I'm trying to find a way to atomically do the following: if (task is sleeping) { walk the stack if (certain set of functions isn't on the stack) set (or clear) a thread flag for the task } Any ideas on how I can achieve that? So far my ideas are: So far stack unwinding has basically been a best effort debug output kind of thing, you're wanting to make the integrity of the kernel depend on it. You require an absolute 100% correctness of the stack unwinder -- where today it is; as stated above; a best effort debug output thing. That is a _big_ change. Has this been properly considered; has all the asm of the relevant architectures been audited? Are you planning on maintaining that level of audit for all new patches? Because the way you propose to do things, we'll end up with silent but deadly fail if the unwinder is less than 100% correct. No way to easily debug that, no warns, just silent corruption. Are you really really sure you want to go do this? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Mon, Feb 16, 2015 at 09:44:36PM +0100, Peter Zijlstra wrote: > On Mon, Feb 16, 2015 at 12:52:34PM -0600, Josh Poimboeuf wrote: > > +++ b/kernel/sched/core.c > > @@ -1338,6 +1338,23 @@ void kick_process(struct task_struct *p) > > EXPORT_SYMBOL_GPL(kick_process); > > #endif /* CONFIG_SMP */ > > > > +/*** > > + * sched_task_call - call a function with a task's state locked > > + * > > + * The task is guaranteed to remain either active or inactive during the > > + * function call. > > + */ > > +void sched_task_call(sched_task_call_func_t func, struct task_struct *p, > > +void *data) > > +{ > > + unsigned long flags; > > + struct rq *rq; > > + > > + rq = task_rq_lock(p, ); > > + func(p, data); > > + task_rq_unlock(rq, p, ); > > +} > > Yeah, I think not. We're so not going to allow running random code under > rq->lock and p->pi_lock. Yeah, I can understand that. I definitely want to avoid touching the scheduler code. Basically I'm trying to find a way to atomically do the following: if (task is sleeping) { walk the stack if (certain set of functions isn't on the stack) set (or clear) a thread flag for the task } Any ideas on how I can achieve that? So far my ideas are: 1. Use task_rq_lock() -- but rq_lock is internal to sched code. 2. Use wait_task_inactive() -- I could call it twice, with the stack checking in between, and use ncsw to ensure that it didn't reschedule in the mean time. But this still seems racy. i.e., I think the task could start running after the second call to wait_task_inactive() returns but before setting the thread flag. Not sure if that's a realistic race condition or not. 3. Use set_cpus_allowed() to temporarily pin the task to its current CPU, and then call smp_call_function_single() to run the above critical section on that CPU. I'm not sure if there's a race-free way to do it but it's a lot more disruptive than I'd like... Any ideas or guidance would be greatly appreciated! -- Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Mon, Feb 16, 2015 at 12:52:34PM -0600, Josh Poimboeuf wrote: > +++ b/kernel/sched/core.c > @@ -1338,6 +1338,23 @@ void kick_process(struct task_struct *p) > EXPORT_SYMBOL_GPL(kick_process); > #endif /* CONFIG_SMP */ > > +/*** > + * sched_task_call - call a function with a task's state locked > + * > + * The task is guaranteed to remain either active or inactive during the > + * function call. > + */ > +void sched_task_call(sched_task_call_func_t func, struct task_struct *p, > + void *data) > +{ > + unsigned long flags; > + struct rq *rq; > + > + rq = task_rq_lock(p, ); > + func(p, data); > + task_rq_unlock(rq, p, ); > +} Yeah, I think not. We're so not going to allow running random code under rq->lock and p->pi_lock. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/