Re: live kernel upgrades (was: live kernel patching design)
* Jiri Slaby wrote: > On 02/24/2015, 10:16 AM, Ingo Molnar wrote: > > > > and we don't design the Linux kernel for weird, extreme cases, we > > design for the common, sane case that has the broadest appeal, and > > we hope that the feature garners enough interest to be > > maintainable. > > Hello, > > oh, so why do we have NR_CPUS up to 8192, then? [...] Because: - More CPUs is not some weird dead end, but a natural direction of hardware development. - Furthermore, we've gained a lot of scalability and other improvements all around the kernel just by virtue of big iron running into those problems first. - In the typical case there's no friction between 8192 CPUs and the kernel's design. Where there was friction (and it happened), we pushed back. Such benefits add up and 8K CPUs support is a success story today. That positive, symbiotic, multi-discipline relationship between 8K CPUs support design goals and 'regular Linux' design goals stands in stark contrast with the single-issue approach that live kernel patching is designing itself into a dead end so early on ... Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
* Jiri Slaby jsl...@suse.cz wrote: On 02/24/2015, 10:16 AM, Ingo Molnar wrote: and we don't design the Linux kernel for weird, extreme cases, we design for the common, sane case that has the broadest appeal, and we hope that the feature garners enough interest to be maintainable. Hello, oh, so why do we have NR_CPUS up to 8192, then? [...] Because: - More CPUs is not some weird dead end, but a natural direction of hardware development. - Furthermore, we've gained a lot of scalability and other improvements all around the kernel just by virtue of big iron running into those problems first. - In the typical case there's no friction between 8192 CPUs and the kernel's design. Where there was friction (and it happened), we pushed back. Such benefits add up and 8K CPUs support is a success story today. That positive, symbiotic, multi-discipline relationship between 8K CPUs support design goals and 'regular Linux' design goals stands in stark contrast with the single-issue approach that live kernel patching is designing itself into a dead end so early on ... Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
On Tue, Feb 24, 2015 at 11:23:29AM +0100, Ingo Molnar wrote: > > Your upgrade proposal is an *enormous* disruption to the > > system: > > > > - a latency of "well below 10" seconds is completely > > unacceptable to most users who want to patch the kernel > > of a production system _while_ it's in production. > > I think this statement is false for the following reasons. The statement is very true. > - I'd say the majority of system operators of production > systems can live with a couple of seconds of delay at a > well defined moment of the day or week - with gradual, > pretty much open ended improvements in that latency > down the line. In the most usual corporate setting any noticeable outage, even out of business hours, requires an ahead notice, and an agreement of all stakeholders - teams that depend on the system. If a live patching technology introduces an outage, it's not "live" and because of these bureaucratic reasons, it will not be used and a regular reboot will be scheduled instead. > - I think your argument ignores the fact that live > upgrades would extend the scope of 'users willing to > patch the kernel of a production system' _enormously_. > > For example, I have a production system with this much > uptime: > >10:50:09 up 153 days, 3:58, 34 users, load average: 0.00, 0.02, 0.05 > > While currently I'm reluctant to reboot the system to > upgrade the kernel (due to a reboot's intrusiveness), > and that is why it has achieved a relatively high > uptime, but I'd definitely allow the kernel to upgrade > at 0:00am just fine. (I'd even give it up to a few > minutes, as long as TCP connections don't time out.) > > And I don't think my usecase is special. I agree that this is useful. But it is a different problem that only partially overlaps with what we're trying to achieve with live patching. If you can make full kernel upgrades to work this way, which I doubt is achievable in the next 10 years due to all the research and infrastructure needed, then you certainly gain an additional group of users. And a great tool. A large portion of those that ask for live patching won't use it, though. But honestly, I prefer a solution that works for small patches now, than a solution for unlimited patches sometime in next decade. > What gradual improvements in live upgrade latency am I > talking about? > > - For example the majority of pure user-space process >pages in RAM could be saved from the old kernel over >into the new kernel - i.e. they'd stay in place in RAM, >but they'd be re-hashed for the new data structures. >This avoids a big chunk of checkpointing overhead. I'd have hoped this would be a given. If you can't preserve memory contents and have to re-load from disk, you can just as well reboot entirely, the time needed will not be much more.. > - Likewise, most of the page cache could be saved from an >old kernel to a new kernel as well - further reducing >checkpointing overhead. > > - The PROT_NONE mechanism of the current NUMA balancing >code could be used to transparently mark user-space >pages as 'checkpointed'. This would reduce system >interruption as only 'newly modified' pages would have >to be checkpointed when the upgrade happens. > > - Hardware devices could be marked as 'already in well >defined state', skipping the more expensive steps of >driver initialization. > > - Possibly full user-space page tables could be preserved >over an upgrade: this way user-space execution would be >unaffected even in the micro level: cache layout, TLB >patterns, etc. > > There's lots of gradual speedups possible with such a model > IMO. Yes, as I say above, guaranteeing decades of employment. ;) > With live kernel patching we run into a brick wall of > complexity straight away: we have to analyze the nature of > the kernel modification, in the context of live patching, > and that only works for the simplest of kernel > modifications. But you're able to _use_ it. > With live kernel upgrades no such brick wall exists, just > about any transition between kernel versions is possible. The brick wall you run to is "I need to implement full kernel state serialization before I can do anything at all." That's something that isn't even clear _how_ to do. Particularly with Linux kernel's development model where internal ABI and structures are always in flux it may not even be realistic. > Granted, with live kernel upgrades it's much more complex > to get the 'simple' case into an even rudimentarily working > fashion (full userspace state has to be enumerated, saved > and restored), but once we are there, it's a whole new > category of goodness and it probably covers 90%+ of the > live kernel patching usecases on day 1 already ... Feel free to start working on it. I'll stick with live patching. -- Vojtech
Re: live kernel upgrades (was: live kernel patching design)
On 02/24/2015, 10:16 AM, Ingo Molnar wrote: > and we don't design the Linux kernel for weird, extreme > cases, we design for the common, sane case that has the > broadest appeal, and we hope that the feature garners > enough interest to be maintainable. Hello, oh, so why do we have NR_CPUS up to 8192, then? I haven't met a machine with more than 16 cores yet. You did. But you haven't met a guy thanksgiving for a live patching being so easy to implement, but fast. I did. What ones call extreme, others accept as standard. That is, I believe, why you signed under the support for up to 8192 CPUs. We develop Linux to be scalable, i.e. used on *whatever* scenario you can imagine in any world. Be it large/small machines, lowmem/higmem, numa/uma, whatever. If you don't like something, you are free to disable that. Democracy. > This is not a problem in general: the weird case can take > care of itself just fine - 'specialized and weird' usually > means there's enough money to throw at special hardware and > human solutions or it goes extinct quickly ... Live patching is not a random idea which is about to die. It is months of negotiations with customers, management, between developers, establishing teams and really thinking about the idea. The decisions were discussed on many conferences too. I am trying to shed some light on why we are not trying to improve criu or any other already existing project. We studied papers, codes, implementations, kSplice and such and decided to incline to what we have implemented, presented and merged. thanks, -- js suse labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
On Tue, Feb 24, 2015 at 11:53:28AM +0100, Ingo Molnar wrote: > > * Jiri Kosina wrote: > > > [...] We could optimize the kernel the craziest way we > > can, but hardware takes its time to reinitialize. And in > > most cases, you'd really need to reinitalize it; [...] > > If we want to reinitialize a device, most of the longer > initialization latencies during bootup these days involve > things like: 'poke hardware, see if there's any response'. > Those are mostly going away quickly with modern, > well-enumerated hardware interfaces. > > Just try a modprobe of a random hardware driver - most > initialization sequences are very fast. (That's how people > are able to do cold bootups in less than 1 second.) Have you ever tried to boot a system with a large (> 100) number of drives connected over FC? That takes time to discover and you have to do the discovery as the configuration could have changed while you were not looking. Or a machine with terabytes of memory? Just initializing the memory takes minutes. Or a desktop with USB? And you have to reinitialize the USB bus and the state of all the USB devices, because an application might be accessing files on an USB drive. > In theory this could also be optimized: we could avoid the > reinitialization step through an upgrade via relatively > simple means, for example if drivers define their own > version and the new kernel's driver checks whether the > previous state is from a compatible driver. Then the new > driver could do a shorter initialization sequence. There you're clearly getting in the "so complex to maintain that it'll never work reliably" territory. > But I'd only do it only in special cases, where for some > reason the initialization sequence takes longer time and it > makes sense to share hardware discovery information between > two versions of the driver. I'm not convinced such a > mechanism is necessary in the general case. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
On Tue, Feb 24, 2015 at 10:44:05AM +0100, Ingo Molnar wrote: > > This is the most common argument that's raised when live > > patching is discussed. "Why do need live patching when we > > have redundancy?" > > My argument is that if we start off with a latency of 10 > seconds and improve that gradually, it will be good for > everyone with a clear, actionable route for even those who > cannot take a 10 seconds delay today. Sure, we can do it that way. Or do it in the other direction. Today we have a tool (livepatch) in the kernel that can apply trivial single-function fixes without a measurable disruption to applications. And we can improve it gradually to expand the range of fixes it can apply. Dependent functions can be done by kGraft's lazy migration. Limited data structure changes can be handled by shadowing. Major data structure and/or locking changes require stopping the kernel, and trapping all tasks at the kernel/userspace boundary is clearly the cleanest way to do that. I comes at a steep latency cost, though. Full code replacement without change scope consideration requires full serialization and deserialization of hardware and userspace interface state, which is something we don't have today and would require work on every single driver. Possible, but probably a decade of effort. With this approach you have something useful at every point and every piece of effort put in gives you a rewars. > Lets see the use cases: > > > [...] Examples would be legacy applications which can't > > run in an active-active cluster and need to be restarted > > on failover. > > Most clusters (say web frontends) can take a stoppage of a > couple of seconds. It's easy to find examples of workloads that can be stopped. It doesn't rule out a significant set of those where stopping them is very expensive. > > Another usecase is large HPC clusters, where all nodes > > have to run carefully synchronized. Once one gets behind > > in a calculation cycle, others have to wait for the > > results and the efficiency of the whole cluster goes > > down. [...] > > I think calculation nodes on large HPC clusters qualify as > the specialized case that I mentioned, where the update > latency could be brought down into the 1 second range. > > But I don't think calculation nodes are patched in the > typical case: you might want to patch Internet facing > frontend systems, the rest is left as undisturbed as > possible. So I'm not even sure this is a typical usecase. They're not patched for security bugs, but stability bugs are an important issue for multi-month calculations. > In any case, there's no hard limit on how fast such a > kernel upgrade can get in principle, and the folks who care > about that latency will sure help out optimizing it and > many HPC projects are well funded. So far, unless you come up with an effective solutions, if you're catching all tasks at the kernel/userspace boundary (the "Kragle" approach), the service interruption is effectively unbounded due to tasks in D state. > > The value of live patching is in near zero disruption. > > Latency is a good attribute of a kernel upgrade mechanism, > but it's by far not the only attribute and we should > definitely not design limitations into the approach and > hurt all the other attributes, just to optimize that single > attribute. It's an attribute I'm not willing to give up. On the other hand, I definitely wouldn't argue against having modes of operation where the latency is higher and the tool is more powerful. > I.e. don't make it a single-issue project. There is no need to worry about that. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
On 02/22/2015, 10:46 AM, Ingo Molnar wrote: > Arbitrary live kernel upgrades could be achieved by > starting with the 'simple method' I outlined in earlier > mails, using some of the methods that kpatch and kGraft are > both utilizing or planning to utilize: > > - implement user task and kthread parking to get the > kernel into quiescent state. > > - implement (optional, thus ABI-compatible) > system call interruptability and restartability > support. > > - implement task state and (limited) device state > snapshotting support > > - implement live kernel upgrades by: > > - snapshotting all system state transparently > > - fast-rebooting into the new kernel image without > shutting down and rebooting user-space, i.e. _much_ > faster than a regular reboot. > > - restoring system state transparently within the new > kernel image and resuming system workloads where > they were left. > > Even complex external state like TCP socket state and > graphics state can be preserved over an upgrade. As far as > the user is concerned, nothing happened but a brief pause - > and he's now running a v3.21 kernel, not v3.20. > > Obviously one of the simplest utilizations of live kernel > upgrades would be to apply simple security fixes to > production systems. But that's just a very simple > application of a much broader capability. > > Note that if done right, then the time to perform a live > kernel upgrade on a typical system could be brought to well > below 10 seconds system stoppage time: adequate to the vast > majority of installations. > > For special installations or well optimized hardware the > latency could possibly be brought below 1 second stoppage > time. Hello, IMNSHO, you cannot. The criu-based approach you have just described is already alive as an external project in Parallels. It is of course a perfect solution for some use cases. But its use case is a distinctive one. It is not our competitor, it is our complementer. I will try to explain why. It is highly dependent on HW. Kexec is not (or any other arbitrary kernel-exchange mechanism would not be) supported by all HW, neither drivers. There is not even a way to implement snapshotting for some devices which is a real issue, obviously. Downtime is highly dependent on the scenario. If you have a plenty of dirty memory, you have to flush first. This might be minutes, especially when using a network FS. Or you need not, but a failure to replace a kernel is then lethal. If you have a heap of open FD, restore time will take ages. You cannot fool any of those. It's pure I/O. You cannot estimate the downtime and that is a real downside. Even if you can get the criu time under one second, this is still unacceptable for live patching. Live patching shall be by 3 orders of magnitude faster than that, otherwise it makes no sense. If you can afford a second, you probably already have a large enough windows or failure handling to perform a full and mainly safer reboot/kexec anyway. You cannot restore everything. * TCP is one of the pure beasts in this. And there is indeed a plenty of theoretical papers behind this, explaining what can or cannot be done. * NFS is another one. * Xorg. Today, we cannot even fluently switch between discreet and native GFX chip. No go. * There indeed are situations, where NP-hard problems need to be solved upon restoration. No way, if you want to restore yet in this century. While you cannot live-patch everything using KLP, it is patch-dependent. Failure of restoration is condition-dependent and the condition is really fuzzy. That is a huge difference. Despite you put criu-based approach as provably safe and correct, it is not in many cases and cannot be by definition. That said, we are not going to start moving that way, except the many good points which emerged during the discussion (fake signals to pick one). > This 'live kernel upgrades' approach would have various > advantages: > > - it brings together various principles working towards > shared goals: > > - the boot time reduction folks > - the checkpoint/restore folks > - the hibernation folks > - the suspend/resume and power management folks > - the live patching folks (you) > - the syscall latency reduction folks > > if so many disciplines are working together then maybe > something really good and long term maintainble can > crystalize out of that effort. I must admit, whenever I implemented something in the kernel, nobody did any work for me. So the above will only result in live patching teams to do all the work. I am not saying we do not want to do the work. I am only pointing out that there is nothing like "work together with other teams" (unless we are sending them their pay-bills). > - it ignores the security theater that treats security > fixes as a separate, disproportionally more important
Re: live kernel upgrades (was: live kernel patching design)
On Tue 2015-02-24 11:23:29, Ingo Molnar wrote: > What gradual improvements in live upgrade latency am I > talking about? > > - For example the majority of pure user-space process >pages in RAM could be saved from the old kernel over >into the new kernel - i.e. they'd stay in place in RAM, >but they'd be re-hashed for the new data structures. I wonder how many structures we would need to rehash when we update the whole kernel. I think that it is not only about memory but also about any other subsystem: networking, scheduler, ... > - Hardware devices could be marked as 'already in well >defined state', skipping the more expensive steps of >driver initialization. This is another point that might get easily wrong. We know that the quality of many drivers is not good. Yes, we want to make it better. But we also know that system suspend does not work well on many systems for years even with the huge effort. > - Possibly full user-space page tables could be preserved >over an upgrade: this way user-space execution would be >unaffected even in the micro level: cache layout, TLB >patterns, etc. > > There's lots of gradual speedups possible with such a model > IMO. > > With live kernel patching we run into a brick wall of > complexity straight away: we have to analyze the nature of > the kernel modification, in the context of live patching, > and that only works for the simplest of kernel > modifications. > > With live kernel upgrades no such brick wall exists, just > about any transition between kernel versions is possible. I see here a big difference in the complexity. If verifying patches is considered as complex then I think that it is much much more complicated to verify that the whole kernel upgrade is safe and that all states will be properly preserved and reused. Otherwise, I think that live patching won't be for any Joe User. The people producing patches will need to investigate the changes anyway. They will not blindly take a patch on internet and convert it to a life patch. I think that this is true for many other kernel features. > Granted, with live kernel upgrades it's much more complex > to get the 'simple' case into an even rudimentarily working > fashion (full userspace state has to be enumerated, saved > and restored), but once we are there, it's a whole new > category of goodness and it probably covers 90%+ of the > live kernel patching usecases on day 1 already ... I like the idea and I see the benefit for other tasks: system suspend, migration of systems to another hardware, ... But I also think that it is another level of functionality. IMHO, live patching is somewhere on the way for the full kernel update and it will help as well. For example, we will need to somehow solve transition of kthreads and thus fix their parking. I think that live patching deserves its separate solution. I consider it much less risky but still valuable. I am sure that it will have its users. Also it will not block improving the things for the full update in the future. Best Regards, Petr -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
* Jiri Kosina wrote: > [...] We could optimize the kernel the craziest way we > can, but hardware takes its time to reinitialize. And in > most cases, you'd really need to reinitalize it; [...] If we want to reinitialize a device, most of the longer initialization latencies during bootup these days involve things like: 'poke hardware, see if there's any response'. Those are mostly going away quickly with modern, well-enumerated hardware interfaces. Just try a modprobe of a random hardware driver - most initialization sequences are very fast. (That's how people are able to do cold bootups in less than 1 second.) In theory this could also be optimized: we could avoid the reinitialization step through an upgrade via relatively simple means, for example if drivers define their own version and the new kernel's driver checks whether the previous state is from a compatible driver. Then the new driver could do a shorter initialization sequence. But I'd only do it only in special cases, where for some reason the initialization sequence takes longer time and it makes sense to share hardware discovery information between two versions of the driver. I'm not convinced such a mechanism is necessary in the general case. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
* Pavel Machek wrote: > > More importantly, both kGraft and kpatch are pretty limited > > in what kinds of updates they allow, and neither kGraft nor > > kpatch has any clear path towards applying more complex > > fixes to kernel images that I can see: kGraft can only > > apply the simplest of fixes where both versions of a > > function are interchangeable, and kpatch is only marginally > > better at that - and that's pretty fundamental to both > > projects! > > > > I think all of these problems could be resolved by shooting > > for the moon instead: > > > > - work towards allowing arbitrary live kernel upgrades! > > > > not just 'live kernel patches'. > > Note that live kernel upgrade would have interesting > implications outside kernel: > > 1) glibc does "what kernel version is this?" caches > result and alters behaviour accordingly. That should be OK, as a new kernel will be ABI compatible with an old kernel. A later optimization could update the glibc cache on an upgrade, fortunately both projects are open source. > 2) apps will do recently_introduced_syscall(), get error > and not attempt it again. That should be fine too. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
* Josh Poimboeuf wrote: > Your upgrade proposal is an *enormous* disruption to the > system: > > - a latency of "well below 10" seconds is completely > unacceptable to most users who want to patch the kernel > of a production system _while_ it's in production. I think this statement is false for the following reasons. - I'd say the majority of system operators of production systems can live with a couple of seconds of delay at a well defined moment of the day or week - with gradual, pretty much open ended improvements in that latency down the line. - I think your argument ignores the fact that live upgrades would extend the scope of 'users willing to patch the kernel of a production system' _enormously_. For example, I have a production system with this much uptime: 10:50:09 up 153 days, 3:58, 34 users, load average: 0.00, 0.02, 0.05 While currently I'm reluctant to reboot the system to upgrade the kernel (due to a reboot's intrusiveness), and that is why it has achieved a relatively high uptime, but I'd definitely allow the kernel to upgrade at 0:00am just fine. (I'd even give it up to a few minutes, as long as TCP connections don't time out.) And I don't think my usecase is special. What gradual improvements in live upgrade latency am I talking about? - For example the majority of pure user-space process pages in RAM could be saved from the old kernel over into the new kernel - i.e. they'd stay in place in RAM, but they'd be re-hashed for the new data structures. This avoids a big chunk of checkpointing overhead. - Likewise, most of the page cache could be saved from an old kernel to a new kernel as well - further reducing checkpointing overhead. - The PROT_NONE mechanism of the current NUMA balancing code could be used to transparently mark user-space pages as 'checkpointed'. This would reduce system interruption as only 'newly modified' pages would have to be checkpointed when the upgrade happens. - Hardware devices could be marked as 'already in well defined state', skipping the more expensive steps of driver initialization. - Possibly full user-space page tables could be preserved over an upgrade: this way user-space execution would be unaffected even in the micro level: cache layout, TLB patterns, etc. There's lots of gradual speedups possible with such a model IMO. With live kernel patching we run into a brick wall of complexity straight away: we have to analyze the nature of the kernel modification, in the context of live patching, and that only works for the simplest of kernel modifications. With live kernel upgrades no such brick wall exists, just about any transition between kernel versions is possible. Granted, with live kernel upgrades it's much more complex to get the 'simple' case into an even rudimentarily working fashion (full userspace state has to be enumerated, saved and restored), but once we are there, it's a whole new category of goodness and it probably covers 90%+ of the live kernel patching usecases on day 1 already ... Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
* Vojtech Pavlik wrote: > On Sun, Feb 22, 2015 at 03:01:48PM -0800, Andrew Morton wrote: > > > On Sun, 22 Feb 2015 20:13:28 +0100 (CET) Jiri Kosina > > wrote: > > > > > But if you ask the folks who are hungry for live bug > > > patching, they wouldn't care. > > > > > > You mentioned "10 seconds", that's more or less equal > > > to infinity to them. > > > > 10 seconds outage is unacceptable, but we're running > > our service on a single machine with no failover. Who > > is doing this?? > > This is the most common argument that's raised when live > patching is discussed. "Why do need live patching when we > have redundancy?" My argument is that if we start off with a latency of 10 seconds and improve that gradually, it will be good for everyone with a clear, actionable route for even those who cannot take a 10 seconds delay today. Lets see the use cases: > [...] Examples would be legacy applications which can't > run in an active-active cluster and need to be restarted > on failover. Most clusters (say web frontends) can take a stoppage of a couple of seconds. > [...] Or trading systems, where the calculations must be > strictly serialized and response times are counted in > tens of microseconds. All trading systems I'm aware of have daily maintenance time periods that can afford at minimum of a couple of seconds of optional maintenance latency: stock trading systems can be maintained when there's no trading session (which is many hours), aftermarket or global trading systems can be maintained when the daily rollover interested is calculated in a predetermined low activity period. > Another usecase is large HPC clusters, where all nodes > have to run carefully synchronized. Once one gets behind > in a calculation cycle, others have to wait for the > results and the efficiency of the whole cluster goes > down. [...] I think calculation nodes on large HPC clusters qualify as the specialized case that I mentioned, where the update latency could be brought down into the 1 second range. But I don't think calculation nodes are patched in the typical case: you might want to patch Internet facing frontend systems, the rest is left as undisturbed as possible. So I'm not even sure this is a typical usecase. In any case, there's no hard limit on how fast such a kernel upgrade can get in principle, and the folks who care about that latency will sure help out optimizing it and many HPC projects are well funded. > The value of live patching is in near zero disruption. Latency is a good attribute of a kernel upgrade mechanism, but it's by far not the only attribute and we should definitely not design limitations into the approach and hurt all the other attributes, just to optimize that single attribute. I.e. don't make it a single-issue project. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
* Arjan van de Ven wrote: > I think 10 seconds is Ingo being a bit exaggerating, > since you can boot a full system in a lot less time than > that, and more so if you know more about the system (e.g. > don't need to spin down and then discover and spin up > disks). If you're talking about inside a VM it's even > more extreme than that. Correct, I mentioned 10 seconds latency to be on the safe side - but in general I suspect it can be reduced to below 1 second, which should be enough for everyone but the most specialized cases: even specialized HA servers will update their systems in low activity maintenance windows. and we don't design the Linux kernel for weird, extreme cases, we design for the common, sane case that has the broadest appeal, and we hope that the feature garners enough interest to be maintainable. This is not a problem in general: the weird case can take care of itself just fine - 'specialized and weird' usually means there's enough money to throw at special hardware and human solutions or it goes extinct quickly ... Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
* Arjan van de Ven arjanvande...@gmail.com wrote: I think 10 seconds is Ingo being a bit exaggerating, since you can boot a full system in a lot less time than that, and more so if you know more about the system (e.g. don't need to spin down and then discover and spin up disks). If you're talking about inside a VM it's even more extreme than that. Correct, I mentioned 10 seconds latency to be on the safe side - but in general I suspect it can be reduced to below 1 second, which should be enough for everyone but the most specialized cases: even specialized HA servers will update their systems in low activity maintenance windows. and we don't design the Linux kernel for weird, extreme cases, we design for the common, sane case that has the broadest appeal, and we hope that the feature garners enough interest to be maintainable. This is not a problem in general: the weird case can take care of itself just fine - 'specialized and weird' usually means there's enough money to throw at special hardware and human solutions or it goes extinct quickly ... Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
* Pavel Machek pa...@ucw.cz wrote: More importantly, both kGraft and kpatch are pretty limited in what kinds of updates they allow, and neither kGraft nor kpatch has any clear path towards applying more complex fixes to kernel images that I can see: kGraft can only apply the simplest of fixes where both versions of a function are interchangeable, and kpatch is only marginally better at that - and that's pretty fundamental to both projects! I think all of these problems could be resolved by shooting for the moon instead: - work towards allowing arbitrary live kernel upgrades! not just 'live kernel patches'. Note that live kernel upgrade would have interesting implications outside kernel: 1) glibc does what kernel version is this? caches result and alters behaviour accordingly. That should be OK, as a new kernel will be ABI compatible with an old kernel. A later optimization could update the glibc cache on an upgrade, fortunately both projects are open source. 2) apps will do recently_introduced_syscall(), get error and not attempt it again. That should be fine too. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
* Josh Poimboeuf jpoim...@redhat.com wrote: Your upgrade proposal is an *enormous* disruption to the system: - a latency of well below 10 seconds is completely unacceptable to most users who want to patch the kernel of a production system _while_ it's in production. I think this statement is false for the following reasons. - I'd say the majority of system operators of production systems can live with a couple of seconds of delay at a well defined moment of the day or week - with gradual, pretty much open ended improvements in that latency down the line. - I think your argument ignores the fact that live upgrades would extend the scope of 'users willing to patch the kernel of a production system' _enormously_. For example, I have a production system with this much uptime: 10:50:09 up 153 days, 3:58, 34 users, load average: 0.00, 0.02, 0.05 While currently I'm reluctant to reboot the system to upgrade the kernel (due to a reboot's intrusiveness), and that is why it has achieved a relatively high uptime, but I'd definitely allow the kernel to upgrade at 0:00am just fine. (I'd even give it up to a few minutes, as long as TCP connections don't time out.) And I don't think my usecase is special. What gradual improvements in live upgrade latency am I talking about? - For example the majority of pure user-space process pages in RAM could be saved from the old kernel over into the new kernel - i.e. they'd stay in place in RAM, but they'd be re-hashed for the new data structures. This avoids a big chunk of checkpointing overhead. - Likewise, most of the page cache could be saved from an old kernel to a new kernel as well - further reducing checkpointing overhead. - The PROT_NONE mechanism of the current NUMA balancing code could be used to transparently mark user-space pages as 'checkpointed'. This would reduce system interruption as only 'newly modified' pages would have to be checkpointed when the upgrade happens. - Hardware devices could be marked as 'already in well defined state', skipping the more expensive steps of driver initialization. - Possibly full user-space page tables could be preserved over an upgrade: this way user-space execution would be unaffected even in the micro level: cache layout, TLB patterns, etc. There's lots of gradual speedups possible with such a model IMO. With live kernel patching we run into a brick wall of complexity straight away: we have to analyze the nature of the kernel modification, in the context of live patching, and that only works for the simplest of kernel modifications. With live kernel upgrades no such brick wall exists, just about any transition between kernel versions is possible. Granted, with live kernel upgrades it's much more complex to get the 'simple' case into an even rudimentarily working fashion (full userspace state has to be enumerated, saved and restored), but once we are there, it's a whole new category of goodness and it probably covers 90%+ of the live kernel patching usecases on day 1 already ... Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
* Vojtech Pavlik vojt...@suse.com wrote: On Sun, Feb 22, 2015 at 03:01:48PM -0800, Andrew Morton wrote: On Sun, 22 Feb 2015 20:13:28 +0100 (CET) Jiri Kosina jkos...@suse.cz wrote: But if you ask the folks who are hungry for live bug patching, they wouldn't care. You mentioned 10 seconds, that's more or less equal to infinity to them. 10 seconds outage is unacceptable, but we're running our service on a single machine with no failover. Who is doing this?? This is the most common argument that's raised when live patching is discussed. Why do need live patching when we have redundancy? My argument is that if we start off with a latency of 10 seconds and improve that gradually, it will be good for everyone with a clear, actionable route for even those who cannot take a 10 seconds delay today. Lets see the use cases: [...] Examples would be legacy applications which can't run in an active-active cluster and need to be restarted on failover. Most clusters (say web frontends) can take a stoppage of a couple of seconds. [...] Or trading systems, where the calculations must be strictly serialized and response times are counted in tens of microseconds. All trading systems I'm aware of have daily maintenance time periods that can afford at minimum of a couple of seconds of optional maintenance latency: stock trading systems can be maintained when there's no trading session (which is many hours), aftermarket or global trading systems can be maintained when the daily rollover interested is calculated in a predetermined low activity period. Another usecase is large HPC clusters, where all nodes have to run carefully synchronized. Once one gets behind in a calculation cycle, others have to wait for the results and the efficiency of the whole cluster goes down. [...] I think calculation nodes on large HPC clusters qualify as the specialized case that I mentioned, where the update latency could be brought down into the 1 second range. But I don't think calculation nodes are patched in the typical case: you might want to patch Internet facing frontend systems, the rest is left as undisturbed as possible. So I'm not even sure this is a typical usecase. In any case, there's no hard limit on how fast such a kernel upgrade can get in principle, and the folks who care about that latency will sure help out optimizing it and many HPC projects are well funded. The value of live patching is in near zero disruption. Latency is a good attribute of a kernel upgrade mechanism, but it's by far not the only attribute and we should definitely not design limitations into the approach and hurt all the other attributes, just to optimize that single attribute. I.e. don't make it a single-issue project. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
* Jiri Kosina jkos...@suse.cz wrote: [...] We could optimize the kernel the craziest way we can, but hardware takes its time to reinitialize. And in most cases, you'd really need to reinitalize it; [...] If we want to reinitialize a device, most of the longer initialization latencies during bootup these days involve things like: 'poke hardware, see if there's any response'. Those are mostly going away quickly with modern, well-enumerated hardware interfaces. Just try a modprobe of a random hardware driver - most initialization sequences are very fast. (That's how people are able to do cold bootups in less than 1 second.) In theory this could also be optimized: we could avoid the reinitialization step through an upgrade via relatively simple means, for example if drivers define their own version and the new kernel's driver checks whether the previous state is from a compatible driver. Then the new driver could do a shorter initialization sequence. But I'd only do it only in special cases, where for some reason the initialization sequence takes longer time and it makes sense to share hardware discovery information between two versions of the driver. I'm not convinced such a mechanism is necessary in the general case. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
On 02/22/2015, 10:46 AM, Ingo Molnar wrote: Arbitrary live kernel upgrades could be achieved by starting with the 'simple method' I outlined in earlier mails, using some of the methods that kpatch and kGraft are both utilizing or planning to utilize: - implement user task and kthread parking to get the kernel into quiescent state. - implement (optional, thus ABI-compatible) system call interruptability and restartability support. - implement task state and (limited) device state snapshotting support - implement live kernel upgrades by: - snapshotting all system state transparently - fast-rebooting into the new kernel image without shutting down and rebooting user-space, i.e. _much_ faster than a regular reboot. - restoring system state transparently within the new kernel image and resuming system workloads where they were left. Even complex external state like TCP socket state and graphics state can be preserved over an upgrade. As far as the user is concerned, nothing happened but a brief pause - and he's now running a v3.21 kernel, not v3.20. Obviously one of the simplest utilizations of live kernel upgrades would be to apply simple security fixes to production systems. But that's just a very simple application of a much broader capability. Note that if done right, then the time to perform a live kernel upgrade on a typical system could be brought to well below 10 seconds system stoppage time: adequate to the vast majority of installations. For special installations or well optimized hardware the latency could possibly be brought below 1 second stoppage time. Hello, IMNSHO, you cannot. The criu-based approach you have just described is already alive as an external project in Parallels. It is of course a perfect solution for some use cases. But its use case is a distinctive one. It is not our competitor, it is our complementer. I will try to explain why. It is highly dependent on HW. Kexec is not (or any other arbitrary kernel-exchange mechanism would not be) supported by all HW, neither drivers. There is not even a way to implement snapshotting for some devices which is a real issue, obviously. Downtime is highly dependent on the scenario. If you have a plenty of dirty memory, you have to flush first. This might be minutes, especially when using a network FS. Or you need not, but a failure to replace a kernel is then lethal. If you have a heap of open FD, restore time will take ages. You cannot fool any of those. It's pure I/O. You cannot estimate the downtime and that is a real downside. Even if you can get the criu time under one second, this is still unacceptable for live patching. Live patching shall be by 3 orders of magnitude faster than that, otherwise it makes no sense. If you can afford a second, you probably already have a large enough windows or failure handling to perform a full and mainly safer reboot/kexec anyway. You cannot restore everything. * TCP is one of the pure beasts in this. And there is indeed a plenty of theoretical papers behind this, explaining what can or cannot be done. * NFS is another one. * Xorg. Today, we cannot even fluently switch between discreet and native GFX chip. No go. * There indeed are situations, where NP-hard problems need to be solved upon restoration. No way, if you want to restore yet in this century. While you cannot live-patch everything using KLP, it is patch-dependent. Failure of restoration is condition-dependent and the condition is really fuzzy. That is a huge difference. Despite you put criu-based approach as provably safe and correct, it is not in many cases and cannot be by definition. That said, we are not going to start moving that way, except the many good points which emerged during the discussion (fake signals to pick one). This 'live kernel upgrades' approach would have various advantages: - it brings together various principles working towards shared goals: - the boot time reduction folks - the checkpoint/restore folks - the hibernation folks - the suspend/resume and power management folks - the live patching folks (you) - the syscall latency reduction folks if so many disciplines are working together then maybe something really good and long term maintainble can crystalize out of that effort. I must admit, whenever I implemented something in the kernel, nobody did any work for me. So the above will only result in live patching teams to do all the work. I am not saying we do not want to do the work. I am only pointing out that there is nothing like work together with other teams (unless we are sending them their pay-bills). - it ignores the security theater that treats security fixes as a separate, disproportionally more important class of fixes and instead allows arbitrary complex
Re: live kernel upgrades (was: live kernel patching design)
On Tue, Feb 24, 2015 at 10:44:05AM +0100, Ingo Molnar wrote: This is the most common argument that's raised when live patching is discussed. Why do need live patching when we have redundancy? My argument is that if we start off with a latency of 10 seconds and improve that gradually, it will be good for everyone with a clear, actionable route for even those who cannot take a 10 seconds delay today. Sure, we can do it that way. Or do it in the other direction. Today we have a tool (livepatch) in the kernel that can apply trivial single-function fixes without a measurable disruption to applications. And we can improve it gradually to expand the range of fixes it can apply. Dependent functions can be done by kGraft's lazy migration. Limited data structure changes can be handled by shadowing. Major data structure and/or locking changes require stopping the kernel, and trapping all tasks at the kernel/userspace boundary is clearly the cleanest way to do that. I comes at a steep latency cost, though. Full code replacement without change scope consideration requires full serialization and deserialization of hardware and userspace interface state, which is something we don't have today and would require work on every single driver. Possible, but probably a decade of effort. With this approach you have something useful at every point and every piece of effort put in gives you a rewars. Lets see the use cases: [...] Examples would be legacy applications which can't run in an active-active cluster and need to be restarted on failover. Most clusters (say web frontends) can take a stoppage of a couple of seconds. It's easy to find examples of workloads that can be stopped. It doesn't rule out a significant set of those where stopping them is very expensive. Another usecase is large HPC clusters, where all nodes have to run carefully synchronized. Once one gets behind in a calculation cycle, others have to wait for the results and the efficiency of the whole cluster goes down. [...] I think calculation nodes on large HPC clusters qualify as the specialized case that I mentioned, where the update latency could be brought down into the 1 second range. But I don't think calculation nodes are patched in the typical case: you might want to patch Internet facing frontend systems, the rest is left as undisturbed as possible. So I'm not even sure this is a typical usecase. They're not patched for security bugs, but stability bugs are an important issue for multi-month calculations. In any case, there's no hard limit on how fast such a kernel upgrade can get in principle, and the folks who care about that latency will sure help out optimizing it and many HPC projects are well funded. So far, unless you come up with an effective solutions, if you're catching all tasks at the kernel/userspace boundary (the Kragle approach), the service interruption is effectively unbounded due to tasks in D state. The value of live patching is in near zero disruption. Latency is a good attribute of a kernel upgrade mechanism, but it's by far not the only attribute and we should definitely not design limitations into the approach and hurt all the other attributes, just to optimize that single attribute. It's an attribute I'm not willing to give up. On the other hand, I definitely wouldn't argue against having modes of operation where the latency is higher and the tool is more powerful. I.e. don't make it a single-issue project. There is no need to worry about that. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
On Tue, Feb 24, 2015 at 11:23:29AM +0100, Ingo Molnar wrote: Your upgrade proposal is an *enormous* disruption to the system: - a latency of well below 10 seconds is completely unacceptable to most users who want to patch the kernel of a production system _while_ it's in production. I think this statement is false for the following reasons. The statement is very true. - I'd say the majority of system operators of production systems can live with a couple of seconds of delay at a well defined moment of the day or week - with gradual, pretty much open ended improvements in that latency down the line. In the most usual corporate setting any noticeable outage, even out of business hours, requires an ahead notice, and an agreement of all stakeholders - teams that depend on the system. If a live patching technology introduces an outage, it's not live and because of these bureaucratic reasons, it will not be used and a regular reboot will be scheduled instead. - I think your argument ignores the fact that live upgrades would extend the scope of 'users willing to patch the kernel of a production system' _enormously_. For example, I have a production system with this much uptime: 10:50:09 up 153 days, 3:58, 34 users, load average: 0.00, 0.02, 0.05 While currently I'm reluctant to reboot the system to upgrade the kernel (due to a reboot's intrusiveness), and that is why it has achieved a relatively high uptime, but I'd definitely allow the kernel to upgrade at 0:00am just fine. (I'd even give it up to a few minutes, as long as TCP connections don't time out.) And I don't think my usecase is special. I agree that this is useful. But it is a different problem that only partially overlaps with what we're trying to achieve with live patching. If you can make full kernel upgrades to work this way, which I doubt is achievable in the next 10 years due to all the research and infrastructure needed, then you certainly gain an additional group of users. And a great tool. A large portion of those that ask for live patching won't use it, though. But honestly, I prefer a solution that works for small patches now, than a solution for unlimited patches sometime in next decade. What gradual improvements in live upgrade latency am I talking about? - For example the majority of pure user-space process pages in RAM could be saved from the old kernel over into the new kernel - i.e. they'd stay in place in RAM, but they'd be re-hashed for the new data structures. This avoids a big chunk of checkpointing overhead. I'd have hoped this would be a given. If you can't preserve memory contents and have to re-load from disk, you can just as well reboot entirely, the time needed will not be much more.. - Likewise, most of the page cache could be saved from an old kernel to a new kernel as well - further reducing checkpointing overhead. - The PROT_NONE mechanism of the current NUMA balancing code could be used to transparently mark user-space pages as 'checkpointed'. This would reduce system interruption as only 'newly modified' pages would have to be checkpointed when the upgrade happens. - Hardware devices could be marked as 'already in well defined state', skipping the more expensive steps of driver initialization. - Possibly full user-space page tables could be preserved over an upgrade: this way user-space execution would be unaffected even in the micro level: cache layout, TLB patterns, etc. There's lots of gradual speedups possible with such a model IMO. Yes, as I say above, guaranteeing decades of employment. ;) With live kernel patching we run into a brick wall of complexity straight away: we have to analyze the nature of the kernel modification, in the context of live patching, and that only works for the simplest of kernel modifications. But you're able to _use_ it. With live kernel upgrades no such brick wall exists, just about any transition between kernel versions is possible. The brick wall you run to is I need to implement full kernel state serialization before I can do anything at all. That's something that isn't even clear _how_ to do. Particularly with Linux kernel's development model where internal ABI and structures are always in flux it may not even be realistic. Granted, with live kernel upgrades it's much more complex to get the 'simple' case into an even rudimentarily working fashion (full userspace state has to be enumerated, saved and restored), but once we are there, it's a whole new category of goodness and it probably covers 90%+ of the live kernel patching usecases on day 1 already ... Feel free to start working on it. I'll stick with live patching. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe
Re: live kernel upgrades (was: live kernel patching design)
On Tue 2015-02-24 11:23:29, Ingo Molnar wrote: What gradual improvements in live upgrade latency am I talking about? - For example the majority of pure user-space process pages in RAM could be saved from the old kernel over into the new kernel - i.e. they'd stay in place in RAM, but they'd be re-hashed for the new data structures. I wonder how many structures we would need to rehash when we update the whole kernel. I think that it is not only about memory but also about any other subsystem: networking, scheduler, ... - Hardware devices could be marked as 'already in well defined state', skipping the more expensive steps of driver initialization. This is another point that might get easily wrong. We know that the quality of many drivers is not good. Yes, we want to make it better. But we also know that system suspend does not work well on many systems for years even with the huge effort. - Possibly full user-space page tables could be preserved over an upgrade: this way user-space execution would be unaffected even in the micro level: cache layout, TLB patterns, etc. There's lots of gradual speedups possible with such a model IMO. With live kernel patching we run into a brick wall of complexity straight away: we have to analyze the nature of the kernel modification, in the context of live patching, and that only works for the simplest of kernel modifications. With live kernel upgrades no such brick wall exists, just about any transition between kernel versions is possible. I see here a big difference in the complexity. If verifying patches is considered as complex then I think that it is much much more complicated to verify that the whole kernel upgrade is safe and that all states will be properly preserved and reused. Otherwise, I think that live patching won't be for any Joe User. The people producing patches will need to investigate the changes anyway. They will not blindly take a patch on internet and convert it to a life patch. I think that this is true for many other kernel features. Granted, with live kernel upgrades it's much more complex to get the 'simple' case into an even rudimentarily working fashion (full userspace state has to be enumerated, saved and restored), but once we are there, it's a whole new category of goodness and it probably covers 90%+ of the live kernel patching usecases on day 1 already ... I like the idea and I see the benefit for other tasks: system suspend, migration of systems to another hardware, ... But I also think that it is another level of functionality. IMHO, live patching is somewhere on the way for the full kernel update and it will help as well. For example, we will need to somehow solve transition of kthreads and thus fix their parking. I think that live patching deserves its separate solution. I consider it much less risky but still valuable. I am sure that it will have its users. Also it will not block improving the things for the full update in the future. Best Regards, Petr -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
On 02/24/2015, 10:16 AM, Ingo Molnar wrote: and we don't design the Linux kernel for weird, extreme cases, we design for the common, sane case that has the broadest appeal, and we hope that the feature garners enough interest to be maintainable. Hello, oh, so why do we have NR_CPUS up to 8192, then? I haven't met a machine with more than 16 cores yet. You did. But you haven't met a guy thanksgiving for a live patching being so easy to implement, but fast. I did. What ones call extreme, others accept as standard. That is, I believe, why you signed under the support for up to 8192 CPUs. We develop Linux to be scalable, i.e. used on *whatever* scenario you can imagine in any world. Be it large/small machines, lowmem/higmem, numa/uma, whatever. If you don't like something, you are free to disable that. Democracy. This is not a problem in general: the weird case can take care of itself just fine - 'specialized and weird' usually means there's enough money to throw at special hardware and human solutions or it goes extinct quickly ... Live patching is not a random idea which is about to die. It is months of negotiations with customers, management, between developers, establishing teams and really thinking about the idea. The decisions were discussed on many conferences too. I am trying to shed some light on why we are not trying to improve criu or any other already existing project. We studied papers, codes, implementations, kSplice and such and decided to incline to what we have implemented, presented and merged. thanks, -- js suse labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
On Tue, Feb 24, 2015 at 11:53:28AM +0100, Ingo Molnar wrote: * Jiri Kosina jkos...@suse.cz wrote: [...] We could optimize the kernel the craziest way we can, but hardware takes its time to reinitialize. And in most cases, you'd really need to reinitalize it; [...] If we want to reinitialize a device, most of the longer initialization latencies during bootup these days involve things like: 'poke hardware, see if there's any response'. Those are mostly going away quickly with modern, well-enumerated hardware interfaces. Just try a modprobe of a random hardware driver - most initialization sequences are very fast. (That's how people are able to do cold bootups in less than 1 second.) Have you ever tried to boot a system with a large ( 100) number of drives connected over FC? That takes time to discover and you have to do the discovery as the configuration could have changed while you were not looking. Or a machine with terabytes of memory? Just initializing the memory takes minutes. Or a desktop with USB? And you have to reinitialize the USB bus and the state of all the USB devices, because an application might be accessing files on an USB drive. In theory this could also be optimized: we could avoid the reinitialization step through an upgrade via relatively simple means, for example if drivers define their own version and the new kernel's driver checks whether the previous state is from a compatible driver. Then the new driver could do a shorter initialization sequence. There you're clearly getting in the so complex to maintain that it'll never work reliably territory. But I'd only do it only in special cases, where for some reason the initialization sequence takes longer time and it makes sense to share hardware discovery information between two versions of the driver. I'm not convinced such a mechanism is necessary in the general case. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
> kernel update as step one, maybe we want this on a kernel module > level: > Hot-swap of kernel modules, where a kernel module makes itself go > quiet and serializes its state ("suspend" pretty much), then gets > swapped out (hot) by its replacement, > which then unserializes the state and continues. Hmm. So Linux 5.0 will be micro-kernel? :-). Pavek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
> More importantly, both kGraft and kpatch are pretty limited > in what kinds of updates they allow, and neither kGraft nor > kpatch has any clear path towards applying more complex > fixes to kernel images that I can see: kGraft can only > apply the simplest of fixes where both versions of a > function are interchangeable, and kpatch is only marginally > better at that - and that's pretty fundamental to both > projects! > > I think all of these problems could be resolved by shooting > for the moon instead: > > - work towards allowing arbitrary live kernel upgrades! > > not just 'live kernel patches'. Note that live kernel upgrade would have interesting implications outside kernel: 1) glibc does "what kernel version is this?" caches result and alters behaviour accordingly. 2) apps will do recently_introduced_syscall(), get error and not attempt it again. Pavel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
On Mon, Feb 23, 2015 at 11:42:17AM +0100, Richard Weinberger wrote: > > Of course, if you are random Joe User, you can do whatever you want, i.e. > > also compile your own home-brew patches and apply them randomly and brick > > your system that way. But that's in no way different to what you as Joe > > User can do today; there is nothing that will prevent you from shooting > > yourself in a foot if you are creative. > > Sorry if I ask something that got already discussed, I did not follow > the whole live-patching discussion. > > How much of the userspace tools will be public available? > With live-patching mainline the kernel offers the mechanism, but > random Joe user still needs > the tools to create good live patches. All the tools for kGraft and kpatch are available in public git repositories. Also, while kGraft has tools to automate the generation of patches, these are generally not required to create a patch. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
On Mon, Feb 23, 2015 at 9:17 AM, Jiri Kosina wrote: > On Sun, 22 Feb 2015, Arjan van de Ven wrote: > Of course, if you are random Joe User, you can do whatever you want, i.e. > also compile your own home-brew patches and apply them randomly and brick > your system that way. But that's in no way different to what you as Joe > User can do today; there is nothing that will prevent you from shooting > yourself in a foot if you are creative. Sorry if I ask something that got already discussed, I did not follow the whole live-patching discussion. How much of the userspace tools will be public available? With live-patching mainline the kernel offers the mechanism, but random Joe user still needs the tools to create good live patches. -- Thanks, //richard -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
On Sun, 22 Feb 2015, Arjan van de Ven wrote: > There's a lot of logistical issues (can you patch a patched system... if > live patching is a first class citizen you end up with dozens and dozens > of live patches applied, some out of sequence etc etc). I can't speak on behalf of others, but I definitely can speak on behalf of SUSE, as we are already basing a product on this. Yes, you can patch a patched system, you can patch one function multiple times, you can revert a patch. It's all tracked by dependencies. Of course, if you are random Joe User, you can do whatever you want, i.e. also compile your own home-brew patches and apply them randomly and brick your system that way. But that's in no way different to what you as Joe User can do today; there is nothing that will prevent you from shooting yourself in a foot if you are creative. Regarding "out of sequence", this is up to the vendor providing/packaging the patches to make sure that this is guaranteed not to happen. SUSE for example always provides "all-in-one" patch for each and every released and supported kernel codestream in a cummulative manner, which takes care of the ordering issue completely. It's not really too different from shipping external kernel modules and making sure they have proper dependencies that need to be satisfied before the module can be loaded. > There's the "which patches do I have, and if the first patch for a > security hole was not complete, how do I cope by applying number two. > There's the "which of my 50.000 servers have which patch applied" > logistics. Yes. That's easy if distro/patch vendors make reasonable userspace and distribution infrastructure around this. Thanks, -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
On Mon, Feb 23, 2015 at 11:42:17AM +0100, Richard Weinberger wrote: Of course, if you are random Joe User, you can do whatever you want, i.e. also compile your own home-brew patches and apply them randomly and brick your system that way. But that's in no way different to what you as Joe User can do today; there is nothing that will prevent you from shooting yourself in a foot if you are creative. Sorry if I ask something that got already discussed, I did not follow the whole live-patching discussion. How much of the userspace tools will be public available? With live-patching mainline the kernel offers the mechanism, but random Joe user still needs the tools to create good live patches. All the tools for kGraft and kpatch are available in public git repositories. Also, while kGraft has tools to automate the generation of patches, these are generally not required to create a patch. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
More importantly, both kGraft and kpatch are pretty limited in what kinds of updates they allow, and neither kGraft nor kpatch has any clear path towards applying more complex fixes to kernel images that I can see: kGraft can only apply the simplest of fixes where both versions of a function are interchangeable, and kpatch is only marginally better at that - and that's pretty fundamental to both projects! I think all of these problems could be resolved by shooting for the moon instead: - work towards allowing arbitrary live kernel upgrades! not just 'live kernel patches'. Note that live kernel upgrade would have interesting implications outside kernel: 1) glibc does what kernel version is this? caches result and alters behaviour accordingly. 2) apps will do recently_introduced_syscall(), get error and not attempt it again. Pavel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
On Mon, Feb 23, 2015 at 9:17 AM, Jiri Kosina jkos...@suse.cz wrote: On Sun, 22 Feb 2015, Arjan van de Ven wrote: Of course, if you are random Joe User, you can do whatever you want, i.e. also compile your own home-brew patches and apply them randomly and brick your system that way. But that's in no way different to what you as Joe User can do today; there is nothing that will prevent you from shooting yourself in a foot if you are creative. Sorry if I ask something that got already discussed, I did not follow the whole live-patching discussion. How much of the userspace tools will be public available? With live-patching mainline the kernel offers the mechanism, but random Joe user still needs the tools to create good live patches. -- Thanks, //richard -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
kernel update as step one, maybe we want this on a kernel module level: Hot-swap of kernel modules, where a kernel module makes itself go quiet and serializes its state (suspend pretty much), then gets swapped out (hot) by its replacement, which then unserializes the state and continues. Hmm. So Linux 5.0 will be micro-kernel? :-). Pavek -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
On Sun, 22 Feb 2015, Arjan van de Ven wrote: There's a lot of logistical issues (can you patch a patched system... if live patching is a first class citizen you end up with dozens and dozens of live patches applied, some out of sequence etc etc). I can't speak on behalf of others, but I definitely can speak on behalf of SUSE, as we are already basing a product on this. Yes, you can patch a patched system, you can patch one function multiple times, you can revert a patch. It's all tracked by dependencies. Of course, if you are random Joe User, you can do whatever you want, i.e. also compile your own home-brew patches and apply them randomly and brick your system that way. But that's in no way different to what you as Joe User can do today; there is nothing that will prevent you from shooting yourself in a foot if you are creative. Regarding out of sequence, this is up to the vendor providing/packaging the patches to make sure that this is guaranteed not to happen. SUSE for example always provides all-in-one patch for each and every released and supported kernel codestream in a cummulative manner, which takes care of the ordering issue completely. It's not really too different from shipping external kernel modules and making sure they have proper dependencies that need to be satisfied before the module can be loaded. There's the which patches do I have, and if the first patch for a security hole was not complete, how do I cope by applying number two. There's the which of my 50.000 servers have which patch applied logistics. Yes. That's easy if distro/patch vendors make reasonable userspace and distribution infrastructure around this. Thanks, -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
On Sun, Feb 22, 2015 at 03:01:48PM -0800, Andrew Morton wrote: > On Sun, 22 Feb 2015 20:13:28 +0100 (CET) Jiri Kosina wrote: > > > But if you ask the folks who are hungry for live bug patching, they > > wouldn't care. > > > > You mentioned "10 seconds", that's more or less equal to infinity to them. > > 10 seconds outage is unacceptable, but we're running our service on a > single machine with no failover. Who is doing this?? This is the most common argument that's raised when live patching is discussed. "Why do need live patching when we have redundancy?" People who are asking for live patching typically do have failover in place, but prefer not to have to use it when they don't have to. In many cases, the failover just can't be made transparent to the outside world and there is a short outage. Examples would be legacy applications which can't run in an active-active cluster and need to be restarted on failover. Or trading systems, where the calculations must be strictly serialized and response times are counted in tens of microseconds. Another usecase is large HPC clusters, where all nodes have to run carefully synchronized. Once one gets behind in a calculation cycle, others have to wait for the results and the efficiency of the whole cluster goes down. There are people who run realtime on them for that reason. Dumping all data and restarting the HPC cluster takes a lot of time and many nodes (out of tens of thousands) may not come back up, making the restore from media difficult. Doing a rolling upgrade causes the nodes one by one stall by 10+ seconds, which times 10k is a long time, too. And even the case where you have a perfect setup with everything redundant and with instant failover does benefit from live patching. Since you have to plan for failure, you have to plan for failure while patching, too. With live patching you need 2 servers minimum (or N+1), without you need 3 (or N+2), as one will be offline while during the upgrade process. 10 seconds of outage may be acceptable in a disaster scenario. Not necessarily for a regular update scenario. The value of live patching is in near zero disruption. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
There's failover, there's running the core services in VMs (which can migrate)... I think 10 seconds is Ingo being a bit exaggerating, since you can boot a full system in a lot less time than that, and more so if you know more about the system (e.g. don't need to spin down and then discover and spin up disks). If you're talking about inside a VM it's even more extreme than that. Now, live patching sounds great as ideal, but it may end up being (mostly) similar like hardware hotplug: Everyone wants it, but nobody wants to use it (and just waits for a maintenance window instead). In the hotplug case, while people say they want it, they're also aware that hardware hotplug is fundamentally messy, and then nobody wants to do it on that mission critical piece of hardware outside the maintenance window. (hotswap drives seem to have been the exception to this, that seems to have been worked out well enough, but that's replace-with-the-same). I would be very afraid that hot kernel patching ends up in the same space: The super-mission-critical folks are what its aimed at, while those are the exact same folks that would rather wait for the maintenance window. There's a lot of logistical issues (can you patch a patched system... if live patching is a first class citizen you end up with dozens and dozens of live patches applied, some out of sequence etc etc). There's the "which patches do I have, and if the first patch for a security hole was not complete, how do I cope by applying number two. There's the "which of my 50.000 servers have which patch applied" logistics. And Ingo is absolutely right: The scope is very fuzzy. Todays bugfix is tomorrows "oh oops it turns out exploitable". I will throw a different hat in the ring: Maybe we don't want full kernel update as step one, maybe we want this on a kernel module level: Hot-swap of kernel modules, where a kernel module makes itself go quiet and serializes its state ("suspend" pretty much), then gets swapped out (hot) by its replacement, which then unserializes the state and continues. If we can do this on a module level, then the next step is treating more components of the kernel as modules, which is a fundamental modularity thing. On Sun, Feb 22, 2015 at 4:18 PM, Dave Airlie wrote: > On 23 February 2015 at 09:01, Andrew Morton wrote: >> On Sun, 22 Feb 2015 20:13:28 +0100 (CET) Jiri Kosina wrote: >> >>> But if you ask the folks who are hungry for live bug patching, they >>> wouldn't care. >>> >>> You mentioned "10 seconds", that's more or less equal to infinity to them. >> >> 10 seconds outage is unacceptable, but we're running our service on a >> single machine with no failover. Who is doing this?? > > if I had to guess, telcos generally, you've only got one wire between a phone > and the exchange and if the switch on the end needs patching it better be > fast. > > Dave. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
On 23 February 2015 at 09:01, Andrew Morton wrote: > On Sun, 22 Feb 2015 20:13:28 +0100 (CET) Jiri Kosina wrote: > >> But if you ask the folks who are hungry for live bug patching, they >> wouldn't care. >> >> You mentioned "10 seconds", that's more or less equal to infinity to them. > > 10 seconds outage is unacceptable, but we're running our service on a > single machine with no failover. Who is doing this?? if I had to guess, telcos generally, you've only got one wire between a phone and the exchange and if the switch on the end needs patching it better be fast. Dave. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
On Sun, 22 Feb 2015 20:13:28 +0100 (CET) Jiri Kosina wrote: > But if you ask the folks who are hungry for live bug patching, they > wouldn't care. > > You mentioned "10 seconds", that's more or less equal to infinity to them. 10 seconds outage is unacceptable, but we're running our service on a single machine with no failover. Who is doing this?? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
[ added live-patching@ ML as well, in consistency with Josh ] On Sun, 22 Feb 2015, Ingo Molnar wrote: > It's all still tons of work to pull off a 'live kernel upgrade' on > native hardware, but IMHO it's tons of very useful work that helps a > dozen non-competing projects, literally. Yes, I agree, it might be nice-to-have feature. The only issue with that is that it's solving a completely different problem than live patching. Guys working on criu have made quite a few steps in that direction of already course; modulo bugs and current implementation limitations, you should be able to checkpoint your userspace, kexec to a new kernel, and restart your userspace. But if you ask the folks who are hungry for live bug patching, they wouldn't care. You mentioned "10 seconds", that's more or less equal to infinity to them. And frankly, even "10 seconds" is something we can't really guarantee. We could optimize the kernel the craziest way we can, but hardware takes its time to reinitialize. And in most cases, you'd really need to reinitalize it; I don't see a way how you could safely suspend it somehow in the old kernel and resume it in a new one, because the driver suspending the device might be completely different than the driver resuming the device. How are you able to provide hard guarantees that this is going to work? So all in all, if you ask me -- yes, live kernel upgrades from v3.20 to v3.21, pretty cool feature. Is it related to the problem we are after with live bug patching? I very much don't think so. Thanks, -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
On Sun, 22 Feb 2015, Josh Poimboeuf wrote: > Yes, there have been some suggestions that we should support multiple > consistency models, but I haven't heard any good reasons that would > justify the added complexity. I tend to agree, consistency models were just a temporary idea that seems to likely become unnecessary given all the ideas on the unified solution that have been presented so far. (Well, with a small exception to this -- I still think we should be able to "fire and forget" for patches where it's guaranteed that no housekeeping is necessary -- my favorite example is again fixing out of bounds access in a certain syscall entry ... i.e. the "super-simple" consistency model). -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
On Sun, Feb 22, 2015 at 08:37:58AM -0600, Josh Poimboeuf wrote: > On Sun, Feb 22, 2015 at 10:46:39AM +0100, Ingo Molnar wrote: > > - the whole 'consistency model' talk both projects employ > >reminds me of how we grew 'security modules': where > >people running various mediocre projects would in the > >end not seek to create a superior upstream project, but > >would seek the 'consensus' in the form of cross-acking > >each others' patches as long as their own code got > >upstream as well ... > > That's just not the case. The consistency models were used to describe > the features and the pros and cons of the different approaches. > > The RFC is not a compromise to get "cross-acks". IMO it's an > improvement on both kpatch and kGraft. See the RFC cover letter [1] and > the original consistency model discussion [2] for more details. BTW, I proposed that with my RFC we only need a _single_ consistency model. Yes, there have been some suggestions that we should support multiple consistency models, but I haven't heard any good reasons that would justify the added complexity. -- Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
[ adding live-patching mailing list to CC ] On Sun, Feb 22, 2015 at 10:46:39AM +0100, Ingo Molnar wrote: > * Ingo Molnar wrote: > > Anyway, let me try to reboot this discussion back to > > technological details by summing up my arguments in > > another mail. > > So here's how I see the kGraft and kpatch series. To not > put too fine a point on it, I think they are fundamentally > misguided in both implementation and in design, which turns > them into an (unwilling) extended arm of the security > theater: > > - kGraft creates a 'mixed' state where old kernel >functions and new kernel functions are allowed to >co-exist, Yes, some tasks may be running old functions and some tasks may be running new functions. This would only cause a problem if there are changes to global data semantics. We have guidelines the patch author can follow to ensure that this isn't a problem. >attempting to get the patching done within a bound >amount of time. Don't forget about my RFC [1] which converges the system to a patched state within a few seconds. If the system isn't patched by then, the user space tool can trigger a safe patch revert. > - kpatch uses kernel stack backtraces to determine whether >a task is executing a function or not - which IMO is >fundamentally fragile as kernel stack backtraces are >'debug info' and are maintained and created as such: >we've had long lasting stack backtrace bugs which would >now be turned into 'potentially patching a live >function' type of functional (and hard to debug) bugs. >I didn't see much effort that tries to turn this >equation around and makes kernel stacktraces more >robust. Again, I proposed several stack unwinding validation improvements which would make this a non-issue IMO. > - the whole 'consistency model' talk both projects employ >reminds me of how we grew 'security modules': where >people running various mediocre projects would in the >end not seek to create a superior upstream project, but >would seek the 'consensus' in the form of cross-acking >each others' patches as long as their own code got >upstream as well ... That's just not the case. The consistency models were used to describe the features and the pros and cons of the different approaches. The RFC is not a compromise to get "cross-acks". IMO it's an improvement on both kpatch and kGraft. See the RFC cover letter [1] and the original consistency model discussion [2] for more details. >I'm not blaming Linus for giving in to allowing security >modules: they might be the right model for such a hard >to define and in good part psychological discipline as >'security', but I sure don't see the necessity of doing >that for 'live kernel patching'. > > More importantly, both kGraft and kpatch are pretty limited > in what kinds of updates they allow, and neither kGraft nor > kpatch has any clear path towards applying more complex > fixes to kernel images that I can see: kGraft can only > apply the simplest of fixes where both versions of a > function are interchangeableand kpatch is only marginally > better at that - and that's pretty fundamental to both > projects! Sorry, but that is just not true. We can apply complex patches, including "non-interchangeable functions" and data structures/semantics. The catch is that it requires the patch author to put in the work to modify the patch to make it compatible with live patching. But that's an acceptable tradeoff for distros who want to support live patching. > I think all of these problems could be resolved by shooting > for the moon instead: > > - work towards allowing arbitrary live kernel upgrades! > > not just 'live kernel patches'. > > Work towards the goal of full live kernel upgrades between > any two versions of a kernel that supports live kernel > upgrades (and that doesn't have fatal bugs in the kernel > upgrade support code requiring a hard system restart). > > Arbitrary live kernel upgrades could be achieved by > starting with the 'simple method' I outlined in earlier > mails, using some of the methods that kpatch and kGraft are > both utilizing or planning to utilize: > > - implement user task and kthread parking to get the > kernel into quiescent state. > > - implement (optional, thus ABI-compatible) > system call interruptability and restartability > support. > > - implement task state and (limited) device state > snapshotting support > > - implement live kernel upgrades by: > > - snapshotting all system state transparently > > - fast-rebooting into the new kernel image without > shutting down and rebooting user-space, i.e. _much_ > faster than a regular reboot. > > - restoring system state transparently within the new > kernel image and resuming system workloads where > they were left. > > Even complex external
Re: live kernel upgrades (was: live kernel patching design)
* Ingo Molnar wrote: > We have many of the building blocks in place and have > them available: > > - the freezer code already attempts at parking/unparking > threads transparently, that could be fixed/extended. > > - hibernation, regular suspend/resume and in general > power management has in essence already implemented > most building blocks needed to enumerate and > checkpoint/restore device state that otherwise gets > lost in a shutdown/reboot cycle. > > - c/r patches started user state enumeration and > checkpoint/restore logic I forgot to mention: - kexec allows the loading and execution of a new kernel image. It's all still tons of work to pull off a 'live kernel upgrade' on native hardware, but IMHO it's tons of very useful work that helps a dozen non-competing projects, literally. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
* Ingo Molnar wrote: > - implement live kernel upgrades by: > > - snapshotting all system state transparently Note that this step can be sped up further in the end, because most of this work can be performed asynchronously and transparently prior to the live kernel upgrade step itself. So if we split the snapshotting+parking preparatory step into two parts: - do opportunistic snapshotting of sleeping/inactive user tasks while allowing snapshotted tasks to continue to run - once that is completed, do snapshotting+parking of all user tasks, even running ones The first step is largely asynchronous, can be done with lower priority and does not park/stop any tasks on the system. Only the second step counts as 'system stoppage time': and only those tasks have to be snapshotted again which executed any code since the first snapshotting run was performed. Note that even this stoppage time can be reduced further: if a system is running critical services/users that need as little interruption as possible, they could be prioritized/ordered to be snapshotted/parked closest to the live kernel upgrade step. > - fast-rebooting into the new kernel image without > shutting down and rebooting user-space, i.e. _much_ > faster than a regular reboot. > > - restoring system state transparently within the new > kernel image and resuming system workloads where > they were left. > > Even complex external state like TCP socket state and > graphics state can be preserved over an upgrade. As far > as the user is concerned, nothing happened but a brief > pause - and he's now running a v3.21 kernel, not v3.20. So all this would allow 'live, rolling kernel upgrades' in the end. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
* Ingo Molnar mi...@kernel.org wrote: - implement live kernel upgrades by: - snapshotting all system state transparently Note that this step can be sped up further in the end, because most of this work can be performed asynchronously and transparently prior to the live kernel upgrade step itself. So if we split the snapshotting+parking preparatory step into two parts: - do opportunistic snapshotting of sleeping/inactive user tasks while allowing snapshotted tasks to continue to run - once that is completed, do snapshotting+parking of all user tasks, even running ones The first step is largely asynchronous, can be done with lower priority and does not park/stop any tasks on the system. Only the second step counts as 'system stoppage time': and only those tasks have to be snapshotted again which executed any code since the first snapshotting run was performed. Note that even this stoppage time can be reduced further: if a system is running critical services/users that need as little interruption as possible, they could be prioritized/ordered to be snapshotted/parked closest to the live kernel upgrade step. - fast-rebooting into the new kernel image without shutting down and rebooting user-space, i.e. _much_ faster than a regular reboot. - restoring system state transparently within the new kernel image and resuming system workloads where they were left. Even complex external state like TCP socket state and graphics state can be preserved over an upgrade. As far as the user is concerned, nothing happened but a brief pause - and he's now running a v3.21 kernel, not v3.20. So all this would allow 'live, rolling kernel upgrades' in the end. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
* Ingo Molnar mi...@kernel.org wrote: We have many of the building blocks in place and have them available: - the freezer code already attempts at parking/unparking threads transparently, that could be fixed/extended. - hibernation, regular suspend/resume and in general power management has in essence already implemented most building blocks needed to enumerate and checkpoint/restore device state that otherwise gets lost in a shutdown/reboot cycle. - c/r patches started user state enumeration and checkpoint/restore logic I forgot to mention: - kexec allows the loading and execution of a new kernel image. It's all still tons of work to pull off a 'live kernel upgrade' on native hardware, but IMHO it's tons of very useful work that helps a dozen non-competing projects, literally. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
[ adding live-patching mailing list to CC ] On Sun, Feb 22, 2015 at 10:46:39AM +0100, Ingo Molnar wrote: * Ingo Molnar mi...@kernel.org wrote: Anyway, let me try to reboot this discussion back to technological details by summing up my arguments in another mail. So here's how I see the kGraft and kpatch series. To not put too fine a point on it, I think they are fundamentally misguided in both implementation and in design, which turns them into an (unwilling) extended arm of the security theater: - kGraft creates a 'mixed' state where old kernel functions and new kernel functions are allowed to co-exist, Yes, some tasks may be running old functions and some tasks may be running new functions. This would only cause a problem if there are changes to global data semantics. We have guidelines the patch author can follow to ensure that this isn't a problem. attempting to get the patching done within a bound amount of time. Don't forget about my RFC [1] which converges the system to a patched state within a few seconds. If the system isn't patched by then, the user space tool can trigger a safe patch revert. - kpatch uses kernel stack backtraces to determine whether a task is executing a function or not - which IMO is fundamentally fragile as kernel stack backtraces are 'debug info' and are maintained and created as such: we've had long lasting stack backtrace bugs which would now be turned into 'potentially patching a live function' type of functional (and hard to debug) bugs. I didn't see much effort that tries to turn this equation around and makes kernel stacktraces more robust. Again, I proposed several stack unwinding validation improvements which would make this a non-issue IMO. - the whole 'consistency model' talk both projects employ reminds me of how we grew 'security modules': where people running various mediocre projects would in the end not seek to create a superior upstream project, but would seek the 'consensus' in the form of cross-acking each others' patches as long as their own code got upstream as well ... That's just not the case. The consistency models were used to describe the features and the pros and cons of the different approaches. The RFC is not a compromise to get cross-acks. IMO it's an improvement on both kpatch and kGraft. See the RFC cover letter [1] and the original consistency model discussion [2] for more details. I'm not blaming Linus for giving in to allowing security modules: they might be the right model for such a hard to define and in good part psychological discipline as 'security', but I sure don't see the necessity of doing that for 'live kernel patching'. More importantly, both kGraft and kpatch are pretty limited in what kinds of updates they allow, and neither kGraft nor kpatch has any clear path towards applying more complex fixes to kernel images that I can see: kGraft can only apply the simplest of fixes where both versions of a function are interchangeableand kpatch is only marginally better at that - and that's pretty fundamental to both projects! Sorry, but that is just not true. We can apply complex patches, including non-interchangeable functions and data structures/semantics. The catch is that it requires the patch author to put in the work to modify the patch to make it compatible with live patching. But that's an acceptable tradeoff for distros who want to support live patching. I think all of these problems could be resolved by shooting for the moon instead: - work towards allowing arbitrary live kernel upgrades! not just 'live kernel patches'. Work towards the goal of full live kernel upgrades between any two versions of a kernel that supports live kernel upgrades (and that doesn't have fatal bugs in the kernel upgrade support code requiring a hard system restart). Arbitrary live kernel upgrades could be achieved by starting with the 'simple method' I outlined in earlier mails, using some of the methods that kpatch and kGraft are both utilizing or planning to utilize: - implement user task and kthread parking to get the kernel into quiescent state. - implement (optional, thus ABI-compatible) system call interruptability and restartability support. - implement task state and (limited) device state snapshotting support - implement live kernel upgrades by: - snapshotting all system state transparently - fast-rebooting into the new kernel image without shutting down and rebooting user-space, i.e. _much_ faster than a regular reboot. - restoring system state transparently within the new kernel image and resuming system workloads where they were left. Even complex external state like TCP socket state and graphics state can be preserved over an
Re: live kernel upgrades (was: live kernel patching design)
[ added live-patching@ ML as well, in consistency with Josh ] On Sun, 22 Feb 2015, Ingo Molnar wrote: It's all still tons of work to pull off a 'live kernel upgrade' on native hardware, but IMHO it's tons of very useful work that helps a dozen non-competing projects, literally. Yes, I agree, it might be nice-to-have feature. The only issue with that is that it's solving a completely different problem than live patching. Guys working on criu have made quite a few steps in that direction of already course; modulo bugs and current implementation limitations, you should be able to checkpoint your userspace, kexec to a new kernel, and restart your userspace. But if you ask the folks who are hungry for live bug patching, they wouldn't care. You mentioned 10 seconds, that's more or less equal to infinity to them. And frankly, even 10 seconds is something we can't really guarantee. We could optimize the kernel the craziest way we can, but hardware takes its time to reinitialize. And in most cases, you'd really need to reinitalize it; I don't see a way how you could safely suspend it somehow in the old kernel and resume it in a new one, because the driver suspending the device might be completely different than the driver resuming the device. How are you able to provide hard guarantees that this is going to work? So all in all, if you ask me -- yes, live kernel upgrades from v3.20 to v3.21, pretty cool feature. Is it related to the problem we are after with live bug patching? I very much don't think so. Thanks, -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
On Sun, Feb 22, 2015 at 08:37:58AM -0600, Josh Poimboeuf wrote: On Sun, Feb 22, 2015 at 10:46:39AM +0100, Ingo Molnar wrote: - the whole 'consistency model' talk both projects employ reminds me of how we grew 'security modules': where people running various mediocre projects would in the end not seek to create a superior upstream project, but would seek the 'consensus' in the form of cross-acking each others' patches as long as their own code got upstream as well ... That's just not the case. The consistency models were used to describe the features and the pros and cons of the different approaches. The RFC is not a compromise to get cross-acks. IMO it's an improvement on both kpatch and kGraft. See the RFC cover letter [1] and the original consistency model discussion [2] for more details. BTW, I proposed that with my RFC we only need a _single_ consistency model. Yes, there have been some suggestions that we should support multiple consistency models, but I haven't heard any good reasons that would justify the added complexity. -- Josh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
On Sun, 22 Feb 2015, Josh Poimboeuf wrote: Yes, there have been some suggestions that we should support multiple consistency models, but I haven't heard any good reasons that would justify the added complexity. I tend to agree, consistency models were just a temporary idea that seems to likely become unnecessary given all the ideas on the unified solution that have been presented so far. (Well, with a small exception to this -- I still think we should be able to fire and forget for patches where it's guaranteed that no housekeeping is necessary -- my favorite example is again fixing out of bounds access in a certain syscall entry ... i.e. the super-simple consistency model). -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
On Sun, 22 Feb 2015 20:13:28 +0100 (CET) Jiri Kosina jkos...@suse.cz wrote: But if you ask the folks who are hungry for live bug patching, they wouldn't care. You mentioned 10 seconds, that's more or less equal to infinity to them. 10 seconds outage is unacceptable, but we're running our service on a single machine with no failover. Who is doing this?? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
On Sun, Feb 22, 2015 at 03:01:48PM -0800, Andrew Morton wrote: On Sun, 22 Feb 2015 20:13:28 +0100 (CET) Jiri Kosina jkos...@suse.cz wrote: But if you ask the folks who are hungry for live bug patching, they wouldn't care. You mentioned 10 seconds, that's more or less equal to infinity to them. 10 seconds outage is unacceptable, but we're running our service on a single machine with no failover. Who is doing this?? This is the most common argument that's raised when live patching is discussed. Why do need live patching when we have redundancy? People who are asking for live patching typically do have failover in place, but prefer not to have to use it when they don't have to. In many cases, the failover just can't be made transparent to the outside world and there is a short outage. Examples would be legacy applications which can't run in an active-active cluster and need to be restarted on failover. Or trading systems, where the calculations must be strictly serialized and response times are counted in tens of microseconds. Another usecase is large HPC clusters, where all nodes have to run carefully synchronized. Once one gets behind in a calculation cycle, others have to wait for the results and the efficiency of the whole cluster goes down. There are people who run realtime on them for that reason. Dumping all data and restarting the HPC cluster takes a lot of time and many nodes (out of tens of thousands) may not come back up, making the restore from media difficult. Doing a rolling upgrade causes the nodes one by one stall by 10+ seconds, which times 10k is a long time, too. And even the case where you have a perfect setup with everything redundant and with instant failover does benefit from live patching. Since you have to plan for failure, you have to plan for failure while patching, too. With live patching you need 2 servers minimum (or N+1), without you need 3 (or N+2), as one will be offline while during the upgrade process. 10 seconds of outage may be acceptable in a disaster scenario. Not necessarily for a regular update scenario. The value of live patching is in near zero disruption. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
On 23 February 2015 at 09:01, Andrew Morton a...@linux-foundation.org wrote: On Sun, 22 Feb 2015 20:13:28 +0100 (CET) Jiri Kosina jkos...@suse.cz wrote: But if you ask the folks who are hungry for live bug patching, they wouldn't care. You mentioned 10 seconds, that's more or less equal to infinity to them. 10 seconds outage is unacceptable, but we're running our service on a single machine with no failover. Who is doing this?? if I had to guess, telcos generally, you've only got one wire between a phone and the exchange and if the switch on the end needs patching it better be fast. Dave. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
There's failover, there's running the core services in VMs (which can migrate)... I think 10 seconds is Ingo being a bit exaggerating, since you can boot a full system in a lot less time than that, and more so if you know more about the system (e.g. don't need to spin down and then discover and spin up disks). If you're talking about inside a VM it's even more extreme than that. Now, live patching sounds great as ideal, but it may end up being (mostly) similar like hardware hotplug: Everyone wants it, but nobody wants to use it (and just waits for a maintenance window instead). In the hotplug case, while people say they want it, they're also aware that hardware hotplug is fundamentally messy, and then nobody wants to do it on that mission critical piece of hardware outside the maintenance window. (hotswap drives seem to have been the exception to this, that seems to have been worked out well enough, but that's replace-with-the-same). I would be very afraid that hot kernel patching ends up in the same space: The super-mission-critical folks are what its aimed at, while those are the exact same folks that would rather wait for the maintenance window. There's a lot of logistical issues (can you patch a patched system... if live patching is a first class citizen you end up with dozens and dozens of live patches applied, some out of sequence etc etc). There's the which patches do I have, and if the first patch for a security hole was not complete, how do I cope by applying number two. There's the which of my 50.000 servers have which patch applied logistics. And Ingo is absolutely right: The scope is very fuzzy. Todays bugfix is tomorrows oh oops it turns out exploitable. I will throw a different hat in the ring: Maybe we don't want full kernel update as step one, maybe we want this on a kernel module level: Hot-swap of kernel modules, where a kernel module makes itself go quiet and serializes its state (suspend pretty much), then gets swapped out (hot) by its replacement, which then unserializes the state and continues. If we can do this on a module level, then the next step is treating more components of the kernel as modules, which is a fundamental modularity thing. On Sun, Feb 22, 2015 at 4:18 PM, Dave Airlie airl...@gmail.com wrote: On 23 February 2015 at 09:01, Andrew Morton a...@linux-foundation.org wrote: On Sun, 22 Feb 2015 20:13:28 +0100 (CET) Jiri Kosina jkos...@suse.cz wrote: But if you ask the folks who are hungry for live bug patching, they wouldn't care. You mentioned 10 seconds, that's more or less equal to infinity to them. 10 seconds outage is unacceptable, but we're running our service on a single machine with no failover. Who is doing this?? if I had to guess, telcos generally, you've only got one wire between a phone and the exchange and if the switch on the end needs patching it better be fast. Dave. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/