Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
> > Maybe hooking into genapic is the right way to mop up all the uses of > send_IPI and its variants. It is. More hooks in this are wouldn't be appreciated. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
* Jeremy Fitzhardinge ([EMAIL PROTECTED]) wrote: > Maybe hooking into genapic is the right way to mop up all the uses of > send_IPI and its variants. But from a quick grep it doesn't look like > they get called from too many places... Most of the callers seem to be > in arch/i386/kernek/smp.c, so they should be pretty easy to isolate. Yeah, we'll see once we are crashing and debugging some code ;-) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
Chris Wright wrote: > * Jeremy Fitzhardinge ([EMAIL PROTECTED]) wrote: > >> Maybe hooking into genapic is the right way to mop up all the uses of >> send_IPI and its variants. But from a quick grep it doesn't look like >> they get called from too many places... Most of the callers seem to be >> in arch/i386/kernek/smp.c, so they should be pretty easy to isolate. >> > > Yeah, we'll see once we are crashing and debugging some code ;-) > It's the Linux way (tm). J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
* Jeremy Fitzhardinge ([EMAIL PROTECTED]) wrote: > I guess by "rest of the kernel" you mean other stuff in arch/i386. Yes, > that's a concern, but maybe we can tease it apart in a sensible way. Yes, that's exactly what I'm saying. Same with above (the native stuff), since we don't want a bunch of apic_read type of pv_ops (oh, wait... ;-) Of course, dom0 will be another can of worms, but one at a time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
Chris Wright wrote: > * Jeremy Fitzhardinge ([EMAIL PROTECTED]) wrote: > >> I guess by "rest of the kernel" you mean other stuff in arch/i386. Yes, >> that's a concern, but maybe we can tease it apart in a sensible way. >> > > Yes, that's exactly what I'm saying. Same with above (the native stuff), > since > we don't want a bunch of apic_read type of pv_ops (oh, wait... ;-) Of course, > dom0 will be another can of worms, but one at a time. > Yeah, well we're already talking about a two-level model to accomodate VMI, since it wants the mostly native SMP stuff except for the actual apic operations. Maybe hooking into genapic is the right way to mop up all the uses of send_IPI and its variants. But from a quick grep it doesn't look like they get called from too many places... Most of the callers seem to be in arch/i386/kernek/smp.c, so they should be pretty easy to isolate. J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
Chris Wright wrote: > * Jeremy Fitzhardinge ([EMAIL PROTECTED]) wrote: > >> Chris Wright wrote: >> >>> I agree with that, but I think that's esp. for things like create and launch >>> new vcpu. The IPI bit I'm not as clear on, nor running this all on native >>> as well. >>> >>> >> Well, native would fall back to using the existing arch/i386 versions of >> those functions, so that's reasonably straightforward. >> > > It's the fact that we need to leave code in the kernel to run on native, > but also do something dynamically with that same code when running > paravirt that I'm referring to. Why would it be any different to all the other code we've got behind native pvops? The ideal simplified case is that we rename smp_send_stop/send_reschedule/prepare_cpus/etc to native_* versions. In the !PARAVIRT case we just call the native_* version directly; in PARAVIRT we call via the native pv_ops structure. Under Xen, all these would implemented independently from the native versions. > No, it's not the IPI itself, it's the way it's often accessed by the rest of > the kernel (which is intertwined with genapic). I'm happy to avoid apic > altogether since it's effectively worthless for Xen other than > integrating into the existing infrastructure. > I guess by "rest of the kernel" you mean other stuff in arch/i386. Yes, that's a concern, but maybe we can tease it apart in a sensible way. J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
* Jeremy Fitzhardinge ([EMAIL PROTECTED]) wrote: > Chris Wright wrote: > > * Daniel Arai ([EMAIL PROTECTED]) wrote: > > > >> There's no good way to override __send_IPI_shortcut. I suppose we could > >> add > >> paravirt ops for __send_IPI_shortcut and every other op that touches the > >> APIC. > >> > > > > While that's basically what we did in Xen, it would make more sense to > > build it into genapic which would give us one common abstraction to base > > from. We should avoid adding pv_ops when existing infrastructure exists. > > I was looking at cutting in at a much higher level. The interface in > is a good match for Xen, so I was going to investigate > making pv_ops at that level and see how it falls out. I agree with that, but I think that's esp. for things like create and launch new vcpu. The IPI bit I'm not as clear on, nor running this all on native as well. thanks, -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
* Jeremy Fitzhardinge ([EMAIL PROTECTED]) wrote: > Chris Wright wrote: > > I agree with that, but I think that's esp. for things like create and launch > > new vcpu. The IPI bit I'm not as clear on, nor running this all on native > > as well. > > > > Well, native would fall back to using the existing arch/i386 versions of > those functions, so that's reasonably straightforward. It's the fact that we need to leave code in the kernel to run on native, but also do something dynamically with that same code when running paravirt that I'm referring to. Xen punts on this right now by #ifdef'ing away as happy as can be. > There'll need to > be a bit of internal rearrangement so that the Xen code can call in to > do things like set up the pda/gdt and other bits of CPU state. > > I don't think IPI is especially interesting in itself, is it? It's a > necessary mechanism to implement smp_call_function(), but Xen can do IPI > without having to invoke any of the existing apic-based IPI code. The > other main user of IPI is cross-cpu tlb shootdown, but Xen has much more > efficient mechanisms than IPI for that (so we'll need to make the tlb > pv_ops interface a little wider to pass down a cpuset). No, it's not the IPI itself, it's the way it's often accessed by the rest of the kernel (which is intertwined with genapic). I'm happy to avoid apic altogether since it's effectively worthless for Xen other than integrating into the existing infrastructure. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
* Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote: > > While that's basically what we did in Xen, it would make more sense > > to build it into genapic which would give us one common abstraction > > to base from. We should avoid adding pv_ops when existing > > infrastructure exists. > > I was looking at cutting in at a much higher level. The interface in > is a good match for Xen, so I was going to investigate > making pv_ops at that level and see how it falls out. yes, yes, yes. Finally someone with a clue about APIs ;-) Basically, we want to think about the hypercall API more like a system call API, not like a hardware API! There will probably still be lowlevel details like ptes for a long time - but even those are not quite necessary. And the reason is really fundamental: those system-call alike APIs are going to be the /most stable ones/ over time! 'Send stuff from A to B' or 'notify X about event Y' is /ALOT/ more stable across hardware variations than 'IDTs, vectors, apics or ptes'. And that is so precisely because these are fundamental actions that physical matter can do, and those do not get changed when new silicon comes out. In that sense Xen's hypervisor API is saner than VMI. the most highlevel API is what UML uses today (and it clearly overdoes abstraction), still i was able to get basic UML performance close to native performance, via extending a few Linux system calls to enable the management of multiple sets of pagetables (each represented by a separate fd) via a single hypervisor-level process, and feeding back raw pagefault events to the hypervisor. (that was UML's SKAS concept combined with sys_remap_file_pages_prot() and sys_vcpu()) Now the practical problem with UML is that nobody has tried to make an UML native+guest 'shared kernel image', and hence it's unusable for distros. But there is no conceptual problem with UML's virtualization model. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
Chris Wright wrote: > I agree with that, but I think that's esp. for things like create and launch > new vcpu. The IPI bit I'm not as clear on, nor running this all on native > as well. > Well, native would fall back to using the existing arch/i386 versions of those functions, so that's reasonably straightforward. There'll need to be a bit of internal rearrangement so that the Xen code can call in to do things like set up the pda/gdt and other bits of CPU state. I don't think IPI is especially interesting in itself, is it? It's a necessary mechanism to implement smp_call_function(), but Xen can do IPI without having to invoke any of the existing apic-based IPI code. The other main user of IPI is cross-cpu tlb shootdown, but Xen has much more efficient mechanisms than IPI for that (so we'll need to make the tlb pv_ops interface a little wider to pass down a cpuset). J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
Chris Wright wrote: > * Daniel Arai ([EMAIL PROTECTED]) wrote: > >> There's no good way to override __send_IPI_shortcut. I suppose we could add >> paravirt ops for __send_IPI_shortcut and every other op that touches the >> APIC. >> > > While that's basically what we did in Xen, it would make more sense to > build it into genapic which would give us one common abstraction to base > from. We should avoid adding pv_ops when existing infrastructure exists. > I was looking at cutting in at a much higher level. The interface in is a good match for Xen, so I was going to investigate making pv_ops at that level and see how it falls out. J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
* Daniel Arai ([EMAIL PROTECTED]) wrote: > Chris, would you like to work together on this? I don't know what Xen's > requirements are for the APIC interface. Do you think we could come up > with something that would fit both of our needs, and maybe also be usable > for some of the subarch-specific code? Sure, we just have a pretty small genapic_xen, and then enough (hackery, this should be sorted out) to use that genapic and have an effective override for __send_IPI_shortcut. thanks, -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
* Chris Wright <[EMAIL PROTECTED]> wrote: > > Chris, would you like to work together on this? I don't know what > > Xen's requirements are for the APIC interface. Do you think we > > could come up with something that would fit both of our needs, and > > maybe also be usable for some of the subarch-specific code? > > Sure, we just have a pretty small genapic_xen, and then enough > (hackery, this should be sorted out) to use that genapic and have an > effective override for __send_IPI_shortcut. genapic is still too lowlevel: as Thomas mentioned what we want is a virtual interrupt controller used by /all/ hypervisors (and mapped to their respective hypervisor ABIs via the backend). Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
Chris Wright wrote: * Daniel Arai ([EMAIL PROTECTED]) wrote: There's no good way to override __send_IPI_shortcut. I suppose we could add paravirt ops for __send_IPI_shortcut and every other op that touches the APIC. While that's basically what we did in Xen, it would make more sense to build it into genapic which would give us one common abstraction to base from. We should avoid adding pv_ops when existing infrastructure exists. I agree with this. Chris, would you like to work together on this? I don't know what Xen's requirements are for the APIC interface. Do you think we could come up with something that would fit both of our needs, and maybe also be usable for some of the subarch-specific code? Dan. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
* Zachary Amsden ([EMAIL PROTECTED]) wrote: > s/do/will (smpboot.c) Well the current Xen mechanism rather dodges all of that (for bits like IPI apicid). thanks, -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
* Daniel Arai ([EMAIL PROTECTED]) wrote: > There's no good way to override __send_IPI_shortcut. I suppose we could add > paravirt ops for __send_IPI_shortcut and every other op that touches the > APIC. While that's basically what we did in Xen, it would make more sense to build it into genapic which would give us one common abstraction to base from. We should avoid adding pv_ops when existing infrastructure exists. thanks, -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On Thu, 2007-03-08 at 09:01 +0100, Ingo Molnar wrote: > * Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote: > > > > Your implementation is almost the perfect prototype, if you move the > > > 128 bit hackery into the hypervisor and hide it away from the kernel > > > :) > > > > The point is to use the tsc to avoid making any hypercalls, so dealing > > with the tsc->ns conversion has to happen on the guest side somehow. > > you are obsessed with avoiding a hypercall, but why? Granted it's slow > especially on things like SVN/VMX, but it's not fundamentally slow. We > definitely do not want to design our whole APIs and abstractions around > the temporary notion that 'hypercalls are slow'. I'd expect hypercalls > to be put into silicon just as much as SYSENTER was put into silicon. Indeed, I expect them to fall somewhere between system calls and context switches. Perhaps not slow, but definitely worth minimising. > Anyway, in terms of guest time code, a /big/ amount of design junk can > be avoided by not trying to do sillynesses like 'virtual time'. The TSC > is awfully unreliable. You mean stolen time? I find this whole discussion really irritating, to be honest. I just want Thomas to implement the timer code for lguest, because that code scares me... I look forward to your patch 8) Rusty. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
Ingo Molnar wrote: > you are obsessed with avoiding a hypercall, but why? Granted it's slow > especially on things like SVN/VMX, but it's not fundamentally slow. We > definitely do not want to design our whole APIs and abstractions around > the temporary notion that 'hypercalls are slow'. Sure. But the specific case we're talking about here is a 300 line clock driver. Nothing about its implementation has any effect on the kernel's APIs or abstractions. > I'd expect hypercalls > to be put into silicon just as much as SYSENTER was put into silicon. > Sysenter is marginally faster than int $80, but not massively so. I guess Xen could use sysenter now for hypercalls, since its only useful for getting into ring 0. > Anyway, in terms of guest time code, a /big/ amount of design junk can > be avoided by not trying to do sillynesses like 'virtual time'. Well, if you have a hypervisor scheduler multiplexing vcpus onto a real cpu at 100hz and a kernel scheduler multiplexing processes onto a vcpu at 100hz, then you're going to get a lot of disappointed processes who nominally got their 10ms real-time slice, but it was all spent on some other vcpu. Its important that the kernel's scheduler know how much vcpu time each process really got, rather than basing its scheduling on the amount of real time that passed. > The TSC > is awfully unreliable. > Sure. > /THIS/ is the kind of junk we are trying to protect Linux against. > What? That Xen happens to use the tsc as part of its hypervisor interface? A fact that's completely isolated from the rest of the kernel behind the clock subsystem? J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On 8/3/07 08:01, "Ingo Molnar" <[EMAIL PROTECTED]> wrote: > you are obsessed with avoiding a hypercall, but why? Granted it's slow > especially on things like SVN/VMX, but it's not fundamentally slow. We > definitely do not want to design our whole APIs and abstractions around > the temporary notion that 'hypercalls are slow'. I'd expect hypercalls > to be put into silicon just as much as SYSENTER was put into silicon. If syscalls are already so fast, why does Linux have vgettimeofday()? -- Keir - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
* Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote: > > Your implementation is almost the perfect prototype, if you move the > > 128 bit hackery into the hypervisor and hide it away from the kernel > > :) > > The point is to use the tsc to avoid making any hypercalls, so dealing > with the tsc->ns conversion has to happen on the guest side somehow. you are obsessed with avoiding a hypercall, but why? Granted it's slow especially on things like SVN/VMX, but it's not fundamentally slow. We definitely do not want to design our whole APIs and abstractions around the temporary notion that 'hypercalls are slow'. I'd expect hypercalls to be put into silicon just as much as SYSENTER was put into silicon. Anyway, in terms of guest time code, a /big/ amount of design junk can be avoided by not trying to do sillynesses like 'virtual time'. The TSC is awfully unreliable. really, it's a bit as if Linus looked at his 386DX CPU when he bought it 16 years ago and decided that: "this CPU executes 16-bit code much faster than 32-bit code, so lets base this new toy OS on 16-bit code. Sure, it's a bit of a pain to use, compared to 32-bit code, but users demand performance!". /THIS/ is the kind of junk we are trying to protect Linux against. Basically hypervisors are a way to prolong hardware legacies, and because unlike real hardware software ABIs dont actually burn out with time, and people are stubborn about using them, their effects are alot worse and alot longer than that of legacy hardware. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
Thomas Gleixner wrote: On Wed, 2007-03-07 at 17:01 -0800, Daniel Arai wrote: But more importantly, we want a kernel that can run both on native hardware and in a paravirtualized environment. Linux doesn't really provide abstractions for replacing the appropriate code. We tried to hook into the source code at a level that seemed possible. Again. You just refuse to change your implementation and you want to keep it by arguing how hard it is because there are no abstractions. It is no longer possible to change our _hypervisor_ implementation. The Linux side of our code is entirely flexible, and we are trying to change it, but it hasn't always been clear what you want us to do. Your prayer wheel argument of missing abstractions and easiness of emulating things is annoying. If you think it is better to emulate APIC, please emulate it without paravirt ops. If you want the speed improvement, work with us to create the interfaces and abstractions which are necessary to have a sane, maintainable and useful for all hypervisors implementation. That's what we are doing. Our prayer wheel would be easier appeased if you actually told us which parts of the VMI timer you objected to. As I understand it now: 1) We should not call into external functions in other time sources; any common code should be merged up 2) We should not be using global_clock_event; it is a horrible hack which you want to remove 3) We should not use the smp_apic_timer_interrupt assembly code which calls up to the lapic timer handlers 4) We should not add our own assembly code to call out to a local timer handler (from Ingo) These last two points create a conflict which is a little tricky to solve. We can't add our own custom timer handler, and we can't re-use the APIC timer handler. But there is no timer handler available on i386 that works, since the handlers will fall back to either PIC or IO-APIC edge handling. Using either of those for the local timer interrupt on SMP does not work because they assume traditional IRQ semantics - an IRQ raised from the bus should be serviced by one processor. Re-raises of the same IRQ on remote processors are locked out by the handler, and dropped. Thus simultaneous local timers firing on multiple CPUs cause only one to be serviced. This does not work for local timer interrupts in NO_HZ mode, because they must always be serviced so that they can reschedule the next local timer. I have a proposed solution to this issue, but it fails to work when the IO-APIC assumes control of all IRQs based on ACPI results (which we control, but can't change because of compatibility issues with other operating systems). My proposal is to keep IRQ-0 as the timer interrupt, on all CPUs, but fire it from the LAPIC after local apic timers get initialized. We would do this by converting the irq handler using set_irq_handler(0, handle_percpu_irq). The only problem is the IO-APIC code will want to take over IRQ0 and convert it to an edge triggered IO-APIC interrupt. But for the local irq handlers to work, we have to keep them using the handle_percpu_irq handler, and can't let the IO-APIC steal these vectors. There is no way to do conditionally for just a specific set of IRQs in tree today, so we would need to add a special case to io_apic.c to allow early boot code to reserve specific vectors so they are not subsumed by the IO-APIC. This seems reasonable, but is a special case. If, on the other hand, we are allowed to use our own assembly code to call out to our local timer handler (dropping constraint #4 above), we can simply rewire LOCAL_TIMER_VECTOR to point to this code, but now we must emulate the semantics of irq_enter / leave / etc inside our code, which is also not the cleanest solution. We used to do this, and it caught flak I believe from Ingo. The basic problem is that a local IRQ doesn't behave like a global IRQ, and the i386 backend is unaware of how to set up any local IRQs except in the case of local APIC, but you have told us we should not re-use the APIC handlers by overloading global_clock_event. The patches we sent out recently did just this, but seemed to meet even more violence than our previous way of doing things. So the question is, which approach do you prefer? Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On Wed, 2007-03-07 at 17:01 -0800, Daniel Arai wrote: > Thomas Gleixner wrote: > > > You managed to avoid the usage of other code (i.e. PIT / HPET) already, > > so why is it sooo desireable to emulate apics instead of substituting it > > by a small and sane replacement ? Just because you happen to have an > > LAPIC emulator ? That's no reason to wire yourself into the kernel code > > and make it harder to change and maintain. > > There are several reasons why it's desirable to emulate the APIC. As you > mentioned, we already have APIC emulation, and APIC emulation isn't a huge > bottleneck on most workloads. Our code works, the Linux code works, and > replacing both pieces of code with something "small and sane" isn't going to > improve performance very much, so why bother? Any hypervisor implementation > is > going to be a tradeoff between what's easy to implement in the hypervisor, > what's easy to implement in the guest operating system, and what's > performance > critical. It is not about performance. It is about maintainability. > Secondly, not all (para-)virtualized operating systems will want to use > abstracted devices. Some virtual operating systems will be given direct > access > to hardware devices, and will need to run the actual driver for that device > and > not some abstracted device driver. So I don't buy your argument that every > piece of the kernel that interacts with a paravirtualized driver should have > a > "small and sane replacement." Err. We talk about paravirtualized Linux and not about what you have to emulate to get Windows running. I don't care at all. Do you really expect that we have to accept your design decisions, just because they allow you to make your life easy ? This is exactly what you are using paravirt ops for: a backdoor to throw your hackery at the kernel and leave us with the mess of hardwired crap. > But more importantly, we want a kernel that can run both on native hardware > and > in a paravirtualized environment. Linux doesn't really provide abstractions > for > replacing the appropriate code. We tried to hook into the source code at a > level that seemed possible. Again. You just refuse to change your implementation and you want to keep it by arguing how hard it is because there are no abstractions. I went through the business of creating abstractions into hardwired hairballs twice. I know exactly what I'm talking about. It _IS_ hard work, but at the end it makes the code better and more maintainable. You do nothing for that, but expect that we live with your addons to the hairball. > There's no good way to override __send_IPI_shortcut. I suppose we could add > paravirt ops for __send_IPI_shortcut and every other op that touches the > APIC. > But there are dozens of functions in apic.c that would need to be included in > paravirt ops. And for our implementation, we really just want to override > apic_read and apic_write, since we can make these faster when done through > hypercalls than through memory accesses. If we were to make these paravirt > ops, > their implementations would be the same, except with a different apic_read > and > apic_write. This is a whole lot of useless code duplication. No it is not. #include is an abstraction and __send_IPI ... is the i386 low level implementation. You insist to hook yourself into the low level code instead of hooking into the high level code, because it is _YOUR_ implementation and we have to accept it as is. This is the completely wrong way. We get the same crap and discussion for every other architecture we are going to support with paravirt ops. And probably for every other hypervisor implementation, which has a different way of doing things. > Most of the interrupt system is not written in such a way that multiple APICs > implementations can be selected from at boot time. This is an absolute > requirement so that the same kernel can boot on native and in a > paravirtualized > environment. While this could be implemented, it seems like a waste of time, > since we can just emulate something similar to a real interrupt system and > not > change things very much. Waste of your precious time. I'm working on low level code and abstractions and from now on I have also to take care not to break _YOUR_ implementation. You are going to waste _MY_ time and I'm going to fight that forever. Your prayer wheel argument of missing abstractions and easiness of emulating things is annoying. If you think it is better to emulate APIC, please emulate it without paravirt ops. If you want the speed improvement, work with us to create the interfaces and abstractions which are necessary to have a sane, maintainable and useful for all hypervisors implementation. tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On Wed, 2007-03-07 at 17:23 -0800, Jeremy Fitzhardinge wrote: > Daniel Arai wrote: > > But more importantly, we want a kernel that can run both on native hardware > > and > > in a paravirtualized environment. Linux doesn't really provide > > abstractions for > > replacing the appropriate code. We tried to hook into the source code at > > a > > level that seemed possible. > > > > Xen doesn't support any kind of apic emulation, so we'll need to hook > anything which relies on an apic. The ipi code you quote below will > probably be one of those. > > My opinion is that pv_ops shouldn't have raw apic operations, but > instead have appropriate high-level interfaces to achieve the same > ends. Zach's counter-argument was basically your's: that the VMI code > will use a lot of the native code except for the actual apic operations. > > I can live with VMI emulating apics if it wants, so long as it does it > in private and doesn't make a big scene about it. We'll need the > high-level interfaces regardless. I can't because it reaches out into non private parts of the low level implementation and is not helping to distangle things and making the overall code better. No it forces its own view of the world on us without giving us anything back. tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
Daniel Arai wrote: > But more importantly, we want a kernel that can run both on native hardware > and > in a paravirtualized environment. Linux doesn't really provide abstractions > for > replacing the appropriate code. We tried to hook into the source code at a > level that seemed possible. > Xen doesn't support any kind of apic emulation, so we'll need to hook anything which relies on an apic. The ipi code you quote below will probably be one of those. My opinion is that pv_ops shouldn't have raw apic operations, but instead have appropriate high-level interfaces to achieve the same ends. Zach's counter-argument was basically your's: that the VMI code will use a lot of the native code except for the actual apic operations. I can live with VMI emulating apics if it wants, so long as it does it in private and doesn't make a big scene about it. We'll need the high-level interfaces regardless. J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
Thomas Gleixner wrote: You managed to avoid the usage of other code (i.e. PIT / HPET) already, so why is it sooo desireable to emulate apics instead of substituting it by a small and sane replacement ? Just because you happen to have an LAPIC emulator ? That's no reason to wire yourself into the kernel code and make it harder to change and maintain. There are several reasons why it's desirable to emulate the APIC. As you mentioned, we already have APIC emulation, and APIC emulation isn't a huge bottleneck on most workloads. Our code works, the Linux code works, and replacing both pieces of code with something "small and sane" isn't going to improve performance very much, so why bother? Any hypervisor implementation is going to be a tradeoff between what's easy to implement in the hypervisor, what's easy to implement in the guest operating system, and what's performance critical. Secondly, not all (para-)virtualized operating systems will want to use abstracted devices. Some virtual operating systems will be given direct access to hardware devices, and will need to run the actual driver for that device and not some abstracted device driver. So I don't buy your argument that every piece of the kernel that interacts with a paravirtualized driver should have a "small and sane replacement." But more importantly, we want a kernel that can run both on native hardware and in a paravirtualized environment. Linux doesn't really provide abstractions for replacing the appropriate code. We tried to hook into the source code at a level that seemed possible. For example, take smp_call_function(). What this essentially does is call send_IPI_allbutself(). void fastcall send_IPI_self(int vector) { __send_IPI_shortcut(APIC_DEST_SELF, vector); } void __send_IPI_shortcut(unsigned int shortcut, int vector) { /* * Subtle. In the case of the 'never do double writes' workaround * we have to lock out interrupts to be safe. As we don't care * of the value read we use an atomic rmw access to avoid costly * cli/sti. Otherwise we use an even cheaper single atomic write * to the APIC. */ unsigned int cfg; /* * Wait for idle. */ apic_wait_icr_idle(); /* * No need to touch the target chip field */ cfg = __prepare_ICR(shortcut, vector); /* * Send the IPI. The write to APIC_ICR fires this off. */ apic_write_around(APIC_ICR, cfg); } There's no good way to override __send_IPI_shortcut. I suppose we could add paravirt ops for __send_IPI_shortcut and every other op that touches the APIC. But there are dozens of functions in apic.c that would need to be included in paravirt ops. And for our implementation, we really just want to override apic_read and apic_write, since we can make these faster when done through hypercalls than through memory accesses. If we were to make these paravirt ops, their implementations would be the same, except with a different apic_read and apic_write. This is a whole lot of useless code duplication. Most of the interrupt system is not written in such a way that multiple APICs implementations can be selected from at boot time. This is an absolute requirement so that the same kernel can boot on native and in a paravirtualized environment. While this could be implemented, it seems like a waste of time, since we can just emulate something similar to a real interrupt system and not change things very much. Dan Arai VMware, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
Thomas Gleixner wrote: > Sigh. The cut zero hairball is already in mainline. :( > Yes, there were a couple of unfortunate patches in that series, but they got fast-tracked in with the promise they would get fixed asap. > Sure. If the clockevent API is changed, then the users get fixed. This > is not my main concern. The "oh we reuse the PIT interrupt" reachout is > what makes life hard. VMI does this already extensive and I'm frightened > by it. > Well, I think they know what's expected of them now. J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On Wed, 2007-03-07 at 15:33 -0800, Jeremy Fitzhardinge wrote: > > On the other hand we yet see things like: > > > > /* We use normal irq0 handler on cpu0. */ > > time_init_hook(); > > > > Which is just reaching into the kernel code directly and does not handle > > the clock event interrupt self contained. clockevents is not bound to > > IRQ0 and this kind of hackery is exactly what we need to avoid in order > > to get this maintainable. > > > > Yes, I'm definitely not arguing with you about this. I think the first > cut vmi time code was pretty questionable, but I have confidence they'll > fix it up before submission. Sigh. The cut zero hairball is already in mainline. :( > The point is that when you put the xen and vmi implementations next to > each other you find that 1) in each case there's a pretty small > abstraction distance between the clock interface and the hypercall > interface, and 2) there's very little code which can be shared between > the two. Which means that adding another layer of abstraction to > protect the clock code from paravirtualized time devices is just going > to add fat without much benefit. Fair enough. > > Yes, if they are used in a sane and self contained way without reaching > > all over the place and expecting that those functions, which are not > > part of the paravirt interfaces will work for ever. > > > > 100% agree. If the interfaces change, then we'll change the code using > them like any other kernel code would. If the new interfaces are hard > to make work then that's a problem, but one would hope that would get > shaken out as part of the normal kernel development process. Sure. If the clockevent API is changed, then the users get fixed. This is not my main concern. The "oh we reuse the PIT interrupt" reachout is what makes life hard. VMI does this already extensive and I'm frightened by it. > The point is that this code under and around the paravirt_ops interface > is just normal Linux code, and we expect to participate in the normal > kernel development process, with all the usual > discussions/arguments/negotiations over interface changes. If the code > loses all its maintainers and becomes orphaned, unresponsive to > interface changes, then it's like any other dead driver: mark it > CONFIG_BROKEN and wait for someone to fix it. But for now and the > foreseeable future these are going to be actively supported and > maintained pieces of code. Ack. > > You are not increasing the entanglement with the rest of the system, > > when you use a self contained device on top of an existing core kernel > > infrastructure, which has a paravirt backend. Quite the contrary, you > > have one piece of virtual hardware which is connected to the kernel and > > interacts with the various incarnations on the other side, which can as > > well live inside the kernel code. Granted it is another level of > > indirection, but I'd be happy to have only to deal with one of those > > beasts. > > > > Right. But at that point the interface doesn't really have much of a > technical basis. It's really a political border at which you can hand > off responsibility and make it ours. I quite understand your > motivation, but I think you're solving a problem that hasn't happened > yet, and one that we'd all like to avoid. Granted. > I know the vmi time code has coloured your view here, but I surely hope > it can be got into a better state before posting. I'm biased of course, > but I would rather hope that all these drivers we're talking about will > be as stylistically clean as the Xen time code (which has room for > improvement, of course). > > There is, however, a median solution which keeps the number of clock > drivers down but also doesn't involve extending pv_ops. We can just > create paravirt_clocksource/paravirt_clockevent helper wrappers, with > their own internal interfaces to act as a facade for the > hypervisor-specific code. I don't think there's much point in doing > this now, but maybe it will become appealing once we start dealing with > things like stolen time. We'll see. tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
Dan Hecht wrote: > On 03/07/2007 03:33 PM, Jeremy Fitzhardinge wrote: >> I know the vmi time code has coloured your view here, but I surely hope >> it can be got into a better state before posting. I'm biased of course, >> but I would rather hope that all these drivers we're talking about will >> be as stylistically clean as the Xen time code (which has room for >> improvement, of course). >> > > Could you send us comments on where you feel the style needs some > fixing up? I think Thomas has covered this in quite a bit of detail already. But the fact that the code mentions "apic" or "pit" at all seems unfortunate, but I guess that's what you have to work with. > VMI encapsulates all the implementation details away from the kernel, > whereas the Xen time code puts it all out there in the kernel[...] This is not an exercise in "my hypervisor is better than yours", it's a matter of getting clean implementations within the constraints of each hypervisor interface. The Xen code may be more verbose than the corresponding VMI code, but it's self-contained and doesn't make any demands on the rest of the kernel. The concern is that the vmi code reaches out and does things like set global_clock_event, calls time_init_hook and so on - basically complicating the already ugly lapic/pic legacy time mess, and therefore making yourself part of the tangle if anyone wants to go in there and change it. The question is whether you can make the vmi clock implementation free-standing, in that it has no dependencies other than well defined interfaces like the clock api itself, the normal (non-legacy) interrupt api and, of course, the underlying VMI interface. But no reach-arounds into the lapic/pit code. J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On Wed, 2007-03-07 at 15:25 -0800, Zachary Amsden wrote: > > Looking at vmitimer.c and the number of hardcoded assumptions are > > telling me, that we are heading in exactly the opposite direction. > > > > No, VMI timer is unique because for SMP, it is based on the APIC. On > i386, SMP is hardwired to depend on the APIC, and so we simply re-use > the pieces of it which are there, with the same assumptions about irqs, > and hardware behavior, good or bad. We just have a different way of > telling the LAPIC when to deliver interrupts. This is exactly the point. There is no benfit in reusing 3 lines of lapic interrupt handler code and therefor reaching into it. clockevents are not connected to lapic on SMP by any means. They are designed to be self contained and so please use them as designed. > The alternative is to pretty much completely copy apic.c into vmi.c or > vmitimer.c, which seems a rather bad idea, since now two copies of > nearly identical code need to be maintained. You managed to avoid the usage of other code (i.e. PIT / HPET) already, so why is it sooo desireable to emulate apics instead of substituting it by a small and sane replacement ? Just because you happen to have an LAPIC emulator ? That's no reason to wire yourself into the kernel code and make it harder to change and maintain. > > Yes, if they are used in a sane and self contained way without reaching > > all over the place and expecting that those functions, which are not > > part of the paravirt interfaces will work for ever. > > > > But we definitely need pieces of the core APIC dependent code. Xen > needs pieces of it too, but very select pieces for SMP boot. The > ugliness you point out is there, but the reason it is there is not > because the paravirt code is cluttered, it is because the i386 code is > so hardwired to use the APIC model that there is pain separating from it. > > The correct solution here is to properly separate the APIC, SMP, and > timer code so the logic of it which we want to reuse is separated from > the hardware dependence. Clock events and clocksources take care of > most of the timer issues, but there is still ugliness from SMP timer > events depending on having part of the APIC infrastructure for wiring > the interrupt gates. Again: clockevents do not require APIC and do not depend on any APIC wiring. Your hypervisor is working that way. > > No it's not an absolute blocker, as long as we can take care, that the > > number of incarnations is > > > > - designed to be shareable between hypervisors which have the same time > > model > > - common code like the 128 bit math is in a shared library > > - self contained and not reaching out into core kernel code for no good > > reason > > > > Same goes for clock events, interrupts and other core facilities. > > I think that is what everyone wants. This is an iterative process. We > certainly don't want to reach out into core kernel code unless there is > a good reason to do so, and with every development of clock events, > sources, and interrupts, we have less of a reason to do so, and the code > gets cleaner and more maintainable. We have to avoid this reachout in the first place. It just adds more hardwires into the hairball and makes it harder to distangle. If you want the virtualization support in the kernel, then please understand that we hardwire now and we'll fix it up once the core kernel developers serve us the solution on the silver tablet is not going to work. Please work with us on a proper solution upfront instead of throwing random hackery with the lame excuse "for a good reason" at us. You knew exactly, that clockevents & co are on the way to mainline and there was enough time to work with us on a proper solution. No, you decided to ignore it, even after people pointed it out to you way before the 2.6.21 merge window. Now we have the hardwire in place and we can wait for you to fix it whenever it seems to fit into the vmware business plan. I'm not going to accept any further reachout unless there is an urgent bugfix in the release cycle, which does not allow a proper solution. But be sure, that the backout patch will hit -mm immidiately. tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On 03/07/2007 03:33 PM, Jeremy Fitzhardinge wrote: I know the vmi time code has coloured your view here, but I surely hope it can be got into a better state before posting. I'm biased of course, but I would rather hope that all these drivers we're talking about will be as stylistically clean as the Xen time code (which has room for improvement, of course). Could you send us comments on where you feel the style needs some fixing up? VMI encapsulates all the implementation details away from the kernel, whereas the Xen time code puts it all out there in the kernel (see snippet below). What happens when Xen wants to change the way it implements "system time"? It looses compatibility with all existing kernels In VMI terms, the code to read "system time" from the hypervisor is this one-liner (it can be written in any "style" you want; the fact is, it's just an interface call to the VMI-layer): vmi_timer_ops.get_cycle_counter(VMI_CYCLES_REAL); In Xen terms, the same code to accomplish that is: /* * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction, * yielding a 64-bit result. */ static inline u64 scale_delta(u64 delta, u32 mul_frac, int shift) { u64 product; #ifdef __i386__ u32 tmp1, tmp2; #endif if (shift < 0) delta >>= -shift; else delta <<= shift; #ifdef __i386__ __asm__ ( "mul %5 ; " "mov %4,%%eax ; " "mov %%edx,%4 ; " "mul %5 ; " "xor %5,%5; " "add %4,%%eax ; " "adc %5,%%edx ; " : "=A" (product), "=r" (tmp1), "=r" (tmp2) : "a" ((u32)delta), "1" ((u32)(delta >> 32)), "2" (mul_frac) ); #elif __x86_64__ __asm__ ( "mul %%rdx ; shrd $32,%%rdx,%%rax" : "=a" (product) : "0" (delta), "d" ((u64)mul_frac) ); #else #error implement me! #endif return product; } static u64 get_nsec_offset(struct shadow_time_info *shadow) { u64 now, delta; rdtscll(now); delta = now - shadow->tsc_timestamp; return scale_delta(delta, shadow->tsc_to_nsec_mul, shadow->tsc_shift); } static cycle_t xen_clocksource_read(void) { struct shadow_time_info *shadow = &get_cpu_var(shadow_time); cycle_t ret; get_time_values_from_xen(); ret = shadow->system_timestamp + get_nsec_offset(shadow); put_cpu_var(shadow_time); return ret; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
> Yep, the tsc has myriad problems; for Xen its the best of a bad lot. > Unfortunately in 10 years no clearly better alternative has appeared; > maybe in 10 years there will be one. It might even be the tsc. TSC is essentially unusable for any kind of time related work. And I'd disagree about the alternatives - the HPET and ACPI timers are not bad, the CMOS timer can be used as an interrupting timer source, and there is the old PC timer chip. All are superior to the TSC. Finally for performance management work you've got cycle counters in the debug side (with interrupt on overflow) which allow you to do management of resources by cpu ticks or by memory bandwidth utilisation (Sun btw have a fascinating paper somewhere on the latter) Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
Jeremy Fitzhardinge wrote: Zachary Amsden wrote: Xen needs pieces of it too, but very select pieces for SMP boot. We do? Send the SMP Xen code over, because I don't have it here. s/do/will (smpboot.c) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
Zachary Amsden wrote: > Xen needs pieces of it too, but very select pieces for SMP boot. We do? Send the SMP Xen code over, because I don't have it here. Thanks, J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
Thomas Gleixner wrote: > Still there is a difference between using existing kernel interfaces and > abusing them in a way which makes modifications to the core kernel code > hard and unmaintainable. See below. > I completely agree. "Using the kernel interfaces" doesn't mean "this random hack happens to work", it means "use the interface as intended as a fully-fledged client". If the interface doesn't work for our use, then we can negotiate with the appropriate people on how to extend it properly. > On the other hand we yet see things like: > > /* We use normal irq0 handler on cpu0. */ > time_init_hook(); > > Which is just reaching into the kernel code directly and does not handle > the clock event interrupt self contained. clockevents is not bound to > IRQ0 and this kind of hackery is exactly what we need to avoid in order > to get this maintainable. > Yes, I'm definitely not arguing with you about this. I think the first cut vmi time code was pretty questionable, but I have confidence they'll fix it up before submission. The point is that when you put the xen and vmi implementations next to each other you find that 1) in each case there's a pretty small abstraction distance between the clock interface and the hypercall interface, and 2) there's very little code which can be shared between the two. Which means that adding another layer of abstraction to protect the clock code from paravirtualized time devices is just going to add fat without much benefit. > Yes, if they are used in a sane and self contained way without reaching > all over the place and expecting that those functions, which are not > part of the paravirt interfaces will work for ever. > 100% agree. If the interfaces change, then we'll change the code using them like any other kernel code would. If the new interfaces are hard to make work then that's a problem, but one would hope that would get shaken out as part of the normal kernel development process. The point is that this code under and around the paravirt_ops interface is just normal Linux code, and we expect to participate in the normal kernel development process, with all the usual discussions/arguments/negotiations over interface changes. If the code loses all its maintainers and becomes orphaned, unresponsive to interface changes, then it's like any other dead driver: mark it CONFIG_BROKEN and wait for someone to fix it. But for now and the foreseeable future these are going to be actively supported and maintained pieces of code. > You are not increasing the entanglement with the rest of the system, > when you use a self contained device on top of an existing core kernel > infrastructure, which has a paravirt backend. Quite the contrary, you > have one piece of virtual hardware which is connected to the kernel and > interacts with the various incarnations on the other side, which can as > well live inside the kernel code. Granted it is another level of > indirection, but I'd be happy to have only to deal with one of those > beasts. > Right. But at that point the interface doesn't really have much of a technical basis. It's really a political border at which you can hand off responsibility and make it ours. I quite understand your motivation, but I think you're solving a problem that hasn't happened yet, and one that we'd all like to avoid. I know the vmi time code has coloured your view here, but I surely hope it can be got into a better state before posting. I'm biased of course, but I would rather hope that all these drivers we're talking about will be as stylistically clean as the Xen time code (which has room for improvement, of course). There is, however, a median solution which keeps the number of clock drivers down but also doesn't involve extending pv_ops. We can just create paravirt_clocksource/paravirt_clockevent helper wrappers, with their own internal interfaces to act as a facade for the hypervisor-specific code. I don't think there's much point in doing this now, but maybe it will become appealing once we start dealing with things like stolen time. > No it's not an absolute blocker, as long as we can take care, that the > number of incarnations is > > - designed to be shareable between hypervisors which have the same time > model > - common code like the 128 bit math is in a shared library > - self contained and not reaching out into core kernel code for no good > reason Yep. J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
Thomas Gleixner wrote: On the other hand we yet see things like: /* We use normal irq0 handler on cpu0. */ time_init_hook(); Which is just reaching into the kernel code directly and does not handle the clock event interrupt self contained. clockevents is not bound to IRQ0 and this kind of hackery is exactly what we need to avoid in order to get this maintainable. Once this is used by paravirt implementations a change to the mach-default implementation will break stuff left and right. We've fixed that already. Thanks for pointing it out. We were just trying to re-use code. Also the whole LAPIC business is so horrible, that it hurts. The generic interrupt layer is there since almost a year and we still see the crude emulation of hardware and assumptions of irq0 setup all over the place. We carefully need to define, which existing kernel interfaces are used / hooked in which way. If the paravirt implementations actually use the already available abstractions in the way in which those abstractions are designed, then we get into a maintainable design. If there are shortcomings on those abstractions we need to fix them in a sane way or provide a _common_ workaround (e.g. 128 bit math back and forth library) without impacting the main kernel code. Looking at vmitimer.c and the number of hardcoded assumptions are telling me, that we are heading in exactly the opposite direction. No, VMI timer is unique because for SMP, it is based on the APIC. On i386, SMP is hardwired to depend on the APIC, and so we simply re-use the pieces of it which are there, with the same assumptions about irqs, and hardware behavior, good or bad. We just have a different way of telling the LAPIC when to deliver interrupts. The alternative is to pretty much completely copy apic.c into vmi.c or vmitimer.c, which seems a rather bad idea, since now two copies of nearly identical code need to be maintained. Yes, if they are used in a sane and self contained way without reaching all over the place and expecting that those functions, which are not part of the paravirt interfaces will work for ever. But we definitely need pieces of the core APIC dependent code. Xen needs pieces of it too, but very select pieces for SMP boot. The ugliness you point out is there, but the reason it is there is not because the paravirt code is cluttered, it is because the i386 code is so hardwired to use the APIC model that there is pain separating from it. The correct solution here is to properly separate the APIC, SMP, and timer code so the logic of it which we want to reuse is separated from the hardware dependence. Clock events and clocksources take care of most of the timer issues, but there is still ugliness from SMP timer events depending on having part of the APIC infrastructure for wiring the interrupt gates. No it's not an absolute blocker, as long as we can take care, that the number of incarnations is - designed to be shareable between hypervisors which have the same time model - common code like the 128 bit math is in a shared library - self contained and not reaching out into core kernel code for no good reason Same goes for clock events, interrupts and other core facilities. I think that is what everyone wants. This is an iterative process. We certainly don't want to reach out into core kernel code unless there is a good reason to do so, and with every development of clock events, sources, and interrupts, we have less of a reason to do so, and the code gets cleaner and more maintainable. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On Wed, 2007-03-07 at 14:05 -0800, Jeremy Fitzhardinge wrote: > Thomas Gleixner wrote: > > This is tinkering of the best. My understanding of the paravirt > > discussion at Kernel Summit was, that paravirt ops are exactly there to > > prevent the above random hackery in the kernel and to allow _ALL_ > > hypervisors to interact via a sane interface inside of the kernel. > > > > No, I don't think that was ever the intent. The idea was to create a > new interface for things which don't currently have an interface in the > kernel, such as how to run the CPU in ring 1 and manage pagetable > updates. But an important and explicit intent of the project was to use > existing kernel interfaces where possible, rather than try to make > pv_ops an monster all-encompassing interface. Maybe I missunderstood. Still there is a difference between using existing kernel interfaces and abusing them in a way which makes modifications to the core kernel code hard and unmaintainable. See below. > Using the new time infrastructure was an explicit example of that. We > anticipated that different hypervisors would have different ways of > doing time, but all would be easily accommodated by the > clocksource/events infrastructure, and so each would have its own > implementation for these interfaces. From the kernel's perspective, > they're just another time device, and we manage to avoid making any core > kernel changes, or bloating the pv_ops interface. It seems like a > natural use of the clock subsystem's design. On the other hand we yet see things like: /* We use normal irq0 handler on cpu0. */ time_init_hook(); Which is just reaching into the kernel code directly and does not handle the clock event interrupt self contained. clockevents is not bound to IRQ0 and this kind of hackery is exactly what we need to avoid in order to get this maintainable. Once this is used by paravirt implementations a change to the mach-default implementation will break stuff left and right. Also the whole LAPIC business is so horrible, that it hurts. The generic interrupt layer is there since almost a year and we still see the crude emulation of hardware and assumptions of irq0 setup all over the place. We carefully need to define, which existing kernel interfaces are used / hooked in which way. If the paravirt implementations actually use the already available abstractions in the way in which those abstractions are designed, then we get into a maintainable design. If there are shortcomings on those abstractions we need to fix them in a sane way or provide a _common_ workaround (e.g. 128 bit math back and forth library) without impacting the main kernel code. Looking at vmitimer.c and the number of hardcoded assumptions are telling me, that we are heading in exactly the opposite direction. > > You are just perverting the whole idea of a standartized > > paravirtualization interface. > > > > This things can be done for clocksources, clockevents, interrupts (the > > generic irq code allows this) and probaly for a whole bunch of other > > stuff. > > > > Yes, exactly. The entirety of the Xen support consists of not only an > implementation of the paravirt_ops interface, but also the Xen > clocksource and clockevents and the Xen irqchip. My hope and intent is > that we can shrink the paravirt_ops interface in favour of using > existing generally useful kernel interfaces. Yes, if they are used in a sane and self contained way without reaching all over the place and expecting that those functions, which are not part of the paravirt interfaces will work for ever. > > The current paravirt interface is completely insane and will explode > > into an unmaintainable nightmare within no time, if we keep accepting > > that crap further. > > > > No, that's exactly what we've been trying to avoid. > > If we start patching in new paravirt_ops to deal with time, interrupts, > or whatever piece of functionality which already has a perfectly good > kernel interface, then we're just increasing the size of the pv_ops > interface, its entanglement with the rest of the system and the amount > of potential legacy stuff which gets dragged around as the interface > evolves. You are not increasing the entanglement with the rest of the system, when you use a self contained device on top of an existing core kernel infrastructure, which has a paravirt backend. Quite the contrary, you have one piece of virtual hardware which is connected to the kernel and interacts with the various incarnations on the other side, which can as well live inside the kernel code. Granted it is another level of indirection, but I'd be happy to have only to deal with one of those beasts. > As hardware gets better at supporting virtualization directly, we're > going to see more hybrid para- and fully- virtualized hypervisor > interfaces. The result will be that more and more of paravirt_ops will > be implemented by the "native" versions of the functions
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On 03/07/2007 02:31 PM, Thomas Gleixner wrote: Please make these things self contained and not relying on whatever time_init_hook() contains. Fixing up the code to do this now thanks, Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On Wed, 2007-03-07 at 14:17 -0800, Zachary Amsden wrote: > Thomas Gleixner wrote: > > Simply because you _ABUSE_ timer_init_hook() to set it up. Keep it self > > contained and do not impose restrictions on the kernel core code, which > > we have to maintain. > > > > But time_init_hook is supposed to be abused. That is its purpose - to > be a hook for different time devices on SGI Visual Workstation and > Voyager. And we don't actually abuse it anymore, we just bypass it > because the default timer init path wants to setup the PIT or the HPET, > neither of which should be used in paravirt. It is there for those hardware platforms, but using it inside your clock event device is _JUST_ wrong. Please make these things self contained and not relying on whatever time_init_hook() contains. tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
Thomas Gleixner wrote: Simply because you _ABUSE_ timer_init_hook() to set it up. Keep it self contained and do not impose restrictions on the kernel core code, which we have to maintain. But time_init_hook is supposed to be abused. That is its purpose - to be a hook for different time devices on SGI Visual Workstation and Voyager. And we don't actually abuse it anymore, we just bypass it because the default timer init path wants to setup the PIT or the HPET, neither of which should be used in paravirt. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On Wed, 2007-03-07 at 13:34 -0800, Dan Hecht wrote: > On 03/07/2007 01:40 PM, Thomas Gleixner wrote: > > On Wed, 2007-03-07 at 13:07 -0800, Jeremy Fitzhardinge wrote: > > That would certainly be ideal. We'll look at the xen, vmi, lguest and > >> kvm paravirtualized time models and see how much they really have in > >> common. I'm a bit curious about how vmi's time events make their way > >> back into the system. > > > > By the crude mechanism I'm fighting. > > > > Hmm? They make there way back via interrupts. How is that crude? Simply because you _ABUSE_ timer_init_hook() to set it up. Keep it self contained and do not impose restrictions on the kernel core code, which we have to maintain. tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
Thomas Gleixner wrote: > This is tinkering of the best. My understanding of the paravirt > discussion at Kernel Summit was, that paravirt ops are exactly there to > prevent the above random hackery in the kernel and to allow _ALL_ > hypervisors to interact via a sane interface inside of the kernel. > No, I don't think that was ever the intent. The idea was to create a new interface for things which don't currently have an interface in the kernel, such as how to run the CPU in ring 1 and manage pagetable updates. But an important and explicit intent of the project was to use existing kernel interfaces where possible, rather than try to make pv_ops an monster all-encompassing interface. Using the new time infrastructure was an explicit example of that. We anticipated that different hypervisors would have different ways of doing time, but all would be easily accommodated by the clocksource/events infrastructure, and so each would have its own implementation for these interfaces. From the kernel's perspective, they're just another time device, and we manage to avoid making any core kernel changes, or bloating the pv_ops interface. It seems like a natural use of the clock subsystem's design. > You are just perverting the whole idea of a standartized > paravirtualization interface. > > This things can be done for clocksources, clockevents, interrupts (the > generic irq code allows this) and probaly for a whole bunch of other > stuff. > Yes, exactly. The entirety of the Xen support consists of not only an implementation of the paravirt_ops interface, but also the Xen clocksource and clockevents and the Xen irqchip. My hope and intent is that we can shrink the paravirt_ops interface in favour of using existing generally useful kernel interfaces. > The current paravirt interface is completely insane and will explode > into an unmaintainable nightmare within no time, if we keep accepting > that crap further. > No, that's exactly what we've been trying to avoid. If we start patching in new paravirt_ops to deal with time, interrupts, or whatever piece of functionality which already has a perfectly good kernel interface, then we're just increasing the size of the pv_ops interface, its entanglement with the rest of the system and the amount of potential legacy stuff which gets dragged around as the interface evolves. As hardware gets better at supporting virtualization directly, we're going to see more hybrid para- and fully- virtualized hypervisor interfaces. The result will be that more and more of paravirt_ops will be implemented by the "native" versions of the functions; maybe at some point the whole thing will evaporate away. It's not a huge reach to expect the hardware vendors to get a clue about time hardware (scratch that, of course it is, but we can always hope) and come up with something that is directly usable from either an OS running natively or from within a virtual machine. In that case, I'm sure you'd agree it would warrant a real clocksource/event implementation. In the scheme I'm proposing, that's no big deal; you just register the hardware driver, and that's that. But what you're proposing leaves this vestigial interface sitting in pv_ops, doing nothing other than being redundant. My principle goal here is to get the Xen code into the kernel, and I'm being pragmatic about it. If you think having a xen_clocksource is an absolute blocker to merging this stuff, then I'll add the interface to pv_ops, and we'll work out how to wire all the hypervisors up underneath that interface. But I think it's precisely the wrong way to go from an overall kernel perspective. J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On Wed, 2007-03-07 at 13:42 -0800, Dan Hecht wrote: > On 03/07/2007 12:40 PM, Thomas Gleixner wrote: > > Real hardware copes well with relative deltas for the events, even when > > it is match register based. I thought long about the support for > > absolute expiry values in cycles and decided against them to avoid that > > math hackery, which you folks now demand. > > First of all, I'm not "demanding" anything. I'm just trying to have a > technical discussion about the issues. If it comes out that absolute > expiry can't be done cleanly, and the cost out weighs the benefit, then > so be it. But, what's so wrong about having the discussion? > > When you do have match register (or count and compare, whatever you want > to call it) based timers in real hardware, the relative expiry interface > in software is a bit suboptimal. You still have no idea how much time > has already gone by between the time you calculated the delta and when > you setup the hardware (you have a pretty good estimate, but can't know > for sure unless you disable caches and all other sources of > non-determinate latencies). So, you will always be a little late in > your timer firing. You may argue that no client of clockevents cares > about this little bit of lateness. But, it does exist, and can be > solved with a software interface that talks in terms of absolute expiries. With sane hardware yes. But there is no sane hardware. You need a (<=) match machinery instead of the available (==) ones, which introduce extra latencies and incorrectness. See arch/i386/kernel/hpet.c. We can end up with returning -ETIME and an interrupt, as we have no control over SMM code and such crap at all. For such devices the delta based expiry is actually faster, as it avoids the calculation of wraps and the possible 128 bit math in the reprogramming path. This correctness discussion is purely hypothetical on current real world hardware. > Perhaps we can't get around the 128-bit math problem, or maybe we can > think of a clever solution. If we can't, then maybe fixing the lateness > is not worth the cost 128-bit math. But, maybe there is a clean way > around the 128-bit math and we just need to approach it from another angle. Please put the clever solution inside of the clockevent. I can provide the absolute time in nanoseconds without making you touch the clockevent->next_event variable. tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On 03/07/2007 12:40 PM, Thomas Gleixner wrote: Real hardware copes well with relative deltas for the events, even when it is match register based. I thought long about the support for absolute expiry values in cycles and decided against them to avoid that math hackery, which you folks now demand. First of all, I'm not "demanding" anything. I'm just trying to have a technical discussion about the issues. If it comes out that absolute expiry can't be done cleanly, and the cost out weighs the benefit, then so be it. But, what's so wrong about having the discussion? When you do have match register (or count and compare, whatever you want to call it) based timers in real hardware, the relative expiry interface in software is a bit suboptimal. You still have no idea how much time has already gone by between the time you calculated the delta and when you setup the hardware (you have a pretty good estimate, but can't know for sure unless you disable caches and all other sources of non-determinate latencies). So, you will always be a little late in your timer firing. You may argue that no client of clockevents cares about this little bit of lateness. But, it does exist, and can be solved with a software interface that talks in terms of absolute expiries. Perhaps we can't get around the 128-bit math problem, or maybe we can think of a clever solution. If we can't, then maybe fixing the lateness is not worth the cost 128-bit math. But, maybe there is a clean way around the 128-bit math and we just need to approach it from another angle. Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On 03/07/2007 01:40 PM, Thomas Gleixner wrote: On Wed, 2007-03-07 at 13:07 -0800, Jeremy Fitzhardinge wrote: That would certainly be ideal. We'll look at the xen, vmi, lguest and kvm paravirtualized time models and see how much they really have in common. I'm a bit curious about how vmi's time events make their way back into the system. By the crude mechanism I'm fighting. Hmm? They make there way back via interrupts. How is that crude? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On 03/07/2007 01:21 PM, Thomas Gleixner wrote: On Wed, 2007-03-07 at 11:49 -0800, Dan Hecht wrote: Jeremy, I saw you sent out the Xen version earlier, thanks. Here's ours for reference (please excuse any formating issues); it's also lean. We'll send out a proper patch later after some more testing: Ah. Bitching loud enough speeds things up. :) We've always planned to do this. We just didn't want to create the dependency between paravirt_ops and clockevents too early such that they would depend on each other to merge to main line. Now that they are both there, we are all for it. /** vmi clockevent */ static struct clock_event_device vmi_global_clockevent; static inline u32 vmi_alarm_wiring(struct clock_event_device *evt) { return (evt == &vmi_global_clockevent) ? VMI_ALARM_WIRED_IRQ0 : VMI_ALARM_WIRED_LVTT; } static void vmi_timer_set_mode(enum clock_event_mode mode, struct clock_event_device *evt) { u32 wiring; cycle_t now, cycles_per_hz; BUG_ON(!irqs_disabled()); wiring = vmi_alarm_wiring(evt); if (wiring == VMI_ALARM_WIRED_LVTT) /* Route the interrupt to the correct vector */ apic_write_around(APIC_LVTT, LOCAL_TIMER_VECTOR); Wire that in the hypervisor. switch (mode) { case CLOCK_EVT_MODE_ONESHOT: break; case CLOCK_EVT_MODE_PERIODIC: cycles_per_hz = vmi_timer_ops.get_cycle_frequency(); (void)do_div(cycles_per_hz, HZ); now = vmi_timer_ops.get_cycle_counter(vmi_counter(VMI_PERIODIC)); vmi_timer_ops.set_alarm(wiring | VMI_PERIODIC, now, cycles_per_hz); paravirt_ops->paravirt_clockevent->set_periodic(vcpu, period); Huh? paravirt_ops isn't a hypervisor interface, it's just a linux code abstraction. The code on both sides of paravirt_ops is *linux* code, any way you cut it. clockevents is already a linux code abstraction. why introduce the redundancy? break; case CLOCK_EVT_MODE_UNUSED: case CLOCK_EVT_MODE_SHUTDOWN: paravirt_ops->paravirt_clockevent->stop_event(vcpu, mode); You would be introducing the same redundancy. switch (evt->mode) { case CLOCK_EVT_MODE_ONESHOT: vmi_timer_ops.cancel_alarm(VMI_ONESHOT); break; case CLOCK_EVT_MODE_PERIODIC: vmi_timer_ops.cancel_alarm(VMI_PERIODIC); break; default: break; } break; default: break; } } This whole vmi_timer_ops thing is horrible. All hypervisors can share paravirt_ops->paravirt_clockevent and retrieve the methods on boot. vmi_timer_ops.whatever is where the kernel <-> hypervisor boundary is crossed for VMI. static int vmi_timer_next_event(unsigned long delta, struct clock_event_device *evt) { /* Unfortunately, set_next_event interface only passes relative * expiry, but we want absolute expiry. It'd be better if were * were passed an aboslute expiry, since a bunch of time may * have been stolen between the time the delta is computed and * when we set the alarm below. */ cycle_t now = vmi_timer_ops.get_cycle_counter(vmi_counter(VMI_ONESHOT)); BUG_ON(evt->mode != CLOCK_EVT_MODE_ONESHOT); vmi_timer_ops.set_alarm(vmi_alarm_wiring(evt) | VMI_ONESHOT, now + delta, 0); return 0; } Great. Now we have: s64 event = startup_offset + ktime_to_ns(evt->next_event); if (HYPERVISOR_set_timer_op(event) < 0) BUG(); and vmi_timer_ops.set_alarm(vmi_alarm_wiring(evt) | VMI_ONESHOT,now + delta, 0); How will the next implementations look like ? lguest_program_timer(delta + lguest_current_time(), LGUEST_TIMER_SHOOT_ONCE); virt_nextgen_ops.set_timer_event(delta, NO_WE_NEED_NO_FLAGS); ... This is tinkering of the best. My understanding of the paravirt discussion at Kernel Summit was, that paravirt ops are exactly there to prevent the above random hackery in the kernel and to allow _ALL_ hypervisors to interact via a sane interface inside of the kernel. No, that was not the point of paravirt_ops. It is actually the complete opposite of the intention of paravirt_ops. paravirt_ops' intent is exactly to allow for *multiple* hypervisor ABIs to exist in the kernel. At kernel summit, paravirt_ops was proposed to allow for multiple hypervisor ABI's to be targeted by the kernel. The code on both sides of paravirt_ops is *linux* code. You are just perverting the whole idea of a standartized paravirtualization interface. This things can be done f
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On Wed, 2007-03-07 at 13:07 -0800, Jeremy Fitzhardinge wrote: > Thomas Gleixner wrote: > > I tend to disagree. The clockevents infrastructure was designed to cope > > with the existing mess of real hardware. The discussion over the last > > days exposed me to even more exotic designs than the hardware vendors > > were able to deliver until now. > > > > It's a different but related problem domain. It's also an increasingly > common execution environment for a kernel to find itself in. Dealing > with proper paravirtualized timer devices is a big improvement over > trying to reliably deal with fully virtualized hardware timers, which > simply can't make the same guarantees that real hardware can make - such > as "you will definitely get N ns of CPU time between doing the > delta->absolute computation and programming the match register". That's exactly the reason why we want only _ONE_ proper virtualized timer device instead of 10 new variants of broken hardware. > > I know exactly where you are heading: > > > > Offload the handling of hypervisor design decisions to the kernel and > > let us deal with that. So we need to implement 128 bit math to convert > > back and forth and I expect more interesting things to creep up. > > > > I wouldn't put it that way. We've been getting a lot of pressure to > keep the pv_ops interface as small as possible. Reusing existing kernel > interfaces rather than making up new ones is a good way to do that. The > clock infrastructure certainly cleans things up; earlier Xen patches > made a complete copy of the old kernel/time.c and hacked it around, > which isn't what anyone wants to do. All you need is exactly ONE paravirt clockevent device and ONE paravirt clocksource for _ALL_ hypervisors. Cast that into stone with a paravirt_ops->clockwahtever interface and we are all happy. > > All this is of _NO_ use and benefit for the kernel itself. > > > > Lots of people want to run Linux in virtual machines. If we can make > sane kernel changes to help those users, then that is of use an benefit > to the kernel. The above will give a real benefit as it is a well defined interface, which can be verified on both ends. > > Real hardware copes well with relative deltas for the events, even when > > it is match register based. I thought long about the support for > > absolute expiry values in cycles and decided against them to avoid that > > math hackery, which you folks now demand. > > > > Not really. Xen and VMI interfaces both use absolute monotonic time for > timeouts, which is certainly a common case for such interfaces > (pthread_cond_timedwait, for example). Converting delta to absolute is > clearly simple, but it does introduce an added bit of non-determinism if > your CPU can be preempted from outside at any time. I presume SMM or > similar interrupts can cause the same problem on real hardware. As I said before: I have no objection against expanding / changing the clockevents interface to deliver absolute expiry time, which we have already handy. I just refuse for a good reason to convert it from ktime_t (nanoseconds) to an absolute cycle value. This can be done on the hypervisor side of the paravirt clock event device. Same applies for clocksources. The ones which need nanosecond from/to whatever conversion can do it _IN_ the hypervisor and not in 10 different grades of madness in the kernel code. > > We can optimize this by skipping the conversion via a feature flag. > The clocksource needed the shift for ntp warping. Does the clockevent > need a shift at all? Could I just set mult/shift to 1/0? Yes. > > Your implementation is almost the perfect prototype, if you move the > > 128 bit hackery into the hypervisor and hide it away from the kernel :) > > > The point is to use the tsc to avoid making any hypercalls, so dealing > with the tsc->ns conversion has to happen on the guest side somehow. I understand that you want to make this as fast as possible, but TSC is broken in more than one way and it just makes me barf, when we have yet another way of dealing with it in the kernel. Please keep the paravirt interface abstract and treat it in the same way we treat the kernel - userspace API. The kernel hides all this hardware crap away from the user space and the same applies for a sane paravirt interface. This is also a benefit in terms of portability. For devices, which already live on top of an abstraction layer in the kernel, e.g. clocksources, clockevents, interrupts, we can share one implementation accross multiple platforms. > > One of these is perfectly fine for _ALL_ of the hypervisor folks. > > Anything else is just a backwards decision for the kernel. > > > That would certainly be ideal. We'll look at the xen, vmi, lguest and > kvm paravirtualized time models and see how much they really have in > common. I'm a bit curious about how vmi's time events make their way > back into the system. By the crude mechanism I'm figh
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On 03/07/2007 01:19 PM, Thomas Gleixner wrote: On Wed, 2007-03-07 at 13:02 -0800, Dan Hecht wrote: On 03/07/2007 12:57 PM, Thomas Gleixner wrote: On Wed, 2007-03-07 at 12:11 -0800, Jeremy Fitzhardinge wrote: Dan Hecht wrote: Jeremy, I saw you sent out the Xen version earlier, thanks. Here's ours for reference (please excuse any formating issues); it's also lean. We'll send out a proper patch later after some more testing: So the interrupt side of the clockevent comes through the virtual apic? Where does evt->handle_event get called? /* We use normal irq0 handler on cpu0. */ time_init_hook(); That's exactly the thing I ranted about before. We keep the historic view of emulated hardware and just wrap it into enough glue code instead of doing an abstract design, which just gets rid of those hardware assumptions at all. That's the big advantage of paravirtualization, but the current way on paravirt ops is just ignoring this. Are you saying you would prefer we create our own irq handler something like this rather than using the standard i386 handlers? irqreturn_t vmi_timer_interrupt(int irq, void *dev_id) { local_event->event_handler(local_event); return IRQ_HANDLED; } ?? That's fine with me. I prefer _ONE_ generic abstract implementation of a clock event, which can be used by all hypervisors. Please keep all your wiring and ideas of how to best emulate a i386 system away from the kernel as far as you can. Please sit down with the other hypervisor folks and define the five functions you need to interact between clockevents and the particular hypervisor and implement it once. Then you can change and evolve your idea of how handle them best in your hypervisor code, where it belongs. Okay, I guess we are essentially back to the "XEN & VMI" thread. Let's just keep that discussion in one place. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On Wed, 2007-03-07 at 11:49 -0800, Dan Hecht wrote: > Jeremy, I saw you sent out the Xen version earlier, thanks. Here's ours > for reference (please excuse any formating issues); it's also lean. > We'll send out a proper patch later after some more testing: Ah. Bitching loud enough speeds things up. :) > /** vmi clockevent */ > > static struct clock_event_device vmi_global_clockevent; > > static inline u32 vmi_alarm_wiring(struct clock_event_device *evt) > { > return (evt == &vmi_global_clockevent) ? > VMI_ALARM_WIRED_IRQ0 : VMI_ALARM_WIRED_LVTT; > } > > static void vmi_timer_set_mode(enum clock_event_mode mode, > struct clock_event_device *evt) > { > u32 wiring; > cycle_t now, cycles_per_hz; > BUG_ON(!irqs_disabled()); > > wiring = vmi_alarm_wiring(evt); > if (wiring == VMI_ALARM_WIRED_LVTT) > /* Route the interrupt to the correct vector */ > apic_write_around(APIC_LVTT, LOCAL_TIMER_VECTOR); Wire that in the hypervisor. > switch (mode) { > case CLOCK_EVT_MODE_ONESHOT: > break; > case CLOCK_EVT_MODE_PERIODIC: > cycles_per_hz = vmi_timer_ops.get_cycle_frequency(); > (void)do_div(cycles_per_hz, HZ); > now = > vmi_timer_ops.get_cycle_counter(vmi_counter(VMI_PERIODIC)); > vmi_timer_ops.set_alarm(wiring | VMI_PERIODIC, > now, cycles_per_hz); paravirt_ops->paravirt_clockevent->set_periodic(vcpu, period); > break; > case CLOCK_EVT_MODE_UNUSED: > case CLOCK_EVT_MODE_SHUTDOWN: paravirt_ops->paravirt_clockevent->stop_event(vcpu, mode); > switch (evt->mode) { > case CLOCK_EVT_MODE_ONESHOT: > vmi_timer_ops.cancel_alarm(VMI_ONESHOT); > break; > case CLOCK_EVT_MODE_PERIODIC: > vmi_timer_ops.cancel_alarm(VMI_PERIODIC); > break; > default: > break; > } > break; > default: > break; > } > } This whole vmi_timer_ops thing is horrible. All hypervisors can share paravirt_ops->paravirt_clockevent and retrieve the methods on boot. > static int vmi_timer_next_event(unsigned long delta, > struct clock_event_device *evt) > { > /* Unfortunately, set_next_event interface only passes relative >* expiry, but we want absolute expiry. It'd be better if were >* were passed an aboslute expiry, since a bunch of time may >* have been stolen between the time the delta is computed and >* when we set the alarm below. */ > cycle_t now = vmi_timer_ops.get_cycle_counter(vmi_counter(VMI_ONESHOT)); > > BUG_ON(evt->mode != CLOCK_EVT_MODE_ONESHOT); > vmi_timer_ops.set_alarm(vmi_alarm_wiring(evt) | VMI_ONESHOT, > now + delta, 0); > return 0; > } Great. Now we have: s64 event = startup_offset + ktime_to_ns(evt->next_event); if (HYPERVISOR_set_timer_op(event) < 0) BUG(); and vmi_timer_ops.set_alarm(vmi_alarm_wiring(evt) | VMI_ONESHOT,now + delta, 0); How will the next implementations look like ? lguest_program_timer(delta + lguest_current_time(), LGUEST_TIMER_SHOOT_ONCE); virt_nextgen_ops.set_timer_event(delta, NO_WE_NEED_NO_FLAGS); ... This is tinkering of the best. My understanding of the paravirt discussion at Kernel Summit was, that paravirt ops are exactly there to prevent the above random hackery in the kernel and to allow _ALL_ hypervisors to interact via a sane interface inside of the kernel. You are just perverting the whole idea of a standartized paravirtualization interface. This things can be done for clocksources, clockevents, interrupts (the generic irq code allows this) and probaly for a whole bunch of other stuff. The current paravirt interface is completely insane and will explode into an unmaintainable nightmare within no time, if we keep accepting that crap further. No thanks. > #ifdef CONFIG_X86_LOCAL_APIC > > /* Replacement for lapic timer local clock event. > * paravirt_ops.setup_boot_clock = vmi_nop > * (continue using global_clock_event on cpu0) > * paravirt_ops.setup_secondary_clock = vmi_timer_setup_local_alarm > */ > void __devinit vmi_timer_setup_local_alarm(void) > { > struct clock_event_device *evt = &__get_cpu_var(local_clock_events); > > /* Then, start it back up as a local clockevent device. */ > memcpy(evt, &vmi_clockevent, sizeof(*evt)); > evt->cpumask = cpumask_of_cpu(smp_processor_id()); > > printk(KERN_WARNING "vmi: registering clock event %s. mult=%lu > shift=%u\n", > evt->name, evt->mult, evt->shift); > clockevents_register_device(e
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On Wed, 2007-03-07 at 13:02 -0800, Dan Hecht wrote: > On 03/07/2007 12:57 PM, Thomas Gleixner wrote: > > On Wed, 2007-03-07 at 12:11 -0800, Jeremy Fitzhardinge wrote: > >> Dan Hecht wrote: > >>> Jeremy, I saw you sent out the Xen version earlier, thanks. Here's > >>> ours for reference (please excuse any formating issues); it's also > >>> lean. We'll send out a proper patch later after some more testing: > >> So the interrupt side of the clockevent comes through the virtual apic? > >> Where does evt->handle_event get called? > > > > > >> /* We use normal irq0 handler on cpu0. */ > >> time_init_hook(); > > > > That's exactly the thing I ranted about before. We keep the historic > > view of emulated hardware and just wrap it into enough glue code instead > > of doing an abstract design, which just gets rid of those hardware > > assumptions at all. That's the big advantage of paravirtualization, but > > the current way on paravirt ops is just ignoring this. > > > > Are you saying you would prefer we create our own irq handler something > like this rather than using the standard i386 handlers? > > irqreturn_t vmi_timer_interrupt(int irq, void *dev_id) > { > local_event->event_handler(local_event); > return IRQ_HANDLED; > } > > ?? That's fine with me. I prefer _ONE_ generic abstract implementation of a clock event, which can be used by all hypervisors. Please keep all your wiring and ideas of how to best emulate a i386 system away from the kernel as far as you can. Please sit down with the other hypervisor folks and define the five functions you need to interact between clockevents and the particular hypervisor and implement it once. Then you can change and evolve your idea of how handle them best in your hypervisor code, where it belongs. tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On Wed, 2007-03-07 at 12:49 -0800, Dan Hecht wrote: > On 03/07/2007 12:11 PM, Jeremy Fitzhardinge wrote: > > Dan Hecht wrote: > >> Jeremy, I saw you sent out the Xen version earlier, thanks. Here's > >> ours for reference (please excuse any formating issues); it's also > >> lean. We'll send out a proper patch later after some more testing: > > > > So the interrupt side of the clockevent comes through the virtual apic? > > Where does evt->handle_event get called? > > > > Yeah, we use the same interrupt handlers as normal i386: timer_interrupt > and smp_apic_timer_interrupt. That way we don't need to duplicate the > interrupt handler code. Oh well. Here we are again. 2 hypervisors - 4 different views on how to inject events into the kernel. This is the complete wrong approach. Paravirtualization should not abuse existing hardware drivers. It should just provide their own sane abstract implementation. Please stop this _NOW_ tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
Thomas Gleixner wrote: > I tend to disagree. The clockevents infrastructure was designed to cope > with the existing mess of real hardware. The discussion over the last > days exposed me to even more exotic designs than the hardware vendors > were able to deliver until now. > It's a different but related problem domain. It's also an increasingly common execution environment for a kernel to find itself in. Dealing with proper paravirtualized timer devices is a big improvement over trying to reliably deal with fully virtualized hardware timers, which simply can't make the same guarantees that real hardware can make - such as "you will definitely get N ns of CPU time between doing the delta->absolute computation and programming the match register". > I know exactly where you are heading: > > Offload the handling of hypervisor design decisions to the kernel and > let us deal with that. So we need to implement 128 bit math to convert > back and forth and I expect more interesting things to creep up. > I wouldn't put it that way. We've been getting a lot of pressure to keep the pv_ops interface as small as possible. Reusing existing kernel interfaces rather than making up new ones is a good way to do that. The clock infrastructure certainly cleans things up; earlier Xen patches made a complete copy of the old kernel/time.c and hacked it around, which isn't what anyone wants to do. > All this is of _NO_ use and benefit for the kernel itself. > Lots of people want to run Linux in virtual machines. If we can make sane kernel changes to help those users, then that is of use an benefit to the kernel. > Real hardware copes well with relative deltas for the events, even when > it is match register based. I thought long about the support for > absolute expiry values in cycles and decided against them to avoid that > math hackery, which you folks now demand. > Not really. Xen and VMI interfaces both use absolute monotonic time for timeouts, which is certainly a common case for such interfaces (pthread_cond_timedwait, for example). Converting delta to absolute is clearly simple, but it does introduce an added bit of non-determinism if your CPU can be preempted from outside at any time. I presume SMM or similar interrupts can cause the same problem on real hardware. I guess the worst case for real hardware is an absolute-time match register which only compares for match==now rather than match<=now, since you could completely lose the time event if you miss the deadline. >> static const struct clock_event_device xen_clockevent = { >> .name = "xen", >> .features = CLOCK_EVT_FEAT_ONESHOT, >> >> .max_delta_ns = 0x7fff, >> .min_delta_ns = 100,/* ? */ >> >> .mult = 1<> .shift = XEN_SHIFT, >> > > We can optimize this by skipping the conversion via a feature flag. > The clocksource needed the shift for ntp warping. Does the clockevent need a shift at all? Could I just set mult/shift to 1/0? > Your implementation is almost the perfect prototype, if you move the > 128 bit hackery into the hypervisor and hide it away from the kernel :) > The point is to use the tsc to avoid making any hypercalls, so dealing with the tsc->ns conversion has to happen on the guest side somehow. > One of these is perfectly fine for _ALL_ of the hypervisor folks. > Anything else is just a backwards decision for the kernel. > That would certainly be ideal. We'll look at the xen, vmi, lguest and kvm paravirtualized time models and see how much they really have in common. I'm a bit curious about how vmi's time events make their way back into the system. J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
Dan Hecht wrote: > Are you saying you would prefer we create our own irq handler > something like this rather than using the standard i386 handlers? > > irqreturn_t vmi_timer_interrupt(int irq, void *dev_id) > { >local_event->event_handler(local_event); >return IRQ_HANDLED; > } > > ?? That's fine with me. It does make the code self-contained. J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On 03/07/2007 12:57 PM, Thomas Gleixner wrote: On Wed, 2007-03-07 at 12:11 -0800, Jeremy Fitzhardinge wrote: Dan Hecht wrote: Jeremy, I saw you sent out the Xen version earlier, thanks. Here's ours for reference (please excuse any formating issues); it's also lean. We'll send out a proper patch later after some more testing: So the interrupt side of the clockevent comes through the virtual apic? Where does evt->handle_event get called? /* We use normal irq0 handler on cpu0. */ time_init_hook(); That's exactly the thing I ranted about before. We keep the historic view of emulated hardware and just wrap it into enough glue code instead of doing an abstract design, which just gets rid of those hardware assumptions at all. That's the big advantage of paravirtualization, but the current way on paravirt ops is just ignoring this. Are you saying you would prefer we create our own irq handler something like this rather than using the standard i386 handlers? irqreturn_t vmi_timer_interrupt(int irq, void *dev_id) { local_event->event_handler(local_event); return IRQ_HANDLED; } ?? That's fine with me. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On 03/07/2007 12:11 PM, Jeremy Fitzhardinge wrote: Dan Hecht wrote: Jeremy, I saw you sent out the Xen version earlier, thanks. Here's ours for reference (please excuse any formating issues); it's also lean. We'll send out a proper patch later after some more testing: So the interrupt side of the clockevent comes through the virtual apic? Where does evt->handle_event get called? Yeah, we use the same interrupt handlers as normal i386: timer_interrupt and smp_apic_timer_interrupt. That way we don't need to duplicate the interrupt handler code. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On Wed, 2007-03-07 at 12:11 -0800, Jeremy Fitzhardinge wrote: > Dan Hecht wrote: > > Jeremy, I saw you sent out the Xen version earlier, thanks. Here's > > ours for reference (please excuse any formating issues); it's also > > lean. We'll send out a proper patch later after some more testing: > > So the interrupt side of the clockevent comes through the virtual apic? > Where does evt->handle_event get called? > /* We use normal irq0 handler on cpu0. */ > time_init_hook(); That's exactly the thing I ranted about before. We keep the historic view of emulated hardware and just wrap it into enough glue code instead of doing an abstract design, which just gets rid of those hardware assumptions at all. That's the big advantage of paravirtualization, but the current way on paravirt ops is just ignoring this. tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On Wed, 2007-03-07 at 09:41 -0800, Jeremy Fitzhardinge wrote: > Other hypervisors may take other approaches, depending on what the real > underlying hardware is and the real requirements. One could imagine a > hypervisor exposing an hpet mapping, for example, or just having some > kind of completely synthetic time source. > > The point is that if we were to build an abstraction layer over all of > these just so that we could have a single clocksource/event > implementation, it would be pretty much equivalent to the existing clock > infrastructure, and would add no value. I tend to disagree. The clockevents infrastructure was designed to cope with the existing mess of real hardware. The discussion over the last days exposed me to even more exotic designs than the hardware vendors were able to deliver until now. > I was very pleased when I saw the clocksource/event mechanisms go into > the kernel because it means different hypervisors can have a clock* > implementation to match their own particular time model/interface > without having to clutter up the pv_ops interface, and still have a > well-defined interface to the rest of the kernel's time infrastructure. I know exactly where you are heading: Offload the handling of hypervisor design decisions to the kernel and let us deal with that. So we need to implement 128 bit math to convert back and forth and I expect more interesting things to creep up. All this is of _NO_ use and benefit for the kernel itself. Real hardware copes well with relative deltas for the events, even when it is match register based. I thought long about the support for absolute expiry values in cycles and decided against them to avoid that math hackery, which you folks now demand. > I don't think having a clock implementation for each hypervisor is such > a big deal. The Xen one, for example, is 300 lines of straightforward code. > > > Abstractions for the abstractions sake are braindead. There is no real > > reason to implement 128 bit math into that path just to make the virtual > > clockevent device look like real hardware. > > > > The abstraction of clockevents helps you to get rid of hardwired > > hardware assumptions, but you insist on creating them artificially for > > reasons which are beyond my grasp. > > > The hypervisor may present abstracted time hardware, but there is real > time hardware under there somewhere, and there are benefits to making > the abstraction as thin as possible. Yeah, it's much faster to do the conversion in the kernel and not in the hypervisor thin layer. See also below. > Xen chooses to express its time > interfaces in ns and so is a good direct match for the Linux time > infrastructure, but it still has to the 128-bit cycles<->ns conversion > *somewhere*, because the underlying hardware is still using cycles. It > sounds like the VMWare folks have chosen to directly use cycles in order > to avoid that conversion altogether. Neither the host OS nor the hypervisors use cycles as the main unit for their own time related code. They all have the required conversion code already available. The historical design of hypervisors was based on emulating the hardware 1:1. So the TSC needs to be a TSC and the LAPIC a LAPIC. Paravitualized guests can use smarter virtual hardware which is exposed to the kernel. Using paravirtualization only to speed up the emulation of legacy crap without thinking about the overall possible enhancements is just backwards. Paravirtualization is a technique that presents a software interface to virtual machines that is similar but not identical to that of the underlying hardware. clockevents allow you to do that easy and simple, but you insist on a 1:1 conversion of your current design and offload the legacy burden of your historical hardware usage to the kernel developers. No thanks. Also let's compare the code flow for a Linux guest on a Linux host: cylces based: program_next_event() convert to a virtual cycle value call into the emulated clock event device call into the hypervisor convert to nanoseconds arm a hrtimer convert to real hardware cycles nanosecond based: program_next_event() call into the emulated clock event device call into the hypervisor arm a hrtimer convert to real hardware cycles > > Jeremy spent a couple of hours to get NO_HZ running for Xen yesterday > > instead of writing up lengthy excuses, why it is s hard and takes > > sooo much time and the current interface is sooo insufficient. > > > > Yep, it worked out well. The only warty thing in there is the asm > 128-bit math needed in scale_delta() to convert tsc cycles to ns. John > Stultz had suggested (on a much earlier incarnation of this code) that > it could be generally useful and could be hoisted to somewhere more > common. I've included the
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
Dan Hecht wrote: > Jeremy, I saw you sent out the Xen version earlier, thanks. Here's > ours for reference (please excuse any formating issues); it's also > lean. We'll send out a proper patch later after some more testing: So the interrupt side of the clockevent comes through the virtual apic? Where does evt->handle_event get called? J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On 03/07/2007 11:05 AM, Jeremy Fitzhardinge wrote: James Morris wrote: It seems to me that it could be useful to have a library of common virtual time code (entirely separate from pv_ops), to avoid re-implementing some apparently common requirements, such as: handling TSC frequency changes, stolen time accounting, synthetic programmable clockevent etc. Well, lets put our clock* implementations next to each other and see how much common code there is to be factored out. The Xen time code is pretty lean. There's not much difference in abstraction between the clocksource/event interface and the hypervisor interface, so there's just not very much code there. Jeremy, I saw you sent out the Xen version earlier, thanks. Here's ours for reference (please excuse any formating issues); it's also lean. We'll send out a proper patch later after some more testing: --- /* * VMI paravirtual timer support routines. * * Copyright (C) 2007, VMware, Inc. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, but * WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or * NON INFRINGEMENT. See the GNU General Public License for more * details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. * */ #include #include #include #include #include #include #include #include #include #include #define VMI_ONESHOT (VMI_ALARM_IS_ONESHOT | VMI_CYCLES_REAL) #define VMI_PERIODIC (VMI_ALARM_IS_PERIODIC | VMI_CYCLES_REAL) static inline u32 vmi_counter(u32 flags) { /* Given VMI_ONESHOT or VMI_PERIODIC, return the corresponding * cycle counter. */ return flags & VMI_ALARM_COUNTER_MASK; } /* paravirt_ops.get_wallclock = vmi_get_wallclock */ unsigned long vmi_get_wallclock(void) { unsigned long long wallclock; wallclock = vmi_timer_ops.get_wallclock(); // nsec (void)do_div(wallclock, 10); // sec return wallclock; } /* paravirt_ops.set_wallclock = vmi_set_wallclock */ int vmi_set_wallclock(unsigned long now) { return 0; } /* paravirt_ops.get_scheduled_cycles = vmi_get_sched_cycles */ unsigned long long vmi_get_sched_cycles(void) { return vmi_timer_ops.get_cycle_counter(VMI_CYCLES_AVAILABLE); } /* paravirt_ops.get_cpu_khz = vmi_cpu_khz */ unsigned long vmi_cpu_khz(void) { unsigned long long khz; khz = vmi_timer_ops.get_cycle_frequency(); (void)do_div(khz, 1000); return khz; } /** vmi clockevent */ static struct clock_event_device vmi_global_clockevent; static inline u32 vmi_alarm_wiring(struct clock_event_device *evt) { return (evt == &vmi_global_clockevent) ? VMI_ALARM_WIRED_IRQ0 : VMI_ALARM_WIRED_LVTT; } static void vmi_timer_set_mode(enum clock_event_mode mode, struct clock_event_device *evt) { u32 wiring; cycle_t now, cycles_per_hz; BUG_ON(!irqs_disabled()); wiring = vmi_alarm_wiring(evt); if (wiring == VMI_ALARM_WIRED_LVTT) /* Route the interrupt to the correct vector */ apic_write_around(APIC_LVTT, LOCAL_TIMER_VECTOR); switch (mode) { case CLOCK_EVT_MODE_ONESHOT: break; case CLOCK_EVT_MODE_PERIODIC: cycles_per_hz = vmi_timer_ops.get_cycle_frequency(); (void)do_div(cycles_per_hz, HZ); now = vmi_timer_ops.get_cycle_counter(vmi_counter(VMI_PERIODIC)); vmi_timer_ops.set_alarm(wiring | VMI_PERIODIC, now, cycles_per_hz); break; case CLOCK_EVT_MODE_UNUSED: case CLOCK_EVT_MODE_SHUTDOWN: switch (evt->mode) { case CLOCK_EVT_MODE_ONESHOT: vmi_timer_ops.cancel_alarm(VMI_ONESHOT); break; case CLOCK_EVT_MODE_PERIODIC: vmi_timer_ops.cancel_alarm(VMI_PERIODIC); break; default: break; } break; default: break; } } static int vmi_timer_next_event(unsigned long delta, struct clock_event_device *evt) { /* Unfortunately, set_next_event interface only passes relative * expiry, but we want absolute expiry. It'd be better if were * were passed an aboslute expiry, since a bunch of time may * have been st
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
James Morris wrote: > It seems to me that it could be useful to have a library of common virtual > time code (entirely separate from pv_ops), to avoid re-implementing some > apparently common requirements, such as: handling TSC frequency changes, > stolen time accounting, synthetic programmable clockevent etc. > Well, lets put our clock* implementations next to each other and see how much common code there is to be factored out. The Xen time code is pretty lean. There's not much difference in abstraction between the clocksource/event interface and the hypervisor interface, so there's just not very much code there. One immediate candidate is the scale_delta() function which does the necessary cycles->tsc conversion. I think that will be generally useful and should be put somewhere common rather than copied. I think stolen time is a bit more core, and in principle applies to non-virtualized systems as well (such as time stolen by SMM and discontinuities caused by suspend/resume). The key piece is a monotonic clock which advances while a vcpu is actually running on a real cpu, since that should be used to determine how much time each process has been running for. Maybe it will just fall out if we start moving to a state-transition process time accounting rather than the current sample-based one. Is there an actual plan to do that, or is it at the handwaving stage? J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On Wed, 2007-03-07 at 13:11 -0500, James Morris wrote: > On Wed, 7 Mar 2007, Jeremy Fitzhardinge wrote: > > > I was very pleased when I saw the clocksource/event mechanisms go into > > the kernel because it means different hypervisors can have a clock* > > implementation to match their own particular time model/interface > > without having to clutter up the pv_ops interface, and still have a > > well-defined interface to the rest of the kernel's time infrastructure. > > It seems to me that it could be useful to have a library of common virtual > time code (entirely separate from pv_ops), to avoid re-implementing some > apparently common requirements, such as: handling TSC frequency changes, > stolen time accounting, synthetic programmable clockevent etc. Yes please. Expose sane emulated silicon to the kernel core and maintain your hypervisor decisions behind that silicon instead of exposing us to 10 different silicon versions with 20 bugs each. tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On Wed, 2007-03-07 at 10:28 -0800, Jeremy Fitzhardinge wrote: > Ingo Molnar wrote: > > /For you/ it's certainly no big deal, you dont have to fix it up and you > > dont have to keep it flexible ;) > > > > How flexible does it need to be? Its a simple time source and event > driver. How flexible does the pit driver need to be? It's just a small > leaf node hanging off a large existing piece of kernel infrastructure. > > > and really, i'm not expecting miracles, i've never seen any hardware > > vendor argue /against/ support for their own hardware =B-) > > > > And since when has it been kernel policy to argue against including a > well written, self-contained, vendor-provided driver for a piece of > hardware? The difference is that we have not much influence on the design decisions of silicon vendors. We usually see them when the shit already has been morphed into solid silicon. Software emulated silicon _IS_ actually under our control. And we want to have it as sane as possible. tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
Ingo Molnar wrote: > ugh. Please take it from me: i've watched the Linux time code walk its > long, rocky 10+ years road. One of the first mistakes was when we made > the TSC the center of the i386-time universe. (incidentally, it was me > who did the first steps of that, as a rookie kernel hacker) We got cured > out of that in v2.6.19, v2.6.20 and v2.6.21. Granted, Xen is only at the > beginning of that same road. Meet in another 10 years? ;) Yep, the tsc has myriad problems; for Xen its the best of a bad lot. Unfortunately in 10 years no clearly better alternative has appeared; maybe in 10 years there will be one. It might even be the tsc. J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
Ingo Molnar wrote: > /For you/ it's certainly no big deal, you dont have to fix it up and you > dont have to keep it flexible ;) > How flexible does it need to be? Its a simple time source and event driver. How flexible does the pit driver need to be? It's just a small leaf node hanging off a large existing piece of kernel infrastructure. > and really, i'm not expecting miracles, i've never seen any hardware > vendor argue /against/ support for their own hardware =B-) > And since when has it been kernel policy to argue against including a well written, self-contained, vendor-provided driver for a piece of hardware? J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On Wed, 7 Mar 2007, Jeremy Fitzhardinge wrote: > I was very pleased when I saw the clocksource/event mechanisms go into > the kernel because it means different hypervisors can have a clock* > implementation to match their own particular time model/interface > without having to clutter up the pv_ops interface, and still have a > well-defined interface to the rest of the kernel's time infrastructure. It seems to me that it could be useful to have a library of common virtual time code (entirely separate from pv_ops), to avoid re-implementing some apparently common requirements, such as: handling TSC frequency changes, stolen time accounting, synthetic programmable clockevent etc. - James -- James Morris <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On Wed, 7 Mar 2007, Ingo Molnar wrote: > > * Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote: > > > Xen, for example, uses the tsc as the principle timebase in the > > hypervisor interface. [...] > > ugh. Please take it from me: i've watched the Linux time code walk its > long, rocky 10+ years road. One of the first mistakes was when we made > the TSC the center of the i386-time universe. (incidentally, it was me > who did the first steps of that, as a rookie kernel hacker) We got cured > out of that in v2.6.19, v2.6.20 and v2.6.21. Granted, Xen is only at the > beginning of that same road. Meet in another 10 years? ;) What do you suggest instead ? (Digging into this for lguest now...) - James -- James Morris <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
* Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote: > I don't think having a clock implementation for each hypervisor is > such a big deal. The Xen one, for example, is 300 lines of > straightforward code. /For you/ it's certainly no big deal, you dont have to fix it up and you dont have to keep it flexible ;) and really, i'm not expecting miracles, i've never seen any hardware vendor argue /against/ support for their own hardware =B-) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
* Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote: > Xen, for example, uses the tsc as the principle timebase in the > hypervisor interface. [...] ugh. Please take it from me: i've watched the Linux time code walk its long, rocky 10+ years road. One of the first mistakes was when we made the TSC the center of the i386-time universe. (incidentally, it was me who did the first steps of that, as a rookie kernel hacker) We got cured out of that in v2.6.19, v2.6.20 and v2.6.21. Granted, Xen is only at the beginning of that same road. Meet in another 10 years? ;) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
Thomas Gleixner wrote: > That's a pure academic exercise. When we are at the point where > nanoseconds are to coarse - sometimes after we both retired - the > internal resolution will be femtoseconds or whatever fits. > > Again: paravirt should use a common infrastructure for this. Virtual > clocksource and virtual clockevent devices, which operate on ktime_t and > not on some artificial clock chip emulation frequency. The backend > implementation will be still per hypervisor, but we have _ONE_ device > emulation model, which is exposed to the kernel instead of five. > Different hypervisors have different time interfaces for good reasons - mostly because the real hardware is such a mess, and there's no clear "good" answer. In other words, for the same reason that the new clock infrastructure exists. Xen, for example, uses the tsc as the principle timebase in the hypervisor interface. A shared memory region is updated from time to time with the tsc frequency and other parameters, and the guest is expected to compute the current time in ns by extrapolating using the current tsc value. This only works because the hypervisor goes to some effort to synchronize the tsc between the (real) cpus, but its otherwise much the same as using the raw tsc. Other hypervisors may take other approaches, depending on what the real underlying hardware is and the real requirements. One could imagine a hypervisor exposing an hpet mapping, for example, or just having some kind of completely synthetic time source. The point is that if we were to build an abstraction layer over all of these just so that we could have a single clocksource/event implementation, it would be pretty much equivalent to the existing clock infrastructure, and would add no value. I was very pleased when I saw the clocksource/event mechanisms go into the kernel because it means different hypervisors can have a clock* implementation to match their own particular time model/interface without having to clutter up the pv_ops interface, and still have a well-defined interface to the rest of the kernel's time infrastructure. I don't think having a clock implementation for each hypervisor is such a big deal. The Xen one, for example, is 300 lines of straightforward code. > Abstractions for the abstractions sake are braindead. There is no real > reason to implement 128 bit math into that path just to make the virtual > clockevent device look like real hardware. > > The abstraction of clockevents helps you to get rid of hardwired > hardware assumptions, but you insist on creating them artificially for > reasons which are beyond my grasp. > The hypervisor may present abstracted time hardware, but there is real time hardware under there somewhere, and there are benefits to making the abstraction as thin as possible. Xen chooses to express its time interfaces in ns and so is a good direct match for the Linux time infrastructure, but it still has to the 128-bit cycles<->ns conversion *somewhere*, because the underlying hardware is still using cycles. It sounds like the VMWare folks have chosen to directly use cycles in order to avoid that conversion altogether. > Jeremy spent a couple of hours to get NO_HZ running for Xen yesterday > instead of writing up lengthy excuses, why it is s hard and takes > sooo much time and the current interface is sooo insufficient. > Yep, it worked out well. The only warty thing in there is the asm 128-bit math needed in scale_delta() to convert tsc cycles to ns. John Stultz had suggested (on a much earlier incarnation of this code) that it could be generally useful and could be hoisted to somewhere more common. I've included the whole thing below. J -- #include #include #include #include #include #include #include #include #include "xen-ops.h" #define XEN_SHIFT 22 /* These are perodically updated in shared_info, and then copied here. */ struct shadow_time_info { u64 tsc_timestamp; /* TSC at last update of time vals. */ u64 system_timestamp; /* Time, in nanosecs, since boot.*/ u32 tsc_to_nsec_mul; int tsc_shift; u32 version; }; static DEFINE_PER_CPU(struct shadow_time_info, shadow_time); /* Xen time at startup */ static s64 startup_offset; unsigned long xen_cpu_khz(void) { u64 cpu_khz = 100ULL << 32; const struct vcpu_time_info *info = &HYPERVISOR_shared_info->vcpu_info[0].time; do_div(cpu_khz, info->tsc_to_system_mul); if (info->tsc_shift < 0) cpu_khz <<= -info->tsc_shift; else cpu_khz >>= info->tsc_shift; return cpu_khz; } /* * Reads a consistent set of time-base values from Xen, into a shadow data * area. */ static void get_time_values_from_xen(void) { struct vcpu_time_info *src; struct shadow_time_info *dst; src = &read_pda(xen.vcpu)->time; dst = &get_cpu_var(shadow_time); do {
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On Tue, 2007-03-06 at 18:08 -0800, Dan Hecht wrote: > > IMO the paravirt interfaces should use nanoseconds anyway for both > > readout and next event programming. That way the conversion is done in > > the hypervisor once and the clocksources and clockevents are simple and > > unified (except for the underlying hypervisor calls). > > > > I disagree. The clocksource/clockevents layer are always going to have > to convert nanoseconds to/from hardware units, so why not use it? And, > some guests (say, a future version of linux that does trace-based > process accounting) may want higher resolution than nanoseconds for > certain uses. That's a pure academic exercise. When we are at the point where nanoseconds are to coarse - sometimes after we both retired - the internal resolution will be femtoseconds or whatever fits. Again: paravirt should use a common infrastructure for this. Virtual clocksource and virtual clockevent devices, which operate on ktime_t and not on some artificial clock chip emulation frequency. The backend implementation will be still per hypervisor, but we have _ONE_ device emulation model, which is exposed to the kernel instead of five. On a Linux based host, you probably end up with a hrtimer on the host side to schedule the next event on the guest. So why do we need to convert ktime_t to some virtual frequency in the guest so we can convert it back into ktime_t on the host ? Abstractions for the abstractions sake are braindead. There is no real reason to implement 128 bit math into that path just to make the virtual clockevent device look like real hardware. The abstraction of clockevents helps you to get rid of hardwired hardware assumptions, but you insist on creating them artificially for reasons which are beyond my grasp. > In any case, this is beside the point; I'd prefer to > stick to using the clockevents interface in the way it was intended > rather than reaching into ->next_event. Sigh. The gain is, that you still have a good reason, why you can't move to the clockevents interface. Jeremy spent a couple of hours to get NO_HZ running for Xen yesterday instead of writing up lengthy excuses, why it is s hard and takes sooo much time and the current interface is sooo insufficient. tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On Tue, 2007-03-06 at 17:44 -0800, Dan Hecht wrote: > >>> 2) As I said above. The time accounting for virtualization needs to be > >>> fixed in a generic way. > >>> > >>> I'm not going to accept some weird hackery for virtualization, which is > >>> of exactly ZERO value for the kernel itself. Quite the contrary it will > >>> make the cleanup harder and introduce another hard to remove thing, > >>> which will in the worst case last for ever. > >>> > >> Okay, to confirm I'm on the same page as you, you want to move process > >> time accounting from being periodic sampled based to being trace based? > >> i.e. at the system-call/interrupt boundaries, read clocksource and > >> compute directly the amount of system/user/process time? > > > > At least for the paravirt guests this is the correct approach. Once the > > CPU vendors come up with a sane solution for a reliable and fast clock > > source we might use that on real hardware as well. > > > > I thought your preference was to not do things differently from real > hardware? I guess this case you are okay with since you'd like to see > the real hardware case follow eventually? Real hardware _IS_ broken and slow. If we add the facilities for virtualization we want it in a way, which is usable by real hardware as well. > > Yes, with todays hardware it is simply a PITA. PowerPC has some basic > > support for this though, IIRC. > > > > I think S390 maybe too. One more reason to make it a generic solution rather than some extra hackery. tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
Thomas Gleixner wrote: > Ooops. I completely forgot, that you get the absolute expiry time > already in ktime_t format (nanoseconds) when dev->set_next_event() is > called. > > dev->next_event = expires; > > is done right before the call. > > So it's already there for free. > OK, but a trap for young players (ie, me): the absolute time is in ns since kernel boot, but the hypervisor wants an absolute time in ns since system boot. Everything works reasonably well for the first guest started early, so be sure to take a snapshot of hypervisor time early in order to get the correction... J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On 03/06/2007 05:18 PM, Thomas Gleixner wrote: On Tue, 2007-03-06 at 16:53 -0800, Dan Hecht wrote: Ooops. I completely forgot, that you get the absolute expiry time already in ktime_t format (nanoseconds) when dev->set_next_event() is called. dev->next_event = expires; is done right before the call. So it's already there for free. Okay. I noticed that but didn't think it was okay to use since it didn't seem like it was set up for the clock_event_device code's use, so seemed like a conceptual interface violation to go digging around in there. Yes it is. I just wanted to point out that you can use it until I'm awake enough to implement it proper. Well, we'll probably just live with using the relative expiry for the first pass, and then revisit this later once that is working, rather than resort to hacking it out by reading ->next_event. Also, wasn't one of the points of clockevents to prevent the device code from doing conversions between nanoseconds and clicks themselves? Don't we really want the clockevents generic layer to do this conversion between monotonic nanonseconds to absolute device clicks and then give the device code that value, so the device layer doesn't perform any conversions? Right. But this applies only to deltas, as the conversion of absolute time values gets ugly, i.e. 128bit math Yeah, hopefully we can come up with a clean way to do this. But, like I said early, until we do, we'll stick with the relative expiry. IMO the paravirt interfaces should use nanoseconds anyway for both readout and next event programming. That way the conversion is done in the hypervisor once and the clocksources and clockevents are simple and unified (except for the underlying hypervisor calls). I disagree. The clocksource/clockevents layer are always going to have to convert nanoseconds to/from hardware units, so why not use it? And, some guests (say, a future version of linux that does trace-based process accounting) may want higher resolution than nanoseconds for certain uses. In any case, this is beside the point; I'd prefer to stick to using the clockevents interface in the way it was intended rather than reaching into ->next_event. thanks, Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On 03/06/2007 05:22 PM, Thomas Gleixner wrote: On Tue, 2007-03-06 at 16:42 -0800, Dan Hecht wrote: accounting would be wrong. Instead, we should allow the tick_sched_timer in cases (c) and (d) to have runtime configurable period, and then scale the time value accordingly before passing to account_system_time. This is probably something the Xen folks will want also, since I think Xen itself only gets 100hz hard timer, and so it can implement at best a oneshot virtual timer with 100hz resolution. Any objections to us doing something like this? Yes. It's gross hackery. 1) We want to have a cleanup of the tick assumptions _all_ over the place and this is going to be real hard work. 2) As I said above. The time accounting for virtualization needs to be fixed in a generic way. I'm not going to accept some weird hackery for virtualization, which is of exactly ZERO value for the kernel itself. Quite the contrary it will make the cleanup harder and introduce another hard to remove thing, which will in the worst case last for ever. Okay, to confirm I'm on the same page as you, you want to move process time accounting from being periodic sampled based to being trace based? i.e. at the system-call/interrupt boundaries, read clocksource and compute directly the amount of system/user/process time? At least for the paravirt guests this is the correct approach. Once the CPU vendors come up with a sane solution for a reliable and fast clock source we might use that on real hardware as well. I thought your preference was to not do things differently from real hardware? I guess this case you are okay with since you'd like to see the real hardware case follow eventually? In any case, in paravirt the costs of reading timers and doing system call transitions are a bit different than on native, so we'll need to figure out what makes sense given those costs. Do you know if anyone has explored this? I thought there was a discussion about this a while back but it was rejected due to the sample-based approach having much lower overheads on high system call rate workloads. Yes, with todays hardware it is simply a PITA. PowerPC has some basic support for this though, IIRC. I think S390 maybe too. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On Tue, 2007-03-06 at 16:42 -0800, Dan Hecht wrote: > >> accounting would be wrong. Instead, we should allow the > >> tick_sched_timer in cases (c) and (d) to have runtime configurable > >> period, and then scale the time value accordingly before passing to > >> account_system_time. This is probably something the Xen folks will want > >> also, since I think Xen itself only gets 100hz hard timer, and so it can > >> implement at best a oneshot virtual timer with 100hz resolution. Any > >> objections to us doing something like this? > > > > Yes. It's gross hackery. > > > > 1) We want to have a cleanup of the tick assumptions _all_ over the > > place and this is going to be real hard work. > > > > 2) As I said above. The time accounting for virtualization needs to be > > fixed in a generic way. > > > > I'm not going to accept some weird hackery for virtualization, which is > > of exactly ZERO value for the kernel itself. Quite the contrary it will > > make the cleanup harder and introduce another hard to remove thing, > > which will in the worst case last for ever. > > > > Okay, to confirm I'm on the same page as you, you want to move process > time accounting from being periodic sampled based to being trace based? > i.e. at the system-call/interrupt boundaries, read clocksource and > compute directly the amount of system/user/process time? At least for the paravirt guests this is the correct approach. Once the CPU vendors come up with a sane solution for a reliable and fast clock source we might use that on real hardware as well. > Do you know if anyone has explored this? I thought there was a > discussion about this a while back but it was rejected due to the > sample-based approach having much lower overheads on high system call > rate workloads. Yes, with todays hardware it is simply a PITA. PowerPC has some basic support for this though, IIRC. tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On Tue, 2007-03-06 at 16:53 -0800, Dan Hecht wrote: > > Ooops. I completely forgot, that you get the absolute expiry time > > already in ktime_t format (nanoseconds) when dev->set_next_event() is > > called. > > > > dev->next_event = expires; > > > > is done right before the call. > > > > So it's already there for free. > > > > > > Okay. I noticed that but didn't think it was okay to use since it > didn't seem like it was set up for the clock_event_device code's use, so > seemed like a conceptual interface violation to go digging around in > there. Yes it is. I just wanted to point out that you can use it until I'm awake enough to implement it proper. > Also, wasn't one of the points of clockevents to prevent the device code > from doing conversions between nanoseconds and clicks themselves? Don't > we really want the clockevents generic layer to do this conversion > between monotonic nanonseconds to absolute device clicks and then give > the device code that value, so the device layer doesn't perform any > conversions? Right. But this applies only to deltas, as the conversion of absolute time values gets ugly, i.e. 128bit math IMO the paravirt interfaces should use nanoseconds anyway for both readout and next event programming. That way the conversion is done in the hypervisor once and the clocksources and clockevents are simple and unified (except for the underlying hypervisor calls). > On an unrelated note, can you explain what the difference between > CLOCK_EVT_MODE_UNUSED and CLOCK_EVT_MODE_SHUTDOWN modes are and what the > legal state transitions are? (or point me to a document describing > this). At least on i386, all clock event devices treat them the same; > do we really need both? UNUSED: The device is registered, but not used by any clockevents client SHUTDOWN: The device is registered, claimed by a clockevents client, but momentarily not active. The clock events device can treat UNUSED and SHUTDOWN basically in the same way. tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On 03/06/2007 04:49 PM, Thomas Gleixner wrote: On Tue, 2007-03-06 at 16:35 -0800, Dan Hecht wrote: There is no problem for realtime uses, as the reprogramming path is running with local interrupts disabled. I can see the point for paravirt and I'm not opposed to change / expand the interface for that. It might be done by an extra clockevents feature flag, which requests absolute time instead of relative time. I'm not sure how much different it makes overall. It's true that absolute time would be a more useful interface, but because the guest vcpu can be preempted at any time, we could miss the timeout regardless. In Xen if you set a timeout for the past you get an immediate interrupt; I presume the clockevent code can deal with that? That's the problem though, you won't know to set it for the past since the expiry is relative. When the vcpu starts running again, it will set the timer to expire X ns from now, not Xns from when the timer was requested. Ooops. I completely forgot, that you get the absolute expiry time already in ktime_t format (nanoseconds) when dev->set_next_event() is called. dev->next_event = expires; is done right before the call. So it's already there for free. Okay. I noticed that but didn't think it was okay to use since it didn't seem like it was set up for the clock_event_device code's use, so seemed like a conceptual interface violation to go digging around in there. Also, wasn't one of the points of clockevents to prevent the device code from doing conversions between nanoseconds and clicks themselves? Don't we really want the clockevents generic layer to do this conversion between monotonic nanonseconds to absolute device clicks and then give the device code that value, so the device layer doesn't perform any conversions? On an unrelated note, can you explain what the difference between CLOCK_EVT_MODE_UNUSED and CLOCK_EVT_MODE_SHUTDOWN modes are and what the legal state transitions are? (or point me to a document describing this). At least on i386, all clock event devices treat them the same; do we really need both? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On Tue, 2007-03-06 at 16:35 -0800, Dan Hecht wrote: > >> There is no problem for realtime uses, as the reprogramming path is > >> running with local interrupts disabled. I can see the point for paravirt > >> and I'm not opposed to change / expand the interface for that. It might > >> be done by an extra clockevents feature flag, which requests absolute > >> time instead of relative time. > >> > > > > I'm not sure how much different it makes overall. It's true that > > absolute time would be a more useful interface, but because the guest > > vcpu can be preempted at any time, we could miss the timeout > > regardless. In Xen if you set a timeout for the past you get an > > immediate interrupt; I presume the clockevent code can deal with that? > > > > That's the problem though, you won't know to set it for the past since > the expiry is relative. When the vcpu starts running again, it will set > the timer to expire X ns from now, not Xns from when the timer was > requested. Ooops. I completely forgot, that you get the absolute expiry time already in ktime_t format (nanoseconds) when dev->set_next_event() is called. dev->next_event = expires; is done right before the call. So it's already there for free. tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On 03/06/2007 03:53 PM, Thomas Gleixner wrote: 2) Virtual interrupts have a relatively high overhead as compared with native interrupts. So, in vmitime, we wanted to be able to lower the timer interrupt rate at runtime, even if HZ is a compile time constant (and set to something high, like 1000hz). While we could hack this in by using evt->min_delta_ns, it wouldn't really work since process time accounting would be wrong. Instead, we should allow the tick_sched_timer in cases (c) and (d) to have runtime configurable period, and then scale the time value accordingly before passing to account_system_time. This is probably something the Xen folks will want also, since I think Xen itself only gets 100hz hard timer, and so it can implement at best a oneshot virtual timer with 100hz resolution. Any objections to us doing something like this? Yes. It's gross hackery. 1) We want to have a cleanup of the tick assumptions _all_ over the place and this is going to be real hard work. 2) As I said above. The time accounting for virtualization needs to be fixed in a generic way. I'm not going to accept some weird hackery for virtualization, which is of exactly ZERO value for the kernel itself. Quite the contrary it will make the cleanup harder and introduce another hard to remove thing, which will in the worst case last for ever. Okay, to confirm I'm on the same page as you, you want to move process time accounting from being periodic sampled based to being trace based? i.e. at the system-call/interrupt boundaries, read clocksource and compute directly the amount of system/user/process time? Do you know if anyone has explored this? I thought there was a discussion about this a while back but it was rejected due to the sample-based approach having much lower overheads on high system call rate workloads. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On 03/06/2007 04:24 PM, Jeremy Fitzhardinge wrote: Thomas Gleixner wrote: 3) clockevent set_next_event interface is suboptimal for paravirt (and probably realtime-ish uses). The problem is that the expiry is passed as a relative time. On paravirt, an arbitrary amount of (stolen) time may have passed since the delta was computed and when the timer device is programmed, causing that next interrupt to be too far out in the future. It seems a better interface for set_next_event would be to pass the current time and the absolute expiry. Actually, I sent email to Thomas and Ingo about this (and some other clockevents/hrtimer feedback) in July 2006, but never heard back. Thoughts? There is no problem for realtime uses, as the reprogramming path is running with local interrupts disabled. I can see the point for paravirt and I'm not opposed to change / expand the interface for that. It might be done by an extra clockevents feature flag, which requests absolute time instead of relative time. I'm not sure how much different it makes overall. It's true that absolute time would be a more useful interface, but because the guest vcpu can be preempted at any time, we could miss the timeout regardless. In Xen if you set a timeout for the past you get an immediate interrupt; I presume the clockevent code can deal with that? That's the problem though, you won't know to set it for the past since the expiry is relative. When the vcpu starts running again, it will set the timer to expire X ns from now, not Xns from when the timer was requested. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On Tue, 2007-03-06 at 16:24 -0800, Jeremy Fitzhardinge wrote: > >> 3) clockevent set_next_event interface is suboptimal for paravirt (and > >> probably realtime-ish uses). The problem is that the expiry is passed > >> as a relative time. On paravirt, an arbitrary amount of (stolen) time > >> may have passed since the delta was computed and when the timer device > >> is programmed, causing that next interrupt to be too far out in the > >> future. It seems a better interface for set_next_event would be to pass > >> the current time and the absolute expiry. Actually, I sent email to > >> Thomas and Ingo about this (and some other clockevents/hrtimer feedback) > >> in July 2006, but never heard back. Thoughts? > >> > > > > There is no problem for realtime uses, as the reprogramming path is > > running with local interrupts disabled. I can see the point for paravirt > > and I'm not opposed to change / expand the interface for that. It might > > be done by an extra clockevents feature flag, which requests absolute > > time instead of relative time. > > > > I'm not sure how much different it makes overall. It's true that > absolute time would be a more useful interface, but because the guest > vcpu can be preempted at any time, we could miss the timeout > regardless. In Xen if you set a timeout for the past you get an > immediate interrupt; I presume the clockevent code can deal with that? Yep. You also can return -ETIME so it just works w/o an interrupt. tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
Thomas Gleixner wrote: > All paravirt users probably want to have NO_HZ, so PARAVIRT might simply > depend on NO_HZ. Of course I might be wrong :) > Xen can deal either way, but tickless is certainly preferred. > OTOH the stolen time accounting should be fixed in general and not rely > on it happens to work now assumptions. And it should be done for _ALL_ > hypervisors in the same way, i.e. in the generic code. > Yep. We'll need to come up with a common story for that. >> This is probably something the Xen folks will want >> also, since I think Xen itself only gets 100hz hard timer, and so it can >> implement at best a oneshot virtual timer with 100hz resolution. Any >> objections to us doing something like this? >> Xen has a nanosecond resolution one-shot timer which I'm using for this. There's also a 100Hz tick which gets in the way a bit (it will appear as a stream of spurious timeouts), but we'll turn that off soon. >> 3) clockevent set_next_event interface is suboptimal for paravirt (and >> probably realtime-ish uses). The problem is that the expiry is passed >> as a relative time. On paravirt, an arbitrary amount of (stolen) time >> may have passed since the delta was computed and when the timer device >> is programmed, causing that next interrupt to be too far out in the >> future. It seems a better interface for set_next_event would be to pass >> the current time and the absolute expiry. Actually, I sent email to >> Thomas and Ingo about this (and some other clockevents/hrtimer feedback) >> in July 2006, but never heard back. Thoughts? >> > > There is no problem for realtime uses, as the reprogramming path is > running with local interrupts disabled. I can see the point for paravirt > and I'm not opposed to change / expand the interface for that. It might > be done by an extra clockevents feature flag, which requests absolute > time instead of relative time. > I'm not sure how much different it makes overall. It's true that absolute time would be a more useful interface, but because the guest vcpu can be preempted at any time, we could miss the timeout regardless. In Xen if you set a timeout for the past you get an immediate interrupt; I presume the clockevent code can deal with that? J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
Dan, On Tue, 2007-03-06 at 13:07 -0800, Dan Hecht wrote: > > Why is this so non-trivial ? All you have to do is _NOT_ register > > PIT/HPET/APIC timers and register a per CPU hyper-CE-device instead, > > which uses the hypervisor timer emulation instead of real hardware. > > > > clockevents breaks the hardwired assumptions of the old timer code and > > allows you to remove _ALL_ the hardwired hackery in vmitimer.c, i.e. > > stuff like > > > >/* Disable PIT. */ > > outb_p(0x3a, PIT_MODE); /* binary, mode 5, LSB/MSB, ch 0 */ > > > > Hmm, I think that the (virtual) bios still will set up the PIT ch 0, and > we still need to stop it. I guess you have access to the source code of this virtual BIOS. So this is a real cute technical solution. ROTFL. The number of lame excuses in this whole virtualization discussion is amazing. > In any case, clockevents doesn't really make it easier nor harder as far > as init goes. In the pre-clockevent days, we replace setup_pit_timer, > setup_boot_clock, setup_secondary_clock. With clockevents, I think the > hook points are the same. Mostly just need to allow the per-cpu > lapic_event to be generalized to local_clock_events that can be set to > whatever device we want. The other thing on i386 is just some minor > annoyances due initially setting up only the PIT on cpu0 on irq 0 and > then later setting up per-cpu timer on lvtt, and making this all place > nice with paravirt timers. But these are just details and just require > some minor changes and will be working, but it just takes some massaging. Nothing forces you to follow that low level hardware scheme. That's _WHY_ clockevents are there. Create a per cpu clock event source, which uses whatever interrupt you want (you just need to be able to pin it to the cpu) > So, that is not the real reason to move over the clockevents. It is partially, because clockevents remove the hardcoded hardware assumptions. > The real > reason is to use the generic interrupt handlers. We understand that, > and will get to that point. In the mean time, we are harming no one. > Our code has zero effect when you booting natively or on a non-VMI > hypervisor. The "we are harming no one" argument is a great excuse to push random hackery into the kernel. Once it is there, there is no rush to fix it because it works (for you). That's exactly the point which is discussed in the "Xen & VMI" thread. We open up a can of worms and within no time we have 5 or more different solutions for the same problem. If we do not look careful at this, we have no way to do any changes in the core code w/o breaking one of those hypervisor interfaces. The in tree / FOSS hypervisor interfaces might be fixable, but those which throw a binary blob to the kernel are not. I completely agree with Ingo, that this whole paravirt business starts to crawl across the kernel spreading paralyis all over the place. We have already enough trouble with real hardware, so we want to carefully avoid that we get broken virtual hardware as an extra workload via paravirt ops. > >> We worked around this by keeping NO_IDLE_HZ support, which now > >> you deprecated. So now we are using NO_HZ without a hyper-CE device, > >> and it is working fine. We understand the benefits of moving to the CE > >> model - but it cannot be done overnight. > > > > This is ugly as hell. NO_HZ enables the dyntick functions in idle(), > > irq_enter() and irq_exit() so the clockevents code is actually invoked. > > I have not looked close enough why this does work at all. > > > > I believe this was just a quick fix in response to Ingo breaking the VMI > build yesterday by disabling NO_IDLE_HZ on us. There is no technical > reason why NO_IDLE_HZ=y can't coexist with NO_HZ. > > (The two work okay together because when using NO_IDLE_HZ, the hooks are > deeper in a custom safe_halt routine which isn't registered when using > nohz mode at runtime, and conversely, the nohz code is guarded at > runtime by the ts->nohz_mode. So, the two really can co-exist at > compile time). It is guarded by the fact, that you are not registering clockevent devices. It's not guarded by design. It happens to work. > Again, no one is arguing that we shouldn't move to clockevents, it's > just a matter of time (sorry, no pun intended). clockevents have been around for quite a time - pun intended :). They did not surface surprisingly with 2.6.21-rc1. > The vmi-time code was introduced to solve some shortcomings of the old > (pre-clocksource/clockevents/hrtimer/NO_HZ) i386 timer code that was > especially painful for virtualization. Certainly, > clocksource/clockevents/NO_HZ solves many of the problems (basically, > moving away from counting interrupts to using time sources). e.g. xtime > updating is no longer a worry with the new timeofday/clocksource stuff. > But there are some that may not quite be solved, listed below. (I > know I'm not telling you anything
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On 03/06/2007 02:21 PM, Andi Kleen wrote: I believe this was just a quick fix in response to Ingo breaking the VMI build yesterday by disabling NO_IDLE_HZ on us. There is no technical reason why NO_IDLE_HZ=y can't coexist with NO_HZ. Well it's nasty that you force NO_IDLE_HZ on all of paravirt ops users. The only thing NO_IDLE_HZ=y "forces" on other users is some extra code (which you are going to get no matter what with CONFIG_PARAVIRT). It doesn't force them to use this code. It just provides a few extra routines that a paravirt_ops backend might want to call back into (I think both vmi and xen backends use these routines and that is why it became associated with CONFIG_PARAVIRT rather than CONFIG_VMI). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
> I believe this was just a quick fix in response to Ingo breaking the VMI > build yesterday by disabling NO_IDLE_HZ on us. There is no technical > reason why NO_IDLE_HZ=y can't coexist with NO_HZ. Well it's nasty that you force NO_IDLE_HZ on all of paravirt ops users. I think the right solution is to make VMI depend on (not select) NO_IDLE_HZ until you can fix your code to work with dynticks properly. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On 03/06/2007 02:59 AM, Thomas Gleixner wrote: On Tue, 2007-03-06 at 00:55 -0800, Zachary Amsden wrote: a proper CE device also has the added bonus of making high-res timers guests work automatically. It should be simple: just pass it through to your hypervisor, a hyper-CE-device, like a hyper-clocksource device has essentially no guest-side complexity. It is not so simple. In theory it works great. In reality, the i386 implementation is completely hardwired to work the way hardware works, and breaking the clockevent code out of the deep ties to the APIC is extremely non-trivial. We tried, and could not accomplish it for 2.6.21 because the hrtimers integration was complex, and introduced many bugs for us. Why is this so non-trivial ? All you have to do is _NOT_ register PIT/HPET/APIC timers and register a per CPU hyper-CE-device instead, which uses the hypervisor timer emulation instead of real hardware. clockevents breaks the hardwired assumptions of the old timer code and allows you to remove _ALL_ the hardwired hackery in vmitimer.c, i.e. stuff like /* Disable PIT. */ outb_p(0x3a, PIT_MODE); /* binary, mode 5, LSB/MSB, ch 0 */ Hmm, I think that the (virtual) bios still will set up the PIT ch 0, and we still need to stop it. In any case, clockevents doesn't really make it easier nor harder as far as init goes. In the pre-clockevent days, we replace setup_pit_timer, setup_boot_clock, setup_secondary_clock. With clockevents, I think the hook points are the same. Mostly just need to allow the per-cpu lapic_event to be generalized to local_clock_events that can be set to whatever device we want. The other thing on i386 is just some minor annoyances due initially setting up only the PIT on cpu0 on irq 0 and then later setting up per-cpu timer on lvtt, and making this all place nice with paravirt timers. But these are just details and just require some minor changes and will be working, but it just takes some massaging. So, that is not the real reason to move over the clockevents. The real reason is to use the generic interrupt handlers. We understand that, and will get to that point. In the mean time, we are harming no one. Our code has zero effect when you booting natively or on a non-VMI hypervisor. We worked around this by keeping NO_IDLE_HZ support, which now you deprecated. So now we are using NO_HZ without a hyper-CE device, and it is working fine. We understand the benefits of moving to the CE model - but it cannot be done overnight. This is ugly as hell. NO_HZ enables the dyntick functions in idle(), irq_enter() and irq_exit() so the clockevents code is actually invoked. I have not looked close enough why this does work at all. I believe this was just a quick fix in response to Ingo breaking the VMI build yesterday by disabling NO_IDLE_HZ on us. There is no technical reason why NO_IDLE_HZ=y can't coexist with NO_HZ. (The two work okay together because when using NO_IDLE_HZ, the hooks are deeper in a custom safe_halt routine which isn't registered when using nohz mode at runtime, and conversely, the nohz code is guarded at runtime by the ts->nohz_mode. So, the two really can co-exist at compile time). Again, no one is arguing that we shouldn't move to clockevents, it's just a matter of time (sorry, no pun intended). The vmi-time code was introduced to solve some shortcomings of the old (pre-clocksource/clockevents/hrtimer/NO_HZ) i386 timer code that was especially painful for virtualization. Certainly, clocksource/clockevents/NO_HZ solves many of the problems (basically, moving away from counting interrupts to using time sources). e.g. xtime updating is no longer a worry with the new timeofday/clocksource stuff. But there are some that may not quite be solved, listed below. (I know I'm not telling you anything new, but I might as well flesh it out for the other paravirt folks while the code is fresh in my mind): 1) Stolen time (virtual cpu is ready to run but not running): this is handled inconsistently between the various clockevent handlers / CLOCK_EVT_MODE_ONESHOT combinations: a) tick_handle_periodic / CLOCK_EVT_MODE_PERIODIC: depends on how you define "periodic" timer in a paravirtual world. If you do something like Xen-style where you send periodic events only to running vcpus, then this handler suffers from some of the same problems as the old i386 timer handler: - jiffies updated according to the number of interrupts you get, so falls behind monotonic time. generally, counting timer interrupts is bad for paravirt. - process time updated according to the number of interrupts, so falls behind monotonic time. This is probably okay though, since it is essentially tracking (mono - stolen) time. I.e. the missing time is stolen. - jiffies updated only by boot cpu, which is a problem for paravirt since the boot vcpu can be descheduled while the other vcpus are scheduled. -
Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On Tue, 2007-03-06 at 00:55 -0800, Zachary Amsden wrote: > > a proper CE device also has the added bonus of making high-res timers > > guests work automatically. It should be simple: just pass it through to > > your hypervisor, a hyper-CE-device, like a hyper-clocksource device has > > essentially no guest-side complexity. > > > > It is not so simple. In theory it works great. In reality, the i386 > implementation is completely hardwired to work the way hardware works, > and breaking the clockevent code out of the deep ties to the APIC is > extremely non-trivial. We tried, and could not accomplish it for 2.6.21 > because the hrtimers integration was complex, and introduced many bugs > for us. Why is this so non-trivial ? All you have to do is _NOT_ register PIT/HPET/APIC timers and register a per CPU hyper-CE-device instead, which uses the hypervisor timer emulation instead of real hardware. clockevents breaks the hardwired assumptions of the old timer code and allows you to remove _ALL_ the hardwired hackery in vmitimer.c, i.e. stuff like /* Disable PIT. */ outb_p(0x3a, PIT_MODE); /* binary, mode 5, LSB/MSB, ch 0 */ > We worked around this by keeping NO_IDLE_HZ support, which now > you deprecated. So now we are using NO_HZ without a hyper-CE device, > and it is working fine. We understand the benefits of moving to the CE > model - but it cannot be done overnight. This is ugly as hell. NO_HZ enables the dyntick functions in idle(), irq_enter() and irq_exit() so the clockevents code is actually invoked. I have not looked close enough why this does work at all. I have the feeling that "working fine" means something like "does not explode". We really want to fix this now instead of pushing some not know why it works hack into the kernel. tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/