Re: [kvm-devel] KVM in-kernel APIC update
Gregory Haskins wrote: > My current thoughts are that we at least move the IOAPIC into the kernel as > well. That will give sufficient control to generate ISA bus interrupts for > guests that understand APICs. If we want to be able to generate ISA > interrupts for legacy guests which talk to the 8259s that will prove to be > insufficient. The good news is that moving the 8259s down as well is > probably not a huge deal either, especially since I have already prepped the > usermode side. Thoughts? > I would avoid moving down anything that's not strictly necessary. If we want to keep the PIC in qemu, for example, just export the APIC-PIC interface to qemu. I still don't have an opinion as to whether it is necessary; I'll need to study the details. Xen pushes most of the platform into the hypervisor, but then Xen's communication path to qemu is much more expensive (involving the scheduler and a potential cpu switch) than kvm. We'd need to balance possible performance improvements (I'd expect negligible) and interface simplification (possibly non-negligible) against further diverging from qemu. > So heres a question for you guys out there. What is the expected use of the > in-kernel APIC? My interests lie in the ability to send IPIs for SMP, as > well as being able to inject asynchronous hypercall interrupts. I assume > there are other reasons too, such as PV device interrupts, etc and I would > like to make sure I am seeing the big picture before making any bad design > decisions. My question is, how do we expect the PV devices to look from a > bus perspective? > > The current Bochs/QEMU system model paints a fairly simple ISA architecture > utilizing a single IOAPIC + dual 8259 setup. Do we expect in-kernel injected > IRQs to follow the ISA model (e.g. either legacy or PCI interrupts only > limited to IRQ0-15) or do we want to expand on this? The PCI hypercall > device introduced a while back would be an example of something ISA based. > Alternatives would be to utilize unused "pins" (such as IRQ16-23) on IOAPIC > #0, or introducing new an entirely new bus/IOAPICs just for KVM, etc. > There are two extreme models, which I think are both needed. On one end, support for closed OSes (e.g. Windows) requires fairly strict conformance to the PCI model, which means going through the IOAPIC or PIC or however the interrupt lines are wired in qemu. This seems to indicate that an in-kernel IOAPIC is needed. On the other end (Linux), a legacy-free and emulation-free device can just inject interrupts directly and use shared memory to ack interrupts and indicate their source. > If the latter, we also need to decide what the resource conveyance model and > vector allocation policy should be. For instance, do we publish said > resources formally in the MP/ACPI tables in Bochs? Doing so would allow > MP/ACPI compliant OSs like linux to naturally route the IRQ. Conversely, do > we do something more direct just like we do for KVM discovery via wrmsr? > I think we can go the direct route for cooperative guests. I also suggest doing the work in stages and measuring; that is, first push the local apic, determine what the remaining bottlenecks are and tackle them. I'm pretty sure that Linux guests would only require the local apic but Windows (and older Linux kernels) might require more. > struct kvm_vcpu; > > +struct kvm_irqinfo { > + int vector; > + int nmi; > +}; > + > +#define KVM_IRQFLAGS_NMI (1 << 0) > +#define KVM_IRQFLAGS_PEEK (1 << 1) > + > +struct kvm_irqdevice { > + int (*pending)(struct kvm_irqdevice *this, int flags); > + int (*read)(struct kvm_irqdevice *this, int flags, > + struct kvm_irqinfo *info); > Aren't pending() and read() + PEEK overlapping? > + int (*inject)(struct kvm_irqdevice *this, int irq, int flags); > + int (*summary)(struct kvm_irqdevice *this, void *data); > + void (*destructor)(struct kvm_irqdevice *this); > + > + void *private; > +}; > Consider using container_of() to simulate C++ inheritance. Messier but less indirections. Also consider a kvm_irqdevice_operations structure. > + > +#define MAX_APIC_INT_VECTOR 256 > + > +struct kvm_apic { > + u32 status; > + u32 vcpu_id; > + spinlock_t lock; > + u32 pcpu_lock_owner; > Isn't this vcpu->cpu? > + atomic_ttimer_pending; > + u64 apic_base_msr; > + unsigned long base_address; > + u32 timer_divide_count; > + struct hrtimer apic_timer; > + int intr_pending_count[MAX_APIC_INT_VECTOR]; > + ktime_t timer_last_update; > + struct { > + int deliver_mode; > + int source[6]; > + } direct_intr; > + u32 err_sta
[kvm-devel] memory hotplug for guests?
Does KVM allow something like "memory hotplug" for its guests? For example, lets says you are running several guests, and would like to start yet another one for a while - but have no free memory left. Obviously, your guests are so important that you don't want to stop them - so you simply "hotplug remove" memory from a guest that has a lot of free memory left - and start a new guest. When that new guest is no longer needed, and is stopped, you can "hotplug add" memory to the guest it was previously removed from. Guest's kernel would of course need to support memory hotplugging, too. Is it possible with KVM? If not, is such a feature planned? I noticed it was only mentioned once or twice on the list: http://article.gmane.org/gmane.comp.emulators.kvm.devel/712/ "Well, the _interface_ supports removing, the implementation does not :) Everything was written in mind to allow memory hotplug." -- Tomasz Chmielewski http://wpkg.org - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] memory hotplug for guests?
>Does KVM allow something like "memory hotplug" for its guests? It does not support. > > >For example, lets says you are running several guests, and would like to >start yet another one for a while - but have no free memory left. > We have another solution for it that will soon be pushed into the kernel: It is the balloon driver solution. Each guest runs a balloon driver, when the host needs to free up memory a daemon with certain policy asks some of the guests to inflate their balloon, KVM frees their ballooned pages and the host free memory increases. When the memory pressure relives, the balloons get deflate command. > >Obviously, your guests are so important that you don't want to stop them >- so you simply "hotplug remove" memory from a guest that has a lot of >free memory left - and start a new guest. > >When that new guest is no longer needed, and is stopped, you can >"hotplug add" memory to the guest it was previously removed from. > >Guest's kernel would of course need to support memory hotplugging, too. > > >Is it possible with KVM? If not, is such a feature planned? > >I noticed it was only mentioned once or twice on the list: > >http://article.gmane.org/gmane.comp.emulators.kvm.devel/712/ > >"Well, the _interface_ supports removing, the implementation does not :) > >Everything was written in mind to allow memory hotplug." > > > >-- >Tomasz Chmielewski >http://wpkg.org > >--- -- >Take Surveys. Earn Cash. Influence the Future of IT >Join SourceForge.net's Techsay panel and you'll get the chance to share >your >opinions on IT & business topics through brief surveys-and earn cash >http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVD EV >___ >kvm-devel mailing list >kvm-devel@lists.sourceforge.net >https://lists.sourceforge.net/lists/listinfo/kvm-devel - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Avi Kivity wrote: > Gregory Haskins wrote: >> Hi Dor, >> Please find a patch attached for your review which adds support for >> dynamic substitution of the PIC/APIC code to QEMU. This will allow us >> to selectively chose the KVM in-kernel apic emulation vs the QEMU >> user-space apic emulation. Support for both is key to allow >> "--no-kvm" type operation to continue working even after the >> in-kernel code is deployed. >> >> Note that this is only the part the allows indirection. The code >> that actually fills in the "slim apic" in QEMU, as well as the kernel >> side changes applied to git are not included here. Note also that >> this patch can stand alone. I have confirmed that I can boot a guest >> with no discernible difference in behavior/performance both before >> and after this patch. YMMV as my test cases are limited. >> >> Note that this patch only touches the KVM specific portions of QEMU >> (namely, x86/i8259/apic support). If we decide that this should be >> pushed to QEMU upstream, there is still work to be done to convert >> the other PIC implementations (e.g. arm_pic, etc) to use the new >> interface. >> > > While the approach is technically better than #ifdefing things away, > this patch would really cause us to drift too far from the qemu > codebase, making the eventual merge back (and any intervening merges > to kvm) much more difficult. We also have no guarantee that upstream > qemu will accept this change. > > If we could get this accepted into upstream qemu, and then merge > qemu-devel, that would resolve these issues. The devices are already written to take a set_irq function. Instead of hijacking the emulated PIC device, I think it would be better if in pc.c, we just conditionally created our PIC device that reflected to the hypervisor and passed the appropriate function to the emulated hardware. Otherwise, to support all the other architectures, there's going to be a lot of modifications. Then again, are we really positive that we have to move the APIC into the kernel? A lot of things will get much more complicated. Regards, Anthony Liguori > Anthony, you're our qemu expert. What's your opinion? > > > >> Thanks! >> >> -Greg >> >> >> >> Index: kvm/qemu/vl.h >> === >> --- kvm.orig/qemu/vl.h >> +++ kvm/qemu/vl.h >> @@ -1040,16 +1040,11 @@ ParallelState *parallel_init(int base, i >> >> /* i8259.c */ >> >> -typedef struct PicState2 PicState2; >> -extern PicState2 *isa_pic; >> -void pic_set_irq(int irq, int level); >> -void pic_set_irq_new(void *opaque, int irq, int level); >> -PicState2 *pic_init(IRQRequestFunc *irq_request, void >> *irq_request_opaque); >> -void pic_set_alt_irq_func(PicState2 *s, SetIRQFunc *alt_irq_func, >> - void *alt_irq_opaque); >> -int pic_read_irq(PicState2 *s); >> -void pic_update_irq(PicState2 *s); >> -uint32_t pic_intack_read(PicState2 *s); >> +#include "pic.h" >> + >> +PIC *pic_init(IRQRequestFunc *irq_request, void *irq_request_opaque); >> +void pic_set_alt_irq_func(PIC *pic, SetIRQFunc *alt_irq_func, >> + void *alt_irq_opaque); >> void pic_info(void); >> void irq_info(void); >> >> @@ -1057,7 +1052,6 @@ void irq_info(void); >> typedef struct IOAPICState IOAPICState; >> >> int apic_init(CPUState *env); >> -int apic_get_interrupt(CPUState *env); >> IOAPICState *ioapic_init(void); >> void ioapic_set_irq(void *opaque, int vector, int level); >> >> Index: kvm/qemu/hw/i8259.c >> === >> --- kvm.orig/qemu/hw/i8259.c >> +++ kvm/qemu/hw/i8259.c >> @@ -29,6 +29,8 @@ >> //#define DEBUG_IRQ_LATENCY >> //#define DEBUG_IRQ_COUNT >> >> +typedef struct PicState2 PicState2; >> + >> typedef struct PicState { >> uint8_t last_irr; /* edge detection */ >> uint8_t irr; /* interrupt request register */ >> @@ -58,6 +60,7 @@ struct PicState2 { >> /* IOAPIC callback support */ >> SetIRQFunc *alt_irq_func; >> void *alt_irq_opaque; >> +PIC *base; >> }; >> >> #if defined(DEBUG_PIC) || defined (DEBUG_IRQ_COUNT) >> @@ -133,8 +136,9 @@ static int pic_get_irq(PicState *s) >> /* raise irq to CPU if necessary. must be called every time the active >> irq may change */ >> /* XXX: should not export it, but it is needed for an APIC kludge */ >> -void pic_update_irq(PicState2 *s) >> +static void i8259_update_irq(PIC *pic) >> { >> +PicState2 *s = (PicState2*)pic->private; >> int irq2, irq; >> >> /* first look at slave pic */ >> @@ -174,9 +178,9 @@ void pic_update_irq(PicState2 *s) >> int64_t irq_time[16]; >> #endif >> >> -void pic_set_irq_new(void *opaque, int irq, int level) >> +static void i8259_set_irq(PIC *pic, int irq, int level) >> { >> -PicState2 *s = opaque; >> +PicState2 *s = (PicState2*)pic->private; >> >> #if defi
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Anthony Liguori wrote: > > Then again, are we really positive that we have to move the APIC into > the kernel? A lot of things will get much more complicated. The following arguments are in favor: - allow in-kernel paravirt drivers to interrupt the guest without going through qemu (which involves a signal and some complexity) - same for guest SMP IPI - reduced overhead for a much-loved hardware component (especially on Windows, where one regularly sees 100K apic updates a second) The strength of these arguments increase as vmexit overhead decreases with improving hardware. Of course, pushing such a piece of misery into the kernel where it can cause much pain should not be done lightly. -- error compiling committee.c: too many arguments to function - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Avi Kivity wrote: > Anthony Liguori wrote: >> >> Then again, are we really positive that we have to move the APIC into >> the kernel? A lot of things will get much more complicated. > > The following arguments are in favor: > - allow in-kernel paravirt drivers to interrupt the guest without > going through qemu (which involves a signal and some complexity) > - same for guest SMP IPI > - reduced overhead for a much-loved hardware component (especially on > Windows, where one regularly sees 100K apic updates a second) This is for the TPR right? VT has special logic to handle TPR virtualization doesn't it? I thought SVM did too... > The strength of these arguments increase as vmexit overhead decreases > with improving hardware. Do you mean, the strength of these arguments decrease? :-) > > Of course, pushing such a piece of misery into the kernel where it can > cause much pain should not be done lightly. Right, so some arguments against: - The cost to go to userspace is small compared to the cost of the exit itself - Maintaining hardware emulation in two places is going to suck - The need to have in-kernel backends for paravirt drivers should be carefully considered. An in-kernel backend is certainly not necessary for disks. If we can't drive network traffic fast enough from userspace, perhaps we should consider improving the userspace interfaces. Originating disk IO from userspace is useful for support interesting storage formats. Network IO from userspace is also interesting to support things like slirp. Regards, Anthony Liguori - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Anthony Liguori wrote: > Avi Kivity wrote: >> Anthony Liguori wrote: >>> >>> Then again, are we really positive that we have to move the APIC >>> into the kernel? A lot of things will get much more complicated. >> >> The following arguments are in favor: >> - allow in-kernel paravirt drivers to interrupt the guest without >> going through qemu (which involves a signal and some complexity) >> - same for guest SMP IPI >> - reduced overhead for a much-loved hardware component (especially on >> Windows, where one regularly sees 100K apic updates a second) > > This is for the TPR right? VT has special logic to handle TPR > virtualization doesn't it? I thought SVM did too... > Yes, the TPR. Both VT and SVM virtualize CR8 in 64-bit mode. SVM also supports CR8 in 32-bit mode through a nwe instruction encoding, but nobody uses that to my knowledge. Maybe some brave soul can hack kvm to patch the new instruction in place of the mmio instruction Windows uses to bang on the tpr. >> The strength of these arguments increase as vmexit overhead decreases >> with improving hardware. > > Do you mean, the strength of these arguments decrease? :-) Nope, increase. Currently vmexit time dominates the privilege switch time to user mode. If vmexit time is reduced to be on par with syscall time, you get 50% saving (assuming tpr emulation is near-free, which it can be easily made to be). > >> >> Of course, pushing such a piece of misery into the kernel where it >> can cause much pain should not be done lightly. > > Right, so some arguments against: > - The cost to go to userspace is small compared to the cost of the > exit itself True now. False tomorrow (for some value of tomorrow). > - Maintaining hardware emulation in two places is going to suck True now. Even truer tomorrow. > - The need to have in-kernel backends for paravirt drivers should be > carefully considered. > > An in-kernel backend is certainly not necessary for disks. Agree. > If we can't drive network traffic fast enough from userspace, perhaps > we should consider improving the userspace interfaces. I'm open to suggestions. With an in-kernel backend, we can have copyless transmit and single-copy receive. With the current APIs, we can have single-copy transmit and 3-copy receive. I doubt we could improve it beyond (1, 2) without major hacking on non-kvm kernel interfaces. > > Originating disk IO from userspace is useful for support interesting > storage formats. Network IO from userspace is also interesting to > support things like slirp. Right. Nobody's suggesting removing it, however. -- error compiling committee.c: too many arguments to function - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Avi Kivity wrote: > Anthony Liguori wrote: >> Avi Kivity wrote: >>> Anthony Liguori wrote: Then again, are we really positive that we have to move the APIC into the kernel? A lot of things will get much more complicated. >>> >>> The following arguments are in favor: >>> - allow in-kernel paravirt drivers to interrupt the guest without >>> going through qemu (which involves a signal and some complexity) >>> - same for guest SMP IPI >>> - reduced overhead for a much-loved hardware component (especially >>> on Windows, where one regularly sees 100K apic updates a second) >> >> This is for the TPR right? VT has special logic to handle TPR >> virtualization doesn't it? I thought SVM did too... >> > > Yes, the TPR. Both VT and SVM virtualize CR8 in 64-bit mode. SVM > also supports CR8 in 32-bit mode through a nwe instruction encoding, > but > nobody uses that to my knowledge. Maybe some brave soul can hack kvm > to patch the new instruction in place of the mmio instruction Windows > uses > to bang on the tpr. Actually VT has virtual TPR support that does not require CR8. We submitted a patch for Xen. Please see http://lists.xensource.com/archives/html/xen-devel/2007-03/msg00993.html The spec should be available soon. We are working on a patch for KVM. Jun --- Intel Open Source Technology Center - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
>>> This is for the TPR right? VT has special logic to handle TPR >>> virtualization doesn't it? I thought SVM did too... >>> >> >> Yes, the TPR. Both VT and SVM virtualize CR8 in 64-bit mode. SVM >> also supports CR8 in 32-bit mode through a nwe instruction encoding, >> but >> nobody uses that to my knowledge. Maybe some brave soul can hack kvm >> to patch the new instruction in place of the mmio instruction Windows >> uses >> to bang on the tpr. > >Actually VT has virtual TPR support that does not require CR8. We >submitted a patch for Xen. Please see >http://lists.xensource.com/archives/html/xen-devel/2007-03/msg00993.htm l >The spec should be available soon. We are working on a patch for KVM. > >Jun That's superb! Windows gives us really bad time with all of those TPR accesses. Because of that we currently prefer of using the non-acpi HAL. On what version on chip is this feature supported? This pushes towards in kernel apic too. Can't see how we avoid it. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Avi Kivity wrote: > Anthony Liguori wrote: >> Avi Kivity wrote: >>> Anthony Liguori wrote: Then again, are we really positive that we have to move the APIC into the kernel? A lot of things will get much more complicated. >>> >>> The following arguments are in favor: >>> - allow in-kernel paravirt drivers to interrupt the guest without >>> going through qemu (which involves a signal and some complexity) >>> - same for guest SMP IPI >>> - reduced overhead for a much-loved hardware component (especially >>> on Windows, where one regularly sees 100K apic updates a second) >> >> This is for the TPR right? VT has special logic to handle TPR >> virtualization doesn't it? I thought SVM did too... >> > > Yes, the TPR. Both VT and SVM virtualize CR8 in 64-bit mode. SVM > also supports CR8 in 32-bit mode through a nwe instruction encoding, > but nobody uses that to my knowledge. Maybe some brave soul can hack > kvm to patch the new instruction in place of the mmio instruction > Windows uses to bang on the tpr. It seems like that shouldn't be too hard assuming that the MMIO instructions are <= the new CR8 instruction. It would require knowing where the TPR is mapped into memory of course. If we do this, then we can probably just handle the TPR as a special case anyway and not bother returning to userspace when the TPR is updated through MMIO. That saves the round trip without adding emulation complexity. >> If we can't drive network traffic fast enough from userspace, perhaps >> we should consider improving the userspace interfaces. > > I'm open to suggestions. With an in-kernel backend, we can have > copyless transmit and single-copy receive. With the current APIs, we > can have single-copy transmit and 3-copy receive. I doubt we could > improve it beyond (1, 2) without major hacking on non-kvm kernel > interfaces. I don't know enough about networking to speak intelligently here so I'll just defer to the experts :-) Regards, Anthony Liguori >> >> Originating disk IO from userspace is useful for support interesting >> storage formats. Network IO from userspace is also interesting to >> support things like slirp. > > Right. Nobody's suggesting removing it, however. > - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Dor Laor wrote: This is for the TPR right? VT has special logic to handle TPR virtualization doesn't it? I thought SVM did too... >>> Yes, the TPR. Both VT and SVM virtualize CR8 in 64-bit mode. SVM >>> also supports CR8 in 32-bit mode through a nwe instruction encoding, >>> but >>> nobody uses that to my knowledge. Maybe some brave soul can hack kvm >>> to patch the new instruction in place of the mmio instruction Windows >>> uses >>> to bang on the tpr. >>> >> Actually VT has virtual TPR support that does not require CR8. We >> submitted a patch for Xen. Please see >> http://lists.xensource.com/archives/html/xen-devel/2007-03/msg00993.htm >> > l > >> The spec should be available soon. We are working on a patch for KVM. >> >> Jun >> > > That's superb! Windows gives us really bad time with all of those TPR > accesses. Because of that we currently prefer of using the non-acpi HAL. > > On what version on chip is this feature supported? > > This pushes towards in kernel apic too. Can't see how we avoid it. > Does it really? IIUC, we would avoid TPR traps entirely and would just need to synchronize the TPR whenever we go down to userspace. Regards, Anthony Liguori - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] kvm-devel Digest, Vol 6, Issue 61
Casey, On Tue, Apr 03, 2007 at 10:46:38PM -0400, Casey Jeffery wrote: > Stephane, > > I'm glad you found this; I thought I was going to have to repost while > actually remembering to change the subject line. > Someone else pointed me to your message. The title was indeed misleading. > >On Wed, Mar 28, 2007 at 01:02:47PM -0400, Casey Jeffery wrote: > >> I was messing around with using the perf counters a couple weeks ago > >> as a way to get deterministic exits in the instruction stream of the > >> guest. I used the h/w msr save/restore area to disable the counters > >> and save the values on guest exit and restore them on entry. I also > >> set up the LVT to deliver NMI's on overflow. > >> > >You mean the host APIC LVT vector? > Yes, FEE0 0340 in the host. Yes, that is the one. > >> This basically worked as expected, but I never got around the problem > >> of inconsistent NMI delivery. A large majority of the time the NMI > >> would be delivered in non-root mode and a vmexit would occur, as > >> expected. Occasionally, though the NMI is delivered in root mode. It > >> seems if the overflow occurs near the time a vmexit occurs for some > >> other reason, the NMI takes long enough to propagate that it's > >> delivered in root mode. > >> > >In my tests, I have setup perfmon to use a regular interrupt vector (0xee). > >I have not yet played with NMI. This is in general more difficult to > >handle, > >although, from Avi's comments, it looks like I could have caught the > >interrupt > >more easily in KVM. > > I haven't tried anything other than NMI's and wasn't aware of the > ack-on-exit bit, either. I've just put the call to the handler in the > handler_exception() function in vmx.c. This is where I expect it to > end up if the NMI occurs while in non-root mode. > With regular interrupt, you end up in handle_external_interrupt() but only AFTER the host kernel as serviced the interrupt. > > > >There may be some propagation delay yet you, supposedly, do not suffer > >from masked > >interrupt windows. Also something to watch out for is that when you restore > >you must make sure that msrs upper bits are set to 1. Otherwise you may > >trigger > >unvoluntary interrupts. > > I'm not sure which msrs you're referring to. The only perfmon msr > reserved bits I see in the documentation that need to be set to 1 are > bits 16 and 17 of CCCR, but I've actually only been considering Core > and Core 2 processors with the simplified perfmon design that has just > IA32_PERFEVTSEL and IA32_PMC. These are the only two msrs I'm saving > and restoring. I save them to the VM-exit guest-state area and clear > them for root mode and then restore them from the guest-state area on > resume. Yes, I am talking about IA32_PMC0/1. You need to ensure that the upper bits (32-63) are always set to 1 if you want to get an interrupt. If I recall I also had to fixup (force upper bits to 1) the VT-saved PMC0/PMC1 to avoid getting a spurious interrupt on VM-entry. > > Even with this, I get a fairly large number of NMI's that occur in > root mode. Theoretically, the counter should be disabled when the exit > occurs and overflows shouldn't occur. I think if the implementation is > really that poor, it would be very difficult to use them for assessing > guest performance. I'm guessing I'm just missing something and they do > work better than that. The counter is disabled only if you clear it. the hardware does not do this by default for you. For that you can use the VM-entry load MSR bitmap. In my current test code, I do just that. You save on VM-exit, load zero into PERFEVTSEL0 on VM-exit and on VM-entry you restore PERFEVTSEL0. -- -Stephane - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Anthony Liguori wrote: >> Maybe some brave soul can hack kvm to patch the new instruction in >> place of the mmio instruction Windows uses to bang on the tpr. > > It seems like that shouldn't be too hard assuming that the MMIO > instructions are <= the new CR8 instruction. It would require knowing > where the TPR is mapped into memory of course. Well, we know the physical address (some msr) and the virtual mapping. But we must be sure that the instruction is only used for setting the tpr, and not other registers. Er, thinking a bit more, cr8 is just 4 bits (and no, not the least significant) out of the 8-bit tpr, so it doesn't work without serious hackery. > > If we do this, then we can probably just handle the TPR as a special > case anyway and not bother returning to userspace when the TPR is > updated through MMIO. That saves the round trip without adding > emulation complexity. That means the emulation is split among user space and kernel. Not nice. One of the advantages of moving the entire thing is that it is at least clearly defined. -- error compiling committee.c: too many arguments to function - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Anthony Liguori wrote: > >> >> This pushes towards in kernel apic too. Can't see how we avoid it. >> > > Does it really? IIUC, we would avoid TPR traps entirely and would > just need to synchronize the TPR whenever we go down to userspace. > It's a bit more complex than that, as userspace would need to tell the kernel the highest priority pending interrupt so that it can program the hardware to exit when an interrupt is ready. However I agree with you that in principle we could split the apic emulation between kvm and qemu, even with this featurette. -- error compiling committee.c: too many arguments to function - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Avi Kivity wrote: > Anthony Liguori wrote: >>> Maybe some brave soul can hack kvm to patch the new instruction in >>> place of the mmio instruction Windows uses to bang on the tpr. >> >> It seems like that shouldn't be too hard assuming that the MMIO >> instructions are <= the new CR8 instruction. It would require >> knowing where the TPR is mapped into memory of course. > > Well, we know the physical address (some msr) and the virtual > mapping. But we must be sure that the instruction is only used for > setting the tpr, and not other registers. > > Er, thinking a bit more, cr8 is just 4 bits (and no, not the least > significant) out of the 8-bit tpr, so it doesn't work without serious > hackery. Hrm, so this not nearly as straight forward as I initially hoped :-/ >> >> If we do this, then we can probably just handle the TPR as a special >> case anyway and not bother returning to userspace when the TPR is >> updated through MMIO. That saves the round trip without adding >> emulation complexity. > > That means the emulation is split among user space and kernel. Not > nice. One of the advantages of moving the entire thing is that it is > at least clearly defined. It still exists in userspace. Having the code duplication (especially when it's not the same code base) is unfortunate. Plus, it complicates save/restore/migration since now some device state is in the kernel. It further complicates things if you want to make sure that KVM saved images are loadable in QEMU (you have to make sure that the device state is identical for the kernel and userspace). Special casing the TPR emulation seems like the lesser of two evils to me. Regards, Anthony Liguori - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Anthony Liguori wrote: >>> >>> If we do this, then we can probably just handle the TPR as a special >>> case anyway and not bother returning to userspace when the TPR is >>> updated through MMIO. That saves the round trip without adding >>> emulation complexity. >> >> That means the emulation is split among user space and kernel. Not >> nice. One of the advantages of moving the entire thing is that it is >> at least clearly defined. > > It still exists in userspace. Having the code duplication (especially > when it's not the same code base) is unfortunate. This remains true. > Plus, it complicates save/restore/migration since now some device > state is in the kernel. It further complicates things if you want to > make sure that KVM saved images are loadable in QEMU (you have to make > sure that the device state is identical for the kernel and userspace). You'd just load the kernel state into qemu state, like we do with the registers, and use the regular qemu save. You could turn kernel apic emulation on and off during runtime :) > Special casing the TPR emulation seems like the lesser of two evils to > me. It's not just the tpr. -- error compiling committee.c: too many arguments to function - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] kvm-devel Digest, Vol 6, Issue 61
Stephane > > >There may be some propagation delay yet you, supposedly, do not suffer > > >from masked > > >interrupt windows. Also something to watch out for is that when you restore > > >you must make sure that msrs upper bits are set to 1. Otherwise you may > > >trigger > > >unvoluntary interrupts. > > > > I'm not sure which msrs you're referring to. The only perfmon msr > > reserved bits I see in the documentation that need to be set to 1 are > > bits 16 and 17 of CCCR, but I've actually only been considering Core > > and Core 2 processors with the simplified perfmon design that has just > > IA32_PERFEVTSEL and IA32_PMC. These are the only two msrs I'm saving > > and restoring. I save them to the VM-exit guest-state area and clear > > them for root mode and then restore them from the guest-state area on > > resume. > > Yes, I am talking about IA32_PMC0/1. You need to ensure that the upper > bits (32-63) are always set to 1 if you want to get an interrupt. > If I recall I also had to fixup (force upper bits to 1) the VT-saved PMC0/PMC1 > to avoid getting a spurious interrupt on VM-entry. > I assume you mean bits 40-63 since the counters are 40-bit, although I'm still not sure about forcing them to 1. The steps I take are to set a 64-bit variable to a negative value of one greater than I want to trigger on. For example, if I want to trigger on 100 events, I set a 64-bit variable to -99. Before I do a VM resume, I set the VM-entry load bitmap to that value (as well as program the PERFEVTSEL entry load bitmap to enable counting and PMI). I also have the exit store bitmap clear the PERFEVTSEL to stop counting on exit. If I then look at the PMC value from the save area after the next exit, however, I find the top 14-bit are cleared (even if it didn't overflow yet). Based on this, I assumed it was because the counter is 40-bit as it was in the Netburst architecture. I believe if I forced the top bits back to one, they would just be cleared again when the h/w copied the value into the actual 40-bit msr. I'm interested to find out if you are seeing the same thing with regards to NMI's in root mode, though. If you already have all this working, all you should have to do is program the LVT to deliver an NMI and stop the KVM handler from making the INT2 call in its handler. Then, if you are seeing NMI's in root mode, you'll get lots of "Dazed and confused" messages from the kernel. > > > > Even with this, I get a fairly large number of NMI's that occur in > > root mode. Theoretically, the counter should be disabled when the exit > > occurs and overflows shouldn't occur. I think if the implementation is > > really that poor, it would be very difficult to use them for assessing > > guest performance. I'm guessing I'm just missing something and they do > > work better than that. > > The counter is disabled only if you clear it. the hardware does not > do this by default for you. For that you can use the VM-entry load MSR > bitmap. In my current test code, I do just that. You save on VM-exit, > load zero into PERFEVTSEL0 on VM-exit and on VM-entry you restore PERFEVTSEL0. > > -- > -Stephane > Thanks, Casey - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Avi Kivity wrote: > Anthony Liguori wrote: >> >>> >>> This pushes towards in kernel apic too. Can't see how we avoid it. >>> >> >> Does it really? IIUC, we would avoid TPR traps entirely and would >> just need to synchronize the TPR whenever we go down to userspace. >> > > It's a bit more complex than that, as userspace would need to tell the > kernel the highest priority pending interrupt so that it can program > the hardware to exit when an interrupt is ready. However I agree > with you that in principle we could split the apic emulation between > kvm and qemu, even with this featurette. Most of H/W-virtualization capable processors out there don't support that feature today. I think the decision (kvm or qemu) should be done based on performance data. I'm not worried about maintenance issues; the APIC code is not expected to change frequently. I'm a bit worried about extra complexity caused by such split, though. BTW, I see CPU utilization of qemu is almost always 99% in the top command when I run kernel build in an x86-64 Linux guest. Jun --- Intel Open Source Technology Center - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] KVM in-kernel APIC update
>>> On Wed, Apr 4, 2007 at 3:40 AM, in message <[EMAIL PROTECTED]>, Avi Kivity <[EMAIL PROTECTED]> wrote: > > I would avoid moving down anything that's not strictly necessary. Agreed. > > I still don't have an opinion as to whether it is necessary; I'll need > to study the details. Xen pushes most of the platform into the > hypervisor, but then Xen's communication path to qemu is much more > expensive (involving the scheduler and a potential cpu switch) than > kvm. We'd need to balance possible performance improvements (I'd expect > negligible) and interface simplification (possibly non-negligible) > against further diverging from qemu. I am not going to advocate one way or the other here, but I will just layout how I saw this "full kernel emulation" happening so we are on the same page. What I found was that the entire QEMU interrupt system converges on "pic_send_irq()" regardless of whether its a PCI or legacy ISA device. This function also acts as a pin-forwarding mechanism, sending the IRQ event to both the 8259s and IOAPIC #0. The irq that is dispatched by pic_send_irq() is still a raw "pin IRQ" reference, and must be translated into an actual x86 vector by the 8259/IOAPIC (which is governed by how the BIOS/OS programs them). The action of raising the interrupt ends up synchronously calling cpu_interrupt() which currently seems to be a no-op for KVM. Later, in the pre_run stage the system retrieves a pending vector (this is when the irq/vector translation occurs) and injects it to the kernel with the INTERRUPT ioctl. What I was planning on doing was using that QEMU patch I provided to intercept all pic_send_irq() calls and forward them directly to the kernel via a new ioctl(). This ioctl would be directed at the VM fd, not the VCPU, since its a pure ISA global pin reference and wont know the targeted vcpu until the 8259/IOAPIC perform their translation. So that being said, I think the interface between userspace and kernel would be no more complex than it is today. I.e. we would just be passing a single int via an ioctl. The danger, as you have pointed out, is accepting the QEMU patch that I submitted that potentially diverges the pic code from QEMU upstream. What this buys us, however, is is that any code (kernel or userspace) would be able to inject an ISA interrupt. Using an ISA interrupt has the advantage of already being represented in the ACPI/MP tables presented by Bochs, and thus any compliant OS will automatically route the IRQ. Where things do get potentially complicated with the interface is if we go with a hybrid solution. Leaving the LAPIC in kernel, but the IOAPIC/8259 in userspace requires a much wider interface so the IOAPIC can deliver APIC style messages to the kernel. Also, IOAPIC EOI for level sensitive interrupts become more difficult and complex. Putting LAPIC + IOAPIC#0 in the kernel and leaving 8259 outside might actually work fairly well, but in-kernel ISA interrupts will only work with OSs which enable the APIC. This may or may not be an issue and may be an acceptable tradeoff. > > There are two extreme models, which I think are both needed. On one > end, support for closed OSes (e.g. Windows) requires fairly strict > conformance to the PCI model, which means going through the IOAPIC or > PIC or however the interrupt lines are wired in qemu. This seems to indicate > that an in-kernel IOAPIC is needed. On the other end (Linux), > a legacy-free and emulation-free device can just inject interrupts > directly and use shared memory to ack interrupts and indicate their source. Well, sort of. The problem as I see it in both cases is still IRQ routing. If you knew of a free vector to take (regardless of OS) you could forgo the ACPI/MP declarations all together and register the vector with your "device" (e.g. via a hypercall, etc) and have the device issue direct LAPIC messages with the proper vector. I think this would work assuming the device used edge triggered interrupts (which don't require IOAPIC IRR registration or EOI). > >> If the latter, we also need to decide what the resource conveyance model and > vector allocation policy should be. For instance, do we publish said > resources formally in the MP/ACPI tables in Bochs? Doing so would allow > MP/ACPI compliant OSs like linux to naturally route the IRQ. Conversely, do > we do something more direct just like we do for KVM discovery via wrmsr? >> > > I think we can go the direct route for cooperative guests. As long as there is a way for the guest code to know about a free vector to use, I think this should work just fine. > > I also suggest doing the work in stages and measuring; that is, first > push the local apic, determine what the remaining bottlenecks are and > tackle them. I'm pretty sure that Linux guests would only require the > local apic but Windows (and older Linux kernels) might require more. > >> struct
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
>>> >>> If we do this, then we can probably just handle the TPR as a special >>> case anyway and not bother returning to userspace when the TPR is >>> updated through MMIO. That saves the round trip without adding >>> emulation complexity. >> >> That means the emulation is split among user space and kernel. Not >> nice. One of the advantages of moving the entire thing is that it is >> at least clearly defined. > >It still exists in userspace. Having the code duplication (especially >when it's not the same code base) is unfortunate. Plus, it complicates >save/restore/migration since now some device state is in the kernel. It >further complicates things if you want to make sure that KVM saved >images are loadable in QEMU (you have to make sure that the device state >is identical for the kernel and userspace). Special casing the TPR >emulation seems like the lesser of two evils to me. > There are still some nasty issues that caused by running apic in qemu: - If you have a PV driver that issued an irq and now it needs another one, You cannot really be sure if the first one was injected. You can hold extra state and follow irq injection from qemu but this is ugly. - You have to keep sync of tpr and irq injection: suppose apic needs to inject an irq, it pops it from the irr and tries to inject it. If it is done while the VM is in interrupt window closed or irq disables it will no happen instantly. When the irq would really be injected the tpr might be different, causing windows irq-not-less-or-equal BSOD. This is especially true for injecting several irq at once. Keeping the apic in the kernel simplifies this with the cost of maintaining an apic/pic implementation. Do you know why Xen guy choose of implementing it in Xen? Why didn't they rip Qemu implementation? >Regards, > >Anthony Liguori - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
This pushes towards in kernel apic too. Can't see how we avoid it. >>> >>> Does it really? IIUC, we would avoid TPR traps entirely and would >>> just need to synchronize the TPR whenever we go down to userspace. >>> >> >> It's a bit more complex than that, as userspace would need to tell the >> kernel the highest priority pending interrupt so that it can program >> the hardware to exit when an interrupt is ready. However I agree >> with you that in principle we could split the apic emulation between >> kvm and qemu, even with this featurette. > >Most of H/W-virtualization capable processors out there don't support >that feature today. I think the decision (kvm or qemu) should be done >based on performance data. I'm not worried about maintenance issues; the >APIC code is not expected to change frequently. I'm a bit worried about >extra complexity caused by such split, though. > I had an in-kernel-apic implementation for KVM and the performance improvement was insignificant. It is mainly a software engineering thing. PV drivers can benefit from APIC in the kernel but again just a mere improvement. >BTW, I see CPU utilization of qemu is almost always 99% in the top >command when I run kernel build in an x86-64 Linux guest. What does qemu do? Idle guest hardly consume cpu, especially KVM powered. >Jun >--- >Intel Open Source Technology Center - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] kvm-18 breaks Cisco VPN on WinXP SP1
Leslie Mann wrote: I'll prepare the first patch. Can you ensure that your upgraded setup still works kvm-17. It does, as I use it daily in order to run a Win app that I need. Please test the attached patch, against kvm-17. This is subversion revision 4546 and git commit c01571ed56754dfea458cc37d553c360082411a1. -- error compiling committee.c: too many arguments to function diff -ur kvm-17/kernel/include/linux/kvm.h kvm/kernel/include/linux/kvm.h --- kvm-17/kernel/include/linux/kvm.h 2007-03-20 15:12:42.0 +0200 +++ kvm/kernel/include/linux/kvm.h 2007-04-04 19:19:44.0 +0300 @@ -11,7 +11,7 @@ #include #include -#define KVM_API_VERSION 4 +#define KVM_API_VERSION 9 /* * Architectural interrupt line count, and the size of the bitmap needed @@ -34,36 +34,33 @@ #define KVM_MEM_LOG_DIRTY_PAGES 1UL -#define KVM_EXIT_TYPE_FAIL_ENTRY 1 -#define KVM_EXIT_TYPE_VM_EXIT2 - enum kvm_exit_reason { KVM_EXIT_UNKNOWN = 0, KVM_EXIT_EXCEPTION= 1, KVM_EXIT_IO = 2, - KVM_EXIT_CPUID= 3, + KVM_EXIT_HYPERCALL= 3, KVM_EXIT_DEBUG= 4, KVM_EXIT_HLT = 5, KVM_EXIT_MMIO = 6, KVM_EXIT_IRQ_WINDOW_OPEN = 7, KVM_EXIT_SHUTDOWN = 8, + KVM_EXIT_FAIL_ENTRY = 9, + KVM_EXIT_INTR = 10, }; -/* for KVM_RUN */ +/* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */ struct kvm_run { /* in */ - __u32 emulated; /* skip current instruction */ - __u32 mmio_completed; /* mmio request completed */ + __u32 io_completed; /* mmio/pio request completed */ __u8 request_interrupt_window; - __u8 padding1[7]; + __u8 padding1[3]; /* out */ - __u32 exit_type; __u32 exit_reason; __u32 instruction_length; __u8 ready_for_interrupt_injection; __u8 if_flag; - __u16 padding2; + __u8 padding2[6]; /* in (pre_kvm_run), out (post_kvm_run) */ __u64 cr8; @@ -72,29 +69,26 @@ union { /* KVM_EXIT_UNKNOWN */ struct { - __u32 hardware_exit_reason; + __u64 hardware_exit_reason; } hw; + /* KVM_EXIT_FAIL_ENTRY */ + struct { + __u64 hardware_entry_failure_reason; + } fail_entry; /* KVM_EXIT_EXCEPTION */ struct { __u32 exception; __u32 error_code; } ex; /* KVM_EXIT_IO */ - struct { + struct kvm_io { #define KVM_EXIT_IO_IN 0 #define KVM_EXIT_IO_OUT 1 __u8 direction; __u8 size; /* bytes */ - __u8 string; - __u8 string_down; - __u8 rep; - __u8 pad; __u16 port; - __u64 count; - union { -__u64 address; -__u32 value; - }; + __u32 count; + __u64 data_offset; /* relative to kvm_run start */ } io; struct { } debug; @@ -105,6 +99,13 @@ __u32 len; __u8 is_write; } mmio; + /* KVM_EXIT_HYPERCALL */ + struct { + __u64 args[6]; + __u64 ret; + __u32 longmode; + __u32 pad; + } hypercall; }; }; @@ -210,39 +211,72 @@ }; }; +struct kvm_cpuid_entry { + __u32 function; + __u32 eax; + __u32 ebx; + __u32 ecx; + __u32 edx; + __u32 padding; +}; + +/* for KVM_SET_CPUID */ +struct kvm_cpuid { + __u32 nent; + __u32 padding; + struct kvm_cpuid_entry entries[0]; +}; + +/* for KVM_SET_SIGNAL_MASK */ +struct kvm_signal_mask { + __u32 len; + __u8 sigset[0]; +}; + #define KVMIO 0xAE /* * ioctls for /dev/kvm fds: */ -#define KVM_GET_API_VERSION _IO(KVMIO, 1) -#define KVM_CREATE_VM _IO(KVMIO, 2) /* returns a VM fd */ -#define KVM_GET_MSR_INDEX_LIST_IOWR(KVMIO, 15, struct kvm_msr_list) +#define KVM_GET_API_VERSION _IO(KVMIO, 0x00) +#define KVM_CREATE_VM _IO(KVMIO, 0x01) /* returns a VM fd */ +#define KVM_GET_MSR_INDEX_LIST_IOWR(KVMIO, 0x02, struct kvm_msr_list) +/* + * Check if a kvm extension is available. Argument is extension number, + * return is 1 (yes) or 0 (no, sorry). + */ +#define KVM_CHECK_EXTENSION _IO(KVMIO, 0x03) +/* + * Get size for mmap(vcpu_fd) + */ +#define KVM_GET_VCPU_MMAP_SIZE_IO(KVMIO, 0x04) /* in bytes */ /* * ioctls for VM fds */ -#define KVM_SET_MEMORY_REGION _IOW(KVMIO, 10, struct kvm_memory_region) +#define KVM_SET_MEMORY_REGION _IOW(KVMIO, 0x40, struct kvm_memory_region) /* * KVM_CREATE_VCPU receives as a parameter the vcpu slot, and returns * a vcpu fd. */ -#define KVM_CREATE_VCPU _IOW(KVMIO, 11, int) -#define KVM_GET_DIRTY_LOG _IOW(KVMIO, 12, struct kvm_dirty_log) -#define KVM_GET_MEM_MAP _IOW(KVMIO, 16, struct kvm_dirty_log) +#define KVM_CREATE_VCPU _IO(KVMIO, 0x41) +#define KVM_GET_DIRTY_LOG _IOW(KVMIO, 0x42, struct kvm_dirty_log) +#define KVM_GET_MEM_MAP _IOW(KVMIO, 0x43, struct kvm_dirty_log) /* * ioctls for vcpu fds */ -#define KVM_RUN _IOWR(KVMIO, 2, struct kvm_run) -#define KVM_GET_REGS _IOR(KVMIO, 3, struct kvm_regs) -#define KVM_SET_REGS _IOW(KVMIO, 4, struct kvm_regs) -#define KVM_GET_SREGS _IOR(KVMIO, 5, struct kvm_sregs) -#define KVM_SET_SREG
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Nakajima, Jun wrote: > Most of H/W-virtualization capable processors out there don't support > that feature today. I think the decision (kvm or qemu) should be done > based on performance data. I'm not worried about maintenance issues; the > APIC code is not expected to change frequently. I'm a bit worried about > extra complexity caused by such split, though. > > In principle we could measure the performance cost today with the pv-net driver; however it still does a lot of copies which could be eliminated. > BTW, I see CPU utilization of qemu is almost always 99% in the top > command when I run kernel build in an x86-64 Linux guest. > Isn't that expected? if your guest image is mostly cached in the host, the guest would have nothing to block on. -- error compiling committee.c: too many arguments to function - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] Recursive virtualization
I swear this has been brought up before in this forum, but I can't find it. I'm curious what the virtualization gurus in this forum think of the possibilities for recursive virtualization. I know vbox claims to support it, but I haven't come across many details on how they do it and I don't think they really use the hvm hardware. Is it something that should be possible without an "enlightened" guest hypervisor and by basically just virtualizing the VMCS/VMCB structures? - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Dor Laor wrote: > This pushes towards in kernel apic too. Can't see how we avoid it. > > Does it really? IIUC, we would avoid TPR traps entirely and would just need to synchronize the TPR whenever we go down to userspace. >>> It's a bit more complex than that, as userspace would need to tell >>> > the > >>> kernel the highest priority pending interrupt so that it can program >>> the hardware to exit when an interrupt is ready. However I agree >>> with you that in principle we could split the apic emulation between >>> kvm and qemu, even with this featurette. >>> >> Most of H/W-virtualization capable processors out there don't support >> that feature today. I think the decision (kvm or qemu) should be done >> based on performance data. I'm not worried about maintenance issues; >> > the > >> APIC code is not expected to change frequently. I'm a bit worried about >> extra complexity caused by such split, though. >> >> > > I had an in-kernel-apic implementation for KVM and the performance > improvement was insignificant. It is mainly a software engineering > thing. > PV drivers can benefit from APIC in the kernel but again just a mere > improvement. > > > >> BTW, I see CPU utilization of qemu is almost always 99% in the top >> command when I run kernel build in an x86-64 Linux guest. >> > > > What does qemu do? > > Idle guest hardly consume cpu, especially KVM powered. > qemu would be 99% even if all the time is being spent in the guest context. If the user time is high, an oprofile run would be pretty useful. I've found that the VGA drawing routines can be pretty expensive. Regards, Anthony Liguori >> Jun >> --- >> Intel Open Source Technology Center >> > > - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Dor Laor wrote: If we do this, then we can probably just handle the TPR as a special case anyway and not bother returning to userspace when the TPR is updated through MMIO. That saves the round trip without adding emulation complexity. >>> That means the emulation is split among user space and kernel. Not >>> nice. One of the advantages of moving the entire thing is that it is >>> at least clearly defined. >>> >> It still exists in userspace. Having the code duplication (especially >> when it's not the same code base) is unfortunate. Plus, it complicates >> save/restore/migration since now some device state is in the kernel. >> > It > >> further complicates things if you want to make sure that KVM saved >> images are loadable in QEMU (you have to make sure that the device >> > state > >> is identical for the kernel and userspace). Special casing the TPR >> emulation seems like the lesser of two evils to me. >> >> > > There are still some nasty issues that caused by running apic in qemu: > - If you have a PV driver that issued an irq and now it needs another > one, >You cannot really be sure if the first one was injected. You can hold > >extra state and follow irq injection from qemu but this is ugly. > - You have to keep sync of tpr and irq injection: suppose apic needs to > >inject an irq, it pops it from the irr and tries to inject it. If it > is >done while the VM is in interrupt window closed or irq disables it > will >no happen instantly. When the irq would really be injected the tpr > might >be different, causing windows irq-not-less-or-equal BSOD. >This is especially true for injecting several irq at once. > > Keeping the apic in the kernel simplifies this with the cost of > maintaining an apic/pic implementation. > Hrm, this is definitely starting to sound like a PITA to deal with. Maybe in-kernel platform devices are unavoidable :-/ > Do you know why Xen guy choose of implementing it in Xen? Why didn't > they rip Qemu implementation? > I believe it's based on the QEMU implementation although it's evolved quite a bit. Jun can probably provide a better answer. Regards, Anthony Liguori > >> Regards, >> >> Anthony Liguori >> > > > - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] Recursive virtualization
>I swear this has been brought up before in this forum, but I can't >find it. I'm curious what the virtualization gurus in this forum think >of the possibilities for recursive virtualization. I know vbox claims >to support it, but I haven't come across many details on how they do >it and I don't think they really use the hvm hardware. Is it something >that should be possible without an "enlightened" guest hypervisor and >by basically just virtualizing the VMCS/VMCB structures? We have an open todo task for it. It is 'just' emulating the VMCS/VMCB structures and commands. > >--- -- >Take Surveys. Earn Cash. Influence the Future of IT >Join SourceForge.net's Techsay panel and you'll get the chance to share >your >opinions on IT & business topics through brief surveys-and earn cash >http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVD EV >___ >kvm-devel mailing list >kvm-devel@lists.sourceforge.net >https://lists.sourceforge.net/lists/listinfo/kvm-devel - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Avi Kivity wrote: > Nakajima, Jun wrote: >> Most of H/W-virtualization capable processors out there don't support >> that feature today. I think the decision (kvm or qemu) should be done >> based on performance data. I'm not worried about maintenance issues; >> the APIC code is not expected to change frequently. I'm a bit >> worried about extra complexity caused by such split, though. >> >> > > In principle we could measure the performance cost today with the > pv-net driver; however it still does a lot of copies which could be > eliminated. > >> BTW, I see CPU utilization of qemu is almost always 99% in the top >> command when I run kernel build in an x86-64 Linux guest. >> > > Isn't that expected? if your guest image is mostly cached in the host, > the guest would have nothing to block on. I compared the performance on Xen and KVM for kernel build using the same guest image. Looks like KVM was (kvm-17) three times slower as far as we tested, and that high load of qemu was one of the symptoms. We are looking at the shadow code, but the load of qemu looks very high. I remember we had similar problems in Xen before, but those were fixed. Someone should take a look at the qemu side. Jun --- Intel Open Source Technology Center - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] KVM in-kernel APIC update
Gregory Haskins wrote: > What I was planning on doing was using that QEMU patch I provided to > intercept all pic_send_irq() calls and forward them directly to the kernel > via a new ioctl(). This ioctl would be directed at the VM fd, not the VCPU, > since its a pure ISA global pin reference and wont know the targeted vcpu > until the 8259/IOAPIC perform their translation. > Hmm. If the ioapic is in the kernel, then it's a platform-wide resource and you would need a vm ioctl. If ioapic emulation is in userspace, then the ioapic logic will have decided which cpu is targeted and you would issue a vcpu ioctl. > So that being said, I think the interface between userspace and kernel would > be no more complex than it is today. I.e. we would just be passing a single > int via an ioctl. I think the interface should mirror the hardware interface at the point of the "cut". For example, if we keep the ioapic in userspace, the interface is ioapic/apic bus messages. If we push the ioapic into the kernel, the interface describes the various ioapic pins and how the ioapics are connected to each other and to the processors (i.e. the topology). > The danger, as you have pointed out, is accepting the QEMU patch that I > submitted that potentially diverges the pic code from QEMU upstream. What > this buys us, however, is is that any code (kernel or userspace) would be > able to inject an ISA interrupt. Using an ISA interrupt has the advantage of > already being represented in the ACPI/MP tables presented by Bochs, and thus > any compliant OS will automatically route the IRQ. > > Where things do get potentially complicated with the interface is if we go > with a hybrid solution. Leaving the LAPIC in kernel, but the IOAPIC/8259 in > userspace requires a much wider interface so the IOAPIC can deliver APIC > style messages to the kernel. Also, IOAPIC EOI for level sensitive > interrupts become more difficult and complex. Putting LAPIC + IOAPIC#0 in > the kernel and leaving 8259 outside might actually work fairly well, but > in-kernel ISA interrupts will only work with OSs which enable the APIC. This > may or may not be an issue and may be an acceptable tradeoff. > Everything should keep working, that is a must. We just need the interfaces to follow the hardware faithfully. The issue with the ioapic eoi is worrying me performance wise, though; it looks like we need to push the ioapic too if we are to have no-compromise performance on unmodified OSes. >> There are two extreme models, which I think are both needed. On one >> end, support for closed OSes (e.g. Windows) requires fairly strict >> conformance to the PCI model, which means going through the IOAPIC or >> PIC or however the interrupt lines are wired in qemu. This seems to >> indicate >> that an in-kernel IOAPIC is needed. On the other end (Linux), >> a legacy-free and emulation-free device can just inject interrupts >> directly and use shared memory to ack interrupts and indicate their source. >> > > Well, sort of. The problem as I see it in both cases is still IRQ routing. > If you knew of a free vector to take (regardless of OS) you could forgo the > ACPI/MP declarations all together and register the vector with your "device" > (e.g. via a hypercall, etc) and have the device issue direct LAPIC messages > with the proper vector. I think this would work assuming the device used > edge triggered interrupts (which don't require IOAPIC IRR registration or > EOI). > > For unmodified guests, use the existing pci irq routing. I certainly wouldn't want to debug anything else. For modified guests, there's no real problem. >>> >>> >> Since you need locking anyway, best to use the unlocked versions >> (__set_bit()). >> > > Ack. Yes, locking needs to be added. Votes for appropriate mechanism? (e.g. > spinlock, mutex, etc?) > > spin_lock_irq(), as this lock will frequently have to be taken in host irq handlers. Need to be extra careful with the locking in {vmx,svm}_vcpu_run(). > >>> @@ -108,20 +109,12 @@ static unsigned get_addr_size(struct kvm_vcpu *vcpu) >>> >>> static inline u8 pop_irq(struct kvm_vcpu *vcpu) >>> { >>> - int word_index = __ffs(vcpu->irq_summary); >>> - int bit_index = __ffs(vcpu->irq_pending[word_index]); >>> - int irq = word_index * BITS_PER_LONG + bit_index; >>> - >>> - clear_bit(bit_index, &vcpu->irq_pending[word_index]); >>> - if (!vcpu->irq_pending[word_index]) >>> - clear_bit(word_index, &vcpu->irq_summary); >>> - return irq; >>> + return kvm_vcpu_irq_read(vcpu, 0, NULL); >>> } >>> >>> static inline void push_irq(struct kvm_vcpu *vcpu, u8 irq) >>> { >>> - set_bit(irq, vcpu->irq_pending); >>> - set_bit(irq / BITS_PER_LONG, &vcpu->irq_summary); >>> + kvm_vcpu_irq_inject(vcpu, irq, 0); >>> } >>> >>> >> It would be helpful to unify the vmx and svm irq code first (I can merge >> something
Re: [kvm-devel] Recursive virtualization
It seems from cursory inspection that this is possible in theory, even on HVM hardware. My thoughts are as follows (Intel oriented, which I know better): *) The hypervisor sets to trap on VMX type operations (VMXON/OFF/START/RESUME, etc) and provide emulation of them as follows: *) When a VMXON instruction is encountered, mark the guest as a nested hypervisor and setup any necessary structures for tracking *) when a VMXSTART/RESUME instruction is encountered, launch the new guest as a subordinate guest of the owning hypervisor guest. When a VMEXIT occurs, VMXRESUME the hypervisor guest. etc etc This in theory could work to any degree of nesting, as long as the VMX operations are emulated. Note that each subordinate hypervisor would think it was emulating the VMX operations too, even though it is really only the bottom hypervisor which does so. This is kind of mind bending ;) -Greg >>> On Wed, Apr 4, 2007 at 12:36 PM, in message <[EMAIL PROTECTED]>, "Casey Jeffery" <[EMAIL PROTECTED]> wrote: > I swear this has been brought up before in this forum, but I can't > find it. I'm curious what the virtualization gurus in this forum think > of the possibilities for recursive virtualization. I know vbox claims > to support it, but I haven't come across many details on how they do > it and I don't think they really use the hvm hardware. Is it something > that should be possible without an "enlightened" guest hypervisor and > by basically just virtualizing the VMCS/VMCB structures? > > - > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys- and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > ___ > kvm- devel mailing list > kvm- [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/kvm- devel - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Anthony Liguori wrote: >> >>> BTW, I see CPU utilization of qemu is almost always 99% in the top >>> command when I run kernel build in an x86-64 Linux guest. >>> >> > > qemu would be 99% even if all the time is being spent in the guest > context. > > If the user time is high, an oprofile run would be pretty useful. > I've found that the VGA drawing routines can be pretty expensive. > Ah, Jun's used to seeing Xen's qemu-dm idling when the guest is not issuing I/O. With kvm, guest cpu time is accounted as system time to the qemu process. Thanks for clearing that up, I didn't understand the question till now. -- error compiling committee.c: too many arguments to function - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] Recursive virtualization
Dor, Thanks, I realize there will certainly be a lot of work in virtualizing them. Maybe Intel can help out with VVT-x to give a root-root mode. ;) Any idea at a high level how vbox does it? I will post in their forum, but I assume somebody here has a good idea. Thanks. On 4/4/07, Dor Laor <[EMAIL PROTECTED]> wrote: > >I swear this has been brought up before in this forum, but I can't > >find it. I'm curious what the virtualization gurus in this forum think > >of the possibilities for recursive virtualization. I know vbox claims > >to support it, but I haven't come across many details on how they do > >it and I don't think they really use the hvm hardware. Is it something > >that should be possible without an "enlightened" guest hypervisor and > >by basically just virtualizing the VMCS/VMCB structures? > > We have an open todo task for it. > It is 'just' emulating the VMCS/VMCB structures and commands. > > > > >--- > -- > >Take Surveys. Earn Cash. Influence the Future of IT > >Join SourceForge.net's Techsay panel and you'll get the chance to share > >your > >opinions on IT & business topics through brief surveys-and earn cash > >http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVD > EV > >___ > >kvm-devel mailing list > >kvm-devel@lists.sourceforge.net > >https://lists.sourceforge.net/lists/listinfo/kvm-devel > - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Nakajima, Jun wrote: > I compared the performance on Xen and KVM for kernel build using the > same guest image. Looks like KVM was (kvm-17) three times slower as far > as we tested, and that high load of qemu was one of the symptoms. We are > looking at the shadow code, but the load of qemu looks very high. I > remember we had similar problems in Xen before, but those were fixed. > Someone should take a look at the qemu side. > I'd expect the following issues to dominate: - the shadow cache is quite small at 256 pages. Increasing it may increase performance. - we haven't yet taught the scheduler that migrating vcpus is expensive due to the IPI needed to fetch the vmcs. Maybe running with 'taskset 1' would help - shadow eviction policy is FIFO, not LRU, which probably causes many page faults. Running kvm_stat can help show what's going on. -- error compiling committee.c: too many arguments to function - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] KVM in-kernel APIC update
>>> On Wed, Apr 4, 2007 at 12:49 PM, in message <[EMAIL PROTECTED]>, Avi Kivity <[EMAIL PROTECTED]> wrote: > Gregory Haskins wrote: > > > Hmm. If the ioapic is in the kernel, then it's a platform- wide resource > and you would need a vm ioctl. If ioapic emulation is in userspace, > then the ioapic logic will have decided which cpu is targeted and you > would issue a vcpu ioctl. > Thats exactly in line with my thinking. >> So that being said, I think the interface between userspace and kernel would > be no more complex than it is today. I.e. we would just be passing a single > int via an ioctl. > > I think the interface should mirror the hardware interface at the point > of the "cut". For example, if we keep the ioapic in userspace, the > interface is ioapic/apic bus messages. If we push the ioapic into the > kernel, the interface describes the various ioapic pins and how the > ioapics are connected to each other and to the processors (i.e. the > topology). Agreed. I was thinking that the interface for the "IOAPIC in kernel" model would look something like the way the pic_send_irq() function looks, except it would also convey BUS/IOAPIC id. so: kvm_inject_interrupt(int bus, int pin, int value); and the "kvmpic" driver would currently translate as bus = 0 (giving us IRQ0-23). E.g. kvmpic_send_irq(int irq, int value) { kvm_inject_interrupt(0, irq, value); } In the future, if we have more than one IOAPIC the system can map to the right unit according to the topology. > > Everything should keep working, that is a must. We just need the > interfaces to follow the hardware faithfully. The issue with the ioapic > eoi is worrying me performance wise, though; it looks like we need to > push the ioapic too if we are to have no- compromise performance on > unmodified OSes. Not sure if this will make you feel better, but it appears as though both the QEMU and the in-kernel model that I inherited don't accurately support IOAPIC EOI already. From that you can infer that there must not be any level-sensitive interrupts in use today. Since the system seems to only have the legacy ISA model (which dictates edge triggers, IIRC), this makes sense. However, if we want to support level triggers in the future, we will have to address this. I would be uncomfortable designing something that doesn't take the IOAPIC EOI into account, even if its not immediately used. > > For unmodified guests, use the existing pci irq routing. I certainly > wouldn't want to debug anything else. For modified guests, there's no > real problem. Ill take your word for it ;) I spent a few minutes looking through the linux kernel trying to figure out how to get a hint about a free vector without assigning one statically in the header files and came up empty handed. Im sure there is a way (or maybe static is ok?). > >>> Since you need locking anyway, best to use the unlocked versions >>> (__set_bit()). >>> >> >> Ack. Yes, locking needs to be added. Votes for appropriate mechanism? > (e.g. spinlock, mutex, etc?) >> >> > > spin_lock_irq(), as this lock will frequently have to be taken in host > irq handlers. Need to be extra careful with the locking in > {vmx,svm}_vcpu_run(). Ack >> Well, I could just change any occurrence of "pop_irq" with > kvm_vcpu_irq_read() and any occurrence of push_irq() to > kvm_vcpu_irq_inject(), but I don't get the impression that this is what you > were referring to. Could you elaborate? >> > > I meant having a cleanup patch before that pushes irq handling into > common code, then the apic patch could modify that to call the kernel > apic code if necessary. > Ack -Greg - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Avi Kivity wrote: > Nakajima, Jun wrote: >> I compared the performance on Xen and KVM for kernel build using the >> same guest image. Looks like KVM was (kvm-17) three times slower as >> far as we tested, and that high load of qemu was one of the >> symptoms. We are looking at the shadow code, but the load of qemu >> looks very high. I remember we had similar problems in Xen before, >> but those were fixed. Someone should take a look at the qemu side. >> > > I'd expect the following issues to dominate: > > - the shadow cache is quite small at 256 pages. Increasing it may > increase performance. Yes, we are aware of this. > > - we haven't yet taught the scheduler that migrating vcpus is > expensive due to the IPI needed to fetch the vmcs. Maybe running > with 'taskset 1' would help > > - shadow eviction policy is FIFO, not LRU, which probably causes many > page faults. This may explain that the performance gets worse as we repeat kernel build, at least, the second run is slower than the first one. > > Running kvm_stat can help show what's going on. Thanks for the good insights. We'll come back with some analysis. Jun --- Intel Open Source Technology Center - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] KVM in-kernel APIC update
Gregory Haskins wrote: > Agreed. I was thinking that the interface for the "IOAPIC in kernel" model > would look something like the way the pic_send_irq() function looks, except > it would also convey BUS/IOAPIC id. > > so: kvm_inject_interrupt(int bus, int pin, int value); > > and the "kvmpic" driver would currently translate as bus = 0 (giving us > IRQ0-23). E.g. > > kvmpic_send_irq(int irq, int value) > { >kvm_inject_interrupt(0, irq, value); > } > With appropriate modeling of edge vs level triggered, and translation of pin to vector, yes. > > >> Everything should keep working, that is a must. We just need the >> interfaces to follow the hardware faithfully. The issue with the ioapic >> eoi is worrying me performance wise, though; it looks like we need to >> push the ioapic too if we are to have no- compromise performance on >> unmodified OSes. >> > > Not sure if this will make you feel better, but it appears as though both the > QEMU and the in-kernel model that I inherited don't accurately support IOAPIC > EOI already. From that you can infer that there must not be any > level-sensitive interrupts in use today. Since the system seems to only have > the legacy ISA model (which dictates edge triggers, IIRC), this makes sense. > However, if we want to support level triggers in the future, we will have to > address this. I would be uncomfortable designing something that doesn't take > the IOAPIC EOI into account, even if its not immediately used. > > pci is level triggered, so maybe the guests just handle the inaccuracy. >> For unmodified guests, use the existing pci irq routing. I certainly >> wouldn't want to debug anything else. For modified guests, there's no >> real problem. >> > > Ill take your word for it ;) I spent a few minutes looking through the linux > kernel trying to figure out how to get a hint about a free vector without > assigning one statically in the header files and came up empty handed. Im > sure there is a way (or maybe static is ok?). > > No idea really, but the paravirt_ops stuff probably provides the right hooks. -- error compiling committee.c: too many arguments to function - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
>>> On Wed, Apr 4, 2007 at 10:20 AM, in message <[EMAIL PROTECTED]>, Anthony Liguori <[EMAIL PROTECTED]> wrote: > > The devices are already written to take a set_irq function. Instead of > hijacking the emulated PIC device, I think it would be better if in > pc.c, we just conditionally created our PIC device that reflected to the > hypervisor and passed the appropriate function to the emulated hardware. > When I first starting looking at the code, I was hoping to do exactly that. But unfortunately it appears as though many devices call pic_set_irq() directly which is why I used the approach I did. Take a look at qemu/hw/rt18139.c for an example. > Otherwise, to support all the other architectures, there's going to be a > lot of modifications. Fully agree, which is why I waited for review before putting in all that work ;) -Greg - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] KVM in-kernel APIC update
>>> On Wed, Apr 4, 2007 at 1:43 PM, in message <[EMAIL PROTECTED]>, Avi Kivity <[EMAIL PROTECTED]> wrote: > Gregory Haskins wrote: >> Agreed. I was thinking that the interface for the "IOAPIC in kernel" model > would look something like the way the pic_send_irq() function looks, except > it would also convey BUS/IOAPIC id. >> >> so: kvm_inject_interrupt(int bus, int pin, int value); >> >> and the "kvmpic" driver would currently translate as bus = 0 (giving us > IRQ0- 23). E.g. >> >> kvmpic_send_irq(int irq, int value) >> { >>kvm_inject_interrupt(0, irq, value); >> } >> > > With appropriate modeling of edge vs level triggered, and translation of > pin to vector, yes. I believe we would be ok here. The current code uses a similar model for edge/level, where edge simply ignores *value*, and level uses non-zero value as "assert" and 0 as "deassert". As far as the pin/vector mapping, that would be taken care of by the programming of the IOAPIC by the BIOS/OS. > pci is level triggered, so maybe the guests just handle the inaccuracy. > Good point. I'm not sure how this works today. Perhaps we just get lucky that nothing checks the IRR in the IOAPIC coupled with a bug in the IOAPIC model that an APIC message is sent out even if the interrupt is not acknowledged. It would explain why it works today, anyway. Either way, I would like to get this modeled "right" this go-round, so the point is moot. -Greg - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Nakajima, Jun wrote: > Avi Kivity wrote: > >> Nakajima, Jun wrote: >> >>> Most of H/W-virtualization capable processors out there don't support >>> that feature today. I think the decision (kvm or qemu) should be done >>> based on performance data. I'm not worried about maintenance issues; >>> the APIC code is not expected to change frequently. I'm a bit >>> worried about extra complexity caused by such split, though. >>> >>> >>> >> In principle we could measure the performance cost today with the >> pv-net driver; however it still does a lot of copies which could be >> eliminated. >> >> >>> BTW, I see CPU utilization of qemu is almost always 99% in the top >>> command when I run kernel build in an x86-64 Linux guest. >>> >>> >> Isn't that expected? if your guest image is mostly cached in the host, >> the guest would have nothing to block on. >> > > I compared the performance on Xen and KVM for kernel build using the > same guest image. Looks like KVM was (kvm-17) three times slower as far > as we tested, and that high load of qemu was one of the symptoms. We are > looking at the shadow code, but the load of qemu looks very high. The KVM IDE emulation is much slower than Xen's. Part of that is the new AIO framework. It may be worth modifying the qemu_aio_init() function in block-raw.c to increase the AIO thread count, and trying with a SCSI disk. The SCSI disk with > 1 AIO thread is pretty good. Regards, Anthony Liguori > I > remember we had similar problems in Xen before, but those were fixed. > Someone should take a look at the qemu side. > > Jun > --- > Intel Open Source Technology Center > > - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Gregory Haskins wrote: On Wed, Apr 4, 2007 at 10:20 AM, in message <[EMAIL PROTECTED]>, > Anthony Liguori <[EMAIL PROTECTED]> wrote: > >> The devices are already written to take a set_irq function. Instead of >> hijacking the emulated PIC device, I think it would be better if in >> pc.c, we just conditionally created our PIC device that reflected to the >> hypervisor and passed the appropriate function to the emulated hardware. >> >> > > When I first starting looking at the code, I was hoping to do exactly that. > But unfortunately it appears as though many devices call pic_set_irq() > directly which is why I used the approach I did. Take a look at > qemu/hw/rt18139.c for an example. > Yeah, I was thinking of serial.c specifically. I just looked again and things are not quite as clean as I had thought. PPC has a set_irq hook but it also has it's own pic_set_irq function. I suspect the cleanest thing to do would be to register an irq handler that takes an callback/opaque. Probably switch pic_set_irq calls to qemu_set_irq or something. But this is definitely a topic to bring up on qemu-devel as it touches all architectures. Regards, Anthony Liguori > >> Otherwise, to support all the other architectures, there's going to be a >> lot of modifications. >> > > Fully agree, which is why I waited for review before putting in all that work > ;) > > -Greg > > - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
* Avi Kivity <[EMAIL PROTECTED]> wrote: > > It still exists in userspace. Having the code duplication > > (especially when it's not the same code base) is unfortunate. > > This remains true. but it's the wrong argument. Of course there's duplicate functionality, and that's _good_ because it represents choice. KVM _itself_ is duplicate functionality of qemu in a way. So why move the lapic/PIC handling to the kernel? Because it's alot cleaner to do device emulation there and PV drivers get significantly easier to do. The lapic/PIC code should also be available in Qemu for OSs that dont have KVM-alike support in the kernel. and while today most of the performance advantages of moving the PIC into the kernel are masked by the high cost of VM exits, in the future the effect will be more marked, as the relative cost of piggybacking out to qemu increases. I can see the value in doing certain things in Qemu, but i cannot see _at all_ the value of handling say the PIT in Qemu. Just look at the Qemu PIT/timers code quality in Qemu for a change ... it's a huge ugly mess of lots of #ifdefs, ineffective handling of /dev/rtc, linear list walking, signal overhead, etc., etc. All of that resulting in 10-15% of 'idle' overhead of KVM+qemu when it runs a Linux guest. On the other side, in the kernel it's most natural to do timers and to emulate hardware, because the kernel has _precise_ knowledge about the platform's capabilities. Ingo - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
* Anthony Liguori <[EMAIL PROTECTED]> wrote: > > Keeping the apic in the kernel simplifies this with the cost of > > maintaining an apic/pic implementation. > > Hrm, this is definitely starting to sound like a PITA to deal with. > Maybe in-kernel platform devices are unavoidable :-/ yes, very much so. Not only are they unavoidable, they largely simplify many aspects of KVM. Did anyone here ever have unreliable keyboard emulation in qemu because /dev/rtc didnt allow 1024 Hz? Such basic issues are just not present when timer emulation is done by the kernel. The kernel abstracts away the actual hardware, and Qemu then uses this abstract interfaces to create something that is specific. By moving platform device emulation into KVM, much of that (unnecessary) indirection goes away. Furthermore, scheduling and guest-interrupt handling is something that is most naturally done in the host kernel - any indirection is unnecessary fat that distracts from the core purpose of building a first-class hypervisor subsystem. Ingo - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] KVM in-kernel APIC update
* Avi Kivity <[EMAIL PROTECTED]> wrote: > > My current thoughts are that we at least move the IOAPIC into the > > kernel as well. That will give sufficient control to generate ISA > > bus interrupts for guests that understand APICs. If we want to be > > able to generate ISA interrupts for legacy guests which talk to the > > 8259s that will prove to be insufficient. The good news is that > > moving the 8259s down as well is probably not a huge deal either, > > especially since I have already prepped the usermode side. > > Thoughts? > > I would avoid moving down anything that's not strictly necessary. If > we want to keep the PIC in qemu, for example, just export the APIC-PIC > interface to qemu. we should move all the PICs into KVM proper - and that includes the i8259A PIC too. Qemu-space drivers are then wired to pins on these PICs, but nothing in Qemu does vector generation or vector prioritization - that task is purely up to KVM. There are mixed i8259A+lapic models possible too and the simplest model is to have all vector handling in KVM. any 'cut' of the interface to allow both qemu and KVM generate vectors is unnecessary (and harmful) complexity. The interface cut should be at the 'pin' level, with Qemu raising a signal on a pin and lowering a signal on a pin, but otherwise not dealing with IRQ routing and IRQ vectors. Ingo - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] KVM in-kernel APIC update
* Gregory Haskins <[EMAIL PROTECTED]> wrote: > > pci is level triggered, so maybe the guests just handle the > > inaccuracy. > > > > Good point. I'm not sure how this works today. Perhaps we just get > lucky that nothing checks the IRR in the IOAPIC coupled with a bug in > the IOAPIC model that an APIC message is sent out even if the > interrupt is not acknowledged. It would explain why it works today, > anyway. Either way, I would like to get this modeled "right" this > go-round, so the point is moot. on real hardware, some devices produce edges, some devices produce level signals that need to be deasserted. The basic model of a PIC is that it has 'pins', which convert the signal arriving on those pins into interrupts. Qemu itself should only know about the pin enumeration, and should only be able to raise/lower the (virtual) 'signal' on such a pin. PCI can be level and edge triggered too, and IO-APICs can be programmed on a per-pin basis to detect edge-high, edge-low, level-high, level-low signals. there is a remote possibility that some OSs depend on certain devices being level-triggered: for example if you get an IRQ from a level-triggered device and _dont_ deassert that signal from the IRQ handler (intentionally so), then the semantics of current hardware will cause a second interrupt to be sent by the PIC, after the APIC message has been EOI-ed in the local APIC. While such "repeat interrupts" would be pure madness to rely on i think, i'm not sure it's not being done. Note that if the same IO-APIC pin is set up to detect edges then not deasserting the signal would not cause a 'repeat interrupt'. Whether such accurate emulation of signalling is needed depends on the hardware semantics of the devices we emulate. Ingo - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Ingo Molnar wrote: > * Avi Kivity <[EMAIL PROTECTED]> wrote: > > >>> It still exists in userspace. Having the code duplication >>> (especially when it's not the same code base) is unfortunate. >>> >> This remains true. >> > > but it's the wrong argument. Of course there's duplicate functionality, > and that's _good_ because it represents choice. KVM _itself_ is > duplicate functionality of qemu in a way. So why move the lapic/PIC > handling to the kernel? Because it's alot cleaner to do device emulation > there and PV drivers get significantly easier to do. But why is it a good thing to do PV drivers in the kernel? You lose flexibility and functionality to gain performance. Really, it's more about there not being good enough userspace interfaces to do network IO. > The lapic/PIC code > should also be available in Qemu for OSs that dont have KVM-alike > support in the kernel. > > and while today most of the performance advantages of moving the PIC > into the kernel are masked by the high cost of VM exits, in the future > the effect will be more marked, as the relative cost of piggybacking out > to qemu increases. > > I can see the value in doing certain things in Qemu, but i cannot see > _at all_ the value of handling say the PIT in Qemu. Just look at the > Qemu PIT/timers code quality in Qemu for a change ... it's a huge ugly > mess of lots of #ifdefs, ineffective handling of /dev/rtc, linear list > walking, signal overhead, etc., etc. All of that resulting in 10-15% of > 'idle' overhead of KVM+qemu when it runs a Linux guest. On the other > side, in the kernel it's most natural to do timers and to emulate > hardware, because the kernel has _precise_ knowledge about the > platform's capabilities. > Yeah, I think this is a good point. If we're going to push the APIC into the kernel, we might as well put the PIT there too. The timing stuff is an absolute mess in QEMU since it wants to get a fast high res clock but isn't aware of things like CPU migration. I'm pretty sure that if you are on an SMP host, some bad things can happen with the QEMU timer code since it relies on the rdtsc. Regards, Anthony Liguori > Ingo > > - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] KVM in-kernel APIC update
* Gregory Haskins <[EMAIL PROTECTED]> wrote: > Hi all, > > Attached is a snapshot of my current efforts on the kernel side for > the in-kernel APIC work. Feedback welcome. good work and nice patch! :) > My current thoughts are that we at least move the IOAPIC into the > kernel as well. [...] yes. And then do the final 10% move of handling the i8529A in KVM too. That would mean that PV drivers could inject IRQ events without having to schedule back to qemu. Especially if the IRQ is masked this avoids a context-switch. (with any PIC component in qemu we have no choice but to always wake up qemu and let it handle the event - even if it results in a 'no vector generated' decision). This is a plus because most PV drivers will do IO completion from irq context and possibly on other CPUs. > The current Bochs/QEMU system model paints a fairly simple ISA > architecture utilizing a single IOAPIC + dual 8259 setup. Do we > expect in-kernel injected IRQs to follow the ISA model (e.g. either > legacy or PCI interrupts only limited to IRQ0-15) or do we want to > expand on this? [...] yes, we should probably expand them - it's not hard. PV/accel drivers would most likely hook up to free pins to unshare the interrupts. > If the latter, we also need to decide what the resource conveyance > model and vector allocation policy should be. For instance, do we > publish said resources formally in the MP/ACPI tables in Bochs? Doing > so would allow MP/ACPI compliant OSs like linux to naturally route the > IRQ. Conversely, do we do something more direct just like we do for > KVM discovery via wrmsr? for PV/accel drivers we dont need any extra ACPI enumeration - the hypercall API is good enough to connect to the hypervisor, and i suspect all guest OSs we care about allow drivers to allocate an IRQ vector for a new device, without having that device enumerated in ACPI. Ingo - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH] Support for in-kernel mmio handlers
The MMIO registration code has been broken out as a new patch from the in-kernel APIC work with the following changes per Avi's request: 1) Supports dynamic registration 2) Uses gpa_t addresses 3) Explicit per-cpu mappings In addition, I have added the concept of distinct VCPU and VM level registrations (where VCPU devices will eclipse competing VM registrations (if any). This will be key down the road where LAPICs should use VCPU registration, but IOAPICs should use VM level. Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]> --- drivers/kvm/kvm.h | 50 + drivers/kvm/kvm_main.c | 53 +++ 2 files changed, 94 insertions(+), 9 deletions(-) diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h index fceeb84..3334730 100644 --- a/drivers/kvm/kvm.h +++ b/drivers/kvm/kvm.h @@ -236,6 +236,54 @@ struct kvm_pio_request { int rep; }; +struct kvm_io_device { + unsigned long (*read)(struct kvm_io_device *this, + gpa_t addr, + unsigned long length); + void (*write)(struct kvm_io_device *this, + gpa_t addr, + unsigned long length, + unsigned long val); + int (*in_range)(struct kvm_io_device *this, gpa_t addr); + + void *private; + struct list_head link; +}; + +/* It would be nice to use something smarter than a linear search, TBD... + Thankfully we dont expect many devices to register (famous last words :), + so until then it will suffice. At least its abstracted so we can change + in one place. + */ +struct kvm_io_bus { + struct list_head list; +}; + +static inline void +kvm_io_bus_init(struct kvm_io_bus *bus) +{ + INIT_LIST_HEAD(&bus->list); +} + +static inline struct kvm_io_device* +kvm_io_bus_find_dev(struct kvm_io_bus *bus, gpa_t addr) +{ + struct kvm_io_device *pos = NULL; + + list_for_each_entry(pos, &bus->list, link) { + if(pos->in_range(pos, addr)) + return pos; + } + + return NULL; +} + +static inline void +kvm_io_bus_register_dev(struct kvm_io_bus *bus, struct kvm_io_device *dev) +{ + list_add_tail(&dev->link, &bus->list); +} + struct kvm_vcpu { struct kvm *kvm; union { @@ -294,6 +342,7 @@ struct kvm_vcpu { gpa_t mmio_phys_addr; struct kvm_pio_request pio; void *pio_data; + struct kvm_io_bus mmio_bus; int sigset_active; sigset_t sigset; @@ -345,6 +394,7 @@ struct kvm { unsigned long rmap_overflow; struct list_head vm_list; struct file *filp; + struct kvm_io_bus mmio_bus; }; struct kvm_stat { diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c index 4473174..da119c0 100644 --- a/drivers/kvm/kvm_main.c +++ b/drivers/kvm/kvm_main.c @@ -294,6 +294,7 @@ static struct kvm *kvm_create_vm(void) spin_lock_init(&kvm->lock); INIT_LIST_HEAD(&kvm->active_mmu_pages); + kvm_io_bus_init(&kvm->mmio_bus); for (i = 0; i < KVM_MAX_VCPUS; ++i) { struct kvm_vcpu *vcpu = &kvm->vcpus[i]; @@ -302,6 +303,7 @@ static struct kvm *kvm_create_vm(void) vcpu->kvm = kvm; vcpu->mmu.root_hpa = INVALID_PAGE; INIT_LIST_HEAD(&vcpu->free_pages); + kvm_io_bus_init(&vcpu->mmio_bus); spin_lock(&kvm_lock); list_add(&kvm->vm_list, &vm_list); spin_unlock(&kvm_lock); @@ -1015,12 +1017,30 @@ static int emulator_write_std(unsigned long addr, return X86EMUL_UNHANDLEABLE; } +static struct kvm_io_device* vcpu_find_mmio_dev(struct kvm_vcpu *vcpu, + gpa_t addr) +{ + struct kvm_io_device *mmio_dev; + + /* First check the local CPU addresses */ + mmio_dev = kvm_io_bus_find_dev(&vcpu->mmio_bus, addr); + if(!mmio_dev) { + /* Then check the entire VM */ + mmio_dev = kvm_io_bus_find_dev(&vcpu->kvm->mmio_bus, addr); + } + + return mmio_dev; +} + static int emulator_read_emulated(unsigned long addr, unsigned long *val, unsigned int bytes, struct x86_emulate_ctxt *ctxt) { struct kvm_vcpu *vcpu = ctxt->vcpu; + gpa_t gpa; + int i; + struct kvm_io_device *mmio_dev; if (vcpu->mmio_read_completed) { memcpy(val, vcpu->mmio_data, bytes); @@ -1029,18 +1049,24 @@ static int emulator_read_emulated(unsigned long addr, } else if (emulator_read_std(addr, val, bytes, ctxt) == X86EMUL_CONTINUE) return X86EMUL_CONTINUE; - else { - gpa_t gpa = vcpu->mmu.gva_to_gpa(vcpu, addr); - if (gpa == UNMAPPED_GVA) -
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
* Anthony Liguori <[EMAIL PROTECTED]> wrote: > But why is it a good thing to do PV drivers in the kernel? You lose > flexibility and functionality to gain performance. [...] in Linux a kernel-space network driver can still be tunneled over user-space code, and hence you can add arbitrary add-on functionality (and thus have flexibility), without slowing down the common case (which would be to tunnel the guest's network traffic into the firewall rules of the kernel. No need to touch user-space for any of that). if performance didnt matter and if it were all about flexibility then Bochs/Qemu could have migrated most Windows users to Linux 10 years ago. The reality is, performance and precision of emulation very much matters. Ingo - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] KVM in-kernel APIC update
>>> On Wed, Apr 4, 2007 at 4:32 PM, in message <[EMAIL PROTECTED]>, Ingo Molnar <[EMAIL PROTECTED]> wrote: > >> My current thoughts are that we at least move the IOAPIC into the >> kernel as well. [...] > > yes. And then do the final 10% move of handling the i8529A in KVM too. Hi Ingo, We are in full agreement on this point, and has been my preferred model from the beginning. The only issue with this approach is that it requires a fairly disruptive patch to QEMUs "pic_set_irq()" feature which many people have drawn exception to so far. (In case you weren't following from the beginning, its the "QEMU PIC indirection patch" thread). If we dont care about supporting "--no-kvm" anymore, this problem becomes trivially easy.we can just link in a different pic module into QEMU and be done with it. The problem as I see it is that we really have a lot of value in being able to switch between kvm and pure qemu mode via --no-kvm, especially for debugging. Therefore, IMHO we need to be able to dynamically switch between PIC emulation code. If we *do* want to go with this model, *and* we decide that the approach I have taken with QEMU is a reasonable way to do it, then I would suggest we go about it by getting the patch accepted in QEMU upstream. I would gladly take on this duty if we all agree this is the right approach. > for PV/accel drivers we dont need any extra ACPI enumeration - the > hypercall API is good enough to connect to the hypervisor, and i suspect > all guest OSs we care about allow drivers to allocate an IRQ vector for > a new device, without having that device enumerated in ACPI. If you know how to do this in Linux, please share! I was looking for this earlier and came up empty handed. All I could find was the places where the PCI/MP/ACPI type things assigned vectors to devices they new about. It's probably "operator-ignorance" ;) Regards, -Greg - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] Recursive virtualization
> >Dor, > >Thanks, I realize there will certainly be a lot of work in >virtualizing them. Maybe Intel can help out with VVT-x to give a >root-root mode. ;) > >Any idea at a high level how vbox does it? I will post in their forum, >but I assume somebody here has a good idea. Vbox branched out from qemu. It is another emulator that suppose to run in native code in a similar fashion to kqeumu. They run the user code and non-dangerous stuff natively, the rest is fully emulated, thus can have nested levels. > >Thanks. > >On 4/4/07, Dor Laor <[EMAIL PROTECTED]> wrote: >> >I swear this has been brought up before in this forum, but I can't >> >find it. I'm curious what the virtualization gurus in this forum think >> >of the possibilities for recursive virtualization. I know vbox claims >> >to support it, but I haven't come across many details on how they do >> >it and I don't think they really use the hvm hardware. Is it something >> >that should be possible without an "enlightened" guest hypervisor and >> >by basically just virtualizing the VMCS/VMCB structures? >> >> We have an open todo task for it. >> It is 'just' emulating the VMCS/VMCB structures and commands. >> >> > >> >--- >> -- >> >Take Surveys. Earn Cash. Influence the Future of IT >> >Join SourceForge.net's Techsay panel and you'll get the chance to share >> >your >> >opinions on IT & business topics through brief surveys-and earn cash >> >http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVD >> EV >> >___ >> >kvm-devel mailing list >> >kvm-devel@lists.sourceforge.net >> >https://lists.sourceforge.net/lists/listinfo/kvm-devel >> - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
>Avi Kivity wrote: >> Nakajima, Jun wrote: >>> I compared the performance on Xen and KVM for kernel build using the >>> same guest image. Looks like KVM was (kvm-17) three times slower as >>> far as we tested, and that high load of qemu was one of the >>> symptoms. We are looking at the shadow code, but the load of qemu >>> looks very high. I remember we had similar problems in Xen before, >>> but those were fixed. Someone should take a look at the qemu side. >>> >> >> I'd expect the following issues to dominate: >> >> - the shadow cache is quite small at 256 pages. Increasing it may >> increase performance. > >Yes, we are aware of this. > >> >> - we haven't yet taught the scheduler that migrating vcpus is >> expensive due to the IPI needed to fetch the vmcs. Maybe running >> with 'taskset 1' would help >> >> - shadow eviction policy is FIFO, not LRU, which probably causes many >> page faults. > >This may explain that the performance gets worse as we repeat kernel >build, at least, the second run is slower than the first one. > >> >> Running kvm_stat can help show what's going on. > >Thanks for the good insights. We'll come back with some analysis. Maybe you can come up with a patch ;) If you touch those areas anyway and run several benchmarks, writing a small piece of code is neglectable effort. > >Jun >--- >Intel Open Source Technology Center - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] KVM in-kernel APIC update
Gregory Haskins wrote: On Wed, Apr 4, 2007 at 4:32 PM, in message <[EMAIL PROTECTED]>, > Ingo Molnar <[EMAIL PROTECTED]> wrote: > >>> My current thoughts are that we at least move the IOAPIC into the >>> kernel as well. [...] >>> >> yes. And then do the final 10% move of handling the i8529A in KVM too. >> > > Hi Ingo, > We are in full agreement on this point, and has been my preferred model > from the beginning. The only issue with this approach is that it requires a > fairly disruptive patch to QEMUs "pic_set_irq()" feature which many people > have drawn exception to so far. (In case you weren't following from the > beginning, its the "QEMU PIC indirection patch" thread). > > If we dont care about supporting "--no-kvm" anymore, this problem becomes > trivially easy. No, this would be a big mistake. We'll just end up with another qemu-dm. > we can just link in a different pic module into QEMU and be done with it. > The problem as I see it is that we really have a lot of value in being able > to switch between kvm and pure qemu mode via --no-kvm, especially for > debugging. Therefore, IMHO we need to be able to dynamically switch between > PIC emulation code. > > If we *do* want to go with this model, *and* we decide that the approach I > have taken with QEMU is a reasonable way to do it, then I would suggest we go > about it by getting the patch accepted in QEMU upstream. I would gladly take > on this duty if we all agree this is the right approach. > I think you should post the patch on qemu-devel if for nothing else to begin the discussion. Regards, Anthony Liguori > >> for PV/accel drivers we dont need any extra ACPI enumeration - the >> hypercall API is good enough to connect to the hypervisor, and i suspect >> all guest OSs we care about allow drivers to allocate an IRQ vector for >> a new device, without having that device enumerated in ACPI. >> > > If you know how to do this in Linux, please share! I was looking for this > earlier and came up empty handed. All I could find was the places where the > PCI/MP/ACPI type things assigned vectors to devices they new about. It's > probably "operator-ignorance" ;) > > Regards, > -Greg > > > > - > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys-and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > ___ > kvm-devel mailing list > kvm-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/kvm-devel > > - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] KVM in-kernel APIC update
>we should move all the PICs into KVM proper - and that includes the >i8259A PIC too. Qemu-space drivers are then wired to pins on these PICs, >but nothing in Qemu does vector generation or vector prioritization - >that task is purely up to KVM. There are mixed i8259A+lapic models >possible too and the simplest model is to have all vector handling in >KVM. > >any 'cut' of the interface to allow both qemu and KVM generate vectors >is unnecessary (and harmful) complexity. The interface cut should be at >the 'pin' level, with Qemu raising a signal on a pin and lowering a >signal on a pin, but otherwise not dealing with IRQ routing and IRQ >vectors. > > Ingo Actually that the best cut, it's not a cut that was chosen because of someone's implementation status. It is the best logical way doing things. It's either all of the components are in user space or in kernel space altogether. Since PV drivers and APIC acceleration (vt/svm) are strong arguments in favor of kernel implementation we should aim towards that. It is also a very clean cut and it will predicting the exact interface is easy. > >--- -- >Take Surveys. Earn Cash. Influence the Future of IT >Join SourceForge.net's Techsay panel and you'll get the chance to share >your >opinions on IT & business topics through brief surveys-and earn cash >http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVD EV >___ >kvm-devel mailing list >kvm-devel@lists.sourceforge.net >https://lists.sourceforge.net/lists/listinfo/kvm-devel - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] KVM in-kernel APIC update
>> Gregory Haskins wrote: >> >> >> Hmm. If the ioapic is in the kernel, then it's a platform- wide resource >> and you would need a vm ioctl. If ioapic emulation is in userspace, >> then the ioapic logic will have decided which cpu is targeted and you >> would issue a vcpu ioctl. >> > >Thats exactly in line with my thinking. This is one supports in-kernel apic since the physical ioapic is not bound to specific cpu. > >>> So that being said, I think the interface between userspace and kernel >would >> be no more complex than it is today. I.e. we would just be passing a >single >> int via an ioctl. >> >> I think the interface should mirror the hardware interface at the point >> of the "cut". For example, if we keep the ioapic in userspace, the >> interface is ioapic/apic bus messages. If we push the ioapic into the >> kernel, the interface describes the various ioapic pins and how the >> ioapics are connected to each other and to the processors (i.e. the >> topology). > >Agreed. I was thinking that the interface for the "IOAPIC in kernel" model >would look something like the way the pic_send_irq() function looks, except >it would also convey BUS/IOAPIC id. > Keeping the ioapic in qemu was just a temporal solution, the full blown architecture should implement everything within the kernel. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
>But why is it a good thing to do PV drivers in the kernel? You lose >flexibility and functionality to gain performance. Really, it's more >about there not being good enough userspace interfaces to do network IO. > >> The lapic/PIC code >> should also be available in Qemu for OSs that dont have KVM-alike >> support in the kernel. >> >> and while today most of the performance advantages of moving the PIC >> into the kernel are masked by the high cost of VM exits, in the future >> the effect will be more marked, as the relative cost of piggybacking out >> to qemu increases. >> >> I can see the value in doing certain things in Qemu, but i cannot see >> _at all_ the value of handling say the PIT in Qemu. Just look at the >> Qemu PIT/timers code quality in Qemu for a change ... it's a huge ugly >> mess of lots of #ifdefs, ineffective handling of /dev/rtc, linear list >> walking, signal overhead, etc., etc. All of that resulting in 10-15% of >> 'idle' overhead of KVM+qemu when it runs a Linux guest. On the other >> side, in the kernel it's most natural to do timers and to emulate >> hardware, because the kernel has _precise_ knowledge about the >> platform's capabilities. I think that the pit just need clean, ifdef-less implementation, I can't believe it adds a bug performance penalty. Actually it is quite simple to implement dyn-tick in qemu. The question is after moving the pic,apic,ioapic into the kernel, is it nicer to re-implement the pit in the kernel or in qemu. I suggest we start with the first 3 nominees... >> > >Yeah, I think this is a good point. If we're going to push the APIC >into the kernel, we might as well put the PIT there too. The timing >stuff is an absolute mess in QEMU since it wants to get a fast high res >clock but isn't aware of things like CPU migration. I'm pretty sure >that if you are on an SMP host, some bad things can happen with the QEMU >timer code since it relies on the rdtsc. Btw: KVM support guest rdtsc over physical cpus, I guess it wasn't enough and the qemu pit side need to do it too. > >Regards, > >Anthony Liguori > >> Ingo >> >> > > >--- -- >Take Surveys. Earn Cash. Influence the Future of IT >Join SourceForge.net's Techsay panel and you'll get the chance to share >your >opinions on IT & business topics through brief surveys-and earn cash >http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVD EV >___ >kvm-devel mailing list >kvm-devel@lists.sourceforge.net >https://lists.sourceforge.net/lists/listinfo/kvm-devel - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Support for in-kernel mmio handlers
* Gregory Haskins ([EMAIL PROTECTED]) wrote: > The MMIO registration code has been broken out as a new patch from the > in-kernel APIC work with the following changes per Avi's request: > > 1) Supports dynamic registration > 2) Uses gpa_t addresses > 3) Explicit per-cpu mappings > > In addition, I have added the concept of distinct VCPU and VM level > registrations (where VCPU devices will eclipse competing VM registrations (if > any). This will be key down the road where LAPICs should use VCPU > registration, but IOAPICs should use VM level. hmm, i'm surprised it makes a difference. > Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]> > > --- > drivers/kvm/kvm.h | 50 + > drivers/kvm/kvm_main.c | 53 +++ > 2 files changed, 94 insertions(+), 9 deletions(-) > > diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h > index fceeb84..3334730 100644 > --- a/drivers/kvm/kvm.h > +++ b/drivers/kvm/kvm.h > @@ -236,6 +236,54 @@ struct kvm_pio_request { > int rep; > }; > > +struct kvm_io_device { > + unsigned long (*read)(struct kvm_io_device *this, > + gpa_t addr, > + unsigned long length); > + void (*write)(struct kvm_io_device *this, > + gpa_t addr, > + unsigned long length, > + unsigned long val); > + int (*in_range)(struct kvm_io_device *this, gpa_t addr); > + > + void *private; This looks unused, what is it meant for? > + struct list_head link; > +}; > + > +/* It would be nice to use something smarter than a linear search, TBD... > + Thankfully we dont expect many devices to register (famous last words :), > + so until then it will suffice. At least its abstracted so we can change > + in one place. > + */ > +struct kvm_io_bus { > + struct list_head list; > +}; > + > +static inline void > +kvm_io_bus_init(struct kvm_io_bus *bus) > +{ > + INIT_LIST_HEAD(&bus->list); > +} > + > +static inline struct kvm_io_device* > +kvm_io_bus_find_dev(struct kvm_io_bus *bus, gpa_t addr) > +{ > + struct kvm_io_device *pos = NULL; > + > + list_for_each_entry(pos, &bus->list, link) { > + if(pos->in_range(pos, addr)) linux style nit, missing space after if --> if (pos->in_range(pos, addr)) > + return pos; > + } > + > + return NULL; > +} > + > +static inline void > +kvm_io_bus_register_dev(struct kvm_io_bus *bus, struct kvm_io_device *dev) > +{ > + list_add_tail(&dev->link, &bus->list); > +} > + > struct kvm_vcpu { > struct kvm *kvm; > union { > @@ -294,6 +342,7 @@ struct kvm_vcpu { > gpa_t mmio_phys_addr; > struct kvm_pio_request pio; > void *pio_data; > + struct kvm_io_bus mmio_bus; > > int sigset_active; > sigset_t sigset; > @@ -345,6 +394,7 @@ struct kvm { > unsigned long rmap_overflow; > struct list_head vm_list; > struct file *filp; > + struct kvm_io_bus mmio_bus; > }; > > struct kvm_stat { > diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c > index 4473174..da119c0 100644 > --- a/drivers/kvm/kvm_main.c > +++ b/drivers/kvm/kvm_main.c > @@ -294,6 +294,7 @@ static struct kvm *kvm_create_vm(void) > > spin_lock_init(&kvm->lock); > INIT_LIST_HEAD(&kvm->active_mmu_pages); > + kvm_io_bus_init(&kvm->mmio_bus); I'd just do INIT_LIST_HEAD, unless you havd bigger plans for this wrapper? > for (i = 0; i < KVM_MAX_VCPUS; ++i) { > struct kvm_vcpu *vcpu = &kvm->vcpus[i]; > > @@ -302,6 +303,7 @@ static struct kvm *kvm_create_vm(void) > vcpu->kvm = kvm; > vcpu->mmu.root_hpa = INVALID_PAGE; > INIT_LIST_HEAD(&vcpu->free_pages); > + kvm_io_bus_init(&vcpu->mmio_bus); ditto > spin_lock(&kvm_lock); > list_add(&kvm->vm_list, &vm_list); > spin_unlock(&kvm_lock); > @@ -1015,12 +1017,30 @@ static int emulator_write_std(unsigned long addr, > return X86EMUL_UNHANDLEABLE; > } > > +static struct kvm_io_device* vcpu_find_mmio_dev(struct kvm_vcpu *vcpu, > + gpa_t addr) > +{ > + struct kvm_io_device *mmio_dev; > + > + /* First check the local CPU addresses */ > + mmio_dev = kvm_io_bus_find_dev(&vcpu->mmio_bus, addr); > + if(!mmio_dev) { same style nit. and why do you have local vs global check (or dynamic registration for that matter)? > + /* Then check the entire VM */ > + mmio_dev = kvm_io_bus_find_dev(&vcpu->kvm->mmio_bus, addr); > + } > + > + return mmio_dev; > +} > + > static int emulator_read_emulated(unsigned long addr, > unsigned long *val, > unsigned int bytes, > struct x86_emulate_ctxt *ctxt) > { >
Re: [kvm-devel] [PATCH] Support for in-kernel mmio handlers
Hi Chris, Thanks for the feedback. Ive answered inline below. >>> On Wed, Apr 4, 2007 at 6:48 PM, in message <[EMAIL PROTECTED]>, Chris Wright <[EMAIL PROTECTED]> wrote: > * Gregory Haskins ([EMAIL PROTECTED]) wrote: >> The MMIO registration code has been broken out as a new patch from the > in- kernel APIC work with the following changes per Avi's request: >> >> 1) Supports dynamic registration >> 2) Uses gpa_t addresses >> 3) Explicit per- cpu mappings >> >> In addition, I have added the concept of distinct VCPU and VM level > registrations (where VCPU devices will eclipse competing VM registrations (if > any). This will be key down the road where LAPICs should use VCPU > registration, but IOAPICs should use VM level. > > hmm, i'm surprised it makes a difference. LAPICs can be remapped on a per-cpu basis via an MSR, whereas something like an IOAPIC is a system-wide resource. > >> Signed- off- by: Gregory Haskins <[EMAIL PROTECTED]> >> >> --- >> drivers/kvm/kvm.h | 50 + >> drivers/kvm/kvm_main.c | 53 >> +++ >> 2 files changed, 94 insertions(+), 9 deletions(- ) >> >> diff -- git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h >> index fceeb84..3334730 100644 >> --- a/drivers/kvm/kvm.h >> +++ b/drivers/kvm/kvm.h >> @@ - 236,6 +236,54 @@ struct kvm_pio_request { >> int rep; >> }; >> >> +struct kvm_io_device { >> +unsigned long (*read)(struct kvm_io_device *this, >> + gpa_t addr, >> + unsigned long length); >> +void (*write)(struct kvm_io_device *this, >> + gpa_t addr, >> + unsigned long length, >> + unsigned long val); >> +int (*in_range)(struct kvm_io_device *this, gpa_t addr); >> + >> +void *private; > > This looks unused, what is it meant for? Its unused in this patch, because the primary consumer is a follow on patch that is not yet released. The original patch had this logic + the logic that used it all together and it was requested to break them apart. > >> +struct list_head link; >> +}; >> + >> +/* It would be nice to use something smarter than a linear search, TBD... >> + Thankfully we dont expect many devices to register (famous last words > :), >> + so until then it will suffice. At least its abstracted so we can change >> + in one place. >> + */ >> +struct kvm_io_bus { >> +struct list_head list; >> +}; >> + >> +static inline void >> +kvm_io_bus_init(struct kvm_io_bus *bus) >> +{ >> +INIT_LIST_HEAD(&bus- >list); >> +} >> + >> +static inline struct kvm_io_device* >> +kvm_io_bus_find_dev(struct kvm_io_bus *bus, gpa_t addr) >> +{ >> +struct kvm_io_device *pos = NULL; >> + >> +list_for_each_entry(pos, &bus- >list, link) { >> +if(pos- >in_range(pos, addr)) > > linux style nit, missing space after if -- > if (pos- >in_range(pos, addr)) Yeah, old habits die hard ;) I will fix all of these. > >> +return pos; >> +} >> + >> +return NULL; >> +} >> + >> +static inline void >> +kvm_io_bus_register_dev(struct kvm_io_bus *bus, struct kvm_io_device *dev) >> +{ >> +list_add_tail(&dev- >link, &bus- >list); >> +} >> + >> struct kvm_vcpu { >> struct kvm *kvm; >> union { >> @@ - 294,6 +342,7 @@ struct kvm_vcpu { >> gpa_t mmio_phys_addr; >> struct kvm_pio_request pio; >> void *pio_data; >> +struct kvm_io_bus mmio_bus; >> >> int sigset_active; >> sigset_t sigset; >> @@ - 345,6 +394,7 @@ struct kvm { >> unsigned long rmap_overflow; >> struct list_head vm_list; >> struct file *filp; >> +struct kvm_io_bus mmio_bus; >> }; >> >> struct kvm_stat { >> diff -- git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c >> index 4473174..da119c0 100644 >> --- a/drivers/kvm/kvm_main.c >> +++ b/drivers/kvm/kvm_main.c >> @@ - 294,6 +294,7 @@ static struct kvm *kvm_create_vm(void) >> >> spin_lock_init(&kvm- >lock); >> INIT_LIST_HEAD(&kvm- >active_mmu_pages); >> +kvm_io_bus_init(&kvm- >mmio_bus); > > I'd just do INIT_LIST_HEAD, unless you havd bigger plans for this wrapper? The motivation for wrapping the init is because I want to abstract the fact that its a list. This means I can update the mechanism to do something more intelligent with address lookup (e.g. b-tree, etc) without changing code all over the place. Right now there are only two consumers, put I envision there will be some more. For instance, I would like to get PIOs using this mechanism at some point (so I can snarf accesses to the 8259s at 0x20/0xa0) > >> for (i = 0; i < KVM_MAX_VCPUS; ++i) { >> struct kvm_vcpu *vcpu = &kvm- >vcpus[i]; >> >> @@ - 302,6 +303,7 @@ static struct kvm *kvm_create_vm(void) >> vcpu- >kvm = kvm; >> vcpu- >mmu.root_hpa = INVALID_PAGE; >> INIT_LIST_HEAD(&vcpu- >free
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
On Wed, 2007-04-04 at 23:21 +0200, Ingo Molnar wrote: > * Anthony Liguori <[EMAIL PROTECTED]> wrote: > > > But why is it a good thing to do PV drivers in the kernel? You lose > > flexibility and functionality to gain performance. [...] > > in Linux a kernel-space network driver can still be tunneled over > user-space code, and hence you can add arbitrary add-on functionality > (and thus have flexibility), without slowing down the common case (which > would be to tunnel the guest's network traffic into the firewall rules > of the kernel. No need to touch user-space for any of that). You didn't quote Anthony's point about "it's more about there not being good enough userspace interfaces to do network IO." It's easier to write a kernel-space network driver, but it's not obviously the right thing to do until we can show that an efficient packet-level userspace interface isn't possible. I don't think that's been done, and it would be interesting to try. Cheers, Rusty. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Support for in-kernel mmio handlers
The attachment contains fixes based on the feedback from Chris. Thanks Chris! Regards, -Greg diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h index fceeb84..0e6eb04 100644 --- a/drivers/kvm/kvm.h +++ b/drivers/kvm/kvm.h @@ -236,6 +236,54 @@ struct kvm_pio_request { int rep; }; +struct kvm_io_device { + unsigned long (*read)(struct kvm_io_device *this, + gpa_t addr, + unsigned long length); + void (*write)(struct kvm_io_device *this, + gpa_t addr, + unsigned long length, + unsigned long val); + int (*in_range)(struct kvm_io_device *this, gpa_t addr); + + void *private; + struct list_head link; +}; + +/* It would be nice to use something smarter than a linear search, TBD... + * Thankfully we dont expect many devices to register (famous last words :), + * so until then it will suffice. At least its abstracted so we can change + * in one place. + */ +struct kvm_io_bus { + struct list_head list; +}; + +static inline void +kvm_io_bus_init(struct kvm_io_bus *bus) +{ + INIT_LIST_HEAD(&bus->list); +} + +static inline struct kvm_io_device* +kvm_io_bus_find_dev(struct kvm_io_bus *bus, gpa_t addr) +{ + struct kvm_io_device *pos = NULL; + + list_for_each_entry(pos, &bus->list, link) { + if (pos->in_range(pos, addr)) + return pos; + } + + return NULL; +} + +static inline void +kvm_io_bus_register_dev(struct kvm_io_bus *bus, struct kvm_io_device *dev) +{ + list_add_tail(&dev->link, &bus->list); +} + struct kvm_vcpu { struct kvm *kvm; union { @@ -294,6 +342,7 @@ struct kvm_vcpu { gpa_t mmio_phys_addr; struct kvm_pio_request pio; void *pio_data; + struct kvm_io_bus mmio_bus; int sigset_active; sigset_t sigset; @@ -345,6 +394,7 @@ struct kvm { unsigned long rmap_overflow; struct list_head vm_list; struct file *filp; + struct kvm_io_bus mmio_bus; }; struct kvm_stat { diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c index 4473174..c8109b7 100644 --- a/drivers/kvm/kvm_main.c +++ b/drivers/kvm/kvm_main.c @@ -294,6 +294,7 @@ static struct kvm *kvm_create_vm(void) spin_lock_init(&kvm->lock); INIT_LIST_HEAD(&kvm->active_mmu_pages); + kvm_io_bus_init(&kvm->mmio_bus); for (i = 0; i < KVM_MAX_VCPUS; ++i) { struct kvm_vcpu *vcpu = &kvm->vcpus[i]; @@ -302,6 +303,7 @@ static struct kvm *kvm_create_vm(void) vcpu->kvm = kvm; vcpu->mmu.root_hpa = INVALID_PAGE; INIT_LIST_HEAD(&vcpu->free_pages); + kvm_io_bus_init(&vcpu->mmio_bus); spin_lock(&kvm_lock); list_add(&kvm->vm_list, &vm_list); spin_unlock(&kvm_lock); @@ -1015,12 +1017,30 @@ static int emulator_write_std(unsigned long addr, return X86EMUL_UNHANDLEABLE; } +static struct kvm_io_device* vcpu_find_mmio_dev(struct kvm_vcpu *vcpu, + gpa_t addr) +{ + struct kvm_io_device *mmio_dev; + + /* First check the local CPU addresses */ + mmio_dev = kvm_io_bus_find_dev(&vcpu->mmio_bus, addr); + if (!mmio_dev) { + /* Then check the entire VM */ + mmio_dev = kvm_io_bus_find_dev(&vcpu->kvm->mmio_bus, addr); + } + + return mmio_dev; +} + static int emulator_read_emulated(unsigned long addr, unsigned long *val, unsigned int bytes, struct x86_emulate_ctxt *ctxt) { struct kvm_vcpu *vcpu = ctxt->vcpu; + gpa_t gpa; + int i; + struct kvm_io_device *mmio_dev; if (vcpu->mmio_read_completed) { memcpy(val, vcpu->mmio_data, bytes); @@ -1029,18 +1049,24 @@ static int emulator_read_emulated(unsigned long addr, } else if (emulator_read_std(addr, val, bytes, ctxt) == X86EMUL_CONTINUE) return X86EMUL_CONTINUE; - else { - gpa_t gpa = vcpu->mmu.gva_to_gpa(vcpu, addr); - if (gpa == UNMAPPED_GVA) - return X86EMUL_PROPAGATE_FAULT; - vcpu->mmio_needed = 1; - vcpu->mmio_phys_addr = gpa; - vcpu->mmio_size = bytes; - vcpu->mmio_is_write = 0; + gpa = vcpu->mmu.gva_to_gpa(vcpu, addr); + if (gpa == UNMAPPED_GVA) + return vcpu_printf(vcpu, "not present\n"), X86EMUL_PROPAGATE_FAULT; - return X86EMUL_UNHANDLEABLE; + /* Is this MMIO handled locally? */ + mmio_dev = vcpu_find_mmio_dev(vcpu, gpa); + if (mmio_dev) { + *val = mmio_dev->read(mmio_dev, gpa, bytes); + return X86EMUL_CONTINUE;
Re: [kvm-devel] [PATCH] Support for in-kernel mmio handlers
* Gregory Haskins ([EMAIL PROTECTED]) wrote: > LAPICs can be remapped on a per-cpu basis via an MSR, whereas something > like an IOAPIC is a system-wide resource. Yes, I see now, no vcpu in kvm_io_device callbacks' context (admittedly, I'm used to the Xen implementation ;-) > >> +struct kvm_io_device { > >> + unsigned long (*read)(struct kvm_io_device *this, > >> +gpa_t addr, > >> +unsigned long length); > >> + void (*write)(struct kvm_io_device *this, > >> +gpa_t addr, > >> +unsigned long length, > >> +unsigned long val); > >> + int (*in_range)(struct kvm_io_device *this, gpa_t addr); > >> + > >> + void *private; > > > > This looks unused, what is it meant for? > > Its unused in this patch, because the primary consumer is a follow on > patch that is not yet released. The original patch had this logic + the > logic that used it all together and it was requested to break them apart. Makes sense, I'll wait to see a user to understand how it'w used. > >> +++ b/drivers/kvm/kvm_main.c > >> @@ - 294,6 +294,7 @@ static struct kvm *kvm_create_vm(void) > >> > >>spin_lock_init(&kvm- >lock); > >>INIT_LIST_HEAD(&kvm- >active_mmu_pages); > >> + kvm_io_bus_init(&kvm- >mmio_bus); > > > > I'd just do INIT_LIST_HEAD, unless you havd bigger plans for this wrapper? > > The motivation for wrapping the init is because I want to abstract > the fact that its a list. This means I can update the mechanism to > do something more intelligent with address lookup (e.g. b-tree, etc) > without changing code all over the place. Right now there are only > two consumers, put I envision there will be some more. For instance, > I would like to get PIOs using this mechanism at some point (so I can > snarf accesses to the 8259s at 0x20/0xa0) Right, you even alluded to that in your comments. I didn't expect a list to really become that long where it needed a more complex data structure. thanks, -chris - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Ingo Molnar wrote: > * Avi Kivity <[EMAIL PROTECTED]> wrote: > > >>> It still exists in userspace. Having the code duplication >>> (especially when it's not the same code base) is unfortunate. >>> >> This remains true. >> > > but it's the wrong argument. Of course there's duplicate functionality, > and that's _good_ because it represents choice. KVM _itself_ is > duplicate functionality of qemu in a way. So why move the lapic/PIC > handling to the kernel? Because it's alot cleaner to do device emulation > there and PV drivers get significantly easier to do. The lapic/PIC code > should also be available in Qemu for OSs that dont have KVM-alike > support in the kernel. > Duplicating code is never good, and duplicating code into the kernel (where maintenance cost is much higher) is bad. There has to be a very good reason for it. For the core kvm code, it is the 10x or more performance increase over qemu. If we are to add *pic/pit to kvm, we need to find an advantage that offsets the disadvantages. This can be a combination of simpler interfaces and better performance. > and while today most of the performance advantages of moving the PIC > into the kernel are masked by the high cost of VM exits, in the future > the effect will be more marked, as the relative cost of piggybacking out > to qemu increases. > That is correct, but we need to quantify it. Assuming 3us per qemu exit overhead, 30,000 events per second per core give us a 10% overall overhead. That's >100K events/sec for an entry-level server. > I can see the value in doing certain things in Qemu, but i cannot see > _at all_ the value of handling say the PIT in Qemu. Just look at the > Qemu PIT/timers code quality in Qemu for a change ... it's a huge ugly > mess of lots of #ifdefs, ineffective handling of /dev/rtc, linear list > walking, signal overhead, etc., etc. Bad userspace code should be fixed, not rewritten as kernel code. > All of that resulting in 10-15% of > 'idle' overhead of KVM+qemu when it runs a Linux guest. On my machines it's 0% overhead on idle (runlevel 3). > On the other > side, in the kernel it's most natural to do timers and to emulate > hardware, because the kernel has _precise_ knowledge about the > platform's capabilities. > That comes back to the kernel not exporting proper interfaces. I think that's fixed now, with hrtimers tied to the userspace APIs? -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] KVM in-kernel APIC update
Ingo Molnar wrote: > we should move all the PICs into KVM proper - and that includes the > i8259A PIC too. Qemu-space drivers are then wired to pins on these PICs, > but nothing in Qemu does vector generation or vector prioritization - > that task is purely up to KVM. There are mixed i8259A+lapic models > possible too and the simplest model is to have all vector handling in > KVM. > > any 'cut' of the interface to allow both qemu and KVM generate vectors > is unnecessary (and harmful) complexity. The interface cut should be at > the 'pin' level, with Qemu raising a signal on a pin and lowering a > signal on a pin, but otherwise not dealing with IRQ routing and IRQ > vectors. > Following is my view of the possible cuts: - everything in userspace: worst performance, but needed for comaptibility - tpr in kernel: minimal effort, bad badly defined interface - lapic in kernel: well defined (all on-chip stuff in kernel, off-chip in userspace), fixes main Windows problems, interrupts still need userspace. Interface is the processor's LINT pins and APIC bus. - *pic in kernel: most effort, easiest irq synchronization. Interface is pic/ioapic pin level, plus a topology description that userspace uses to tell the hardware how the pins are connected (logically required, but practically not). -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] KVM in-kernel APIC update
Ingo Molnar wrote: > there is a remote possibility that some OSs depend on certain devices > being level-triggered: for example if you get an IRQ from a > level-triggered device and _dont_ deassert that signal from the IRQ > handler (intentionally so), then the semantics of current hardware will > cause a second interrupt to be sent by the PIC, after the APIC message > has been EOI-ed in the local APIC. While such "repeat interrupts" would > be pure madness to rely on i think, i'm not sure it's not being done. > Note that if the same IO-APIC pin is set up to detect edges then not > deasserting the signal would not cause a 'repeat interrupt'. Whether > such accurate emulation of signalling is needed depends on the hardware > semantics of the devices we emulate. > > Don't all OSes heavily depend on level-triggered interrupts for irq line sharing? -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Anthony Liguori wrote: > > Yeah, I think this is a good point. If we're going to push the APIC > into the kernel, we might as well put the PIT there too. The timing > stuff is an absolute mess in QEMU since it wants to get a fast high > res clock but isn't aware of things like CPU migration. qemu timing could indeed use an overhaul, for example not relying on a perioding tick but instead calculating the next wakeup event (for which it has all the infrastructure already; just unused). > I'm pretty sure that if you are on an SMP host, some bad things can > happen with the QEMU timer code since it relies on the rdtsc. It seems to compensate for it. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel