Re: Time and KVM - best practices
On 03/21/2010 01:29 PM, Thomas Løcke wrote: Hey, What is considered best practice when running a KVM host with a mixture of Linux and Windows guests? Currently I have ntpd running on the host, and I start my guests using -rtc base=localhost,clock=host, with an extra -tdf added for Windows guests, just to keep their clock from drifting madly during load. But with this setup, all my guests are constantly 1-2 seconds behind the host. I can live with that for the Windows guests, as they are not Is it just during boot time? If you run ntpdate after the boot inside the guest, does the time is 100% in sync with the host from that moment on? Glauber once analyzed it and blames hwclock call in rc.sysinit running anything that depends heavily on the time being set perfect, but for some of the Linux guests it's an issue. Would I be better of using ntpd and -rtc base=localhost,clock=vm for all the Linux guests, or is there some other magic way of ensuring that the clock is perfectly in sync with the host? Perhaps there are some kernel configuration I can do to optimize the host for KVM? Jan is the expert here, but last I checked clock=vm is not appropriate since this is virtual time and not host time - if qemu is stopped/migrated you won't notice it with virtual time withing the guest but the drift will grow. I'm currently using QEMU PC emulator version 0.12.50 (qemu-kvm-devel) because version 0.12.30 did not work well at all with Windows guests, and the kernel in both host and Linux guests is 2.6.33.1 :o) /Thomas -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: About KVM Forum 2010
On 03/17/2010 07:37 AM, kazushi takahashi wrote: Hi all Does anybody know exact important date, such as paper deadline for KVM Forum 2010? It's not yet official and Chris Wright will publish the dates but last we talked it was about asking for pretty simple abstracts (a paragraph or two, ~100-150 words) due by April 15, notification by May 7th. Again, not official, probably because of admin needed to set up a site for the paper submission. So Chris will update us all officially, in the mean time, all can start working on their proposals. hth, Dor I can find this blog(http://www.linux-kvm.com/content/kvm-forum-2010-scheduled-august-9-10-2010) but the blog only say about the date of the conference. Regards, Kazushi Takahashi -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Timedrift in KVM guests after livemigration.
On 04/18/2010 02:21 AM, Espen Berg wrote: Den 17.04.2010 22:17, skrev Michael Tokarev: We have three KVM hosts that supports live-migration between them, but one of our problems is time drifting. The three frontends has different CPU frequency and the KVM guests adopt the frequency from the host machine where it was first started. What do you mean by adopts ? Note that the cpu frequency means nothing for all the modern operating systems, at least since the days of common usage of MS-DOS which relied on CPU frequency for its time functions. All interesting things are now done using timers instead, and timers (which don't depend on CPU frequency again) usually work quite well. The assumption that frequency of the ticks was calculated by the hosts MHz, was based on the fact that grater clock frequency differences caused higher time drift. 60 MHz difference caused about 24min drift, 332 MHz difference caused about 2h25min drift. What complicates things is that the most cheap and accurate enough time source is TSC (time stamp counter register in the CPU), but it will definitely be different on each machine. For that, 0.12.3 kvm and 2.6.32 kernel (I think) introduced a compensation. See for example -tdf kvm option. Ah, nice to know. :) That's two different things here: The issue that Espen is reporting is that the hosts have different frequency and guests that relay on the tsc as a source clock will notice that post migration. The is indeed a problem that -tdf does not solve. -tdf only adds compensation for the RTC clock emulation. What's the guest type and what's the guest's source clock? Using tsc directly as a source clock is not recommended because of this migration issue (that is not solveable until we trap every rdtsc by the guest). Using pv kvmclock in Linux mitigates this issue since it exposes both the tsc and the host clock so guests can adjust themselves. Several months ago a pvclock migration fix was added to pass the pvclock MSRs reading to the destination: 1a03675db146dfc760b3b48b3448075189f142cc Since this is a cluster in production, I'm not able to try the latest version either. Well, that's difficult one, no? It either works or not. If you can't try anything else, why to ask? :) What I tried to say was that there are many important virtual servers running on this cluster at the moment, so trial by error was not an option. The last time we tried 0.12.x (during the initial tests of the cluster) there where a lot of stability issues, crashes during migration etc. Regards, Espen -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Timedrift in KVM guests after livemigration.
On 04/19/2010 12:29 PM, Gleb Natapov wrote: On Mon, Apr 19, 2010 at 11:21:47AM +0200, Espen Berg wrote: Den 18.04.2010 11:56, skrev Gleb Natapov: That's two different things here: The issue that Espen is reporting is that the hosts have different frequency and guests that relay on the tsc as a source clock will notice that post migration. The is indeed a problem that -tdf does not solve. -tdf only adds compensation for the RTC clock emulation. It's -rtc-td-hack. -tdf does pit compensation, but since usually kernel pit is used it does nothing. So this hack will not solve our problem? As I also stated, in the past the kvmclock MSRs were not sync upon live migration and it was fixed in 1a03675db146dfc760b3b48b3448075189f142cc , better check with the code. If your guest uses RTC for time keeping it may help. Otherwise it does nothing. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote: Hi all, We have been implementing the prototype of Kemari for KVM, and we're sending this message to share what we have now and TODO lists. Hopefully, we would like to get early feedback to keep us in the right direction. Although advanced approaches in the TODO lists are fascinating, we would like to run this project step by step while absorbing comments from the community. The current code is based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27. For those who are new to Kemari for KVM, please take a look at the following RFC which we posted last year. http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html The transmission/transaction protocol, and most of the control logic is implemented in QEMU. However, we needed a hack in KVM to prevent rip from proceeding before synchronizing VMs. It may also need some plumbing in the kernel side to guarantee replayability of certain events and instructions, integrate the RAS capabilities of newer x86 hardware with the HA stack, as well as for optimization purposes, for example. [ snap] The rest of this message describes TODO lists grouped by each topic. === event tapping === Event tapping is the core component of Kemari, and it decides on which event the primary should synchronize with the secondary. The basic assumption here is that outgoing I/O operations are idempotent, which is usually true for disk I/O and reliable network protocols such as TCP. IMO any type of network even should be stalled too. What if the VM runs non tcp protocol and the packet that the master node sent reached some remote client and before the sync to the slave the master failed? [snap] === clock === Since synchronizing the virtual machines every time the TSC is accessed would be prohibitive, the transmission of the TSC will be done lazily, which means delaying it until there is a non-TSC synchronization point arrives. Why do you specifically care about the tsc sync? When you sync all the IO model on snapshot it also synchronizes the tsc. In general, can you please explain the 'algorithm' for continuous snapshots (is that what you like to do?): A trivial one would we to : - do X online snapshots/sec - Stall all IO (disk/block) from the guest to the outside world until the previous snapshot reaches the slave. - Snapshots are made of - diff of dirty pages from last snapshot - Qemu device model (+kvm's) diff from last. You can do 'light' snapshots in between to send dirty pages to reduce snapshot time. I wrote the above to serve a reference for your comments so it will map into my mind. Thanks, dor TODO: - Synchronization of clock sources (need to intercept TSC reads, etc). === usability === These are items that defines how users interact with Kemari. TODO: - Kemarid daemon that takes care of the cluster management/monitoring side of things. - Some device emulators might need minor modifications to work well with Kemari. Use white(black)-listing to take the burden of choosing the right device model off the users. === optimizations === Although the big picture can be realized by completing the TODO list above, we need some optimizations/enhancements to make Kemari useful in real world, and these are items what needs to be done for that. TODO: - SMP (for the sake of performance might need to implement a synchronization protocol that can maintain two or more synchronization points active at any given moment) - VGA (leverage VNC's subtilting mechanism to identify fb pages that are really dirty). Any comments/suggestions would be greatly appreciated. Thanks, Yoshi -- Kemari starts synchronizing VMs when QEMU handles I/O requests. Without this patch VCPU state is already proceeded before synchronization, and after failover to the VM on the receiver, it hangs because of this. Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/svm.c | 11 --- arch/x86/kvm/vmx.c | 11 --- arch/x86/kvm/x86.c |4 4 files changed, 21 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 26c629a..7b8f514 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -227,6 +227,7 @@ struct kvm_pio_request { int in; int port; int size; + bool lazy_skip; }; /* diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index d04c7ad..e373245 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1495,7 +1495,7 @@ static int io_interception(struct vcpu_svm *svm) { struct kvm_vcpu *vcpu =svm-vcpu; u32 io_info = svm-vmcb-control.exit_info_1; /* address size bug? */ - int size, in, string; + int size, in, string, ret; unsigned port; ++svm-vcpu.stat.io_exits; @@
Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote: Dor Laor wrote: On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote: Hi all, We have been implementing the prototype of Kemari for KVM, and we're sending this message to share what we have now and TODO lists. Hopefully, we would like to get early feedback to keep us in the right direction. Although advanced approaches in the TODO lists are fascinating, we would like to run this project step by step while absorbing comments from the community. The current code is based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27. For those who are new to Kemari for KVM, please take a look at the following RFC which we posted last year. http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html The transmission/transaction protocol, and most of the control logic is implemented in QEMU. However, we needed a hack in KVM to prevent rip from proceeding before synchronizing VMs. It may also need some plumbing in the kernel side to guarantee replayability of certain events and instructions, integrate the RAS capabilities of newer x86 hardware with the HA stack, as well as for optimization purposes, for example. [ snap] The rest of this message describes TODO lists grouped by each topic. === event tapping === Event tapping is the core component of Kemari, and it decides on which event the primary should synchronize with the secondary. The basic assumption here is that outgoing I/O operations are idempotent, which is usually true for disk I/O and reliable network protocols such as TCP. IMO any type of network even should be stalled too. What if the VM runs non tcp protocol and the packet that the master node sent reached some remote client and before the sync to the slave the master failed? In current implementation, it is actually stalling any type of network that goes through virtio-net. However, if the application was using unreliable protocols, it should have its own recovering mechanism, or it should be completely stateless. Why do you treat tcp differently? You can damage the entire VM this way - think of dhcp request that was dropped on the moment you switched between the master and the slave? [snap] === clock === Since synchronizing the virtual machines every time the TSC is accessed would be prohibitive, the transmission of the TSC will be done lazily, which means delaying it until there is a non-TSC synchronization point arrives. Why do you specifically care about the tsc sync? When you sync all the IO model on snapshot it also synchronizes the tsc. So, do you agree that an extra clock synchronization is not needed since it is done anyway as part of the live migration state sync? In general, can you please explain the 'algorithm' for continuous snapshots (is that what you like to do?): Yes, of course. Sorry for being less informative. A trivial one would we to : - do X online snapshots/sec I currently don't have good numbers that I can share right now. Snapshots/sec depends on what kind of workload is running, and if the guest was almost idle, there will be no snapshots in 5sec. On the other hand, if the guest was running I/O intensive workloads (netperf, iozone for example), there will be about 50 snapshots/sec. - Stall all IO (disk/block) from the guest to the outside world until the previous snapshot reaches the slave. Yes, it does. - Snapshots are made of Full device model + diff of dirty pages from the last snapshot. - diff of dirty pages from last snapshot This also depends on the workload. In case of I/O intensive workloads, dirty pages are usually less than 100. The hardest would be memory intensive loads. So 100 snap/sec means latency of 10msec right? (not that it's not ok, with faster hw and IB you'll be able to get much more) - Qemu device model (+kvm's) diff from last. We're currently sending full copy because we're completely reusing this part of existing live migration framework. Last time we measured, it was about 13KB. But it varies by which QEMU version is used. You can do 'light' snapshots in between to send dirty pages to reduce snapshot time. I agree. That's one of the advanced topic we would like to try too. I wrote the above to serve a reference for your comments so it will map into my mind. Thanks, dor Thank your for the guidance. I hope this answers to your question. At the same time, I would also be happy it we could discuss how to implement too. In fact, we needed a hack to prevent rip from proceeding in KVM, which turned out that it was not the best workaround. There are brute force solutions like - stop the guest until you send all of the snapshot to the remote (like standard live migration) - Stop + fork + cont the father Or mark the recent dirty pages that were not sent to the remote as write protected and copy them if touched. Thanks, Yoshi TODO: - Synchronization of clock sources (need to intercept TSC reads, etc). === usability
Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
On 04/22/2010 04:16 PM, Yoshiaki Tamura wrote: 2010/4/22 Dor Laordl...@redhat.com: On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote: Dor Laor wrote: On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote: Hi all, We have been implementing the prototype of Kemari for KVM, and we're sending this message to share what we have now and TODO lists. Hopefully, we would like to get early feedback to keep us in the right direction. Although advanced approaches in the TODO lists are fascinating, we would like to run this project step by step while absorbing comments from the community. The current code is based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27. For those who are new to Kemari for KVM, please take a look at the following RFC which we posted last year. http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html The transmission/transaction protocol, and most of the control logic is implemented in QEMU. However, we needed a hack in KVM to prevent rip from proceeding before synchronizing VMs. It may also need some plumbing in the kernel side to guarantee replayability of certain events and instructions, integrate the RAS capabilities of newer x86 hardware with the HA stack, as well as for optimization purposes, for example. [ snap] The rest of this message describes TODO lists grouped by each topic. === event tapping === Event tapping is the core component of Kemari, and it decides on which event the primary should synchronize with the secondary. The basic assumption here is that outgoing I/O operations are idempotent, which is usually true for disk I/O and reliable network protocols such as TCP. IMO any type of network even should be stalled too. What if the VM runs non tcp protocol and the packet that the master node sent reached some remote client and before the sync to the slave the master failed? In current implementation, it is actually stalling any type of network that goes through virtio-net. However, if the application was using unreliable protocols, it should have its own recovering mechanism, or it should be completely stateless. Why do you treat tcp differently? You can damage the entire VM this way - think of dhcp request that was dropped on the moment you switched between the master and the slave? I'm not trying to say that we should treat tcp differently, but just it's severe. In case of dhcp request, the client would have a chance to retry after failover, correct? But until it timeouts it won't have networking. BTW, in current implementation, it's synchronizing before dhcp ack is sent. But in case of tcp, once you send ack to the client before sync, there is no way to recover. What if the guest is running dhcp server? It we provide an IP to a client and then fail to the secondary that will run without knowing the master allocated this IP [snap] === clock === Since synchronizing the virtual machines every time the TSC is accessed would be prohibitive, the transmission of the TSC will be done lazily, which means delaying it until there is a non-TSC synchronization point arrives. Why do you specifically care about the tsc sync? When you sync all the IO model on snapshot it also synchronizes the tsc. So, do you agree that an extra clock synchronization is not needed since it is done anyway as part of the live migration state sync? I agree that its sent as part of the live migration. What I wanted to say here is that this is not something for real time applications. I usually get questions like can this guarantee fault tolerance for real time applications. First the huge cost of snapshots won't match to any real time app. Second, even if it wasn't the case, the tsc delta and kvmclock are synchronized as part of the VM state so there is no use of trapping it in the middle. In general, can you please explain the 'algorithm' for continuous snapshots (is that what you like to do?): Yes, of course. Sorry for being less informative. A trivial one would we to : - do X online snapshots/sec I currently don't have good numbers that I can share right now. Snapshots/sec depends on what kind of workload is running, and if the guest was almost idle, there will be no snapshots in 5sec. On the other hand, if the guest was running I/O intensive workloads (netperf, iozone for example), there will be about 50 snapshots/sec. - Stall all IO (disk/block) from the guest to the outside world until the previous snapshot reaches the slave. Yes, it does. - Snapshots are made of Full device model + diff of dirty pages from the last snapshot. - diff of dirty pages from last snapshot This also depends on the workload. In case of I/O intensive workloads, dirty pages are usually less than 100. The hardest would be memory intensive loads. So 100 snap/sec means latency of 10msec right? (not that it's not ok, with faster hw and IB you'll be able to get much more) Doesn't 100 snap/sec mean the interval of snap is 10msec? IIUC, to get the latency
Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
On 04/23/2010 10:36 AM, Fernando Luis Vázquez Cao wrote: On 04/23/2010 02:17 PM, Yoshiaki Tamura wrote: Dor Laor wrote: [...] Second, even if it wasn't the case, the tsc delta and kvmclock are synchronized as part of the VM state so there is no use of trapping it in the middle. I should study the clock in KVM, but won't tsc get updated by the HW after migration? I was wondering the following case for example: 1. The application on the guest calls rdtsc on host A. 2. The application uses rdtsc value for something. 3. Failover to host B. 4. The application on the guest replays the rdtsc call on host B. 5. If the rdtsc value is different between A and B, the application may get into trouble because of it. Regarding the TSC, we need to guarantee that the guest sees a monotonic TSC after migration, which can be achieved by adjusting the TSC offset properly. Besides, we also need a trapping TSC, so that we can tackle the case where the primary node and the standby node have different TSC frequencies. You're right but this is already taken care of by normal save/restore process. Check void kvm_load_tsc(CPUState *env) function. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call agenda for Apr 27
On 04/27/2010 11:14 AM, Avi Kivity wrote: On 04/27/2010 01:36 AM, Anthony Liguori wrote: A few comments: 1) The problem was not block watermark itself but generating a notification on the watermark threshold. It's a heuristic and should be implemented based on polling block stats. Polling for an event that never happens is bad engineering. What frequency do you poll? you're forcing the user to make a lose-lose tradeoff. Otherwise, we'll be adding tons of events to qemu that we'll struggle to maintain. That's not a valid reason to reject a user requirement. We may argue the requirement is bogus, or that the suggested implementation is wrong and point in a different direction, but saying that we may have to add more code in the future due to other requirements is ... well I can't find a word for it. 2) A block plugin doesn't solve the problem if it's just at the BlockDriverState level because it can't interact with qcow2. Why not? We have a layered model. guest - qcow2 - plugin (sends event) - raw-posix. Just need to insert the plugin at the appropriate layer. 3) For general block plugins, it's probably better to tackle userspace block devices. We have CUSE and FUSE already, a BUSE is a logical conclusion. We also have an nbd client. Here's another option: an nbd-like protocol that remotes all BlockDriver operations except read and write over a unix domain socket. The open operation returns an fd (SCM_RIGHTS strikes again) that is used for read and write. This can be used to implement snapshots over LVM, for example. Why w/o read/writes? the watermark code needs them too (as info, not the actual buffer). IMHO the whole thing is way over engineered: a) Having another channel into qemu is complicating management software. Isn't the monitor should be the channel? Otherwise we'll need to create another QMP (or nbd like Avi suggest) for these actions. It's extra work for mgmt and they will have hard time to understand events interleaving of the various channels b) How the plugins are defined? Is it scripts? Binaries? Do they open their own sockets? So I suggest either to stick with qmp or to have new block layer but let qmp pass events from it - this is actually the nbd-like approach but with qmp socket. Thanks, Dor -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call agenda for Apr 27
On 04/27/2010 11:56 AM, Avi Kivity wrote: On 04/27/2010 11:48 AM, Dor Laor wrote: Here's another option: an nbd-like protocol that remotes all BlockDriver operations except read and write over a unix domain socket. The open operation returns an fd (SCM_RIGHTS strikes again) that is used for read and write. This can be used to implement snapshots over LVM, for example. Why w/o read/writes? To avoid the copying. Of course, just pass the offset+len on read/write too the watermark code needs them too (as info, not the actual buffer). Yeah. It works for lvm snapshots, not for watermarks. IMHO the whole thing is way over engineered: a) Having another channel into qemu is complicating management software. Isn't the monitor should be the channel? Otherwise we'll need to create another QMP (or nbd like Avi suggest) for these actions. It's extra work for mgmt and they will have hard time to understand events interleaving of the various channels block layer plugins allow intercepting all interesting block layer events, not just write-past-a-watermark, and allow actions based on those events. It's a more general solution. No problem there, as long as we do try to use the single existing QMP with the plugins. Otherwise we'll create QMP2 for the block events in a year from now. b) How the plugins are defined? Is it scripts? Binaries? Do they open their own sockets? Shared objects. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call agenda for Apr 27
On 04/27/2010 12:22 PM, Avi Kivity wrote: On 04/27/2010 12:08 PM, Dor Laor wrote: On 04/27/2010 11:56 AM, Avi Kivity wrote: On 04/27/2010 11:48 AM, Dor Laor wrote: Here's another option: an nbd-like protocol that remotes all BlockDriver operations except read and write over a unix domain socket. The open operation returns an fd (SCM_RIGHTS strikes again) that is used for read and write. This can be used to implement snapshots over LVM, for example. Why w/o read/writes? To avoid the copying. Of course, just pass the offset+len on read/write too There will be a large performance impact. IMHO the whole thing is way over engineered: a) Having another channel into qemu is complicating management software. Isn't the monitor should be the channel? Otherwise we'll need to create another QMP (or nbd like Avi suggest) for these actions. It's extra work for mgmt and they will have hard time to understand events interleaving of the various channels block layer plugins allow intercepting all interesting block layer events, not just write-past-a-watermark, and allow actions based on those events. It's a more general solution. No problem there, as long as we do try to use the single existing QMP with the plugins. Otherwise we'll create QMP2 for the block events in a year from now. I don't see how we can interleave messages from the plugin into the qmp stream without causing confusion. Those are QMP async events. Since Kevin suggested adding even more events (was is cynical?) maybe we can use optional QMP opaque block events that the plugin issues and it will travel using the standard QMP connection as async event to the interested mgmt app. Once stabilized each event can go into the official QMP protocol. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] virtio: put last seen used index into ring itself
On 05/05/2010 11:58 PM, Michael S. Tsirkin wrote: Generally, the Host end of the virtio ring doesn't need to see where Guest is up to in consuming the ring. However, to completely understand what's going on from the outside, this information must be exposed. For example, host can reduce the number of interrupts by detecting that the guest is currently handling previous buffers. Fortunately, we have room to expand: the ring is always a whole number of pages and there's hundreds of bytes of padding after the avail ring and the used ring, whatever the number of descriptors (which must be a power of 2). We add a feature bit so the guest can tell the host that it's writing out the current value there, if it wants to use that. This is based on a patch by Rusty Russell, with the main difference being that we dedicate a feature bit to guest to tell the host it is writing the used index. This way we don't need to force host to publish the last available index until we have a use for it. Signed-off-by: Rusty Russellru...@rustcorp.com.au Signed-off-by: Michael S. Tsirkinm...@redhat.com --- Rusty, this is a simplified form of a patch you posted in the past. I have a vhost patch that, using this feature, shows external to host bandwidth grow from 5 to 7 GB/s, by avoiding You mean external to guest I guess. We have a similar issue with virtio-blk - when using very fast multi-spindle storage on the host side, there are too many irq injection events. This patch should probably reduce them allot. The principle exactly matches the Xen ring. an interrupt in the window after previous interrupt was sent and before interrupts were disabled for the vq. With vhost under some external to host loads I see this window being hit about 30% sometimes. I'm finalizing the host bits and plan to send the final version for inclusion when all's ready, but I'd like to hear comments meanwhile. drivers/virtio/virtio_ring.c | 28 +--- include/linux/virtio_ring.h | 14 +- 2 files changed, 30 insertions(+), 12 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 1ca8890..7729aba 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -89,9 +89,6 @@ struct vring_virtqueue /* Number we've added since last sync. */ unsigned int num_added; - /* Last used index we've seen. */ - u16 last_used_idx; - /* How to notify other side. FIXME: commonalize hcalls! */ void (*notify)(struct virtqueue *vq); @@ -285,12 +282,13 @@ static void detach_buf(struct vring_virtqueue *vq, unsigned int head) static inline bool more_used(const struct vring_virtqueue *vq) { - return vq-last_used_idx != vq-vring.used-idx; + return *vq-vring.last_used_idx != vq-vring.used-idx; } void *virtqueue_get_buf(struct virtqueue *_vq, unsigned int *len) { struct vring_virtqueue *vq = to_vvq(_vq); + struct vring_used_elem *u; void *ret; unsigned int i; @@ -307,12 +305,13 @@ void *virtqueue_get_buf(struct virtqueue *_vq, unsigned int *len) return NULL; } - /* Only get used array entries after they have been exposed by host. */ - virtio_rmb(); - - i = vq-vring.used-ring[vq-last_used_idx%vq-vring.num].id; - *len = vq-vring.used-ring[vq-last_used_idx%vq-vring.num].len; + /* Only get used array entries after they have been exposed by host. +* Need mb(), not just rmb() because we write last_used_idx below. */ + virtio_mb(); + u =vq-vring.used-ring[*vq-vring.last_used_idx % vq-vring.num]; + i = u-id; + *len = u-len; if (unlikely(i= vq-vring.num)) { BAD_RING(vq, id %u out of range\n, i); return NULL; @@ -325,7 +324,8 @@ void *virtqueue_get_buf(struct virtqueue *_vq, unsigned int *len) /* detach_buf clears data, so grab it now. */ ret = vq-data[i]; detach_buf(vq, i); - vq-last_used_idx++; + (*vq-vring.last_used_idx)++; + END_USE(vq); return ret; } @@ -431,7 +431,7 @@ struct virtqueue *vring_new_virtqueue(unsigned int num, vq-vq.name = name; vq-notify = notify; vq-broken = false; - vq-last_used_idx = 0; + *vq-vring.last_used_idx = 0; vq-num_added = 0; list_add_tail(vq-vq.list,vdev-vqs); #ifdef DEBUG @@ -440,6 +440,10 @@ struct virtqueue *vring_new_virtqueue(unsigned int num, vq-indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC); + /* We publish used index whether Host offers it or not: if not, it's +* junk space anyway. But calling this acknowledges the feature. */ + virtio_has_feature(vdev, VIRTIO_RING_F_PUBLISH_USED); + /* No callback? Tell other side not to bother us. */ if (!callback) vq-vring.avail-flags |= VRING_AVAIL_F_NO_INTERRUPT; @@ -473,6 +477,8 @@ void
Re: Copy and paste feature across guest and host
On 05/27/2010 12:17 PM, Tomasz Chmielewski wrote: Just installed Fedora13 as guest on KVM. However there is no cross-platform copy and paste feature. I trust I have setup this feature on other guest sometime before. Unfortunately I can't the relevant document. Could you please shed me some light. Pointer would be appreciated. TIA Did you try; # modprobe virtio-copypaste ? Seriously, qemu does not make it easy (well, its GUI does not make most things easy) and you'll need a tool which synchronizes the clipboard between two machines (google for qemu copy paste?). There is no cutpaste at the moment. The plan is to enable it through virtio-serial and have spice vnc use it. Cannot guarantee a date but it shouldn't be too long. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm still unable to load option rom extboot.bin
On 09/09/2009 04:47 PM, Lucas Meneghel Rodrigues wrote: Hi folks, seems like we are still facing a build problem on qemu-kvm: The option rom is failing to boot: 09/04 11:12:08 DEBUG|kvm_vm:0384| Running qemu command: /usr/local/autotest/tests/kvm/qemu -name 'vm1' -monitor unix:/tmp/monitor-20090904-111208-9nyy,server,nowait -drive file=/usr/local/autotest/tests/kvm/images/fc9-32.qcow2,if=ide,boot=on -net nic,vlan=0 -net user,vlan=0 -m 512 -cdrom /usr/local/autotest/tests/kvm/isos/linux/Fedora-9-i386-DVD.iso -redir tcp:5000::22 -vnc :0 09/04 11:12:08 DEBUG| kvm_utils:0858| (qemu) Could not load option rom 'extboot.bin' So qemu is still not able to locate roms when it needs them. The test could work around this as pointed out by Marcelo, by copying the roms to the right repository, but that's not desirable, it should be fixed on the build system appropriately. I will do my best to allways watch closely the results of daily git testing and report on problems. Avi just committed it: [COMMIT master] qemu-kvm: Install built option roms Thanks! Lucas -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm network latency, higher with virtio ?
On 09/16/2009 10:27 AM, Michael S. Tsirkin wrote: On Tue, Sep 15, 2009 at 05:15:09PM +0200, Luca Bigliardi wrote: Hi, I'm running some tests between two linux instances bridged together. If I try to ping 10 times I obtain the following results: -net nic,model=virtio -net tap : rtt min/avg/max/mdev = 0.756/0.967/2.115/0.389 ms -net nic,model=rtl8139 -net tap : rtt min/avg/max/mdev = 0.301/0.449/1.173/0.248 ms So it seems with virtio the latency is higher. Is it normal? Yes, the main reason is the TX timer it uses for interrupt/vm exit mitigation. Originally we used the tx mitigation timer in order to provide better throughput on the expense of latency. Measurements of older versions of virtio proved that we can cancel this timer and achieve better latency while not hurting throughput. Vhost wouldn't use it. For the time being until be get vhost, we should probably remove it from qemu. The results I'm reporting were obtained with - host qemu-kvm 0.11-rc2 kvm-kmod-2.6.30.1 kernel: 2.6.30.5 (HIGH_RES_TIMERS=y as suggested in http://www.linux-kvm.org/page/Virtio ) - guest kernel: 2.6.31 but I also tested older versions always obtaining latency values at least two times higher than rtl8139/e1000 . Thank you, Luca -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM-AUTOTEST PATCH 1/2] Add KSM test
On 09/15/2009 09:58 PM, Jiri Zupka wrote: After a quick review I have the following questions: 1. Why did you implement the guest tool in 'c' and not in python? Python is much simpler and you can share some code with the server. This 'test protocol' would also be easier to understand this way. We need speed and the precise control of allocate memory in pages. 2. IMHO there is no need to use select, you can do blocking read. We replace socket communication by interactive program communication via ssh/telnet 3. Also you can use plain malloc without the more complex ( a bit) mmap. We need address exactly the memory pages. We can't allow shift of the data in memory. You can use the tmpfs+dd idea instead of the specific program as I detailed before. Maybe some other binary can be used. My intention is to simplify the test/environment as much as possible. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM-AUTOTEST PATCH 1/2] Add KSM test
On 09/16/2009 04:09 PM, Jiri Zupka wrote: - Dor Laordl...@redhat.com wrote: On 09/15/2009 09:58 PM, Jiri Zupka wrote: After a quick review I have the following questions: 1. Why did you implement the guest tool in 'c' and not in python? Python is much simpler and you can share some code with the server. This 'test protocol' would also be easier to understand this way. We need speed and the precise control of allocate memory in pages. 2. IMHO there is no need to use select, you can do blocking read. We replace socket communication by interactive program communication via ssh/telnet 3. Also you can use plain malloc without the more complex ( a bit) mmap. We need address exactly the memory pages. We can't allow shift of the data in memory. You can use the tmpfs+dd idea instead of the specific program as I detailed before. Maybe some other binary can be used. My intention is to simplify the test/environment as much as possible. We need compatibility with others system, like Windows etc.. We want to add support for others system in next version KSM is a host feature and should be agnostic to the guest. Also I don't think your code will compile on windows... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Binary Windows guest drivers are released
On 09/24/2009 11:59 PM, Javier Guerra wrote: On Thu, Sep 24, 2009 at 3:38 PM, Kenni Lundke...@kelu.dk wrote: I've done some benchmarking with the drivers on Windows XP SP3 32bit, but it seems like using the VirtIO drivers are slower than the IDE drivers in (almost) all cases. Perhaps I've missed something or does the driver still need optimization? very interesting! it seems that IDE wins on all the performance numbers, but VirtIO always has lower CPU utilization. i guess this is guest CPU %, right? it would also be interesting to compare the CPU usage from the host point of view, since a lower 'off-guest' CPU usage is very important for scaling to many guests doing I/O. Can you re-try it with setting the host ioscheduler to deadline? Virtio backend (thread pool) is sensitive for it. These drivers are mainly tweaked for win2k3 and win2k8. We once had queue depth settings in the driver, not sure we still have it, Vadim, can you add more info? Also virtio should provide IO parallelism as opposed to IDE. I don't think your test test it. Virtio can provide more virtual drives than the max 4 that ide offers. Dor -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM-AUTOTEST PATCH 1/2] Add KSM test
On 09/29/2009 05:50 PM, Lucas Meneghel Rodrigues wrote: On Fri, 2009-09-25 at 05:22 -0400, Jiri Zupka wrote: - Dor Laordl...@redhat.com wrote: On 09/16/2009 04:09 PM, Jiri Zupka wrote: - Dor Laordl...@redhat.com wrote: On 09/15/2009 09:58 PM, Jiri Zupka wrote: After a quick review I have the following questions: 1. Why did you implement the guest tool in 'c' and not in python? Python is much simpler and you can share some code with the server. This 'test protocol' would also be easier to understand this way. We need speed and the precise control of allocate memory in pages. 2. IMHO there is no need to use select, you can do blocking read. We replace socket communication by interactive program communication via ssh/telnet 3. Also you can use plain malloc without the more complex ( a bit) mmap. We need address exactly the memory pages. We can't allow shift of the data in memory. You can use the tmpfs+dd idea instead of the specific program as I detailed before. Maybe some other binary can be used. My intention is to simplify the test/environment as much as possible. We need compatibility with others system, like Windows etc.. We want to add support for others system in next version KSM is a host feature and should be agnostic to the guest. Also I don't think your code will compile on windows... Yes, I think you have true. First of all, sorry, I am doing the best I can to review carefully all the patch queue, and as KSM is a more involved feature that I am not very familiar with, I need a bit more time to review it! But because we need generate special data to pages in memory. We need use script on guest side of test. Because communication over ssh is to slow to transfer lot of GB of special data to guests. We can use optimized C program which is 10x and more faster than python script on native system. Heavy load of virtual guest can make some performance problem. About code compiling under windows, I guess making a native windows c or c++ program is an option, I generally agree with your reasoning, this case seems to be better covered with a c program. Will get into it in more detail ASAP... We can use tmpfs but with python script to generate special data. We can't use dd with random because we need test some special case. (change only last 96B of page etc.. ) What do you think about it? I think it can be done with some simple scripting and it will be fast enough and more importantly, easier to understand and to change in the future. Here is a short example for creating lots of identical pages that contain '0' apart for the last two bytes. If you'll run it in a single guest you should expect to save lots of memory. Then you can change the last bytes to random value and see the memory consumption grow: [Remember to cancel the guest swap to keep it in the guest ram] dd if=/dev/zero of=template count=1 bs=4094 echo '1' template cp template large_file for ((i=0;i10;i++)) do dd if=large_file of=large_file conv=notrunc oflag=append /dev/null 21 ; done It creates a 4k*2^10 file with identical pages (since it's on tmpfs with no swap) Can you try it? It should be far simpler than the original option. Thanks, Dor -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Autotest] [PATCH] Test 802.1Q vlan of nic
On 10/15/2009 11:48 AM, Amos Kong wrote: Test 802.1Q vlan of nic, config it by vconfig command. 1) Create two VMs 2) Setup guests in different vlan by vconfig and test communication by ping using hard-coded ip address 3) Setup guests in same vlan and test communication by ping 4) Recover the vlan config Signed-off-by: Amos Kongak...@redhat.com --- client/tests/kvm/kvm_tests.cfg.sample |6 +++ client/tests/kvm/tests/vlan_tag.py| 73 + 2 files changed, 79 insertions(+), 0 deletions(-) mode change 100644 = 100755 client/tests/kvm/scripts/qemu-ifup In general the above should come as an independent patch. create mode 100644 client/tests/kvm/tests/vlan_tag.py diff --git a/client/tests/kvm/kvm_tests.cfg.sample b/client/tests/kvm/kvm_tests.cfg.sample index 9ccc9b5..4e47767 100644 --- a/client/tests/kvm/kvm_tests.cfg.sample +++ b/client/tests/kvm/kvm_tests.cfg.sample @@ -166,6 +166,12 @@ variants: used_cpus = 5 used_mem = 2560 +- vlan_tag: install setup +type = vlan_tag +subnet2 = 192.168.123 +vlans = 10 20 If we want to be fanatic and safe we should dynamically choose subnet and vlans numbers that are not used on the host instead of hard code it. +nic_mode = tap +nic_model = e1000 Why only e1000? Let's test virtio and rtl8139 as well. Can't you inherit the nic model from the config? - autoit: install setup type = autoit diff --git a/client/tests/kvm/scripts/qemu-ifup b/client/tests/kvm/scripts/qemu-ifup old mode 100644 new mode 100755 diff --git a/client/tests/kvm/tests/vlan_tag.py b/client/tests/kvm/tests/vlan_tag.py new file mode 100644 index 000..15e763f --- /dev/null +++ b/client/tests/kvm/tests/vlan_tag.py @@ -0,0 +1,73 @@ +import logging, time +from autotest_lib.client.common_lib import error +import kvm_subprocess, kvm_test_utils, kvm_utils + +def run_vlan_tag(test, params, env): + +Test 802.1Q vlan of nic, config it by vconfig command. + +1) Create two VMs +2) Setup guests in different vlan by vconfig and test communication by ping + using hard-coded ip address +3) Setup guests in same vlan and test communication by ping +4) Recover the vlan config + +@param test: Kvm test object +@param params: Dictionary with the test parameters. +@param env: Dictionary with test environment. + + +vm = [] +session = [] +subnet2 = params.get(subnet2) +vlans = params.get(vlans).split() + +vm.append(kvm_test_utils.get_living_vm(env, %s % params.get(main_vm))) + +params_vm2 = params.copy() +params_vm2['image_snapshot'] = yes +params_vm2['kill_vm_gracefully'] = no +params_vm2[address_index] = int(params.get(address_index, 0))+1 +vm.append(vm[0].clone(vm2, params_vm2)) +kvm_utils.env_register_vm(env, vm2, vm[1]) +if not vm[1].create(): +raise error.TestError(VM 1 create faild) The whole 7-8 lines above should be grouped as a function to clone existing VM. It should be part of kvm autotest infrastructure. Besides that, it looks good. + +for i in range(2): +session.append(kvm_test_utils.wait_for_login(vm[i])) + +try: +vconfig_cmd = vconfig add eth0 %s;ifconfig eth0.%s %s.%s +# Attempt to configure IPs for the VMs and record the results in +# boolean variables +# Make vm1 and vm2 in the different vlan + +ip_config_vm1_ok = (session[0].get_command_status(vconfig_cmd + % (vlans[0], vlans[0], subnet2, 11)) == 0) +ip_config_vm2_ok = (session[1].get_command_status(vconfig_cmd + % (vlans[1], vlans[1], subnet2, 12)) == 0) +if not ip_config_vm1_ok or not ip_config_vm2_ok: +raise error.TestError, Fail to config VMs ip address +ping_diff_vlan_ok = (session[0].get_command_status( + ping -c 2 %s.12 % subnet2) == 0) + +if ping_diff_vlan_ok: +raise error.TestFail(VM 2 is unexpectedly pingable in different + vlan) +# Make vm2 in the same vlan with vm1 +vlan_config_vm2_ok = (session[1].get_command_status( + vconfig rem eth0.%s;vconfig add eth0 %s; + ifconfig eth0.%s %s.12 % + (vlans[1], vlans[0], vlans[0], subnet2)) == 0) +if not vlan_config_vm2_ok: +raise error.TestError, Fail to config ip address of VM 2 + +ping_same_vlan_ok = (session[0].get_command_status( + ping -c 2 %s.12 % subnet2) == 0) +if not ping_same_vlan_ok: +raise error.TestFail(Fail to ping the guest in same vlan) +finally: +# Clean the vlan config +for i in range(2): +session[i].sendline(vconfig rem eth0.%s % vlans[0]) +
Re: Do I set up separate bridges for each guest?
On 10/20/2009 04:37 AM, Neil Aggarwal wrote: Hello: I am installing KVM on top of CentOS 5.4 so I can have two guests running on my host. I would like to have the host and guests accessible from my network. Do I set up separate bridges for each guest or would they somehow be shared? If I set up separate bridges, I think I need to do in /etc/sysconfig/network-scripts on the host machine: 1. Set up ifcfg-eth0 with the ip information of the host (For example 192.168.2.200) 2. Set up ifcfg-eth0:1 for the first guest. It will have BRIDGE=br1 3. Create ifcfg-br1 with the IP info for the first guest (For example 192.168.2.201) 4. Set up ifcfg-eth0:2 for the second guest. It will have BRIDGE=br2 5. Create ifcfg-br2 with the IP info for the second guest (For example 192.168.2.202) Is this correct or did I miss something? The simplest thing is to use a single bridge for all - The physical nic should be part of it and supply the outside world connection. The physical nic doesn't need an IP and the bridge should own it. All vms can use this bridge. cat /etc/sysconfig/network-scripts/ifcfg-br0 DEVICE=br0 TYPE=Bridge ONBOOT=yes GATEWAYDEV='' BOOTPROTO=dhcp DELAY=0 HWADDR=00:14:5E:17:D0:04 # cat /etc/sysconfig/network-scripts/ifcfg-eth0 DEVICE=eth0 ONBOOT=yes BOOTPROTO=none HWADDR=00:14:5E:17:D0:04 BRIDGE=br0 Thanks, Neil -- Neil Aggarwal, (281)846-8957, www.JAMMConsulting.com Will your e-commerce site go offline if you have a DB server failure, fiber cut, flood, fire, or other disaster? If so, ask about our geographically redundant database system. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Autotest] [PATCH] Test 802.1Q vlan of nic
On 10/21/2009 03:46 PM, Uri Lublin wrote: On 10/21/2009 12:37 PM, Amos Kong wrote: On Tue, Oct 20, 2009 at 09:19:50AM -0400, Michael Goldish wrote: - Dor Laordl...@redhat.com wrote: On 10/15/2009 11:48 AM, Amos Kong wrote: For the sake of safety maybe we should start both VMs with -snapshot. Dor, what do you think? Is it safe to start 2 VMs with the same disk image when only one of them uses -snapshot? Setup the second VM with -snapshot is enough. The image can only be R/W by 1th VM. Actually, I agree with Michael. If both VMs use the same disk image, it is safer to setup both VMs with -snapshot. When the first VM writes to the disk-image the second VM may be affected. That's a must. If only one VM uses -snapshot, its base will get written and the snapshot will get obsolete. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KSM and HugePages
On 10/23/2009 08:21 PM, David Martin wrote: Does KSM support HugePages? Reading the Fedora 12 feature list I notice this: Using huge pages for guest memory does have a downside, however - you can no longer swap nor balloon guest memory. However it is unclear to me if that includes KSM. ksm pages are only standard 4k pages. If I use 1GB HugePages and KSM (assuming this is possible), does that mean the entire 1GB page has to match another for them to merge? Are there any other downsides to using them other than swapping and ballooning? It needs to be available at VM creation time. Also the tlb size for use pages is smaller although it does bring better results than 4k pages. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Autotest] [KVM-AUTOTEST PATCH 3/7] KVM test: new test timedrift_with_migration
On 10/12/2009 05:28 PM, Lucas Meneghel Rodrigues wrote: Hi Michael, I am reviewing your patchset and have just a minor remark to make here: On Wed, Oct 7, 2009 at 2:54 PM, Michael Goldishmgold...@redhat.com wrote: This patch adds a new test that checks the timedrift introduced by migrations. It uses the same parameters used by the timedrift test to get the guest time. In addition, the number of migrations the test performs is controlled by the parameter 'migration_iterations'. Signed-off-by: Michael Goldishmgold...@redhat.com --- client/tests/kvm/kvm_tests.cfg.sample | 33 --- client/tests/kvm/tests/timedrift_with_migration.py | 95 2 files changed, 115 insertions(+), 13 deletions(-) create mode 100644 client/tests/kvm/tests/timedrift_with_migration.py diff --git a/client/tests/kvm/kvm_tests.cfg.sample b/client/tests/kvm/kvm_tests.cfg.sample index 540d0a2..618c21e 100644 --- a/client/tests/kvm/kvm_tests.cfg.sample +++ b/client/tests/kvm/kvm_tests.cfg.sample @@ -100,19 +100,26 @@ variants: type = linux_s3 - timedrift:install setup -type = timedrift extra_params += -rtc-td-hack -# Pin the VM and host load to CPU #0 -cpu_mask = 0x1 -# Set the load and rest durations -load_duration = 20 -rest_duration = 20 -# Fail if the drift after load is higher than 50% -drift_threshold = 50 -# Fail if the drift after the rest period is higher than 10% -drift_threshold_after_rest = 10 -# For now, make sure this test is executed alone -used_cpus = 100 +variants: +- with_load: +type = timedrift +# Pin the VM and host load to CPU #0 +cpu_mask = 0x1 Let's use -smp 2 always. btw: we need not to parallel the load test with standard tests. +# Set the load and rest durations +load_duration = 20 +rest_duration = 20 Even the default duration here seems way too brief here, is there any reason why 20s was chosen instead of, let's say, 1800s? I am under the impression that 20s of load won't be enough to cause any noticeable drift... +# Fail if the drift after load is higher than 50% +drift_threshold = 50 +# Fail if the drift after the rest period is higher than 10% +drift_threshold_after_rest = 10 I am also curious about those tresholds and the reasoning behind them. Is there any official agreement on what we consider to be an unreasonable drift? Another thing that struck me out is drift calculation: On the original timedrift test, the guest drift is normalized against the host drift: drift = 100.0 * (host_delta - guest_delta) / host_delta While in the new drift tests, we consider only the guest drift. I believe is better to normalize all tests based on one drift calculation criteria, and those values should be reviewed, and at least a certain level of agreement on our development community should be reached. I think we don't need to calculate drift ratio. We should define a threshold in seconds, let's say 2 seconds. Beyond that, there should not be any drift. Do we support migration to a different host? We should, especially in this test too. The destination host reading should also be used. Apart for that, good patchset, and good thing you refactored some of the code to shared utils. Other than this concern that came to my mind, the new tests look good and work fine here. I had to do a slight rebase in one of the patches, very minor stuff. The default values and the drift calculation can be changed on a later time. Thanks! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] KVM Fault Tolerance: Kemari for KVM
On 11/09/2009 05:53 AM, Fernando Luis Vázquez Cao wrote: Hi all, It has been a while coming, but we have finally started work on Kemari's port to KVM. For those not familiar with it, Kemari provides the basic building block to create a virtualization-based fault tolerant machine: a virtual machine synchronization mechanism. Traditional high availability solutions can be classified in two groups: fault tolerant servers, and software clustering. Broadly speaking, fault tolerant servers protect us against hardware failures and, generally, rely on redundant hardware (often proprietary), and hardware failure detection to trigger fail-over. On the other hand, software clustering, as its name indicates, takes care of software failures and usually requires a standby server whose software configuration for the part we are trying to make fault tolerant must be identical to that of the active server. Both solutions may be applied to virtualized environments. Indeed, the current incarnation of Kemari (Xen-based) brings fault tolerant server-like capabilities to virtual machines and integration with existing HA stacks (Heartbeat, RHCS, etc) is under consideration. After some time in the drawing board we completed the basic design of Kemari for KVM, so we are sending an RFC at this point to get early feedback and, hopefully, get things right from the start. Those already familiar with Kemari and/or fault tolerance may want to skip the Background and go directly to the design and implementation bits. This is a pretty long write-up, but please bear with me. == Background == We started to play around with continuous virtual synchronization technology about 3 years ago. As development progressed and, most importantly, we got the first Xen-based working prototypes it became clear that we needed a proper name for our toy: Kemari. The goal of Kemari is to provide a fault tolerant platform for virtualization environments, so that in the event of a hardware failure the virtual machine fails over from compromised to properly operating hardware (a physical machine) in a way that is completely transparent to the guest operating system. Although hardware based fault tolerant servers and HA servers (software clustering) have been around for a (long) while, they typically require specifically designed hardware and/or modifications to applications. In contrast, by abstracting hardware using virtualization, Kemari can be used on off-the-shelf hardware and no application modifications are needed. After a period of in-house development the first version of Kemari for Xen was released in Nov 2008 as open source. However, by then it was already pretty clear that a KVM port would have several advantages. First, KVM is integrated into the Linux kernel, which means one gets support for a wide variety of hardware for free. Second, and in the same vein, KVM can also benefit from Linux' low latency networking capabilities including RDMA, which is of paramount importance for a extremely latency-sensitive functionality like Kemari. Last, but not the least, KVM and its community is growing rapidly, and there is increasing demand for Kemari-like functionality for KVM. Although the basic design principles will remain the same, our plan is to write Kemari for KVM from scratch, since there does not seem to be much opportunity for sharing between Xen and KVM. == Design outline == The basic premise of fault tolerant servers is that when things go awry with the hardware the running system should transparently continue execution on an alternate physical host. For this to be possible the state of the fallback host has to be identical to that of the primary. Kemari runs paired virtual machines in an active-passive configuration and achieves whole-system replication by continuously copying the state of the system (dirty pages and the state of the virtual devices) from the active node to the passive node. An interesting implication of this is that during normal operation only the active node is actually executing code. Another possible approach is to run a pair of systems in lock-step (à la VMware FT). Since both the primary and fallback virtual machines are active keeping them synchronized is a complex task, which usually involves carefully injecting external events into both virtual machines so that they result in identical states. The latter approach is extremely architecture specific and not SMP friendly. This spurred us to try the design that became Kemari, which we believe lends itself to further optimizations. == Implementation == The first step is to encapsulate the machine to be protected within a virtual machine. Then the live migration functionality is leveraged to keep the virtual machines synchronized. Whereas during live migration dirty pages can be sent asynchronously from the primary to the fallback server until the ratio of dirty pages is low enough to guarantee very short downtimes, when it comes to fault tolerance
Re: virtio disk slower than IDE?
On 11/14/2009 04:23 PM, Gordan Bobic wrote: I just tried paravirtualized virtio block devices, and my tests show that they are approximately 30% slower than emulated IDE devices. I'm guessing this isn't normal. Is this a known issue or am I likely to have mosconfigured something? I'm using 64-bit RHEL/CentOS 5 (both host and guest). Please try to change the io scheduler on the host to io scheduler, it should boost your performance back. Thanks. Gordan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio disk slower than IDE?
On 11/15/2009 02:00 PM, Gordan Bobic wrote: Dor Laor wrote: On 11/14/2009 04:23 PM, Gordan Bobic wrote: I just tried paravirtualized virtio block devices, and my tests show that they are approximately 30% slower than emulated IDE devices. I'm guessing this isn't normal. Is this a known issue or am I likely to have mosconfigured something? I'm using 64-bit RHEL/CentOS 5 (both host and guest). Please try to change the io scheduler on the host to io scheduler, it should boost your performance back. I presume you mean the deadline io scheduler. I tried that (kernel parameter elevator=deadline) and it made no measurable difference compared to the cfq scheduler. What version of kvm do you use? Is it rhel5.4? Can you post the qemu cmdline and the perf test in the guest? Lastly, do you use cache=wb on qemu? it's just a fun mode, we use cache=off only. Gordan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] KVM Fault Tolerance: Kemari for KVM
On 11/13/2009 01:48 PM, Yoshiaki Tamura wrote: Hi, Thanks for your comments! Dor Laor wrote: On 11/09/2009 05:53 AM, Fernando Luis Vázquez Cao wrote: Hi all, It has been a while coming, but we have finally started work on Kemari's port to KVM. For those not familiar with it, Kemari provides the basic building block to create a virtualization-based fault tolerant machine: a virtual machine synchronization mechanism. Traditional high availability solutions can be classified in two groups: fault tolerant servers, and software clustering. Broadly speaking, fault tolerant servers protect us against hardware failures and, generally, rely on redundant hardware (often proprietary), and hardware failure detection to trigger fail-over. On the other hand, software clustering, as its name indicates, takes care of software failures and usually requires a standby server whose software configuration for the part we are trying to make fault tolerant must be identical to that of the active server. Both solutions may be applied to virtualized environments. Indeed, the current incarnation of Kemari (Xen-based) brings fault tolerant server-like capabilities to virtual machines and integration with existing HA stacks (Heartbeat, RHCS, etc) is under consideration. After some time in the drawing board we completed the basic design of Kemari for KVM, so we are sending an RFC at this point to get early feedback and, hopefully, get things right from the start. Those already familiar with Kemari and/or fault tolerance may want to skip the Background and go directly to the design and implementation bits. This is a pretty long write-up, but please bear with me. == Background == We started to play around with continuous virtual synchronization technology about 3 years ago. As development progressed and, most importantly, we got the first Xen-based working prototypes it became clear that we needed a proper name for our toy: Kemari. The goal of Kemari is to provide a fault tolerant platform for virtualization environments, so that in the event of a hardware failure the virtual machine fails over from compromised to properly operating hardware (a physical machine) in a way that is completely transparent to the guest operating system. Although hardware based fault tolerant servers and HA servers (software clustering) have been around for a (long) while, they typically require specifically designed hardware and/or modifications to applications. In contrast, by abstracting hardware using virtualization, Kemari can be used on off-the-shelf hardware and no application modifications are needed. After a period of in-house development the first version of Kemari for Xen was released in Nov 2008 as open source. However, by then it was already pretty clear that a KVM port would have several advantages. First, KVM is integrated into the Linux kernel, which means one gets support for a wide variety of hardware for free. Second, and in the same vein, KVM can also benefit from Linux' low latency networking capabilities including RDMA, which is of paramount importance for a extremely latency-sensitive functionality like Kemari. Last, but not the least, KVM and its community is growing rapidly, and there is increasing demand for Kemari-like functionality for KVM. Although the basic design principles will remain the same, our plan is to write Kemari for KVM from scratch, since there does not seem to be much opportunity for sharing between Xen and KVM. == Design outline == The basic premise of fault tolerant servers is that when things go awry with the hardware the running system should transparently continue execution on an alternate physical host. For this to be possible the state of the fallback host has to be identical to that of the primary. Kemari runs paired virtual machines in an active-passive configuration and achieves whole-system replication by continuously copying the state of the system (dirty pages and the state of the virtual devices) from the active node to the passive node. An interesting implication of this is that during normal operation only the active node is actually executing code. Another possible approach is to run a pair of systems in lock-step (à la VMware FT). Since both the primary and fallback virtual machines are active keeping them synchronized is a complex task, which usually involves carefully injecting external events into both virtual machines so that they result in identical states. The latter approach is extremely architecture specific and not SMP friendly. This spurred us to try the design that became Kemari, which we believe lends itself to further optimizations. == Implementation == The first step is to encapsulate the machine to be protected within a virtual machine. Then the live migration functionality is leveraged to keep the virtual machines synchronized. Whereas during live migration dirty pages can be sent asynchronously from the primary to the fallback server until the ratio
Re: [Autotest] [KVM-AUTOTEST PATCH 3/7] KVM test: new test timedrift_with_migration
On 10/28/2009 08:54 AM, Michael Goldish wrote: - Dor Laordl...@redhat.com wrote: On 10/12/2009 05:28 PM, Lucas Meneghel Rodrigues wrote: Hi Michael, I am reviewing your patchset and have just a minor remark to make here: On Wed, Oct 7, 2009 at 2:54 PM, Michael Goldishmgold...@redhat.com wrote: This patch adds a new test that checks the timedrift introduced by migrations. It uses the same parameters used by the timedrift test to get the guest time. In addition, the number of migrations the test performs is controlled by the parameter 'migration_iterations'. Signed-off-by: Michael Goldishmgold...@redhat.com --- client/tests/kvm/kvm_tests.cfg.sample | 33 --- client/tests/kvm/tests/timedrift_with_migration.py | 95 2 files changed, 115 insertions(+), 13 deletions(-) create mode 100644 client/tests/kvm/tests/timedrift_with_migration.py diff --git a/client/tests/kvm/kvm_tests.cfg.sample b/client/tests/kvm/kvm_tests.cfg.sample index 540d0a2..618c21e 100644 --- a/client/tests/kvm/kvm_tests.cfg.sample +++ b/client/tests/kvm/kvm_tests.cfg.sample @@ -100,19 +100,26 @@ variants: type = linux_s3 - timedrift:install setup -type = timedrift extra_params += -rtc-td-hack -# Pin the VM and host load to CPU #0 -cpu_mask = 0x1 -# Set the load and rest durations -load_duration = 20 -rest_duration = 20 -# Fail if the drift after load is higher than 50% -drift_threshold = 50 -# Fail if the drift after the rest period is higher than 10% -drift_threshold_after_rest = 10 -# For now, make sure this test is executed alone -used_cpus = 100 +variants: +- with_load: +type = timedrift +# Pin the VM and host load to CPU #0 +cpu_mask = 0x1 Let's use -smp 2 always. We can also just make -smp 2 the default for all tests. Does that sound good? Yes btw: we need not to parallel the load test with standard tests. We already don't, because the load test has used_cpus = 100 which forces it to run alone. Soon I'll have 100 on my laptop :), better change it to -1 or MAX_INT +# Set the load and rest durations +load_duration = 20 +rest_duration = 20 Even the default duration here seems way too brief here, is there any reason why 20s was chosen instead of, let's say, 1800s? I am under the impression that 20s of load won't be enough to cause any noticeable drift... +# Fail if the drift after load is higher than 50% +drift_threshold = 50 +# Fail if the drift after the rest period is higher than 10% +drift_threshold_after_rest = 10 I am also curious about those tresholds and the reasoning behind them. Is there any official agreement on what we consider to be an unreasonable drift? Another thing that struck me out is drift calculation: On the original timedrift test, the guest drift is normalized against the host drift: drift = 100.0 * (host_delta - guest_delta) / host_delta While in the new drift tests, we consider only the guest drift. I believe is better to normalize all tests based on one drift calculation criteria, and those values should be reviewed, and at least a certain level of agreement on our development community should be reached. I think we don't need to calculate drift ratio. We should define a threshold in seconds, let's say 2 seconds. Beyond that, there should not be any drift. Are you talking about the timedrift with load or timedrift with migration or reboot tests? I was told that when running the load test for e.g 60 secs, the drift should be given in % of that duration. In the case of migration and reboot, absolute durations are used (in seconds, no %). Should we do that in the load test too? Yes, but: during extreme load, we do predict that a guest *without* pv clock will drift and won't be able to catchup until the load stops and only then it will catchup. So my recommendation is to do the following: - pvclock guest - can check with 'cat /sys/devices/system/clocksource/clocksource0/current_clocksource ' don't allow drift during huge loads. Exist (+safe) for rhel5.4 guests and ~2.6.29 (from 2.6.27). - non-pv clock - run the load, stop the load, wait 5 seconds, measure time For both, use absolute times. Do we support migration to a different host? We should, especially in this test too. The destination host reading should also be used. Apart for that, good patchset, and good thing you refactored some of the code to shared utils. We don't, and it would be very messy to implement with the framework right now. We should probably do that as some sort of server side test, but we don't have server side tests right now, so doing it may take a little time and effort. I got the
Re: virtio disk slower than IDE?
On 11/16/2009 08:11 PM, Charles Duffy wrote: Gordan Bobic wrote: Lastly, do you use cache=wb on qemu? it's just a fun mode, we use cache=off only. I don't see the option being set in the logs, so I'd guess it's whatever qemu-kvm defaults to. You can set this through libvirt by putting an element such as the following within your disk element: driver name='qemu' type='qcow2' cache='none'/ It's not needed on rhel5.4 qemu - we have cache=none as a default (Setting the type is preferred to avoid security issues wherein a guest writes an arbitrary qcow2 header to the beginning of a raw disk, reboots and allows qemu's autodetection to decide that this formerly-raw disk should now be treated as a delta against a file they otherwise might not have access to read; as such, it's particularly important if you intend that the type be raw). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Autotest] [KVM-AUTOTEST] KSM-overcommit test v.2 (python version)
On 11/17/2009 04:49 PM, Jiri Zupka wrote: Hi, We find a little mistake with ending of allocator.py. Because I send this patch today. I resend whole repaired patch again. It sure is big improvment from the previous. There are still many refactoring to be made to make it more readable. Comments embedded. - Original Message - From: Jiri Zupkajzu...@redhat.com To: autotestautot...@test.kernel.org, kvmkvm@vger.kernel.org Cc:u...@redhat.com Sent: Tuesday, November 17, 2009 12:52:28 AM GMT +01:00 Amsterdam / Berlin / Bern / Rome / Stockholm / Vienna Subject: [Autotest] [KVM-AUTOTEST] KSM-overcommit test v.2 (python version) Hi, based on your requirements we have created new version of KSM-overcommit patch (submitted in September). Describe: It tests KSM (kernel shared memory) with overcommit of memory. Changelog: 1) Based only on python (remove C code) 2) Add new test (check last 96B) 3) Separate test to (serial,parallel,both) 4) Improve log and documentation 5) Add perf constat to change time limit for waiting. (slow computer problem) Functionality: KSM test start guests. They are connect to guest over ssh. Copy and run allocator.py to guests. Host can run any python command over Allocator.py loop on client side. Start run_ksm_overcommit. Define host and guest reserve variables (host_reserver,guest_reserver). Calculate amount of virtual machine and their memory based on variables host_mem and overcommit. Check KSM status. Create and start virtual guests. Test : a] serial 1) initialize, merge all mem to single page 2) separate first guset mem 3) separate rest of guest up to fill all mem 4) kill all guests except for the last 5) check if mem of last guest is ok 6) kill guest b] parallel 1) initialize, merge all mem to single page 2) separate mem of guest 3) verification of guest mem 4) merge mem to one block 5) verification of guests mem 6) separate mem of guests by 96B 7) check if mem is all right 8) kill guest allocator.py (client side script) After start they wait for command witch they make in client side. mem_fill class implement commands to fill, check mem and return error to host. We need client side script because we need generate lot of GB of special data. Future plane: We want to add to log information about time spend in task. Information from log we want to use to automatic compute perf contant. And add New tests. ___ Autotest mailing list autot...@test.kernel.org http://test.kernel.org/cgi-bin/mailman/listinfo/autotest ksm_overcommit.patch diff --git a/client/tests/kvm/kvm_tests.cfg.sample b/client/tests/kvm/kvm_tests.cfg.sample index ac9ef66..90f62bb 100644 --- a/client/tests/kvm/kvm_tests.cfg.sample +++ b/client/tests/kvm/kvm_tests.cfg.sample @@ -118,6 +118,23 @@ variants: test_name = npb test_control_file = npb.control +- ksm_overcommit: +# Don't preprocess any vms as we need to change it's params +vms = '' +image_snapshot = yes +kill_vm_gracefully = no +type = ksm_overcommit +ksm_swap = yes # yes | no +no hugepages +# Overcommit of host memmory +ksm_overcommit_ratio = 3 +# Max paralel runs machine +ksm_paralel_ratio = 4 +variants: +- serial +ksm_test_size = serial +- paralel +ksm_test_size = paralel - linux_s3: install setup unattended_install type = linux_s3 diff --git a/client/tests/kvm/tests/ksm_overcommit.py b/client/tests/kvm/tests/ksm_overcommit.py new file mode 100644 index 000..408e711 --- /dev/null +++ b/client/tests/kvm/tests/ksm_overcommit.py @@ -0,0 +1,605 @@ +import logging, time +from autotest_lib.client.common_lib import error +import kvm_subprocess, kvm_test_utils, kvm_utils +import kvm_preprocessing +import random, string, math, os + +def run_ksm_overcommit(test, params, env): + +Test how KSM (Kernel Shared Memory) act with more than physical memory is +used. In second part is also tested, how KVM can handle the situation, +when the host runs out of memory (expected is to pause the guest system, +wait until some process returns the memory and bring the guest back to life) + +@param test: kvm test object. +@param params: Dictionary with test parameters. +@param env: Dictionary with the test wnvironment. + + +def parse_meminfo(rowName): + +Function get date from file /proc/meminfo + +@param rowName: Name of line in meminfo + +for line in open('/proc/meminfo').readlines(): +if line.startswith(rowName+:): +name, amt, unit = line.split() +return name, amt, unit + +def parse_meminfo_value(rowName): +
Re: [Autotest] [KVM-AUTOTEST] KSM-overcommit test v.2 (python version)
On 11/26/2009 12:11 PM, Lukáš Doktor wrote: Hello Dor, Thank you for your review. I have few questions about your comments: --- snip --- + stat += Guests memsh = { + for vm in lvms: + if vm.is_dead(): + logging.info(Trying to get informations of death VM: %s + % vm.name) + continue You can fail the entire test. Afterwards it will be hard to find the issue. Well if it's what the community wants, we can change it. We just didn't want to lose information about the rest of the systems. Perhaps we can set some DIE flag and after collecting all statistics raise an Error. I don't think we need to continue testing if some thing as basic as VM died upon us. --- snip --- + def get_true_pid(vm): + pid = vm.process.get_pid() + for i in range(1,10): + pid = pid + 1 What are you trying to do here? It's seems like a nasty hack that might fail on load. qemu has -pifile option. It works fine. Yes and I'm really sorry for this ugly hack. The qemu command has changed since the first patch was made. Nowadays the vm.pid returns PID of the command itself, not the actual qemu process. We need to have the PID of the actual qemu process, which is executed by the command with PID vm.pid. That's why first I try finding the qemu process as the following vm.pid PID. I haven't found another solution yet (in case we don't want to change the qemu command back in the framework). We have tested this solution under heavy process load and either first or second part always finds the right value. --- snip --- + if (params['ksm_test_size'] == paralel) : + vmsc = 1 + overcommit = 1 + mem = host_mem + # 32bit system adjustment + if not params['image_name'].endswith(64): + logging.debug(Probably i386 guest architecture, \ + max allocator mem = 2G) Better not to relay on the guest name. You can test percentage of the guest mem. What do you mean by percentage of the guest mem? This adjustment is made because the maximum memory for 1 process in 32 bit OS is 2GB. Testing of the 'image_name' showed to be most reliable method we found. It's not that important but it should be a convention of kvm autotest. If that's acceptable, fine, otherwise, each VM will define it in the config file --- snip --- + # Guest can have more than 2G but kvm mem + 1MB (allocator itself) + # can't + if (host_mem 2048): + mem = 2047 + + + if os.popen(uname -i).readline().startswith(i386): + logging.debug(Host is i386 architecture, max guest mem is 2G) There are bigger 32 bit guests. How do you mean this note? We are testing whether the host machine is 32 bit. If so, the maximum process allocation is 2GB (similar case to 32 bit guest) but this time the whole qemu process (2GB qemu machine + 64 MB qemu overhead) can't exceeded 2GB. Still the maximum memory used in test is the same (as we increase the VM count - host_mem = quest_mem * vm_count; quest_mem is decreased, vm_count is increased) i386 guests with PAE mode (additional 4 bits) can have up to 16G ram on theory. --- snip --- + + # Copy the allocator.c into guests .py yes indeed. --- snip --- + # Let kksmd works (until shared mem rich expected value) + shm = 0 + i = 0 + cmd = cat/proc/%d/statm % get_true_pid(vm) + while shm ksm_size: + if i 64: + logging.info(get_stat(lvms)) + raise error.TestError(SHM didn't merged the memory until \ + the DL on guest: %s% (vm.name)) + logging.debug(Sleep(%d) % (ksm_size / 200 * perf_ratio)) + time.sleep(ksm_size / 200 * perf_ratio) + try: + shm = int(os.popen(cmd).readline().split()[2]) + shm = shm * 4 / 1024 + i = i + 1 Either you have nice statistic calculation function or not. I vote for the first case. Yes, we are using the statistics function for the output. But in this case we just need to know the shm value, not to log anything. If this is a big problem even for others, we can split the statistics function into 2: int = _get_stat(vm) - returns shm value string = get_stat(vm) - Uses _get_stats and creates a nice log output --- snip --- + Check if memory in max loading guest is allright + logging.info(Starting phase 3b) + + Kill rest of machine We should have a function for it for all kvm autotest you think lsessions[i].close() instead of (status,data) = lsessions[i].get_command_status_output(exit;,20)? Yes, it would be better. + for i in range(last_vm+1, vmsc): + (status,data) = lsessions[i].get_command_status_output(exit;,20) + if i == (vmsc-1): + logging.info(get_stat([lvms[i]])) + lvms[i].destroy(gracefully = False) --- snip --- + def phase_paralel(): + Paralel page spliting + logging.info(Phase 1: Paralel page spliting) + # We have to wait until allocator is finished (it waits 5 seconds to + # clean the socket + The whole function is very similar to phase_separate_first_guest please refactor them. Yes, those functions are a bit similar. On the other hand there are
Re: [Autotest] [KVM-autotest][RFC] 32/32 PAE bit guest system definition
On 12/15/2009 09:04 PM, Lucas Meneghel Rodrigues wrote: On Fri, Dec 11, 2009 at 2:34 PM, Jiri Zupkajzu...@redhat.com wrote: Hello, we write KSM_overcommit test. If we calculate memory for guest we need to know which architecture is Guest. If it is a 32b or 32b with PAE or 64b system. Because with a 32b guest we can allocate only 3100M +-. Currently we use the name of disk's image file. Image file name ends with 64 or 32. Is there way how we can detect if guest machine run with PAE etc.. ? Do you think that kvm_autotest can define parameter in kvm_tests.cfg which configure determine if is guest 32b or 32b with PAE or 64b. Hi Jiri, sorry for taking long to answer you, I am reviewing the overcommit test. About your question, I'd combine your approach of picking if host is 32/64 bit from the image name with looking on /cat/cpuinfo for PAE support. We might keep in KISS for the time being since 99% hosts installation are 64 bit only and many times only the guest can turn on PAE to practically use it. So I'll go with naming only. Let's go with this approach for the final version of the test, OK? Thanks and congrats for the test, it's a great piece of work! More comments soon, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Benchmarking on CentOS 5
On Mon, 2008-06-02 at 14:35 +0530, Amit Shah wrote: On Friday 30 May 2008 23:00:41 Farkas Levente wrote: this is out production server at the development department (10-15) people using it so actually if i tell them that i'll stop the host and all guests for max an hour it's acceptable, but more not really. it's run it type programs. from my experience in the last 6-12 months is that kvm is not production ready. as you can read from this list there are far too many change day-by-day which are very core. and this comes from the current state of kvm. which indicate that rh can't include in there You'll find the most stable version of kvm in the kernel that your distribution ships. Linux-2.6.x (where x 20) should also be stable. The development on kvm will continue to proceed at a fast pace, so you'll see several kvm releases and this, as a result, is bound to bring in a few new bugs in each iteration. imho the biggest problem with the current development of kvm that there is not a stable releases which is somewhat related to the current release number. eg kvm-0.5.x kvm-0.6.x would be better. but currently So the short answer is: if you're looking for a stable version of kvm, look at a kernel.org kernel or the kvm version provided to you by your distribution. kvm development is so fast that keep 2-3 parallel branch where there is a development and stable release seems to too much work. so to answer to your question i don't know:-( The stable branch of kvm is the one in the most-recently available Linux kernel from kernel.org. kvm.git is the development version. In the near future we'll publish a stable branch. There are actually 2 repositories: kernel repo, based on the latest kernel - 2.6.26 and a userspace repository that will be based on kvm-68. The idea is to maintain the above repos together and only apply bug fixes. New features will come with every next kernel release. We're in the process of creating an automatic test framework for kvm. It will be open source framework based on autotest and similar to Anthony's kvmtest. It will help stabilizing both the 'stable' branch and the master. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] KVM: Report hardware virtualization features
On Sun, 2008-06-22 at 09:49 +0300, Avi Kivity wrote: Yang, Sheng wrote: From f02d2ccf01e8671d2da517f14a908d1df1cc42ad Mon Sep 17 00:00:00 2001 From: Sheng Yang [EMAIL PROTECTED] Date: Thu, 19 Jun 2008 18:41:26 +0800 Subject: [PATCH] KVM: Report hardware virtualization features The hardware virtualization technology evolves very fast. But currently it's hard to tell if your CPU support certain kind of HW technology without dig into the source code. The patch introduced a virtual file called kvm_hw_feature_report under /sys/devices/system/kvm/kvm0 to show the mainly important current hardware virtualization feature, then it's pretty easy to tell if your CPU support some advanced virtualization technology now. Yes, this is definitely helpful. However, I think that users will expect cpu flags under /proc/cpuinfo. Perhaps we should add a new line 'virt flags' to /proc/cpuinfo? I think all the features are reported using msrs, so it can be done from arch/x86/kernel/cpu/proc.c without involving kvm at all. while I agree with Avi, it would be nice thought to see them on older kernels. At least sprinkle a printk message. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] Fix time drift problem under high load when PIT is in use.
On Sun, 2008-06-29 at 16:59 +0300, Gleb Natapov wrote: Count the number of interrupts that was lost due to interrupt coalescing and re-inject them back when possible. This fixes time drift problem when pit is used as a time source. Signed-off-by: Gleb Natapov [EMAIL PROTECTED] --- hw/i8254.c | 20 +++- 1 files changed, 19 insertions(+), 1 deletions(-) diff --git a/hw/i8254.c b/hw/i8254.c index 4813b03..c4f0f46 100644 --- a/hw/i8254.c +++ b/hw/i8254.c @@ -61,6 +61,8 @@ static PITState pit_state; static void pit_irq_timer_update(PITChannelState *s, int64_t current_time); +static uint32_t pit_irq_coalesced; The pit has 3 channels, it should be a channel field. Also every time the pit frequency changes the above field should be compensated with * (new_freq/old_freq). For example, if the guest was running with 1000hz clock and the pit_irq_coalesced value is 100 currently, a frequency change to 100hz should reduce pit_irq_coalesced to 10. Except that, its high time we stop drifting :) + static int pit_get_count(PITChannelState *s) { uint64_t d; @@ -369,12 +371,28 @@ static void pit_irq_timer_update(PITChannelState *s, int64_t current_time) return; expire_time = pit_get_next_transition_time(s, current_time); irq_level = pit_get_out1(s, current_time); -qemu_set_irq(s-irq, irq_level); +if(irq_level) { +if(!qemu_irq_raise(s-irq)) +pit_irq_coalesced++; +} else { +qemu_irq_lower(s-irq); +if(pit_irq_coalesced 0) { +if(qemu_irq_raise(s-irq)) +pit_irq_coalesced--; +qemu_irq_lower(s-irq); +} +} + #ifdef DEBUG_PIT printf(irq_level=%d next_delay=%f\n, irq_level, (double)(expire_time - current_time) / ticks_per_sec); #endif +if(pit_irq_coalesced expire_time != -1) { +uint32_t div = ((pit_irq_coalesced 10) 0x7f) + 2; +expire_time -= ((expire_time - current_time) / div); +} + s-next_transition_time = expire_time; if (expire_time != -1) qemu_mod_timer(s-irq_timer, expire_time); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Fix block mode during halt emulation
From d85feaae019bc0abc98a2524369e04d521a78aa8 Mon Sep 17 00:00:00 2001 From: Dor Laor [EMAIL PROTECTED] Date: Mon, 30 Jun 2008 18:22:44 -0400 Subject: [PATCH] Fix block mode hduring halt emulation There is no need to check for pending pit/apic timer, nor pending virq, since all of the check KVM_MP_STATE_RUNNABLE and wakeup the waitqueue. It fixes 100% cpu when windows guest is shutdown (non acpi HAL) Signed-off-by: Dor Laor [EMAIL PROTECTED] --- virt/kvm/kvm_main.c |4 1 files changed, 0 insertions(+), 4 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index b90da0b..faa0778 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -816,10 +816,6 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu) for (;;) { prepare_to_wait(vcpu-wq, wait, TASK_INTERRUPTIBLE); - if (kvm_cpu_has_interrupt(vcpu)) - break; - if (kvm_cpu_has_pending_timer(vcpu)) - break; if (kvm_arch_vcpu_runnable(vcpu)) break; if (signal_pending(current)) -- 1.5.4 0001-Fix-block-mode-during-halt-emulation.patch Description: application/mbox
Re: Sharing variables/memory between host and guest ?
Arn wrote: How can one share memory (a few variables not necessarily a page) between host/hypervisor and guest VM ? Since the guest is just a process within the host, there should be existing ways to do this. It's not that straight forward since the host has its pfn (page frame number) while the guest has gfn (guest frame number) and also use virtual memory. What about using something like debugfs or sysfs, is it possible to share variables this way ? Note, I want a system that is fast, i.e. changes to shared variable/memory should be visible instantly. A paravirtualized driver can take care of that with driver in the guest and device side in qemu/host kernel. You can use 9p virtio solution in Linux that implements a shared file system. I search the kvm-devel archives and found emails referring to kshmem but a search on the kvm-70 code turns up nothing. There are also some emails on sharing a page but no final outcome or what exactly to do. Thanks Arn -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rtl8139 stop working under high load
Farkas Levente wrote: hi, i'm just switch to the use rtl8139 network emulator in kvm-70 for the guests, but under high load it's simple stop working. a reboot or even a service network restart solve the problem, but imho there should have to be some bug in the qemu's rtl8139 code. and there is not any kind of error in any log (neither the host's nor the guest's). this not happened with e1000. but with e1000 the network sometimes seems to breathing (sometimes slow down and then speed up again). do currently which is the preferred network network device in qemu/kvm? thanks. I think rtl8139 is the most stable, and afterwards virtio and e1000 (which both perform much better too). Maybe it's irq problem. Can you run the same test with -no-kvm-irqchip ? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] kvm-autotest
It's definitely worth looking at the autotest server code/samples. There exists code in-tree already to build an deploy kvm via autotest server mode which a single machine can drive the building, installing, creation of guests on N number of clients, directing each guest image to run various autotest client tests, collecting all of the results. See autotest/server/samples/*kvm* A proper server setup is a little involved[1] but much more streamlined these days. Let's think of a guest-installation test. Would you implement it on the server or on the client ? What do you plan for non-linux guests ? We'll try this little exercise of writing a kvm-test on the server side and on the client side and compare complexity. Thanks, Uri. IMHO we need a mixture: - kvm/environment setup autoserve tests/deploy - Internal guest tests Implemented as client test, executed from the server. Composed of benchmarks, standard functionality, applications, unit tests, etc. - guest installation, guest boot client test that execute on the kvm host Regards, Dor -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: best practices for kvm setup?
Rik Theys wrote: Hi, I'm looking into virtualizing some of our servers onto two (or more) physical nodes with either KVM or Xen. What are the 'best practices' for running virtual _servers_ with KVM? Any good/bad experiences with running KVM for virtual servers that have to run for months on end? I've installed ubuntu 8.04 because it should have KVM as the default virtualization tool and is the only 'enterprise' distribution with kvm right now. I used one host to act as an iSCSI target and installed ubuntu with KVM on two other nodes. I can create a virtual server with virt-manager, but it seems live migration is not (yet) supported by libvirt/virsh? So how are other people running their KVM virtual servers? Do you create a script for each virtual server and invoke kvm directly? How do you do the live migration then? Launch the script with an 'incoming' parameter on the target host, and run the migrate command manually? If libvirt does not support migration than you'll need to automate it yourself, we use a daemon to exec/migrate VMs. AFAIK, except for libvirt* there is no other free tool for it. Or is there an other (automated) way? I once tried the live migration on a test host and if I recall correctly, the kvm process kept on running on the source host even after the server was migrated to the target? Is that the expected behaviour? This is works-as-designed, the idea is that a 3rd party mgmt tool get the result of the migration process and closes one of the source/destination. Without 3rd party, the destination cannot continue the source got end-of-migration message and the opposite on failure. What type of shared storage is best used with KVM (or Xen for that matter)? Our physical servers will be connected to a SAN. Should I create volumes on my san and export them to my physical servers where I can then use them as /dev/by-id/xxx disk in my KVM configs? Of should I configure my two servers into a GFS cluster and use files as backend for my KVM virtual machines? What are you using as shared storage? We use NFS and it works pretty well, your proposals are also valid options. Just make sure an image is not accessed in parallel by 2 hosts. Regards, Rik Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm: Unknown error 524, Fail to handle apic access vmexit
Martin Michlmayr wrote: I installed a Windows XP SP2 guest on a Debian x86_64 host The installation itself went fine but kvm aborts when when XP starts during Windows XP Setup. XP mentions something with intelppm.sys (see the attached screenshot) and kvm says: kvm_run: Unknown error 524 kvm_run returned -524 It's a FlexPriority bug, while it should be solved, you can disable it by using kvm-intel module parameter. In dmesg, I see: [ 8891.352876] Fail to handle apic access vmexit! Offset is 0xf0 This happens with kvm 70, and kernel 2.6.25 and 2.6.26-rc9. Someone else reported a similar problem before but there was no response: http://www.mail-archive.com/[EMAIL PROTECTED]/msg12111.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm: Unknown error 524, Fail to handle apic access vmexit
Yang, Sheng wrote: On Tuesday 15 July 2008 23:19:07 Dor Laor wrote: Martin Michlmayr wrote: I installed a Windows XP SP2 guest on a Debian x86_64 host The installation itself went fine but kvm aborts when when XP starts during Windows XP Setup. XP mentions something with intelppm.sys (see the attached screenshot) and kvm says: kvm_run: Unknown error 524 kvm_run returned -524 It's a FlexPriority bug, while it should be solved, you can disable it by using kvm-intel module parameter. Dor, are you sure it's a FlexPriority bug? Well, I'm not sure it's the FlexPriority's fault, it's just when it is disabled it does not happen and I saw the apic access. It can be miss emulation too. It happened to me on ~ kvm-69 If you look at where is the complain, you would find there is a result of emulate_instruction(). And you will find a clearly emulation failed (mmio) rip 7cb3d000 ff ff 8d 85 in the bug tracker Martin metioned above the Fail to handle apic access vmexit! Offset is 0xf0(Spurious Interrupt Vector Register). I don't think ff ff 8d 85 is a vaild opcode for that case. Maybe it's a regression? The last report is long ago... Hi Martin, can you show more dmesg here? And if it can be reproduce stable? Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm guest loops_per_jiffy miscalibration under host load
Marcelo Tosatti wrote: On Tue, Jul 22, 2008 at 10:22:00AM +0200, Jan Kiszka wrote: The in-kernel PIT rearms relative to host clock, so the frequency is more reliable (next_expiration = prev_expiration + count). The same happens under plain QEMU: static void pit_irq_timer_update(PITChannelState *s, int64_t current_time); static void pit_irq_timer(void *opaque) { PITChannelState *s = opaque; pit_irq_timer_update(s, s-next_transition_time); } True. I misread current_time. To my experience QEMU's PIT is suffering from lost ticks under load (when some delay gets larger than 2*period). Yes, with clock=pit on RHEL4 its quite noticeable. Even with -tdf. The Note that -tdf works only when you use userspace irqchip too, then it should work. in-kernel timer seems immune to that under the load I was testing. In the long run we should try to remove the in kernel pit. Currently it does handle pit irq coalescing problem that leads to time drift. The problem is that its not yet 100% production level, migration with it has some issues and basically we should try not to duplicate userspace code unless there is no good reason (like performance). There are floating patches by Glen Natapov for the pit and virtual rtc to prevent time drifts. Hope they'll get accepted by qemu. I recently played a bit with QEMU new icount feature. Than one tracks the guest progress based on a virtual instruction pointer, derives the QEMU's virtual clock from it, but also tries to keep that clock in sync with the host by periodically adjusting its scaling factor (kind of virtual CPU frequency tuning to keep the TSC in sync with real time). Works quite nicely, but my feeling is that the adjustment is not 100% stable yet. Maybe such pattern could be applied on kvm as well with tsc_vmexit - tsc_vmentry serving as guest progress counter (instead of icount which depends on QEMU's code translator). I see. Do you have patches around? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Remove -tdf
Anthony Liguori wrote: The last time I posted the KVM patch series to qemu-devel, the -tdf patch met with some opposition. Since today we implement timer catch-up in the in-kernel PIT and the in-kernel PIT is used by default, it doesn't seem all that valuable to have timer catch-up in userspace too. Removing it will reduce our divergence from QEMU. IMHO the in kernel PIT should go away, there is no reason to keep it except that userspace PIT drifts. Currently both in-kernel PIT and even the in kernel irqchips are not 100% bullet proof. Of course this code is a hack, Gleb Natapov has send better fix for PIT/RTC to qemu list. Can you look into them: http://www.mail-archive.com/kvm@vger.kernel.org/msg01181.html Thanks, Dor Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/hw/i8254.c b/qemu/hw/i8254.c index 69eb889..d0394c0 100644 --- a/qemu/hw/i8254.c +++ b/qemu/hw/i8254.c @@ -332,11 +332,6 @@ static uint32_t pit_ioport_read(void *opaque, uint32_t addr) return ret; } -/* global counters for time-drift fix */ -int64_t timer_acks=0, timer_interrupts=0, timer_ints_to_push=0; - -extern int time_drift_fix; - static void pit_irq_timer_update(PITChannelState *s, int64_t current_time) { int64_t expire_time; @@ -347,24 +342,6 @@ static void pit_irq_timer_update(PITChannelState *s, int64_t current_time) expire_time = pit_get_next_transition_time(s, current_time); irq_level = pit_get_out1(s, current_time); qemu_set_irq(s-irq, irq_level); -if (time_drift_fix irq_level==1) { -/* FIXME: fine tune timer_max_fix (max fix per tick). - *Should it be 1 (double time), 2 , 4, 10 ? - *Currently setting it to 5% of PIT-ticks-per-second (per PIT-tick) - */ -const long pit_ticks_per_sec = (s-count0) ? (PIT_FREQ/s-count) : 0; -const long timer_max_fix = pit_ticks_per_sec/20; -const long delta = timer_interrupts - timer_acks; -const long max_delta = pit_ticks_per_sec * 60; /* one minute */ -if ((delta max_delta) (pit_ticks_per_sec 0)) { -printf(time drift is too long, %ld seconds were lost\n, delta/pit_ticks_per_sec); -timer_acks = timer_interrupts; -timer_ints_to_push = 0; -} else if (delta 0) { -timer_ints_to_push = MIN(delta, timer_max_fix); -} -timer_interrupts++; -} #ifdef DEBUG_PIT printf(irq_level=%d next_delay=%f\n, irq_level, diff --git a/qemu/hw/i8259.c b/qemu/hw/i8259.c index b266119..1707434 100644 --- a/qemu/hw/i8259.c +++ b/qemu/hw/i8259.c @@ -221,35 +221,18 @@ static inline void pic_intack(PicState *s, int irq) } else { s-isr |= (1 irq); } - /* We don't clear a level sensitive interrupt here */ if (!(s-elcr (1 irq))) s-irr = ~(1 irq); - } -extern int time_drift_fix; - int pic_read_irq(PicState2 *s) { int irq, irq2, intno; irq = pic_get_irq(s-pics[0]); if (irq = 0) { - pic_intack(s-pics[0], irq); -#ifndef TARGET_IA64 - if (time_drift_fix irq == 0) { - extern int64_t timer_acks, timer_ints_to_push; - timer_acks++; - if (timer_ints_to_push 0) { - timer_ints_to_push--; -/* simulate an edge irq0, like the one generated by i8254 */ -pic_set_irq1(s-pics[0], 0, 0); -pic_set_irq1(s-pics[0], 0, 1); - } - } -#endif if (irq == 2) { irq2 = pic_get_irq(s-pics[1]); if (irq2 = 0) { diff --git a/qemu/vl.c b/qemu/vl.c index 19c8bbf..d6877cd 100644 --- a/qemu/vl.c +++ b/qemu/vl.c @@ -229,7 +229,6 @@ const char *option_rom[MAX_OPTION_ROMS]; int nb_option_roms; int semihosting_enabled = 0; int autostart = 1; -int time_drift_fix = 0; unsigned int kvm_shadow_memory = 0; const char *mem_path = NULL; int hpagesize = 0; @@ -7968,7 +7967,6 @@ static void help(int exitcode) #ifndef _WIN32 -daemonize daemonize QEMU after initializing\n #endif - -tdfinject timer interrupts that got lost\n -kvm-shadow-memory megs set the amount of shadow pages to be allocated\n -mem-path set the path to hugetlbfs/tmpfs mounted directory, also enables allocation of guest memory with huge pages\n -option-rom rom load a file, rom, into the option ROM space\n @@ -8089,7 +8087,6 @@ enum { QEMU_OPTION_tb_size, QEMU_OPTION_icount, QEMU_OPTION_incoming, -QEMU_OPTION_tdf, QEMU_OPTION_kvm_shadow_memory, QEMU_OPTION_mempath, }; @@ -8202,7 +8199,6 @@ const QEMUOption qemu_options[] = { #if defined(TARGET_ARM) || defined(TARGET_M68K) { semihosting, 0, QEMU_OPTION_semihosting }, #endif -{ tdf, 0, QEMU_OPTION_tdf }, /* enable time drift fix */ { kvm-shadow-memory, HAS_ARG, QEMU_OPTION_kvm_shadow_memory }, { name, HAS_ARG, QEMU_OPTION_name }, #if
Re: [PATCH 2/2] Remove -tdf
Anthony Liguori wrote: Gleb Natapov wrote: On Tue, Jul 22, 2008 at 08:20:41PM -0500, Anthony Liguori wrote: Currently both in-kernel PIT and even the in kernel irqchips are not 100% bullet proof. Of course this code is a hack, Gleb Natapov has send better fix for PIT/RTC to qemu list. Can you look into them: http://www.mail-archive.com/kvm@vger.kernel.org/msg01181.html Paul Brook's initial feedback is still valid. It causes quite a lot of churn and may not jive well with a virtual time base. An advantage to the current -tdf patch is that it's more contained. I don't think either approach is going to get past Paul in it's current form. Yes, my patch causes a lot of churn because it changes widely used API. Indeed. But the time drift fix itself is contained to PIT/RTC code only. The last patch series I've sent disables time drift fix if virtual time base is enabled as Paul requested. There was no further feedback from him. I think there's a healthy amount of scepticism about whether tdf really is worth it. This is why I suggested that we need to better quantify exactly how much this patch set helps things. For instance, a time drift test for kvm-autotest would be perfect. tdf is ugly and deviates from how hardware works. A compelling case is needed to justify it. We'll add time drift tests to autotest the minute it starts to run enough interesting tests/loads. In our private test platform we use a simple scenario to test it: 1. Use windows guest and play a movie (changes rtc on acpi win/pit on -no-acpi win freq to 1000hz). 2. Pin the guest to a physical cpu + load the same cpu. 3. Measure a minute in real life vs in the guest. Actually the movie seems to be more smooth without time drift fix. When fixing irqs some times the player needs to cope with too rapid changes. Anyway the main focus is time accuracy and not smoother movies. In-kernel pit does relatively good job for Windows guests, the problem its not yet 100% stable and also we can do it in userspace and the rtc needs a solution too. As Jan Kiszka wrote in one of his mails may be Paul's virtual time base can be adopted to work with KVM too. BTW how virtual time base handles SMP guest? I really don't know. I haven't looked to deeply at the virtual time base. Keep in mind though, that QEMU SMP is not true SMP. All VCPUs run in lock-step. Regards, Anthony Liguori Also, it's important that this is reproducible in upstream QEMU and not just in KVM. If we can make a compelling case for the importance of this, we can possibly work out a compromise. I developed and tested my patch with upstream QEMU. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Live Migration, DRBD
Kent Borg wrote: I am very happy to discover that KVM does live migration. Now I am figuring out whether it will work for me. What I have in mind is to use DRBD for the file system image. The problem is that during the migration I want to shift the file system access at the moment when the VM has quit running on the host it is leaving but before it starts running on the host where it is arriving. Is there a hook to let me do stuff at this point? This is what I want to do: On the departing machine... - VM has stopped here - umount the volume with the VM file system image - mark volume in DRDB as secondary On the arriving machine... - mark volume in DRBD as primary - mount the volume with the VM file system image - VM can now start here Is there a way? No, but one can add such pretty easy. The whole migration code is in one file qemu/migration.c You can add a parameter to qemu migration command to specify a script that should be called on migration end event (similar to the tap script). Thanks, -kb -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scsi broken 4GB RAM
Martin Maurer wrote: Using IDE boot disk, no problem. Win2008 (64bit) works without any problems - 6 gb ram in the guest. After successful booting IDE, I added a second disk using SCSI: windows see the disk but cannot initialize the disk. So SCSI looks quite unusable if you run windows guest (win2003 sp2 also stops during install), or should we load any SCSI driver during setup? Win2008 uses LSI Logic 8953U PCI SCSI Adapter, 53C895A Device (LSI Logic Driver 4.16.6.0, signed) Any other expierences running SCSI on windows? You're right, its broken right now :( At least ide is stable. Best Regards, Martin -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Martin Maurer Sent: Donnerstag, 24. Juli 2008 11:46 To: kvm@vger.kernel.org Subject: RE: scsi broken 4GB RAM Sorry, just returned to the installer - also stopped with the same error code, using just 2 gb ram. Best Regards, Martin Maurer [EMAIL PROTECTED] http://www.proxmox.com Proxmox Server Solutions GmbH Kohlgasse 51/10, 1050 Vienna, Austria Phone: +43 1 545 4497 11 Fax: +43 1 545 4497 22 Commercial register no.: FN 258879 f Registration office: Handelsgericht Wien -Original Message- From: Martin Maurer Sent: Donnerstag, 24. Juli 2008 11:44 To: kvm@vger.kernel.org Subject: RE: scsi broken 4GB RAM Hi, I tried windows server 2008 (64 bit) on Proxmox VE 0.9beta2 (KVM 71), see http://pve.proxmox.com): Some details: --memory 6144 --cdrom en_windows_server_2008_datacenter_enterprise_standard_x64_dvd_X14- 26714.iso --name win2008-6gb-scsi --smp 1 --bootdisk scsi0 --scsi0 80 The installer shows 80 GB harddisk but freezes after clicking next for a minute then: Windows could not creat a partition on disk 0. The error occurred while preparing the computer´s system volume. Error code: 0x8004245F. I also got installer problems if I just use scsi as boot disk (no high memory) on several windows versions, including win2003 and xp. So I decided to use IDE, works without any issue on windows. But: I reduced the memory to 2048 and the installer continues to work! Best Regards, Martin Maurer [EMAIL PROTECTED] http://www.proxmox.com Proxmox Server Solutions GmbH Kohlgasse 51/10, 1050 Vienna, Austria Phone: +43 1 545 4497 11 Fax: +43 1 545 4497 22 Commercial register no.: FN 258879 f Registration office: Handelsgericht Wien -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Henrik Holst Sent: Mittwoch, 23. Juli 2008 23:09 To: kvm@vger.kernel.org Subject: scsi broken 4GB RAM I do not know if this is a bug in qemu or the linux kernel sym53c8xx module (I haven't had the opportunity to test with anything other than Linux at the moment) but if one starts an qemu instance with -m 4096 and larger the scsi emulated disk fails in the Linux guest. If booting any install cd the /dev/sda is seen as only 512B in size and if booting an ubuntu 8.04-amd64 with the secondary drive as scsi it is seen with the correct size but one cannot read not write the partition table. Is there anyone out there that could test say a Windows image on scsi with 4GB or more of RAM and see if it works or not? If so it could be the linux driver that is faulty. /Henrik Holst -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/9] kvm: qemu: Drop the mutex while reading from tapfd
Mark McLoughlin wrote: The idea here is that with GSO, packets are much larger and we can allow the vcpu threads to e.g. process irq acks during the window where we're reading these packets from the tapfd. Signed-off-by: Mark McLoughlin [EMAIL PROTECTED] --- qemu/vl.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/qemu/vl.c b/qemu/vl.c index efdaafd..de92848 100644 --- a/qemu/vl.c +++ b/qemu/vl.c @@ -4281,7 +4281,9 @@ static void tap_send(void *opaque) sbuf.buf = s-buf; s-size = getmsg(s-fd, NULL, sbuf, f) =0 ? sbuf.len : -1; #else Maybe do it only when GSO is actually used by the guest/tap. Otherwise it can cause some ctx trashing right? + kvm_sleep_begin(); s-size = read(s-fd, s-buf, sizeof(s-buf)); + kvm_sleep_end(); #endif if (s-size == -1 errno == EINTR) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic
Mark McLoughlin wrote: virtio_net tries to guess when it has received a tx notification from the guest whether it indicates that the guest has no more room in the tx ring and it should immediately flush the queued buffers. The heuristic is based on the fact that there are 128 buffer entries in the ring and each packet uses 2 buffers (i.e. the virtio_net_hdr and the packet's linear data). Using GSO or increasing the size of the rings will break that heuristic, so let's remove it and assume that any notification from the guest after we've disabled notifications indicates that we should flush our buffers. Signed-off-by: Mark McLoughlin [EMAIL PROTECTED] --- qemu/hw/virtio-net.c |3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c index 31867f1..4adfa42 100644 --- a/qemu/hw/virtio-net.c +++ b/qemu/hw/virtio-net.c @@ -175,8 +175,7 @@ static void virtio_net_handle_tx(VirtIODevice *vdev, VirtQueue *vq) { VirtIONet *n = to_virtio_net(vdev); -if (n-tx_timer_active - (vq-vring.avail-idx - vq-last_avail_idx) == 64) { +if (n-tx_timer_active) { vq-vring.used-flags = ~VRING_USED_F_NO_NOTIFY; qemu_del_timer(n-tx_timer); n-tx_timer_active = 0; Actually we can improve latency a bit more by using this timer only for high throughput scenario. For example, if during the previous timer period no/few packets were accumulated, we can set the flag off and not issue new timer. This way we'll get notified immediately without timer latency. When lots of packets will be transmitted, we'll go back to this batch mode again. Cheers, Dor -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/3] fix PIT injection
Marcelo Tosatti wrote: The in-kernel PIT emulation can either inject too many or too few interrupts. While it's an improvement, the in-kernel pit is still not perfect. For example, on pit frequency changes the pending count should be recalculated and matched to the new frequency. I also tumbled on live migration problem and there is your guest smp fix. IMHO we need to switch back to userspace pit. [Actually I did consider in-kernel pit myself in the past.]. The reasons: 1. There is no performance advantage doing this in the kernel. It's just potentially reduced the host stability and reduces code 2. There are floating patches to fix pit/rtc injection in the same way the acked irq is sone here. So the first 2 patches are relevant. 3. Will we do the same for rtc? - why duplicate userspace code in the kernel? We won't have smp issues since we have qemu_mutex and it will be simpler too. If you agree, please help merging the qemu patches. Otherwise argue against the above :) Cheers, Dor -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] reserved-ram for pci-passthrough without VT-d capable hardware
Andrea Arcangeli wrote: On Wed, Jul 30, 2008 at 11:50:43AM +0530, Amit Shah wrote: * On Tuesday 29 July 2008 18:47:35 Andi Kleen wrote: I'm not so interested to go there right now, because while this code is useful right now because the majority of systems out there lacks VT-d/iommu, I suspect this code could be nuked in the long run when all systems will ship with that, which is why I kept it all Actually at least on Intel platforms and if you exclude the lowest end VT-d is shipping universally for quite some time now. If you buy a Intel box today or bought it in the last year the chances are pretty high that it has VT-d support. I think you mean VT-x, which is virtualization extensions for the x86 architecture. VT-d is virtualization extensions for devices (IOMMU). I think Andi understood VT-d right but even if he was right that every reader of this email that is buying a new VT-x system today is also almost guaranteed to get a VT-d motherboard (which I disagree unless you buy some really expensive toy), there are current large installations of VT-x systems that lacks VT-d and that with recent current dual/quadcore cpus are very fast and will be used for the next couple of years and they will not upgrade just the motherboard to use pci-passthrough. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html In addition KVM is used in embedded too and things are slower there, we know of a specific use case (production) that demands 1:1 mapping and can't use VT-d -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Issues while Debugging Windows Kernel running on KVM
Can you try http://kvm.qumranet.com/kvmwiki/WindowsGuestDebug You can use windows host as a VM too. Since (in the past) there was a problem with the virtual serial polling you can use -no-kvm and the qemu patch, as described in the wiki. Good luck, Dor. Muppana, Bhaskar wrote: Hi, I am facing issues while trying to debug Windows XP kernel running on top of Linux KVM. I have to debug Windows XP kernel running in a VM. I have dedicated ttyS0 on the host to the guest. I am using the following command to bring up Windows VM. /usr/local/kvm/bin/qemu-system-x86_64 \ -hda /opt/vdisk.img \ -boot c \ -m 512 \ -net nic,model=rtl8139,macaddr=52:54:00:12:34:56 \ -net tap,ifname=qtap0,script=no \ -smp 1 \ -usb \ -usbdevice tablet \ -localtime \ -serial /dev/ttyS0 I have another machine, running Windows XP, connected to the Linux host through serial cable. |Windows | | VM| |(target)| --- | Windows Host | - | Linux with KVM | --- I am able to send messages between Windows host and target through serial ports (tested using windows power shell). But, I am not able to use Win DBG (Kernel Debugger) in host to connect to target. Target is getting stuck while booting. Debug enabled Windows entry in boot.ini: multi(0)disk(0)rdisk(0)partition(1)\WINDOWS=Microsoft Windows XP Professional /fastdetect /debugport=COM1 /baudrate=115200 Can someone please help me regarding this? Thanks, Bhaskar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Reserving CPU resources for a KVM guest
Yuksel Gunal wrote: Hi, I have been playing with KVM and was wondering about the following question: is there a resource configuration setting that would enforce a fraction of CPU to be guaranteed for a KVM guest? What I have on mind is something similar to the reservation setting on VMware (used to be called minimum CPU), which guarantees a number of CPU cycles to a VM. Also, any configuration setting similar to CPU/Memory Shares setting in VMware, which will kick in under contention for resources? VM is like any other process in Linux, you can use cpu controller, cgroups or any other scheduling option for your VMs. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: paravirtualized windows net driver stop after some days on XP guest
Can you please try an update version of the windows drivers? I also added a dummy installer you can use too: http://kvm.qumranet.com/kvmwiki/VirtioWindowsDrivers Regards, Dor Yann Dupont wrote: Hello. I'm using kvm whit great succes for various OS. Very good job. In June I started using paravirtualized drivers. Since that we encountered sporadic loss of connectivity on some Xp guests after some days of uptime. This was with KVM 70. I upgraded to KVM 73 4 days ago, and this morning 1 off my Xp guests had no connectivity. Putting the interface off then on via panel control revives network instantly. Seems like a bug on the windows driver side . This is occurring on 2 Xp guests, they have moderate to low network load. They have 1 CPU , the hal is the non acpi one (because they were installed in KVM-23 timeframe) I also have 2003 guests, and so far I haven't encountered the problem. Also have linux guests, with net AND disk virtio , AND high load without problem. Best Regards, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: paravirtualized windows net driver for vista does not work on windows 2008 (64-bit)
Sorry for that, seems like there where some instructions missing. Since we did not (yet soon) sign the drivers you need to install a certificate workaround manually: There are 2 things to do on 64-bit before installation. 1. Install certificate using installcertificate.bat 2. If Test mode does not appear on the screen, run bcdedit /set testsigning on and reboot The system diagnostic, related to installation, on 2008 is in %windir%\inf\setupapi.dev.log Please compress the file and send, if both 2 things done but the install does not work. Regards, Dor Martin Maurer wrote: Hi all, I tried to use the vista virtio driver on win2008 (64-bit) but the install failed, I got this in the windows event log: I am working on a Debian Etch 64 bit Kernel 2.6.24 with KVM 74 (internal testing Kernel of http://pve.proxmox.com) I used the following driver: http://people.qumranet.com/dor/Drivers-0-3107.iso ___ Log Name: Security Source:Microsoft-Windows-Security-Auditing Date: 09.09.2008 17:06:20 Event ID: 5038 Task Category: System Integrity Level: Information Keywords: Audit Failure User: N/A Computer: WIN-0Z71CK0XVXP Description: Code integrity determined that the image hash of a file is not valid. The file could be corrupt due to unauthorized modification or the invalid hash could indicate a potential disk device error. File Name: \Device\HarddiskVolume1\Windows\System32\drivers\kvmnet6.sys Event Xml: Event xmlns=http://schemas.microsoft.com/win/2004/08/events/event; System Provider Name=Microsoft-Windows-Security-Auditing Guid={54849625-5478-4994-a5ba-3e3b0328c30d} / EventID5038/EventID Version0/Version Level0/Level Task12290/Task Opcode0/Opcode Keywords0x8010/Keywords TimeCreated SystemTime=2008-09-09T15:06:20.562Z / EventRecordID364/EventRecordID Correlation / Execution ProcessID=4 ThreadID=88 / ChannelSecurity/Channel ComputerWIN-0Z71CK0XVXP/Computer Security / /System EventData Data Name=param1\Device\HarddiskVolume1\Windows\System32\drivers\kvmnet6.sys/Data /EventData /Event Best Regards, Martin Maurer [EMAIL PROTECTED] http://pve.proxmox.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: paravirtualized windows net driver for vista does not work on windows 2008 (64-bit)
Maurer] YES, working! Testing again (I already have now a KVM 75, but I assume this does not make any difference here). I followed your instructions, the driver installed without any warning as expected after installing the certificate. The only issue: the connection shows only 100mbit - after changing this via the windows device manager, the 1 GBIT is up. Default should be 1 Gbit, is this possible? I assume a lot of people forget about changing this and then they got bad performance due to 100mbit. The 100mb is not the bandwidth limitation. Nevertheless is should change. The only worry is that in order to certify (Microsoft sign) the drivers I was told that 1Gb device needs to support 802.1q. It might require some simple qemu virtio changes like like vlan filtering and tag on/off options. Regards, Dor -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmnet.sys BSOD w/ WinXP...
Daniel J Blueman wrote: When using Windows XP 32 installed with TCP/IP and microsoft client networking, I can reproduce an intermittent BSOD [1] with kvmnet.sys 1.0.0 and 1.2.0, by aborting a large data transfer in an application. Since this reproduces with 1.0.0 kvmnet.sys, it looks unrelated to the locking changes that went into 1.2.0, but something relating to when sockets are closed, flushed or data discarded. Perhaps the offset into the driver at 0xF761A5A9 - 0xF7618000 may tell us what is needed to reproduce and hint at what area the fix is needed in? Many thanks, Daniel --- [1] DRIVER_IRQL_NOT_LESS_OR_EQUAL *** STOP: 0x00D1 (0x001C,0x0002,0x,0xF761A5A9) *** kvmnet.sys - Address F761A5A9 base at F7618000, DateStamp 47dd531c Can you try http://people.qumranet.com/dor/Drivers-0-3107.iso this? Also please provide the specific way of producing load. Along with it, please note kernel version, kvm version, qemu cmd line. Regards, Dor -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm 76 - open /dev/kvm: No such device or address
Matias Aguirre wrote: Hi all, Im using 2.6.26.5 kernel and slackware-current distribution. I was compiled the latest 76 version of kvm and when i run kvm i return this error: open /dev/kvm: No such device or address Could not initialize KVM, will disable KVM support The module is already loaded: # lsmod Module Size Used by kvm_intel 33984 0 kvm 116156 1 kvm_intel nvidia 6886800 26 And my CPU have VM support. # cat /proc/cpuinfo | grep vmx flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr lahf_lm flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr lahf_lm And the file permission: # dir /dev/kvm crw-rwxr-- 1 root kvm 250, 0 2008-10-07 18:22 /dev/kvm Any help? Thanks chmod a+wx /dev/kvm will do the trick Regards, Dor -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] Disk integrity in QEMU
Avi Kivity wrote: Chris Wright wrote: I think it's safe to say the perf folks are concerned w/ data integrity first, stable/reproducible results second, and raw performance third. So seeing data cached in host was simply not what they expected. I think write through is sufficient. However I think that uncached vs. wt will show up on the radar under reproducible results (need to tune based on cache size). And in most overcommit scenarios memory is typically more precious than cpu, it's unclear to me if the extra buffering is anything other than memory overhead. As long as it's configurable then it's comparable and benchmarking and best practices can dictate best choice. Getting good performance because we have a huge amount of free memory in the host is not a good benchmark. Under most circumstances, the free memory will be used either for more guests, or will be given to the existing guests, which can utilize it more efficiently than the host. I can see two cases where this is not true: - using older, 32-bit guests which cannot utilize all of the cache. I think Windows XP is limited to 512MB of cache, and usually doesn't utilize even that. So if you have an application running on 32-bit Windows (or on 32-bit Linux with pae disabled), and a huge host, you will see a significant boost from cache=writethrough. This is a case where performance can exceed native, simply because native cannot exploit all the resources of the host. - if cache requirements vary in time across the different guests, and if some smart ballooning is not in place, having free memory on the host means we utilize it for whichever guest has the greatest need, so overall performance improves. Another justification for ODIRECT is that many production system will use the base images for their VMs. It's mainly true for desktop virtualization but probably for some server virtualization deployments. In these type of scenarios, we can have all of the base image chain opened as default with caching for read-only while the leaf images are open with cache=off. Since there is ongoing effort (both by IT and developers) to keep the base images as big as possible, it guarantees that this data is best suited for caching in the host while the private leaf images will be uncached. This way we provide good performance and caching for the shared parent images while also promising correctness. Actually this is what happens on mainline qemu with cache=off. Cheers, Dor -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How can I tell KVM is actually using AMD-V virtualization extensions?
Veiko Kukk wrote: Hi! My desktop machine is HP dc5750 SFF, CPU is AMD Athlon(tm) 64 X2 Dual Core Processor 4600+, /proc/cpuinfo lists svm flag. I'm using 2.6.27 kernel on FC9, qemu-system-x86_64 info version 0.9.1. How can I be absolutely sure, that my kvm virtual machines are using AMD-V? You can /sbin/lsmod | grep kvm_amd and check for ref count 0. You can also use dmesg to check kvm messages. Alternatively, check kvm_stat tool or run /usr/sbin/lsof -p `pgrep qemu` | grep /dev/kvm Dor -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm XP P2V required ACPI-Standard PC HAL change, keep or revert to ACPI?
Jeff Kowalczyk wrote: I'm running a physical-to-virtual Windows XP Dell OEM instance on Ubuntu 8.04.1 kvm-62 with kvm-intel and bridged networking. After early BSOD difficulty with the output of VMWare Converter 3.0.3, I did manage to get the XP P2V instance ready to run under kvm after changing from the Windows XP HAL ACPI to Standard PC in device manager under VMWare Player. After a complete redetection of system hardware and resources (perhaps this was the true reason it started to work), the instance must now be activated again. It works very well, but must be shut down at the You may now turn off the PC. This is a headless kvm server for a few straggle windows apps, and the kvm instance will seldom be rebooted. Should I activate as Standard PC, or attempt to convert the HAL back to ACPI. Basically it should work. Maybe newer kvm will encounter less problems. Is there still any performance penalty for ACPI with kvm-62? Since we have the tpr optimization it should be fine. Nevertheless we did measure about 10%-20% performance penalty on windows acpi. What is the kvm shutdown behavior with an ACPI HAL? It should be fine and turn off the process completely. btw: you can install APM module on the standard HAL too and it power down the VM to exit completely too. Thanks, Jeff -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MTU on a virtio-net device?
Michael Tokarev wrote: Right now (2.6.27), there's no way to change MTU of a virtio-net interface, since the mtu-changing method is not provided. Is there a simple way to add such a beast? It should be a nice easy patch for mtu 4k. You can just implement a 'change_mtu' handler like: static int virtio_change_mtu(struct net_device *netdev, int new_mtu) { if(new_mtu ETH_ZLEN || new_mtu PAGE_SIZE) return -EINVAL; netdev-mtu = new_mtu; return 0; } I'm asking because I'm not familiar with the internals, and because, I think, increasing MTU (so that the resulting skb still fits in a single page) will increase performance significantly, at least on a internal/virtual network -- currently there are just way too many context switches and the like while copying data from one guest to another or between guest and host. Thanks! /mjt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MTU on a virtio-net device?
Michael Tokarev wrote: Dor Laor wrote: Michael Tokarev wrote: Dor Laor wrote: Michael Tokarev wrote: Right now (2.6.27), there's no way to change MTU of a virtio-net interface, since the mtu-changing method is not provided. Is there a simple way to add such a beast? It should be a nice easy patch for mtu 4k. You can just implement a 'change_mtu' handler like: [] Well, this isn't enough I think. That is, new_mtu's upper cap should be less than PAGE_SIZE due to various additional data structures. But it is enough to start playing. The virtio header is in a separate ring entry so no prob. virtio header is one thing. Ethernet frame is another. And so on. From the last experiment (sending 2000bytes-payload pings resulting in 2008 bytes total, and 528 bytes missing with original mtu=1500), it seems like the necessary upper cap is PAGE_SIZE-28. Or something similar. Also see receive_skb() routine: receive_skb(struct net_device *dev, struct sk_buff *skb, unsigned len) { if (unlikely(len sizeof(struct virtio_net_hdr) + ETH_HLEN)) { /*drop*/ } len -= sizeof(struct virtio_net_hdr); if (len = MAX_PACKET_LEN) { ... So it seems that virtio_net_hdr is in here, just like ethernet header. [] So something else has to be changed for this to work, it seems. You're right, this was needs to be changed to: /* FIXME: MTU in config. */ #define MAX_PACKET_LEN (ETH_HLEN+ETH_DATA_LEN) You can change it to PAGE_SIZE or have the current mtu. so s/MAX_PACKET_LEN/dev-mtu/g for the whole driver, it seems. Plus/minus sizeof(virtio_net_hdr) - checking this now. This constant is used in 3 places: receive_skb(): if (len = MAX_PACKET_LEN) { (this one seems to be wrong, but again I don't know much internals of all this stuff) here, dev-mtu is what we want. try_fill_recv(): skb = netdev_alloc_skb(vi-dev, MAX_PACKET_LEN); here, we don't have dev, but have vi-dev, should be ok too. try_fill_recv(): skb_put(skb, MAX_PACKET_LEN); ditto I was too lazy to write a complete patch. And by the way, what is big_packets here? It's a bit harder here, IIRC qemu also has a 4k limit. Not that it can be done in a short period. Anyway you can use GSO and achieve similar performance. Ok, so I changed MAX_PACKET_LEN to be PAGE_SIZE (current MTU seems to be more appropriate but PAGE_SIZE is enough for testing anyway). It seems to be working, and network speed increased significantly with MTU=3500 compared with former 1500 - it seems it's about 2 times faster (which is quite expectable, since there's 2x less context switches, transmissions and the like). I'm asking because I'm not familiar with the internals, Still... ;) Thanks! /mjt You seems to be a fast learner :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: can we hope a stable version in the near future?
Farkas Levente wrote: Avi Kivity wrote: Farkas Levente wrote: There is the maint/ series on git.kernel.org. It doesn't have formal releases though. do you plan any formal release? and it'd be nice to see the relationship between the current devel tree and the stable tree to eg. last stable 0.5 current devel 0.78. The key to a formal release is a formal test suite. We've been building one (for a long while) but it isn't in production yet. The plan is for it to be open so people can add their favorite guests, to ensure they will not regress. the question is not when but what happened with those bugs which cause test fail? the problem currently not that we don't know problems, but there are many known bugs just the reason and the solution not known. so test suite can't help too much here (may be find more bugs). Test suite will help since it's job is to run regression tests each night or even each commit. Once a new regression is introduced it will immediately revert it. Now, when we have only very poor, old regression suite, it does not happen so regressions are detected by users, weeks after being committed. We'll publish the test suite (based on autotest) next week. The more users will use it the better. Anyway our maintainer will run it each night. on the other hand the real question are you plan to somehow stabilize any of the following release in the near future? in the last 1.5 years we wait for this. or you currently not recommend and not plan to use kvm in production? it's also an option but would be useful to know. in this case we (and probably many others) switch to xen, virtualbox, vmware or anything else as a virtualization platform. kvm is used in production on several products. Just not the kvm-nn releases I make. The production versions of kvm are backed by testing, which makes all the difference. Slapping a 'stable' label over a release doesn't make it so. there are many open source project which has stable and devel versions:-) actually almost all projects have a stable release along with the devel version. but kvm has not any in the last few years, that's why i think it's high time to stabilize 'a' version ie. frozen feature list and fix all known bugs. You're right about the need for stable release, that's the idea of the 'maint' branches. maint/2.6.26 for both kernel and userspace is stable (using userspace irqchip). Now we'll stabilize another user/kernel pair based on 2.6.28 Thanks, Dor -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: can we hope a stable version in the near future?
Farkas Levente wrote: Dor Laor wrote: on the other hand the real question are you plan to somehow stabilize any of the following release in the near future? in the last 1.5 years we wait for this. or you currently not recommend and not plan to use kvm in production? it's also an option but would be useful to know. in this case we (and probably many others) switch to xen, virtualbox, vmware or anything else as a virtualization platform. kvm is used in production on several products. Just not the kvm-nn releases I make. The production versions of kvm are backed by testing, which makes all the difference. Slapping a 'stable' label over a release doesn't make it so. there are many open source project which has stable and devel versions:-) actually almost all projects have a stable release along with the devel version. but kvm has not any in the last few years, that's why i think it's high time to stabilize 'a' version ie. frozen feature list and fix all known bugs. You're right about the need for stable release, that's the idea of the 'maint' branches. maint/2.6.26 for both kernel and userspace is stable (using userspace irqchip). Now we'll stabilize another user/kernel pair based on 2.6.28 that's a good news:-) but does this means there will be a new kvm-x.y.z release and i can build the userspace from it _and_ build a kmod for eg. the latest rhel-5's kernel-2.6.18-92.1.18.el5? ie. i'll be able to install it on rhel-5 a kvm and kvm-kmod and it'll work? or it'll just run on the not even released 2.6.28 kernel? and what is the relationship between maint release and kvm-nn and the next stable release? is there a tarball for the current maint release? and the same question here can i build a kmod and userspace from that for rhel-5? As always you'll have the option of using kvm as a kernel module. So even if the stable branch is based on 2.6.28, you can always take the kvm bits through 'make -C kernel sync LINUX=PATH' in the userspace. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 1-1 mapping of devices without VT-d
Passera, Pablo R wrote: Hi everyone, I want to assign a PCI device directly to a VM (PCI passthrough) in a machine that does not have VT-d. I found something related with this in a presentation done at the 2008 KVM Forum called 1-1 mapping and a patch for this at http://thread.gmane.org/gmane.comp.emulators.kvm.devel/18722/focus=18753. I am wondering if this is included or are there plans to include it in the latest KVM version? Although it had worked for us out of tree, there is no immediate need to pursue it. If anyone would like to nurture these patches he is more than welcome. ps: you also have pv-dma option for Linux guests (same status though). As time goes by most host will have either vt-d or amd iommu. Regards, Dor Thanks in advance, Pablo Pássera -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: STOP error with virtio on KVM-79/2.6.18/Win2k3 x64 guest
Adrian Schmitz wrote: Sorry for the repost.. I forgot the subject line! Hi, I'm having problems with STOP errors (0x00d1) under KVM-79/2.6.18 whenever I try to use the virtio drivers. This post (http://marc.info/?l=kvmm=121089259211638w=2) describes the issue exactly, except that I'm using a Win2k3 x64 guest with the x64 paravirtual drivers instead of 32-bit guest/drivers. I am able to reproduce the problem reliably using iperf, the same as in the above post. When I disable virtio, the guest is very stable. Any suggestions are greatly appreciated. What driver version are you using? Version 2 is obsolete. I posted ver 3 few months ago, Avi can you please upload it to sourceforge. My old public space was blocked so I'll send you a private attachment to test. Dor. -Adrian -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 1-1 mapping of devices without VT-d
Michael Tokarev wrote: Dor Laor wrote: [] Although it had worked for us out of tree, there is no immediate need to pursue it. If anyone would like to nurture these patches he is more than welcome. ps: you also have pv-dma option for Linux guests (same status though). As time goes by most host will have either vt-d or amd iommu. Hmm. Well, as time goes by, most hosts will be 64 bit or more. But it does not mean that there's no need to maintain 32bits arch anymore... i hope anyway :) But of course Are you saying that PCI passthrough without hardware support will not be available in (standard) kvm, even if patches exists for that? No, just might take a some time to go to mainline. Patches need further polishing and we also need wider demand for it. Actually pvdma can help vt-d so we won't have to make all the guest memory unswappable. /mjt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Virtio network performance problem
Adrian Schmitz wrote: On Wed, Dec 03, 2008 at 11:20:08AM -0800, Chris Wedgwood wrote: TSC instability? Is this an SMP guest? Ok, I tried pinning the kvm process to two cores (0,2) on a single socket, but that didn't seem to make any difference for my virtio network performance. I also tried pinning the process to a single core, which also didn't seem to have any effect. I think it is an unsync tsc problem. First, make sure you pin all of the process threads. There is thread per vcpu + io thread +more non relevant. You can do it by adding the taskset before the cmdline. Second, you said that you use smp guest. So windows also sees unsync tsc. So, either test with UP guest or learn how to pin windows receiving ISR, DPC and the user app. Well, testing on Intel or newer AMD is another option. I tested it again now on Intel with UP guest and there is no such a problem. Hope to test it next week on AMD SMP guest. Regards, Dor Someone on IRC suggested that it sounded like a clocking issue, since some of my ping times are negative. He suggested trying a different clock source. I tried it with dynticks, rtc, and unix. None of them seem better, although all of them seem different in terms of patterns in the ping times. Sorry if this makes it a long post, but I don't know how to describe it other than to paste an example (below). Not sure if this indicates that it is clock-related or if it is meaningless. In any event, I'm not sure where to go from here. Another suggestion from IRC was that it was due to the age of my host kernel (2.6.18) and the fact that it doesn't support high-res timers. If I can avoid replacing the distro kernel, I'd like to, but I'll do what I have to, I suppose. With dynticks (these are all with -net user, as I had some trouble with my tap interface last night while testing this. The results are roughly the same as when I was using tap before, though): Reply from 10.0.2.2: bytes=32 time=1ms TTL=255 Reply from 10.0.2.2: bytes=32 time=1ms TTL=255 Reply from 10.0.2.2: bytes=32 time=1ms TTL=255 Reply from 10.0.2.2: bytes=32 time=143ms TTL=255 Reply from 10.0.2.2: bytes=32 time=143ms TTL=255 Reply from 10.0.2.2: bytes=32 time=1ms TTL=255 Reply from 10.0.2.2: bytes=32 time=143ms TTL=255 Reply from 10.0.2.2: bytes=32 time=1ms TTL=255 Reply from 10.0.2.2: bytes=32 time=1ms TTL=255 Reply from 10.0.2.2: bytes=32 time=1ms TTL=255 Reply from 10.0.2.2: bytes=32 time=-139ms TTL=255 Reply from 10.0.2.2: bytes=32 time=-141ms TTL=255 Reply from 10.0.2.2: bytes=32 time=-133ms TTL=255 Reply from 10.0.2.2: bytes=32 time=1ms TTL=255 Reply from 10.0.2.2: bytes=32 time=143ms TTL=255 Reply from 10.0.2.2: bytes=32 time=1ms TTL=255 With rtc: Reply from 10.0.2.2: bytes=32 time=-224ms TTL=255 Reply from 10.0.2.2: bytes=32 time=-223ms TTL=255 Reply from 10.0.2.2: bytes=32 time=4ms TTL=255 Reply from 10.0.2.2: bytes=32 time1ms TTL=255 Reply from 10.0.2.2: bytes=32 time1ms TTL=255 Reply from 10.0.2.2: bytes=32 time1ms TTL=255 Reply from 10.0.2.2: bytes=32 time=225ms TTL=255 Reply from 10.0.2.2: bytes=32 time=-223ms TTL=255 Reply from 10.0.2.2: bytes=32 time=-224ms TTL=255 Reply from 10.0.2.2: bytes=32 time1ms TTL=255 Reply from 10.0.2.2: bytes=32 time1ms TTL=255 Reply from 10.0.2.2: bytes=32 time=225ms TTL=255 Reply from 10.0.2.2: bytes=32 time1ms TTL=255 Reply from 10.0.2.2: bytes=32 time1ms TTL=255 Reply from 10.0.2.2: bytes=32 time1ms TTL=255 Reply from 10.0.2.2: bytes=32 time=225ms TTL=255 Reply from 10.0.2.2: bytes=32 time=225ms TTL=255 With unix: Reply from 10.0.2.2: bytes=32 time=-191ms TTL=255 Reply from 10.0.2.2: bytes=32 time1ms TTL=255 Reply from 10.0.2.2: bytes=32 time1ms TTL=255 Reply from 10.0.2.2: bytes=32 time=-191ms TTL=255 Reply from 10.0.2.2: bytes=32 time1ms TTL=255 Reply from 10.0.2.2: bytes=32 time1ms TTL=255 Reply from 10.0.2.2: bytes=32 time1ms TTL=255 Reply from 10.0.2.2: bytes=32 time1ms TTL=255 Reply from 10.0.2.2: bytes=32 time=-190ms TTL=255 Reply from 10.0.2.2: bytes=32 time=-191ms TTL=255 Reply from 10.0.2.2: bytes=32 time=1ms TTL=255 Reply from 10.0.2.2: bytes=32 time=192ms TTL=255 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using signals to communicate two Qemu processes
Passera, Pablo R wrote: Hi all, I am trying to communicate two VMs using a virtio driver. Once a data is moved to the driver I want to notify the other Qemu process that there is new data available in the buffer. I was thinking about using linux signals to synchronize both processes but when I register my SIGUSR1 handler in Qemu I am seeing an strange behavior. After starting the VM and Linux gets loaded, Qemu is receiving SIGUSR2 at a regular time period. Looking a little bit at the code I realize that signals are being used for other purposes in Qemu, however, SIGUSR1 is not used. Is it possible to use signals to synchronize these processes or should I think about using a different mechanism? SIGUSR2 is used as aio completion signal. You can use SIGUSR1 but you need to know what you're doing (some threads block signals). Better fit would be a pipe. The vcpu Thanks, Pablo Pássera Intel - Software Innovation Pathfinding Group Cordoba - Argentina Phone: +54 351 526 5611 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] AF_VMCHANNEL address family for guest-host communication.
Evgeniy Polyakov wrote: On Tue, Dec 16, 2008 at 08:57:27AM +0200, Gleb Natapov (g...@redhat.com) wrote: Another approach is to implement that virtio backend with netlink based userspace interface (like using connector or genetlink). This does not differ too much from what you have with special socket family, but at least it does not duplicate existing functionality of userspace-kernelspace communications. I implemented vmchannel using connector initially (the downside is that message can be dropped). Is this more expectable for upstream? The implementation was 300 lines of code. Hard to tell, it depends on implementation. But if things are good, I have no objections as connector maintainer :) Messages in connector in particular and netlink in general are only dropped, when receiving buffer is full (or when there is no memory), you can tune buffer size to match virtual queue size or vice versa. Gleb was aware of that and it's not a problem since all of the anticipated usages may drop msgs (guest statistics, cutpaste, mouse movements, single sign on commands, etc). Service that would need reliability could use basic acks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: gettimeofday slow in RHEL4 guests
Avi Kivity wrote: Marcelo Tosatti wrote: The tsc clock on older Linux 2.6 kernels compensates for lost ticks. The algorithm uses the PIT count (latched) to measure the delay between interrupt generation and handling, and sums that value, on the next interrupt, to the TSC delta. Sheng investigated this problem in the discussions before in-kernel PIT was merged: http://www.mail-archive.com/kvm-de...@lists.sourceforge.net/msg13873.html The algorithm overcompensates for lost ticks and the guest time runs faster than the hosts. There are two issues: 1) A bug in the in-kernel PIT which miscalculates the count value. 2) For the case where more than one interrupt is lost, and later reinjected, the value read from PIT count is meaningless for the purpose of the tsc algorithm. The count is interpreted as the delay until the next interrupt, which is not the case with reinjection. As Sheng mentioned in the thread above, Xen pulls back the TSC value when reinjecting interrupts. VMWare ESX has a notion of virtual TSC, which I believe is similar in this context. For KVM I believe the best immediate solution (for now) is to provide an option to disable reinjection, behaving similarly to real hardware. The advantage is simplicity compared to virtualizing the time sources. The QEMU PIT emulation has a limit on the rate of interrupt reinjection, perhaps something similar should be investigated in the future. The following patch (which contains the bugfix for 1) and disabled reinjection) fixes the severe time drift on RHEL4 with clock=tsc. What I'm proposing is to condition reinjection with an option (-kvm-pit-no-reinject or something). Comments or better ideas? diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c index e665d1c..608af7b 100644 --- a/arch/x86/kvm/i8254.c +++ b/arch/x86/kvm/i8254.c @@ -201,13 +201,16 @@ static int __pit_timer_fn(struct kvm_kpit_state *ps) if (!atomic_inc_and_test(pt-pending)) set_bit(KVM_REQ_PENDING_TIMER, vcpu0-requests); +if (atomic_read(pt-pending) 1) +atomic_set(pt-pending, 1); + Replace the atomic_inc() with atomic_set(, 1) instead? One less test, and more important, the logic is scattered less around the source. But having only a pending bit instead of a counter will cause kvm to drop pit irqs on rare high load situations. The disable reinjection option is better. if (vcpu0 waitqueue_active(vcpu0-wq)) wake_up_interruptible(vcpu0-wq); hrtimer_add_expires_ns(pt-timer, pt-period); pt-scheduled = hrtimer_get_expires_ns(pt-timer); if (pt-period) -ps-channels[0].count_load_time = hrtimer_get_expires(pt-timer); +ps-channels[0].count_load_time = ktime_get(); return (pt-period == 0 ? 0 : 1); } I don't like the idea of punting to the user but looks like we don't have a choice. Hopefully vendors will port kvmclock to these kernels and release them as updates -- time simply doesn't work will with virtualization, especially Linux guests. Except for these 'tsc compensate' guest, what are the occasions where the guest writes his tsc? If this is the only case we can disable reinjection once we trap tsc writes. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] KVM: Reset PIT irq injection logic when the PIT IRQ is unmasked
Avi Kivity wrote: Marcelo Tosatti wrote: I'm worried about: - boot guest using local apic timer - reset - boot with pit timer - a zillion interrupts So at the very least, we need a limiter. Or have a new notifier on kvm_pic_reset, instead of simply acking one pending irq? That seems the appropriate place to zero the counter. Clearing the counter on reset is good, but it doesn't solve the underlying problem, which is that there are two separate cases that appear to the host as the same thing: - guest masks irqs, does a lot of work, unmasks irqs - host deschedules guest, does a lot of work, reschedules guest Right now we assume any missed interrupts are due to host load. In the reboot case, that's clearly wrong, but that is only an example. Maybe we can use preempt notifiers to detect whether the timer tick happened while the guest was scheduled or not. It might get too complex. It can be done inside the vcpu_run function too: An irq needs reinjection if the irq window was not open from the timer tick till the next timer tick minus the deschedule time. You also need to know on the right vcpu that the pit irq it routed to. Since scenarios like guests masking their pit and do a lot of work are rare and a bad guest behaviour anyway, I don't think we should special case them. So the pit reset hook is enough. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM, Entropy and Windows
On 02/17/2011 12:09 PM, Vadim Rozenfeld wrote: On Thu, 2011-02-17 at 11:11 +0200, Avi Kivity wrote: On 02/16/2011 09:54 PM, --[ UxBoD ]-- wrote: Hello all, I believe I am hitting a problem on one of our Windows 2003 KVM guests were I believe it is running out of Entropy and causing SSL issues. I see that there is a module called virtio-rng which I believe passes the HW entropy source through to the guest but does this work on Windows as-well ? AFAIK there is no Windows driver for virtio-rng. Seems like a good idea. Vadim? virtio-rng driver for windows is not a big deal. IMO, the real problem will be to force Windows to use for CriptoApi. What's the implication of it? good or bad? Do you know what hyper-v is doing for it? If it doesn't any ideas on how I can increase the amount of entropy being generated on a headless system ? or even monitor entropy on a Windows system ? No idea. Maybe you could ask Windows to collect entropy from packet timings. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KSM For All Via LD_PRELOAD?
On 06/08/2010 09:43 PM, Gordan Bobic wrote: Is this plausible? I'm trying to work out if it's even worth considering this approach to enable all memory used by in a system to be open to KSM page merging, rather than only memory used by specific programs aware of it (e.g. kvm/qemu). Something like this would address the fact that container based virtualization (OpenVZ, VServer, LXC) cannot benefit from KSM. What I'm thinking about is somehow intercepting malloc() and wrapping it so that all malloc()-ed memory gets madvise()-d as well. Has this been done? Or is this too crazy an idea? It should work. Note that the the malloced memory should be aligned in order to get better sharing. Gordan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KSM For All Via LD_PRELOAD?
On 06/09/2010 01:31 PM, Gordan Bobic wrote: On 06/09/2010 09:56 AM, Paolo Bonzini wrote: Or is this too crazy an idea? It should work. Note that the the malloced memory should be aligned in order to get better sharing. Within glibc malloc large blocks are mmaped, so they are automatically aligned. Effective sharing of small blocks would take too much luck or too much wasted memory, so probably madvising brk memory is not too useful. Of course there are exceptions. Bitmaps are very much sharable, but not big. And some programs have their own allocator, using mmap in all likelihood and slicing the resulting block. Typically these will be virtual machines for garbage collected languages (but also GCC for example does this). They will store a lot of pointers in there too, so in this case KSM would likely work a lot for little benefit. So if you really want to apply it to _all_ processes, it comes to mind to wrap both mmap and malloc so that you can set a flag only for mmap-within-malloc... It will take some experimentation and heuristics to actually not degrade performance (and of course it will depend on the workload), but it should work. Arguably, the way QEMU KVM does it for the VM's entire memory block doesn't seem to be distinguishing the types of memory allocation inside the VM, so simply covering all mmap()/brk() calls would probably do no worse in terms of performance. Or am I missing something? There won't be drastic effect for qemu-kvm since the non guest ram areas are minimal. I thought you were trying to trap mmap/brk/malloc for other general applications regardless of virt. Gordan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM test: Disable HPET on windows timedrift tests
On 07/01/2010 07:05 PM, Lucas Meneghel Rodrigues wrote: On Thu, 2010-07-01 at 17:42 +0300, Avi Kivity wrote: On 06/30/2010 06:39 PM, Lucas Meneghel Rodrigues wrote: By default, HPET is enabled on qemu and no time drift mitigation is being made for it. So, add -no-hpet if qemu supports it, during windows timedrift tests. Hm, you're compensating for a qemu bug by not testing it. Can we have an XFAIL for this test instead? Certainly we can. In actuality, that's what's being done on our internal autotest server - this particular test is linked to the upstream bug https://bugs.launchpad.net/qemu/+bug/599958 We've discussed about this issue this morning, it boils down to the way people are more comfortable with handling this issue. My first thought was to disable HPET until someone come up with a time drift mitigation strategy for it. But your approach makes more sense, unless someone has something else to say about it, I'll drop the patch from autotest shortly. Actually we should do both - XFAIL when hpet is used and in addition (and even more importantly) test other clock sources by disabling hpet. Lucas -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM Processor cache size
On 08/03/2010 02:36 AM, Anthony Liguori wrote: On 08/02/2010 05:42 PM, Andre Przywara wrote: Anthony Liguori wrote: On 08/02/2010 08:49 AM, Ulrich Drepper wrote: glibc uses the cache size information returned by cpuid to perform optimizations. For instance, copy operations which would pollute too much of the cache because they are large will use non-temporal instructions. There are real performance benefits. I imagine that there would be real performance problems from doing live migration with -cpu host too if we don't guarantee these values remain stable across migration... Again, -cpu host is not meant to be migrated. Then it needs to prevent migration from happening. Otherwise, it's a bug waiting to happen. There are other virtualization use cases than cloud-like server virtualization. Sometimes users don't care about migration (or even the live version), but want full CPU exposure for performance reasons (think of virtualizing Windows on a Linux desktop). I agree that -cpu host and migration should be addressed, but only to a certain degree. And missing migration experience should not be a road blocker for -cpu host. When we can reasonably prevent it, we should prevent users from shooting themselves in the foot. Honestly, I think -cpu host is exactly what you would want to use in a cloud. A lot of private clouds and even public clouds are largely based on homogenous hardware. There are two good solutions for that: a. keep adding newer -cpu definition like the Penryn, Nehalem, Opteron_gx, so newer models will be abstracted as similar to the physical properties b. Use strict flag with -cpu host and pass the info with the live migration protocol. Our live migration protocol can do better job with validation the cmdline and the current set of devices/hw on the src/dst and fail migration if there is a diff. Today we relay on libvirt for that, another mechanism will surely help, especially for -cpu host. The goodie is that there won't be a need to wait for the non-live migration part, and more cpu cycles will be saved. I actually think the case where you want to migrate between heterogenous hardware is grossly overstated. Regards, Anthony Liguori Regards, Andre. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bad O_DIRECT read and write performance with small block sizes with virtio
On 08/02/2010 11:50 PM, Stefan Hajnoczi wrote: On Mon, Aug 2, 2010 at 6:46 PM, Anthony Liguorianth...@codemonkey.ws wrote: On 08/02/2010 12:15 PM, John Leach wrote: Hi, I've come across a problem with read and write disk IO performance when using O_DIRECT from within a kvm guest. With O_DIRECT, reads and writes are much slower with smaller block sizes. Depending on the block size used, I've seen 10 times slower. For example, with an 8k block size, reading directly from /dev/vdb without O_DIRECT I see 750 MB/s, but with O_DIRECT I see 79 MB/s. As a comparison, reading in O_DIRECT mode in 8k blocks directly from the backend device on the host gives 2.3 GB/s. Reading in O_DIRECT mode from a xen guest on the same hardware manages 263 MB/s. Stefan has a few fixes for this behavior that help a lot. One of them (avoiding memset) is already upstream but not in 0.12.x. The other two are not done yet but should be on the ML in the next couple weeks. They involve using ioeventfd for notification and unlocking the block queue lock while doing a kick notification. Thanks for mentioning those patches. The ioeventfd patch will be sent this week, I'm checking that migration works correctly and then need to check that vhost-net still works. Writing is affected in the same way, and exhibits the same behaviour with O_SYNC too. Watching with vmstat on the host, I see the same number of blocks being read, but about 14 times the number of context switches in O_DIRECT mode (4500 cs vs. 63000 cs) and a little more cpu usage. The device I'm writing to is a device-mapper zero device that generates zeros on read and throws away writes, you can set it up at /dev/mapper/zero like this: echo 0 21474836480 zero | dmsetup create zero My libvirt config for the disk is: disk type='block' device='disk' driver cache='none'/ source dev='/dev/mapper/zero'/ target dev='vdb' bus='virtio'/ address type='pci' domain='0x' bus='0x00' slot='0x06' function='0x0'/ /disk which translates to the kvm arg: -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -drive file=/dev/mapper/zero,if=none,id=drive-virtio-disk1,cache=none aio=native and change the io scheduler on the host to deadline should help as well. I'm testing with dd: dd if=/dev/vdb of=/dev/null bs=8k iflag=direct As a side note, as you increase the block size read performance in O_DIRECT mode starts to overtake non O_DIRECT mode reads (from about 150k block size). By 550k block size I'm seeing 1 GB/s reads with O_DIRECT and 770 MB/s without. Can you take QEMU out of the picture and run the same test on the host: dd if=/dev/vdb of=/dev/null bs=8k iflag=direct vs dd if=/dev/vdb of=/dev/null bs=8k This isn't quite the same because QEMU will use a helper thread doing preadv. I'm not sure what syscall dd will use. It should be close enough to determine whether QEMU and device emulation are involved at all though, or whether these differences are due to the host kernel code path down to the device mapper zero device being different for normal vs O_DIRECT. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RHEL 4.5 guest virtual network performace
On 08/16/2010 10:00 PM, Alex Rixhardson wrote: Hi guys, I have the following configuration: 1. host is RHEL 5.5, 64bit with KVM (version that comes out of the box with RHEL 5.5) 2. two guests: 2a: RHEL 5.5, 32bit, 2b: RHEL 4.5, 64bit If I run iperf between host RHEL 5.5 and guest RHEL 5.5 inside the virtual network subnet I get great results ( 4Gbit/sec). But if I run iperf between guest RHEL 4.5 and either of the two RHELs 5.5 I get bad network performance (around 140Mbit/sec). Please try netperf, iperf known to be buggy and might consume cpu w/o real justification The configuration was made thru virtual-manager utility, nothing special. I just added virtual network device to both guests. Could you guys give me some tips on what should I check? Regards, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RHEL 4.5 guest virtual network performace
On 08/17/2010 12:22 AM, Alex Rixhardson wrote: Thanks for the suggestion. I tried with the netperf. I ran netserver on host and netperf on RHEL 5.5 and RHEL 4.5 guests. This are the results of 60 seconds long tests: RHEL 4.5 guest: Throughput (10^6bits/sec) = 145.80 At least it bought you another 5Mb/s over iperf ... It might be time related, 5.5 has kvmclock but rhel4 does not. If it's 64 bit guest add this to the 4.5 guest cmdline 'notsc divider=10'. If it's 32 use 'clock=pmtmr divider=10'. The divider is probably new and is in rhel4.8 only, it's ok w/o it too. What's the host load for the 4.5 guest? RHEL 5.5 guest: Throughput (10^6bits/sec) = 3760.24 The results are really bad on RHEL 4.5 guest. What could be wrong? Regards, Alex On Mon, Aug 16, 2010 at 9:49 PM, Dor Laordl...@redhat.com wrote: On 08/16/2010 10:00 PM, Alex Rixhardson wrote: Hi guys, I have the following configuration: 1. host is RHEL 5.5, 64bit with KVM (version that comes out of the box with RHEL 5.5) 2. two guests: 2a: RHEL 5.5, 32bit, 2b: RHEL 4.5, 64bit If I run iperf between host RHEL 5.5 and guest RHEL 5.5 inside the virtual network subnet I get great results (4Gbit/sec). But if I run iperf between guest RHEL 4.5 and either of the two RHELs 5.5 I get bad network performance (around 140Mbit/sec). Please try netperf, iperf known to be buggy and might consume cpu w/o real justification The configuration was made thru virtual-manager utility, nothing special. I just added virtual network device to both guests. Could you guys give me some tips on what should I check? Regards, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RHEL 4.5 guest virtual network performace
On 08/17/2010 12:51 AM, Alex Rixhardson wrote: I tried with 'notsc divider=10' (since it's 64 bit guest), but the results are the still same :-(. The guest is idle at the time of testing. It has 2 CPU and 1024 MB RAM available. Hmm, are you using e1000 or virtio for the 4.5 guest? e1000 should be slow since it's less suitable for virtualization (3 mmio/packet) On Mon, Aug 16, 2010 at 11:35 PM, Dor Laordl...@redhat.com wrote: On 08/17/2010 12:22 AM, Alex Rixhardson wrote: Thanks for the suggestion. I tried with the netperf. I ran netserver on host and netperf on RHEL 5.5 and RHEL 4.5 guests. This are the results of 60 seconds long tests: RHEL 4.5 guest: Throughput (10^6bits/sec) = 145.80 At least it bought you another 5Mb/s over iperf ... It might be time related, 5.5 has kvmclock but rhel4 does not. If it's 64 bit guest add this to the 4.5 guest cmdline 'notsc divider=10'. If it's 32 use 'clock=pmtmr divider=10'. The divider is probably new and is in rhel4.8 only, it's ok w/o it too. What's the host load for the 4.5 guest? RHEL 5.5 guest: Throughput (10^6bits/sec) = 3760.24 The results are really bad on RHEL 4.5 guest. What could be wrong? Regards, Alex On Mon, Aug 16, 2010 at 9:49 PM, Dor Laordl...@redhat.comwrote: On 08/16/2010 10:00 PM, Alex Rixhardson wrote: Hi guys, I have the following configuration: 1. host is RHEL 5.5, 64bit with KVM (version that comes out of the box with RHEL 5.5) 2. two guests: 2a: RHEL 5.5, 32bit, 2b: RHEL 4.5, 64bit If I run iperf between host RHEL 5.5 and guest RHEL 5.5 inside the virtual network subnet I get great results ( 4Gbit/sec). But if I run iperf between guest RHEL 4.5 and either of the two RHELs 5.5 I get bad network performance (around 140Mbit/sec). Please try netperf, iperf known to be buggy and might consume cpu w/o real justification The configuration was made thru virtual-manager utility, nothing special. I just added virtual network device to both guests. Could you guys give me some tips on what should I check? Regards, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: The HPET issue on Linux
On 01/06/2010 12:09 PM, Gleb Natapov wrote: On Wed, Jan 06, 2010 at 05:48:52PM +0800, Sheng Yang wrote: Hi Beth I still found the emulated HPET would result in some boot failure. For example, on my 2.6.30, with HPET enabled, the kernel would fail check_timer(), especially in timer_irq_works(). The testing of timer_irq_works() is let 10 ticks pass(using mdelay()), and want to confirm the clock source with at least 5 ticks advanced in jiffies. I've checked that, on my machine, it would mostly get only 4 ticks when HPET enabled, then fail the test. On the other hand, if I using PIT, it would get more than 10 ticks(maybe understandable if some complementary ticks there). Of course, extend the ticks count/mdelay() time can work. I think it's a major issue of HPET. And it maybe just due to a too long userspace path for interrupt injection... If it's true, I think it's not easy to deal with it. PIT tick are reinjected automatically, HPET should probably do the same although it may just create another set of problems. Older Linux do automatic adjustment for lost ticks so automatic reinjection causes time to run too fast. This is why we added the -no-kvm-pit-reinject flag... It took lots of time to pit/rtc to stabilize, in order of seriously consider the hpet emulation, lots of testing should be done. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] cpuid problem in upstream qemu with kvm
On 01/06/2010 05:16 PM, Anthony Liguori wrote: On 01/06/2010 08:48 AM, Dor Laor wrote: On 01/06/2010 04:32 PM, Avi Kivity wrote: On 01/06/2010 04:22 PM, Michael S. Tsirkin wrote: We can probably default -enable-kvm to -cpu host, as long as we explain very carefully that if users wish to preserve cpu features across upgrades, they can't depend on the default. Hardware upgrades or software upgrades? Yes. I just want to remind all the the main motivation for using -cpu realModelThatWasOnceShiped is to provide correct cpu emulation for the guest. Using a random qemu|kvm64+flag1-flag2 might really cause trouble for the guest OS or guest apps. On top of -cpu nehalem we can always add fancy features like x2apic, etc. I think it boils down to, how are people going to use this. For individuals, code names like Nehalem are too obscure. From my own personal experience, even power users often have no clue whether there processor is a Nehalem or not. For management tools, Nehalem is a somewhat imprecise target because it covers a wide range of potential processors. In general, I think what we really need to do is simplify the process of going from, here's the output of /proc/cpuinfo for a 100 nodes, what do I need to pass to qemu so that migration always works for these systems. I don't think -cpu nehalem really helps with that problem. -cpu none helps a bit, but I hope we can find something nicer. We can debate about the exact name/model to represent the Nehalem family, I don't have an issue with that and actually Intel and Amd should define it. There are two main motivations behind the above approach: 1. Sound guest cpu definition. Using a predefined model should automatically set all the relevant vendor/stepping/cpuid flags/cache sizes/etc. We just can let every management application deal with it. It breaks guest OS/apps. For instance there are MSI support in windows guest relay on the stepping. 2. Simplifying end user and mgmt tools. qemu/kvm have the best knowledge about these low levels. If we push it up in the stack, eventually it reaches the user. The end user, not a 'qemu-devel user' which is actually far better from the average user. This means that such users will have to know what is popcount and whether or not to limit migration on one host by adding sse4.2 or not. This is exactly what vmware are doing: - Intel CPUs : http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1991 - AMD CPUs : http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1992 Why should we invent the wheel (qemu64..)? Let's learn from their experience. This is the test description of the original patch by John: # Intel # - # Management layers remove pentium3 by default. # It primarily remains here for testing of 32-bit migration. # [0:Pentium 3 Intel :vmx :pentium3;] # Core 2, 65nm # possible option sets: (+nx,+cx16), (+nx,+cx16,+ssse3) # 1:Merom :vmx,sse2 :qemu64,-nx,+sse2; # Core2 45nm # 2:Penryn :vmx,sse2,nx,cx16,ssse3,sse4_1 :qemu64,+sse2,+cx16,+ssse3,+sse4_1; # Core i7 45/32nm # 3:Nehalem :vmx,sse2,nx,cx16,ssse3,sse4_1,sse4_2,popcnt :qemu64,+sse2,+cx16,+ssse3,+sse4_1,+sse4_2,+popcnt; # AMD # --- # Management layers remove pentium3 by default. # It primarily remains here for testing of 32-bit migration. # [0:Pentium 3 AMD :svm :pentium3;] # Opteron 90nm stepping E1/E4/E6 # possible option sets: (-nx) for 130nm # 1:Opteron G1 :svm,sse2,nx :qemu64,+sse2; # Opteron 90nm stepping F2/F3 # 2:Opteron G2 :svm,sse2,nx,cx16,rdtscp :qemu64,+sse2,+cx16,+rdtscp; # Opteron 65/45nm # 3:Opteron G3 :svm,sse2,nx,cx16,sse4a,misalignsse,popcnt,abm :qemu64,+sse2,+cx16,+sse4a,+misalignsse,+popcnt,+abm; Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] cpuid problem in upstream qemu with kvm
On 01/07/2010 10:18 AM, Avi Kivity wrote: On 01/07/2010 10:03 AM, Dor Laor wrote: We can debate about the exact name/model to represent the Nehalem family, I don't have an issue with that and actually Intel and Amd should define it. AMD and Intel already defined their names (in cat /proc/cpuinfo). They don't define families, the whole idea is to segment the market. The idea here is to minimize the number of models we should have the following range for Intel for example: pentium3 - merom - penry - Nehalem - host - kvm/qemu64 So we're supplying wide range of cpus, p3 for maximum flexibility and migration, nehalem for performance and migration, host for maximum performance and qemu/kvm64 for custom maid. There are two main motivations behind the above approach: 1. Sound guest cpu definition. Using a predefined model should automatically set all the relevant vendor/stepping/cpuid flags/cache sizes/etc. We just can let every management application deal with it. It breaks guest OS/apps. For instance there are MSI support in windows guest relay on the stepping. 2. Simplifying end user and mgmt tools. qemu/kvm have the best knowledge about these low levels. If we push it up in the stack, eventually it reaches the user. The end user, not a 'qemu-devel user' which is actually far better from the average user. This means that such users will have to know what is popcount and whether or not to limit migration on one host by adding sse4.2 or not. This is exactly what vmware are doing: - Intel CPUs : http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1991 - AMD CPUs : http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1992 They don't have to deal with different qemu and kvm versions. Both our customers - the end users. It's not their problem. IMO what's missing today is a safe and sound cpu emulation that is simply and friendly to represent. qemu64,+popcount is not simple for the end user. There is no reason to through it on higher level mgmt. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] cpuid problem in upstream qemu with kvm
On 01/07/2010 11:24 AM, Avi Kivity wrote: On 01/07/2010 11:11 AM, Dor Laor wrote: On 01/07/2010 10:18 AM, Avi Kivity wrote: On 01/07/2010 10:03 AM, Dor Laor wrote: We can debate about the exact name/model to represent the Nehalem family, I don't have an issue with that and actually Intel and Amd should define it. AMD and Intel already defined their names (in cat /proc/cpuinfo). They don't define families, the whole idea is to segment the market. The idea here is to minimize the number of models we should have the following range for Intel for example: pentium3 - merom - penry - Nehalem - host - kvm/qemu64 So we're supplying wide range of cpus, p3 for maximum flexibility and migration, nehalem for performance and migration, host for maximum performance and qemu/kvm64 for custom maid. There's no such thing as Nehalem. Intel were ok with it. Again, you can name is corei7 or xeon34234234234, I don't care, the principle remains the same. This is exactly what vmware are doing: - Intel CPUs : http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1991 - AMD CPUs : http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1992 They don't have to deal with different qemu and kvm versions. Both our customers - the end users. It's not their problem. IMO what's missing today is a safe and sound cpu emulation that is simply and friendly to represent. qemu64,+popcount is not simple for the end user. There is no reason to through it on higher level mgmt. There's no simple solution except to restrict features to what was available on the first processors. What's not simple about the above 4 options? What's a better alternative (that insures users understand it and use it and guest msi and even skype application is happy about it)? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] cpuid problem in upstream qemu with kvm
On 01/07/2010 01:39 PM, Anthony Liguori wrote: On 01/07/2010 03:40 AM, Dor Laor wrote: There's no simple solution except to restrict features to what was available on the first processors. What's not simple about the above 4 options? What's a better alternative (that insures users understand it and use it and guest msi and even skype application is happy about it)? Even if you have -cpu Nehalem, different versions of the KVM kernel module may additionally filter cpuid flags. So if you had a 2.6.18 kernel and a 2.6.33 kernel, it may be necessary to say: (2.6.33) qemu -cpu Nehalem,-syscall (2.6.18) qemu -cpu Nehalem Or let qemu do it automatically for you. In order to be compatible. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] cpuid problem in upstream qemu with kvm
On 01/07/2010 02:00 PM, Avi Kivity wrote: On 01/07/2010 01:44 PM, Dor Laor wrote: So if you had a 2.6.18 kernel and a 2.6.33 kernel, it may be necessary to say: (2.6.33) qemu -cpu Nehalem,-syscall (2.6.18) qemu -cpu Nehalem Or let qemu do it automatically for you. qemu on 2.6.33 doesn't know that you're running qemu on 2.6.18 on another node. We can live with it, either have qemu realize the kernel version out of another existing feature or query uname. Alternatively, the matching libvirt package can be the one adding or removing it in the right distribution. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] cpuid problem in upstream qemu with kvm
On 01/07/2010 03:14 PM, Anthony Liguori wrote: On 01/07/2010 06:40 AM, Avi Kivity wrote: On 01/07/2010 02:33 PM, Anthony Liguori wrote: There's another option. Make cpuid information part of live migration protocol, and then support something like -cpu Xeon-3550. We would remember the exact cpuid mask we present to the guest and then we could validate that we can obtain the same mask on the destination. It solves controlling the destination qemu execution all right but does not change the initial spawning of the original guest - to know whether ,-syscall is needed or not. Anyway, I'm in favor of it too. Currently, our policy is to only migrate dynamic (from the guest's point of view) state, and specify static state on the command line [1]. I think your suggestion makes a lot of sense, but I'd like to expand it to move all guest state, whether dynamic or static. So '-m 1G' would be migrated as well (but not -mem-path). Similarly, in -drive file=...,if=ide,index=1, everything but file=... would be migrated. Yes, I agree with this and it should be in the form of an fdt. This means we need full qdev conversion. But I think cpuid is somewhere in the middle with respect to static vs. dynamic. For instance, -cpu host is very dynamic in that you get very difficult results on different systems. Likewise, because of kvm filtering, even -cpu qemu64 can be dynamic. So if we didn't have filtering and -cpu host, I'd agree that it's totally static but I think in the current state, it's dynamic. This has an advantage wrt hotplug: since qemu is responsible for migrating all guest visible information, the migrator is no longer responsible for replaying hotplug events in the exact sequence they happened. Yup, 100% in agreement as a long term goal. In short, I think we should apply your suggestion as broadly as possible. [1] cpuid state is actually dynamic; repeated cpuid instruction execution with the same operands can return different results. kvm supports querying and setting this state. Yes, and we save some cpuid state in cpu. We just don't save all of it. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] Add definitions for current cpu models..
On 01/21/2010 05:05 PM, Anthony Liguori wrote: On 01/20/2010 07:18 PM, john cooper wrote: Chris Wright wrote: * Daniel P. Berrange (berra...@redhat.com) wrote: To be honest all possible naming schemes for '-cpuname' are just as unfriendly as each other. The only user friendly option is '-cpu host'. IMHO, we should just pick a concise naming scheme document it. Given they are all equally unfriendly, the one that has consistency with vmware naming seems like a mild winner. Heh, I completely agree, and was just saying the same thing to John earlier today. May as well be -cpu {foo,bar,baz} since the meaning for those command line options must be well-documented in the man page. I can appreciate the concern of wanting to get this as correct as possible. This is the root of the trouble. At the qemu layer, we try to focus on being correct. Management tools are typically the layer that deals with being correct. A good compromise is making things user tunable which means that a downstream can make correctness decisions without forcing those decisions on upstream. In this case, the idea would be to introduce a new option, say something like -cpu-def. The syntax would be: -cpu-def name=coreduo,level=10,family=6,model=14,stepping=8,features=+vme+mtrr+clflush+mca+sse3+monitor,xlevel=0x8008,model_id=Genuine Intel(R) CPU T2600 @ 2.16GHz Which is not that exciting since it just lets you do -cpu coreduo in a much more complex way. However, if we take advantage of the current config support, you can have: [cpu-def] name=coreduo level=10 family=6 model=14 stepping=8 features=+vme+mtrr+clflush+mca+sse3.. model_id=Genuine Intel... And that can be stored in a config file. We should then parse /etc/qemu/target-targetname.conf by default. We'll move the current x86_defs table into this config file and then downstreams/users can define whatever compatibility classes they want. With this feature, I'd be inclined to take correct compatibility classes like Nehalem as part of the default qemurc that we install because it's easily overridden by a user. It then becomes just a suggestion on our part verses a guarantee. It should just be a matter of adding qemu_cpudefs_opts to qemu-config.[ch], taking a new command line that parses the argument via QemuOpts, then passing the parsed options to a target-specific function that then builds the table of supported cpus. Isn't the outcome of John's patches and these configs will be exactly the same? Since these cpu models won't ever change, there is no reason why not to hard code them. Adding configs or command lines is a good idea but it is more friendlier to have basic support to the common cpus. This is why qemu today offers: -cpu ? x86 qemu64 x86 phenom x86 core2duo x86kvm64 x86 qemu32 x86 coreduo x86 486 x86 pentium x86 pentium2 x86 pentium3 x86 athlon x86 n270 So bottom line, my point is to have John's base + your configs. We need to keep also the check verb and the migration support for sending those. btw: IMO we should deal with this complexity ourselves and save 99.9% of the users the need to define such models, don't ask this from a java programmer, he is running on a JVM :-) Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] Add definitions for current cpu models..
On 01/25/2010 04:21 PM, Anthony Liguori wrote: On 01/25/2010 03:08 AM, Dor Laor wrote: qemu-config.[ch], taking a new command line that parses the argument via QemuOpts, then passing the parsed options to a target-specific function that then builds the table of supported cpus. It should just be a matter of adding qemu_cpudefs_opts to Isn't the outcome of John's patches and these configs will be exactly the same? Since these cpu models won't ever change, there is no reason why not to hard code them. Adding configs or command lines is a good idea but it is more friendlier to have basic support to the common cpus. This is why qemu today offers: -cpu ? x86 qemu64 x86 phenom x86 core2duo x86 kvm64 x86 qemu32 x86 coreduo x86 486 x86 pentium x86 pentium2 x86 pentium3 x86 athlon x86 n270 So bottom line, my point is to have John's base + your configs. We need to keep also the check verb and the migration support for sending those. btw: IMO we should deal with this complexity ourselves and save 99.9% of the users the need to define such models, don't ask this from a java programmer, he is running on a JVM :-) I'm suggesting John's base should be implemented as a default config that gets installed by default in QEMU. The point is that a smart user (or a downstream) can modify this to suite their needs more appropriately. Another way to look at this is that implementing a somewhat arbitrary policy within QEMU's .c files is something we should try to avoid. Implementing arbitrary policy in our default config file is a fine thing to do. Default configs are suggested configurations that are modifiable by a user. Something baked into QEMU is something that ought to work for If we get the models right, users and mgmt stacks won't need to define them. It seems like almost impossible task for us, mgmt stack/users won't do a better job, the opposite I guess. The configs are great, I have no argument against them, my case is that if we can pin down some definitions, its better live in the code, like the above models. It might even help to get the same cpus across the various vendors, otherwise we might end up with IBM's core2duo, RH's core2duo, Suse's,.. everyone in all circumstances. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [RFC] KVM test: Control files automatic generation to save memory
On 02/14/2010 07:07 PM, Michael Goldish wrote: - Lucas Meneghel Rodriguesl...@redhat.com wrote: As our configuration system generates a list of dicts with test parameters, and that list might be potentially *very* large, keeping all this information in memory might be a problem for smaller virtualization hosts due to the memory pressure created. Tests made on my 4GB laptop show that most of the memory is being used during a typical kvm autotest session. So, instead of keeping all this information in memory, let's take a different approach and unfold all the tests generated by the config system and generate a control file: job.run_test('kvm', params={param1, param2, ...}, tag='foo', ...) job.run_test('kvm', params={param1, param2, ...}, tag='bar', ...) By dumping all the dicts that were before in the memory to a control file, the memory usage of a typical kvm autotest session is drastically reduced making it easier to run in smaller virt hosts. The advantages of taking this new approach are: * You can see what tests are going to run and the dependencies between them by looking at the generated control file * The control file is all ready to use, you can for example paste it on the web interface and profit * As mentioned, a lot less memory consumption, avoiding memory pressure on virtualization hosts. This is a crude 1st pass at implementing this approach, so please provide comments. Signed-off-by: Lucas Meneghel Rodriguesl...@redhat.com --- Interesting idea! - Personally I don't like the renaming of kvm_config.py to generate_control.py, and prefer to keep them separate, so that generate_control.py has the create_control() function and kvm_config.py has everything else. It's just a matter of naming; kvm_config.py deals mostly with config files, not with control files, and it can be used for other purposes than generating control files. - I wonder why so much memory is used by the test list. Our daily test sets aren't very big, so although the parser should use a huge amount of memory while parsing, nearly all of that memory should be freed by the time the parser is done, because the final 'only' statement reduces the number of tests to a small fraction of the total number in a full set. What test set did you try with that 4 GB machine, and how much memory was used by the test list? If a ridiculous amount of memory was used, this might indicate a bug in kvm_config.py (maybe it keeps references to deleted tests, forcing them to stay in memory). I agree, it's worth getting to the bottom of it - I wonder how many objects are created on kvm unstable set. It should be a huge number. Besides that, one can always call the python garbage collection interface in order to free unreferenced memory immediately. - I don't think this approach will work for control.parallel, because the tests have to be assigned dynamically to available queues, and AFAIK this can't be done by a simple static control file. - Whether or not this is a good idea probably depends on the users. On one hand, users will be required to run generate_control.py before autotest.py, and the generated control files will be very big and ugly; on the other hand, maybe they won't care. I probably haven't given this enough thought so I might have missed a few things. client/tests/kvm/control | 64 client/tests/kvm/generate_control.py | 586 ++ client/tests/kvm/kvm_config.py | 524 -- 3 files changed, 586 insertions(+), 588 deletions(-) delete mode 100644 client/tests/kvm/control create mode 100755 client/tests/kvm/generate_control.py delete mode 100755 client/tests/kvm/kvm_config.py diff --git a/client/tests/kvm/control b/client/tests/kvm/control deleted file mode 100644 index 163286e..000 --- a/client/tests/kvm/control +++ /dev/null @@ -1,64 +0,0 @@ -AUTHOR = -u...@redhat.com (Uri Lublin) -dru...@redhat.com (Dror Russo) -mgold...@redhat.com (Michael Goldish) -dh...@redhat.com (David Huff) -aerom...@redhat.com (Alexey Eromenko) -mbu...@redhat.com (Mike Burns) - -TIME = 'MEDIUM' -NAME = 'KVM test' -TEST_TYPE = 'client' -TEST_CLASS = 'Virtualization' -TEST_CATEGORY = 'Functional' - -DOC = -Executes the KVM test framework on a given host. This module is separated in -minor functions, that execute different tests for doing Quality Assurance on -KVM (both kernelspace and userspace) code. - -For online docs, please refer to http://www.linux-kvm.org/page/KVM-Autotest - - -import sys, os, logging -# Add the KVM tests dir to the python path -kvm_test_dir = os.path.join(os.environ['AUTODIR'],'tests/kvm') -sys.path.append(kvm_test_dir) -# Now we can import modules inside the KVM tests dir -import kvm_utils, kvm_config - -# set English environment (command output might be localized, need to be safe) -os.environ['LANG'] = 'en_US.UTF-8' - -build_cfg_path = os.path.join(kvm_test_dir, build.cfg) -build_cfg =
Re: Recommended network driver for a windows KVM guest
On 02/17/2010 12:51 PM, carlopmart wrote: Hi all, I need to install several windows KVM (rhel5.4 host fully updated) guests for iSCSI boot. iSCSI servers are Solaris/OpenSolaris storage servers and I need to boot windows guests (2008R2 and Win7) using gpxe. Can i use virtio net dirver during windows install or e1000 driver?? rhel5.4 does not have gpxe so it won't work. rhel5.5 will have such but I don't recall someone testing iScsi with kvm+gpxe on upstream too, worth testing. Anyway, virtio performs better than e1000 and potentially more stable than it. Many thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Make QEmu HPET disabled by default for KVM?
On 03/14/2010 09:10 AM, Gleb Natapov wrote: On Sun, Mar 14, 2010 at 09:05:50AM +0200, Avi Kivity wrote: On 03/11/2010 09:08 PM, Marcelo Tosatti wrote: I have kept --no-hpet in my setup for months... Any details about the problems? HPET is important to some guests. As Gleb mentioned in the other thread, reinjection will introduce another set of problems. Ideally all this timer related problems should be fixed by correlating timer interrupts and time source reads. This still needs reinjection (or slewing of the timer frequency). Correlation doesn't fix drift. But only when all time sources are synchronised and correlated with interrupts we can slew time frequency without guest noticing (and only if guest disables NTP) In the mean time we should definitely disable hpet by default. Besides this we need to fully virtualize the tsc, fix win7 64bit rtc time drift and some pvclock potential issues. Before we add new timer, better fix existing ones. What about creating a pv time keeping device that will be aware of lost ticks and host wall clock time? It's similar to hyper-v enlightenment virt timers. Since one already has to use special timer parameters (-rtc-td-hack, -no-kvm-pit-reinjection), using -no-hpet for problematic Linux guests seems fine? Depends on how common the problematic ones are. If they're common, better to have a generic fix. -- error compiling committee.c: too many arguments to function -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Make QEmu HPET disabled by default for KVM?
On 03/14/2010 12:27 PM, Avi Kivity wrote: On 03/14/2010 12:23 PM, Dor Laor wrote: On 03/14/2010 09:10 AM, Gleb Natapov wrote: On Sun, Mar 14, 2010 at 09:05:50AM +0200, Avi Kivity wrote: On 03/11/2010 09:08 PM, Marcelo Tosatti wrote: I have kept --no-hpet in my setup for months... Any details about the problems? HPET is important to some guests. As Gleb mentioned in the other thread, reinjection will introduce another set of problems. Ideally all this timer related problems should be fixed by correlating timer interrupts and time source reads. This still needs reinjection (or slewing of the timer frequency). Correlation doesn't fix drift. But only when all time sources are synchronised and correlated with interrupts we can slew time frequency without guest noticing (and only if guest disables NTP) In the mean time we should definitely disable hpet by default. Definitely not. Windows needs it. Some pre-kvmclock Linux may also work with it. Without hpet, there is no fast high resolution timer in the system. It's all depends on how hard would it be to re-inject to windows guest. We still need to fix the win2k3 64 bit and win2k8 64 bit (and not win7 as I told initially) since the irq is broadcasted to all the vcpus and we do not track who acknowledged the irq. Besides this we need to fully virtualize the tsc, fix win7 64bit rtc time drift and some pvclock potential issues. Before we add new timer, better fix existing ones. What about creating a pv time keeping device that will be aware of lost ticks and host wall clock time? It's similar to hyper-v enlightenment virt timers. That's kvmclock. I meant a device that can be used to generate timeouts. We do use today pit/rtc along with kvmclock time source but it's not perfect and probably the same for hpet. This is why I tough that a pv device will be beneficial. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fwd: [PATCH]: An implementation of HyperV KVP functionality
FYI. Long ago we discussed key value approach on top of virtio-serial. Original Message Subject: [PATCH]: An implementation of HyperV KVP functionality Date: Thu, 11 Nov 2010 13:03:10 -0700 From: Ky Srinivasan ksriniva...@novell.com To: de...@driverdev.osuosl.org, virtualizat...@lists.osdl.org CC: Haiyang Zhang haiya...@microsoft.com, Greg KH gre...@suse.de I am enclosing a patch that implements the KVP (Key Value Pair) functionality for Linux guests on HyperV. This functionality allows Microsoft Management stack to query information from the guest. This functionality is implemented in two parts: (a) A kernel component that communicates with the host and (b) A user level daemon that implements data gathering. The attached patch (kvp.patch) implements the kernel component. I am also attaching the code for the user-level daemon (kvp_daemon.c) for reference. Regards, K. Y From: K. Y. Srinivasan ksriniva...@novell.com Subject: An implementation of key/value pair feature (KVP) for Linux on HyperV. Signed-off-by: K. Y. Srinivasan ksriniva...@novell.com Index: linux.trees.git/drivers/staging/hv/kvp.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux.trees.git/drivers/staging/hv/kvp.c2010-11-11 13:45:17.0 -0500 @@ -0,0 +1,404 @@ +/* + * An implementation of key value pair (KVP) functionality for Linux. + * + * + * Copyright (C) 2010, Novell, Inc. + * Author : K. Y. Srinivasan ksriniva...@novell.com + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2 as published + * by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or + * NON INFRINGEMENT. See the GNU General Public License for more + * details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. + * + */ + + +#include linux/net.h +#include linux/nls.h +#include linux/connector.h + +#include logging.h +#include osd.h +#include vmbus.h +#include vmbus_packet_format.h +#include vmbus_channel_interface.h +#include version_info.h +#include channel.h +#include vmbus_private.h +#include vmbus_api.h +#include utils.h +#include kvp.h + + +/* + * + * The following definitions are shared with the user-mode component; do not + * change any of this without making the corresponding changes in + * the KVP user-mode component. + */ + +#define CN_KVP_VAL 0x1 /* This supports queries from the kernel */ +#define CN_KVP_USER_VAL 0x2 /* This supports queries from the user */ + + +/* + * KVP protocol: The user mode component first registers with the + * the kernel component. Subsequently, the kernel component requests, data + * for the specified keys. In response to this message the user mode component + * fills in the value corresponding to the specified key. We overload the + * sequence field in the cn_msg header to define our KVP message types. + * + * XXXKYS: Have a shared header file between the user and kernel (TODO) + */ + +enum kvp_op { + KVP_REGISTER = 0, /* Register the user mode component */ + KVP_KERNEL_GET,/*Kernel is requesting the value for the specified key*/ + KVP_KERNEL_SET, /*Kernel is providing the value for the specified key*/ + KVP_USER_GET, /*User is requesting the value for the specified key*/ + KVP_USER_SET /*User is providing the value for the specified key*/ +}; + + + +#define KVP_KEY_SIZE512 +#define KVP_VALUE_SIZE 2048 + + +typedef struct kvp_msg { + __u32 kvp_key; /* Key */ + __u8 kvp_value[0]; /* Corresponding value */ +} kvp_msg_t; + +/* + * End of shared definitions. + */ + +/* + * Registry value types. + */ + +#define REG_SZ 1 + +/* + * Array of keys we support in Linux. + * + */ +#define KVP_MAX_KEY10 +#define KVP_LIC_VERSION 1 + + +static char *kvp_keys[KVP_MAX_KEY] = {FullyQualifiedDomainName, + IntegrationServicesVersion, + NetworkAddressIPv4, + NetworkAddressIPv6, + OSBuildNumber, + OSName, + OSMajorVersion, + OSMinorVersion, + OSVersion, + ProcessorArchitecture, + }; + +/* + * Global state maintained for transaction that is being processed. + * Note that only one transaction can be active at any point in time. + * + * This state is set when we receive a request from the host; we + * cleanup this state when the
Re: [Qemu-devel] [PATCH] qemu-kvm: introduce cpu_start/cpu_stop commands
On 11/23/2010 08:41 AM, Avi Kivity wrote: On 11/23/2010 01:00 AM, Anthony Liguori wrote: qemu-kvm vcpu threads don't response to SIGSTOP/SIGCONT. Instead of teaching them to respond to these signals, introduce monitor commands that stop and start individual vcpus. The purpose of these commands are to implement CPU hard limits using an external tool that watches the CPU consumption and stops the CPU as appropriate. Why not use cgroup for that? The monitor commands provide a more elegant solution that signals because it ensures that a stopped vcpu isn't holding the qemu_mutex. From signal(7): The signals SIGKILL and SIGSTOP cannot be caught, blocked, or ignored. Perhaps this is a bug in kvm? If we could catch SIGSTOP, then it would be easy to unblock it only while running in guest context. It would then stop on exit to userspace. Using monitor commands is fairly heavyweight for something as high frequency as this. What control period do you see people using? Maybe we should define USR1 for vcpu start/stop. What happens if one vcpu is stopped while another is running? Spin loops, synchronous IPIs will take forever. Maybe we need to stop the entire process. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
On 11/29/2010 06:23 PM, Stefan Hajnoczi wrote: On Mon, Nov 29, 2010 at 3:00 PM, Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp wrote: 2010/11/29 Paul Brookp...@codesourcery.com: If devices incorrectly claim support for live migration, then that should also be fixed, either by removing the broken code or by making it work. I totally agree with you. AFAICT your current proposal is just feeding back the results of some fairly specific QA testing. I'd rather not get into that game. The correct response in the context of upstream development is to file a bug and/or fix the code. We already have config files that allow third party packagers to remove devices they don't want to support. Sorry, I didn't get what you're trying to tell me. My plan would be to initially start from a subset of devices, and gradually grow the number of devices that Kemari works with. While this process, it'll include what you said above, file a but and/or fix the code. Am I missing what you're saying? My point is that the whitelist shouldn't exist at all. Devices either support migration or they don't. Having some sort of separate whitelist is the wrong way to determine which devices support migration. Alright! Then if a user encounters a problem with Kemari, we'll fix Kemari or the devices or both. Correct? Is this a fair summary: any device that supports live migration workw under Kemari? It might be fair summary but practically we barely have live migration working w/o Kemari. In addition, last I checked Kemari needs additional hooks and it will be too hard to keep that out of tree until all devices get it. (If such a device does not work under Kemari then this is a bug that needs to be fixed in live migration, Kemari, or the device.) Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Freezing Windows 2008 x64bit guest
On 12/13/2010 09:42 PM, Manfred Heubach wrote: Gleb Natapovglebat redhat.com writes: On Wed, Jul 28, 2010 at 12:53:02AM +0300, Harri Olin wrote: Gleb Natapov wrote: On Wed, Jul 21, 2010 at 09:25:31AM +0300, Harri Olin wrote: Gleb Natapov kirjoitti: On Mon, Jul 19, 2010 at 10:17:02AM +0300, Harri Olin wrote: Gleb Natapov kirjoitti: On Thu, Jul 15, 2010 at 03:19:44PM +0200, Christoph Adomeit wrote: But one Windows 2008 64 Bit Server Standard is freezing regularly. This happens sometimes 3 times a day, sometimes it takes 2 days until freeze. The Windows Machine is a clean fresh install. I think I have seen same problem occur on my Windows 2008 SBS SP2 64bit system, but a bit less often, only like once a week. Now I haven't seen crashes but only freezes with qemu on 100% and virtual system unresponsive. Does sendkey from monitor works? qemu-kvm-0.11.1 is very old and this is not total freeze which even harder to debug. I don't see anything extraordinary in your logs. 4643 interrupt per second for 4 cpus is normal if windows runs multimedia or other app that need hi-res timers. Does your host swapping? Is there any chance that you can try upstream qemu-kvm? I tried running qemu-kvm from git but it exhibited the same problem as 12.x that I tried before, BSODing once in a while, running kernel 2.6.34.1. That should be pretty stable config, although it would be nice if you could try running in qemy-kvm.git head. sample BSOD failure details: These two with Realtec nic and qemu cpu 0x0019 (0x0020, 0xf88007e65970, 0xf88007e65990, 0x0502040f) 0x0019 (0x0020, 0xf88007a414c0, 0xf88007a414e0, 0x0502044c) These are with e1000 and -cpu host 0x003b (0xc005, 0xf80001c5d842, 0xfa60093ddb70, 0x) 0x003b (0xc005, 0xf80001cb8842, 0xfa600c94ab70, 0x) 0x000a (0x0080, 0x000c, 0x0001, 0xf80001cadefd) Can you attach screenshots of BSODs? Have you reinstalled your guests or are you running the same images you ran in 11.x? I'll see if I can analyze minidumps later. In addition to these there have been as many reboots that have been only logged as 'disruptive shutdown'. Right now I'm running the problematic guest under Xen 3.2.1-something from Debian to see if it works better. -- Harri. Hello, is there a solution for that problem? I'm experiencing the same problems ever since I installed SBS 2008 on KVM. I was running the host with Ubuntu 10.04 but upgraded to 10.10 - mainly because of performance problems which were solved by the upgrade. After the upgrade the system became extremly unstable. It was crashing as soon as disk io and network io load was growing. 100% reproduceable with windows server backup to an iscsi volume. i had virtio drivers for storage and network installed (redhat/fedora 1.1.11). Which fedora/rhel release is that? What's the windows virtio driver version? Have you tried using virt-manager/virhs instead of raw cmdline? About e1000, some windows comes with buggy driver and an update e1000 from Intel fixes some issues. At each BSOD I had the following line in the log of the guest: virtio_ioport_write: unexpected address 0x13 value 0x1 I changed the network interface back to e1000. What I experience now (and I had that a the very beginning before i switched to virtio network) are freezes. The guest doesn't respond anymore (doesn't answer to pings and doesn't interact via mouse/keyboard anymore). Host CPU usage of the kvm process is 100% on as many cores as there are virtual cpus (in this case 4). I'm a bit frustrated about this. I have 2 windows 2003 32bit, 1 windows xp and 3 linux guests (2x 32bit, 1x64 bit). They are all running without any problems (except that the windows xp guest cannot boot without an ntldr cd image). Only the SBS2008 guest regulary freezes. The host system has 2 Intel Xeon 5504, Intel Chipset 5500, Adaptec Raid 5805, 24 GB DDR3 RAM. I know there is a lack of detailed information right now. I first need to know if anybody is working on this or has similar problems. I can deliver minidumps, and any debugging information you need. I don't want to give up now. We will switch to Hyper-V if we cannot solve this, because we need a stable virtualization plattform for Windows Guests. I would like to use KVM it is so much more flexibel. Best regards Manfred -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html