Re: [RFC PATCH 00/17] virtual-bus
On Wednesday 01 April 2009 05:12:47 Gregory Haskins wrote: Bare metal: tput = 4078Mb/s, round-trip = 25593pps (39us rtt) Virtio-net: tput = 4003Mb/s, round-trip = 320pps (3125us rtt) Venet: tput = 4050Mb/s, round-trip = 15255 (65us rtt) That rtt time is awful. I know the notification suppression heuristic in qemu sucks. I could dig through the code, but I'll ask directly: what heuristic do you use for notification prevention in your venet_tap driver? As you point out, 350-450 is possible, which is still bad, and it's at least partially caused by the exit to userspace and two system calls. If virtio_net had a backend in the kernel, we'd be able to compare numbers properly. Bare metal: tput = 9717Mb/s, round-trip = 30396pps (33us rtt) Virtio-net: tput = 4578Mb/s, round-trip = 249pps (4016us rtt) Venet: tput = 5802Mb/s, round-trip = 15127 (66us rtt) Note that even the throughput was slightly better in this test for venet, though neither venet nor virtio-net could achieve line-rate. I suspect some tuning may allow these numbers to improve, TBD. At some point, the copying will hurt you. This is fairly easy to avoid on xmit tho. Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: strange guest slowness after some time
David S. Ahern schrieb: Could you add a (unused) e1000 interface to your virtio guests? As this issue happens rarely for me, maybe you could help to reproduce it as well (i.e. if network gets slow on virtio interface, give e1000 a IP address, and try if network is also slow on e1000 on the very same guest). Will do and report BTW, what CPU do you have? One dual core Opteron 2212 Note: I will upgrade to two Shanghai Quad-Cores in 2 weeks and test with those as well. I have this slowness on an Intel CPU as well, after about 10 days of guest uptime (using virtio net): processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU3050 @ 2.13GHz For the Intel server, the guest is using the e1000 NIC or virtio or other? I have a few DL320G5s with this processor; I have not hit this problem running rhel3 and rhel4 guests using e1000/scsi devices. As I mentioned, it was using virtio net. Guests running with e1000 (and virtio_blk) don't have this problem. -- Tomasz Chmielewski http://wpkg.org -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] Fix task switching.
Gleb Natapov wrote: On Tue, Mar 31, 2009 at 05:21:16PM +0200, Kohl, Bernhard (NSN - DE/Munich) wrote: Bernhard Kohl wrote: Jan Kiszka wrote: Gleb Natapov wrote: The patch fixes two problems with task switching. 1. Back link is written to a wrong TSS. 2. Instruction emulation is not needed if the reason for task switch is a task gate in IDT and access to it is caused by an external even. 2 is currently solved only for VMX since there is not reliable way to skip an instruction in SVM. We should emulate it instead. Does this series fix all issues Bernhard, Thomas and Julian stumbled over? Jan I will try this today. Thanks. Yes, it works for us (Thomas + Bernhard). Great. Thanks for testing. Same here: No obvious regressions found while running various NMI/IRQ tests. Jan signature.asc Description: OpenPGP digital signature
[PATCH 1/2] KVM: VMX: Clean up Flex Priority related
And clean paranthes on returns. Signed-off-by: Sheng Yang sh...@linux.intel.com --- arch/x86/kvm/vmx.c | 47 ++- 1 files changed, 30 insertions(+), 17 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index aba41ae..1caa1fc 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -216,61 +216,69 @@ static inline int is_external_interrupt(u32 intr_info) static inline int cpu_has_vmx_msr_bitmap(void) { - return (vmcs_config.cpu_based_exec_ctrl CPU_BASED_USE_MSR_BITMAPS); + return vmcs_config.cpu_based_exec_ctrl CPU_BASED_USE_MSR_BITMAPS; } static inline int cpu_has_vmx_tpr_shadow(void) { - return (vmcs_config.cpu_based_exec_ctrl CPU_BASED_TPR_SHADOW); + return vmcs_config.cpu_based_exec_ctrl CPU_BASED_TPR_SHADOW; } static inline int vm_need_tpr_shadow(struct kvm *kvm) { - return ((cpu_has_vmx_tpr_shadow()) (irqchip_in_kernel(kvm))); + return (cpu_has_vmx_tpr_shadow()) (irqchip_in_kernel(kvm)); } static inline int cpu_has_secondary_exec_ctrls(void) { - return (vmcs_config.cpu_based_exec_ctrl - CPU_BASED_ACTIVATE_SECONDARY_CONTROLS); + return vmcs_config.cpu_based_exec_ctrl + CPU_BASED_ACTIVATE_SECONDARY_CONTROLS; } static inline bool cpu_has_vmx_virtualize_apic_accesses(void) { - return flexpriority_enabled; + return vmcs_config.cpu_based_2nd_exec_ctrl + SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; +} + +static inline bool cpu_has_vmx_flexpriority(void) +{ + return cpu_has_vmx_tpr_shadow() + cpu_has_vmx_virtualize_apic_accesses(); } static inline int cpu_has_vmx_invept_individual_addr(void) { - return (!!(vmx_capability.ept VMX_EPT_EXTENT_INDIVIDUAL_BIT)); + return !!(vmx_capability.ept VMX_EPT_EXTENT_INDIVIDUAL_BIT); } static inline int cpu_has_vmx_invept_context(void) { - return (!!(vmx_capability.ept VMX_EPT_EXTENT_CONTEXT_BIT)); + return !!(vmx_capability.ept VMX_EPT_EXTENT_CONTEXT_BIT); } static inline int cpu_has_vmx_invept_global(void) { - return (!!(vmx_capability.ept VMX_EPT_EXTENT_GLOBAL_BIT)); + return !!(vmx_capability.ept VMX_EPT_EXTENT_GLOBAL_BIT); } static inline int cpu_has_vmx_ept(void) { - return (vmcs_config.cpu_based_2nd_exec_ctrl - SECONDARY_EXEC_ENABLE_EPT); + return vmcs_config.cpu_based_2nd_exec_ctrl + SECONDARY_EXEC_ENABLE_EPT; } static inline int vm_need_virtualize_apic_accesses(struct kvm *kvm) { - return ((cpu_has_vmx_virtualize_apic_accesses()) - (irqchip_in_kernel(kvm))); + return flexpriority_enabled + (cpu_has_vmx_virtualize_apic_accesses()) + (irqchip_in_kernel(kvm)); } static inline int cpu_has_vmx_vpid(void) { - return (vmcs_config.cpu_based_2nd_exec_ctrl - SECONDARY_EXEC_ENABLE_VPID); + return vmcs_config.cpu_based_2nd_exec_ctrl + SECONDARY_EXEC_ENABLE_VPID; } static inline int cpu_has_virtual_nmis(void) @@ -278,6 +286,11 @@ static inline int cpu_has_virtual_nmis(void) return vmcs_config.pin_based_exec_ctrl PIN_BASED_VIRTUAL_NMIS; } +static inline bool report_flexpriority(void) +{ + return flexpriority_enabled; +} + static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr) { int i; @@ -1201,7 +1214,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) if (!cpu_has_vmx_ept()) enable_ept = 0; - if (!(vmcs_config.cpu_based_2nd_exec_ctrl SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) + if (!cpu_has_vmx_flexpriority()) flexpriority_enabled = 0; min = 0; @@ -3655,7 +3668,7 @@ static struct kvm_x86_ops vmx_x86_ops = { .check_processor_compatibility = vmx_check_processor_compat, .hardware_enable = hardware_enable, .hardware_disable = hardware_disable, - .cpu_has_accelerated_tpr = cpu_has_vmx_virtualize_apic_accesses, + .cpu_has_accelerated_tpr = report_flexpriority, .vcpu_create = vmx_create_vcpu, .vcpu_free = vmx_free_vcpu, -- 1.5.4.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] KVM: VMX: Fix feature testing
The testing of feature is too early now, before vmcs_config complete initialization. Signed-off-by: Sheng Yang sh...@linux.intel.com --- arch/x86/kvm/vmx.c | 18 +- 1 files changed, 9 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 1caa1fc..7d7b0d6 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -1208,15 +1208,6 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) vmx_capability.ept, vmx_capability.vpid); } - if (!cpu_has_vmx_vpid()) - enable_vpid = 0; - - if (!cpu_has_vmx_ept()) - enable_ept = 0; - - if (!cpu_has_vmx_flexpriority()) - flexpriority_enabled = 0; - min = 0; #ifdef CONFIG_X86_64 min |= VM_EXIT_HOST_ADDR_SPACE_SIZE; @@ -1320,6 +1311,15 @@ static __init int hardware_setup(void) if (boot_cpu_has(X86_FEATURE_NX)) kvm_enable_efer_bits(EFER_NX); + if (!cpu_has_vmx_vpid()) + enable_vpid = 0; + + if (!cpu_has_vmx_ept()) + enable_ept = 0; + + if (!cpu_has_vmx_flexpriority()) + flexpriority_enabled = 0; + return alloc_kvm_area(); } -- 1.5.4.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: VMX: Clean up Flex Priority related
Sheng Yang wrote: And clean paranthes on returns. Applied, thanks. Bad bugs on my part :( -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Use rsvd_bits_mask in load_pdptrs for cleanup and considing EXB bit
Dong, Eddie wrote: Looks good, but doesn't apply; please check if you are working against the latest version. Rebased on top of a317a1e496b22d1520218ecf16a02498b99645e2 + previous rsvd bits violation check patch. Applied, thanks. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mm_pages_next() question
Marcelo Tosatti wrote: On Sun, Mar 29, 2009 at 03:24:08PM +0300, Avi Kivity wrote: static int mmu_pages_next(struct kvm_mmu_pages *pvec, struct mmu_page_path *parents, int i) { int n; for (n = i+1; n pvec-nr; n++) { struct kvm_mmu_page *sp = pvec-page[n].sp; if (sp-role.level == PT_PAGE_TABLE_LEVEL) { parents-idx[0] = pvec-page[n].idx; return n; } parents-parent[sp-role.level-2] = sp; parents-idx[sp-role.level-1] = pvec-page[n].idx; } return n; } Do we need to break out of the loop if we switch parents during the loop (since that will give us a different mmu_page_path)? Or are callers careful to only pass pvecs which belong to the same shadow page? This function builds mmu_page_path for a number of pagetable (leaf) pages. Whenever the path changes, mmu_page_path will be rebuilt. The pages in the pvec must be organized as follows: level4, level3, level2, level1, level1, level1, , level3, level2, level1, level1, ... So you don't have to repeat higher levels for a number of leaf pages. I'm still missing something. That if () tests for level == PT_PAGE_TABLE_LEVEL. So it looks like we'll have batch sizes of 4, 1, 1, 1, ... 3, 1, 1, 1, ...? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/5] Fix handling of a fault during NMI unblocked due to IRET
Gleb Natapov wrote: Bit 12 is undefined in any of the following cases: If the VM exit sets the valid bit in the IDT-vectoring information field. If the VM exit is due to a double fault. Applied the entire series, thanks. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] fix header-sync using --with-patched-kernel
Mark McLoughlin wrote: Hi Avi, Here are a few fairly trivial build patches - they fix building kvm.git using --with-patched-kernel alongside an unconfigured kvm.git tree. Applied all, thanks. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] kvm: qemu: check device assignment command
Han, Weidong wrote: pci_parse_devaddr parses [[domain:][bus:]slot, it's valid when even enter only slot, whereas it must be bus:slot.func in device assignment command (-pcidevice host=bus:slot.func). So I implemented a dedicated function to parse device bdf in device assignment command, rather than mix two parsing function together. Applied, thanks. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can't download kvmctl scripts
Brent A Nelson wrote: URL: http://www.linux-kvm.org/page/HowToConfigScript The kvmctl scripts in the HowTo pages can't be downloaded, as the download links are actually uploads. Copying smintz... -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] KVM: Qemu: Do not use log dirty in IA64
Avi Kivity wrote: Zhang, Yang wrote: hi please checkin it to kvm85, thanks! IA64 does not support log dirty. We should not use it in IA64, or it will have some problem. Applied, thanks. When are you planning to add support for log dirty on ia64? We had the patch at hand, but still there are other issues which block upstream, so hadn't tested it yet. Xiantao-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/17] virtual-bus
Rusty Russell wrote: On Wednesday 01 April 2009 05:12:47 Gregory Haskins wrote: Bare metal: tput = 4078Mb/s, round-trip = 25593pps (39us rtt) Virtio-net: tput = 4003Mb/s, round-trip = 320pps (3125us rtt) Venet: tput = 4050Mb/s, round-trip = 15255 (65us rtt) That rtt time is awful. I know the notification suppression heuristic in qemu sucks. I could dig through the code, but I'll ask directly: what heuristic do you use for notification prevention in your venet_tap driver? I am not 100% sure I know what you mean with notification prevention, but let me take a stab at it. So like most of these kinds of constructs, I have two rings (rx + tx on the guest is reversed to tx + rx on the host), each of which can signal in either direction for a total of 4 events, 2 on each side of the connection. I utilize what I call bidirectional napi so that only the first packet submitted needs to signal across the guest/host boundary. E.g. first ingress packet injects an interrupt, and then does a napi_schedule and masks future irqs. Likewise, first egress packet does a hypercall, and then does a napi_schedule (I dont actually use napi in this path, but its conceptually identical) and masks future hypercalls. So thats is my first form of what I would call notification prevention. The second form occurs on the tx-complete path (that is guest-host tx). I only signal back to the guest to reclaim its skbs every 10 packets, or if I drain the queue, whichever comes first (note to self: make this # configurable). The nice part about this scheme is it significantly reduces the amount of guest/host transitions, while still providing the lowest latency response for single packets possible. e.g. Send one packet, and you get one hypercall, and one tx-complete interrupt as soon as it queues on the hardware. Send 100 packets, and you get one hypercall and 10 tx-complete interrupts as frequently as every tenth packet queues on the hardware. There is no timer governing the flow, etc. Is that what you were asking? As you point out, 350-450 is possible, which is still bad, and it's at least partially caused by the exit to userspace and two system calls. If virtio_net had a backend in the kernel, we'd be able to compare numbers properly. :) But that is the whole point, isnt it? I created vbus specifically as a framework for putting things in the kernel, and that *is* one of the major reasons it is faster than virtio-net...its not the difference in, say, IOQs vs virtio-ring (though note I also think some of the innovations we have added such as bi-dir napi are helping too, but these are not in-kernel specific kinds of features and could probably help the userspace version too). I would be entirely happy if you guys accepted the general concept and framework of vbus, and then worked with me to actually convert what I have as venet-tap into essentially an in-kernel virtio-net. I am not specifically interested in creating a competing pv-net driver...I just needed something to showcase the concepts and I didnt want to hack the virtio-net infrastructure to do it until I had everyone's blessing. Note to maintainers: I *am* perfectly willing to maintain the venet drivers if, for some reason, we decide that we want to keep them as is. Its just an ideal for me to collapse virtio-net and venet-tap together, and I suspect our community would prefer this as well. -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 00/17] virtual-bus
Andi Kleen wrote: Gregory Haskins ghask...@novell.com writes: What might be useful is if you could expand a bit more on what the high level use cases for this. Questions that come to mind and that would be good to answer: This seems to be aimed at having multiple VMs talk to each other, but not talk to the rest of the world, correct? Is that a common use case? Actually we didn't design specifically for either type of environment. I think it would, in fact, be well suited to either type of communication model, even concurrently (e.g. an intra-vm ipc channel resource could live right on the same bus as a virtio-net and a virtio-disk resource) Wouldn't they typically have a default route anyways and be able to talk to each other this way? And why can't any such isolation be done with standard firewalling? (it's known that current iptables has some scalability issues, but there's work going on right now to fix that). vbus itself, and even some of the higher level constructs we apply on top of it (like venet) are at a different scope than I think what you are getting at above. Yes, I suppose you could create a private network using the existing virtio-net + iptables. But you could also do the same using virtio-net and a private bridge devices as well. That is not what we are trying to address. What we *are* trying to address is making an easy way to declare virtual resources directly in the kernel so that they can be accessed more efficiently. Contrast that to the way its done today, where the models live in, say, qemu userspace. So instead of having guest-host-qemu::virtio-net-tap-[iptables|bridge], you simply have guest-host-[iptables|bridge]. How you make your private network (if that is what you want to do) is orthogonal...its the path to get there that we changed. What would be the use cases for non networking devices? How would the interfaces to the user look like? I am not sure if you are asking about the guests perspective or the host-administators perspective. First now lets look at the low-level device interface from the guests perspective. We can cover the admin perspective in a separate doc, if need be. Each device in vbus supports two basic verbs: CALL, and SHM int (*call)(struct vbus_device_proxy *dev, u32 func, void *data, size_t len, int flags); int (*shm)(struct vbus_device_proxy *dev, int id, int prio, void *ptr, size_t len, struct shm_signal_desc *sigdesc, struct shm_signal **signal, int flags); CALL provides a synchronous method for invoking some verb on the device (defined by func) with some arbitrary data. The namespace for func is part of the ABI for the device in question. It is analogous to an ioctl, with the primary difference being that its remotable (it invokes from the guest driver across to the host device). SHM provides a way to register shared-memory with the device which can be used for asynchronous communication. The memory is always owned by the north (the guest), while the south (the host) simply maps it into its address space. You can optionally establish a shm_signal object on this memory for signaling in either direction, and I anticipate most shm regions will use this feature. Each shm region has an id namespace, which like the func namespace from the CALL method is completely owned by the device ABI. For example, we have might have id's of RX-RING and TX-RING, etc. From there, we can (hopefully) build an arbitrary type of IO service to map on top. So for instance, for venet-tap, we have CALL verbs for things like MACQUERY, and LINKUP, and we have SHM ids for RX-QUEUE and TX-QUEUE. We can write a driver that speaks this ABI on the bottom edge, and presents a normal netif interface on the top edge. So the actual consumption of these resources can look just like another other resource of a similar type. -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 01/17] shm-signal: shared-memory signals
Avi Kivity wrote: Gregory Haskins wrote: +struct shm_signal_irq { +__u8 enabled; +__u8 pending; +__u8 dirty; +}; Some ABIs may choose to pad this, suggest explicit padding. Yeah, good idea. What is the official way to do this these days? Are GCC pragmas allowed? I just add a __u8 pad[5] in such cases. Oh, duh. Dumb question. I was getting confused with pack, not pad. :) + +struct shm_signal; + +struct shm_signal_ops { +int (*inject)(struct shm_signal *s); +void (*fault)(struct shm_signal *s, const char *fmt, ...); Eww. Must we involve strings and printf formats? This is still somewhat of a immature part of the design. Its supposed to be used so that by default, its a panic. But on the host side, we can do something like inject a machine-check. That way malicious/broken guests cannot (should not? ;) be able to take down the host. Note today I do not map this to anything other than the default panic, so this needs some love. But given the asynchronous nature of the fault, I want to be sure we have decent accounting to avoid bug reports like silent MCE kills the guest ;) At least this way, we can log the fault string somewhere to get a clue. I see. This raises a point I've been thinking of - the symmetrical nature of the API vs the assymetrical nature of guest/host or user/kernel interfaces. This is most pronounced in -inject(); in the host-guest direction this is async (host can continue processing while the guest is handling the interrupt), whereas in the guest-host direction it is synchronous (the guest is blocked while the host is processing the call, unless the host explicitly hands off work to a different thread). Note that this is exactly what I do (though it is device specific). venet-tap has a ioq_notifier registered on its rx ring (which is the tx-ring for the guest) that simply calls ioq_notify_disable() (which calls shm_signal_disable() under the covers) and it wakes its rx-thread. This all happens in the context of the hypercall, which then returns and allows the vcpu to re-enter guest mode immediately. signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 01/17] shm-signal: shared-memory signals
Gregory Haskins wrote: Note that this is exactly what I do (though it is device specific). venet-tap has a ioq_notifier registered on its rx ring (which is the tx-ring for the guest) that simply calls ioq_notify_disable() (which calls shm_signal_disable() under the covers) and it wakes its rx-thread. This all happens in the context of the hypercall, which then returns and allows the vcpu to re-enter guest mode immediately. I think this is suboptimal. The ring is likely to be cache hot on the current cpu, waking a thread will introduce scheduling latency + IPI +cache-to-cache transfers. On a benchmark setup, host resources are likely to exceed guest requirements, so you can throw cpu at the problem and no one notices. But I think the bits/cycle figure will decrease, even if bits/sec increases. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
one question about virualization and kvm
Hello! I have two containers with os linux. All files in /usr and /bin are identical. Is that possible to mount/bind /usr and /bin to containers? (not copy all files to containers).. ? P.S. Sorry for bad english and may be stupid question. -- Vasiliy Tolstov v.tols...@selfip.ru Selfip.Ru -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/17] virtual-bus
On Wed, Apr 01, 2009 at 08:03:49AM -0400, Gregory Haskins wrote: Andi Kleen wrote: Gregory Haskins ghask...@novell.com writes: What might be useful is if you could expand a bit more on what the high level use cases for this. Questions that come to mind and that would be good to answer: This seems to be aimed at having multiple VMs talk to each other, but not talk to the rest of the world, correct? Is that a common use case? Actually we didn't design specifically for either type of environment. But surely you must have some specific use case in mind? Something that it does better than the various methods that are available today. Or rather there must be some problem you're trying to solve. I'm just not sure what that problem exactly is. What we *are* trying to address is making an easy way to declare virtual resources directly in the kernel so that they can be accessed more efficiently. Contrast that to the way its done today, where the models live in, say, qemu userspace. So instead of having guest-host-qemu::virtio-net-tap-[iptables|bridge], you simply have guest-host-[iptables|bridge]. How you make your private network (if So is the goal more performance or simplicity or what? What would be the use cases for non networking devices? How would the interfaces to the user look like? I am not sure if you are asking about the guests perspective or the host-administators perspective. I was wondering about the host-administrators perspective. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 01/17] shm-signal: shared-memory signals
Avi Kivity wrote: Gregory Haskins wrote: Note that this is exactly what I do (though it is device specific). venet-tap has a ioq_notifier registered on its rx ring (which is the tx-ring for the guest) that simply calls ioq_notify_disable() (which calls shm_signal_disable() under the covers) and it wakes its rx-thread. This all happens in the context of the hypercall, which then returns and allows the vcpu to re-enter guest mode immediately. I think this is suboptimal. Heh, yes I know this is your (well documented) position, but I respectfully disagree. :) CPUs are not getting much faster, but they are rapidly getting more cores. If we want to continue to make software run increasingly faster, we need to actually use those cores IMO. Generally this means split workloads up into as many threads as possible as long as you can keep pipelines filed. The ring is likely to be cache hot on the current cpu, waking a thread will introduce scheduling latency + IPI This part is a valid criticism, though note that Linux is very adept at scheduling so we are talking mere ns/us range here, which is dwarfed by the latency of something like your typical IO device (e.g. 36us for a rtt packet on 10GE baremetal, etc). The benefit, of course, is the potential for increased parallelism which I have plenty of data to show we are very much taking advantage of here (I can saturate two cores almost completely according to LTT traces, one doing vcpu work, and the other running my rx thread which schedules the packet on the hardware) +cache-to-cache transfers. This one I take exception to. While it is perfectly true that splitting the work between two cores has a greater cache impact than staying on one, you cannot look at this one metric alone and say this is bad. Its also a function of how efficiently the second (or more) cores are utilized. There will be a point in the curve where the cost of cache coherence will become marginalized by the efficiency added by the extra compute power. Some workloads will invariably be on the bad end of that curve, and therefore doing the work on one core is better. However, we cant ignore that there will others that are on the good end of this spectrum either. Otherwise, we risk performance stagnation on our effectively uniprocessor box ;). In addition, the task-scheduler will attempt to co-locate tasks that are sharing data according to a best-fit within the cache hierarchy. Therefore, we will still be sharing as much as possible (perhaps only L2, L3, or a local NUMA domain, but this is still better than nothing) The way I have been thinking about these issues is something I have been calling soft-asics. In the early days, we had things like a simple uniprocessor box with a simple dumb ethernet. People figured out that if you put more processing power into the NIC, you could offload that work from the cpu and do more in parallel. So things like checksum computation and segmentation duties were a good fit. More recently, we see even more advanced hardware where you can do L2 or even L4 packet classification right in the hardware, etc. All of these things are effectively parallel computation, and it occurs in a completely foreign cache domain! So a lot of my research has been around the notion of trying to use some of our cpu cores to do work like some of the advanced asic based offload engines do. The cores are often under utilized anyway, and this will bring some of the features of advanced silicon to commodity resources. They also have the added flexibility that its just software, so you can change or enhance the system at will. So if you think about it, by using threads like this in venet-tap, I am effectively using other cores to do csum/segmentation (if the physical hardware doesn't support it), layer 2 classification (linux bridging), filtering (iptables in the bridge), queuing, etc as if it was some smart device out on the PCI bus. The guest just queues up packets independently in its own memory, while the device just dma's the data on its own (after the initial kick). The vcpu is keeping the pipeline filled on its side independently. On a benchmark setup, host resources are likely to exceed guest requirements, so you can throw cpu at the problem and no one notices. Sure, but with the type of design I have presented this still sorts itself out naturally even if the host doesn't have the resources. For instance, if there is a large number of threads competing for a small number of cores, we will simply see things like the rx-thread stalling and going to sleep, or the vcpu thread backpressuring and going idle (and therefore sleeping). All of these things are self throttling. If you don't have enough resources to run a workload at a desirable performance level, the system wasn't sized right to begin with. ;) But I think the bits/cycle figure will decrease, even if bits/sec increases. Note that this isn't necessarily a bad thing. I
Re: one question about virualization and kvm
On Wed, Apr 1, 2009 at 7:27 AM, Vasiliy Tolstov v.tols...@selfip.ru wrote: Hello! I have two containers with os linux. All files in /usr and /bin are identical. Is that possible to mount/bind /usr and /bin to containers? (not copy all files to containers).. ? the problem (and solution) is exactly the same as if they weren't virtual machines, but real machines: use the network. simply share the directories with NFS and mount them in your initrd scripts (preferably read/only). other way would be to set a new image file with a copy of the directories, and mount them on both virtual machines. of course, now you MUST mount them as readonly. and you can't change anything there without ummounting from both VMs. usually it's not worth it, unless you have tens of identical VMs -- Javier -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/17] virtual-bus
Andi Kleen wrote: On Wed, Apr 01, 2009 at 08:03:49AM -0400, Gregory Haskins wrote: Andi Kleen wrote: Gregory Haskins ghask...@novell.com writes: What might be useful is if you could expand a bit more on what the high level use cases for this. Questions that come to mind and that would be good to answer: This seems to be aimed at having multiple VMs talk to each other, but not talk to the rest of the world, correct? Is that a common use case? Actually we didn't design specifically for either type of environment. But surely you must have some specific use case in mind? Something that it does better than the various methods that are available today. Or rather there must be some problem you're trying to solve. I'm just not sure what that problem exactly is. Performance. We are trying to create a high performance IO infrastructure. Ideally we would like to see things like virtual-machines have bare-metal performance (or as close as possible) using just pure software on commodity hardware. The data I provided shows that something like KVM with virtio-net does a good job on throughput even on 10GE, but the latency is several orders of magnitude slower than bare-metal. We are addressing this issue and others like it that are a result of the current design of out-of-kernel emulation. What we *are* trying to address is making an easy way to declare virtual resources directly in the kernel so that they can be accessed more efficiently. Contrast that to the way its done today, where the models live in, say, qemu userspace. So instead of having guest-host-qemu::virtio-net-tap-[iptables|bridge], you simply have guest-host-[iptables|bridge]. How you make your private network (if So is the goal more performance or simplicity or what? (Answered above) What would be the use cases for non networking devices? How would the interfaces to the user look like? I am not sure if you are asking about the guests perspective or the host-administators perspective. I was wondering about the host-administrators perspective. Ah, ok. Sorry about that. It was probably good to document that other thing anyway, so no harm. So about the host-administrator interface. The whole thing is driven by configfs, and the basics are already covered in the documentation in patch 2, so I wont repeat it here. Here is a reference to the file for everyone's convenience: http://git.kernel.org/?p=linux/kernel/git/ghaskins/vbus/linux-2.6.git;a=blob;f=Documentation/vbus.txt;h=e8a05dafaca2899d37bd4314fb0c7529c167ee0f;hb=f43949f7c340bf667e68af6e6a29552e62f59033 So a sufficiently privileged user can instantiate a new bus (e.g. container) and devices on that bus via configfs operations. The types of devices available to instantiate are dictated by whatever vbus-device modules you have loaded into your particular kernel. The loaded modules available are enumerated under /sys/vbus/deviceclass. Now presumably the administrator knows what a particular module is and how to configure it before instantiating it. Once they instantiate it, it will present an interface in sysfs with a set of attributes. For example, an instantiated venet-tap looks like this: ghask...@test:~ tree /sys/vbus/devices /sys/vbus/devices `-- foo |-- class - ../../deviceclass/venet-tap |-- client_mac |-- enabled |-- host_mac |-- ifname `-- interfaces `-- 0 - ../../../instances/bar/devices/0 Some of these attributes, like class and interfaces are default attributes that are filled in by the infrastructure. Other attributes, like client_mac and enabled are properties defined by the venet-tap module itself. So the administrator can then set these attributes as desired to manipulate the configuration of the instance of the device, on a per device basis. So now imagine we have some kind of disk-io vbus device that is designed to act kind of like a file-loopback device. It might define an attribute allowing you to specify the path to the file/block-dev that you want it to export. (Warning: completely fictitious tree output to follow ;) ghask...@test:~ tree /sys/vbus/devices /sys/vbus/devices `-- foo |-- class - ../../deviceclass/vdisk |-- src_path `-- interfaces `-- 0 - ../../../instances/bar/devices/0 So the admin would instantiate this vdisk device and do: 'echo /path/to/my/exported/disk.dat /sys/vbus/devices/foo/src_path' To point the device to the file on the host that it wants to present as a vdisk. Any guest that has access to the particular bus that contains this device would then see it as a standard vdisk ABI device (as if there where such a thing, yet) and could talk to it using a vdisk specific driver. A property of a vbus is that it is inherited by children. Today, I do not have direct support in qemu for creating/configuring vbus devices. Instead what I do is I set up the vbus and devices
Re: [RFC PATCH 00/17] virtual-bus
Gregory Haskins wrote: Andi Kleen wrote: On Wed, Apr 01, 2009 at 08:03:49AM -0400, Gregory Haskins wrote: Andi Kleen wrote: Gregory Haskins ghask...@novell.com writes: What might be useful is if you could expand a bit more on what the high level use cases for this. Questions that come to mind and that would be good to answer: This seems to be aimed at having multiple VMs talk to each other, but not talk to the rest of the world, correct? Is that a common use case? Actually we didn't design specifically for either type of environment. But surely you must have some specific use case in mind? Something that it does better than the various methods that are available today. Or rather there must be some problem you're trying to solve. I'm just not sure what that problem exactly is. Performance. We are trying to create a high performance IO infrastructure. Actually, I should also state that I am interested in enabling some new kinds of features based on having in-kernel devices like this. For instance (and this is still very theoretical and half-baked), I would like to try to support RT guests. [adding linux-rt-users] I think one of the things that we need in order to do that is being able to convey vcpu priority state information to the host in an efficient way. I was thinking that a shared-page per vcpu could have something like current and theshold priorties. The guest modifies current while the host modifies threshold. The guest would be allowed to increase its current priority without a hypercall (after all, if its already running presumably it is already of sufficient priority that the scheduler). But if the guest wants to drop below threshold, it needs to hypercall the host to give it an opportunity to schedule() a new task (vcpu or not). The host, on the other hand, could apply a mapping so that the guests priority of RT1-RT99 might map to RT20-RT30 on the host, or something like that. We would have to take other considerations as well, such as implicit boosting on IRQ injection (e.g. the guest could be in HLT/IDLE when an interrupt is injected...but by virtue of injecting that interrupt we may need to boost it to (guest-relative) RT50). Like I said, this is all half-baked right now. My primary focus is improving performance, but I did try to lay the groundwork for taking things in new directions too..rt being an example. Hope that helps! -Greg signature.asc Description: OpenPGP digital signature
Re: Commit 3d28613c225ba94062950dacbb2304b2d2024abc breaks linux boot
Gleb Natapov wrote: Commit 3d28613c225ba94062950dacbb2304b2d2024abc break linux boot. It hangs after printing: SMP alternatives: switching to UP code Does dropping bit 8 from context-rsvd_bits_mask[0][1] (PT64_ROOT_LEVEL) help? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Commit 3d28613c225ba94062950dacbb2304b2d2024abc breaks linux boot
Commit 3d28613c225ba94062950dacbb2304b2d2024abc break linux boot. It hangs after printing: SMP alternatives: switching to UP code -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Add shared memory PCI device that shares a memory object betweens VMs
This patch supports sharing memory between VMs and between the host/VM. It's a first cut and comments are encouraged. The goal is to support simple Inter-VM communication with zero-copy access to shared memory. The patch adds the switch -ivshmem (short for Inter-VM shared memory) that is used as follows -ivshmem file,size. The shared memory object named 'file' will be created/opened and mapped onto a PCI memory device with size 'size'. The PCI device has two BARs, BAR0 for registers and BAR1 for the memory region that maps the file above. The memory region can be mmapped into userspace on the guest (or read and written if you want). The register region will eventually be used to support interrupts which are communicated via unix domain sockets, but I need some tips on how to do this using a qemu character device. Also, feel free to suggest a better name if you have one. Thanks, Cam --- qemu/Makefile.target |2 + qemu/hw/ivshmem.c| 363 ++ qemu/hw/pc.c |6 + qemu/hw/pc.h |3 + qemu/qemu-options.hx | 10 ++ qemu/sysemu.h|7 + qemu/vl.c| 12 ++ 7 files changed, 403 insertions(+), 0 deletions(-) create mode 100644 qemu/hw/ivshmem.c diff --git a/qemu/Makefile.target b/qemu/Makefile.target index 6eed853..167db55 100644 --- a/qemu/Makefile.target +++ b/qemu/Makefile.target @@ -640,6 +640,8 @@ OBJS += e1000.o # Serial mouse OBJS += msmouse.o +# Inter-VM PCI shared memory +OBJS += ivshmem.o ifeq ($(USE_KVM_DEVICE_ASSIGNMENT), 1) OBJS+= device-assignment.o diff --git a/qemu/hw/ivshmem.c b/qemu/hw/ivshmem.c new file mode 100644 index 000..27db95f --- /dev/null +++ b/qemu/hw/ivshmem.c @@ -0,0 +1,363 @@ +/* + * Inter-VM Shared Memory PCI device. + * + * Author: + * Cam Macdonell c...@cs.ualberta.ca + * + * Based On: cirrus_vga.c and rtl8139.c + * + * This code is licensed under the GNU GPL v2. + */ + +#include hw.h +#include console.h +#include pc.h +#include pci.h +#include sysemu.h + +#include qemu-common.h +#include sys/mman.h + +#define PCI_COMMAND_IOACCESS0x0001 +#define PCI_COMMAND_MEMACCESS 0x0002 +#define PCI_COMMAND_BUSMASTER 0x0004 + +//#define DEBUG_IVSHMEM + +#ifdef DEBUG_IVSHMEM +#define IVSHMEM_DPRINTF(fmt, args...)\ +do {printf(IVSHMEM: fmt, ##args); } while (0) +#else +#define IVSHMEM_DPRINTF(fmt, args...) +#endif + +typedef struct IVShmemState { +uint16_t intrmask; +uint16_t intrstatus; +uint8_t *ivshmem_ptr; +unsigned long ivshmem_offset; +unsigned int ivshmem_size; +unsigned long bios_offset; +unsigned int bios_size; +target_phys_addr_t base_ctrl; +int it_shift; +PCIDevice *pci_dev; +unsigned long map_addr; +unsigned long map_end; +int ivshmem_mmio_io_addr; +} IVShmemState; + +typedef struct PCI_IVShmemState { +PCIDevice dev; +IVShmemState ivshmem_state; +} PCI_IVShmemState; + +typedef struct IVShmemDesc { +char name[1024]; +int size; +} IVShmemDesc; + + +/* registers for the Inter-VM shared memory device */ +enum ivshmem_registers { +IntrMask = 0, +IntrStatus = 16 +}; + +static int num_ivshmem_devices = 0; +static IVShmemDesc ivshmem_desc; + +static void ivshmem_map(PCIDevice *pci_dev, int region_num, +uint32_t addr, uint32_t size, int type) +{ +PCI_IVShmemState *d = (PCI_IVShmemState *)pci_dev; +IVShmemState *s = d-ivshmem_state; + +IVSHMEM_DPRINTF(addr = %u size = %u\n, addr, size); +cpu_register_physical_memory(addr, s-ivshmem_size, s-ivshmem_offset); + +} + +void ivshmem_init(const char * optarg) { + +char * temp; +int size; + +num_ivshmem_devices++; + +/* currently we only support 1 device */ +if (num_ivshmem_devices MAX_IVSHMEM_DEVICES) { +return; +} + +temp = strdup(optarg); +snprintf(ivshmem_desc.name, 1024, /%s, strsep(temp,,)); +size = atol(temp); +if ( size == -1) { +ivshmem_desc.size = TARGET_PAGE_SIZE; +} else { +ivshmem_desc.size = size*1024*1024; +} +IVSHMEM_DPRINTF(optarg is %s, name is %s, size is %d\n, optarg, +ivshmem_desc.name, +ivshmem_desc.size); +} + +int ivshmem_get_size(void) { +return ivshmem_desc.size; +} + +/* accessing registers - based on rtl8139 */ +static void ivshmem_update_irq(IVShmemState *s) +{ +int isr; +isr = (s-intrstatus s-intrmask) 0x; + +/* don't print ISR resets */ +if (isr) { +IVSHMEM_DPRINTF(Set IRQ to %d (%04x %04x)\n, + isr ? 1 : 0, s-intrstatus, s-intrmask); +} + +qemu_set_irq(s-pci_dev-irq[0], (isr != 0)); +} + +static void ivshmem_mmio_map(PCIDevice *pci_dev, int region_num, + uint32_t addr, uint32_t size, int type) +{ +PCI_IVShmemState *d = (PCI_IVShmemState *)pci_dev; +IVShmemState *s =
[PATCH] Guest device driver for an inter-VM shared memory PCI device that maps a shared file (from /dev/shm) on the devices memory.
This driver corresponds to the shared memory PCI device that maps a host file into shared memory on the device. Accessing the device can be done through creating a device file on the guest num =`cat /proc/devices | grep kvm_ivshmem | awk '{print $1}` mknod --mode=666 /dev/ivshmem $num 0 read, write, lseek and mmap are supported, but mmap is the usual usage to get zero-copy communication. The driver contains the initial interrupt support, but I have not yet added the unix domain socket to support interrupts yet. --- drivers/char/Kconfig |8 + drivers/char/Makefile |2 + drivers/char/kvm_ivshmem.c | 370 3 files changed, 380 insertions(+), 0 deletions(-) create mode 100644 drivers/char/kvm_ivshmem.c diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig index 735bbe2..afa7cb8 100644 --- a/drivers/char/Kconfig +++ b/drivers/char/Kconfig @@ -1099,6 +1099,14 @@ config DEVPORT depends on ISA || PCI default y +config KVM_IVSHMEM +tristate Inter-VM Shared Memory Device +depends on PCI +default m +help + This device maps a region of shared memory between the host OS and any + number of virtual machines. + source drivers/s390/char/Kconfig endmenu diff --git a/drivers/char/Makefile b/drivers/char/Makefile index 9caf5b5..021f06b 100644 --- a/drivers/char/Makefile +++ b/drivers/char/Makefile @@ -111,6 +111,8 @@ obj-$(CONFIG_PS3_FLASH) += ps3flash.o obj-$(CONFIG_JS_RTC) += js-rtc.o js-rtc-y = rtc.o +obj-$(CONFIG_KVM_IVSHMEM) += kvm_ivshmem.o + # Files generated that shall be removed upon make clean clean-files := consolemap_deftbl.c defkeymap.c diff --git a/drivers/char/kvm_ivshmem.c b/drivers/char/kvm_ivshmem.c new file mode 100644 index 000..7d46ac4 --- /dev/null +++ b/drivers/char/kvm_ivshmem.c @@ -0,0 +1,370 @@ +/* + * drivers/char/kvm_ivshmem.c - driver for KVM Inter-VM shared memory PCI device + * + * Copyright 2009 Cam Macdonell c...@cs.ualberta.ca + * + * Based on cirrusfb.c and 8139cp.c: + * Copyright 1999-2001 Jeff Garzik + * Copyright 2001-2004 Jeff Garzik + * + */ + +#include linux/init.h +#include linux/kernel.h +#include linux/module.h +#include linux/pci.h +#include linux/proc_fs.h +#include linux/smp_lock.h +#include asm/uaccess.h +#include linux/interrupt.h + +#define TRUE 1 +#define FALSE 0 +#define KVM_IVSHMEM_DEVICE_MINOR_NUM 0 + +enum { +/* KVM Shmem device register offsets */ +IntrMask= 0x00,/* Interrupt Mask */ +IntrStatus= 0x10,/* Interrupt Status */ + +ShmOK = 1/* Everything is OK */ +}; + +typedef struct kvm_ivshmem_device { +void __iomem * regs; + +void * base_addr; + +unsigned int regaddr; +unsigned int reg_size; + +unsigned int ioaddr; +unsigned int ioaddr_size; +unsigned int irq; + +bool enabled; +spinlock_t dev_spinlock; +} kvm_ivshmem_device; + +static kvm_ivshmem_device kvm_ivshmem_dev; + +static int device_major_nr; + +static int kvm_ivshmem_mmap(struct file *, struct vm_area_struct *); +static int kvm_ivshmem_open(struct inode *, struct file *); +static int kvm_ivshmem_release(struct inode *, struct file *); +static ssize_t kvm_ivshmem_read(struct file *, char *, size_t, loff_t *); +static ssize_t kvm_ivshmem_write(struct file *, const char *, size_t, loff_t *); +static loff_t kvm_ivshmem_lseek(struct file * filp, loff_t offset, int origin); + +static const struct file_operations kvm_ivshmem_ops = { +.owner = THIS_MODULE, +.open= kvm_ivshmem_open, +.mmap= kvm_ivshmem_mmap, +.read= kvm_ivshmem_read, +.write = kvm_ivshmem_write, +.llseek = kvm_ivshmem_lseek, +.release = kvm_ivshmem_release, +}; + +static struct pci_device_id kvm_ivshmem_id_table[] = { +{ 0x1af4, 0x1110, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0 }, +{ 0 }, +}; +MODULE_DEVICE_TABLE (pci, kvm_ivshmem_id_table); + +static void kvm_ivshmem_remove_device(struct pci_dev* pdev); +static int kvm_ivshmem_probe_device (struct pci_dev *pdev, +const struct pci_device_id * ent); + +static struct pci_driver kvm_ivshmem_pci_driver = { +.name= kvm-shmem, +.id_table= kvm_ivshmem_id_table, +.probe= kvm_ivshmem_probe_device, +.remove= kvm_ivshmem_remove_device, +}; + +static ssize_t kvm_ivshmem_read(struct file * filp, char * buffer, size_t len, +loff_t * poffset) +{ + +int bytes_read = 0; +unsigned long offset; + +offset = *poffset; + +printk(KERN_INFO kvm_ivshmem: trying to read\n); +if (!kvm_ivshmem_dev.base_addr) { +printk(KERN_ERR KVM_IVSHMEM: cannot read from ioaddr (NULL)\n); +return 0; +} + +if (len kvm_ivshmem_dev.ioaddr_size - offset) { +len = kvm_ivshmem_dev.ioaddr_size - offset; +} + +printk(KERN_INFO KVM_IVSHMEM: len is %u\n, (unsigned) len); +if
kvm-autotest: weird memory error during stepmaker test
Wondering if anyone else using kvm-autotest stepmaker has ever seen this error: Traceback (most recent call last): File /home/rharper/work/git/build/kvm-autotest/client/tests/kvm_runtest_2/stepmaker.py, line 146, in update self.set_image_from_file(self.screendump_filename) File /home/rharper/work/git/build/kvm-autotest/client/tests/kvm_runtest_2/stepeditor.py, line 499, in set_image_from_file self.set_image(w, h, data) File /home/rharper/work/git/build/kvm-autotest/client/tests/kvm_runtest_2/stepeditor.py, line 485, in set_image w, h, w*3)) MemoryError The guest is still running, but stepmaker isn't recording any more so it's boned at that point. And of course, it's near the end of a guest install so one has lost a decent amount of time... -- Ryan Harper Software Engineer; Linux Technology Center IBM Corp., Austin, Tx ry...@us.ibm.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/17] virtual-bus
Rusty Russell wrote: On Wednesday 01 April 2009 05:12:47 Gregory Haskins wrote: Bare metal: tput = 4078Mb/s, round-trip = 25593pps (39us rtt) Virtio-net: tput = 4003Mb/s, round-trip = 320pps (3125us rtt) Venet: tput = 4050Mb/s, round-trip = 15255 (65us rtt) That rtt time is awful. I know the notification suppression heuristic in qemu sucks. I could dig through the code, but I'll ask directly: what heuristic do you use for notification prevention in your venet_tap driver? As you point out, 350-450 is possible, which is still bad, and it's at least partially caused by the exit to userspace and two system calls. If virtio_net had a backend in the kernel, we'd be able to compare numbers properly. I doubt the userspace exit is the problem. On a modern system, it takes about 1us to do a light-weight exit and about 2us to do a heavy-weight exit. A transition to userspace is only about ~150ns, the bulk of the additional heavy-weight exit cost is from vcpu_put() within KVM. If you were to switch to another kernel thread, and I'm pretty sure you have to, you're going to still see about a 2us exit cost. Even if you factor in the two syscalls, we're still talking about less than .5us that you're saving. Avi mentioned he had some ideas to allow in-kernel thread switching without taking a heavy-weight exit but suffice to say, we can't do that today. You have no easy way to generate PCI interrupts in the kernel either. You'll most certainly have to drop down to userspace anyway for that. I believe the real issue is that we cannot get enough information today from tun/tap to do proper notification prevention b/c we don't know when the packet processing is completed. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add shared memory PCI device that shares a memory object betweens VMs
Hi Cam, Cam Macdonell wrote: This patch supports sharing memory between VMs and between the host/VM. It's a first cut and comments are encouraged. The goal is to support simple Inter-VM communication with zero-copy access to shared memory. Nice work! I would suggest two design changes to make here. The first is that I think you should use virtio. The second is that I think instead of relying on mapping in device memory to the guest, you should have the guest allocate it's own memory to dedicate to sharing. A lot of what you're doing is duplicating functionality in virtio-pci. You can also obtain greater portability by building the drivers with virtio. It may not seem obvious how to make the memory sharing via BAR fit into virtio, but if you follow my second suggestion, it will be a lot easier. Right now, you've got a bit of a hole in your implementation because you only support files that are powers-of-two in size even though that's not documented/enforced. This is a limitation of PCI resource regions. Also, the PCI memory hole is limited in size today which is going to put an upper bound on the amount of memory you could ever map into a guest. Since you're using qemu_ram_alloc() also, it makes hotplug unworkable too since qemu_ram_alloc() is a static allocation from a contiguous heap. If you used virtio, what you could do is provide a ring queue that was used to communicate a series of requests/response. The exchange might look like this: guest: REQ discover memory region host: RSP memory region id: 4 size: 8k guest: REQ map region id: 4 size: 8k: sgl: {(addr=43000, size=4k), (addr=944000,size=4k)} host: RSP mapped region id: 4 guest: REQ notify region id: 4 host: RSP notify region id: 4 guest: REQ poll region id: 4 host: RSP poll region id: 4 And the REQ/RSP order does not have to be in series like this. In general, you need one entry on the queue to poll for new memory regions, one entry for each mapped region to poll for incoming notification, and then the remaining entries can be used to send short-lived requests/responses. It's important that the REQ map takes a scatter/gather list of physical addresses because after running for a while, it's unlikely that you'll be able to allocate any significant size of contiguous memory. From a QEMU perspective, you would do memory sharing by waiting for a map REQ from the guest and then you would complete the request by doing an mmap(MAP_FIXED) with the appropriate parameters into phys_ram_base. Notifications are a topic for discussion I think. A CharDriverState could be used by I think it would be more interesting to do something like a fd passed by SCM_RIGHTS so that eventfd can be used. To simplify things, I'd suggest starting out only supporting one memory region mapping. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM on Via Nano (Isaiah) CPUs?
Craig Metz wrote: Has anyone (esp. the KVM core developers) tried to determine whether KVM works on the new Via Nano CPUs? They claim to support the Intel-style VT-x instruction set extensions and show up in cpuinfo that way. But, according to some Google searching, folks who have tried to use KVM (or Hyper-V) have not been succesful. It's not clear if this is a CPU implementation problem and/or something that needs more work in KVM. Via engineers have contacted me and confirmed that this is a problem in the processor. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[ kvm-Bugs-2725367 ] KVM userspace segfaults due to internal VNC server
Bugs item #2725367, was opened at 2009-04-01 19:57 Message generated for change (Tracker Item Submitted) made by technologov You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=2725367group_id=180599 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: qemu Group: None Status: Open Resolution: None Priority: 8 Private: No Submitted By: Technologov (technologov) Assigned to: Nobody/Anonymous (nobody) Summary: KVM userspace segfaults due to internal VNC server Initial Comment: KVM's internal VNC server is unstable. When running KVM (KVM-84 or 85rc2), the userspace segfaults when I try to connect to it with VNC client. Only some VNC clients can trigger it. It happens on both Intel AMD. I used TightVNC 1.3 client for Linux 64-bit. No problems happen with SDL rendering. Host: Intel Core 2 CPU, KVM-85rc2, Fedora 7 x64 Guest: Windows XP SP2 32-bit The Command sent to Qemu/KVM: /usr/local/bin/qemu-system-x86_64 -m 256 -monitor tcp:localhost:4502,server,nowait -cdrom /isos/windows/WindowsXP-sp2-vlk.iso -hda /vm/winxp.qcow2 -name WindowsXP -vnc :1 GDB output: (gdb) c Continuing. Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 46912498463376 (LWP 18803)] 0x00438cfc in vga_draw_line24_32 (s1=value optimized out, d=0x2aaabc822000 Address 0x2aaabc822000 out of bounds, s=0x2aaabb3eeef7 , width=36) at /root/Linstall/kvm-85rc2/qemu/hw/vga_template.h:484 484 ((PIXEL_TYPE *)d)[0] = glue(rgb_to_pixel, PIXEL_NAME)(r, g, b); (gdb) bt #0 0x00438cfc in vga_draw_line24_32 (s1=value optimized out, d=0x2aaabc822000 Address 0x2aaabc822000 out of bounds, s=0x2aaabb3eeef7 , width=36) at /root/Linstall/kvm-85rc2/qemu/hw/vga_template.h:484 #1 0x00437b0d in vga_update_display (opaque=value optimized out) at /root/Linstall/kvm-85rc2/qemu/hw/vga.c:1767 #2 0x00490c45 in vnc_listen_read (opaque=0x2aaabb3eeef7) at vnc.c:2020 #3 0x004093dc in main_loop_wait (timeout=value optimized out) at /root/Linstall/kvm-85rc2/qemu/vl.c:3818 #4 0x0051724a in kvm_main_loop () at /root/Linstall/kvm-85rc2/qemu/qemu-kvm.c:588 #5 0x0040e28a in main (argc=13, argv=0x7fff25e77658, envp=value optimized out) at /root/Linstall/kvm-85rc2/qemu/vl.c:3875 (gdb) c Continuing. Program terminated with signal SIGSEGV, Segmentation fault. The program no longer exists. (gdb) The program is not being run. -Alexey -- You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=2725367group_id=180599 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/17] virtual-bus
On Wed, Apr 01, 2009 at 10:19:49AM -0400, Gregory Haskins wrote: But surely you must have some specific use case in mind? Something that it does better than the various methods that are available today. Or rather there must be some problem you're trying to solve. I'm just not sure what that problem exactly is. Performance. We are trying to create a high performance IO infrastructure. Ok. So the goal is to bypass user space qemu completely for better performance. Can you please put this into the initial patch description? So the administrator can then set these attributes as desired to manipulate the configuration of the instance of the device, on a per device basis. How would the guest learn of any changes in there? I think the interesting part would be how e.g. a vnet device would be connected to the outside interfaces. So the admin would instantiate this vdisk device and do: 'echo /path/to/my/exported/disk.dat /sys/vbus/devices/foo/src_path' So it would act like a loop device? Would you reuse the loop device or write something new? How about VFS mount name spaces? -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] add ksm kernel shared memory driver.
KAMEZAWA Hiroyuki wrote: On Tue, 31 Mar 2009 15:21:53 +0300 Izik Eidus iei...@redhat.com wrote: kpage is actually what going to be KsmPage - the shared page... Right now this pages are not swappable..., after ksm will be merged we will make this pages swappable as well... sure. If so, please - show the amount of kpage - allow users to set limit for usage of kpages. or preserve kpages at boot or by user's command. kpage actually save memory..., and limiting the number of them, would make you limit the number of shared pages... Ah, I'm working for memory control cgroup. And *KSM* will be out of control. It's ok to make the default limit value as INFINITY. but please add knobs. Sure, when i will post V2 i will take care for this issue (i will do it after i get little bit more review for ksm.c :-)) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add shared memory PCI device that shares a memory object betweens VMs
Anthony Liguori wrote: Hi Cam, Cam Macdonell wrote: This patch supports sharing memory between VMs and between the host/VM. It's a first cut and comments are encouraged. The goal is to support simple Inter-VM communication with zero-copy access to shared memory. Nice work! I would suggest two design changes to make here. The first is that I think you should use virtio. I disagree with this. While virtio is excellent at exporting guest memory, it isn't so good at importing another guest's memory. The second is that I think instead of relying on mapping in device memory to the guest, you should have the guest allocate it's own memory to dedicate to sharing. That's not what you describe below. You're having the guest allocate parts of its address space that happen to be used by RAM, and overlaying those parts with the shared memory. Right now, you've got a bit of a hole in your implementation because you only support files that are powers-of-two in size even though that's not documented/enforced. This is a limitation of PCI resource regions. While the BAR needs to be a power of two, I don't think the RAM backing it needs to be. Also, the PCI memory hole is limited in size today which is going to put an upper bound on the amount of memory you could ever map into a guest. Today. We could easily lift this restriction by supporting 64-bit BARs. It would probably take only a few lines of code. Since you're using qemu_ram_alloc() also, it makes hotplug unworkable too since qemu_ram_alloc() is a static allocation from a contiguous heap. We need to fix this anyway, for memory hotplug. If you used virtio, what you could do is provide a ring queue that was used to communicate a series of requests/response. The exchange might look like this: guest: REQ discover memory region host: RSP memory region id: 4 size: 8k guest: REQ map region id: 4 size: 8k: sgl: {(addr=43000, size=4k), (addr=944000,size=4k)} host: RSP mapped region id: 4 guest: REQ notify region id: 4 host: RSP notify region id: 4 guest: REQ poll region id: 4 host: RSP poll region id: 4 That looks significantly more complex. And the REQ/RSP order does not have to be in series like this. In general, you need one entry on the queue to poll for new memory regions, one entry for each mapped region to poll for incoming notification, and then the remaining entries can be used to send short-lived requests/responses. It's important that the REQ map takes a scatter/gather list of physical addresses because after running for a while, it's unlikely that you'll be able to allocate any significant size of contiguous memory. From a QEMU perspective, you would do memory sharing by waiting for a map REQ from the guest and then you would complete the request by doing an mmap(MAP_FIXED) with the appropriate parameters into phys_ram_base. That will fragment the vma list. And what do you do when you unmap the region? How does a 256M guest map 1G of shared memory? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/17] virtual-bus
Andi Kleen wrote: On Wed, Apr 01, 2009 at 10:19:49AM -0400, Gregory Haskins wrote: But surely you must have some specific use case in mind? Something that it does better than the various methods that are available today. Or rather there must be some problem you're trying to solve. I'm just not sure what that problem exactly is. Performance. We are trying to create a high performance IO infrastructure. Ok. So the goal is to bypass user space qemu completely for better performance. Can you please put this into the initial patch description? FWIW, there's nothing that prevents in-kernel back ends with virtio so vbus certainly isn't required for in-kernel backends. That said, I don't think we're bound today by the fact that we're in userspace. Rather we're bound by the interfaces we have between the host kernel and userspace to generate IO. I'd rather fix those interfaces than put more stuff in the kernel. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add shared memory PCI device that shares a memory object betweens VMs
Avi Kivity wrote: Anthony Liguori wrote: Hi Cam, I would suggest two design changes to make here. The first is that I think you should use virtio. I disagree with this. While virtio is excellent at exporting guest memory, it isn't so good at importing another guest's memory. First we need to separate static memory sharing and dynamic memory sharing. Static memory sharing has to be configured on start up. I think in practice, static memory sharing is not terribly interesting except for maybe embedded environments. Dynamically memory sharing requires bidirectional communication in order to establish mappings and tear down mappings. You'll eventually recreate virtio once you've implemented this communication mechanism. The second is that I think instead of relying on mapping in device memory to the guest, you should have the guest allocate it's own memory to dedicate to sharing. That's not what you describe below. You're having the guest allocate parts of its address space that happen to be used by RAM, and overlaying those parts with the shared memory. But from the guest's perspective, it's RAM is being used for memory sharing. If you're clever, you could start a guest with -mem-path and then use this mechanism to map a portion of one guest's memory into another guest without either guest ever knowing who owns the memory and with exactly the same driver on both. Right now, you've got a bit of a hole in your implementation because you only support files that are powers-of-two in size even though that's not documented/enforced. This is a limitation of PCI resource regions. While the BAR needs to be a power of two, I don't think the RAM backing it needs to be. Then you need a side channel to communicate the information to the guest. Also, the PCI memory hole is limited in size today which is going to put an upper bound on the amount of memory you could ever map into a guest. Today. We could easily lift this restriction by supporting 64-bit BARs. It would probably take only a few lines of code. Since you're using qemu_ram_alloc() also, it makes hotplug unworkable too since qemu_ram_alloc() is a static allocation from a contiguous heap. We need to fix this anyway, for memory hotplug. It's going to be hard to fix with TCG. If you used virtio, what you could do is provide a ring queue that was used to communicate a series of requests/response. The exchange might look like this: guest: REQ discover memory region host: RSP memory region id: 4 size: 8k guest: REQ map region id: 4 size: 8k: sgl: {(addr=43000, size=4k), (addr=944000,size=4k)} host: RSP mapped region id: 4 guest: REQ notify region id: 4 host: RSP notify region id: 4 guest: REQ poll region id: 4 host: RSP poll region id: 4 That looks significantly more complex. It's also supporting dynamic shared memory. If you do use BARs, then perhaps you'd just do PCI hotplug to make things dynamic. And the REQ/RSP order does not have to be in series like this. In general, you need one entry on the queue to poll for new memory regions, one entry for each mapped region to poll for incoming notification, and then the remaining entries can be used to send short-lived requests/responses. It's important that the REQ map takes a scatter/gather list of physical addresses because after running for a while, it's unlikely that you'll be able to allocate any significant size of contiguous memory. From a QEMU perspective, you would do memory sharing by waiting for a map REQ from the guest and then you would complete the request by doing an mmap(MAP_FIXED) with the appropriate parameters into phys_ram_base. That will fragment the vma list. And what do you do when you unmap the region? How does a 256M guest map 1G of shared memory? It doesn't but it couldn't today either b/c of the 32-bit BARs. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: EPT support breakage on: KVM: VMX: Zero ept module parameter if ept is not present
Sheng Yang wrote: Oops... Thanks very much for reporting! I can't believe we haven't awared of that... Could you please try the attached patch? Thanks! Tested and works great. Thanks! -Andrew diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index aba41ae..8d6465b 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -1195,15 +1195,6 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) vmx_capability.ept, vmx_capability.vpid); } - if (!cpu_has_vmx_vpid()) - enable_vpid = 0; - - if (!cpu_has_vmx_ept()) - enable_ept = 0; - - if (!(vmcs_config.cpu_based_2nd_exec_ctrl SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) - flexpriority_enabled = 0; - min = 0; #ifdef CONFIG_X86_64 min |= VM_EXIT_HOST_ADDR_SPACE_SIZE; @@ -1307,6 +1298,15 @@ static __init int hardware_setup(void) if (boot_cpu_has(X86_FEATURE_NX)) kvm_enable_efer_bits(EFER_NX); + if (!cpu_has_vmx_vpid()) + enable_vpid = 0; + + if (!cpu_has_vmx_ept()) + enable_ept = 0; + + if (!(vmcs_config.cpu_based_2nd_exec_ctrl SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) + flexpriority_enabled = 0; + return alloc_kvm_area(); } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/17] virtual-bus
Andi Kleen wrote: On Wed, Apr 01, 2009 at 10:19:49AM -0400, Gregory Haskins wrote: But surely you must have some specific use case in mind? Something that it does better than the various methods that are available today. Or rather there must be some problem you're trying to solve. I'm just not sure what that problem exactly is. Performance. We are trying to create a high performance IO infrastructure. Ok. So the goal is to bypass user space qemu completely for better performance. Can you please put this into the initial patch description? Yes, good point. I will be sure to be more explicit in the next rev. So the administrator can then set these attributes as desired to manipulate the configuration of the instance of the device, on a per device basis. How would the guest learn of any changes in there? The only events explicitly supported by the infrastructure of this nature would be device-add and device-remove. So when an admin adds or removes a device to a bus, the guest would see driver::probe() and driver::remove() callbacks, respectively. All other events are left (by design) to be handled by the device ABI itself, presumably over the provided shm infrastructure. So for instance, I have on my todo list to add a third shm-ring for events in the venet ABI. One of the event-types I would like to support is LINK_UP and LINK_DOWN. These events would be coupled to the administrative manipulation of the enabled attribute in sysfs. Other event-types could be added as needed/appropriate. I decided to do it this way because I felt it didn't make sense for me to expose the attributes directly, since they are often back-end specific anyway. Therefore I leave it to the device-specific ABI which has all the necessary tools for async events built in. I think the interesting part would be how e.g. a vnet device would be connected to the outside interfaces. Ah, good question. This ties into the statement I made earlier about how presumably the administrative agent would know what a module is and how it works. As part of this, they would also handle any kind of additional work, such as wiring the backend up. Here is a script that I use for testing that demonstrates this: -- #!/bin/bash set -e modprobe venet-tap mount -t configfs configfs /config bridge=vbus-br0 brctl addbr $bridge brctl setfd $bridge 0 ifconfig $bridge up createtap() { mkdir /config/vbus/devices/$1-dev echo venet-tap /config/vbus/devices/$1-dev/type mkdir /config/vbus/instances/$1-bus ln -s /config/vbus/devices/$1-dev /config/vbus/instances/$1-bus echo 1 /sys/vbus/devices/$1-dev/enabled ifname=$(cat /sys/vbus/devices/$1-dev/ifname) ifconfig $ifname up brctl addif $bridge $ifname } createtap client createtap server This script creates two buses (client-bus and server-bus), instantiates a single venet-tap on each of them, and then wires them together with a private bridge instance called vbus-br0. To complete the picture here, you would want to launch two kvms, one of each of the client-bus/server-bus instances. You can do this via /proc/$pid/vbus. E.g. # (echo client-bus /proc/self/vbus; qemu-kvm -hda client.img) # (echo server-bus /proc/self/vbus; qemu-kvm -hda server.img) (And as noted, someday qemu will be able to do all the setup that the script did, natively. It would wire whatever tap it created to an existing bridge with qemu-ifup, just like we do for tun-taps today) One of the key details is where I do ifname=$(cat /sys/vbus/devices/$1-dev/ifname). The ifname attribute of the venet-tap is a read-only attribute that reports back the netif interface name that was returned when the device did a register_netdev() (e.g. eth3). This register_netdev() operation occurs as a result of echoing the 1 into the enabled attribute. Deferring the registration until the admin explicitly does an enable gives the admin a chance to change the MAC address of the virtual-adapter before it is registered (note: the current code doesnt support rw on the mac attributes yet..i need a parser first). So the admin would instantiate this vdisk device and do: 'echo /path/to/my/exported/disk.dat /sys/vbus/devices/foo/src_path' So it would act like a loop device? Would you reuse the loop device or write something new? Well, keeping in mind that I haven't even looked at writing a block device for this infrastructure yetmy blanket statement would be lets reuse as much as possible ;) If the existing loop infrastructure would work here, great! How about VFS mount name spaces? Yeah, ultimately I would love to be able to support a fairly wide range of the normal userspace/kernel ABI through this mechanism. In fact, one of my original design goals was to somehow expose the syscall ABI directly via some kind of syscall proxy device on the bus. I have
Re: [PATCH] Add shared memory PCI device that shares a memory object betweens VMs
Hi Anthony and Avi, Anthony Liguori wrote: Avi Kivity wrote: Anthony Liguori wrote: Hi Cam, I would suggest two design changes to make here. The first is that I think you should use virtio. I disagree with this. While virtio is excellent at exporting guest memory, it isn't so good at importing another guest's memory. First we need to separate static memory sharing and dynamic memory sharing. Static memory sharing has to be configured on start up. I think in practice, static memory sharing is not terribly interesting except for maybe embedded environments. I think there is value for static memory sharing. It can be used for fast, simple synchronization and communication between guests (and the host) that use need to share data that needs to be updated frequently (such as a simple cache or notification system). It may not be a common task, but I think static sharing has its place and that's what this device is for at this point. Dynamically memory sharing requires bidirectional communication in order to establish mappings and tear down mappings. You'll eventually recreate virtio once you've implemented this communication mechanism. The second is that I think instead of relying on mapping in device memory to the guest, you should have the guest allocate it's own memory to dedicate to sharing. That's not what you describe below. You're having the guest allocate parts of its address space that happen to be used by RAM, and overlaying those parts with the shared memory. But from the guest's perspective, it's RAM is being used for memory sharing. If you're clever, you could start a guest with -mem-path and then use this mechanism to map a portion of one guest's memory into another guest without either guest ever knowing who owns the memory and with exactly the same driver on both. Right now, you've got a bit of a hole in your implementation because you only support files that are powers-of-two in size even though that's not documented/enforced. This is a limitation of PCI resource regions. While the BAR needs to be a power of two, I don't think the RAM backing it needs to be. Then you need a side channel to communicate the information to the guest. Couldn't one of the registers in BAR0 be used to store the actual (non-power-of-two) size? Also, the PCI memory hole is limited in size today which is going to put an upper bound on the amount of memory you could ever map into a guest. Today. We could easily lift this restriction by supporting 64-bit BARs. It would probably take only a few lines of code. Since you're using qemu_ram_alloc() also, it makes hotplug unworkable too since qemu_ram_alloc() is a static allocation from a contiguous heap. We need to fix this anyway, for memory hotplug. It's going to be hard to fix with TCG. If you used virtio, what you could do is provide a ring queue that was used to communicate a series of requests/response. The exchange might look like this: guest: REQ discover memory region host: RSP memory region id: 4 size: 8k guest: REQ map region id: 4 size: 8k: sgl: {(addr=43000, size=4k), (addr=944000,size=4k)} host: RSP mapped region id: 4 guest: REQ notify region id: 4 host: RSP notify region id: 4 guest: REQ poll region id: 4 host: RSP poll region id: 4 That looks significantly more complex. It's also supporting dynamic shared memory. If you do use BARs, then perhaps you'd just do PCI hotplug to make things dynamic. And the REQ/RSP order does not have to be in series like this. In general, you need one entry on the queue to poll for new memory regions, one entry for each mapped region to poll for incoming notification, and then the remaining entries can be used to send short-lived requests/responses. It's important that the REQ map takes a scatter/gather list of physical addresses because after running for a while, it's unlikely that you'll be able to allocate any significant size of contiguous memory. From a QEMU perspective, you would do memory sharing by waiting for a map REQ from the guest and then you would complete the request by doing an mmap(MAP_FIXED) with the appropriate parameters into phys_ram_base. That will fragment the vma list. And what do you do when you unmap the region? How does a 256M guest map 1G of shared memory? It doesn't but it couldn't today either b/c of the 32-bit BARs. Cam -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OT: No vmx-Flag in Via Nano CPU on Samsung NC-20 Netbooks
On Sat, Mar 28, 2009 at 7:51 AM, Oliver Rath rat...@web.de wrote: I took a look at the new Samsung NC-20 Netbook with Via Nano Processor. Unfortunatly the vmx--bit looks to be disaabled on the Via Nano U2250. Tested with the newest Bios 7MC. Does anyone know more about this missing feature? It seems that VIA processors are not fully compatible with Intel VT specification, quoting Avi: Via engineers have contacted me and confirmed that this is a problem in the processor. Luca -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM on Via Nano (Isaiah) CPUs?
Via engineers have contacted me and confirmed that this is a problem in the processor. I'd like to clarify. Stepping 2 Nano processors do not support VMX. This should have been disabled by the BIOS. Support for VMX was not finished until stepping 3. If you have a stepping 2 processor with this enabled please let me know which platform it is on so we can have the manufacturer release a new BIOS. Jesse Ahrens Systems Engineer Centaur Technology -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/17] virtual-bus
* Anthony Liguori (anth...@codemonkey.ws) wrote: Andi Kleen wrote: On Wed, Apr 01, 2009 at 10:19:49AM -0400, Gregory Haskins wrote: Performance. We are trying to create a high performance IO infrastructure. Ok. So the goal is to bypass user space qemu completely for better performance. Can you please put this into the initial patch description? FWIW, there's nothing that prevents in-kernel back ends with virtio so vbus certainly isn't required for in-kernel backends. Indeed. That said, I don't think we're bound today by the fact that we're in userspace. Rather we're bound by the interfaces we have between the host kernel and userspace to generate IO. I'd rather fix those interfaces than put more stuff in the kernel. And more stuff in the kernel can come at the potential cost of weakening protection/isolation. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[ kvm-Bugs-2725669 ] kvm init script breaks network interfaces with multiple IPs
Bugs item #2725669, was opened at 2009-04-01 16:44 Message generated for change (Tracker Item Submitted) made by paulsd You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=2725669group_id=180599 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Paul Donohue (paulsd) Assigned to: Nobody/Anonymous (nobody) Summary: kvm init script breaks network interfaces with multiple IPs Initial Comment: If multiple IP addresses are assigned to a network interface (Using interface aliases - for example 'ifconfig eth0 10.0.0.1 ; ifconfig eth0:1 10.0.0.2'), then the kvm init script causes the interface to become unresponsive when it creates a bridge using the interface. I haven't yet had a need to use bridging for my VMs, so I haven't yet tried to figure out how to properly configure a bridge when multiple IPs are in use on the host system (I assume the multiple IPs simply need to be configured using aliases of the bridge itself - for example 'ifconfig sw0 10.0.0.1 ; ifconfig sw0:1 10.0.0.2' - but I haven't actually tried it). Therefore, I am not sure at the moment how the kvm init script needs to be updated to fix this problem. Regardless, I do have a number of machines which are using multiple IPs on the host system, and I recently installed kvm on them, then discovered that after the next reboot of each machine, the network interface is unresponsive until I disable the kvm init script and reboot again. So, ideally the kvm init script should be updated to properly handle aliased interfaces, but at the very least, it needs to be updated to detect aliased interfaces and refuse to create a bridge for them, since that seems to completely break the underlying interface. -- You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=2725669group_id=180599 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OT: No vmx-Flag in Via Nano CPU on Samsung NC-20 Netbooks
On Wed, Apr 1, 2009 at 10:32 PM, Luca Tettamanti kronos...@gmail.com wrote: On Sat, Mar 28, 2009 at 7:51 AM, Oliver Rath rat...@web.de wrote: I took a look at the new Samsung NC-20 Netbook with Via Nano Processor. Unfortunatly the vmx--bit looks to be disaabled on the Via Nano U2250. Tested with the newest Bios 7MC. Does anyone know more about this missing feature? It seems that VIA processors are not fully compatible with Intel VT specification, quoting Avi: Via engineers have contacted me and confirmed that this is a problem in the processor. More info here: http://marc.info/?l=kvmm=123861829901077w=2 Luca -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/17] virtual-bus
Anthony Liguori wrote: Andi Kleen wrote: On Wed, Apr 01, 2009 at 10:19:49AM -0400, Gregory Haskins wrote: But surely you must have some specific use case in mind? Something that it does better than the various methods that are available today. Or rather there must be some problem you're trying to solve. I'm just not sure what that problem exactly is. Performance. We are trying to create a high performance IO infrastructure. Ok. So the goal is to bypass user space qemu completely for better performance. Can you please put this into the initial patch description? FWIW, there's nothing that prevents in-kernel back ends with virtio so vbus certainly isn't required for in-kernel backends. I think there is a slight disconnect here. This is *exactly* what I am trying to do. You can of course do this many ways, and I am not denying it could be done a different way than the path I have chosen. One extreme would be to just slam a virtio-net specific chunk of code directly into kvm on the host. Another extreme would be to build a generic framework into Linux for declaring arbitrary IO types, integrating it with kvm (as well as other environments such as lguest, userspace, etc), and building a virtio-net model on top of that. So in case it is not obvious at this point, I have gone with the latter approach. I wanted to make sure it wasn't kvm specific or something like pci specific so it had the broadest applicability to a range of environments. So that is why the design is the way it is. I understand that this approach is technically harder/more-complex than the slam virtio-net into kvm approach, but I've already done that work. All we need to do now is agree on the details ;) That said, I don't think we're bound today by the fact that we're in userspace. You will *always* be bound by the fact that you are in userspace. Its purely a question of how much and does anyone care.Right now, the anwer is a lot (roughly 45x slower) and at least Greg's customers do. I have no doubt that this can and will change/improve in the future. But it will always be true that no matter how much userspace improves, the kernel based solution will always be faster. Its simple physics. I'm cutting out the middleman to ultimately reach the same destination as the userspace path, so userspace can never be equal. I agree that the does anyone care part of the equation will approach zero as the latency difference shrinks across some threshold (probably the single microsecond range), but I will believe that is even possible when I see it ;) Regards, -Greg signature.asc Description: OpenPGP digital signature
[ kvm-Bugs-2725669 ] kvm init script breaks network interfaces with multiple IPs
Bugs item #2725669, was opened at 2009-04-01 15:44 Message generated for change (Comment added) made by iggy_cav You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=2725669group_id=180599 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Paul Donohue (paulsd) Assigned to: Nobody/Anonymous (nobody) Summary: kvm init script breaks network interfaces with multiple IPs Initial Comment: If multiple IP addresses are assigned to a network interface (Using interface aliases - for example 'ifconfig eth0 10.0.0.1 ; ifconfig eth0:1 10.0.0.2'), then the kvm init script causes the interface to become unresponsive when it creates a bridge using the interface. I haven't yet had a need to use bridging for my VMs, so I haven't yet tried to figure out how to properly configure a bridge when multiple IPs are in use on the host system (I assume the multiple IPs simply need to be configured using aliases of the bridge itself - for example 'ifconfig sw0 10.0.0.1 ; ifconfig sw0:1 10.0.0.2' - but I haven't actually tried it). Therefore, I am not sure at the moment how the kvm init script needs to be updated to fix this problem. Regardless, I do have a number of machines which are using multiple IPs on the host system, and I recently installed kvm on them, then discovered that after the next reboot of each machine, the network interface is unresponsive until I disable the kvm init script and reboot again. So, ideally the kvm init script should be updated to properly handle aliased interfaces, but at the very least, it needs to be updated to detect aliased interfaces and refuse to create a bridge for them, since that seems to completely break the underlying interface. -- Comment By: Brian Jackson (iggy_cav) Date: 2009-04-01 16:08 Message: KVM doesn't come with an init script in the tarball. This is most likely provided by your distro or some other third party. You should contact them for support. -- You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=2725669group_id=180599 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/17] virtual-bus
Chris Wright wrote: And more stuff in the kernel can come at the potential cost of weakening protection/isolation. Note that the design of vbus should prevent any weakening...though if you see a hole, please point it out. (On that front, note that I still have some hardening to do, such as not calling BUG_ON() in venet-tap if the ring is in a funk, etc) Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 00/17] virtual-bus
* Gregory Haskins (ghask...@novell.com) wrote: Note that the design of vbus should prevent any weakening Could you elaborate? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OT: No vmx-Flag in Via Nano CPU on Samsung NC-20 Netbooks
Hi Luca! Luca Tettamanti schrieb: [..] It seems that VIA processors are not fully compatible with Intel VT specification, quoting Avi: Via engineers have contacted me and confirmed that this is a problem in the processor. More info here: http://marc.info/?l=kvmm=123861829901077w=2 Luca -- Thank you so much for this info! Neither the Via Support nor the Samsung support werent able (or willing?) to respond this question :-( Maybe we should correct the wikipedia entry of the via nano in this way? Best regards Oliver -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/17] virtual-bus
Chris Wright wrote: * Gregory Haskins (ghask...@novell.com) wrote: Note that the design of vbus should prevent any weakening Could you elaborate? Absolutely. So you said that something in the kernel could weaken the protection/isolation. And I fully agree that whatever we do here has to be done carefully...more carefully than a userspace derived counterpart, naturally. So to address this, I put in various mechanisms to (hopefully? :) ensure we can still maintain proper isolation, as well as protect the host, other guests, and other applications from corruption. Here are some of the highlights: *) As I mentioned, a vbus is a form of a kernel-resource-container. It is designed so that the view of a vbus is a unique namespace of device-ids. Each bus has its own individual namespace that consist solely of the devices that have been placed on that bus. The only way to create a bus, and/or create a device on a bus, is via the administrative interface on the host. *) A task can only associate with, at most, one vbus at a time. This means that a task can only see the device-id namespace of the devices on its associated bus and thats it. This is enforced by the host kernel by placing a reference to the associated vbus on the task-struct itself. Again, the only way to modify this association is via a host based administrative operation. Note that multiple tasks can associate to the same vbus, which would commonly be used by all threads in an app, or all vcpus in a guest, etc. *) the asynchronous nature of the shm/ring interfaces implies we have the potential for asynchronous faults. E.g. crap in the ring might not be discovered at the EIP of the guest vcpu when it actually inserts the crap, but rather later when the host side tries to update the ring. A naive implementation would have the host do a BUG_ON() when it discovers the discrepancy (note that I still have a few of these to fix in the venet-tap code). Instead, what should happen is that we utilize an asynchronous fault mechanism that allows the guest to always be the one punished (via something like a machine-check for guests, or SIGABRT for userspace, etc) *) south-to-north path signaling robustness. Because vbus supports a variety of different environments, I call guest/userspace north', and the host/kernel south. When the north wants to communicate with the kernel, its perfectly ok to stall the north indefinitely if the south is not ready. However, it is not really ok to stall the south when communicating with the north because this is an attack vector. E.g. a malicous/broken guest could just stop servicing its ring to cause threads in the host to jam up. This is bad. :) So what we do is we design all south-to-north signaling paths to be robust against stalling. What they do instead is manage backpressure a little bit more intelligently than simply blocking like they might in the guest. For instance, in venet-tap, a transmit from netif that has to be injected in the south-to-north ring when it is full will result in a netif_stop_queue(). etc. I cant think of more examples right now, but I will update this list if/when I come up with more. I hope that satisfactorily answered your question, though! Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 00/17] virtual-bus
On Wed, Apr 01, 2009 at 04:29:57PM -0400, Gregory Haskins wrote: description? Yes, good point. I will be sure to be more explicit in the next rev. So the administrator can then set these attributes as desired to manipulate the configuration of the instance of the device, on a per device basis. How would the guest learn of any changes in there? The only events explicitly supported by the infrastructure of this nature would be device-add and device-remove. So when an admin adds or removes a device to a bus, the guest would see driver::probe() and driver::remove() callbacks, respectively. All other events are left (by design) to be handled by the device ABI itself, presumably over the provided shm infrastructure. Ok so you rely on a transaction model where everything is set up before it is somehow comitted to the guest? I hope that is made explicit in the interface somehow. This script creates two buses (client-bus and server-bus), instantiates a single venet-tap on each of them, and then wires them together with a private bridge instance called vbus-br0. To complete the picture here, you would want to launch two kvms, one of each of the client-bus/server-bus instances. You can do this via /proc/$pid/vbus. E.g. # (echo client-bus /proc/self/vbus; qemu-kvm -hda client.img) # (echo server-bus /proc/self/vbus; qemu-kvm -hda server.img) (And as noted, someday qemu will be able to do all the setup that the script did, natively. It would wire whatever tap it created to an existing bridge with qemu-ifup, just like we do for tun-taps today) The usual problem with that is permissions. Just making qemu-ifup suid it not very nice. It would be good if any new design addressed this. the current code doesnt support rw on the mac attributes yet..i need a parser first). parser in kernel space always sounds scary to me. Yeah, ultimately I would love to be able to support a fairly wide range of the normal userspace/kernel ABI through this mechanism. In fact, one of my original design goals was to somehow expose the syscall ABI directly via some kind of syscall proxy device on the bus. I have since That sounds really scary for security. backed away from that idea once I started thinking about things some more and realized that a significant number of system calls are really inappropriate for a guest type environment due to their ability to block. We really dont want a vcpu to block.however, the AIO type Not only because of blocking, but also because of security issues. After all one of the usual reasons to run a guest is security isolation. In general the more powerful the guest API the more risky it is, so some self moderation is probably a good thing. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] add ksm kernel shared memory driver.
Anthony Liguori wrote: Andrea Arcangeli wrote: On Tue, Mar 31, 2009 at 10:54:57AM -0500, Anthony Liguori wrote: You can still disable ksm and simply return ENOSYS for the MADV_ flag. You Anthony, the biggest problem about madvice() is that it is a real system call api, i wouldnt want in that stage of ksm commit into api changes of linux... The ioctl itself is restricting, madvice is much more..., Can we draft this issue to after ksm is merged, and after all the big new fetures that we want to add to ksm will be merge (then the api would be much more stable, and we will be able to ask ppl in the list about changing of api, but for new driver that it yet to be merged, it is kind of overkill to add api to linux) What do you think? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/17] virtual-bus
Andi Kleen wrote: On Wed, Apr 01, 2009 at 04:29:57PM -0400, Gregory Haskins wrote: description? Yes, good point. I will be sure to be more explicit in the next rev. So the administrator can then set these attributes as desired to manipulate the configuration of the instance of the device, on a per device basis. How would the guest learn of any changes in there? The only events explicitly supported by the infrastructure of this nature would be device-add and device-remove. So when an admin adds or removes a device to a bus, the guest would see driver::probe() and driver::remove() callbacks, respectively. All other events are left (by design) to be handled by the device ABI itself, presumably over the provided shm infrastructure. Ok so you rely on a transaction model where everything is set up before it is somehow comitted to the guest? I hope that is made explicit in the interface somehow. Well, its not an explicit transaction model, but I guess you could think of it that way. Generally you set the device up before you launch the guest. By the time the guest loads and tries to scan the bus for the initial discovery, all the devices would be ready to go. This does bring up the question of hotswap. Today we fully support hotswap in and out, but leaving this enabled transaction to the individual device means that the device-id would be visible in the bus namespace before the device may want to actually communicate. Hmmm Perhaps I need to build this in as a more explicit enabled feature...and the guest will not see the driver::probe() until this happens. This script creates two buses (client-bus and server-bus), instantiates a single venet-tap on each of them, and then wires them together with a private bridge instance called vbus-br0. To complete the picture here, you would want to launch two kvms, one of each of the client-bus/server-bus instances. You can do this via /proc/$pid/vbus. E.g. # (echo client-bus /proc/self/vbus; qemu-kvm -hda client.img) # (echo server-bus /proc/self/vbus; qemu-kvm -hda server.img) (And as noted, someday qemu will be able to do all the setup that the script did, natively. It would wire whatever tap it created to an existing bridge with qemu-ifup, just like we do for tun-taps today) The usual problem with that is permissions. Just making qemu-ifup suid it not very nice. It would be good if any new design addressed this. Well, its kind of out of my control. venet-tap ultimately creates a simple netif interface which we must do something with. Once its created, wiring it up to something like a linux-bridge is no different than something like a tun-tap, so the qemu-ifup requirement doesn't change. The one thing I can think of is it would be possible to build a venet-switch module, and this could be done without using brctl or qemu-ifup...but then I would lose all the benefits of re-using that infrastructure. I do not recommend we actually do this, but it would technically be a way to address your concern. the current code doesnt support rw on the mac attributes yet..i need a parser first). parser in kernel space always sounds scary to me. Heh..why do you think I keep procrastinating ;) Yeah, ultimately I would love to be able to support a fairly wide range of the normal userspace/kernel ABI through this mechanism. In fact, one of my original design goals was to somehow expose the syscall ABI directly via some kind of syscall proxy device on the bus. I have since That sounds really scary for security. backed away from that idea once I started thinking about things some more and realized that a significant number of system calls are really inappropriate for a guest type environment due to their ability to block. We really dont want a vcpu to block.however, the AIO type Not only because of blocking, but also because of security issues. After all one of the usual reasons to run a guest is security isolation. Oh yeah, totally agreed. Not that I am advocating this, because I have abandoned the idea. But back when I was thinking of this, I would have addressed the security with the vbus and syscall-proxy-device objects themselves. E.g. if you dont instantiate a syscall-proxy-device on the bus, the guest wouldnt have access to syscalls at all. And you could put filters into the module to limit what syscalls were allowed, which UID to make the guest appear as, etc. In general the more powerful the guest API the more risky it is, so some self moderation is probably a good thing. :) -Greg signature.asc Description: OpenPGP digital signature
[ kvm-Bugs-2725669 ] kvm init script breaks network interfaces with multiple IPs
Bugs item #2725669, was opened at 2009-04-01 16:44 Message generated for change (Comment added) made by paulsd You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=2725669group_id=180599 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Paul Donohue (paulsd) Assigned to: Nobody/Anonymous (nobody) Summary: kvm init script breaks network interfaces with multiple IPs Initial Comment: If multiple IP addresses are assigned to a network interface (Using interface aliases - for example 'ifconfig eth0 10.0.0.1 ; ifconfig eth0:1 10.0.0.2'), then the kvm init script causes the interface to become unresponsive when it creates a bridge using the interface. I haven't yet had a need to use bridging for my VMs, so I haven't yet tried to figure out how to properly configure a bridge when multiple IPs are in use on the host system (I assume the multiple IPs simply need to be configured using aliases of the bridge itself - for example 'ifconfig sw0 10.0.0.1 ; ifconfig sw0:1 10.0.0.2' - but I haven't actually tried it). Therefore, I am not sure at the moment how the kvm init script needs to be updated to fix this problem. Regardless, I do have a number of machines which are using multiple IPs on the host system, and I recently installed kvm on them, then discovered that after the next reboot of each machine, the network interface is unresponsive until I disable the kvm init script and reboot again. So, ideally the kvm init script should be updated to properly handle aliased interfaces, but at the very least, it needs to be updated to detect aliased interfaces and refuse to create a bridge for them, since that seems to completely break the underlying interface. -- Comment By: Paul Donohue (paulsd) Date: 2009-04-01 19:48 Message: Yes, it does, in the userspace tree, under the scripts subdirectory: http://git.kernel.org/?p=virt/kvm/kvm-userspace.git;a=blob;f=scripts/kvm;h=cddc931fd3b289f3c325e23b55f261e996328bd6;hb=HEAD -- Comment By: Brian Jackson (iggy_cav) Date: 2009-04-01 17:08 Message: KVM doesn't come with an init script in the tarball. This is most likely provided by your distro or some other third party. You should contact them for support. -- You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=2725669group_id=180599 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/17] virtual-bus
Gregory Haskins wrote: Anthony Liguori wrote: I think there is a slight disconnect here. This is *exactly* what I am trying to do. If it were exactly what you were trying to do, you would have posted a virtio-net in-kernel backend implementation instead of a whole new paravirtual IO framework ;-) That said, I don't think we're bound today by the fact that we're in userspace. You will *always* be bound by the fact that you are in userspace. Again, let's talk numbers. A heavy-weight exit is 1us slower than a light weight exit. Ideally, you're taking 1 exit per packet because you're batching notifications. If you're ping latency on bare metal compared to vbus is 39us to 65us, then all other things being equally, the cost imposed by doing what your doing in userspace would make the latency be 66us taking your latency from 166% of native to 169% of native. That's not a huge difference and I'm sure you'll agree there are a lot of opportunities to improve that even further. And you didn't mention whether your latency tests are based on ping or something more sophisticated as ping will be a pathological case that doesn't allow any notification batching. I agree that the does anyone care part of the equation will approach zero as the latency difference shrinks across some threshold (probably the single microsecond range), but I will believe that is even possible when I see it ;) Note the other hat we have to where is not just virtualization developer but Linux developer. If there are bad userspace interfaces for IO that impose artificial restrictions, then we need to identify those and fix them. Regards, Anthony Liguori Regards, -Greg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] add ksm kernel shared memory driver.
Izik Eidus wrote: Anthony, the biggest problem about madvice() is that it is a real system call api, i wouldnt want in that stage of ksm commit into api changes of linux... The ioctl itself is restricting, madvice is much more..., Can we draft this issue to after ksm is merged, and after all the big new fetures that we want to add to ksm will be merge (then the api would be much more stable, and we will be able to ask ppl in the list about changing of api, but for new driver that it yet to be merged, it is kind of overkill to add api to linux) What do you think? You can't change ABIs after something is merged or you break userspace. So you need to figure out the right ABI first. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] add ksm kernel shared memory driver.
* Anthony Liguori (anth...@codemonkey.ws) wrote: You can't change ABIs after something is merged or you break userspace. So you need to figure out the right ABI first. Absolutely. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] add ksm kernel shared memory driver.
* Anthony Liguori (anth...@codemonkey.ws) wrote: The ioctl() interface is quite bad for what you're doing. You're telling the kernel extra information about a VA range in userspace. That's what madvise is for. You're tweaking simple read/write values of kernel infrastructure. That's what sysfs is for. I agree re: sysfs (brought it up myself before). As far as madvise vs. ioctl, the one thing that comes from the ioctl is fops-release to automagically unregister memory on exit. This needs to be handled anyway if some -p pid is added to add a process after it's running, so less weight there. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/17] virtual-bus
On Wednesday 01 April 2009 22:05:39 Gregory Haskins wrote: Rusty Russell wrote: I could dig through the code, but I'll ask directly: what heuristic do you use for notification prevention in your venet_tap driver? I am not 100% sure I know what you mean with notification prevention, but let me take a stab at it. Good stab :) I only signal back to the guest to reclaim its skbs every 10 packets, or if I drain the queue, whichever comes first (note to self: make this # configurable). Good stab, though I was referring to guest-host signals (I'll assume you use a similar scheme there). You use a number of packets, qemu uses a timer (150usec), lguest uses a variable timer (starting at 500usec, dropping by 1 every time but increasing by 10 every time we get fewer packets than last time). So, if the guest sends two packets and stops, you'll hang indefinitely? That's why we use a timer, otherwise any mitigation scheme has this issue. Thanks, Rusty. The nice part about this scheme is it significantly reduces the amount of guest/host transitions, while still providing the lowest latency response for single packets possible. e.g. Send one packet, and you get one hypercall, and one tx-complete interrupt as soon as it queues on the hardware. Send 100 packets, and you get one hypercall and 10 tx-complete interrupts as frequently as every tenth packet queues on the hardware. There is no timer governing the flow, etc. Is that what you were asking? As you point out, 350-450 is possible, which is still bad, and it's at least partially caused by the exit to userspace and two system calls. If virtio_net had a backend in the kernel, we'd be able to compare numbers properly. :) But that is the whole point, isnt it? I created vbus specifically as a framework for putting things in the kernel, and that *is* one of the major reasons it is faster than virtio-net...its not the difference in, say, IOQs vs virtio-ring (though note I also think some of the innovations we have added such as bi-dir napi are helping too, but these are not in-kernel specific kinds of features and could probably help the userspace version too). I would be entirely happy if you guys accepted the general concept and framework of vbus, and then worked with me to actually convert what I have as venet-tap into essentially an in-kernel virtio-net. I am not specifically interested in creating a competing pv-net driver...I just needed something to showcase the concepts and I didnt want to hack the virtio-net infrastructure to do it until I had everyone's blessing. Note to maintainers: I *am* perfectly willing to maintain the venet drivers if, for some reason, we decide that we want to keep them as is. Its just an ideal for me to collapse virtio-net and venet-tap together, and I suspect our community would prefer this as well. -Greg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm : qemu : fix compilation error in kvm-userspace for ia64
when using make in kernel, it can not find msidef.h. This patch fix this. Signed-off-by: Yang Zhang yang.zh...@intel.com diff --git a/kernel/include-compat/asm-ia64/msidef.h b/kernel/include-compat/asm-ia64/msidef.h new file mode 100644 index 000..592c104 --- /dev/null +++ b/kernel/include-compat/asm-ia64/msidef.h @@ -0,0 +1,42 @@ +#ifndef _IA64_MSI_DEF_H +#define _IA64_MSI_DEF_H + +/* + * Shifts for APIC-based data + */ + +#define MSI_DATA_VECTOR_SHIFT 0 +#defineMSI_DATA_VECTOR(v) (((u8)v) MSI_DATA_VECTOR_SHIFT) +#define MSI_DATA_VECTOR_MASK 0xff00 + +#define MSI_DATA_DELIVERY_MODE_SHIFT 8 +#define MSI_DATA_DELIVERY_FIXED(0 MSI_DATA_DELIVERY_MODE_SHIFT) +#define MSI_DATA_DELIVERY_LOWPRI (1 MSI_DATA_DELIVERY_MODE_SHIFT) + +#define MSI_DATA_LEVEL_SHIFT 14 +#define MSI_DATA_LEVEL_DEASSERT(0 MSI_DATA_LEVEL_SHIFT) +#define MSI_DATA_LEVEL_ASSERT (1 MSI_DATA_LEVEL_SHIFT) + +#define MSI_DATA_TRIGGER_SHIFT 15 +#define MSI_DATA_TRIGGER_EDGE (0 MSI_DATA_TRIGGER_SHIFT) +#define MSI_DATA_TRIGGER_LEVEL (1 MSI_DATA_TRIGGER_SHIFT) + +/* + * Shift/mask fields for APIC-based bus address + */ + +#define MSI_ADDR_DEST_ID_SHIFT 4 +#define MSI_ADDR_HEADER0xfee0 + +#define MSI_ADDR_DEST_ID_MASK 0xffff +#define MSI_ADDR_DEST_ID_CPU(cpu) ((cpu) MSI_ADDR_DEST_ID_SHIFT) + +#define MSI_ADDR_DEST_MODE_SHIFT 2 +#define MSI_ADDR_DEST_MODE_PHYS(0 MSI_ADDR_DEST_MODE_SHIFT) +#defineMSI_ADDR_DEST_MODE_LOGIC(1 MSI_ADDR_DEST_MODE_SHIFT) + +#define MSI_ADDR_REDIRECTION_SHIFT 3 +#define MSI_ADDR_REDIRECTION_CPU (0 MSI_ADDR_REDIRECTION_SHIFT) +#define MSI_ADDR_REDIRECTION_LOWPRI(1 MSI_ADDR_REDIRECTION_SHIFT) + +#endif/* _IA64_MSI_DEF_H */ -- 1.6.0.rc1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: Qemu: Flush i-cache after ide-dma operation in IA64
The data from dma will include instructions. In order to exeuting the right instruction, we should to flush the i-cache to ensure those data can be see by cpu. Signed-off-by: Xiantao Zhang xiantao.zh...@intel.com Signed-off-by: Yang Zhang yang.zh...@intel.com --- diff --git a/qemu/cache-utils.h b/qemu/cache-utils.h index b45fde4..5e11d12 100644 --- a/qemu/cache-utils.h +++ b/qemu/cache-utils.h @@ -33,8 +33,22 @@ static inline void flush_icache_range(unsigned long start, unsigned long stop) asm volatile (sync : : : memory); asm volatile (isync : : : memory); } +#define qemu_sync_idcache flush_icache_range +#else +#ifdef __ia64__ +static inline void qemu_sync_idcache(unsigned long start, unsigned long stop) +{ +while (start stop) { + asm volatile (fc %0 :: r(start)); + start += 32; +} +asm volatile (;;sync.i;;srlz.i;;); +} #else +static inline void qemu_sync_idcache(unsigned long start, unsigned long stop) +#endif + #define qemu_cache_utils_init(envp) do { (void) (envp); } while (0) #endif diff --git a/qemu/cutils.c b/qemu/cutils.c index 5b36cc6..7b57173 100644 --- a/qemu/cutils.c +++ b/qemu/cutils.c @@ -23,6 +23,7 @@ */ #include qemu-common.h #include host-utils.h +#include cache-utils.h #include assert.h void pstrcpy(char *buf, int buf_size, const char *str) @@ -215,6 +216,8 @@ void qemu_iovec_from_buffer(QEMUIOVector *qiov, const void *buf, size_t count) if (copy qiov-iov[i].iov_len) copy = qiov-iov[i].iov_len; memcpy(qiov-iov[i].iov_base, p, copy); +qemu_sync_idcache((unsigned long)qiov-iov[i].iov_base, +(unsigned long)(qiov-iov[i].iov_base + copy)); p += copy; count -= copy; } -- 1.6.0.rc1 0001-KVM-Qemu-Flush-icache-after-ide-dma-operation-in-IA64.patch Description: 0001-KVM-Qemu-Flush-icache-after-ide-dma-operation-in-IA64.patch
Re: [RFC PATCH 00/17] virtual-bus
Rusty Russell wrote: On Wednesday 01 April 2009 22:05:39 Gregory Haskins wrote: Rusty Russell wrote: I could dig through the code, but I'll ask directly: what heuristic do you use for notification prevention in your venet_tap driver? I am not 100% sure I know what you mean with notification prevention, but let me take a stab at it. Good stab :) I only signal back to the guest to reclaim its skbs every 10 packets, or if I drain the queue, whichever comes first (note to self: make this # configurable). Good stab, though I was referring to guest-host signals (I'll assume you use a similar scheme there). Oh, actually no. The guest-host path only uses the bidir napi thing I mentioned. So first packet hypercalls the host immediately with no delay, schedules my host-side rx thread, disables subsequent hypercalls, and returns to the guest. If the guest tries to send another packet before the time it takes the host to drain all queued skbs (in this case, 1), it will simply queue it to the ring with no additional hypercalls.Like typical napi ingress processing, the host will leave hypercalls disabled until it finds the ring empty, so this process can continue indefinitely until the host catches up. Once fully drained, the host will re-enable the hypercall channel and subsequent transmissions will repeat the original process. In summary, infrequent transmissions will tend to have one hypercall per packet. Bursty transmissions will have one hypercall per burst (starting immediately with the first packet). In both cases, we minimize the latency to get the first packet out the door. So really the only place I am using a funky heuristic is the modulus 10 operation for tx-complete going host-guest. The rest are kind of standard napi event mitigation techniques. You use a number of packets, qemu uses a timer (150usec), lguest uses a variable timer (starting at 500usec, dropping by 1 every time but increasing by 10 every time we get fewer packets than last time). So, if the guest sends two packets and stops, you'll hang indefinitely? Shouldn't, no. The host will send tx-complete interrupts at *max* every 10 packets, but if it drains the queue before the modulus 10 expires, it will send a tx-complete immediately, right before it re-enables hypercalls. So there is no hang, and there is no delay. For reference, here is the modulus 10 signaling (./drivers/vbus/devices/venet-tap.c, line 584): http://git.kernel.org/?p=linux/kernel/git/ghaskins/vbus/linux-2.6.git;a=blob;f=drivers/vbus/devices/venet-tap.c;h=0ccb7ed94a1a8edd0cca269488f940f40fce20df;hb=master#l584 Here is the one that happens after the queue is fully drained (line 593) http://git.kernel.org/?p=linux/kernel/git/ghaskins/vbus/linux-2.6.git;a=blob;f=drivers/vbus/devices/venet-tap.c;h=0ccb7ed94a1a8edd0cca269488f940f40fce20df;hb=master#l593 and finally, here is where I re-enable hypercalls (or system calls if the driver is in userspace, etc) http://git.kernel.org/?p=linux/kernel/git/ghaskins/vbus/linux-2.6.git;a=blob;f=drivers/vbus/devices/venet-tap.c;h=0ccb7ed94a1a8edd0cca269488f940f40fce20df;hb=master#l600 That's why we use a timer, otherwise any mitigation scheme has this issue. I'm not sure I follow. I don't think I need a timer at all using this scheme, but perhaps I am missing something? Thanks Rusty! -Greg signature.asc Description: OpenPGP digital signature
[PATCH] KVM: Discard reserved bits checking on PDE bit 7-8
1. It's related to a Linux kernel bug which fixed by Ingo on 07a66d7c53a538e1a9759954a82bb6c07365eff9. The original code exists for quite a long time, and it would convert a PDE for large page into a normal PDE. But it fail to fit normal PDE well. With the code before Ingo's fix, the kernel would fall reserved bit checking with bit 8 - the remaining global bit of PTE. So the kernel would receive a double-fault. 2. After discussion, we decide to discard PDE bit 7-8 reserved checking for now. For this marked as reserved in SDM, but didn't checked by the processor in fact... Signed-off-by: Sheng Yang sh...@linux.intel.com --- arch/x86/kvm/mmu.c |7 --- 1 files changed, 4 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index e0f63b6..a0b130d 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2196,7 +2196,7 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, int level) break; case PT32E_ROOT_LEVEL: context-rsvd_bits_mask[0][1] = exb_bit_rsvd | - rsvd_bits(maxphyaddr, 62); /* PDE */ + rsvd_bits(maxphyaddr, 62); /* PDE */ context-rsvd_bits_mask[0][0] = exb_bit_rsvd | rsvd_bits(maxphyaddr, 62); /* PTE */ context-rsvd_bits_mask[1][1] = exb_bit_rsvd | @@ -2210,13 +2210,14 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, int level) context-rsvd_bits_mask[0][2] = exb_bit_rsvd | rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8); context-rsvd_bits_mask[0][1] = exb_bit_rsvd | - rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8); + rsvd_bits(maxphyaddr, 51); context-rsvd_bits_mask[0][0] = exb_bit_rsvd | rsvd_bits(maxphyaddr, 51); context-rsvd_bits_mask[1][3] = context-rsvd_bits_mask[0][3]; context-rsvd_bits_mask[1][2] = context-rsvd_bits_mask[0][2]; context-rsvd_bits_mask[1][1] = exb_bit_rsvd | - rsvd_bits(maxphyaddr, 51) | rsvd_bits(13, 20); + rsvd_bits(maxphyaddr, 51) | + rsvd_bits(13, 20); /* large page */ context-rsvd_bits_mask[1][0] = ~0ull; break; } -- 1.5.4.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] add ksm kernel shared memory driver.
Chris Wright wrote: * Anthony Liguori (anth...@codemonkey.ws) wrote: The ioctl() interface is quite bad for what you're doing. You're telling the kernel extra information about a VA range in userspace. That's what madvise is for. You're tweaking simple read/write values of kernel infrastructure. That's what sysfs is for. I agree re: sysfs (brought it up myself before). As far as madvise vs. ioctl, the one thing that comes from the ioctl is fops-release to automagically unregister memory on exit. This is precisely why ioctl() is a bad interface. fops-release isn't tied to the process but rather tied to the open file. The file can stay open long after the process exits either by a fork()'d child inheriting the file descriptor or through something more sinister like SCM_RIGHTS. In fact, a common mistake is to leak file descriptors by not closing them when exec()'ing a process. Instead of just delaying a close, if you rely on this behavior to unregister memory regions, you could potentially have badness happen in the kernel if ksm attempted to access an invalid memory region. So you absolutely have to automatically unregister regions in something other than the fops-release handler based on something that's tied to the pid's life cycle. Using an interface like madvise() would force the issue to be dealt with properly from the start :-) I'm often afraid of what sort of bugs we'd uncover in kvm if we passed the fds around via SCM_RIGHTS and started poking around :-/ Regards, Anthony Liguori This needs to be handled anyway if some -p pid is added to add a process after it's running, so less weight there. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Check valid bit of VM_EXIT_INTR_INFO
Thx, eddie commit ad4a9829c8d5b30995f008e32774bd5f555b7e9f Author: root r...@eddie-wb.localdomain Date: Thu Apr 2 11:16:03 2009 +0800 Check valid bit of VM_EXIT_INTR_INFO before unblock nmi. Signed-off-by: Eddie Dong eddie.d...@intel.com diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index aba41ae..689523a 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3268,16 +3268,18 @@ static void vmx_complete_interrupts(struct vcpu_vmx *vmx) exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO); if (cpu_has_virtual_nmis()) { - unblock_nmi = (exit_intr_info INTR_INFO_UNBLOCK_NMI) != 0; - vector = exit_intr_info INTR_INFO_VECTOR_MASK; /* * SDM 3: 25.7.1.2 * Re-set bit block by NMI before VM entry if vmexit caused by * a guest IRET fault. */ - if (unblock_nmi vector != DF_VECTOR) - vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO, + if (exit_intr_info INTR_INFO_VALID_MASK) { + unblock_nmi = !!(exit_intr_info INTR_INFO_UNBLOCK_NMI); + vector = exit_intr_info INTR_INFO_VECTOR_MASK; + if (unblock_nmi vector != DF_VECTOR) + vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO, GUEST_INTR_STATE_NMI); + } } else if (unlikely(vmx-soft_vnmi_blocked)) vmx-vnmi_blocked_time += ktime_to_ns(ktime_sub(ktime_get(), vmx-entry_time)); nmi_valid.patch Description: nmi_valid.patch
Re: [RFC PATCH 00/17] virtual-bus
Anthony Liguori wrote: Gregory Haskins wrote: Anthony Liguori wrote: I think there is a slight disconnect here. This is *exactly* what I am trying to do. If it were exactly what you were trying to do, you would have posted a virtio-net in-kernel backend implementation instead of a whole new paravirtual IO framework ;-) semantics, semantics ;) but ok, fair enough. That said, I don't think we're bound today by the fact that we're in userspace. You will *always* be bound by the fact that you are in userspace. Again, let's talk numbers. A heavy-weight exit is 1us slower than a light weight exit. Ideally, you're taking 1 exit per packet because you're batching notifications. If you're ping latency on bare metal compared to vbus is 39us to 65us, then all other things being equally, the cost imposed by doing what your doing in userspace would make the latency be 66us taking your latency from 166% of native to 169% of native. That's not a huge difference and I'm sure you'll agree there are a lot of opportunities to improve that even further. Ok, so lets see it happen. Consider the gauntlet thrown :) Your challenge, should you chose to accept it, is to take todays 4000us and hit a 65us latency target while maintaining 10GE line-rate (at least 1500 mtu line-rate). I personally don't want to even stop at 65. I want to hit that 36us! In case you think that is crazy, my first prototype of venet was hitting about 140us, and I shaved 10us here, 10us there, eventually getting down to the 65us we have today. The low hanging fruit is all but harvested at this point, but I am not done searching for additional sources of latency. I just needed to take a breather to get the code out there for review. :) And you didn't mention whether your latency tests are based on ping or something more sophisticated Well, the numbers posted were actually from netperf -t UDP_RR. This generates a pps from a continuous (but non-bursted) RTT measurement. So I invert the pps result of this test to get the average rtt time. I have also confirmed that ping jives with these results (e.g. virtio-net results were about 4ms, and venet were about 0.065ms as reported by ping). as ping will be a pathological case Ah, but this is not really pathological IMO. There are plenty of workloads that exhibit request-reply patterns (e.g. RPC), and this is a direct measurement of the systems ability to support these efficiently. And even unidirectional flows can be hampered by poor latency (think PTP clock sync, etc). Massive throughput with poor latency is like Andrew Tanenbaum's station-wagon full of backup tapes ;) I think I have proven we can actually get both with a little creative use of resources. that doesn't allow any notification batching. Well, if we can take anything away from all this: I think I have demonstrated that you don't need notification batching to get good throughput. And batching on the head-end of the queue adds directly to your latency overhead, so I don't think its a good technique in general (though I realize that not everyone cares about latency, per se, so maybe most are satisfied with the status-quo). I agree that the does anyone care part of the equation will approach zero as the latency difference shrinks across some threshold (probably the single microsecond range), but I will believe that is even possible when I see it ;) Note the other hat we have to where is not just virtualization developer but Linux developer. If there are bad userspace interfaces for IO that impose artificial restrictions, then we need to identify those and fix them. Fair enough, and I would love to take that on but alas my development/debug bandwidth is rather finite these days ;) -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 00/17] virtual-bus
Rusty Russell ru...@rustcorp.com.au wrote: As you point out, 350-450 is possible, which is still bad, and it's at least partially caused by the exit to userspace and two system calls. If virtio_net had a backend in the kernel, we'd be able to compare numbers properly. FWIW I don't really care whether we go with this or a kernel virtio_net backend. Either way should be good. However the status quo where we're stuck with a user-space backend really sucks! Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/17] virtual-bus
Anthony Liguori anth...@codemonkey.ws wrote: That said, I don't think we're bound today by the fact that we're in userspace. Rather we're bound by the interfaces we have between the host kernel and userspace to generate IO. I'd rather fix those interfaces than put more stuff in the kernel. I'm sorry but I totally disagree with that. By having our IO infrastructure in user-space we've basically given up the main advantage of kvm, which is that the physical drivers operate in the same environment as the hypervisor. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/17] virtual-bus
Chris Wright chr...@sous-sol.org wrote: That said, I don't think we're bound today by the fact that we're in userspace. Rather we're bound by the interfaces we have between the host kernel and userspace to generate IO. I'd rather fix those interfaces than put more stuff in the kernel. And more stuff in the kernel can come at the potential cost of weakening protection/isolation. Protection/isolation always comes at a cost. Not everyone wants to pay that, just like health insurance :) We should enable the users to choose which model they want, based on their needs. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
The errors appear when compiling kvm-guest-drivers on kernel-2.6.29
HI, I have recently setup a guest network in KVM, when i tried to compile kvm guest driver - virtio on kernel-2.6.29, an issue has appeared (1) the version of kvm guest driver [r...@fedora9 kvm-guest-drivers-linux {master}]$ git describe kvm-guest-drivers-linux-1-13-gae1ae62 (2) the output of make [r...@fedora9 kvm-guest-drivers-linux {master}]$ make make -C /lib/modules/2.6.29/build M=`pwd` $@ make[1]: Entering directory `/usr/src/linux-2.6.29' CC [M] /home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.o /home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.c: In function \u2018xmit_skb\u2019: /home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.c:544: error: \u2018CHECKSUM_HW\u2019 undeclared (first use in this function) /home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.c:544: error: (Each undeclared identifier is reported only once /home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.c:544: error: for each function it appears in.) /home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.c:550: error: \u2018struct sk_buff\u2019 has no member named \u2018h\u2019 make[2]: *** [/home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.o] Error 1 make[1]: *** [_module_/home/zwu/study/virt/kvm-guest-drivers-linux] Error 2 make[1]: Leaving directory `/usr/src/linux-2.6.29' make: *** [all] Error 2 Cheers, Zhiyong Wu -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: The errors appear when compiling kvm-guest-drivers on kernel-2.6.29
In virtio_net.c, #ifdef COMPAT_csum_offset if (skb-ip_summed == CHECKSUM_HW) { #else if (skb-ip_summed == CHECKSUM_PARTIAL) { It seems that CHECKSUM_HW is not declared. In skbuff.h, only the macros below are defined. /* Don't change this without changing skb_csum_unnecessary! */ #define CHECKSUM_NONE 0 #define CHECKSUM_UNNECESSARY 1 #define CHECKSUM_COMPLETE 2 #define CHECKSUM_PARTIAL 3 Cheers, Zhiyong Wu On Thu, Apr 2, 2009 at 12:10 PM, Zhiyong Wu zwu.ker...@gmail.com wrote: HI, I have recently setup a guest network in KVM, when i tried to compile kvm guest driver - virtio on kernel-2.6.29, an issue has appeared (1) the version of kvm guest driver [r...@fedora9 kvm-guest-drivers-linux {master}]$ git describe kvm-guest-drivers-linux-1-13-gae1ae62 (2) the output of make [r...@fedora9 kvm-guest-drivers-linux {master}]$ make make -C /lib/modules/2.6.29/build M=`pwd` $@ make[1]: Entering directory `/usr/src/linux-2.6.29' CC [M] /home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.o /home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.c: In function \u2018xmit_skb\u2019: /home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.c:544: error: \u2018CHECKSUM_HW\u2019 undeclared (first use in this function) /home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.c:544: error: (Each undeclared identifier is reported only once /home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.c:544: error: for each function it appears in.) /home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.c:550: error: \u2018struct sk_buff\u2019 has no member named \u2018h\u2019 make[2]: *** [/home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.o] Error 1 make[1]: *** [_module_/home/zwu/study/virt/kvm-guest-drivers-linux] Error 2 make[1]: Leaving directory `/usr/src/linux-2.6.29' make: *** [all] Error 2 Cheers, Zhiyong Wu -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: The errors appear when compiling kvm-guest-drivers on kernel-2.6.29
I don't think the kvm-guest-drivers are still well maintained (they haven't been touched in 5 months). If you are using kernel 2.6.29, it already has virtio drivers and you don't need the kvm-guest-drivers tree at all. --Brian Jackson On Wednesday 01 April 2009 23:10:43 Zhiyong Wu wrote: HI, I have recently setup a guest network in KVM, when i tried to compile kvm guest driver - virtio on kernel-2.6.29, an issue has appeared (1) the version of kvm guest driver [r...@fedora9 kvm-guest-drivers-linux {master}]$ git describe kvm-guest-drivers-linux-1-13-gae1ae62 (2) the output of make [r...@fedora9 kvm-guest-drivers-linux {master}]$ make make -C /lib/modules/2.6.29/build M=`pwd` $@ make[1]: Entering directory `/usr/src/linux-2.6.29' CC [M] /home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.o /home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.c: In function \u2018xmit_skb\u2019: /home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.c:544: error: \u2018CHECKSUM_HW\u2019 undeclared (first use in this function) /home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.c:544: error: (Each undeclared identifier is reported only once /home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.c:544: error: for each function it appears in.) /home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.c:550: error: \u2018struct sk_buff\u2019 has no member named \u2018h\u2019 make[2]: *** [/home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.o] Error 1 make[1]: *** [_module_/home/zwu/study/virt/kvm-guest-drivers-linux] Error 2 make[1]: Leaving directory `/usr/src/linux-2.6.29' make: *** [all] Error 2 Cheers, Zhiyong Wu -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/4] update ksm userspace interfaces
* Anthony Liguori (anth...@codemonkey.ws) wrote: Using an interface like madvise() would force the issue to be dealt with properly from the start :-) Yeah, I'm not at all opposed to it. This updates to madvise for register and sysfs for control. madvise issues: - MADV_SHAREABLE - register only ATM, can add MADV_UNSHAREABLE to allow an app to proactively unregister, but need a cleanup when -mm goes away via exit/exec - will register a region per vma, should probably push the whole thing into vma rather than keep [mm,addr,len] tuple in ksm sysfs issues: - none really, i added a reporting mechanism for number of pages shared, doesn't decrement on COW - could use some extra sanity checks It compiles! Diff output is hard to read, I can send a 4/4 w/ this patch rolled in for easier review. Signed-off-by: Chris Wright chr...@redhat.com --- include/asm-generic/mman.h |1 + include/linux/ksm.h| 63 + mm/ksm.c | 352 mm/madvise.c | 18 +++ 4 files changed, 149 insertions(+), 285 deletions(-) diff --git a/include/asm-generic/mman.h b/include/asm-generic/mman.h index 5e3dde2..a1c1d5c 100644 --- a/include/asm-generic/mman.h +++ b/include/asm-generic/mman.h @@ -34,6 +34,7 @@ #define MADV_REMOVE9 /* remove these pages resources */ #define MADV_DONTFORK 10 /* don't inherit across fork */ #define MADV_DOFORK11 /* do inherit across fork */ +#define MADV_SHAREABLE 12 /* can share identical pages */ /* compatibility flags */ #define MAP_FILE 0 diff --git a/include/linux/ksm.h b/include/linux/ksm.h index 5776dce..e032f6f 100644 --- a/include/linux/ksm.h +++ b/include/linux/ksm.h @@ -1,69 +1,8 @@ #ifndef __LINUX_KSM_H #define __LINUX_KSM_H -/* - * Userspace interface for /dev/ksm - kvm shared memory - */ - -#include linux/types.h -#include linux/ioctl.h - -#include asm/types.h - -#define KSM_API_VERSION 1 - #define ksm_control_flags_run 1 -/* for KSM_REGISTER_MEMORY_REGION */ -struct ksm_memory_region { - __u32 npages; /* number of pages to share */ - __u32 pad; - __u64 addr; /* the begining of the virtual address */ -__u64 reserved_bits; -}; - -struct ksm_kthread_info { - __u32 sleep; /* number of microsecoends to sleep */ - __u32 pages_to_scan; /* number of pages to scan */ - __u32 flags; /* control flags */ -__u32 pad; -__u64 reserved_bits; -}; - -#define KSMIO 0xAB - -/* ioctls for /dev/ksm */ - -#define KSM_GET_API_VERSION _IO(KSMIO, 0x00) -/* - * KSM_CREATE_SHARED_MEMORY_AREA - create the shared memory reagion fd - */ -#define KSM_CREATE_SHARED_MEMORY_AREA_IO(KSMIO, 0x01) /* return SMA fd */ -/* - * KSM_START_STOP_KTHREAD - control the kernel thread scanning speed - * (can stop the kernel thread from working by setting running = 0) - */ -#define KSM_START_STOP_KTHREAD _IOW(KSMIO, 0x02,\ - struct ksm_kthread_info) -/* - * KSM_GET_INFO_KTHREAD - return information about the kernel thread - * scanning speed. - */ -#define KSM_GET_INFO_KTHREAD_IOW(KSMIO, 0x03,\ - struct ksm_kthread_info) - - -/* ioctls for SMA fds */ - -/* - * KSM_REGISTER_MEMORY_REGION - register virtual address memory area to be - * scanned by kvm. - */ -#define KSM_REGISTER_MEMORY_REGION _IOW(KSMIO, 0x20,\ - struct ksm_memory_region) -/* - * KSM_REMOVE_MEMORY_REGION - remove virtual address memory area from ksm. - */ -#define KSM_REMOVE_MEMORY_REGION _IO(KSMIO, 0x21) +long ksm_register_memory(struct vm_area_struct *, unsigned long, unsigned long); #endif diff --git a/mm/ksm.c b/mm/ksm.c index eba4c09..fcbf76e 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -17,7 +17,6 @@ #include linux/errno.h #include linux/mm.h #include linux/fs.h -#include linux/miscdevice.h #include linux/vmalloc.h #include linux/file.h #include linux/mman.h @@ -38,6 +37,7 @@ #include linux/rbtree.h #include linux/anon_inodes.h #include linux/ksm.h +#include linux/kobject.h #include asm/tlbflush.h @@ -55,20 +55,11 @@ MODULE_PARM_DESC(rmap_hash_size, Hash table size for the reverse mapping); */ struct ksm_mem_slot { struct list_head link; - struct list_head sma_link; struct mm_struct *mm; unsigned long addr; /* the begining of the virtual address */ unsigned npages;/* number of pages to share */ }; -/* - * ksm_sma - shared memory area, each process have its own sma that contain the - * information about the slots that it own - */ -struct ksm_sma { - struct list_head sma_slots; -}; - /** * struct ksm_scan - cursor for scanning * @slot_index: the current slot we are scanning @@ -190,6 +181,7 @@ static struct kmem_cache *rmap_item_cache; static int
[PATCH 4/4 alternative userspace] add ksm kernel shared memory driver
Here's ksm w/ a user interface built around madvise for registering and sysfs for controlling (should just drop config tristate and make it bool, CONFIG_KSM= y or n). #include Izik's changelog Ksm is driver that allow merging identical pages between one or more applications in way unvisible to the application that use it. Pages that are merged are marked as readonly and are COWed when any application try to change them. Ksm is used for cases where using fork() is not suitable, one of this cases is where the pages of the application keep changing dynamicly and the application cannot know in advance what pages are going to be identical. Ksm works by walking over the memory pages of the applications it scan in order to find identical pages. It uses a two sorted data strctures called stable and unstable trees to find in effective way the identical pages. When ksm finds two identical pages, it marks them as readonly and merges them into single one page, after the pages are marked as readonly and merged into one page, linux will treat this pages as normal copy_on_write pages and will fork them when write access will happen to them. Ksm scan just memory areas that were registred to be scanned by it. Ksm api (for users to register region): Register a memory region as shareable: madvise(void *addr, size_t len, MADV_SHAREABLE) Unregister a shareable memory region (not currently implemented): madvise(void *addr, size_t len, MADV_UNSHAREABLE) Ksm api (for users to control ksm scanning daemon): /sys/kernel/mm/ksm |-- pages_shared-- RO, attribute showing number of pages shared |-- pages_to_scan -- RW, number of pages to scan per scan loop |-- run -- RW, whether scanning daemon should scan `-- sleep -- RW, number of usecs to sleep between scan loops Signed-off-by: Izik Eidus iei...@redhat.com Signed-off-by: Chris Wright chr...@redhat.com --- include/asm-generic/mman.h |1 + include/linux/ksm.h|8 + mm/Kconfig |6 + mm/Makefile|1 + mm/ksm.c | 1337 mm/madvise.c | 18 + 6 files changed, 1371 insertions(+), 0 deletions(-) diff --git a/include/asm-generic/mman.h b/include/asm-generic/mman.h index 5e3dde2..a1c1d5c 100644 --- a/include/asm-generic/mman.h +++ b/include/asm-generic/mman.h @@ -34,6 +34,7 @@ #define MADV_REMOVE9 /* remove these pages resources */ #define MADV_DONTFORK 10 /* don't inherit across fork */ #define MADV_DOFORK11 /* do inherit across fork */ +#define MADV_SHAREABLE 12 /* can share identical pages */ /* compatibility flags */ #define MAP_FILE 0 diff --git a/include/linux/ksm.h b/include/linux/ksm.h new file mode 100644 index 000..e032f6f --- /dev/null +++ b/include/linux/ksm.h @@ -0,0 +1,8 @@ +#ifndef __LINUX_KSM_H +#define __LINUX_KSM_H + +#define ksm_control_flags_run 1 + +long ksm_register_memory(struct vm_area_struct *, unsigned long, unsigned long); + +#endif diff --git a/mm/Kconfig b/mm/Kconfig index b53427a..3f3fd04 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -223,3 +223,9 @@ config HAVE_MLOCKED_PAGE_BIT config MMU_NOTIFIER bool + +config KSM + tristate Enable KSM for page sharing + help + Enable the KSM kernel module to allow page sharing of equal pages + among different tasks. diff --git a/mm/Makefile b/mm/Makefile index ec73c68..b885513 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -24,6 +24,7 @@ obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o obj-$(CONFIG_TMPFS_POSIX_ACL) += shmem_acl.o obj-$(CONFIG_SLOB) += slob.o obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o +obj-$(CONFIG_KSM) += ksm.o obj-$(CONFIG_PAGE_POISONING) += debug-pagealloc.o obj-$(CONFIG_SLAB) += slab.o obj-$(CONFIG_SLUB) += slub.o diff --git a/mm/ksm.c b/mm/ksm.c new file mode 100644 index 000..fcbf76e --- /dev/null +++ b/mm/ksm.c @@ -0,0 +1,1337 @@ +/* + * Memory merging driver for Linux + * + * This module enables dynamic sharing of identical pages found in different + * memory areas, even if they are not shared by fork() + * + * Copyright (C) 2008 Red Hat, Inc. + * Authors: + * Izik Eidus + * Andrea Arcangeli + * Chris Wright + * + * This work is licensed under the terms of the GNU GPL, version 2. + */ + +#include linux/module.h +#include linux/errno.h +#include linux/mm.h +#include linux/fs.h +#include linux/vmalloc.h +#include linux/file.h +#include linux/mman.h +#include linux/sched.h +#include linux/rwsem.h +#include linux/pagemap.h +#include linux/sched.h +#include linux/rmap.h +#include linux/spinlock.h +#include linux/jhash.h +#include linux/delay.h +#include linux/kthread.h +#include linux/wait.h +#include linux/scatterlist.h +#include linux/random.h +#include linux/slab.h +#include linux/swap.h +#include linux/rbtree.h +#include linux/anon_inodes.h
Re: [PATCH 4/4 alternative userspace] add ksm kernel shared memory driver
On Thu, Apr 2, 2009 at 07:48, Chris Wright chr...@redhat.com wrote: Ksm api (for users to register region): Register a memory region as shareable: madvise(void *addr, size_t len, MADV_SHAREABLE) Unregister a shareable memory region (not currently implemented): madvise(void *addr, size_t len, MADV_UNSHAREABLE) I can't find a definition for MADV_UNSHAREABLE! Bert -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html