[Qemu-devel] [PATCH] build: set up capabilities on qemu-bridge-helper
Out-of-the-box, 'make install' sets up an unusable qemu-bridge-helper since it doesn't have the required capabilities. Fix by adding them. Note: this may break installing as non-root. This is actually the right thing to do, since not setting up the capability would result in a broken setup. Perhaps we need a configure flag to disable helpers. Signed-off-by: Avi Kivity a...@cloudius-systems.com --- Makefile | 3 +++ rules.mak | 2 ++ 2 files changed, 5 insertions(+) diff --git a/Makefile b/Makefile index b15003f..af7748c 100644 --- a/Makefile +++ b/Makefile @@ -188,6 +188,7 @@ qemu-img$(EXESUF): qemu-img.o $(block-obj-y) libqemuutil.a libqemustub.a qemu-nbd$(EXESUF): qemu-nbd.o $(block-obj-y) libqemuutil.a libqemustub.a qemu-io$(EXESUF): qemu-io.o $(block-obj-y) libqemuutil.a libqemustub.a +qemu-bridge-helper$(EXESUF).capabilities = cap_net_admin qemu-bridge-helper$(EXESUF): qemu-bridge-helper.o fsdev/virtfs-proxy-helper$(EXESUF): fsdev/virtfs-proxy-helper.o fsdev/virtio-9p-marshal.o libqemuutil.a libqemustub.a @@ -345,6 +346,8 @@ endif ifneq ($(HELPERS-y),) $(INSTALL_DIR) $(DESTDIR)$(libexecdir) $(INSTALL_PROG) $(STRIP_OPT) $(HELPERS-y) $(DESTDIR)$(libexecdir) + $(SETCAP) $(foreach helper, $(HELPERS-y), \ + $($(helper).capabilities)+pe $(DESTDIR)$(libexecdir)/$(helper)) endif ifneq ($(BLOBS),) set -e; for x in $(BLOBS); do \ diff --git a/rules.mak b/rules.mak index 49edb9b..9194c79 100644 --- a/rules.mak +++ b/rules.mak @@ -177,3 +177,5 @@ $(shell mkdir -p $(sort $(foreach var,$(nested-vars),$(dir $($(var)) $(foreach var,$(nested-vars), $(eval \ -include $(addsuffix *.d, $(sort $(dir $($(var))) endef + +SETCAP = setcap \ No newline at end of file -- 1.8.3.1
Re: [Qemu-devel] [PATCH] build: set up capabilities on qemu-bridge-helper
On Nov 12, 2013 9:23 PM, Eric Blake ebl...@redhat.com wrote: On 11/12/2013 04:10 AM, Avi Kivity wrote: Out-of-the-box, 'make install' sets up an unusable qemu-bridge-helper since it doesn't have the required capabilities. Fix by adding them. Note: this may break installing as non-root. This is actually the right thing to do, since not setting up the capability would result in a broken setup. Perhaps we need a configure flag to disable helpers. +++ b/rules.mak @@ -177,3 +177,5 @@ $(shell mkdir -p $(sort $(foreach var,$(nested-vars),$(dir $($(var)) $(foreach var,$(nested-vars), $(eval \ -include $(addsuffix *.d, $(sort $(dir $($(var))) endef + +SETCAP = setcap \ No newline at end of file Is this intentional? No. -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org
Re: [Qemu-devel] [PATCH v2 5/7] exec: memory radix tree page level compression
Michael S. Tsirkin mst at redhat.com writes: At the moment, memory radix tree is already variable width, but it can only skip the low bits of address. This is efficient if we have huge memory regions but inefficient if we are only using a tiny portion of the address space. After we have built up the map, detect configurations where a single L2 entry is valid. We then speed up the lookup by skipping one or more levels. In case any levels were skipped, we might end up in a valid section instead of erroring out. We handle this by checking that the address is in range of the resulting section. I think this is overkill. It can be done in a simpler way as follows: phys_page_find(RadixTree* tr, hwaddr index, ...) { if (index rt-invalid_index_mask) { // not found } lp = rt-root; for (i = rt-nb_levels - 1; i = 0 !lp.is_leaf; --i) { ... This exploits the fact the lower portion of the address space is always filled, at least in the cases that matter to us.
Re: [Qemu-devel] [PATCH] build: set up capabilities on qemu-bridge-helper
On 11/14/2013 03:29 PM, Stefan Hajnoczi wrote: On Tue, Nov 12, 2013 at 01:10:24PM +0200, Avi Kivity wrote: Out-of-the-box, 'make install' sets up an unusable qemu-bridge-helper since it doesn't have the required capabilities. Fix by adding them. Up until now, downstreams had to make the bridge helper executable setuid, add the cap_net_admin capability, or they did nothing and it was broken ;-). CCing downstream package maintainers in case they have any comments on this patch. And it was, indeed, broken. Note: this may break installing as non-root. This is actually the right thing to do, since not setting up the capability would result in a broken setup. Perhaps we need a configure flag to disable helpers. Users who have been successfully installing QEMU would be upset if it suddenly starts failing after this patch. The bridge helper is a niche feature that shouldn't cause a regression for the majority of users who don't care about it. If we're installing non-root then the bridge helper simply shouldn't be installed. Or maybe installed without the capabilities? This way if the user invokes qemu with sudo, it still works as before.
Re: [Qemu-devel] [PATCH v2 5/7] exec: memory radix tree page level?compression
On 11/14/2013 04:40 PM, Michael S. Tsirkin wrote: On Thu, Nov 14, 2013 at 08:54:11AM +, Avi Kivity wrote: Michael S. Tsirkin mst at redhat.com writes: At the moment, memory radix tree is already variable width, but it can only skip the low bits of address. This is efficient if we have huge memory regions but inefficient if we are only using a tiny portion of the address space. After we have built up the map, detect configurations where a single L2 entry is valid. We then speed up the lookup by skipping one or more levels. In case any levels were skipped, we might end up in a valid section instead of erroring out. We handle this by checking that the address is in range of the resulting section. I think this is overkill. It can be done in a simpler way as follows: phys_page_find(RadixTree* tr, hwaddr index, ...) { if (index rt-invalid_index_mask) { // not found } lp = rt-root; for (i = rt-nb_levels - 1; i = 0 !lp.is_leaf; --i) { ... This exploits the fact the lower portion of the address space is always filled, at least in the cases that matter to us. Basically skip unused high bits? Sure. In fact I think both optimizations can be combined. Not much value in combining them -- the variable level tree check will be dominated by the level skip logic. IMO however skipping intermediate levels will be too rare to justify the complexity and the doubling of the page table size -- it can only happen in iommu setups that place memory in very high addresses. These ought to be rare.
Re: [Qemu-devel] [PATCH v2 5/7] exec: memory radix tree page level?compression
On 11/14/2013 05:37 PM, Michael S. Tsirkin wrote: On Thu, Nov 14, 2013 at 04:56:43PM +0200, Avi Kivity wrote: On 11/14/2013 04:40 PM, Michael S. Tsirkin wrote: On Thu, Nov 14, 2013 at 08:54:11AM +, Avi Kivity wrote: Michael S. Tsirkin mst at redhat.com writes: At the moment, memory radix tree is already variable width, but it can only skip the low bits of address. This is efficient if we have huge memory regions but inefficient if we are only using a tiny portion of the address space. After we have built up the map, detect configurations where a single L2 entry is valid. We then speed up the lookup by skipping one or more levels. In case any levels were skipped, we might end up in a valid section instead of erroring out. We handle this by checking that the address is in range of the resulting section. I think this is overkill. It can be done in a simpler way as follows: phys_page_find(RadixTree* tr, hwaddr index, ...) { if (index rt-invalid_index_mask) { // not found } lp = rt-root; for (i = rt-nb_levels - 1; i = 0 !lp.is_leaf; --i) { ... This exploits the fact the lower portion of the address space is always filled, at least in the cases that matter to us. Basically skip unused high bits? Sure. In fact I think both optimizations can be combined. Not much value in combining them -- the variable level tree check will be dominated by the level skip logic. IMO however skipping intermediate levels will be too rare to justify the complexity and the doubling of the page table size -- it can only happen in iommu setups that place memory in very high addresses. These ought to be rare. Well maybe not very high address, but you can have a device with a 64 bit bar and this will add back levels (though it would not be slower than it is currently). I agree the simplicity is attractive. However I really would like some logic that can handle 1 leaf somehow. Specifically both tricks break if we add a full 64 bit io region in order to return -1 for reads on master abort. That can be achieved by directing a missed lookup to a catch-all region. It would be best to do this completely internally to the memory code, without API changes.
Re: [Qemu-devel] [PATCH v2 5/7] exec: memory radix tree page level?compression
On 11/17/2013 05:37 PM, Michael S. Tsirkin wrote: On Thu, Nov 14, 2013 at 05:42:26PM +0200, Avi Kivity wrote: On 11/14/2013 05:37 PM, Michael S. Tsirkin wrote: On Thu, Nov 14, 2013 at 04:56:43PM +0200, Avi Kivity wrote: On 11/14/2013 04:40 PM, Michael S. Tsirkin wrote: On Thu, Nov 14, 2013 at 08:54:11AM +, Avi Kivity wrote: Michael S. Tsirkin mst at redhat.com writes: At the moment, memory radix tree is already variable width, but it can only skip the low bits of address. This is efficient if we have huge memory regions but inefficient if we are only using a tiny portion of the address space. After we have built up the map, detect configurations where a single L2 entry is valid. We then speed up the lookup by skipping one or more levels. In case any levels were skipped, we might end up in a valid section instead of erroring out. We handle this by checking that the address is in range of the resulting section. I think this is overkill. It can be done in a simpler way as follows: phys_page_find(RadixTree* tr, hwaddr index, ...) { if (index rt-invalid_index_mask) { // not found } lp = rt-root; for (i = rt-nb_levels - 1; i = 0 !lp.is_leaf; --i) { ... This exploits the fact the lower portion of the address space is always filled, at least in the cases that matter to us. Basically skip unused high bits? Sure. In fact I think both optimizations can be combined. Not much value in combining them -- the variable level tree check will be dominated by the level skip logic. IMO however skipping intermediate levels will be too rare to justify the complexity and the doubling of the page table size -- it can only happen in iommu setups that place memory in very high addresses. These ought to be rare. Well maybe not very high address, but you can have a device with a 64 bit bar and this will add back levels (though it would not be slower than it is currently). I agree the simplicity is attractive. However I really would like some logic that can handle 1 leaf somehow. Specifically both tricks break if we add a full 64 bit io region in order to return -1 for reads on master abort. That can be achieved by directing a missed lookup to a catch-all region. It would be best to do this completely internally to the memory code, without API changes. You mean e.g. add a catch-all region to the address space? The API already specifies a catch-all region - the container region catches everything if an access doesn't hit a sub-region (though I wanted to discourage the practice of using the same region for both I/O and a container), or you can have an I/O region as a subregion with the lowest priority, that spans the entire space, so that any access which misses the other sub-regions will hit the default subregion. So we don't need API changes, just to change the internals. I think this can be done by looking at the address space, and finding out what region is mapped to the very last byte in the address space, if any, and using that as the default if an access misses the tree. That is, this region will be the target of null pointers in the tree, or if (addr addr_invalid_mask) is true. This will ensure the tree is of the smallest possible size. Another option is to simply give up on the tree and use std::binary_search() on the FlatView. It will be more cpu intensive, but on the other hand more cache-friendly. tcg already caches stuff in the iotlb, and maybe it makes sense to add a cache for kvm lookups too (the advantage of the cache is that it only needs to contain addresses that are actually touched, so it can be very small and fast, compared to the radix tree which needs to cover all addresses).
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On Tue, Nov 26, 2013 at 2:47 PM, Paolo Bonzini pbonz...@redhat.com wrote: Il 26/11/2013 13:40, Zhanghaoyu (A) ha scritto: When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread will IOCTL return to QEMU from hypervisor, then vcpu thread ask the hypervisor to update the irq routing table, in kvm_set_irq_routing, synchronize_rcu is called, current vcpu thread is blocked for so much time to wait RCU grace period, and during this period, this vcpu cannot provide service to VM, so those interrupts delivered to this vcpu cannot be handled in time, and the apps running on this vcpu cannot be serviced too. It's unacceptable in some real-time scenario, e.g. telecom. So, I want to create a single workqueue for each VM, to asynchronously performing the RCU synchronization for irq routing table, and let the vcpu thread return and VMENTRY to service VM immediately, no more need to blocked to wait RCU grace period. And, I have implemented a raw patch, took a test in our telecom environment, above problem disappeared. I don't think a workqueue is even needed. You just need to use call_rcu to free old after releasing kvm-irq_lock. What do you think? Can this cause an interrupt to be delivered to the wrong (old) vcpu? The way Linux sets interrupt affinity, it cannot, since changing the affinity is (IIRC) done in the interrupt handler, so the next interrupt cannot be in flight and thus pick up the old interrupt routing table. However it may be vulnerable in other ways.
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On Tue, Nov 26, 2013 at 3:47 PM, Paolo Bonzini pbonz...@redhat.com wrote: Il 26/11/2013 14:18, Avi Kivity ha scritto: I don't think a workqueue is even needed. You just need to use call_rcu to free old after releasing kvm-irq_lock. What do you think? Can this cause an interrupt to be delivered to the wrong (old) vcpu? No, this would be exactly the same code that is running now: mutex_lock(kvm-irq_lock); old = kvm-irq_routing; kvm_irq_routing_update(kvm, new); mutex_unlock(kvm-irq_lock); synchronize_rcu(); kfree(old); return 0; Except that the kfree would run in the call_rcu kernel thread instead of the vcpu thread. But the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update. I understood the proposal was also to eliminate the synchronize_rcu(), so while new interrupts would see the new routing table, interrupts already in flight could pick up the old one.
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/26/2013 04:46 PM, Paolo Bonzini wrote: Il 26/11/2013 15:36, Avi Kivity ha scritto: No, this would be exactly the same code that is running now: mutex_lock(kvm-irq_lock); old = kvm-irq_routing; kvm_irq_routing_update(kvm, new); mutex_unlock(kvm-irq_lock); synchronize_rcu(); kfree(old); return 0; Except that the kfree would run in the call_rcu kernel thread instead of the vcpu thread. But the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update. I understood the proposal was also to eliminate the synchronize_rcu(), so while new interrupts would see the new routing table, interrupts already in flight could pick up the old one. Isn't that always the case with RCU? (See my answer above: the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update). With synchronize_rcu(), you have the additional guarantee that any parallel accesses to the old routing table have completed. Since we also trigger the irq from rcu context, you know that after synchronize_rcu() you won't get any interrupts to the old destination (see kvm_set_irq_inatomic()). It's another question whether the hardware provides the same guarantee. If you eliminate the synchronize_rcu, new interrupts would see the new routing table, while interrupts already in flight will get a dangling pointer. Sure, if you drop the synchronize_rcu(), you have to add call_rcu().
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/26/2013 05:03 PM, Gleb Natapov wrote: On Tue, Nov 26, 2013 at 04:54:44PM +0200, Avi Kivity wrote: On 11/26/2013 04:46 PM, Paolo Bonzini wrote: Il 26/11/2013 15:36, Avi Kivity ha scritto: No, this would be exactly the same code that is running now: mutex_lock(kvm-irq_lock); old = kvm-irq_routing; kvm_irq_routing_update(kvm, new); mutex_unlock(kvm-irq_lock); synchronize_rcu(); kfree(old); return 0; Except that the kfree would run in the call_rcu kernel thread instead of the vcpu thread. But the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update. I understood the proposal was also to eliminate the synchronize_rcu(), so while new interrupts would see the new routing table, interrupts already in flight could pick up the old one. Isn't that always the case with RCU? (See my answer above: the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update). With synchronize_rcu(), you have the additional guarantee that any parallel accesses to the old routing table have completed. Since we also trigger the irq from rcu context, you know that after synchronize_rcu() you won't get any interrupts to the old destination (see kvm_set_irq_inatomic()). We do not have this guaranty for other vcpus that do not call synchronize_rcu(). They may still use outdated routing table while a vcpu or iothread that performed table update sits in synchronize_rcu(). Consider this guest code: write msi entry, directing the interrupt away from this vcpu nop memset(idt, 0, sizeof(idt)); Currently, this code will never trigger a triple fault. With the change to call_rcu(), it may. Now it may be that the guest does not expect this to work (PCI writes are posted; and interrupts can be delayed indefinitely by the pci fabric), but we don't know if there's a path that guarantees the guest something that we're taking away with this change.
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/26/2013 05:20 PM, Paolo Bonzini wrote: Il 26/11/2013 16:03, Gleb Natapov ha scritto: I understood the proposal was also to eliminate the synchronize_rcu(), so while new interrupts would see the new routing table, interrupts already in flight could pick up the old one. Isn't that always the case with RCU? (See my answer above: the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update). With synchronize_rcu(), you have the additional guarantee that any parallel accesses to the old routing table have completed. Since we also trigger the irq from rcu context, you know that after synchronize_rcu() you won't get any interrupts to the old destination (see kvm_set_irq_inatomic()). We do not have this guaranty for other vcpus that do not call synchronize_rcu(). They may still use outdated routing table while a vcpu or iothread that performed table update sits in synchronize_rcu(). Avi's point is that, after the VCPU resumes execution, you know that no interrupt will be sent to the old destination because kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is also called within the RCU read-side critical section. Without synchronize_rcu you could have VCPU writes to routing table e = entry from IRQ routing table kvm_irq_routing_update(kvm, new); VCPU resumes execution kvm_set_msi_irq(e, irq); kvm_irq_delivery_to_apic_fast(); where the entry is stale but the VCPU has already resumed execution. If we want to ensure, we need to use a different mechanism for synchronization than the global RCU. QRCU would work; readers are not wait-free but only if there is a concurrent synchronize_qrcu, which should be rare. An alternative path is to convince ourselves that the hardware does not provide the guarantees that the current code provides, and so we can relax them.
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/26/2013 05:28 PM, Paolo Bonzini wrote: Il 26/11/2013 16:25, Avi Kivity ha scritto: If we want to ensure, we need to use a different mechanism for synchronization than the global RCU. QRCU would work; readers are not wait-free but only if there is a concurrent synchronize_qrcu, which should be rare. An alternative path is to convince ourselves that the hardware does not provide the guarantees that the current code provides, and so we can relax them. No, I think it's a reasonable guarantee to provide. Why?
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/26/2013 05:58 PM, Paolo Bonzini wrote: Il 26/11/2013 16:35, Avi Kivity ha scritto: If we want to ensure, we need to use a different mechanism for synchronization than the global RCU. QRCU would work; readers are not wait-free but only if there is a concurrent synchronize_qrcu, which should be rare. An alternative path is to convince ourselves that the hardware does not provide the guarantees that the current code provides, and so we can relax them. No, I think it's a reasonable guarantee to provide. Why? Because IIUC the semantics may depend not just on the interrupt controller, but also on the specific PCI device. It seems safer to assume that at least one device/driver pair wants this to work. It's indeed safe, but I think there's a nice win to be had if we drop the assumption. (BTW, PCI memory writes are posted, but configuration writes are not). MSIs are configured via PCI memory writes. By itself, that doesn't buy us anything, since the guest could flush the write via a read. But I think the fact that the interrupt messages themselves are posted proves that it is safe. The fact that Linux does interrupt migration from within the interrupt handler also shows that someone else believes that it is the only safe place to do it.
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/26/2013 06:11 PM, Michael S. Tsirkin wrote: On Tue, Nov 26, 2013 at 06:06:26PM +0200, Avi Kivity wrote: On 11/26/2013 05:58 PM, Paolo Bonzini wrote: Il 26/11/2013 16:35, Avi Kivity ha scritto: If we want to ensure, we need to use a different mechanism for synchronization than the global RCU. QRCU would work; readers are not wait-free but only if there is a concurrent synchronize_qrcu, which should be rare. An alternative path is to convince ourselves that the hardware does not provide the guarantees that the current code provides, and so we can relax them. No, I think it's a reasonable guarantee to provide. Why? Because IIUC the semantics may depend not just on the interrupt controller, but also on the specific PCI device. It seems safer to assume that at least one device/driver pair wants this to work. It's indeed safe, but I think there's a nice win to be had if we drop the assumption. I'm not arguing with that, but a minor commoent below: (BTW, PCI memory writes are posted, but configuration writes are not). MSIs are configured via PCI memory writes. By itself, that doesn't buy us anything, since the guest could flush the write via a read. But I think the fact that the interrupt messages themselves are posted proves that it is safe. FYI, PCI read flushes the interrupt itself in, too. I guess that kills the optimization then. Maybe you can do qrcu, whatever that is.
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/26/2013 06:24 PM, Gleb Natapov wrote: On Tue, Nov 26, 2013 at 04:20:27PM +0100, Paolo Bonzini wrote: Il 26/11/2013 16:03, Gleb Natapov ha scritto: I understood the proposal was also to eliminate the synchronize_rcu(), so while new interrupts would see the new routing table, interrupts already in flight could pick up the old one. Isn't that always the case with RCU? (See my answer above: the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update). With synchronize_rcu(), you have the additional guarantee that any parallel accesses to the old routing table have completed. Since we also trigger the irq from rcu context, you know that after synchronize_rcu() you won't get any interrupts to the old destination (see kvm_set_irq_inatomic()). We do not have this guaranty for other vcpus that do not call synchronize_rcu(). They may still use outdated routing table while a vcpu or iothread that performed table update sits in synchronize_rcu(). Avi's point is that, after the VCPU resumes execution, you know that no interrupt will be sent to the old destination because kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is also called within the RCU read-side critical section. Without synchronize_rcu you could have VCPU writes to routing table e = entry from IRQ routing table kvm_irq_routing_update(kvm, new); VCPU resumes execution kvm_set_msi_irq(e, irq); kvm_irq_delivery_to_apic_fast(); where the entry is stale but the VCPU has already resumed execution. So how is it different from what we have now: disable_irq() VCPU writes to routing table e = entry from IRQ routing table kvm_set_msi_irq(e, irq); kvm_irq_delivery_to_apic_fast(); kvm_irq_routing_update(kvm, new); synchronize_rcu() VCPU resumes execution enable_irq() receive stale irq Suppose the guest did not disable_irq() and enable_irq(), but instead had a pci read where you have the enable_irq(). After the read you cannot have a stale irq (assuming the read flushes the irq all the way to the APIC).
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/26/2013 06:28 PM, Paolo Bonzini wrote: Il 26/11/2013 17:24, Gleb Natapov ha scritto: VCPU writes to routing table e = entry from IRQ routing table kvm_irq_routing_update(kvm, new); VCPU resumes execution kvm_set_msi_irq(e, irq); kvm_irq_delivery_to_apic_fast(); where the entry is stale but the VCPU has already resumed execution. So how is it different from what we have now: disable_irq() VCPU writes to routing table e = entry from IRQ routing table kvm_set_msi_irq(e, irq); kvm_irq_delivery_to_apic_fast(); kvm_irq_routing_update(kvm, new); synchronize_rcu() VCPU resumes execution enable_irq() receive stale irq Adding a disable/enable IRQs looks like a relatively big change. But perhaps it's not for some reason I'm missing. Those are guest operations, which may not be there at all.
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/28/2013 11:19 AM, Gleb Natapov wrote: On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote: Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto: Without synchronize_rcu you could have VCPU writes to routing table e = entry from IRQ routing table kvm_irq_routing_update(kvm, new); VCPU resumes execution kvm_set_msi_irq(e, irq); kvm_irq_delivery_to_apic_fast(); where the entry is stale but the VCPU has already resumed execution. If we use call_rcu()(Not consider the problem that Gleb pointed out temporarily) instead of synchronize_rcu(), should we still ensure this? The problem is that we should ensure this, so using call_rcu is not possible (even not considering the memory allocation problem). Not changing current behaviour is certainly safer, but I am still not 100% convinced we have to ensure this. Suppose guest does: 1: change msi interrupt by writing to pci register 2: read the pci register to flush the write 3: zero idt I am pretty certain that this code can get interrupt after step 2 on real HW, but I cannot tell if guest can rely on it to be delivered exactly after read instruction or it can be delayed by couple of instructions. Seems to me it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not. Linux is safe, it does interrupt migration from within the interrupt handler. If you do that before the device-specific EOI, you won't get another interrupt until programming the MSI is complete. Is virtio safe? IIRC it can post multiple interrupts without guest acks. Using call_rcu() is a better solution than srcu IMO. Less code changes, consistently faster.
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/28/2013 12:11 PM, Gleb Natapov wrote: On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote: On 11/28/2013 11:19 AM, Gleb Natapov wrote: On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote: Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto: Without synchronize_rcu you could have VCPU writes to routing table e = entry from IRQ routing table kvm_irq_routing_update(kvm, new); VCPU resumes execution kvm_set_msi_irq(e, irq); kvm_irq_delivery_to_apic_fast(); where the entry is stale but the VCPU has already resumed execution. If we use call_rcu()(Not consider the problem that Gleb pointed out temporarily) instead of synchronize_rcu(), should we still ensure this? The problem is that we should ensure this, so using call_rcu is not possible (even not considering the memory allocation problem). Not changing current behaviour is certainly safer, but I am still not 100% convinced we have to ensure this. Suppose guest does: 1: change msi interrupt by writing to pci register 2: read the pci register to flush the write 3: zero idt I am pretty certain that this code can get interrupt after step 2 on real HW, but I cannot tell if guest can rely on it to be delivered exactly after read instruction or it can be delayed by couple of instructions. Seems to me it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not. Linux is safe, it does interrupt migration from within the interrupt handler. If you do that before the device-specific EOI, you won't get another interrupt until programming the MSI is complete. Is virtio safe? IIRC it can post multiple interrupts without guest acks. Using call_rcu() is a better solution than srcu IMO. Less code changes, consistently faster. Why not fix userspace to use KVM_SIGNAL_MSI instead? Shouldn't it work with old userspace too? Maybe I misunderstood your intent.
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/28/2013 11:53 AM, Paolo Bonzini wrote: Il 28/11/2013 10:49, Avi Kivity ha scritto: Linux is safe, it does interrupt migration from within the interrupt handler. If you do that before the device-specific EOI, you won't get another interrupt until programming the MSI is complete. Is virtio safe? IIRC it can post multiple interrupts without guest acks. Using call_rcu() is a better solution than srcu IMO. Less code changes, consistently faster. call_rcu() has the problem of rate limiting, too. It wasn't such a great idea, I think. The QRCU I linked would work great latency-wise (it has roughly the same latency of an rwsem but readers are lock-free). However, the locked operations in the read path would hurt because of cache misses, so it's not good either. I guess srcu would work. Do you know what's the typical write-side latency there? (in terms of what it waits for, not nanoseconds).
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/28/2013 12:40 PM, Paolo Bonzini wrote: Il 28/11/2013 11:16, Avi Kivity ha scritto: The QRCU I linked would work great latency-wise (it has roughly the same latency of an rwsem but readers are lock-free). However, the locked operations in the read path would hurt because of cache misses, so it's not good either. I guess srcu would work. Do you know what's the typical write-side latency there? (in terms of what it waits for, not nanoseconds). If there's no concurrent reader, it's zero if I read the code right. Otherwise it depends: - if there are many callbacks, only 10 of them are processed per millisecond. But unless there are concurrent synchronize_srcu calls there should not be any callback at all. If all VCPUs were to furiously change the MSIs, the latency could go up to #vcpu/10 milliseconds. - if there are no callbacks, but there are readers, synchronize_srcu busy-loops for some time checking if the readers complete. After a while (20 us for synchronize_srcu, 120 us for synchronize_srcu_expedited) it gives up and starts using a workqueue to poll every millisecond. This should never happen unless So, given the very restricted usage this SRCU would have, we probably can expect synchronize_srcu_expedited to finish its job in the busy-looping phase, and 120 us should be the expected maximum latency---more likely to be an order of magnitude smaller, and in very rare cases higher. Looks like it's a good fit then.
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/28/2013 01:10 PM, Paolo Bonzini wrote: Il 28/11/2013 12:09, Gleb Natapov ha scritto: - if there are no callbacks, but there are readers, synchronize_srcu busy-loops for some time checking if the readers complete. After a while (20 us for synchronize_srcu, 120 us for synchronize_srcu_expedited) it gives up and starts using a workqueue to poll every millisecond. This should never happen unless Unless what ? :) Unless reader is scheduled out? Yes. Or unless my brain is scheduled out in the middle of a sentence. You need a grace period. Or just sleep().
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/28/2013 01:02 PM, Gleb Natapov wrote: On Thu, Nov 28, 2013 at 12:12:55PM +0200, Avi Kivity wrote: On 11/28/2013 12:11 PM, Gleb Natapov wrote: On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote: On 11/28/2013 11:19 AM, Gleb Natapov wrote: On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote: Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto: Without synchronize_rcu you could have VCPU writes to routing table e = entry from IRQ routing table kvm_irq_routing_update(kvm, new); VCPU resumes execution kvm_set_msi_irq(e, irq); kvm_irq_delivery_to_apic_fast(); where the entry is stale but the VCPU has already resumed execution. If we use call_rcu()(Not consider the problem that Gleb pointed out temporarily) instead of synchronize_rcu(), should we still ensure this? The problem is that we should ensure this, so using call_rcu is not possible (even not considering the memory allocation problem). Not changing current behaviour is certainly safer, but I am still not 100% convinced we have to ensure this. Suppose guest does: 1: change msi interrupt by writing to pci register 2: read the pci register to flush the write 3: zero idt I am pretty certain that this code can get interrupt after step 2 on real HW, but I cannot tell if guest can rely on it to be delivered exactly after read instruction or it can be delayed by couple of instructions. Seems to me it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not. Linux is safe, it does interrupt migration from within the interrupt handler. If you do that before the device-specific EOI, you won't get another interrupt until programming the MSI is complete. Is virtio safe? IIRC it can post multiple interrupts without guest acks. Using call_rcu() is a better solution than srcu IMO. Less code changes, consistently faster. Why not fix userspace to use KVM_SIGNAL_MSI instead? Shouldn't it work with old userspace too? Maybe I misunderstood your intent. Zhanghaoyu said that the problem mostly hurts in real-time telecom environment, so I propose how he can fix the problem in his specific environment. It will not fix older userspace obviously, but kernel fix will also require kernel update and usually updating userspace is easier. Isn't the latency due to interrupt migration causing long synchronize_rcu()s? How does KVM_SIGNAL_MSI help? The problem occurs with assigned devices too AFAICT.
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/28/2013 01:22 PM, Gleb Natapov wrote: On Thu, Nov 28, 2013 at 01:18:54PM +0200, Avi Kivity wrote: On 11/28/2013 01:02 PM, Gleb Natapov wrote: On Thu, Nov 28, 2013 at 12:12:55PM +0200, Avi Kivity wrote: On 11/28/2013 12:11 PM, Gleb Natapov wrote: On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote: On 11/28/2013 11:19 AM, Gleb Natapov wrote: On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote: Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto: Without synchronize_rcu you could have VCPU writes to routing table e = entry from IRQ routing table kvm_irq_routing_update(kvm, new); VCPU resumes execution kvm_set_msi_irq(e, irq); kvm_irq_delivery_to_apic_fast(); where the entry is stale but the VCPU has already resumed execution. If we use call_rcu()(Not consider the problem that Gleb pointed out temporarily) instead of synchronize_rcu(), should we still ensure this? The problem is that we should ensure this, so using call_rcu is not possible (even not considering the memory allocation problem). Not changing current behaviour is certainly safer, but I am still not 100% convinced we have to ensure this. Suppose guest does: 1: change msi interrupt by writing to pci register 2: read the pci register to flush the write 3: zero idt I am pretty certain that this code can get interrupt after step 2 on real HW, but I cannot tell if guest can rely on it to be delivered exactly after read instruction or it can be delayed by couple of instructions. Seems to me it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not. Linux is safe, it does interrupt migration from within the interrupt handler. If you do that before the device-specific EOI, you won't get another interrupt until programming the MSI is complete. Is virtio safe? IIRC it can post multiple interrupts without guest acks. Using call_rcu() is a better solution than srcu IMO. Less code changes, consistently faster. Why not fix userspace to use KVM_SIGNAL_MSI instead? Shouldn't it work with old userspace too? Maybe I misunderstood your intent. Zhanghaoyu said that the problem mostly hurts in real-time telecom environment, so I propose how he can fix the problem in his specific environment. It will not fix older userspace obviously, but kernel fix will also require kernel update and usually updating userspace is easier. Isn't the latency due to interrupt migration causing long synchronize_rcu()s? How does KVM_SIGNAL_MSI help? If MSI is delivered using KVM_SIGNAL_MSI as opposite to via an entry in irq routing table changing MSI configuration should not cause update to irq routing table (not saying this is what happens with current QEMU, but theoretically there is not reason to update routing table in this case). I see. That pushes the problem to userspace, which uses traditional locking, so the problem disappears until qemu starts using rcu too to manage this. There is also irqfd, however. We could also do a KVM_UPDATE_IRQFD to change the payload it delivers, but that has exactly the same problems.
Re: [Qemu-devel] outlined TLB lookup on x86
On 11/28/2013 04:12 AM, Richard Henderson wrote: 2. why not use a TLB or bigger size? currently the TLB has 18 entries. the TLB lookup is 10 x86 instructions , but every miss needs ~450 instructions, i measured this using Intel PIN. so even the miss rate is low (say 3%) the overall time spent in the cpu_x86_handle_mmu_fault is still signifcant. I'd be interested to experiment with different TLB sizes, to see what effect that has on performance. But I suspect that lack of TLB contexts mean that we wind up flushing the TLB more often than real hardware does, and therefore a larger TLB merely takes longer to flush. You could use a generation counter to flush the TLB in O(1) by incrementing the counter. That slows down the fast path though. Maybe you can do that for the larger second level TLB only.
[Qemu-devel] [PATCH] enhance gdbstub to support x86-64
The following patch adds x86-64 support to gdbstub. Please consider for inclusion. Index: gdbstub.c === --- gdbstub.c (revision 2399) +++ gdbstub.c (revision 2400) @@ -175,10 +175,144 @@ return 0; } -#if defined(TARGET_I386) +#if defined(TARGET_X86_64) static int cpu_gdb_read_registers(CPUState *env, uint8_t *mem_buf) { +uint8_t *p = mem_buf; +int i, fpus; + +#define PUTREG(x) do { \ + target_ulong reg = tswapl(x); \ + memcpy(p, reg, sizeof reg); \ + p += sizeof reg; \ +} while (0) +#define PUTREG32(x) do { \ + uint32_t reg = tswap32(x); \ + memcpy(p, reg, sizeof reg); \ + p += sizeof reg; \ +} while (0) +#define PUTREGF(x) do { \ + memcpy(p, (x), 10);\ + p += sizeof (x);\ +} while (0) + +PUTREG(env-regs[R_EAX]); +PUTREG(env-regs[R_EBX]); +PUTREG(env-regs[R_ECX]); +PUTREG(env-regs[R_EDX]); +PUTREG(env-regs[R_ESI]); +PUTREG(env-regs[R_EDI]); +PUTREG(env-regs[R_EBP]); +PUTREG(env-regs[R_ESP]); +PUTREG(env-regs[8]); +PUTREG(env-regs[9]); +PUTREG(env-regs[10]); +PUTREG(env-regs[11]); +PUTREG(env-regs[12]); +PUTREG(env-regs[13]); +PUTREG(env-regs[14]); +PUTREG(env-regs[15]); + +PUTREG(env-eip); +PUTREG32(env-eflags); +PUTREG32(env-segs[R_CS].selector); +PUTREG32(env-segs[R_SS].selector); +PUTREG32(env-segs[R_DS].selector); +PUTREG32(env-segs[R_ES].selector); +PUTREG32(env-segs[R_FS].selector); +PUTREG32(env-segs[R_GS].selector); +/* XXX: convert floats */ +for(i = 0; i 8; i++) { +PUTREGF(env-fpregs[i]); +} +PUTREG32(env-fpuc); +fpus = (env-fpus ~0x3800) | (env-fpstt 0x7) 11; +PUTREG32(fpus); +PUTREG32(0); /* XXX: convert tags */ +PUTREG32(0); /* fiseg */ +PUTREG32(0); /* fioff */ +PUTREG32(0); /* foseg */ +PUTREG32(0); /* fooff */ +PUTREG32(0); /* fop */ + +#undef PUTREG +#undef PUTREG32 +#undef PUTREGF + +return p - mem_buf; +} + +static void cpu_gdb_write_registers(CPUState *env, uint8_t *mem_buf, int size) +{ +uint8_t *p = mem_buf; +uint32_t junk; +int i, fpus; + +#define GETREG(x) do { \ + target_ulong reg; \ + memcpy(reg, p, sizeof reg); \ + x = tswapl(reg); \ + p += sizeof reg; \ +} while (0) +#define GETREG32(x) do { \ + uint32_t reg; \ + memcpy(reg, p, sizeof reg); \ + x = tswap32(reg);\ + p += sizeof reg; \ +} while (0) +#define GETREGF(x) do { \ + memcpy((x), p, 10); \ + p += 10; \ +} while (0) + +GETREG(env-regs[R_EAX]); +GETREG(env-regs[R_EBX]); +GETREG(env-regs[R_ECX]); +GETREG(env-regs[R_EDX]); +GETREG(env-regs[R_ESI]); +GETREG(env-regs[R_EDI]); +GETREG(env-regs[R_EBP]); +GETREG(env-regs[R_ESP]); +GETREG(env-regs[8]); +GETREG(env-regs[9]); +GETREG(env-regs[10]); +GETREG(env-regs[11]); +GETREG(env-regs[12]); +GETREG(env-regs[13]); +GETREG(env-regs[14]); +GETREG(env-regs[15]); + +GETREG(env-eip); +GETREG32(env-eflags); +GETREG32(env-segs[R_CS].selector); +GETREG32(env-segs[R_SS].selector); +GETREG32(env-segs[R_DS].selector); +GETREG32(env-segs[R_ES].selector); +GETREG32(env-segs[R_FS].selector); +GETREG32(env-segs[R_GS].selector); +/* XXX: convert floats */ +for(i = 0; i 8; i++) { +GETREGF(env-fpregs[i]); +} +GETREG32(env-fpuc); +GETREG32(fpus); /* XXX: convert fpus */ +GETREG32(junk); /* XXX: convert tags */ +GETREG32(junk); /* fiseg */ +GETREG32(junk); /* fioff */ +GETREG32(junk); /* foseg */ +GETREG32(junk); /* fooff */ +GETREG32(junk); /* fop */ + +#undef GETREG +#undef GETREG32 +#undef GETREGF +} + +#elif defined(TARGET_I386) + +static int cpu_gdb_read_registers(CPUState *env, uint8_t *mem_buf) +{ uint32_t *registers = (uint32_t *)mem_buf; int i, fpus; @@ -545,7 +679,8 @@ char buf[4096]; uint8_t mem_buf[2000]; uint32_t *registers; -uint32_t addr, len; +target_ulong addr; +uint32_t len; #ifdef DEBUG_GDB printf(command='%s'\n, line_buf); -- error compiling committee.c: too many arguments to function ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] Wipe patch
Brad Campbell wrote: ZIGLIO, Frediano, VF-IT wrote: Hi, well, this is not a definitive patch but it works. The aim is to be able to wipe the disk without allocating entire space. When you wipe a disk the program fill disk with zero bytes so disk image increase to allocate all space. This just patch detect null byte writes and do not write all zero byte clusters. This _looks_ like it would severely impact cpu load during a write. Have you done any testing to determine if this is likely to impact a normal usage scenario? Why would it? In most cases, the zero test would terminate quickly, without accessing the entire cluster. -- error compiling committee.c: too many arguments to function ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
[Qemu-devel] [PATCH] enhance gdbstub to support x86-64
The following patch adds x86-64 support to gdbstub. Please consider for inclusion. [not subscribed, please cc:] Index: gdbstub.c === --- gdbstub.c (revision 2399) +++ gdbstub.c (revision 2400) @@ -175,10 +175,144 @@ return 0; } -#if defined(TARGET_I386) +#if defined(TARGET_X86_64) static int cpu_gdb_read_registers(CPUState *env, uint8_t *mem_buf) { +uint8_t *p = mem_buf; +int i, fpus; + +#define PUTREG(x) do { \ + target_ulong reg = tswapl(x); \ + memcpy(p, reg, sizeof reg); \ + p += sizeof reg; \ +} while (0) +#define PUTREG32(x) do { \ + uint32_t reg = tswap32(x); \ + memcpy(p, reg, sizeof reg); \ + p += sizeof reg; \ +} while (0) +#define PUTREGF(x) do { \ + memcpy(p, (x), 10);\ + p += sizeof (x);\ +} while (0) + +PUTREG(env-regs[R_EAX]); +PUTREG(env-regs[R_EBX]); +PUTREG(env-regs[R_ECX]); +PUTREG(env-regs[R_EDX]); +PUTREG(env-regs[R_ESI]); +PUTREG(env-regs[R_EDI]); +PUTREG(env-regs[R_EBP]); +PUTREG(env-regs[R_ESP]); +PUTREG(env-regs[8]); +PUTREG(env-regs[9]); +PUTREG(env-regs[10]); +PUTREG(env-regs[11]); +PUTREG(env-regs[12]); +PUTREG(env-regs[13]); +PUTREG(env-regs[14]); +PUTREG(env-regs[15]); + +PUTREG(env-eip); +PUTREG32(env-eflags); +PUTREG32(env-segs[R_CS].selector); +PUTREG32(env-segs[R_SS].selector); +PUTREG32(env-segs[R_DS].selector); +PUTREG32(env-segs[R_ES].selector); +PUTREG32(env-segs[R_FS].selector); +PUTREG32(env-segs[R_GS].selector); +/* XXX: convert floats */ +for(i = 0; i 8; i++) { +PUTREGF(env-fpregs[i]); +} +PUTREG32(env-fpuc); +fpus = (env-fpus ~0x3800) | (env-fpstt 0x7) 11; +PUTREG32(fpus); +PUTREG32(0); /* XXX: convert tags */ +PUTREG32(0); /* fiseg */ +PUTREG32(0); /* fioff */ +PUTREG32(0); /* foseg */ +PUTREG32(0); /* fooff */ +PUTREG32(0); /* fop */ + +#undef PUTREG +#undef PUTREG32 +#undef PUTREGF + +return p - mem_buf; +} + +static void cpu_gdb_write_registers(CPUState *env, uint8_t *mem_buf, int size) +{ +uint8_t *p = mem_buf; +uint32_t junk; +int i, fpus; + +#define GETREG(x) do { \ + target_ulong reg; \ + memcpy(reg, p, sizeof reg); \ + x = tswapl(reg); \ + p += sizeof reg; \ +} while (0) +#define GETREG32(x) do { \ + uint32_t reg; \ + memcpy(reg, p, sizeof reg); \ + x = tswap32(reg);\ + p += sizeof reg; \ +} while (0) +#define GETREGF(x) do { \ + memcpy((x), p, 10); \ + p += 10; \ +} while (0) + +GETREG(env-regs[R_EAX]); +GETREG(env-regs[R_EBX]); +GETREG(env-regs[R_ECX]); +GETREG(env-regs[R_EDX]); +GETREG(env-regs[R_ESI]); +GETREG(env-regs[R_EDI]); +GETREG(env-regs[R_EBP]); +GETREG(env-regs[R_ESP]); +GETREG(env-regs[8]); +GETREG(env-regs[9]); +GETREG(env-regs[10]); +GETREG(env-regs[11]); +GETREG(env-regs[12]); +GETREG(env-regs[13]); +GETREG(env-regs[14]); +GETREG(env-regs[15]); + +GETREG(env-eip); +GETREG32(env-eflags); +GETREG32(env-segs[R_CS].selector); +GETREG32(env-segs[R_SS].selector); +GETREG32(env-segs[R_DS].selector); +GETREG32(env-segs[R_ES].selector); +GETREG32(env-segs[R_FS].selector); +GETREG32(env-segs[R_GS].selector); +/* XXX: convert floats */ +for(i = 0; i 8; i++) { +GETREGF(env-fpregs[i]); +} +GETREG32(env-fpuc); +GETREG32(fpus); /* XXX: convert fpus */ +GETREG32(junk); /* XXX: convert tags */ +GETREG32(junk); /* fiseg */ +GETREG32(junk); /* fioff */ +GETREG32(junk); /* foseg */ +GETREG32(junk); /* fooff */ +GETREG32(junk); /* fop */ + +#undef GETREG +#undef GETREG32 +#undef GETREGF +} + +#elif defined(TARGET_I386) + +static int cpu_gdb_read_registers(CPUState *env, uint8_t *mem_buf) +{ uint32_t *registers = (uint32_t *)mem_buf; int i, fpus; @@ -545,7 +679,8 @@ char buf[4096]; uint8_t mem_buf[2000]; uint32_t *registers; -uint32_t addr, len; +target_ulong addr; +uint32_t len; #ifdef DEBUG_GDB printf(command='%s'\n, line_buf); -- error compiling committee.c: too many arguments to function ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] hosting the forum
Sorry, this was meant to be off-list. My apologies. Avi Kivity wrote: Hetz Ben Hamo wrote: Due to some personal problems (related to financial situations) I'm no longer being able to keep paying the server which hosts the QEMU forum. Therefore, I'm looking for someone who can host the QEMU forum (and, if possible, the nightly snapshots). If anyone would want to host it, please contact me offline. Hi, We may be interested in hosting the forum and the nightlies on our servers here. Can you tell us about the bandwidth required? Also, if you have qemu coding experience we might be able to help with the financial situation as well. Please send your CV to me if interested (we are located in the Poleg area). -- error compiling committee.c: too many arguments to function ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] qemu vs gcc4
Paul Brook wrote: On Monday 23 October 2006 09:16, Martin Guy wrote: Now, gcc4 can produce code with several return instructions (with no option to turn that of, as far as I understand). You cannot cut them out, and therefore you cannot chain the simple functions. ...unless you also map return instructions within the generated functions into branches to the soon-to-be-dropped final return? Not that I know anything about qemu internals mind u... That's exactly what my gcc4 hacks do. It gets complicated because a x86 uses variable length insn encodings so you don't know where insn boundaries are, and a jmp instruction is larger than a ret instruction so it's not always possible to do a straight replacement. how about void some_generated_instruction(u32 a1, u32 s2) { // code asm volatile ( ); } that will force the code to fall through to the null asm code, avoiding premature returns. if the code uses 'return' explicitly, turn it to a goto just before the 'asm volatile'. -- error compiling committee.c: too many arguments to function ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] qemu vs gcc4
Paul Brook wrote: That's exactly what my gcc4 hacks do. It gets complicated because a x86 uses variable length insn encodings so you don't know where insn boundaries are, and a jmp instruction is larger than a ret instruction so it's not always possible to do a straight replacement. how about void some_generated_instruction(u32 a1, u32 s2) { // code asm volatile ( ); } that will force the code to fall through to the null asm code, avoiding premature returns. if the code uses 'return' explicitly, turn it to a goto just before the 'asm volatile'. We already do that. It doesn't stop gcc putting the return in the middle of the function. Paul void f1(); void f2(); void f(int *z, int x, int y) { if (x) { *z = x; f1(); } else { *z = y; f2(); } asm volatile (); } works, with gcc -O2 -fno-reorder-blocks. removing either the asm or the -f flag doesn't. No idea if it's consistent across architectures. (the function calls are there to prevent cmov optimizations) -- error compiling committee.c: too many arguments to function ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] qemu vs gcc4
Paul Brook wrote: We already do that. It doesn't stop gcc putting the return in the middle of the function. Paul void f1(); void f2(); void f(int *z, int x, int y) { if (x) { *z = x; f1(); } else { *z = y; f2(); } asm volatile (); } works, with gcc -O2 -fno-reorder-blocks. removing either the asm or the -f flag doesn't. No idea if it's consistent across architectures. It doesn't work reliably though. We already do everything you mention above. Okay. Sorry for pestering :) -- error compiling committee.c: too many arguments to function ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] [PATCH] Experimental initial patch providing accelerated OpenGL for Linux i386 (2nd attempt to post)
Even Rouault wrote: Apart from portability to other architectures, what would be the other advantages of a solution based on a PCI device ? I can think of a two advantages: - the device can deliver interrupts to the guest when something is completed (I don't know how useful this is to a 3D device) - the guest can autodetect the presence of the device by means of a PCI ID. This means that if the guest driver is eventually included in the OS (which is possible in Linux), then the driver will be loaded automatically by the OS. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] Re: NBD server for QEMU images
Anthony Liguori wrote: Mounting a partition being served on the same host as read-write can cause deadlocks. From nbd-2.9.0 README file: This text is pretty old. Is this still valid? This would imply that things like loop can result in dead locks. I don't see why flushing one device would depend on the completion of another device. Otherwise, if you had two disk adapters, they would always be operating in lock step. As I've said, I've never seen a problem doing-write with nbd on localhost. A deadlock can happen under heavy I/O load: - write tons of data to nbd device, data ends up in pagecache - memory gets low, kswapd wakes up, calls nbd device to actually write the data - nbd issues a request, which ends up on the nbd server on the same machine - the nbd server allocates memory - memory allocation hangs waiting for kswapd I submitted a patch/workaround for this some years ago (for a similar problem with a local mount to a userspace nfs server). -- error compiling committee.c: too many arguments to function ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] Re: NBD server for QEMU images
Martin Guy wrote: - write tons of data to nbd device, data ends up in pagecache - memory gets low, kswapd wakes up, calls nbd device to actually write the data - nbd issues a request, which ends up on the nbd server on the same machine - the nbd server allocates memory - memory allocation hangs waiting for kswapd In other words, it can deadlock only if you are swapping to an nbd device that is served by nbd-server running on the same machine and kernel. No. It is possible if you issue non-O_SYNC writes. In the case of a qemu system swapping over nbd to a server on the host machine, it is the guest kernel that waits on the host kernel paging the nbd server in from the host's separate swap space, so no deadlock is possible. Practice bears this out; if you wanna stress-test it, here's a program that creates a low memory condition by saturating the VM. It isn't enough to thrash the guest, you need to exhaust host memory as well. You also need to do it faster than kswapd can write it out. Of course, this has nothing to do with the original patch, which just lets nbd-server interpret qemu image files ;) Agreed. But mounting nbd from localhost is dangerous. -- error compiling committee.c: too many arguments to function ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] Re: NBD server for QEMU images
Salvador Fandino wrote: I have run some tests and found that it's easy to cause a deadlock just untaring a file over an nbd device being served from localhost (using the standard nbd-server or my own, it doesn't matter). Another interesting finding is that when the deadlock happens, qemu-nbds is inside a read() call, waiting for new nbd requests to arrive over the socket, and so, not trying to allocate memory or writing to disk. If you use sysrq-t and a serial console, you will find exactly how it's waiting for memory. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] QEMU/pc and scsi disks
Anthony Liguori wrote: I think most people agree that we need a config file. I haven't seen any comments on my config file patch though. So, any comments on that patch? Any requirements on a format? 1. Any option should be settable either in the config file or command line. In other words, the user should not be forced to use a config file. This is useful for management programs who keep all options in an internal database, and for users who can experiment via a ^P edit edit edit enter. 2. The %(blah)s syntax is a bit ugly. It's one of the few things in python I dislike. I'd prefer ${blah} and ${blah:modifier}. 3. It would be nice to be able to embed a config file in a qcow3 format so that one can ship an image with some default options in one file. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] QEMU/pc and scsi disks
Anthony Liguori wrote: 1. Any option should be settable either in the config file or command line. In other words, the user should not be forced to use a config file. This is useful for management programs who keep all options in an internal database, and for users who can experiment via a ^P edit edit edit enter. I think we should still provide the ability to set the most common options via the command line. I'm also fine with specifying single options on the command line. I suspect though that being able to do -config - is more useful for management tools than building large strings of command line options. Out of curiosity, why? If the options are store in some database, as is likely, surely it is easier to generate a longish command line than to generate a unique name for a file, remove it if it already exists, write out the data, launch qemu, and clean up the file later? And for migration, you have to regenerate it, since some options may have changed (cdrom media). Of course the config file is very useful for a user maintaining a few virtual machines on their own, but when qemu is part of a longer food chain, I think it's just a complication. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] QEMU/pc and scsi disks
Paul Brook wrote: Out of curiosity, why? If the options are store in some database, as is likely, surely it is easier to generate a longish command line than to generate a unique name for a file, remove it if it already exists, write out the data, launch qemu, and clean up the file later? And for migration, you have to regenerate it, since some options may have changed (cdrom media). You're going to start hitting commandline length limits fairly rapidly (Windows in particular has a fairly low limit). Well, at list on linux, the limit is sane (1024 or 10240?). -- error compiling committee.c: too many arguments to function ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] QEMU/pc and scsi disks
Chris Wilson wrote: Hi Avi, On Sun, 4 Mar 2007, Avi Kivity wrote: Anthony Liguori wrote: I think we should still provide the ability to set the most common options via the command line. I'm also fine with specifying single options on the command line. I suspect though that being able to do -config - is more useful for management tools than building large strings of command line options. Out of curiosity, why? If the options are store in some database, as is likely, surely it is easier to generate a longish command line than to generate a unique name for a file, remove it if it already exists, write out the data, launch qemu, and clean up the file later? And for migration, you have to regenerate it, since some options may have changed (cdrom media). I think Anthony was suggesting that -config - would read the config from standard input, avoiding the need for any temporary file, and also avoiding any command-line length limits on any platforms (Win32 is unlikely to accept as much as Linux). Ah, I missed the '-'. Thanks. Indeed it's quite easy to feed a config file from memory into qemu. -- error compiling committee.c: too many arguments to function ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] please review this scsi patch
Wang Cheng Yeh wrote: thanks If you include a description of what the patch does and why it is necessary, it will probably be reviewed a lot quicker. -- error compiling committee.c: too many arguments to function ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] [RFC/experimental patch] qemu (x86_64 on x86_64 -no-kqemu) compiles with gcc4 and works
Axel Zeuner wrote: A full featured converter (cvtasm) has a lot of dependencies: it has to support all hosts (M) (with all assembler dialects M') and all targets N, i.e. in the worst case one would end with M'x N variants of it, or M x N if one supports only one assembler dialect per host. It is clear, that the number of variants is one of the biggest disadvantages of such an approach. Perhaps a mixed approach can be made for gradual conversion: for combinations where cvtasm has been written, use that. Where that's still to be done, have dyngen generate call instructions to the ops instead of pasting the ops text directly. -- error compiling committee.c: too many arguments to function ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] Exception running under KVM
GSMD wrote: When trying to launch with -- kvm -no-acpi -nographic -m 512 -cdrom /media/ubuntu-7.04-beta-alternate-i386.iso -boot d /opt/ubuntu.img -- I get Code: -- (qemu) exception 12 (0) Please report kvm problems to [EMAIL PROTECTED] This looks like the real-mode emulation problem kvm has on Intel processors. It is a well known problem, unfortunately difficult to fix. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic.
Re: [kvm-devel] [Qemu-devel] Fixing ACPI for Vista -- Source code for bios.bin vs. bochs CVS?
Paul Brook wrote: On Monday 30 April 2007, Nakajima, Jun wrote: We found a way to get 32-bit Vista up on KVM by fixing bios.bin. And I think it should be helpful to qemu as well. However, I'm not sure if the binary file has been created simply by applying bios.diff to the latest bochs CVS tree (actually I doubt it). What makes you think this won't work? Have you tried it? Regardless of whether it works or not, it would be nice to have a file in qemu/pc-bios detailing the revision numbers of the sources taken from bochs, synced on each update. Something like the output of cvs status | awk '/^File:/ { file = $2 } /^ Working revision:/ { printf(%-20s %s\n, file, $3) }' -- error compiling committee.c: too many arguments to function
[Qemu-devel] Re: [kvm-devel] qemu/kvm seems to take ALSA all for itself?
David Abrahams wrote: When I have windows XP running under kvm, I get [EMAIL PROTECTED]:/tmp$ aplay /usr/share/sounds/gaim/receive.wav ALSA lib pcm_dmix.c:864:(snd_pcm_dmix_open) unable to open slave aplay: main:550: audio open error: Device or resource busy As soon as I shut down my VM, though, it works perfectly. Is this expected behavior? I think you have to set up a mixer or something. But I'd rather be deaf than have to wade through all the pseudo-documentation in order to find out how. -- error compiling committee.c: too many arguments to function
[Qemu-devel] Re: [kvm-devel] PIIX/IDE: ports disabled in PCI config space?
Luca Tettamanti wrote: Hello, I'm testing the new Fedora7 under KVM. As you may know Fedora has migrated to the new libata drivers. ata_piix is unhappy with the PIIX IDE controller provided by QEmu/KVM: [...] The following patch fixes the problem (i.e. ata_piix finds both the HD and the cdrom): --- a/hw/ide.c 2007-06-04 19:34:25.0 +0200 +++ b/hw/ide.c 2007-06-04 21:45:28.0 +0200 @@ -2586,6 +2586,8 @@ static void piix3_reset(PCIIDEState *d) pci_conf[0x06] = 0x80; /* FBC */ pci_conf[0x07] = 0x02; // PCI_status_devsel_medium pci_conf[0x20] = 0x01; /* BMIBA: 20-23h */ +pci_conf[0x41] = 0x80; // enable port 0 +pci_conf[0x43] = 0x80; // enable port 1 } void pci_piix_ide_init(PCIBus *bus, BlockDriverState **hd_table, int devfn) I imagine the reset state in the spec is disabled? If so, then the long-term fix is to enable these bits in the bios. In any case, I applied this to the kvm repo. Thanks. -- error compiling committee.c: too many arguments to function
[Qemu-devel] Re: [kvm-devel] PIIX/IDE: ports disabled in PCI config space?
Luca wrote: Good point. I'm looking at bochs right now: the BIOS doesn't touch that bits. The 2 ports are enabled/disabled - using values from the config file - by the reset function of the emulated controller (iodev/pci_ide.cc), so they're doing the same thing I've done for QEMU. I'd rather not touch the BIOS, it's a bit obscure ;-) No kidding :-) Maybe if you post your findings to bochs-devel they'd make the change. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] Detecting Client OS BSOF/Kernel Oops
Clemens Kolbitsch wrote: Hi! I'd like to detect if the client OS crashes... right now, only for linux, but windows systems will become interesting for me as well in the future... Is there an easy way of detecting if a BSOD or a kernel oops happened?? Maybe that'd be possible by checking if the IP is inside a certain range (I could find that location, I think... it does not have to be working generically... for specific client OS would be sufficient)!! You could try checking whether the hlt instruction (on x86) was executed with interrupts disabled. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic.
[Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote: Signed-off-by: Stefan Hajnoczistefa...@linux.vnet.ibm.com --- docs/specs/qed_spec.txt | 94 +++ 1 files changed, 94 insertions(+), 0 deletions(-) +Feature bits: +* QED_F_BACKING_FILE = 0x01. The image uses a backing file. +* QED_F_NEED_CHECK = 0x02. The image needs a consistency check before use. Great that QED_F_NEED_CHECK is now non-optional. However, suppose we add a new structure (e.g. persistent freelist); it's now impossible to tell whether the structure is updated or not: 1 new qemu opens image 2 writes persistent freelist 3 clears need_check 4 shuts down 5 old qemu opens image 6 doesn't update persistent freelist 7 clears need_check 8 shuts down The image is now inconsistent, but has need_check cleared. We can address this by having a third feature bitmask that is autocleared by guests that don't recognize various bits; so the sequence becomes: 1 new qemu opens image 2 writes persistent freelist 3 clears need_check 4 sets persistent_freelist 5 shuts down 6 old qemu opens image 7 clears persistent_freelist (and any other bits it doesn't recognize) 8 doesn't update persistent freelist 9 clears need_check 10 shuts down The image is now consistent, since the persistent freelist has disappeared. +* QED_CF_BACKING_FORMAT = 0x01. The image has a specific backing file format stored. + It was suggested to have just a bit saying whether the backing format is raw or not. This way you don't need to store the format. -- error compiling committee.c: too many arguments to function
[Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support
On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote: This patch implements the read/write state machine. Operations are fully asynchronous and multiple operations may be active at any time. Allocating writes lock tables to ensure metadata updates do not interfere with each other. If two allocating writes need to update the same L2 table they will run sequentially. If two allocating writes need to update different L2 tables they will run in parallel. Shouldn't there be a flush between an allocating write and an L2 update? Otherwise a reuse of a cluster can move logical sectors from one place to another, causing a data disclosure. Can be skipped if the new cluster is beyond the physical image size. + +/* + * Table locking works as follows: + * + * Reads and non-allocating writes do not acquire locks because they do not + * modify tables and only see committed L2 cache entries. What about a non-allocating write that follows an allocating write? 1 Guest writes to sector 0 2 Host reads backing image (or supplies zeros), sectors 1-127 3 Host writes sectors 0-127 4 Guest writes sector 1 5 Host writes sector 1 There needs to be a barrier that prevents the host and the disk from reordering operations 3 and 5, or guest operation 4 is lost. As far as the guest is concerned no overlapping writes were issued, so it isn't required to provide any barriers. (based on the comment only, haven't read the code) -- error compiling committee.c: too many arguments to function
[Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
On 10/11/2010 12:09 PM, Stefan Hajnoczi wrote: On Sun, Oct 10, 2010 at 11:20:09AM +0200, Avi Kivity wrote: On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote: Signed-off-by: Stefan Hajnoczistefa...@linux.vnet.ibm.com --- docs/specs/qed_spec.txt | 94 +++ 1 files changed, 94 insertions(+), 0 deletions(-) +Feature bits: +* QED_F_BACKING_FILE = 0x01. The image uses a backing file. +* QED_F_NEED_CHECK = 0x02. The image needs a consistency check before use. Great that QED_F_NEED_CHECK is now non-optional. However, suppose we add a new structure (e.g. persistent freelist); it's now impossible to tell whether the structure is updated or not: 1 new qemu opens image 2 writes persistent freelist 3 clears need_check 4 shuts down 5 old qemu opens image 6 doesn't update persistent freelist 7 clears need_check 8 shuts down The image is now inconsistent, but has need_check cleared. We can address this by having a third feature bitmask that is autocleared by guests that don't recognize various bits; so the sequence becomes: 1 new qemu opens image 2 writes persistent freelist 3 clears need_check 4 sets persistent_freelist 5 shuts down 6 old qemu opens image 7 clears persistent_freelist (and any other bits it doesn't recognize) 8 doesn't update persistent freelist 9 clears need_check 10 shuts down The image is now consistent, since the persistent freelist has disappeared. It is more complicated than just the feature bit. The freelist requires space in the image file. Clearing the persistent_freelist bit leaks the freelist. We can solve this by using a compat feature bit and an autoclear feature bit. The autoclear bit specifies whether or not the freelist is valid and the compat feature bit specifices whether or not the freelist exists. When the new qemu opens the image again it notices that the autoclear bit is unset but the compat bit is set. This means the freelist is out-of-date and its space can be reclaimed. I don't like the idea of doing these feature bit acrobatics because they add complexity. On the other hand your proposal ensures backward compatibility in the case of an optional data structure that needs to stay in sync with the image. I'm just not 100% convinced it's worth it. My scenario ends up with data corruption if we move to an old qemu and then back again, without any aborts. A leak is acceptable (it won't grow; it's just an unused, incorrect freelist), but data corruption is not. -- error compiling committee.c: too many arguments to function
[Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support
On 10/11/2010 12:37 PM, Stefan Hajnoczi wrote: On Sun, Oct 10, 2010 at 11:10:15AM +0200, Avi Kivity wrote: On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote: This patch implements the read/write state machine. Operations are fully asynchronous and multiple operations may be active at any time. Allocating writes lock tables to ensure metadata updates do not interfere with each other. If two allocating writes need to update the same L2 table they will run sequentially. If two allocating writes need to update different L2 tables they will run in parallel. Shouldn't there be a flush between an allocating write and an L2 update? Otherwise a reuse of a cluster can move logical sectors from one place to another, causing a data disclosure. Can be skipped if the new cluster is beyond the physical image size. Currently clusters are never reused and new clusters are always beyond physical image size. The only exception I can think of is when the image file size is not a multiple of the cluster size and we round down to the start of cluster. In an implementation that supports TRIM or otherwise reuses clusters this is a cost. + +/* + * Table locking works as follows: + * + * Reads and non-allocating writes do not acquire locks because they do not + * modify tables and only see committed L2 cache entries. What about a non-allocating write that follows an allocating write? 1 Guest writes to sector 0 2 Host reads backing image (or supplies zeros), sectors 1-127 3 Host writes sectors 0-127 4 Guest writes sector 1 5 Host writes sector 1 There needs to be a barrier that prevents the host and the disk from reordering operations 3 and 5, or guest operation 4 is lost. As far as the guest is concerned no overlapping writes were issued, so it isn't required to provide any barriers. (based on the comment only, haven't read the code) There is no barrier between operations 3 and 5. However, operation 5 only starts after operation 3 has completed because of table locking. It is my understanding that *independent* requests may be reordered but two writes to the *same* sector will not be reordered if write A completes before write B is issued. Imagine a test program that uses pwrite() to rewrite a counter many times on disk. When the program finishes it prints the counter variable's last value. This scenario is like operations 3 and 5 above. If we read the counter back from disk it will be the final value, not some intermediate value. The writes will not be reordered. Yes, all that is needed is a barrier in program order. So, operation 5 is an allocating write as long as 3 hasn't returned? (at which point it becomes a non-allocating write)? -- error compiling committee.c: too many arguments to function
[Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
On 10/11/2010 05:02 PM, Anthony Liguori wrote: On 10/11/2010 08:44 AM, Avi Kivity wrote: On 10/11/2010 03:42 PM, Stefan Hajnoczi wrote: A leak is acceptable (it won't grow; it's just an unused, incorrect freelist), but data corruption is not. The alternative is for the freelist to be a non-compat feature bit. That means older QEMU binaries cannot use a QED image that has enabled the freelist. For this one feature. What about others? A compat feature is one where the feature can be completely ignored (meaning that the QEMU does not have to understand the data format). An example of a compat feature is copy-on-read. It's merely a suggestion and there is no additional metadata. If a QEMU doesn't understand it, it doesn't affect it's ability to read the image. An example of a non-compat feature would be zero cluster entries. Zero cluster entries are a special L2 table entry that indicates that a cluster's on-disk data is all zeros. As long as there is at least 1 ZCE in the L2 tables, this feature bit must be set. As soon as all of the ZCE bits are cleared, the feature bit can be unset. An older QEMU will gracefully fail when presented with an image using ZCE bits. An image with no ZCEs will work on older QEMUs. What's the motivation behind ZCE? There is yet a third type of feature, one which is not strictly needed in order to use the image, but if used, must be kept synchronized. An example is the freelist. Another example is a directory index for a filesystem. I can't think of another example which would be relevant to QED -- metadata checksums perhaps? -- we can always declare it a non-compatible feature, but of course, it reduces compatibility. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
On 10/11/2010 04:54 PM, Anthony Liguori wrote: On 10/11/2010 08:04 AM, Avi Kivity wrote: On 10/11/2010 12:09 PM, Stefan Hajnoczi wrote: On Sun, Oct 10, 2010 at 11:20:09AM +0200, Avi Kivity wrote: On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote: Signed-off-by: Stefan Hajnoczistefa...@linux.vnet.ibm.com --- docs/specs/qed_spec.txt | 94 +++ 1 files changed, 94 insertions(+), 0 deletions(-) +Feature bits: +* QED_F_BACKING_FILE = 0x01. The image uses a backing file. +* QED_F_NEED_CHECK = 0x02. The image needs a consistency check before use. Great that QED_F_NEED_CHECK is now non-optional. However, suppose we add a new structure (e.g. persistent freelist); it's now impossible to tell whether the structure is updated or not: 1 new qemu opens image 2 writes persistent freelist 3 clears need_check 4 shuts down 5 old qemu opens image 6 doesn't update persistent freelist 7 clears need_check 8 shuts down The image is now inconsistent, but has need_check cleared. We can address this by having a third feature bitmask that is autocleared by guests that don't recognize various bits; so the sequence becomes: 1 new qemu opens image 2 writes persistent freelist 3 clears need_check 4 sets persistent_freelist 5 shuts down 6 old qemu opens image 7 clears persistent_freelist (and any other bits it doesn't recognize) 8 doesn't update persistent freelist 9 clears need_check 10 shuts down The image is now consistent, since the persistent freelist has disappeared. It is more complicated than just the feature bit. The freelist requires space in the image file. Clearing the persistent_freelist bit leaks the freelist. We can solve this by using a compat feature bit and an autoclear feature bit. The autoclear bit specifies whether or not the freelist is valid and the compat feature bit specifices whether or not the freelist exists. When the new qemu opens the image again it notices that the autoclear bit is unset but the compat bit is set. This means the freelist is out-of-date and its space can be reclaimed. I don't like the idea of doing these feature bit acrobatics because they add complexity. On the other hand your proposal ensures backward compatibility in the case of an optional data structure that needs to stay in sync with the image. I'm just not 100% convinced it's worth it. My scenario ends up with data corruption if we move to an old qemu and then back again, without any aborts. A leak is acceptable (it won't grow; it's just an unused, incorrect freelist), but data corruption is not. A leak is unacceptable. It means an image can grow to an unbounded size. If you are a server provider offering multitenancy, then a malicious guest can potentially grow the image beyond it's allotted size causing a Denial of Service attack against another tenant. This particular leak cannot grow, and is not controlled by the guest. A freelist has to be a non-optional feature. When the freelist bit is set, an older QEMU cannot read the image. If the freelist is completed used, the freelist bit can be cleared and the image is then usable by older QEMUs. Once we support TRIM (or detect zeros) we'll never have a clean freelist. -- error compiling committee.c: too many arguments to function
[Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
On 10/11/2010 05:30 PM, Stefan Hajnoczi wrote: It was discussed before, but I don't think we came to a conclusion. Are there any circumstances under which you don't want to set the QED_CF_BACKING_FORMAT flag? I suggest the following: QED_CF_BACKING_FORMAT_RAW = 0x1 When set, the backing file is a raw image and should not be probed for its file format. The default (unset) means that the backing image file format may be probed. Now the backing_fmt_{offset,size} are no longer necessary. Should it not be an incompatible option? If the backing disk starts with a format magic, it will be probed by an older qemu, incorrectly. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
On 10/11/2010 05:49 PM, Anthony Liguori wrote: On 10/11/2010 09:58 AM, Avi Kivity wrote: A leak is unacceptable. It means an image can grow to an unbounded size. If you are a server provider offering multitenancy, then a malicious guest can potentially grow the image beyond it's allotted size causing a Denial of Service attack against another tenant. This particular leak cannot grow, and is not controlled by the guest. As the image gets moved from hypervisor to hypervisor, it can keep growing if given a chance to fill up the disk, then trim it all way. In a mixed hypervisor environment, it just becomes a numbers game. I don't see how it can grow. Both the freelist and the clusters it points to consume space, which becomes a leak once you move it to a hypervisor that doesn't understand the freelist. The older hypervisor then allocates new blocks. As soon as it performs a metadata scan (if ever), the freelist is reclaimed. You could only get a growing leak if you moved it to a hypervisor that doesn't perform metadata scans, but then that is independent of the freelist. A freelist has to be a non-optional feature. When the freelist bit is set, an older QEMU cannot read the image. If the freelist is completed used, the freelist bit can be cleared and the image is then usable by older QEMUs. Once we support TRIM (or detect zeros) we'll never have a clean freelist. Zero detection doesn't add to the free list. Why not? If a cluster is zero filled, you may drop it (assuming no backing image). A potential solution here is to treat TRIM a little differently than we've been discussing. When TRIM happens, don't immediately write an unallocated cluster entry for the L2. Leave the L2 entry in-tact. Don't actually write a UCE to the L2 until you actually allocate the block. This implies a cost because you'll need to do metadata syncs to make this work. However, that eliminates leakage. The information is lost on shutdown; and you can have a large number of unallocated-in-waiting clusters (like a TRIM issued by mkfs, or a user expecting a visit from RIAA). A slight twist on your proposal is to have an allocated-but-may-drop bit in a L2 entry. TRIM or zero detection sets the bit (leaving the cluster number intact). A following write to the cluster needs to clear the bit; if we reallocate the cluster we need to replace it with a ZCE. This makes the freelist all L2 entries with the bit set; it may be less efficient than a custom data structure though. -- error compiling committee.c: too many arguments to function
[Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
On 10/11/2010 05:41 PM, Anthony Liguori wrote: On 10/11/2010 10:24 AM, Avi Kivity wrote: On 10/11/2010 05:02 PM, Anthony Liguori wrote: On 10/11/2010 08:44 AM, Avi Kivity wrote: On 10/11/2010 03:42 PM, Stefan Hajnoczi wrote: A leak is acceptable (it won't grow; it's just an unused, incorrect freelist), but data corruption is not. The alternative is for the freelist to be a non-compat feature bit. That means older QEMU binaries cannot use a QED image that has enabled the freelist. For this one feature. What about others? A compat feature is one where the feature can be completely ignored (meaning that the QEMU does not have to understand the data format). An example of a compat feature is copy-on-read. It's merely a suggestion and there is no additional metadata. If a QEMU doesn't understand it, it doesn't affect it's ability to read the image. An example of a non-compat feature would be zero cluster entries. Zero cluster entries are a special L2 table entry that indicates that a cluster's on-disk data is all zeros. As long as there is at least 1 ZCE in the L2 tables, this feature bit must be set. As soon as all of the ZCE bits are cleared, the feature bit can be unset. An older QEMU will gracefully fail when presented with an image using ZCE bits. An image with no ZCEs will work on older QEMUs. What's the motivation behind ZCE? It's very useful for Copy-on-Read. If the cluster in the backing file is unallocated, then when you do a copy-on-read, you don't want to write out a zero cluster since you'd expand the image to it's maximum size. It's also useful for operations like compaction in the absence of TRIM. The common implementation on platforms like VMware is to open a file and write zeros to it until it fills up the filesystem. You then delete the file. The result is that any unallocated data on the disk is written as zero and combined with zero-detection in the image format, you can compact the image size by marking unallocated blocks as ZCE. Both make sense. The latter is also useful with TRIM: if you have a backing image it's better to implement TRIM with ZCE rather than exposing the cluster from the backing file; it saves you a COW when you later reallocate the cluster. There is yet a third type of feature, one which is not strictly needed in order to use the image, but if used, must be kept synchronized. An example is the freelist. Another example is a directory index for a filesystem. I can't think of another example which would be relevant to QED -- metadata checksums perhaps? -- we can always declare it a non-compatible feature, but of course, it reduces compatibility. You're suggesting a feature that is not strictly needed, but that needs to be kept up to date. If it can't be kept up to date, something needs to happen to remove it. Let's call this a transient feature. Most of the transient features can be removed given some bit of code. For instance, ZCE can be removed by writing out zero clusters or writing an unallocated L2 entry if there is no backing file. I think we could add a qemu-img demote command or something like that that attempted to remove features when possible. That doesn't give you instant compatibility but I'm doubtful that you can come up with a generic way to remove a feature from an image without knowing anything about the image. That should work, and in the worst case there is qemu-img convert (which should be taught about format options). -- error compiling committee.c: too many arguments to function
[Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
On 10/11/2010 04:06 PM, Stefan Hajnoczi wrote: On Mon, Oct 11, 2010 at 03:44:38PM +0200, Avi Kivity wrote: On 10/11/2010 03:42 PM, Stefan Hajnoczi wrote: A leak is acceptable (it won't grow; it's just an unused, incorrect freelist), but data corruption is not. The alternative is for the freelist to be a non-compat feature bit. That means older QEMU binaries cannot use a QED image that has enabled the freelist. For this one feature. What about others? Compat features that need to be in sync with the image state will either require specific checks (e.g. checksum or shadow of the state) or they need to be non-compat features and are not backwards compatible. I'm not opposing autoclear feature bits themselves, they are a neat idea. However, they will initially have no users so is this something we really want to carry? Hard to tell in advance. It seems like a simple feature with potential to make our lives easier in the future. Anything we develop in the next few months can be rolled back into the baseline, assuming we declare the format unstable while 0.14 is in development, so this is really about post 0.14 features. -- error compiling committee.c: too many arguments to function
[Qemu-devel] Re: [SeaBIOS] [RFC] Passing boot order from qemu to seabios
On 10/11/2010 05:39 PM, Gleb Natapov wrote: On Mon, Oct 11, 2010 at 05:09:22PM +0200, Avi Kivity wrote: On 10/11/2010 12:18 PM, Gleb Natapov wrote: Currently if VM is started with multiple disks it is almost impossible to guess which one of them will be used as boot device especially if there is a mix of ATA/virtio/SCSI devices. Essentially BIOS decides the order and without looking into the code you can't tell what the order will be (and in qemu-kvm if boot=on is used it brings even more havoc). We should allow fine-grained control of boot order from qemu command line, or as a minimum control what device will be used for booting. To do that along with inventing syntax to specify boot order on qemu command line we need to communicate boot order to seabios via fw_cfg interface. For that we need to have a way to unambiguously specify a disk from qemu to seabios. PCI bus address is not enough since not all devices are PCI (do we care about them?) and since one PCI device may control more then one disk (ATA slave/master, SCSI LUNs). We can do what EDD specification does. Describe disk as: bus type (isa/pci), address on a bus (16 bit base address for isa, b/s/f for pci) device type (ATA/SCSI/VIRTIO) device path (slave/master for ATA, LUN for SCSI, nothing for virtio) Will it cover all use cased? Any other ideas? Any ideas about qemu command line syntax? May be somebody whats to implement it? :) Instead of fwcfg, we should store the boot order in the bios. This allows seabios to implement persistent boot selection and control boot order from within the guest. It is not instead of it is in a best case in addition too. First of all seabios does not have persistent storage currently and second I much prefer specifying boot device from command line instead of navigating bios menus. That what we have to do on real HW because there is not other way to do it, but in virtualization we can do better. Ok. So fwcfg will have an option do your default thing which the bios can take as a hint to look in cmos memory. -- error compiling committee.c: too many arguments to function
[Qemu-devel] Re: [SeaBIOS] [RFC] Passing boot order from qemu to seabios
On 10/11/2010 12:18 PM, Gleb Natapov wrote: Currently if VM is started with multiple disks it is almost impossible to guess which one of them will be used as boot device especially if there is a mix of ATA/virtio/SCSI devices. Essentially BIOS decides the order and without looking into the code you can't tell what the order will be (and in qemu-kvm if boot=on is used it brings even more havoc). We should allow fine-grained control of boot order from qemu command line, or as a minimum control what device will be used for booting. To do that along with inventing syntax to specify boot order on qemu command line we need to communicate boot order to seabios via fw_cfg interface. For that we need to have a way to unambiguously specify a disk from qemu to seabios. PCI bus address is not enough since not all devices are PCI (do we care about them?) and since one PCI device may control more then one disk (ATA slave/master, SCSI LUNs). We can do what EDD specification does. Describe disk as: bus type (isa/pci), address on a bus (16 bit base address for isa, b/s/f for pci) device type (ATA/SCSI/VIRTIO) device path (slave/master for ATA, LUN for SCSI, nothing for virtio) Will it cover all use cased? Any other ideas? Any ideas about qemu command line syntax? May be somebody whats to implement it? :) Instead of fwcfg, we should store the boot order in the bios. This allows seabios to implement persistent boot selection and control boot order from within the guest. -- error compiling committee.c: too many arguments to function
[Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
On 10/11/2010 03:42 PM, Stefan Hajnoczi wrote: A leak is acceptable (it won't grow; it's just an unused, incorrect freelist), but data corruption is not. The alternative is for the freelist to be a non-compat feature bit. That means older QEMU binaries cannot use a QED image that has enabled the freelist. For this one feature. What about others? -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
On 10/11/2010 06:10 PM, Anthony Liguori wrote: On 10/11/2010 11:02 AM, Avi Kivity wrote: On 10/11/2010 05:49 PM, Anthony Liguori wrote: On 10/11/2010 09:58 AM, Avi Kivity wrote: A leak is unacceptable. It means an image can grow to an unbounded size. If you are a server provider offering multitenancy, then a malicious guest can potentially grow the image beyond it's allotted size causing a Denial of Service attack against another tenant. This particular leak cannot grow, and is not controlled by the guest. As the image gets moved from hypervisor to hypervisor, it can keep growing if given a chance to fill up the disk, then trim it all way. In a mixed hypervisor environment, it just becomes a numbers game. I don't see how it can grow. Both the freelist and the clusters it points to consume space, which becomes a leak once you move it to a hypervisor that doesn't understand the freelist. The older hypervisor then allocates new blocks. As soon as it performs a metadata scan (if ever), the freelist is reclaimed. Assume you don't ever do a metadata scan (which is really our design point). What about crashes? If you move to a hypervisor that doesn't support it, then move to a hypervisor that does, you create a brand new freelist and start leaking more space. This isn't a contrived scenario if you have a cloud environment with a mix of hosts. It's only a leak if you don't do a metadata scan. You might not be able to get a ping-pong every time you provision, but with enough effort, you could create serious problems. It's really an issue of correctness. Making correctness trade-offs for the purpose of compatibility is a policy decision and not something we should bake into an image format. If a tool feels strongly that it's a reasonable trade off to make, it can always fudge the feature bits itself. I think the effort here is reasonable, clearing a bit on startup is not that complicated. A potential solution here is to treat TRIM a little differently than we've been discussing. When TRIM happens, don't immediately write an unallocated cluster entry for the L2. Leave the L2 entry in-tact. Don't actually write a UCE to the L2 until you actually allocate the block. This implies a cost because you'll need to do metadata syncs to make this work. However, that eliminates leakage. The information is lost on shutdown; and you can have a large number of unallocated-in-waiting clusters (like a TRIM issued by mkfs, or a user expecting a visit from RIAA). A slight twist on your proposal is to have an allocated-but-may-drop bit in a L2 entry. TRIM or zero detection sets the bit (leaving the cluster number intact). A following write to the cluster needs to clear the bit; if we reallocate the cluster we need to replace it with a ZCE. Yeah, this is sort of what I was thinking. You would still want a free list but it becomes totally optional because if it's lost, no data is leaked (assuming that the older version understands the bit). I was suggesting that we store that bit in the free list though because that let's us support having older QEMUs with absolutely no knowledge still work. It doesn't - on rewrite an old qemu won't clear the bit, so a newer qemu would think it's still free. The autoclear bit solves it nicely - the old qemu automatically drops the allocated-but-may-drop bits, undoing any TRIMs (which is unfortunate) but preserving consistency. -- error compiling committee.c: too many arguments to function
[Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support
On 10/12/2010 06:16 PM, Anthony Liguori wrote: It's fairly simple to add a sync to this path. It's probably worth checking the prefill/postfill for zeros and avoiding the write/sync if that's the case. That should optimize the common cases of allocating new space within a file. My intuition is that we can avoid the sync entirely but we'll need to think about it further. I don't think so. This isn't a guest initiated write so we can't shift responsibility to the guest. -- error compiling committee.c: too many arguments to function
[Qemu-devel] Re: [PATCH v2 2/7] cutils: Add bytes_to_str() to format byte values
On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote: From: Anthony Liguorialigu...@us.ibm.com This common function converts byte counts to human-readable strings with proper units. Signed-off-by: Anthony Liguorialigu...@us.ibm.com Signed-off-by: Stefan Hajnoczistefa...@linux.vnet.ibm.com --- cutils.c | 15 +++ qemu-common.h |1 + 2 files changed, 16 insertions(+), 0 deletions(-) diff --git a/cutils.c b/cutils.c index 6c32198..5041203 100644 --- a/cutils.c +++ b/cutils.c @@ -301,3 +301,18 @@ int get_bits_from_size(size_t size) return __builtin_ctzl(size); #endif } + +void bytes_to_str(char *buffer, size_t buffer_len, uint64_t size) +{ +if (size (1ULL 10)) { +snprintf(buffer, buffer_len, % PRIu64 byte(s), size); +} else if (size (1ULL 20)) { +snprintf(buffer, buffer_len, % PRIu64 KB(s), size 10); +} else if (size (1ULL 30)) { +snprintf(buffer, buffer_len, % PRIu64 MB(s), size 20); +} else if (size (1ULL 40)) { +snprintf(buffer, buffer_len, % PRIu64 GB(s), size 30); +} else { +snprintf(buffer, buffer_len, % PRIu64 TB(s), size 40); +} +} This will show 1.5GB as 1GB. Need either floating point with a couple of digits of precision, or show 1.5GB as 1500MB. It also misuses SI prefixes. 1 GB means 10^9 bytes, not 2^30 bytes (with a common exception for RAM which is usually packaged in 1.125 times some power of two). -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.
Re: [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support
On 10/13/2010 03:24 PM, Anthony Liguori wrote: On 10/13/2010 08:07 AM, Kevin Wolf wrote: Am 13.10.2010 14:13, schrieb Stefan Hajnoczi: We can avoid it when a backing image is not used. Your idea to check for zeroes in the backing image is neat too, it may well reduce the common case even for backing images. The additional requirement is that we're extending the file and not reusing an old cluster. (And bdrv_has_zero_init() == true, but QED doesn't work on host_devices anyway) Yes, that's a good point. BTW, I think we've decided that making it work on host_devices is not that bad. We can add an additional feature called QED_F_PHYSICAL_SIZE. This feature will add another field to the header that contains an offset immediately following the last cluster allocation. During a metadata scan, we can accurately recreate this field so we only need to update this field whenever we clear the header dirty bit (which means during an fsync()). If you make QED_F_PHYSICAL_SIZE an autoclear bit, you don't need the header dirty bit. That means we can maintain the physical size without introducing additional fsync()s in the allocation path. Since we're already writing out the header anyway, the write operation is basically free too. I don't see how it is free. It's an extra write. The good news is that it's very easy to amortize. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.
Re: [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support
On 10/13/2010 04:11 PM, Anthony Liguori wrote: Why would you ever update the header, apart from relocating L1 for some reason? To update the L1/L2 tables clean bit. That's what prevents a check in the normal case where you have a clean shutdown. I see - so you wouldn't update it every allocation, only when the disk has been quiet for a while. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.
Re: [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support
On 10/13/2010 04:07 PM, Stefan Hajnoczi wrote: On Wed, Oct 13, 2010 at 03:50:00PM +0200, Avi Kivity wrote: On 10/13/2010 03:24 PM, Anthony Liguori wrote: On 10/13/2010 08:07 AM, Kevin Wolf wrote: Am 13.10.2010 14:13, schrieb Stefan Hajnoczi: We can avoid it when a backing image is not used. Your idea to check for zeroes in the backing image is neat too, it may well reduce the common case even for backing images. The additional requirement is that we're extending the file and not reusing an old cluster. (And bdrv_has_zero_init() == true, but QED doesn't work on host_devices anyway) Yes, that's a good point. BTW, I think we've decided that making it work on host_devices is not that bad. We can add an additional feature called QED_F_PHYSICAL_SIZE. This feature will add another field to the header that contains an offset immediately following the last cluster allocation. During a metadata scan, we can accurately recreate this field so we only need to update this field whenever we clear the header dirty bit (which means during an fsync()). If you make QED_F_PHYSICAL_SIZE an autoclear bit, you don't need the header dirty bit. Do you mean we just need to check the physical size header field against the actual file size? If the two are different, then a consistency check is forced. I thought you'd only use a header size field when you don't have a real file size. Why do you need both? That means we can maintain the physical size without introducing additional fsync()s in the allocation path. Since we're already writing out the header anyway, the write operation is basically free too. I don't see how it is free. It's an extra write. The good news is that it's very easy to amortize. We only need to update the header field on disk when we're already updating the header, so it's not even an extra write operation. Why would you ever update the header, apart from relocating L1 for some reason? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.
Re: [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support
On 10/13/2010 04:53 PM, Anthony Liguori wrote: On 10/13/2010 09:16 AM, Avi Kivity wrote: On 10/13/2010 04:11 PM, Anthony Liguori wrote: Why would you ever update the header, apart from relocating L1 for some reason? To update the L1/L2 tables clean bit. That's what prevents a check in the normal case where you have a clean shutdown. I see - so you wouldn't update it every allocation, only when the disk has been quiet for a while. Right, the current plan is to flush the header dirty bit on shutdown or whenever there is an explicit flush of the device. Current that is caused by either a guest-initiated flush or a L1 update. That does add an extra write (and a new write+flush later to mark the header dirty again when you start allocating). I'd drop it and only use the timer. in fact, it adds an extra flush too. The sequence 1 L1 update 2 mark clean 3 flush is unsafe since you can crash between 2 and 3, ad only 2 makes it. So I'd do something like 1 opportunistic flush (for whatever reason) 2 set timer 3 no intervening metadata changes 4 mark clean 5 no intervening metadata changes 6 mark dirty 7 flush 8 metadata changes -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.
[Qemu-devel] Re: [patch 0/8] port qemu-kvm's MCE support (v3)
On 10/11/2010 08:31 PM, Marcelo Tosatti wrote: Port qemu-kvm's KVM MCE (Machine Check Exception) handling to qemu. It allows qemu to propagate MCEs to the guest. v2: - rename do_qemu_ram_addr_from_host. - fix kvm_on_sigbus/kvm_on_sigbus_vcpu naming. - fix bank register restoration (Dean Nelson). v3: - condition MCE generation on MCE_SEG_P bit (Huang Ying). I only see patches 1 and 4 from v2, and this cover letter from v3. Please repost. Also, if the patchset ends up with qemu-kvm master being different from uq/master in this area, please post the corresponding qemu-kvm master patches. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] Hitting 29 NIC limit
On 10/14/2010 12:54 AM, Anthony Liguori wrote: On 10/13/2010 05:32 PM, Anjali Kulkarni wrote: Hi, Using the legacy way of starting up NICs, I am hitting a limitation after 29 NICs ie no more than 29 are detected (that's because of the 32 PCI slot limit on a single bus- 3 are already taken up) I had initially increased the MAX_NICS to 48, just on my tree, to get to more, but ofcource that wont work. Is there any way to go beyond 29 NICs the legacy way? What is the maximum that can be supported by the qdev mothod? I got up to 104 without trying very hard using the following script: args= for slot in 5 6 7 8 9 10 11 12 13 14 15 16 17; do for fn in 0 1 2 3 4 5 6 7; do args=$args -netdev user,id=eth${slot}_${fn} args=$args -device virtio-net-pci,addr=${slot}.${fn},netdev=eth${slot}_${fn},multifunction=on,romfile= done done x86_64-softmmu/qemu-system-x86_64 -hda ~/images/linux.img ${args} -enable-kvm The key is to make the virtio-net devices multifunction and to fill out all 8 functions for each slot. This is unlikely to work right wrt pci hotplug. If we want to support a large number of interfaces, we need true multiport cards. What's the motivation for such a huge number of interfaces? -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] Hitting 29 NIC limit
On 10/14/2010 02:54 PM, Anthony Liguori wrote: The key is to make the virtio-net devices multifunction and to fill out all 8 functions for each slot. This is unlikely to work right wrt pci hotplug. Yes. Our hotplug design is based on devices.. This is wrong, it should be based on bus-level concepts (like PCI slots). If we want to support a large number of interfaces, we need true multiport cards. This magic here creates a multiport virtio-net card so I'm not really sure what you're suggesting. It would certainly be nice to make this all more user friendly (and make hotplug work). The big issue is to fix hotplug. I don't see how we can make it user friendly, without making the ordinary case even more unfriendly. Looks like we need yet another level of indirection here. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] Hitting 29 NIC limit
On 10/14/2010 04:11 PM, Anthony Liguori wrote: On 10/14/2010 08:23 AM, Avi Kivity wrote: On 10/14/2010 02:54 PM, Anthony Liguori wrote: The key is to make the virtio-net devices multifunction and to fill out all 8 functions for each slot. This is unlikely to work right wrt pci hotplug. Yes. Our hotplug design is based on devices.. This is wrong, it should be based on bus-level concepts (like PCI slots). If we want to support a large number of interfaces, we need true multiport cards. This magic here creates a multiport virtio-net card so I'm not really sure what you're suggesting. It would certainly be nice to make this all more user friendly (and make hotplug work). The big issue is to fix hotplug. Yes, but this is entirely independent of multifunction devices. Today we shoe-horn hot remove into device_del. Instead, we should have explicit bus-level interfaces for hot remove. I'm not saying multiplug is not the right way to approach this (it is). The only concern is to get hotplug right. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH 1/3] Introduce threadlets
On 10/14/2010 11:15 AM, Stefan Hajnoczi wrote: I forgot to add that the semantics of cancellation make it difficult to write correct user code. Every cancellation user needs to add extra synchronization after the cancel call to handle the case where the work is currently executing. This seems tricky to me and I suspect code using this interface will be buggy. How about the following? 1. Add a return value indicating that the work is currently executing (this still requires the caller to add extra synchronization but is at least explicit) versus work is no longer on the list. 2. Add a flag to block until the work has been cancelled or completed. This is useful to callers who are allowed to block. Blocking is somewhat against the spirit of the thing, no? While I agree that the current cancel API is hard to use correctly, blocking defeats the purpose of the API. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH 1/3] Introduce threadlets
On 10/14/2010 11:32 PM, Venkateswararao Jujjuri (JV) wrote: Blocking is somewhat against the spirit of the thing, no? While I agree that the current cancel API is hard to use correctly, blocking defeats the purpose of the API. Are you proposing to add additional state in the return (canceled/running/not-canceled) and leave the synchronization part to the user? i.e not to provide any additional interface for the user to wait for the scheduled work to finish? Just trying to understand. I wasn't proposing anything since I don't have a good proposal. Adding a callback makes the whole thing an asynchronous design which threads are trying to avoid. Blocking is bad. Leaving it to the caller is hard to use correctly. Perhaps we can have a threadlet with barrier semantics. You queue a piece of work which is guaranteed to execute after all previously submitted work (against the same queue) and before any consequently submitted work. -- error compiling committee.c: too many arguments to function
[Qemu-devel] Re: [patch 0/8] port qemu-kvm's MCE support (v3 resend)
On 10/11/2010 08:31 PM, Marcelo Tosatti wrote: Port qemu-kvm's KVM MCE (Machine Check Exception) handling to qemu. It allows qemu to propagate MCEs to the guest. v2: - rename do_qemu_ram_addr_from_host. - fix kvm_on_sigbus/kvm_on_sigbus_vcpu naming. - fix bank register restoration (Dean Nelson). v3: - condition MCE generation on MCE_SEG_P bit (Huang Ying). Thanks, applied to uq/master. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH 1/3] Introduce threadlets
On 10/18/2010 12:47 PM, Arun R Bharadwaj wrote: * Avi Kivitya...@redhat.com [2010-10-17 10:57:23]: On 10/14/2010 11:32 PM, Venkateswararao Jujjuri (JV) wrote: Blocking is somewhat against the spirit of the thing, no? While I agree that the current cancel API is hard to use correctly, blocking defeats the purpose of the API. Are you proposing to add additional state in the return (canceled/running/not-canceled) and leave the synchronization part to the user? i.e not to provide any additional interface for the user to wait for the scheduled work to finish? Just trying to understand. I wasn't proposing anything since I don't have a good proposal. Adding a callback makes the whole thing an asynchronous design which threads are trying to avoid. Blocking is bad. Leaving it to the caller is hard to use correctly. Perhaps we can have a threadlet with barrier semantics. You queue a piece of work which is guaranteed to execute after all previously submitted work (against the same queue) and before any consequently submitted work. -- error compiling committee.c: too many arguments to function I would suggest that we have 2 APIs - cancel_threadletwork (current cancel implementation) and cancel_threadletwork_sync (waits for work to complete). As of now there is no known user for cancel_threadletwork_sync. So we can keep this as a TODO for later. I can provide the APIs for both these so that when we have a user for cancel_threadletwork_sync, we can go ahead and implement it. I agree it's best not to implement c_t_s() now. Using it implies a stall so we should discourage it. -- error compiling committee.c: too many arguments to function
[Qemu-devel] Re: [PATCH] Add support for async page fault to qemu
On 10/18/2010 05:48 PM, Juan Quintela wrote: Gleb Natapovg...@redhat.com wrote: Add save/restore of MSR for migration and cpuid bit. It is there a way to test if async page faults are in use? Yes, msr != 0 - need a subsection. Good idea. if so, we can add a subsection instead of changing the cpuversion. I think that at some point we are going to need a bitmap that indicates what MSR's have been used or something like that. What do you think? We just need to check if an msr is different from its default value (which we can get by reading msrs immediately after the initial reset). Currently the reset code assumes msr reset value is zero, that's wrong. -- error compiling committee.c: too many arguments to function
[Qemu-devel] Re: [PATCH] Implement a virtio GPU transport
On 10/19/2010 12:31 PM, Ian Molton wrote: an virtualization@, many virtio developers live there. you mean virtualizat...@lists.osdl.org ? Yes. 2. should start with a patch to the virtio-pci spec to document what you're doing Where can I find that spec? http://ozlabs.org/~rusty/virtio-spec/ + /* Transfer data */ + if (virtqueue_add_buf(vq, sg_list, o_page, i_page, (void *)1)= 0) { + virtqueue_kick(vq); + /* Chill out until it's done with the buffer. */ + while (!virtqueue_get_buf(vq,count)) + cpu_relax(); + } + This is pretty gross, and will burn lots of cpu if the hypervisor processes the queue asynchronously. It doesnt, at present... It could be changed fairly easily ithout breaking anything if that happens though. The hypervisor and the guest can be changed independently. The driver should be coded so that it doesn't depend on hypervisor implementation details. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] Re: KVM call agenda for Oct 19
On 10/19/2010 02:48 PM, Dor Laor wrote: On 10/19/2010 04:11 AM, Chris Wright wrote: * Juan Quintela (quint...@redhat.com) wrote: Please send in any agenda items you are interested in covering. - 0.13.X -stable handoff - 0.14 planning - threadlet work - virtfs proposals - Live snapshots - We were asked to add this feature for external qcow2 images. Will simple approach of fsync + tracking each requested backing file (it can be per vDisk) and re-open the new image would be accepted? - Integration with FS freeze for consistent guest app snapshot Many apps do not sync their ram state to disk correctly or frequent enough. Physical world backup software calls fs freeze on xfs and VSS for windows to make the backup consistent. In order to integrated this with live snapshots we need a guest agent to trigger the guest fs freeze. We can either have qemu communicate with the agent directly through virtio-serial or have a mgmt daemon use virtio-serial to communicate with the guest in addition to QMP messages about the live snapshot state. Preferences? The first solution complicates qemu while the second complicates mgmt. Third option, make the freeze path management - qemu - virtio-blk - guest kernel - file systems. The advantage is that it's easy to associate file systems with a block device this way. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] Re: KVM call agenda for Oct 19
On 10/19/2010 03:22 PM, Anthony Liguori wrote: I had assumed that this would involve: qemu -hda windows.img (qemu) snapshot ide0-disk0 snap0.img 1) create snap0.img internally by doing the equivalent of `qemu-img create -f qcow2 -b windows.img snap0.img' 2) bdrv_flush('ide0-disk0') 3) bdrv_open(snap0.img) 4) bdrv_close(windows.img) 5) rename('windows.img', 'windows.img.tmp') 6) rename('snap0.img', 'windows.img') 7) rename('windows.img.tmp', 'snap0.img') Looks reasonable. Would be interesting to look at this as a use case for the threading work. We should eventually be able to create a snapshot without stalling vcpus (stalling I/O of course allowed). -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] Re: KVM call agenda for Oct 19
On 10/19/2010 02:58 PM, Dor Laor wrote: On 10/19/2010 02:55 PM, Avi Kivity wrote: On 10/19/2010 02:48 PM, Dor Laor wrote: On 10/19/2010 04:11 AM, Chris Wright wrote: * Juan Quintela (quint...@redhat.com) wrote: Please send in any agenda items you are interested in covering. - 0.13.X -stable handoff - 0.14 planning - threadlet work - virtfs proposals - Live snapshots - We were asked to add this feature for external qcow2 images. Will simple approach of fsync + tracking each requested backing file (it can be per vDisk) and re-open the new image would be accepted? - Integration with FS freeze for consistent guest app snapshot Many apps do not sync their ram state to disk correctly or frequent enough. Physical world backup software calls fs freeze on xfs and VSS for windows to make the backup consistent. In order to integrated this with live snapshots we need a guest agent to trigger the guest fs freeze. We can either have qemu communicate with the agent directly through virtio-serial or have a mgmt daemon use virtio-serial to communicate with the guest in addition to QMP messages about the live snapshot state. Preferences? The first solution complicates qemu while the second complicates mgmt. Third option, make the freeze path management - qemu - virtio-blk - guest kernel - file systems. The advantage is that it's easy to associate file systems with a block device this way. OTH the userspace freeze path already exist and now you create another path. I guess we would still have a userspace daemon; instead of talking to virtio-serial it talks to virtio-blk. So: management - qemu - virtio-blk - guest driver - kernel fs resolver - daemon - apps Yuck. What about FS that span over LVM with multiple drives? IDE/SCSI? Good points. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] Re: KVM call agenda for Oct 19
On 10/19/2010 03:38 PM, Stefan Hajnoczi wrote: bdrv_aio_freeze() or any mechanism to deal with pending requests in the generic block code would be a good step for future live support of other operations like truncate. + logical disk grow, etc. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] KVM call minutes for Oct 19
On 10/20/2010 10:21 AM, Alexander Graf wrote: On 19.10.2010, at 17:14, Chris Wright wrote: 0.13.X -stable - Anthony will send note to qemu-devel on this - move 0.13.X -stable to a separate tree - driven independently of main qemu tree - challenge is always in the porting and testing of backported fixes - looking for volunteers 0.14 - would like to do this before end of the year - 0.13 forked off a while back (~July), - 0.14 features - QMP stabilized - 0.13.0 - 0.14 QMP - hard attempt not to break compatibility - new commands, rework, async, human monitor passthrough - goal getting to libvirt not needing human monitor at all - QMP KVM autotest test suite submitted - in-kernel apic, tpr patching still outstanding - QED coroutine concurrency Would it be realistic to declare deprecating the qemu-kvm fork for 0.14 as goal? For general use perhaps, device assignment might need another cycle. -- error compiling committee.c: too many arguments to function
[Qemu-devel] Re: [SeaBIOS] [PATCH] mark irq9 active high in DSDT
On 10/21/2010 04:00 AM, Kevin O'Connor wrote: On Wed, Oct 20, 2010 at 11:34:41AM +0200, Gleb Natapov wrote: In PIIX4 SCI (irq9) is active high. Seabios marks it so in interrupt override table, but some OSes (FreeBSD) require the same information to be present in DSDT too. Make it so. Signed-off-by: Gleb Natapovg...@redhat.com Thanks. How do we manage the stable series wrt this issue? qemu-kvm-0.12.5 has a regression within the stable series that this patch fixes. qemu 0.12.5 does not, but only because it does not emulate polarity in the I/O APIC correctly. There are several paths we could take: - do nothing, bug is fixed in mainline - release a seabios 0.x.1 for qemu 0.13.1 with this patch - same, plus seabios 0.y.1 for qemu 0.12.6 with this patch - skip qemu (which is not truly affected), patch qemu-kvm's copy of seabios for both 0.12.z and 0.13.z The third option is the most correct from a release engineering point of view, but involves more work for everyone. The fourth is quick pain relief but is a little forky. -- error compiling committee.c: too many arguments to function
Re: [SeaBIOS] [Qemu-devel] [PATCH 1/2] pci: Automatically patch PCI device id in PCI ROM
On 10/18/2010 08:39 PM, Anthony Liguori wrote: SeaBIOS rejects them when loaded from the rom bar and doesn't reject them when loaded via fw_cfg. What I meant was, rombar=0 in qdev. Sometimes my fingers don't work the same way my brain does :-) Are you using qmp or the human monitor protocol? -- error compiling committee.c: too many arguments to function
Re: [SeaBIOS] [Qemu-devel] [PATCH 1/2] pci: Automatically patch PCI device id in PCI ROM
On 10/21/2010 03:14 PM, Anthony Liguori wrote: On 10/21/2010 05:09 AM, Avi Kivity wrote: On 10/18/2010 08:39 PM, Anthony Liguori wrote: SeaBIOS rejects them when loaded from the rom bar and doesn't reject them when loaded via fw_cfg. What I meant was, rombar=0 in qdev. Sometimes my fingers don't work the same way my brain does :-) Are you using qmp or the human monitor protocol? I'm still running on a DCE/RPC implementation from the early 80s. Well, I hope you keep it up to date. I wouldn't want a vulnerability inserted into qemu by an attacker controlling a maintainer's hands. -- error compiling committee.c: too many arguments to function
[Qemu-devel] [PATCH 0/2] Type-safe ioport callbacks
A recent qemu - qemu-kvm merge broke cpu hotplug without the compiler complaining because of the type-unsafeness of the ioport callbacks. This patchset adds a type-safe variant of ioport callbacks and coverts a sample ioport. Converting the other 300-odd registrations is left as an excercise to the community. Avi Kivity (2): Type-safe ioport callbacks piix4 acpi: convert io BAR to type-safe ioport callbacks hw/acpi_piix4.c | 30 +++-- ioport.c| 64 +++ ioport.h| 16 + 3 files changed, 98 insertions(+), 12 deletions(-) -- 1.7.3.1
[Qemu-devel] [PATCH 1/2] Type-safe ioport callbacks
The current ioport callbacks are not type-safe, in that they accept an opaque pointer as an argument whose type must match the argument to the registration function; this is not checked by the compiler. This patch adds an alternative that is type-safe. Instead of an opaque argument, both registation and the callback use a new IOPort type. The callback then uses container_of() to access its main structures. Currently the old and new methods exist side by side; once the old way is gone, we can also save a bunch of memory since the new method requires one pointer per ioport instead of 6. Signed-off-by: Avi Kivity a...@redhat.com --- ioport.c | 64 ++ ioport.h | 16 +++ 2 files changed, 80 insertions(+), 0 deletions(-) diff --git a/ioport.c b/ioport.c index ec3dc65..47eafc3 100644 --- a/ioport.c +++ b/ioport.c @@ -174,6 +174,70 @@ int register_ioport_write(pio_addr_t start, int length, int size, return 0; } +static uint32_t ioport_readb_thunk(void *opaque, uint32_t addr) +{ +IOPort *ioport = opaque; + +return ioport-ops-readb(ioport, addr); +} + +static uint32_t ioport_readw_thunk(void *opaque, uint32_t addr) +{ +IOPort *ioport = opaque; + +return ioport-ops-readw(ioport, addr); +} + +static uint32_t ioport_readl_thunk(void *opaque, uint32_t addr) +{ +IOPort *ioport = opaque; + +return ioport-ops-readl(ioport, addr); +} + +static void ioport_writeb_thunk(void *opaque, uint32_t addr, uint32_t data) +{ +IOPort *ioport = opaque; + +return ioport-ops-writeb(ioport, addr, data); +} + +static void ioport_writew_thunk(void *opaque, uint32_t addr, uint32_t data) +{ +IOPort *ioport = opaque; + +return ioport-ops-writew(ioport, addr, data); +} + +static void ioport_writel_thunk(void *opaque, uint32_t addr, uint32_t data) +{ +IOPort *ioport = opaque; + +return ioport-ops-writel(ioport, addr, data); +} + +void ioport_register(IOPort *ioport, pio_addr_t start, pio_addr_t length) +{ +if (ioport-ops-readb) { +register_ioport_read(start, length, 1, ioport_readb_thunk, ioport); +} +if (ioport-ops-readw) { +register_ioport_read(start, length, 2, ioport_readw_thunk, ioport); +} +if (ioport-ops-readl) { +register_ioport_read(start, length, 4, ioport_readl_thunk, ioport); +} +if (ioport-ops-writeb) { +register_ioport_write(start, length, 1, ioport_writeb_thunk, ioport); +} +if (ioport-ops-writew) { +register_ioport_write(start, length, 2, ioport_writew_thunk, ioport); +} +if (ioport-ops-writel) { +register_ioport_write(start, length, 4, ioport_writel_thunk, ioport); +} +} + void isa_unassign_ioport(pio_addr_t start, int length) { int i; diff --git a/ioport.h b/ioport.h index 3d3c8a3..8036e59 100644 --- a/ioport.h +++ b/ioport.h @@ -36,6 +36,22 @@ typedef uint32_t pio_addr_t; typedef void (IOPortWriteFunc)(void *opaque, uint32_t address, uint32_t data); typedef uint32_t (IOPortReadFunc)(void *opaque, uint32_t address); +struct IOPort; + +typedef struct IOPortOps { +uint32_t (*readb)(struct IOPort *ioport, pio_addr_t addr); +uint32_t (*readw)(struct IOPort *ioport, pio_addr_t addr); +uint32_t (*readl)(struct IOPort *ioport, pio_addr_t addr); +void (*writeb)(struct IOPort *ioport, pio_addr_t addr, uint32_t data); +void (*writew)(struct IOPort *ioport, pio_addr_t addr, uint32_t data); +void (*writel)(struct IOPort *ioport, pio_addr_t addr, uint32_t data); +} IOPortOps; + +typedef struct IOPort { +IOPortOps *ops; +} IOPort; + +void ioport_register(IOPort *ioport, pio_addr_t start, pio_addr_t length); int register_ioport_read(pio_addr_t start, int length, int size, IOPortReadFunc *func, void *opaque); int register_ioport_write(pio_addr_t start, int length, int size, -- 1.7.3.1
[Qemu-devel] [PATCH 2/2] piix4 acpi: convert io BAR to type-safe ioport callbacks
Signed-off-by: Avi Kivity a...@redhat.com --- hw/acpi_piix4.c | 30 ++ 1 files changed, 18 insertions(+), 12 deletions(-) diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c index f74f34c..5772667 100644 --- a/hw/acpi_piix4.c +++ b/hw/acpi_piix4.c @@ -52,6 +52,7 @@ struct pci_status { typedef struct PIIX4PMState { PCIDevice dev; +IOPort ioport; uint16_t pmsts; uint16_t pmen; uint16_t pmcntrl; @@ -128,9 +129,9 @@ static void pm_tmr_timer(void *opaque) pm_update_sci(s); } -static void pm_ioport_writew(void *opaque, uint32_t addr, uint32_t val) +static void pm_ioport_writew(IOPort *ioport, pio_addr_t addr, uint32_t val) { -PIIX4PMState *s = opaque; +PIIX4PMState *s = container_of(ioport, PIIX4PMState, ioport); addr = 0x3f; switch(addr) { case 0x00: @@ -184,9 +185,9 @@ static void pm_ioport_writew(void *opaque, uint32_t addr, uint32_t val) PIIX4_DPRINTF(PM writew port=0x%04x val=0x%04x\n, addr, val); } -static uint32_t pm_ioport_readw(void *opaque, uint32_t addr) +static uint32_t pm_ioport_readw(IOPort *ioport, pio_addr_t addr) { -PIIX4PMState *s = opaque; +PIIX4PMState *s = container_of(ioport, PIIX4PMState, ioport); uint32_t val; addr = 0x3f; @@ -208,15 +209,15 @@ static uint32_t pm_ioport_readw(void *opaque, uint32_t addr) return val; } -static void pm_ioport_writel(void *opaque, uint32_t addr, uint32_t val) +static void pm_ioport_writel(IOPort *ioport, pio_addr_t addr, uint32_t val) { -//PIIX4PMState *s = opaque; +//PIIX4PMState *s = container_of(ioport, PIIX4PMState, ioport); PIIX4_DPRINTF(PM writel port=0x%04x val=0x%08x\n, addr 0x3f, val); } -static uint32_t pm_ioport_readl(void *opaque, uint32_t addr) +static uint32_t pm_ioport_readl(IOPort *ioport, uint32_t addr) { -PIIX4PMState *s = opaque; +PIIX4PMState *s = container_of(ioport, PIIX4PMState, ioport); uint32_t val; addr = 0x3f; @@ -265,10 +266,7 @@ static void pm_io_space_update(PIIX4PMState *s) /* XXX: need to improve memory and ioport allocation */ PIIX4_DPRINTF(PM: mapping to 0x%x\n, pm_io_base); -register_ioport_write(pm_io_base, 64, 2, pm_ioport_writew, s); -register_ioport_read(pm_io_base, 64, 2, pm_ioport_readw, s); -register_ioport_write(pm_io_base, 64, 4, pm_ioport_writel, s); -register_ioport_read(pm_io_base, 64, 4, pm_ioport_readl, s); +ioport_register(s-ioport, pm_io_base, 64); } } @@ -413,6 +411,13 @@ static int piix4_pm_initfn(PCIDevice *dev) return 0; } +static IOPortOps pm_ioport_ops = { +.readw = pm_ioport_readw, +.readl = pm_ioport_readl, +.writew = pm_ioport_writew, +.writel = pm_ioport_writel, +}; + i2c_bus *piix4_pm_init(PCIBus *bus, int devfn, uint32_t smb_io_base, qemu_irq sci_irq, qemu_irq cmos_s3, qemu_irq smi_irq, int kvm_enabled) @@ -424,6 +429,7 @@ i2c_bus *piix4_pm_init(PCIBus *bus, int devfn, uint32_t smb_io_base, qdev_prop_set_uint32(dev-qdev, smb_io_base, smb_io_base); s = DO_UPCAST(PIIX4PMState, dev, dev); +s-ioport.ops = pm_ioport_ops; s-irq = sci_irq; s-cmos_s3 = cmos_s3; s-smi_irq = smi_irq; -- 1.7.3.1
[Qemu-devel] Re: [PATCH 1/2] Type-safe ioport callbacks
On 10/24/2010 05:34 PM, Avi Kivity wrote: Currently the old and new methods exist side by side; once the old way is gone, we can also save a bunch of memory since the new method requires one pointer per ioport instead of 6. Actually, 1:7, we replace 3 read callbacks, 3 write callbacks, and 1 opaque pointer with a single IOPort pointer. -- error compiling committee.c: too many arguments to function
[Qemu-devel] Re: [PATCH 0/2] Type-safe ioport callbacks
On 10/24/2010 07:35 PM, Paolo Bonzini wrote: On 10/24/2010 05:34 PM, Avi Kivity wrote: A recent qemu - qemu-kvm merge broke cpu hotplug without the compiler complaining because of the type-unsafeness of the ioport callbacks. This patchset adds a type-safe variant of ioport callbacks and coverts a sample ioport. Converting the other 300-odd registrations is left as an excercise to the community. Should we create a Documentation/ file with incomplete transitions and the commit(s) that introduced them, for volunteers who wish to do some dirty work or to learn Coccinelle? If we have a TODO, we could add a janitor section there. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH 1/2] Type-safe ioport callbacks
On 10/24/2010 08:14 PM, Blue Swirl wrote: On Sun, Oct 24, 2010 at 3:34 PM, Avi Kivitya...@redhat.com wrote: The current ioport callbacks are not type-safe, in that they accept an opaque pointer as an argument whose type must match the argument to the registration function; this is not checked by the compiler. This patch adds an alternative that is type-safe. Instead of an opaque argument, both registation and the callback use a new IOPort type. The callback then uses container_of() to access its main structures. Currently the old and new methods exist side by side; once the old way is gone, we can also save a bunch of memory since the new method requires one pointer per ioport instead of 6. Signed-off-by: Avi Kivitya...@redhat.com If we are going to change the interface, let's do it so that it's useful for other uses too: http://article.gmane.org/gmane.comp.emulators.qemu/76984 I don't really see why we need registration; cpu_register_io() takes function pointers, a size, and an opaque, and gives an integer handle in return. With the IOPort object approach, you set up the IOPort with function pointers, size is implied, and the opaque is derived using container_of(); the handle is simply the address of the object. +typedef struct IOPort { +IOPortOps *ops; const Yup, will fix. Will repost once we agree on the approach. -- error compiling committee.c: too many arguments to function
[Qemu-devel] Re: [SeaBIOS] [PATCH] mark irq9 active high in DSDT
On 10/23/2010 04:12 PM, Kevin O'Connor wrote: On Thu, Oct 21, 2010 at 12:07:17PM +0200, Avi Kivity wrote: How do we manage the stable series wrt this issue? qemu-kvm-0.12.5 has a regression within the stable series that this patch fixes. qemu 0.12.5 does not, but only because it does not emulate polarity in the I/O APIC correctly. There are several paths we could take: - do nothing, bug is fixed in mainline - release a seabios 0.x.1 for qemu 0.13.1 with this patch - same, plus seabios 0.y.1 for qemu 0.12.6 with this patch - skip qemu (which is not truly affected), patch qemu-kvm's copy of seabios for both 0.12.z and 0.13.z The third option is the most correct from a release engineering point of view, but involves more work for everyone. I'm okay with making tags and branches of seabios for bug fixes. So far qemu/kvm has just grabbed various builds of seabios - is it worthwhile to branch off of the seabios-0.6.1 version - which would mean qemu/kvm would pull in additional changes beyond the bug fix above? qemu 0.12 is based on 0.5.1-stable, appears to be an untagged commit qemu 0.13 is based on 17d3e46511, doesn't appear to be a part of a branch or a tag? git-wise, tags are more important than branches. You can always retrofit a branch to a tag (and you can always retrofit a tag to a commit hash). For the qemu git repositories, neither matter so much since the commit is recorded in git; but the distro people really like nice stable tags with lots of digits and dots in them. -- error compiling committee.c: too many arguments to function
[Qemu-devel] Re: [PATCH 1/2] Type-safe ioport callbacks
On 10/25/2010 02:54 PM, Juan Quintela wrote: Avi Kivitya...@redhat.com wrote: +static void ioport_writeb_thunk(void *opaque, uint32_t addr, uint32_t data) +{ +IOPort *ioport = opaque; + +return ioport-ops-writeb(ioport, addr, data); +} + +static void ioport_writew_thunk(void *opaque, uint32_t addr, uint32_t data) +{ +IOPort *ioport = opaque; + +return ioport-ops-writew(ioport, addr, data); +} + +static void ioport_writel_thunk(void *opaque, uint32_t addr, uint32_t data) +{ +IOPort *ioport = opaque; + +return ioport-ops-writel(ioport, addr, data); +} return is not needed on this three functions. Fixed. -- error compiling committee.c: too many arguments to function
[Qemu-devel] [PATCH v2 0/2] Type-safe ioport callbacks
A recent qemu - qemu-kvm merge broke cpu hotplug without the compiler complaining because of the type-unsafeness of the ioport callbacks. This patchset adds a type-safe variant of ioport callbacks and coverts a sample ioport. Converting the other 300-odd registrations is left as an excercise to the community. v2: - const correctness - avoid return void Avi Kivity (2): Type-safe ioport callbacks piix4 acpi: convert io BAR to type-safe ioport callbacks hw/acpi_piix4.c | 30 +++-- ioport.c| 64 +++ ioport.h| 16 + 3 files changed, 98 insertions(+), 12 deletions(-) -- 1.7.3.1
[Qemu-devel] [PATCH v2 1/2] Type-safe ioport callbacks
The current ioport callbacks are not type-safe, in that they accept an opaque pointer as an argument whose type must match the argument to the registration function; this is not checked by the compiler. This patch adds an alternative that is type-safe. Instead of an opaque argument, both registation and the callback use a new IOPort type. The callback then uses container_of() to access its main structures. Currently the old and new methods exist side by side; once the old way is gone, we can also save a bunch of memory since the new method requires one pointer per ioport instead of 6. Acked-by: Anthony Liguori aligu...@us.ibm.com Signed-off-by: Avi Kivity a...@redhat.com --- ioport.c | 64 ++ ioport.h | 16 +++ 2 files changed, 80 insertions(+), 0 deletions(-) diff --git a/ioport.c b/ioport.c index ec3dc65..b9e74b8 100644 --- a/ioport.c +++ b/ioport.c @@ -174,6 +174,70 @@ int register_ioport_write(pio_addr_t start, int length, int size, return 0; } +static uint32_t ioport_readb_thunk(void *opaque, uint32_t addr) +{ +IOPort *ioport = opaque; + +return ioport-ops-readb(ioport, addr); +} + +static uint32_t ioport_readw_thunk(void *opaque, uint32_t addr) +{ +IOPort *ioport = opaque; + +return ioport-ops-readw(ioport, addr); +} + +static uint32_t ioport_readl_thunk(void *opaque, uint32_t addr) +{ +IOPort *ioport = opaque; + +return ioport-ops-readl(ioport, addr); +} + +static void ioport_writeb_thunk(void *opaque, uint32_t addr, uint32_t data) +{ +IOPort *ioport = opaque; + +ioport-ops-writeb(ioport, addr, data); +} + +static void ioport_writew_thunk(void *opaque, uint32_t addr, uint32_t data) +{ +IOPort *ioport = opaque; + +ioport-ops-writew(ioport, addr, data); +} + +static void ioport_writel_thunk(void *opaque, uint32_t addr, uint32_t data) +{ +IOPort *ioport = opaque; + +ioport-ops-writel(ioport, addr, data); +} + +void ioport_register(IOPort *ioport, pio_addr_t start, pio_addr_t length) +{ +if (ioport-ops-readb) { +register_ioport_read(start, length, 1, ioport_readb_thunk, ioport); +} +if (ioport-ops-readw) { +register_ioport_read(start, length, 2, ioport_readw_thunk, ioport); +} +if (ioport-ops-readl) { +register_ioport_read(start, length, 4, ioport_readl_thunk, ioport); +} +if (ioport-ops-writeb) { +register_ioport_write(start, length, 1, ioport_writeb_thunk, ioport); +} +if (ioport-ops-writew) { +register_ioport_write(start, length, 2, ioport_writew_thunk, ioport); +} +if (ioport-ops-writel) { +register_ioport_write(start, length, 4, ioport_writel_thunk, ioport); +} +} + void isa_unassign_ioport(pio_addr_t start, int length) { int i; diff --git a/ioport.h b/ioport.h index 3d3c8a3..6a7742b 100644 --- a/ioport.h +++ b/ioport.h @@ -36,6 +36,22 @@ typedef uint32_t pio_addr_t; typedef void (IOPortWriteFunc)(void *opaque, uint32_t address, uint32_t data); typedef uint32_t (IOPortReadFunc)(void *opaque, uint32_t address); +struct IOPort; + +typedef struct IOPortOps { +uint32_t (*readb)(struct IOPort *ioport, pio_addr_t addr); +uint32_t (*readw)(struct IOPort *ioport, pio_addr_t addr); +uint32_t (*readl)(struct IOPort *ioport, pio_addr_t addr); +void (*writeb)(struct IOPort *ioport, pio_addr_t addr, uint32_t data); +void (*writew)(struct IOPort *ioport, pio_addr_t addr, uint32_t data); +void (*writel)(struct IOPort *ioport, pio_addr_t addr, uint32_t data); +} IOPortOps; + +typedef struct IOPort { +const IOPortOps *ops; +} IOPort; + +void ioport_register(IOPort *ioport, pio_addr_t start, pio_addr_t length); int register_ioport_read(pio_addr_t start, int length, int size, IOPortReadFunc *func, void *opaque); int register_ioport_write(pio_addr_t start, int length, int size, -- 1.7.3.1
[Qemu-devel] [PATCH v2 2/2] piix4 acpi: convert io BAR to type-safe ioport callbacks
Acked-by: Anthony Liguori aligu...@us.ibm.com Signed-off-by: Avi Kivity a...@redhat.com --- hw/acpi_piix4.c | 30 ++ 1 files changed, 18 insertions(+), 12 deletions(-) diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c index f74f34c..c01c2ee 100644 --- a/hw/acpi_piix4.c +++ b/hw/acpi_piix4.c @@ -52,6 +52,7 @@ struct pci_status { typedef struct PIIX4PMState { PCIDevice dev; +IOPort ioport; uint16_t pmsts; uint16_t pmen; uint16_t pmcntrl; @@ -128,9 +129,9 @@ static void pm_tmr_timer(void *opaque) pm_update_sci(s); } -static void pm_ioport_writew(void *opaque, uint32_t addr, uint32_t val) +static void pm_ioport_writew(IOPort *ioport, pio_addr_t addr, uint32_t val) { -PIIX4PMState *s = opaque; +PIIX4PMState *s = container_of(ioport, PIIX4PMState, ioport); addr = 0x3f; switch(addr) { case 0x00: @@ -184,9 +185,9 @@ static void pm_ioport_writew(void *opaque, uint32_t addr, uint32_t val) PIIX4_DPRINTF(PM writew port=0x%04x val=0x%04x\n, addr, val); } -static uint32_t pm_ioport_readw(void *opaque, uint32_t addr) +static uint32_t pm_ioport_readw(IOPort *ioport, pio_addr_t addr) { -PIIX4PMState *s = opaque; +PIIX4PMState *s = container_of(ioport, PIIX4PMState, ioport); uint32_t val; addr = 0x3f; @@ -208,15 +209,15 @@ static uint32_t pm_ioport_readw(void *opaque, uint32_t addr) return val; } -static void pm_ioport_writel(void *opaque, uint32_t addr, uint32_t val) +static void pm_ioport_writel(IOPort *ioport, pio_addr_t addr, uint32_t val) { -//PIIX4PMState *s = opaque; +//PIIX4PMState *s = container_of(ioport, PIIX4PMState, ioport); PIIX4_DPRINTF(PM writel port=0x%04x val=0x%08x\n, addr 0x3f, val); } -static uint32_t pm_ioport_readl(void *opaque, uint32_t addr) +static uint32_t pm_ioport_readl(IOPort *ioport, uint32_t addr) { -PIIX4PMState *s = opaque; +PIIX4PMState *s = container_of(ioport, PIIX4PMState, ioport); uint32_t val; addr = 0x3f; @@ -265,10 +266,7 @@ static void pm_io_space_update(PIIX4PMState *s) /* XXX: need to improve memory and ioport allocation */ PIIX4_DPRINTF(PM: mapping to 0x%x\n, pm_io_base); -register_ioport_write(pm_io_base, 64, 2, pm_ioport_writew, s); -register_ioport_read(pm_io_base, 64, 2, pm_ioport_readw, s); -register_ioport_write(pm_io_base, 64, 4, pm_ioport_writel, s); -register_ioport_read(pm_io_base, 64, 4, pm_ioport_readl, s); +ioport_register(s-ioport, pm_io_base, 64); } } @@ -413,6 +411,13 @@ static int piix4_pm_initfn(PCIDevice *dev) return 0; } +static const IOPortOps pm_ioport_ops = { +.readw = pm_ioport_readw, +.readl = pm_ioport_readl, +.writew = pm_ioport_writew, +.writel = pm_ioport_writel, +}; + i2c_bus *piix4_pm_init(PCIBus *bus, int devfn, uint32_t smb_io_base, qemu_irq sci_irq, qemu_irq cmos_s3, qemu_irq smi_irq, int kvm_enabled) @@ -424,6 +429,7 @@ i2c_bus *piix4_pm_init(PCIBus *bus, int devfn, uint32_t smb_io_base, qdev_prop_set_uint32(dev-qdev, smb_io_base, smb_io_base); s = DO_UPCAST(PIIX4PMState, dev, dev); +s-ioport.ops = pm_ioport_ops; s-irq = sci_irq; s-cmos_s3 = cmos_s3; s-smi_irq = smi_irq; -- 1.7.3.1
Re: [Qemu-devel] [PATCH 1/2] Type-safe ioport callbacks
On 10/25/2010 08:38 PM, Blue Swirl wrote: I don't really see why we need registration; cpu_register_io() takes function pointers, a size, and an opaque, and gives an integer handle in return. With the IOPort object approach, you set up the IOPort with function pointers, size is implied, and the opaque is derived using container_of(); the handle is simply the address of the object. With the handle, we can separate setting up the structures at device level, and mapping the object using only the handle at bus or other higher level. Can this be done with the object approach? I believe so. The handle is simply an indirect pointer, no? The purpose of that patch series was to perform the separation for PCI BARs. I wasn't so happy with the series, so I never pushed. In fact I think an IOPort is even more suitable; if we need additional attributes we can use a derived object: struct PCIIOPort { IOPort ioport; /* additional fields */ }; -- error compiling committee.c: too many arguments to function
[Qemu-devel] Re: KVM call agenda for Oct 26
On 10/26/2010 02:49 PM, Anthony Liguori wrote: On 10/26/2010 03:20 AM, Avi Kivity wrote: On 10/25/2010 09:51 PM, Juan Quintela wrote: Please send in any agenda items you are interested in covering. 0.13.1 At least two fatal issues: - vmmouse is broken - assertion failure in block layer There's actually a fair bit. I'm hoping to also get the hotplug fixes because tose are pretty important IMHO. I don't want to wait too long though. Also, a new seabios for the SCI level triggered fix. What's the assertion in the block layer? https://bugzilla.redhat.com/show_bug.cgi?id=646538 This is qemu-img, so maybe not so critical. I confused this with a runtime assertion failure in the block layer, for which a fix was merged post rc3. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH 1/2] Type-safe ioport callbacks
On 10/26/2010 05:09 PM, Blue Swirl wrote: On Tue, Oct 26, 2010 at 8:05 AM, Avi Kivitya...@redhat.com wrote: On 10/25/2010 08:38 PM, Blue Swirl wrote: I don't really see why we need registration; cpu_register_io() takes function pointers, a size, and an opaque, and gives an integer handle in return. With the IOPort object approach, you set up the IOPort with function pointers, size is implied, and the opaque is derived using container_of(); the handle is simply the address of the object. With the handle, we can separate setting up the structures at device level, and mapping the object using only the handle at bus or other higher level. Can this be done with the object approach? I believe so. The handle is simply an indirect pointer, no? Yes, but then the object should also contain size information. That should not be needed for mapping at higher level. Sorry, I don't follow your meaning. When I said size is implied I meant that the IOPort object has a separate function pointer for sizes 1, 2, and 4, so it ioport_register() doesn't need a size parameter. But I don't see how that relates to your comment. The purpose of that patch series was to perform the separation for PCI BARs. I wasn't so happy with the series, so I never pushed. In fact I think an IOPort is even more suitable; if we need additional attributes we can use a derived object: struct PCIIOPort { IOPort ioport; /* additional fields */ }; One issue with my series was that it would be great if the devices just had some BAR structures (used by PCI layer to map the devices) inside PCI/qdev structures, but I invented that too late. Maybe this can be addressed in your design? It looks to be orthogonal. It would be great to have a BAR object; that object can then use your API, my API, or the existing API. -- error compiling committee.c: too many arguments to function